Genomic big data management and computing
A.A. 2025/2026
Obiettivi formativi
Many projects in the genomics field rely on increasingly large data sets, analyzing, for example, genomes of thousands of individuals affected by a particular disease. It is paramount to understand how large data sets can be managed and processed in an efficient way and how next-generation sequencing processing pipelines and workflows can be used to benefit such large-scale projects.
The objective of the course is to illustrate and discuss key aspects regarding the management, processing and analysis of big data for genomics (mainly data obtained by Next-Generation Sequencing), as well as introduce some of the existing approaches, analysis systems and technologies used. Practical applications will be illustrated using both dedicated programming and query languages (PySpark, GMQL), and specific computational platforms and distributed systems (Galaxy, Apache Spark, Cloud Computing). Also "downstream" analysis examples to underscore the necessity of big data management and computing in genomics will be illustrated.
The objective of the course is to illustrate and discuss key aspects regarding the management, processing and analysis of big data for genomics (mainly data obtained by Next-Generation Sequencing), as well as introduce some of the existing approaches, analysis systems and technologies used. Practical applications will be illustrated using both dedicated programming and query languages (PySpark, GMQL), and specific computational platforms and distributed systems (Galaxy, Apache Spark, Cloud Computing). Also "downstream" analysis examples to underscore the necessity of big data management and computing in genomics will be illustrated.
Risultati apprendimento attesi
Given the vastness of the topics presented, the ultimate goal of the course is not an in-depth knowledge of specific data analysis approaches, but rather to provide a broad overview of different solutions paired with the understanding of strengths and weaknesses of different methodologies and computing environments for managing scientific workflows used for big data analysis in the field of genomics.
Periodo: Primo semestre
Modalità di valutazione: Esame
Giudizio di valutazione: voto verbalizzato in trentesimi
Corso singolo
Questo insegnamento non può essere seguito come corso singolo. Puoi trovare gli insegnamenti disponibili consultando il catalogo corsi singoli.
Programma e organizzazione didattica
Edizione unica
Periodo
Primo semestre
Programma
Many projects in the genomics field rely on increasingly large data sets, analyzing, for example, genomes of thousands of individuals affected by a particular disease. It is paramount to understand how large data sets can be managed and processed in an efficient way and how next-generation sequencing processing pipelines and workflows can be used to benefit such large-scale projects.
The objective of the course is to illustrate and discuss key aspects regarding the management, processing and analysis of big data for genomics (mainly data obtained by Next-Generation Sequencing), as well as introduce some of the existing approaches, analysis systems and technologies used. Practical applications will be illustrated using both dedicated programming and query languages (PySpark, GMQL), and specific computational platforms and distributed systems (Galaxy, Apache Spark, Cloud Computing). Also "downstream" analysis examples to underscore the necessity of big data management and computing in genomics will be illustrated.
Given the vastness of the topics presented, the ultimate goal of the course is not an in-depth knowledge of specific data analysis approaches, but rather to provide a broad overview of different solutions paired with the understanding of strengths and weaknesses of different methodologies and computing environments for managing scientific workflows used for big data analysis in the field of genomics.
The objective of the course is to illustrate and discuss key aspects regarding the management, processing and analysis of big data for genomics (mainly data obtained by Next-Generation Sequencing), as well as introduce some of the existing approaches, analysis systems and technologies used. Practical applications will be illustrated using both dedicated programming and query languages (PySpark, GMQL), and specific computational platforms and distributed systems (Galaxy, Apache Spark, Cloud Computing). Also "downstream" analysis examples to underscore the necessity of big data management and computing in genomics will be illustrated.
Given the vastness of the topics presented, the ultimate goal of the course is not an in-depth knowledge of specific data analysis approaches, but rather to provide a broad overview of different solutions paired with the understanding of strengths and weaknesses of different methodologies and computing environments for managing scientific workflows used for big data analysis in the field of genomics.
Prerequisiti
Knowledge of programming in Python and R
Metodi didattici
Frontal lessons (theory) and exercise sessions with practical exercises and discussion of the solutions.
Materiale di riferimento
Recommended articles (not required to pass the exam):
· Batut et al. (2018) Community-driven data analysis training for biology. Cell Systems 6: 752-758.
· Fillbrunn et al. (2017) KNIME for reproducible cross-domain analysis of life science data. Journal of Biotechnology 261: 149-156.
· Grüning et al. (2018) Practical bomputational reproducibility in the life sciences. Cell Systems 6(6): P631-P635.
· Hoffman et al. (2012) Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 9(5): 473-476.
· Kulkarni and Frommolt (2017) Challenges in the setup of large-scale Next-Generation Sequencing analysis workflows. Computational and Structural Biotechnology Journal 15: 471-477.
· Langmead et al. (2018) Cloud computing as a platform for genomic data analysis and collaboration. Nat Rev Genet 19(4): 208-219.
· Batut et al. (2018) Community-driven data analysis training for biology. Cell Systems 6: 752-758.
· Fillbrunn et al. (2017) KNIME for reproducible cross-domain analysis of life science data. Journal of Biotechnology 261: 149-156.
· Grüning et al. (2018) Practical bomputational reproducibility in the life sciences. Cell Systems 6(6): P631-P635.
· Hoffman et al. (2012) Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 9(5): 473-476.
· Kulkarni and Frommolt (2017) Challenges in the setup of large-scale Next-Generation Sequencing analysis workflows. Computational and Structural Biotechnology Journal 15: 471-477.
· Langmead et al. (2018) Cloud computing as a platform for genomic data analysis and collaboration. Nat Rev Genet 19(4): 208-219.
Modalità di verifica dell’apprendimento e criteri di valutazione
The assessment will be based on a written exam to be taken in the exam sessions defined by the school and covering all aspects in the syllabus. The exam will assign up to 60 points (+ 6 possible bonus points). 30 cum laude will be assigned when the total score exceeds 64 points.
BIO/11 - BIOLOGIA MOLECOLARE - CFU: 1
ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI - CFU: 5
ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI - CFU: 5
Lezioni: 48 ore
Docente:
Piro Rosario Michael