Genomic big data management and computing | Università degli Studi di Milano Statale

A.A. 2021/2022

Crediti massimi

Ore totali

SSD

BIO/11 ING-INF/05

Lingua

Inglese

Corsi di laurea che utilizzano l'insegnamento

Bioinformatics for computational genomics (Classe LM-8)-enrolled from 2019/2020 academic year

Obiettivi formativi

Many projects in genomics rely on increasingly large data sets, analyzing, for example, genomes of thousands of individuals affected by a particular disease.
It is paramount to understand how large data sets can be managed and processed in an efficient way and how Next-Generation Sequencing processing pipelines and workflows can be used to benefit such large-scale projects.
In this course, we will introduce some of the existing technologies, tools and platforms available for this aim, including Apache Spark, GMQL, Amazon WS, Galaxy, and KNIME. We will also discuss some examples of software libraries in different programming languages (mainly Python and R) that can help to design own tools and pipelines.

Risultati apprendimento attesi

Given the vastness of the topics presented, the ultimate goal of the course is not an in-depth knowledge of specific data analysis approaches, but rather to provide a broad overview of different solutions paired with the understanding of strengths and weaknesses of different methodologies and computing environments for managing scientific workflows used for big data analysis in the field of genomics.

Periodo: Primo semestre

Orari delle lezioni

Modalità di valutazione: Esame
Giudizio di valutazione: voto verbalizzato in trentesimi

Calendario degli appelli

Corso singolo

Questo insegnamento non può essere seguito come corso singolo. Puoi trovare gli insegnamenti disponibili consultando il catalogo corsi singoli.

Cerca un corso singolo

Programma e organizzazione didattica

Edizione unica

Periodo

Primo semestre

Programma

Programma

Seminar lectures and practical in informatics room on the following topics:

- Course introduction (2 hours): Motivation, course information, introduction.

- Main concepts and issues (4 hours): challenges that need to be addressed; advantages of workflow management for big data applications; multi-source and heterogeneous data integration; reproducibility of results.

- Querying and extracting heterogeneous big genomic data (6 + 6 hours):
- Genomic Data Model (GDM) and GenoMetric Query Language (GMQL)
- Main operational features of GMQL and GMQL repository
- GMQL interfaces: GMQL Web, GMQL APIs, pyGMQL and RGMQL
- Integrative analysis of heterogeneous (epi)genomics data with GMQL, Python and R/Bioconductor

- Workflow management systems for biomedical applications (6 + 6 hours):
- Galaxy
- KNIME
- Brief overview of other systems

- Distributed programming with Apache Spark (6 + 4 hours):
- Introduction to distributed processing on the cloud and motivation
- Main features of Apache Spark
- Spark APIs for Pyhton and R

- Pattern mining, search and analysis of heterogeneous big (epi)genomic data (6 + 4 hours):
- Hidden Markow Models for state pattern analysis
- Chromatin state discovery with: ChromHMM and Segway
- Searching similar feature patterns in multiple genome browser tracks
- Integrated Genome Browser (IGB) and the SimSearch plugin

- Data processing with cloud computing services (4 hours):
- Introduction to cloud computing for genomics
- Amazon Web Services (AWS) and similar services

- Use cases / example analyses of heterogeneous data (2 + 4 hours)

Prerequisiti

Knowledge of programming, preferable in Python language and/or in R. Knowledge in molecular biology.

Metodi didattici

Class lectures and exercises in an informatics room using the student's laptop. All used data and tools are publicly available.

Materiale di riferimento

The slides presented during the course and the estimated detailed schedule of lectures and practices are available on the "Be e-Poli" (BeeP), the portal for the network activities of students and professors at the Politecnico di Milano, accessible from the Politecnico di Milano Web site; students registered to the course for the current academic year can access it.

Recommended articles (not required to pass the exam):
- Batut et al. (2018) Community-driven data analysis training for biology. Cell Systems 6: 752-758.
- Fillbrunn et al. (2017) KNIME for reproducible cross-domain analysis of life science data. Journal of Biotechnology 261: 149-156.
- Grüning et al. (2018) Practical bomputational reproducibility in the life sciences. Cell Systems 6(6): P631-P635.
- Hoffman et al. (2012) Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 9(5): 473-476.
- Kulkarni and Frommolt (2017) Challenges in the setup of large-scale Next-Generation Sequencing analysis workflows. Computational and Structural Biotechnology Journal 15: 471-477.
- Langmead et al. (2018) Cloud computing as a platform for genomic data analysis and collaboration. Nat Rev Genet 19(4): 208-219.
- Li et al. (2015) An NGS workflow blueprint for DNA sequencing data and its application in individualized molecular oncology. Cancer Informatics 14(S5): 87-107.

Modalità di verifica dell’apprendimento e criteri di valutazione

The assessment is based on a written exam at the end of the course, with exercises and open questions on all the topics presented during the course.

Organizzazione didattica

BIO/11 - BIOLOGIA MOLECOLARE
ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI

Lezioni: 48 ore

Docente: Piro Rosario Michael