Genomic Big Data Management and Computing

A.Y. 2023/2024
6
Max ECTS
48
Overall hours
SSD
BIO/11 ING-INF/05
Language
English
Learning objectives
Many projects in genomics rely on increasingly large data sets, analyzing, for example, genomes of thousands of individuals affected by a particular disease.
It is paramount to understand how large data sets can be managed and processed in an efficient way and how Next-Generation Sequencing processing pipelines and workflows can be used to benefit such large-scale projects.
In this course, we will introduce some of the existing technologies, tools and platforms available for this aim, including Apache Spark, GMQL, Amazon WS, Galaxy, and KNIME. We will also discuss some examples of software libraries in different programming languages (mainly Python and R) that can help to design own tools and pipelines.
Expected learning outcomes
Given the vastness of the topics presented, the ultimate goal of the course is not an in-depth knowledge of specific data analysis approaches, but rather to provide a broad overview of different solutions paired with the understanding of strengths and weaknesses of different methodologies and computing environments for managing scientific workflows used for big data analysis in the field of genomics.
Single course

This course cannot be attended as a single course. Please check our list of single courses to find the ones available for enrolment.

Course syllabus and organization

Single session

Lesson period
First semester
Course syllabus
1. Brief introduction to the main concepts and issues of big data analysis
· Challenges that need to be addressed
· Advantages of workflow management for big data applications
· Multi-source and heterogeneous data integration
· Reproducibility of results

2. Querying and extracting heterogeneous big genomics data
· Genomic Data Model (GDM) and GenoMetric Query Language (GMQL)
· Main operational features of GMQL and GMQL repository
· GMQL interfaces: GMQL Web, GMQL APIs, pyGMQL and RGMQL
· Integrative analysis of heterogeneous (epi)genomics data with GMQL, Python and R/Bioconductor

3. Workflow management systems for biomedical applications
· Galaxy
· (Very) brief overview of KNIME and other systems

4. Distributed programming with Apache Spark
· Introduction to distributed processing on the cloud and motivation
· Main features of Apache Spark
· Spark APIs for Python (PySpark) and R

5. Pattern mining, search and analysis of heterogeneous big genomic data:
· Hidden Markow Models for state pattern analysis
· Chromatin state discovery with ChromHMM and Segway

6. Data processing with cloud computing services
· Introduction to cloud computing for genomics
· Amazon Web Services (AWS) and similar services

7. Use cases / example analyses of heterogeneous data
· Non-negative matrix factorization as an example of multivariate analysis, applied to the identification of mutational signatures from large-scale datasets of somatic mutations in cancer
Prerequisites for admission
Knowledge of programming in Python and R
Teaching methods
Class lectures and exercises in an informatics room using the student's laptop. All used data and tools are publicly available.
Teaching Resources
Recommended articles (not required to pass the exam):
· Batut et al. (2018) Community-driven data analysis training for biology. Cell Systems 6: 752-758.
· Fillbrunn et al. (2017) KNIME for reproducible cross-domain analysis of life science data. Journal of Biotechnology 261: 149-156.
· Grüning et al. (2018) Practical bomputational reproducibility in the life sciences. Cell Systems 6(6): P631-P635.
· Hoffman et al. (2012) Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 9(5): 473-476.
· Kulkarni and Frommolt (2017) Challenges in the setup of large-scale Next-Generation Sequencing analysis workflows. Computational and Structural Biotechnology Journal 15: 471-477.
· Langmead et al. (2018) Cloud computing as a platform for genomic data analysis and collaboration. Nat Rev Genet 19(4): 208-219.
Assessment methods and Criteria
The assessment will be based on a written exam to be taken in the exam sessions defined by the school and covering all aspects in the syllabus. The exam will assign up to 60 points (+ 6 possible bonus points). 30 cum laude will be assigned when the total score exceeds 64 points.

The following table provides a detailed overview of the elements that will be considered in the written exam:
Solution of numerical problems:
- Application of Hidden Markov Models Exercises focusing on specific programming approaches
- Implementation of simple GMQL or PySpark scripts - Interpretation of simple GMQL or PySpark scripts
Descriptive exercises focusing on conceptual aspects:
- Formulation or description of basic notions and concepts of big data analysis
- Description of specific big data analysis approaches and systems/platforms as well as their requirements
- Basic notions of the biological background used for the lectures (cancer bioinformatics)
BIO/11 - MOLECULAR BIOLOGY
ING-INF/05 - INFORMATION PROCESSING SYSTEMS
Lessons: 48 hours
Professor: Piro Rosario Michael