Biostatistics
A.Y. 2025/2026
Learning objectives
Assays in experimental biology generate large amounts of data that must be critially assessed and
processed appropriately to extract meaningful biological knowledge and generate testable
hypotheses. Proficiency in data wrangling and data visualisation, the ability to unravel complex
relationships in biological data and the ability to create transparent and reproducible workflows
constitute crucial skills for the modern biologist. In addition, a good understanding of principles of
experimental design are central to the critical assessment of experimental data. With data as the focus
and R/RStudio as the tool, students are exposed and trained in a unified view of experimental design
and data analysis. Students will develop expertise in data organisation, visualisation, analysis and
interpretation using both conventional biological data and complex large scale (BIG) biological data.
The aims of this course are to enable students to (i) analyse data from a well-designed biological
experiment, (ii) create a transparent reproducible analysis workflow using Rmarkdown in R/Rstudio
that includes exploratory analyses, statistical modelling, model assessment and parameter estimation,
(iv) understand the power and pitfalls of statistical analyses, (v) implement methods for the analysis
of gene expression (RNA-Seq) data and the interpretation of the final results.
Throughout the course, we will use R programming language and the R/Studio software environment.
processed appropriately to extract meaningful biological knowledge and generate testable
hypotheses. Proficiency in data wrangling and data visualisation, the ability to unravel complex
relationships in biological data and the ability to create transparent and reproducible workflows
constitute crucial skills for the modern biologist. In addition, a good understanding of principles of
experimental design are central to the critical assessment of experimental data. With data as the focus
and R/RStudio as the tool, students are exposed and trained in a unified view of experimental design
and data analysis. Students will develop expertise in data organisation, visualisation, analysis and
interpretation using both conventional biological data and complex large scale (BIG) biological data.
The aims of this course are to enable students to (i) analyse data from a well-designed biological
experiment, (ii) create a transparent reproducible analysis workflow using Rmarkdown in R/Rstudio
that includes exploratory analyses, statistical modelling, model assessment and parameter estimation,
(iv) understand the power and pitfalls of statistical analyses, (v) implement methods for the analysis
of gene expression (RNA-Seq) data and the interpretation of the final results.
Throughout the course, we will use R programming language and the R/Studio software environment.
Expected learning outcomes
1. Use the R/RStudio environment to import, visualise, wrangle and summarise data.
2. Create transparent reproducible analysis workflows using Rmarkdown.
3. Understand the statistical model framework for statistical inference and estimation.
4. Interpret results from statistical models using ANOVA tables and estimated marginal means.
5. Communicate conclusions of statistical analyses in graphs and/or tables.
6. Correctly analyse, interpret and visualize the results of dirrerential gene expression
analyses, based on RNA sequencing data
2. Create transparent reproducible analysis workflows using Rmarkdown.
3. Understand the statistical model framework for statistical inference and estimation.
4. Interpret results from statistical models using ANOVA tables and estimated marginal means.
5. Communicate conclusions of statistical analyses in graphs and/or tables.
6. Correctly analyse, interpret and visualize the results of dirrerential gene expression
analyses, based on RNA sequencing data
Lesson period: Second semester
Assessment methods: Esame
Assessment result: voto verbalizzato in trentesimi
Single course
This course cannot be attended as a single course. Please check our list of single courses to find the ones available for enrolment.
Course syllabus and organization
Single session
Responsible
Lesson period
Second semester
Course syllabus
The first six lectures of the course will introduce students to the R/Rstudio programming environment. These programming skills will be reinforced throughout the course.
This will include:
Introduction to the R environment for the analysis of biological data - 1.5 cfu Bio/11 (12hrs)
- Setting up R projects
- Basic data structures, data.frames, vectors, matrices
- Importing data in R
- Installation and management of software packages
- Data wrangling using dplyr package (tidyverse)
- Data visualisation using ggplot package (tidyverse)
- Data simulation using stochastic models
- Introduction to Rmarkdown
The following 12 classes will introduce students to principles of statistical inference, statistical modelling and parameter estimation. These ideas will be reinforced using examples from published data. Using R/Rstudio, emphasis will be on creating transparent and reproducible analysis workflows.
Basics of statistical analysis - 1 cfu Bio/18 (8hrs)
- Data visualisation and pattern recognition
- Principles of statistical inference
- p-values: measures of evidence against "null hypothesis"
-Statistical models of biological experiments, ANOVA
-Assessing model assumptions using residual plots
-First analysis workflow using R/RStudio
Exploring mean and variance structure in statistical models - 1 cfu Bio/18 (8hrs)
-Factorial designs, ANOVA with multiple factors
-Linear models with covariates
-Linear mixed models
-Complete analysis workflow using R/RStudio
Experimental design, Generalised Linear Models and topics relevant to high dimensionsal data- 1 cfu Bio/18 (8hrs)
-Principles of Experimental design: Randomisation, replication, blocking
-Generalised linear models: negative binomial model
-Principal Component Analysis
-Multiple hypothesis testing, p-value adjustments, False Discovery Rates(FDR)
The final part of the course will be an introduction to the analysis of Next Generation Sequencing (NGS) data using R, with insights on the theoretical and practical principles underlying state-of-the-art methods for processing RNA-Seq assays to assess differential gene expression. In particular:
Differential gene expression analysis in R- 1 cfu Bio/11 (8hrs)
- Quality metrics and quality control
- Differential gene expression analyses in R
- Multiple testing correction and FDR (false discovery rate)
Visualization and interpretation of the results- 0.5 cfu Bio/11 (4hrs)
- Visualization of the data: heatmaps, scatterplots, boxplot
Classes will consist of intuitive descriptions of programming principles, bioinformatics methods, and their underlying statistics, compounded with practicals. Students will apply the newly introduced concepts to data analysis use cases.
This will include:
Introduction to the R environment for the analysis of biological data - 1.5 cfu Bio/11 (12hrs)
- Setting up R projects
- Basic data structures, data.frames, vectors, matrices
- Importing data in R
- Installation and management of software packages
- Data wrangling using dplyr package (tidyverse)
- Data visualisation using ggplot package (tidyverse)
- Data simulation using stochastic models
- Introduction to Rmarkdown
The following 12 classes will introduce students to principles of statistical inference, statistical modelling and parameter estimation. These ideas will be reinforced using examples from published data. Using R/Rstudio, emphasis will be on creating transparent and reproducible analysis workflows.
Basics of statistical analysis - 1 cfu Bio/18 (8hrs)
- Data visualisation and pattern recognition
- Principles of statistical inference
- p-values: measures of evidence against "null hypothesis"
-Statistical models of biological experiments, ANOVA
-Assessing model assumptions using residual plots
-First analysis workflow using R/RStudio
Exploring mean and variance structure in statistical models - 1 cfu Bio/18 (8hrs)
-Factorial designs, ANOVA with multiple factors
-Linear models with covariates
-Linear mixed models
-Complete analysis workflow using R/RStudio
Experimental design, Generalised Linear Models and topics relevant to high dimensionsal data- 1 cfu Bio/18 (8hrs)
-Principles of Experimental design: Randomisation, replication, blocking
-Generalised linear models: negative binomial model
-Principal Component Analysis
-Multiple hypothesis testing, p-value adjustments, False Discovery Rates(FDR)
The final part of the course will be an introduction to the analysis of Next Generation Sequencing (NGS) data using R, with insights on the theoretical and practical principles underlying state-of-the-art methods for processing RNA-Seq assays to assess differential gene expression. In particular:
Differential gene expression analysis in R- 1 cfu Bio/11 (8hrs)
- Quality metrics and quality control
- Differential gene expression analyses in R
- Multiple testing correction and FDR (false discovery rate)
Visualization and interpretation of the results- 0.5 cfu Bio/11 (4hrs)
- Visualization of the data: heatmaps, scatterplots, boxplot
Classes will consist of intuitive descriptions of programming principles, bioinformatics methods, and their underlying statistics, compounded with practicals. Students will apply the newly introduced concepts to data analysis use cases.
Prerequisites for admission
Basic knowledge of molecular biology topics:
- Structure and biochemical properties of nucleic acids;
- Nucleic acid sequencing methods;
- Mechanisms of gene expression regulation;
- Structure of the eukaryotic gene.
Basic IT knowledge:
File and folder management.
- Structure and biochemical properties of nucleic acids;
- Nucleic acid sequencing methods;
- Mechanisms of gene expression regulation;
- Structure of the eukaryotic gene.
Basic IT knowledge:
File and folder management.
Teaching methods
Teaching method: Lectures accompanied by practical exercises with real data. Instructors will assign exercises at the end of most lessons to help reinforce concepts between classes. Attendance is highly recommended.
Teaching Resources
W. N. Venables, D. M. Smith and the R Core Team. An introduction to R.
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
Grolemund, G., & Wickham, H. (2017). R for Data Science. O'Reilly Media.
https://r4ds.hadley.nz
Chen Y, McCarthy D, Ritchie M, Robinson, M, Smyth G. edgeR: differential expression analysis of digital gene expression data. https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
Law CW, Alhamdoosh M, Su S, Dong X, Tian L, Smyth GK, Ritchie ME. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Res. 2016 Jun 17;5:ISCB Comm J-1408. doi: 10.12688/f1000research.9005.3. PMID: 27441086; PMCID: PMC4937821.
https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html
Glimma: https://bioconductor.org/packages/release/bioc/html/Glimma.html
Copies of the slides projected during the classes, as well as additional materials and datasets will be made available through the course website on the ARIEL platform of the University of Milano. This material is intended as a support for lectures, and its study cannot be considered as a full alternative to constant attendance of classes. The material is made available only to registered students of the Degree Course in Molecular Biology of the Cell and should not be distributed to others without express consent of the teachers.
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
Grolemund, G., & Wickham, H. (2017). R for Data Science. O'Reilly Media.
https://r4ds.hadley.nz
Chen Y, McCarthy D, Ritchie M, Robinson, M, Smyth G. edgeR: differential expression analysis of digital gene expression data. https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
Law CW, Alhamdoosh M, Su S, Dong X, Tian L, Smyth GK, Ritchie ME. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Res. 2016 Jun 17;5:ISCB Comm J-1408. doi: 10.12688/f1000research.9005.3. PMID: 27441086; PMCID: PMC4937821.
https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html
Glimma: https://bioconductor.org/packages/release/bioc/html/Glimma.html
Copies of the slides projected during the classes, as well as additional materials and datasets will be made available through the course website on the ARIEL platform of the University of Milano. This material is intended as a support for lectures, and its study cannot be considered as a full alternative to constant attendance of classes. The material is made available only to registered students of the Degree Course in Molecular Biology of the Cell and should not be distributed to others without express consent of the teachers.
Assessment methods and Criteria
Students will be required to complete a transparent data analysis workflow using Rmarkdown, consisting of the analysis of biological data from real experiments. The students will produce and submit a report describing their results to the teachers. Delivery of the report is due at least 48h before the selected exam session. Projects will be undertaken in small groups (1-2 students per group).
The grade will result from the joint evaluation of each candidate by the two teachers weighted as follows:
Project and oral presentation - 100%
The grade will result from the joint evaluation of each candidate by the two teachers weighted as follows:
Project and oral presentation - 100%
BIO/11 - MOLECULAR BIOLOGY - University credits: 3
BIO/18 - GENETICS - University credits: 3
BIO/18 - GENETICS - University credits: 3
Lessons: 48 hours
Professor:
Chiara Matteo
Professor(s)