Scientific programming

A.A. 2023/2024
6
Crediti massimi
60
Ore totali
SSD
ING-INF/05
Lingua
Inglese
Obiettivi formativi
The objective of the course is to make students proficient in writing programs and scripts in the programming languages most widely used in modern genomic research: R and Python.
Risultati apprendimento attesi
At the end of this class , the students are expected to be able to design and write advanced programs in Python and R programming languages, applying them to case studies derived from the analysis of genomic data.
Corso singolo

Questo insegnamento non può essere seguito come corso singolo. Puoi trovare gli insegnamenti disponibili consultando il catalogo corsi singoli.

Programma e organizzazione didattica

Edizione unica

Periodo
Secondo semestre

Programma
Seminar lectures and practical exercises on the following topics:

Part A: R programming
1. Course introduction: Motivation, course information, introduction.
2. Introduction to R, CRAN and Bioconductor: repetition of the basic syntax and execution flow (blocks, conditional statements, loops); basic data structures (vectors, factors, matrices, data frames, lists), functions and scripts, data import/export.
3. Data processing in R: advanced use of data structures, vectorized operations and efficient coding in R (e.g., apply versus for-loops; differences in syntax and performance).
4. Important data types and packages for bioinformatics in R: GRanges for genomic locations, DNAString and RNAString, SummrizedExperiment, annotation packages (e.g., GenomicFeatures).
5. Class systems in R: S3, S4 and Reference classes
6. Data visualization in R: simple plots, boxplots, heatmaps and more; basic introduction to the powerful and flexible ggplot2 framework, its syntax and use.
7. Creating R/Bioconductor packages: basic package structure; requirements; building and verifying packages; Bioconductor submission process.
8. Unit testing in R: the testthat framework for unit testing in R.
9. Specific use cases in R: e.g., differential expression analysis (DEseq2), pathway analysis (GSEA).

Part B: Python programming
1. Python recap: main Python concept; control flow statement, variables, data structures, classes, handling of exceptions, file management.
2. Pandas and NumPy libraries: efficient matrix operations with NumPy; concept of Pandas DataFrame and some hints on the internals; overview of the main functionalities provided by a DataFrame (import from and export to files, relational operators, data retrieval, data manipulation).
3. Visualization libraries: overview of the two main libraries for data visualization in Python: matplotlib and Seaborn; trivial plots: curves, scatter plots; sophisticated plots: heatmaps, clustermap, plots of distributions; good practice for realizing plots: correct usage of axis scale, legend, title, etc.
4. Concurrent programming: needs of parallelism; theoretical benefits of parallelism; Python library "multiprocessing" to spawn and join processes; Python library "multithreading" and its limitations.
5. Network programming: theory and implementation of client-server architectures; implementation of RESTful web service using the Python module "flask".

Generic topics (independent of the used programming language):
1. Version control with Git/GitHub
2. Best practices for computational biology
Prerequisiti
Basic knowledge of programming, preferable in Python language and/or in R. Basic knowledge in molecular biology.
Metodi didattici
Class lectures and practices in an informatics room or using the students' laptop computers.
Materiale di riferimento
Recommended textbooks (not required to pass the exam):
· W. McKinney. Python for data analysis. 2013. O'Reilly. http://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf
· Z.A. Shaw. Learn Python the hard way: A very simple introduction to the terrifyingly beautiful world of computers and code. (3rd Edition) 2014. (Zed Shaw's Hard Way Series) 3rd Edition.
· O. Jones et al. Introduction to Scientific Programming and Simulation Using R. (2nd Edition) 2014. Chapman and Hall/CRC. ISBN 978-1-466-56999-7
· C. Ortutay, Z. Ortutay. Molecular Data Analysis Using R. 2017. Wiley Blackwell. ISBN: 978-1-119-16504-0
Modalità di verifica dell’apprendimento e criteri di valutazione
The assessment will be based on

1. a written exam to be taken in the exam sessions defined by the school and covering all aspects in the syllabus. The exam will assign up to 60 points (+ 6 possible bonus points).
2. two project assignments, one to be implemented in R and one to be implemented in Python.

The final grade will be a weighted average based on the grade of the exam (70%) and of each of the project assignments (15% each).
The following table provides a detailed overview of the elements that will be considered.

Written test (70%):
Practical programming exercises in R and Python (Dublin descriptor: DD1, DD2, DD3):
- Writing and interpreting command statements
- Working with basic data structures as well as advanced data structures and packages/software libraries for genomics
- Practical notions of concurrent programming (Python)
- Practical notions of network programming (Python)
- Practical notions of R packages (including unit testing)
- Practical notions of R classes
Descriptive exercises focusing on conceptual aspects (Dublin descriptor: DD1, DD2, DD3, DD4, DD5):
- Basic concepts of R: e.g., attaching vs. loading packages
- Theoretical notions/concepts of concurrent programming
- Theoretical notions/concepts of network programming
- Theoretical notions/concepts of R packages (including unit testing)
- Theoretical notions/concepts of R classes

Software projects (2 x 15%), (Dublin descriptor: DD1, DD2, DD3, DD4, DD5):
From a list of approx. 10 selected projects (the list may change from year to year to adapt to new interesting topics), each student will have to select one to be implemented in Python and a second to be implemented in R.
Example assignment for R:
- Develop a Bioconductor-complinat R package that provides functions for
Example assignment for Python:
- Develop a stand-alone Python script that
Evaluation criteria:
- To what degree are the requirements satisfied?
- Does the software/package offer reasonable and user-friendly functionality? (E.g., use of correct, interoperable data structures?)
- Was the workload reasonable? (E.g., documentation, unit tests?)
- Is the software/package or program well structured?
- What's the code quality? (E.g., efficient data processing?)
- Are there any errors when running the software and/or compiling the package?
IMPORTANT NOTE: plagiarism detection software will be applied!
ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI - CFU: 6
Esercitazioni: 24 ore
Lectures: 36 ore
Docente: Piro Rosario Michael