Algorithms for Massive Datasets
A.Y. 2026/2027
Learning objectives
The course aims at describing the big data processing framework, both in terms of methodologies and technologies.
Expected learning outcomes
Students:
- will be able to use technologies for the distributed storage of datasets;
- will know the map-reduce distributed processing framework and its leading extensions;
- will know the principal algorithms used in order to deal with classical big data problems, as well as to implement them using a distributed processing framework;
- will be able to choose appropriate methods for solving big data problems.
- will be able to use technologies for the distributed storage of datasets;
- will know the map-reduce distributed processing framework and its leading extensions;
- will know the principal algorithms used in order to deal with classical big data problems, as well as to implement them using a distributed processing framework;
- will be able to choose appropriate methods for solving big data problems.
Lesson period: Second four month period
Assessment methods: Esame
Assessment result: voto verbalizzato in trentesimi
Single course
This course cannot be attended as a single course. Please check our list of single courses to find the ones available for enrolment.
Course syllabus and organization
Single session
Responsible
Lesson period
Second four month period
Course syllabus
The course will consider the main processing techniques dealing with data at massive scale, and their implementation on distributed computational frameworks. More precisely, lectures will review the principal application contexts characterized by amounts of data that cannot be handled using standard computing facilities and procedures. Such contexts will be analyzed in terms of tailored algorithms. Meanwhile, some general big data processing techniques, such as those falling within the hat of machine learning, will be considered.
More precisely, the following topics will be covered.
- Technical and mathematical preliminaries.
- Bases of MapReduce, Hadoop, and Spark.
- Analysis of MapReduce algorithms.
- NoSQL databases: MongoDB.
- Link analysis.
- Regression.
- Logistic regression.
- Stream analysis.
- Deep learning.
- Clustering.
- Finding similar items.
- Market-basket analysis.
- Gradient boosting.
- Recommender systems.
- Dimensionality reduction.
- Embeddings.
More precisely, the following topics will be covered.
- Technical and mathematical preliminaries.
- Bases of MapReduce, Hadoop, and Spark.
- Analysis of MapReduce algorithms.
- NoSQL databases: MongoDB.
- Link analysis.
- Regression.
- Logistic regression.
- Stream analysis.
- Deep learning.
- Clustering.
- Finding similar items.
- Market-basket analysis.
- Gradient boosting.
- Recommender systems.
- Dimensionality reduction.
- Embeddings.
Prerequisites for admission
The course requires knowledge of the main topics of bachelor-level computer programming, linear algebra, calculus, probability, and statistics.
Teaching methods
Frontal classes
Teaching Resources
Textbook:
- Anand Rajaraman and Jeff Ullman, Mining of Massive Datasets, Cambridge University Press (ISBN:9781107015357).
Suggested readings:
- Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, Learning Spark. Lightning-Fast Big Data Analysis, O'Reilly, 2015 (ISBN:978-1-449-35862-4)
- Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills, Advanced Analytics with Spark. Patterns for Learning from Data at Scale, O'Reilly, 2015 (ISBN:978-1-491-91276-8)
Lecture notes, supplementary material, and sample code:
- https://labonline.ctu.unimi.it/
- https://malchiodi.di.unimi.it/teaching/AMD/
- Anand Rajaraman and Jeff Ullman, Mining of Massive Datasets, Cambridge University Press (ISBN:9781107015357).
Suggested readings:
- Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, Learning Spark. Lightning-Fast Big Data Analysis, O'Reilly, 2015 (ISBN:978-1-449-35862-4)
- Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills, Advanced Analytics with Spark. Patterns for Learning from Data at Scale, O'Reilly, 2015 (ISBN:978-1-491-91276-8)
Lecture notes, supplementary material, and sample code:
- https://labonline.ctu.unimi.it/
- https://malchiodi.di.unimi.it/teaching/AMD/
Assessment methods and Criteria
The examination consists of a written test including open-ended questions of either a theoretical or an applied nature, with content and difficulty levels consistent with the topics covered during the course. During the examination, students may consult a formula sheet provided by the instructor and may use non-programmable calculators. The consultation of any other materials, including textbooks, personal notes, or electronic devices such as mobile phones, is strictly prohibited.
The written examination is graded on a 30-point scale, and the results are communicated by email. Assessment is based on the student's level of mastery of the course material, the correct use of mathematical notation and formalism, and the appropriate use of technical terminology.
Depending on the student's performance in the written examination, the instructor may require an additional oral examination to further assess the student's understanding of the course material and overall level of achievement. The final grade, expressed on a 30-point scale, is determined on the basis of the written examination and, where applicable, the oral examination.
The written examination is graded on a 30-point scale, and the results are communicated by email. Assessment is based on the student's level of mastery of the course material, the correct use of mathematical notation and formalism, and the appropriate use of technical terminology.
Depending on the student's performance in the written examination, the instructor may require an additional oral examination to further assess the student's understanding of the course material and overall level of achievement. The final grade, expressed on a 30-point scale, is determined on the basis of the written examination and, where applicable, the oral examination.
INFO-01/A - Informatics - University credits: 6
Lessons: 48 hours
Professor:
Malchiodi Dario
Shifts:
Turno
Professor:
Malchiodi DarioProfessor(s)