Algorithms for Massive Data, Cloud and Distributed Computing

A.Y. 2019/2020
12
Max ECTS
80
Overall hours
SSD
INF/01
Language
English
Learning objectives
The objective of the class is to introduce the fundamental concepts at the basis of massive data management and analysis, including the main processing techniques dealing with data at massive scale and their implementation on distributed computational framework, on one side, and the technologies and solutions at the basis of cloud computing paradigm and modern distributed systems (e.g., microservice architectures), on the other side.
Expected learning outcomes
At the end of the class, the student shall know the main approaches enabling her/him to process massive amounts of data, as well as the operating principles of modern distributed computing systems, including cloud computing and microservice-based architectures. The student shall also acquire the ability to design and execute computations on massive datasets, deployed on modern distributed systems and cloud computing platforms.
Single course

This course cannot be attended as a single course. Please check our list of single courses to find the ones available for enrolment.

Course syllabus and organization

Single session

Lesson period
Second semester
Prerequisites for admission
Knowledge of computer programming, probability and statistics, basic calculus, fundamentals of computer networks, fundamentals of virtualization.
Assessment methods and Criteria
The exam consists of two mandatory exams, one for each module. The exam for module "Algorithms for massive datasets" consists of a project and an oral test, both related to the topics covered in the course. The project, described in a report, requires to process one or more datasets through the critical application of the techniques described during the classes. The oral test, which can be accessed after a positive evaluation of the project, is based on the discussion of some topics covered in the course and on in-depth questions about the presented project. The exam for module "Cloud and Distributed Computing" consists of a written exam (duration 2 hours) that aims to verify, by means of open and closed questions, the knowledge of the student about all topics discussed during the course. When the student successfully passes both the exams, a final evaluation is computed, expressed in thirtieths, considering: knowledge of the topics, ability of applying the learned knowledge to the resolution of a practical project, project quality, critical thinking skills, clarity of exposition, and property of language.
Module Algorithms for Massive Data
Course syllabus
Module "Algorithms for Massive Data" will consider the main processing techniques dealing with data at massive scale, and their implementation on distributed computational frameworks. More precisely, lectures will review the principal application contexts characterized by amounts of data that cannot be handled using standard computing facilities and procedures. Such contexts will be analyzed in terms of tailored algorithms. Meanwhile, some general big data processing techniques, such as those falling within the hat of machine learning, will be considered.

More precisely, the following topics will be covered.
- Mathematical preliminaries.
- Technical preliminaries: Python, Jupiter, Colab.
- Hadoop: HDFS and MapReduce.
- Analysis of MapReduce algorithms.
- Spark.
- Link analysis.
- Regression.
- Logistic regression.
- Multilayer perceptrons and deep learning.
- Clustering.
- Finding similar items.
- Market-basket analysis.
- Recommender systems.
- Dimensionality reduction.
- Embeddings.
Teaching methods
The module consists of traditional lectures.
Teaching Resources
Textbook:
- Anand Rajaraman and Jeff Ullman, Mining of Massive Datasets, Cambridge University Press (ISBN:9781107015357).

Suggested readings:
- Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, Learning Spark. Lightning-Fast Big Data Analysis, O'Reilly, 2015 (ISBN:978-1-449-35862-4)
- Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills, Advanced Analytics with Spark. Patterns for Learning from Data at Scale, O'Reilly, 2015 (ISBN:978-1-491-91276-8)

Lecture notes, supplementary material, and sample code:
- https://dmalchiodiamd.ariel.ctu.unimi.it/
- http://malchiodi.di.unimi.it/teaching/AMD
Module Cloud and Distributed Computing
Course syllabus
Module "Cloud and Distributed Computing" will discuss the technologies and solutions at the basis of cloud computing and modern distributed systems, including microservice architectures. Module 2 is composed of three main parts. The first part will provide an overview of the cloud computing paradigm and its technologies, as well as its service and deployment models. It will also investigate risks and opportunities of cloud migration, focusing on governance and non-functional properties of the cloud. The second part will provide an overview of the microservice architecture and its technologies, focusing on the migration from a monolithic approach to microservices and on microservice orchestration. Finally, the third part will focus on privacy and data protection in the cloud.

More in detail, after a brief recall of the fundamentals of IT networks and virtualization, the first part of the module will provide an overview of the cloud computing paradigm as follows.
1. Cloud Computing. Service models. Deployment models. Migration to the cloud. Cloudonomics. Challenges and issues.
2. IaaS, PaaS, SaaS: Definitions. Technologies. Case studies.
3. Non-functional aspects of the cloud.
4. New cloud services. PaaS Big Data.

The second part of the module will provide an overview of the microservice architecture and its technologies, as follows.
1. Microservice architecture. Overview and basic concepts. Microservices and containers. Dockers.
2. Microservice migration and orchestration. Cloud for microservices. How to migrate a monolithic software to microservices. Examples.
3. Microservices and Big Data. Model-Based Big Data Analytics-as-a-Service.

Finally, the third part of the module will focus on privacy and data protection in the cloud as follows.
1. Data and access confidentiality and integrity in outsourcing and cloud scenarios.
Teaching methods
The module consists of traditional lectures.
Teaching Resources
Papers and slide decks available on the web page of the course (https://ariel.unimi.it)
Module Algorithms for Massive Data
INF/01 - INFORMATICS - University credits: 6
Lessons: 40 hours
Professor: Malchiodi Dario
Shifts:
-
Professor: Malchiodi Dario
Module Cloud and Distributed Computing
INF/01 - INFORMATICS - University credits: 6
Lessons: 40 hours
Professor(s)
Reception:
By appointment only
At Dipartimento di Informatica, Via Celoria 18, Milan (MI)
Reception:
By appointment
Room 5015 of the Computer Science Department