Text Mining and Sentiment Analysis | Università degli Studi di Milano Statale

A.Y. 2020/2021

Max ECTS

Overall hours

SSD

INF/01 SECS-S/01

Language

English

Included in the following degree programmes

Data Science and Economics - (Classe LM-91)-Enrolled from 2018/2019 Academic Year

Learning objectives

Understand the state of the art on text mining and sentiment analysis. Design and develop methods for text classification and topic modeling. Design and develop methods for sentiment classification and polarity detection. Understand the differences between sentiment analysis and emotion detection. Design and develop methods for emotion detection in text.

Expected learning outcomes

At the end of the course the student will be able to address a specific problem in the area of text mining and sentiment analysis. In particular student will know he main notions needed to understand text processing, foundations of natural language processing, text classification, and topic modeling. Moreover students will deal with sentiment analysis in the context of opinion mining and rule-based models and machine learning models for text.

Lesson period: Second trimester

Lessons timetable

Assessment methods: Esame
Assessment result: voto verbalizzato in trentesimi

Exams calendar

Single course

This course cannot be attended as a single course. Please check our list of single courses to find the ones available for enrolment.

Search a single course

Course syllabus and organization

Single session

Responsible

Ferrara Alfio

Lesson period

Second trimester

Emergency remote teaching

In the emergency phase, the teaching activities will be provided in form of synchronous lectures by the MS Teams platform according to the course time schedule. The lectures will also be recorded and uploaded on the ARIEL website. Students will be informed as soon as possible, through the ARIEL website, also in case they will be able to attend classes at University. This possibility will depend upon the evolution of the emergency and the respect of the security directives.

Syllabus

Course syllabus

The course provide a complete overview of the state of the art and research perspective in the field of text mining and sentiment analysis, with an introduction to some relevant and correlated problems such as emotion detection and opinion mining.

Introduction (0:30 hours)
Course introduction, logistic issues, course requirements and Python installation.

Natural language processing (3:30 hours)
Basic techniques in natural language processing: tokenization (bag-of-words and n-gram models), stopwords and punctuation, stemming and lemmatization, part-of-speech tagging, chunking, regular expressions and named entity recognition. Public NLP toolkits such as NLTK and SpaCy will be introduced to gain hand-on experience in Python.

Document representation (2 hour)
The Vector Space Model and tf-idf weighting: representing unstructured text documents with appropriate format and structure to support later automated text mining algorithms. PCA as dimensionality reduction technique.

Text clustering (3 hours)
Clustering algorithms, i.e., connectivity-based clustering (a.k.a., hierarchical clustering) and centroid-based clustering (e.g., k-means clustering). Evaluation of text clustering: purity and Rand index.

Text categorization (5 hours)
Feature selection and text categorization algorithms: Naive Bayes, k Nearest Neighbor (kNN), Logistic Regression, Support Vector Machines and Decision Trees. Evaluation of text classification: precision and recall, confusion matrix, F-score.

Topic modeling (4 hours)
Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. Two basic topic models will be covered: Probabilistic Latent Semantic Indexing (pLSI) and Latent Dirichlet Allocation (LDA).

Document summarization (2 hours)
It refers to the process of reducing a text document to a summary that retains the most important points of the original document. Extraction-based summarization methods will be covered.

Introduction to sentiment analysis and emotion detection (1 hour)
Definition of the sentiment analysis problem. Differences between sentiment analysis and emotion detection.

Lexicon-based approaches to sentiment analysis (4 hours)
Survey of the main approaches that exploit dictionaries, ontologies, and specialized corpora for detecting the sentiment polarity in texts.

Machine learning approaches to sentiment analysis (4 hours)
Sentiment and polarity detection as a classification problem. Overview and comparison of the main unsupervised and supervised models on a case study.

Overview of neural network architectures for sentiment analysis (2 hour)
Design and implementation of a case study based on a neural network for sentiment detection and polarity evaluation.

Affect and emotion detection (1 hour)
Survey and definition of affect and emotion detection in texts. Discussion about the differences between the tasks of detection of sentiment, feelings, emotions, and opinions.

The language of emotions (4 hour)
Methods and techniques for modeling the language of emotions using neural networks and statistical language models. Application to a case study.

Multimodal approaches to emotion detection (1 hour)
Survey on the exploitation of multimodal data (e.g., face and body language in video and audio recordings) in combination with text to detect the language of emotions.

Hands on a real case study for design to implementation (2 hour)
Students will be provided with a real case study on sentiment analysis and emotion detection. During the lesson, the case study will be studied to the end of design and implement a solution.

Recap and conclusion (1 hour)
Recap on the main course topics. Open discussion of the project work chosen by the students as their exam assignment.

Prerequisites for admission

Basic knowledge on Machine Learning, Statistical Learning, Deep Learning and Artificial intelligence

Teaching methods

The course is given in the form of lectures with extensive use of examples and support materials such as Python notebooks. Slides and handouts are employed throughout the lectures and they are progressively published on the reference course website on the Ariel platform (https://aferraratmsa.ariel.ctu.unimi.it).
Lecture attendance is not mandatory, but it is strongly recommended.

Teaching Resources

Materials provided by the lecturer on the course website https://aferraratmsa.ariel.ctu.unimi.it

TOOLS
Reference programming language: Python.
Main modules:
NLTK
scikit-learn
spaCy
Gensim
PyTorch

REFERENCES
NLTK Book: https://www.nltk.org/book/
Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer Science & Business Media.
Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1), 1-167.
Munezero, M. D., Montero, C. S., Sutinen, E., & Pajunen, J. (2014). Are they different? Affect, feeling, emotion, sentiment, and opinion detection in text. IEEE transactions on affective computing, 5(2), 101-111.
Calvo, R. A., & D'Mello, S. (2010). Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on affective computing, 1(1), 18-37.

Assessment methods and Criteria

Attending students
Students will be required to prepare and discuss a short paper and a project on the course topics. The topic of the short paper and the project will be defined with the lecturer.

Non-attending students
Interview about the whole program of the course

Course structure

INF/01 - INFORMATICS - University credits: 3
SECS-S/01 - STATISTICS - University credits: 3

Lessons: 40 hours

Professors: Ferrara Alfio, Manzi Giancarlo

Professor(s)

Ferrara Alfio

Web site

Reception:

On appointment. The meeting will be online by first contacting the professor by email.

Online. In case of a meeting in person, Department of Computer Science, via Celoria 18 Milano, Room 7012 (7 floor)