Text Mining and Sentiment Analysis

A.Y. 2020/2021
Overall hours
INF/01 SECS-S/01
Learning objectives
Understand the state of the art on text mining and sentiment analysis. Design and develop methods for text classification and topic modeling. Design and develop methods for sentiment classification and polarity detection. Understand the differences between sentiment analysis and emotion detection. Design and develop methods for emotion detection in text.
Expected learning outcomes
At the end of the course the student will be able to address a specific problem in the area of text mining and sentiment analysis. In particular student will know he main notions needed to understand text processing, foundations of natural language processing, text classification, and topic modeling. Moreover students will deal with sentiment analysis in the context of opinion mining and rule-based models and machine learning models for text.
Course syllabus and organization

Single session

Lesson period
Second trimester
In the emergency phase, the teaching activities will be provided in form of synchronous lectures by the MS Teams platform according to the course time schedule. The lectures will also be recorded and uploaded on the ARIEL website. Students will be informed as soon as possible, through the ARIEL website, also in case they will be able to attend classes at University. This possibility will depend upon the evolution of the emergency and the respect of the security directives.
Course syllabus
The course provide a complete overview of the state of the art and research perspective in the field of text mining and sentiment analysis, with an introduction to some relevant and correlated problems such as emotion detection and opinion mining.

Introduction (0:30 hours)
Course introduction, logistic issues, course requirements and Python installation.

Natural language processing (3:30 hours)
Basic techniques in natural language processing: tokenization (bag-of-words and n-gram models), stopwords and punctuation, stemming and lemmatization, part-of-speech tagging, chunking, regular expressions and named entity recognition. Public NLP toolkits such as NLTK and SpaCy will be introduced to gain hand-on experience in Python.

Document representation (2 hour)
The Vector Space Model and tf-idf weighting: representing unstructured text documents with appropriate format and structure to support later automated text mining algorithms. PCA as dimensionality reduction technique.

Text clustering (3 hours)
Clustering algorithms, i.e., connectivity-based clustering (a.k.a., hierarchical clustering) and centroid-based clustering (e.g., k-means clustering). Evaluation of text clustering: purity and Rand index.

Text categorization (5 hours)
Feature selection and text categorization algorithms: Naive Bayes, k Nearest Neighbor (kNN), Logistic Regression, Support Vector Machines and Decision Trees. Evaluation of text classification: precision and recall, confusion matrix, F-score.

Topic modeling (4 hours)
Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. Two basic topic models will be covered: Probabilistic Latent Semantic Indexing (pLSI) and Latent Dirichlet Allocation (LDA).

Document summarization (2 hours)
It refers to the process of reducing a text document to a summary that retains the most important points of the original document. Extraction-based summarization methods will be covered.

Introduction to sentiment analysis and emotion detection (1 hour)
Definition of the sentiment analysis problem. Differences between sentiment analysis and emotion detection.

Lexicon-based approaches to sentiment analysis (4 hours)
Survey of the main approaches that exploit dictionaries, ontologies, and specialized corpora for detecting the sentiment polarity in texts.

Machine learning approaches to sentiment analysis (4 hours)
Sentiment and polarity detection as a classification problem. Overview and comparison of the main unsupervised and supervised models on a case study.

Overview of neural network architectures for sentiment analysis (2 hour)
Design and implementation of a case study based on a neural network for sentiment detection and polarity evaluation.

Affect and emotion detection (1 hour)
Survey and definition of affect and emotion detection in texts. Discussion about the differences between the tasks of detection of sentiment, feelings, emotions, and opinions.

The language of emotions (4 hour)
Methods and techniques for modeling the language of emotions using neural networks and statistical language models. Application to a case study.

Multimodal approaches to emotion detection (1 hour)
Survey on the exploitation of multimodal data (e.g., face and body language in video and audio recordings) in combination with text to detect the language of emotions.

Hands on a real case study for design to implementation (2 hour)
Students will be provided with a real case study on sentiment analysis and emotion detection. During the lesson, the case study will be studied to the end of design and implement a solution.

Recap and conclusion (1 hour)
Recap on the main course topics. Open discussion of the project work chosen by the students as their exam assignment.
Prerequisites for admission
Basic knowledge on Machine Learning, Statistical Learning, Deep Learning and Artificial intelligence
Teaching methods
The course is given in the form of lectures with extensive use of examples and support materials such as Python notebooks. Slides and handouts are employed throughout the lectures and they are progressively published on the reference course website on the Ariel platform (https://aferraratmsa.ariel.ctu.unimi.it).
Lecture attendance is not mandatory, but it is strongly recommended.
Teaching Resources
Materials provided by the lecturer on the course website https://aferraratmsa.ariel.ctu.unimi.it

Reference programming language: Python.
Main modules:

NLTK Book: https://www.nltk.org/book/
Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer Science & Business Media.
Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1), 1-167.
Munezero, M. D., Montero, C. S., Sutinen, E., & Pajunen, J. (2014). Are they different? Affect, feeling, emotion, sentiment, and opinion detection in text. IEEE transactions on affective computing, 5(2), 101-111.
Calvo, R. A., & D'Mello, S. (2010). Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on affective computing, 1(1), 18-37.
Assessment methods and Criteria
Attending students
Students will be required to prepare and discuss a short paper and a project on the course topics. The topic of the short paper and the project will be defined with the lecturer.

Non-attending students
Interview about the whole program of the course
INF/01 - INFORMATICS - University credits: 3
SECS-S/01 - STATISTICS - University credits: 3
Lessons: 40 hours
Educational website(s)
On appointment. The meeting will be online until the end of the Covid emergency
Department of Computer Science, via Celoria 18 Milano, Room 7012 (7 floor)
Wednesday 2.30PM-5.30PM (appointment suggested, via Teams).
Room 37, 3rd Floor (due to sanitary emergency office hours in person are suspended - Office hours will be held via Teams)