Big Data Analytics
A.Y. 2025/2026
Learning objectives
The recent years have witnessed a dramatic increase in the interest towards the analysis of texts in social sciences. This is largely due to the development of new methods that facilitate substantively important inferences about politics and society from large text collections. This course aims to provide an introductory guide to this exciting new area of research, while also offering guidelines on how to effectively use text methods for social scientific research.
Expected learning outcomes
By the end of the course students will learn how - effectively use statistical methods on texts for social scientific research; - discuss the advantages, but also the limits, of each approach. Course evaluation aims to verify the expected learning outcomes in relation to these topics.
Lesson period: First trimester
Assessment methods: Esame
Assessment result: voto verbalizzato in trentesimi
Single course
This course cannot be attended as a single course. Please check our list of single courses to find the ones available for enrolment.
Course syllabus and organization
Single session
Responsible
Lesson period
First trimester
Course syllabus
The recent years have witnessed a dramatic increase in the interest towards the analysis of texts in social sciences. This is largely due to the development of new techniques that facilitate substantively important inferences about politics and society from large text collections. This course provides students with the foundational skills required to work with large-scale text data in their research, while also offering guidelines on how to use, and validate, text methods for social scientific research. It also serves as a gateway to more advanced studies in artificial intelligence and large language models.
The attention will be devoted to four main areas:
1) (supervised and unsupervised) scaling methods that allow to estimate the location of actors along some latent space
2) unsupervised topic models (including structural topic models) that allow to discover new ways of organizing texts into a set of unknown categories;
3) supervised classification methods (i.e., machine learning algorithms) that allow to organize texts into a set of pre-defined categories. The fundamental concepts used across ML techniques (such as overfitting, cross-validation, and global interpretation) will be introduced, while also discussing how to apply various popular ML techniques (e.g. random forest, naive bayes, and neural networks among the others);
4) word-embedding techniques that extend beyond the traditional bag-of-words approach in text analytics. We will examine both static and dynamic word embedding techniques, with a focus on those leveraging the self-attention mechanism. Additionally, we will explore how these methods have driven the revolution in Large Language Models (LLMs). Special emphasis will be placed on the BERT family of Transformers.
The attention will be devoted to four main areas:
1) (supervised and unsupervised) scaling methods that allow to estimate the location of actors along some latent space
2) unsupervised topic models (including structural topic models) that allow to discover new ways of organizing texts into a set of unknown categories;
3) supervised classification methods (i.e., machine learning algorithms) that allow to organize texts into a set of pre-defined categories. The fundamental concepts used across ML techniques (such as overfitting, cross-validation, and global interpretation) will be introduced, while also discussing how to apply various popular ML techniques (e.g. random forest, naive bayes, and neural networks among the others);
4) word-embedding techniques that extend beyond the traditional bag-of-words approach in text analytics. We will examine both static and dynamic word embedding techniques, with a focus on those leveraging the self-attention mechanism. Additionally, we will explore how these methods have driven the revolution in Large Language Models (LLMs). Special emphasis will be placed on the BERT family of Transformers.
Prerequisites for admission
An elementary knowledge of R, plus a curiosity towards applied statistics, are good prerequisites for the lab sessions.
Teaching methods
Lab sessions are a crucial part of the course: they are offered for "hands-on" experiences to learn the techniques and the statistical methods discussed during classes. All the datasets, replication files of the lab sessions and reference texts will be made available at a dedicated URL before the beginning of the course. Enrolled students should bring their own laptop with R, RStudio and the relevant packages previously installed and functioning (instructions will be circulated beforehand).
Teaching Resources
Benoit, Kenneth. 2020. Text as Data: An Overview. In: Luigi Curini and Robert Franzese, Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, 461-497
Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297.
Laver, Michael, Kenneth Benoit, John Garry. 2003. Extracting Policy Positions from political texts using words as data. American Political Science Review, 97(02), 311-331
Proksch, Sven-Oliber, and Slapin, Jonathan B. 2008. A Scaling Model for Estimating Time-Series Party Positions from Texts. American Journal of Political Science, 52(3): 705-722.
Robert, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Luca, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Response, American Journal of Political Science, 58(4), 1064-1082
Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
Soren Jordan, Hannah L. Paul, Andrew Q. Philips, How to Cautiously Uncover the "Black Box" of Machine Learning Models for Legislative Scholars", Legislative Studies Quarterly, 2022, https://onlinelibrary.wiley.com/doi/abs/10.1111/lsq.12378
Barberá, Pablo and C. Steinert-Threlkeld Zachary. How to Use Social Media Data for Political Science Research. In: Luigi Curini and Robert Franzese, Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, 404-423
Rodriguez Pedro L. and Spirling Arthur (2022). Word Embeddings: What works, what doesn't, and how to tell the difference for applied research, Journal of Politics, 84(1), 101-115
Further readings will be suggested during the course. Please check regularly the home-page of the course and contact the professor for further questions.
Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3): 267-297.
Laver, Michael, Kenneth Benoit, John Garry. 2003. Extracting Policy Positions from political texts using words as data. American Political Science Review, 97(02), 311-331
Proksch, Sven-Oliber, and Slapin, Jonathan B. 2008. A Scaling Model for Estimating Time-Series Party Positions from Texts. American Journal of Political Science, 52(3): 705-722.
Robert, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Luca, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Response, American Journal of Political Science, 58(4), 1064-1082
Olivella, Santiago, and Shoub Kelsey (2020). Machine Learning in Political Science: Supervised Learning Models. In Luigi Curini and Robert Franzese (eds.), SAGE Handbook of Research Methods is Political Science & International Relations, London, Sage, chapter 56
Soren Jordan, Hannah L. Paul, Andrew Q. Philips, How to Cautiously Uncover the "Black Box" of Machine Learning Models for Legislative Scholars", Legislative Studies Quarterly, 2022, https://onlinelibrary.wiley.com/doi/abs/10.1111/lsq.12378
Barberá, Pablo and C. Steinert-Threlkeld Zachary. How to Use Social Media Data for Political Science Research. In: Luigi Curini and Robert Franzese, Sage Handbook of Research Methods in Political Science and International Relations, London: Sage, 404-423
Rodriguez Pedro L. and Spirling Arthur (2022). Word Embeddings: What works, what doesn't, and how to tell the difference for applied research, Journal of Politics, 84(1), 101-115
Further readings will be suggested during the course. Please check regularly the home-page of the course and contact the professor for further questions.
Assessment methods and Criteria
For enrolled students, course grades will be based on home-assignments and class-participation. Instructions for not-enrolled students will be circulated later.
Professor(s)
Reception:
Wednesday 12:45-15:45
room 319 - via Conservatorio 7, Department of Social and Political Sciences. Plz write directly to the teacher to make an appointment