Data Mining and Computational Statistics

A.Y. 2021/2022
Overall hours
Learning objectives
This is an introductory course to basic techniques and applications in finance and economics of Data Mining and Computational Statistics, also in the more general framework of data science. We will allow students to develop programming skills using the R software in the Data Mining part, and the OpenBUGS software for Bayesian Markov Chain Monte Carlo random variable generation. Students will acquire independence in studying Data Mining & Computational Statistics subjects and will be able to solve practical problems in economic and financial data analysis.
Expected learning outcomes
At the end of the course students will be able to perform machine learning techniques and algorithms and use them in economic and financial applications. Specifically, students will be familiar with supervised and unsupervised models. In particular, in the supervised framework students will be able to perform advanced regression models like the ridge and lasso regression, classification techniques like the Bayes classifier, the K-NN classifier and the logistic model, whereas in the unsupervised framework students will become familiar with dimensional reduction techniques and cluster analysis. More sophisticated techniques like decision tree-based classification will be presented to the students. In Computational statistics, resampling techniques, random number and random variable generation and numerical integration will be part of the acquired knowledge the students will have at the end of the course.
Course syllabus and organization

Single session

Lesson period
Third trimester
The course and the exercises, if necessary for the health emergency, will take place remotely synchronously through the Microsoft Teams platform. The registrations and any further teaching material will be available on ARIEL.

The exam will consist of a 30-minute written test, with 15 multiple choice questions, which if necessary will be conducted remotely through the platform, plus a report (5-8 pages) on a specific topic assigned during the course, to be delivered via e-mail to the professor. For attending students it will be a group work (max 5 people); for non-attending students it will be an individual report. 2/3 of the mark is determined by the test and 1/3 by the report. It will be at the discretion of the professor to ask some questions about the report delivered.
Course syllabus
(i) Review of some statistical techniques (likelihood, confidence intervals, tests)
(ii) Introduction to data mining and statistical learning.
(iii) Supervised and unsupervised methods.
(iv) Classical linear regression.
(v) Classification methods: Logistic regression, discriminant analysis, KNN method.
(vi) Regression with regularization: Ridge and lasso regression.
(vii) Unsupervised methods: size reduction, clustering.
(viii) Decision trees and ensemble methods.
(ix) Resampling methods: bootstrap.
(x) Simulation methods: generation of random variables, Montecarlo integration.
(xi) Other types of regression: principal component regression, splines, local regression.
(xii) Multidimensional scaling.
(xiii) Support vector machines.
(xiii) Text mining.
Prerequisites for admission
A good knowledge of basic statistical topics is requested together with basic mathematics, especially linear algebra. Some knowledge of essential computer programming is welcome but not essential.
Teaching methods
Lectures will be given on the blackboard, since the topics are quite demanding for the students. We will try to work interactively with students by stimulating their oral and written interventions.
In addition to the lectures there will be 20 hours of exercises, where applications of the concepts presented in class are carried out through the use of the R software.
Teaching Resources
Main textbooks:
(i) An Introduction to Statistical Learning, with applications in R (2013) by G. James, D. Witten, T. Hastie, R. Tibshirani, Springer.
(ii) Introducing Monte Carlo Statistical Methods with R (2010) by C.P. Robert, G. Casella, Springer.
Suggested reading for insights into some topics in main textbooks:
(i) The Elements of Statistical Learning, 2nd edition (2009), T. Hastie, R. Tibshirani, J. Friedman, Springer.
(ii) Machine Learning: a Probabilistic Perspective (2012), K.P. Murphy, The MIT Press.
(iii) Monte Carlo Statistical Methods (2004) by C.P. Robert, G. Casella, Springer.
Further reading will be suggested during the course.
Assessment methods and Criteria
The exam consists in the preparation of a poster, even for a group (max 4 students per group), on one or more topics of the course that will be presented by the students on the day of the exam. The poster must include a data processing (even non-financial) done in R or in Python (and even with both). The title, the composition of the group and the abstract of the poster must be approved by the instructor. This part is worth two thirds of the entire exam. The remaining part consists of a written test lasting 30 minutes, with 10 multiple choice questions.
SECS-S/01 - STATISTICS - University credits: 9
Practicals: 20 hours
Lessons: 40 hours
Professor: Manzi Giancarlo
Wednesday 2.30PM-5.30PM (appointment suggested, via Teams).
Room 37, 3rd Floor (due to sanitary emergency office hours in person are suspended - Office hours will be held via Teams)