Data Mining and Computational Statistics

A.Y. 2023/2024
9
Max ECTS
60
Overall hours
SSD
SECS-S/01
Language
English
Learning objectives
This is an introductory course to basic techniques and applications in finance and economics of Data Mining and Computational Statistics, also in the more general framework of data science. We will allow students to develop programming skills using the R software in the Data Mining part, and the OpenBUGS software for Bayesian Markov Chain Monte Carlo random variable generation. Students will acquire independence in studying Data Mining & Computational Statistics subjects and will be able to solve practical problems in economic and financial data analysis.
Expected learning outcomes
At the end of the course students will be able to perform machine learning techniques and algorithms and use them in economic and financial applications. Specifically, students will be familiar with supervised and unsupervised models. In particular, in the supervised framework students will be able to perform advanced regression models like the ridge and lasso regression, classification techniques like the Bayes classifier, the K-NN classifier and the logistic model, whereas in the unsupervised framework students will become familiar with dimensional reduction techniques and cluster analysis. More sophisticated techniques like decision tree-based classification will be presented to the students. In Computational statistics, resampling techniques, random number and random variable generation and numerical integration will be part of the acquired knowledge the students will have at the end of the course.
Single course

This course can be attended as a single course.

Course syllabus and organization

Single session

Responsible
Lesson period
Third trimester
Course syllabus
(0) Introduction to R software.
(i) Supervised vs. unsupervised methods: introduction.
(ii) Parametric vs. nonparametric methods; trade-off between BIAS and Variance of a statistical learning method.

Supervised methods:
(vi) Reminds of multiple linear regression. Shrinkage methods: Ridge regression, the Lasso and other shrinkage methods.
(v) Review of likelihood inference.
(vi) Classification methods: logistic regression.
(vii) The Bayes classifier and the K-nearest neighbors method.
(viii) Linear and quadratic discriminant analysis.
(ix) Resampling methods: cross validation and the bootstrap.
(x) Decision trees: regression and classification trees; the pruning.
(xi) Tree-based methods: bagging and random forest.

Unsupervised methods:
(xii) Principal components analysis.
(xiii) Clustering.
Prerequisites for admission
A good knowledge of basic statistical topics is requested together with basic mathematics, especially linear algebra. Some knowledge of essential computer programming is welcome but not essential.
Teaching methods
Lectures will be given on the blackboard, since the topics are quite demanding for the students. We will try to work interactively with students by stimulating their oral and written interventions.
In addition to the lectures there will be 20 hours of exercises, where applications of the concepts presented in class are carried out through the use of the R software and real data analysis are developed.
Teaching Resources
Main textbooks:
(i) An Introduction to Statistical Learning, with applications in R (2013) by G. James, D. Witten, T. Hastie, R. Tibshirani, Springer.
(iii) Introducing Monte Carlo Statistical Methods with R (2010) by C.P. Robert, G. Casella, Springer.
Suggested reading for insights into some topics in main textbooks:
(i) The Elements of Statistical Learning, 2nd edition (2009), T. Hastie, R. Tibshirani, J. Friedman, Springer.
(ii) Machine Learning: a Probabilistic Perspective (2012), K.P. Murphy, The MIT Press.
(iii) Monte Carlo Statistical Methods (2004) by C.P. Robert, G. Casella, Springer.
Further reading will be suggested during the course.
Assessment methods and Criteria
One half of the exam consists of a 30-minute written test, with 10 multiple choice questions, and one half will be about the preparation of a report (5-6 pages) on a specific topic assigned during the course, to be delivered via e-mail to the professor. For attending students it will be a group work (max 5 people); for non-attending students it will be an individual report. It will be at the discretion of the professor to ask some questions about the delivered report.
SECS-S/01 - STATISTICS - University credits: 9
Lessons: 60 hours
Professor: Tommasi Chiara
Professor(s)
Reception:
Wednesday from 9:00 to 12:00
Via Conservatorio, III floor, Room n. 35