Data Mining and Computational Statistics

A.Y. 2019/2020
Overall hours
Learning objectives
This is an introductory course to basic techniques and applications in finance and economics of Data Mining and Computational Statistics, also in the more general framework of data science. We will allow students to develop programming skills using the R software in the Data Mining part, and the OpenBUGS software for Bayesian Markov Chain Monte Carlo random variable generation. Students will acquire independence in studying Data Mining & Computational Statistics subjects and will be able to solve practical problems in economic and financial data analysis.
Expected learning outcomes
At the end of the course students will be able to perform machine learning techniques and algorithms and use them in economic and financial applications. Specifically, students will be familiar with supervised and unsupervised models. In particular, in the supervised framework students will be able to perform advanced regression models like the ridge and lasso regression, classification techniques like the Bayes classifier, the K-NN classifier and the logistic model, whereas in the unsupervised framework students will become familiar with dimensional reduction techniques and cluster analysis. More sophisticated techniques like decision tree-based classification will be presented to the students. In Computational statistics, resampling techniques, random number and random variable generation and numerical integration will be part of the acquired knowledge the students will have at the end of the course.
Course syllabus and organization

Single session

Lesson period
Third trimester
Course syllabus
Part I
(i) Review of likelihood inference; (ii) Introduction to data mining and statistical learning. (iii) Exploratory data analysis and visualization. (iv) Supervised vs. unsupervised methods: introduction. (v) Parametric vs. nonparametric methods: introduction. (vi) Multiple linear regression. (vii) Classification methods: logistic regression, linear discriminant analysis and the K-nearest neighbors method. The Bayes classifier. (viii) Resampling methods: cross validation and the bootstrap. (ix) Shrinkage methods: Ridge regression, the Lasso and other shrinkage methods. (x) Regression splines and local regression. (xi) Tree-based methods: random forest, bagging and boosting. (xii) Support vector machines. (xiii) Unsupervised learning: PCA, clustering and multidimensional scaling methods; correspondence analysis. Principal component regression. (xiv) Introduction to Bayesian methods in data mining. (xv) Basic text mining. (xvi) Data mining in finance.
Part II
(i) Computer-intensive statistical methods: overview. (ii) Pseudo-random number and variable generation. (iii) Monte Carlo methods for numerical integration. (iv) Simulation-based inference. (v) MCMC methods: overview. (vi) MCMC methods: Metropolis-Hastings and Gibbs sampling.
Prerequisites for admission
A good knowledge of basic statistical topics is requested together with basic mathematics, especially linear algebra. Some knowledge of essential computer programming is welcome but not essential.
Teaching methods
Classes will be partly given through pre-set slides and partly on the blackboard, especially for those topics that are more demanding for the students. We will try to work interactively with students by stimulating their oral and written interventions.
Teaching Resources
Main textbooks:
(i) An Introduction to Statistical Learning, with applications in R (2013) by G. James, D. Witten, T. Hastie, R. Tibshirani, Springer.
(ii) Introducing Monte Carlo Statistical Methods with R (2010) by C.P. Robert, G. Casella, Springer.
Suggested reading for insights into some topics in main textbooks:
(i) The Elements of Statistical Learning, 2nd edition (2009), T. Hastie, R. Tibshirani, J. Friedman, Springer.
(ii) Machine Learning: a Probabilistic Perspective (2012), K.P. Murphy, The MIT Press.
(iii) Monte Carlo Statistical Methods (2004) by C.P. Robert, G. Casella, Springer.
Further reading will be suggested during the course.
Assessment methods and Criteria
One third of the exam consists of a multiple choice question test and two thirds will be about the preparation of a poster on course topics in which the student will have to demonstrate knowledge of the R package and the essential techniques statistical learning and computational statistics. During the course, it will be possible to assign some homework to be resolved within a day or two. These homework, albeit to a limited extent, will contribute to increasing the final score.
SECS-S/01 - STATISTICS - University credits: 9
Lessons: 60 hours
Professor: Manzi Giancarlo
Wednesday 2.30PM-5.30PM (appointment suggested, via Teams).
Room 37, 3rd Floor (due to sanitary emergency office hours in person are suspended - Office hours will be held via Teams)