Intro
- This is a summary of the first 10 chapters of the Book The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie, Tibshirani and Friedman.
- It is based on 61 questions from Prof. Lengauer's lecture at Saarland University: http://www.mpi-inf.mpg.de/departments/d3/teaching/sl1_07_08.html
- Hopefully, it throws some light on these chapters.
- Feel free to contact me, if you would like to improve this work, saw an error or want to get the latex code:
- Cautionary note: This was my first machine learning lecture.
- Summary Of Statistical Machine Learning 1 (pdf)
Contents
Overview of Supervised Learning
- Describe the differences between supervised learning and unsupervised learning.
- Describe the differences between regression and classification.
- What is the Bayes error?
- Describe the k-nearest-neighbor algorithm.
- What is the model assumption on which the k-nearest-neighbor algorithm is based?
- What is the model assumption on which least squares regression is based?
- What is the curse of dimensionality? Give an example.
- Describe the Bias-Variance Decomposition. Describe the connection to overfitting and under-fitting.
- How does the EPE change for a linear model as the input dimensionality p and the number of training examples N is varied?
Linear Methods for Regression
- Derive the equations for standard linear regression.
- What is the hat matrix?
- What is a Z-score and how is it computed?
- Describe two methods for feature subset selection.
- What are shrinkage methods used for? Why/When are they useful?
- For all shrinkagemethods for linear regression state the optimization formulation and show/derive the solution. Which are the underlying statistical hypothesis from which these models arise (think at the Maximum-Likelihood estimator of parameter
and the assumptions on
).
- Explain PCA and PLS.
Linear Models for Classification
- Why is linear regression on indicators not suitable for classification in general? When is it reasonable to use it?
- What is the model assumption of LDA (QDA)? When does the data fit the model of LDA (QDA)?
- What is the rationale of reduced-rank LDA? Why is it useful?
- Describe the differences between LDA and logistic regression.
- Which algorithm is used to compute a logistic regression ?
- Logistic regression is popular because it assigns a risk to each input feature. Why can this interpretation be dangerous?
- What is the idea behind Rosenblatt’s perceptron learning algorithm?
- What is the idea behind support vector machines?
Basis Expansions and Regularization
- What is the concept of basis expansion?
- Why should this give superior results for some applications ?
- What is a spline? How is it fitted to the data?
- How many degrees of freedom does a spline with K knots of degree M have?
- What is the advantage of a B-spline?
- What is a natural spline?
- What is a smoothing spline and how does it work?
Model Assessment and Selection
- Which methods do you know for choosing the number of knots or the smoothing parameter for a spline?
- Why does selecting the model with the lowest training error not work?
- What is the difference between test and validation data? What are they used for?
- What is the bias-variance tradeoff? How does bias/variance change as you change k for k-nearest-neighbors?
- What is the in-sample error? Why is better to asses the in-sample error than the training error? What is the optimism of the training error?
- What is the CP statistic? What does it try to estimate?
- What methods for model selection do you know?
- Describe advantages/disadvantages of these model selection methods.
- Describe advantages/disadvantages of Cross-Validation over the
- What is a major problem with the bootstrap estimate of the expected prediction error? How can it be alleviated?
Model Inference and Averaging
- What is the central idea behind maximum likelihood estimation?
- What are the benefits of using bootstrapping in parameter estimation?
- Describe the EM algorithm using the example of a two-component Gaussian mixture.
- What is the difference between the EM algorithm and Gibbs sampling (=MCMC)?
- What are the necessary prerequisites for parameter estimation by ML, Bayes, bootstrapping and MCMC?
- Describe bagging and stacking. What is surprising about the performance of bagging simple classifiers?
Additive Models, Trees and Related Methods
- What is the model assumption underlying additive models? Use an example to illustrate the workings of additive logistic regression.
- Describe the algorithm to fit a generalized additive model.
- What is a scatterplot smoother?
- What is a decision stump?
- How does a regression/classification tree work? How is it trained? Does the training procedure differ between classification/regression trees?
- Argue on the advantages/disadvantages of using trees?
- What is a good way of quantifying the performance of a (binary) classifier?
- Explain how the PRIM bump hunting algorithm works. What is the main difference with respect to CART methods?
- How do you compare two classifiers? Give an example and argue which one is better.
- Discuss the relations between MARS and CART learning methods.
- Describe the issues arising when dealing with missing data.
Boosting and Additive Trees
- Outline the Ada Boost?.M1 algorithm.
- Argument why Ada Boost? algorithm fits an additive model. Which loss function is optimized with this algorithm?
- Describe some loss functions for classification and regression. Compare them (e.g. with respect to performance and implementation).
- Why is it preferable to use small trees in tree boosting?
- What is an advantage of gradient boosting?
- How would you interpret a boosted tree model?