Intro

Contents

Overview of Supervised Learning

  1. Describe the differences between supervised learning and unsupervised learning.
  2. Describe the differences between regression and classification.
  3. What is the Bayes error?
  4. Describe the k-nearest-neighbor algorithm.
  5. What is the model assumption on which the k-nearest-neighbor algorithm is based?
  6. What is the model assumption on which least squares regression is based?
  7. What is the curse of dimensionality? Give an example.
  8. Describe the Bias-Variance Decomposition. Describe the connection to overfitting and under-fitting.
  9. How does the EPE change for a linear model as the input dimensionality p and the number of training examples N is varied?

Linear Methods for Regression

  1. Derive the equations for standard linear regression.
  2. What is the hat matrix?
  3. What is a Z-score and how is it computed?
  4. Describe two methods for feature subset selection.
  5. What are shrinkage methods used for? Why/When are they useful?
  6. For all shrinkagemethods for linear regression state the optimization formulation and show/derive the solution. Which are the underlying statistical hypothesis from which these models arise (think at the Maximum-Likelihood estimator of parameter \theta and the assumptions on \theta ).
  7. Explain PCA and PLS.

Linear Models for Classification

  1. Why is linear regression on indicators not suitable for classification in general? When is it reasonable to use it?
  2. What is the model assumption of LDA (QDA)? When does the data fit the model of LDA (QDA)?
  3. What is the rationale of reduced-rank LDA? Why is it useful?
  4. Describe the differences between LDA and logistic regression.
  5. Which algorithm is used to compute a logistic regression ?
  6. Logistic regression is popular because it assigns a risk to each input feature. Why can this interpretation be dangerous?
  7. What is the idea behind Rosenblatt’s perceptron learning algorithm?
  8. What is the idea behind support vector machines?

Basis Expansions and Regularization

  1. What is the concept of basis expansion?
  2. Why should this give superior results for some applications ?
  3. What is a spline? How is it fitted to the data?
  4. How many degrees of freedom does a spline with K knots of degree M have?
  5. What is the advantage of a B-spline?
  6. What is a natural spline?
  7. What is a smoothing spline and how does it work?

Model Assessment and Selection

  1. Which methods do you know for choosing the number of knots or the smoothing parameter for a spline?
  2. Why does selecting the model with the lowest training error not work?
  3. What is the difference between test and validation data? What are they used for?
  4. What is the bias-variance tradeoff? How does bias/variance change as you change k for k-nearest-neighbors?
  5. What is the in-sample error? Why is better to asses the in-sample error than the training error? What is the optimism of the training error?
  6. What is the CP statistic? What does it try to estimate?
  7. What methods for model selection do you know?
  8. Describe advantages/disadvantages of these model selection methods.
  9. Describe advantages/disadvantages of Cross-Validation over the
  10. What is a major problem with the bootstrap estimate of the expected prediction error? How can it be alleviated?

Model Inference and Averaging

  1. What is the central idea behind maximum likelihood estimation?
  2. What are the benefits of using bootstrapping in parameter estimation?
  3. Describe the EM algorithm using the example of a two-component Gaussian mixture.
  4. What is the difference between the EM algorithm and Gibbs sampling (=MCMC)?
  5. What are the necessary prerequisites for parameter estimation by ML, Bayes, bootstrapping and MCMC?
  6. Describe bagging and stacking. What is surprising about the performance of bagging simple classifiers?

Additive Models, Trees and Related Methods

  1. What is the model assumption underlying additive models? Use an example to illustrate the workings of additive logistic regression.
  2. Describe the algorithm to fit a generalized additive model.
  3. What is a scatterplot smoother?
  4. What is a decision stump?
  5. How does a regression/classification tree work? How is it trained? Does the training procedure differ between classification/regression trees?
  6. Argue on the advantages/disadvantages of using trees?
  7. What is a good way of quantifying the performance of a (binary) classifier?
  8. Explain how the PRIM bump hunting algorithm works. What is the main difference with respect to CART methods?
  9. How do you compare two classifiers? Give an example and argue which one is better.
  10. Discuss the relations between MARS and CART learning methods.
  11. Describe the issues arising when dealing with missing data.

Boosting and Additive Trees

  1. Outline the Ada Boost?.M1 algorithm.
  2. Argument why Ada Boost? algorithm fits an additive model. Which loss function is optimized with this algorithm?
  3. Describe some loss functions for classification and regression. Compare them (e.g. with respect to performance and implementation).
  4. Why is it preferable to use small trees in tree boosting?
  5. What is an advantage of gradient boosting?
  6. How would you interpret a boosted tree model?