Selbsttestfragen
Nachfolgende Fragen eigenen sich zur Prüfungsvorbereitung mittels Active Recall gedacht und eine Ergänzung zu Karteikarten.
Zur Herkunft der Fragen:
  • Fragen aus Altklausuren sind mit einem Stern (⭐) markiert.
  • Ein weiterer Teil der Fragen stammt aus der Vorlesung Maschinelles Lernen (Grundverfahren). Diese Fragen sind mit einem Gehirn (🧠) gekennzeichnet.
  • Fragen der University of Toronto sind mit einem Camp (🏕️) markiert.
  • Fragen der University of Berkley mit einem Feuerwehrmann (🧑‍🚒)
  • Die übrigen Fragen sind eigene Fragen oder es handelt sich um Interviewfragen.

Überblick

Überblick über Algorithmen der Vorlesung

Big Data

  • F: What are the characteristics of big data?
    • volume
    • variety
    • velocity
    • veracity
    • value
  • F: Explain three characteristics of big data?
    • Volume refers to the sheer amount of data that is generated.
    • Variety refers to the diversity of types of data. Data can come in structured, semi-structured, or even unstructured types.
    • Velocity refers to the sheer speed at which data is generated (and processed).
    • Veracity refers to the quality of data or accuracy of the collected data. To resolve data quality issues one has to apply sophisticated pre-processing.
  • F: What is the difference between veracity and variety?
    Veracity refers to the quality of data (e. g. noise in data). While variety refers to types of data (e. g. unstructured data) in which data can come. As data is often collected from different sources both their types and their quality can differ.

ML vs. Statistics vs. Econometrics

  • F: Compare ML to Statistics. What are the most significant differences?
    Statistics is:
    • based on hypothesis, then a collection of data and analysis
    • model-oriented with an emphasis on parametric models
    • focus on understanding and hypothesis testing
    Whereas in Machine Learning:
    • there is seldomly a priori hypothesis
    • data is collected in advance
    • analysis is data-driven not hypothesis-driven
    • analysis is algorithm-oriented rather than model-oriented
    • focus lies on prediction
  • F: Compare ML to Econometrics. In which way do both differ?
    Econometrics is:
    • concerned about casual interference and counterfactuals
    • mostly centred around linear regressions and complex structural models
    • standard errors are often reported after one run
    Machine Learning is:
    • concerned about prediction
    • using all sorts of data-driven models e. g. Trees, NN, etc.

Structure of data / CRISP-DM / Taxonomy

  • F: What are the characteristics of unstructured data? Explain them.
    Unstructured data is:
    • Nonnumeric: No predefined numeric representation for the constructs of interests. Requires manual or automatic coding prior to analysis.
    • Multifaceted: A single unit of unstructured data posses multiple facets. Each aspect of data provides unique information for studying and different types of research goals. E. g. voice data present information about the speaker such as pitch, speech rate. Data can be used both in psychology and communication.
    • Concurrent representation: The simultaneous presence of a single data unit's multiple facets that each provides unique information, which allows to represent of different phenomena at the same time. One can study different research questions with one single unit of unstructured data.
  • F: What is 'structured data'?
    • Structured data is data that adheres to a pre-defined data model and is therefore straightforward to analyze.
    • Structured data conforms to a tabular format with a relationship between the different rows and columns.
  • F: What is unsupervised learning?
    • Observe data and construct a low complexity description of the data.
    • That means in unsupervised learning the dataset that a data set transforms into is not previously known or understood. Data is not labeled. (Grooking p. 13)
    • We observe only the features
      X1,X2,,XPX_1, X_2,\ldots, X_P
      . We are not interested in prediction, because we do not have an associated response variable
      YY
      .
  • F: What is supervised learning?
    • We observe both a set of features
      X1,X2,,XPX_1,X_2,\ldots,X_P
      for each object, as well as a response or outcome variable
      YY
      . The goal is then to predict
      YY
      using
      X1,X2,,XPX_1, X_2,\ldots,X_P
      .
    • Examples include clustering and PCA.
  • F: What are advantages / disadvantages of unsupervised learning techniques?
    • No labeled data required, which is often expensive and laborious. (+)
    • Adding labels to the data after clustering is often easier (+) (own)
    • Unsupervised techniques such as clustering help with data understanding of the raw data. (+) (own)
    • Unsupervised learning is more subjective, as there is no simple objective (-)
  • F: Name practical applications of unsupervised learning.
    • Subgroups of breast cancer patients grouped by their gene expression measurements
    • Groups of shoppers characterized by their browsing and purchase histories
    • Movies grouped by ratings assigned by movie raters
  • F: What is the goal of unsupervised learning?
    • The goal of unsupervised learning is to discover interesting things about the measurements on how to visualize data and finding subgroups among variables or observations.
  • F: Give two examples for unsupervised learning techniques.
    • Clustering algorithms such as
      kk
      -means
    • Dimensionality reduction techniques such as PCA
  • F: Give examples for structured/unstructured data.
    Unstructured: (low degree of organization)
    • Video Data, as the video comes in different formats, compression ratios, sizes, where the video has to be transformed first to extract information from every single frame
    • Image Data, just like videos.
    Structured: (high degree of organization)
    • Numeric secondary data e. g. sales figures, as they come in a standardized format and easy to process format e. g. float with
      xx
      decimal places
    • Categorial data e. g. gender, as there are predefined formats
  • F: Give a brief explanation of categorical, binary, ordinal, and numeric variables.
    • categorical/nominal: Names of things or symbols.
    • binary: A nominal variable with two categories or states: 0 or 1.
    • ordinal: Ordinal variables have a meaningful order or ranking among them, but the magnitude between successive values is not known.
    • numeric: A quantitative variable. Numeric variables could be interval-scaled or ratio-scaled.
  • F: Which steps are part of the CRISP-DM model? Explain them in-depth.
    1. 1.
      Business understanding i. e. developing an understanding of business objects and requirements of the data mining
    2. 2.
      Data understanding i. e. identify and collect the data set needed to fulfill the business goals
    3. 3.
      Data preparation i. e. prepare data for modeling
    4. 4.
      modeling i. e. build several models and assess them on a technical level.
    5. 5.
      Evaluation i. e. Evaluate whether models are able to help achieve the business goals. Plan on the next steps.
    6. 6.
      Deployment i. e. Deploy model to production. Make it accessible to customers.
  • F: Explain common techniques for data gathering.
    • Bulk downloads: Downloading large amounts of data. Often done using sophisticated software.
    • APIs: Accessing data through machine-readable interfaces. Examples include Google Maps API.
    • Web Scraping: Extraction of data from websites. Often done using bots and web crawlers or manually.
  • F: Why is it desirable to work on normalized data?
    • Some algorithms require normalized data, such as
      kk
      -means clustering, which is 'isotropic' in all directions of space and therefore tens to produce more or less round shapes. Not standardizing data would give more relative on variables with a smaller variance. (See here.)
  • F: Explain common techniques to analyze the relationship between variables.
    • A scatter plot (or scatter diagram) is used to show the relationship between variables
    • Bar plot for high dimensional data
    • Mean graph for categorical data
    • Correlation analysis
  • F: How can missing data be replaced? Explain.
    • mean-based imputation: i. e. mean is calculated from all observations
    • median-based imputation: Same as above but with median.
    • stratified imputation: i. e. categories/ structure of data is considered for replacements. E. g. missing height is different for gender male and female.
    • regressed imputation: i. e. replacing missing values by predictions of a regression model
  • F: Explain 3 patterns in which missing data can occur.
    • Completely random / MCAR: Missing values have no pattern. Can not be predicted.
    • Missing at random / MaR: Missing values can be predicted using other data available for observation. Assign a categorial value.
    • Latent, yet unknown variable: Missing value depends on a latent and highly correlated variable.
  • F: What is a training, test, and validation set for?
    • Training set is used to fit all potential models
    • Validation set is used to select hyperparameters of a model
    • Test set is used to estimate the predictive power of a model on unseen data
  • F: What is the risk with tuning hyperparameters using a test dataset? 🧑‍🚒
    • Tuning model hyperparameters to a test set means that the hyperparameters may overfit that test set. If the same test set is used to estimate performance, it will produce an overestimate. Using a separate validation set for tuning and test set for measuring performance provides an unbiased, realistic measurement of performance. (Berkley p. 14)

LR, Ridge, and Lasso

  • F: Explain how best subset selection works in 3 steps.
    1. 1.
      Let
      M0\mathcal{M}_{0}
      denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation. 2. For
      k=1,2,pk=1,2, \ldots p
      :
  • a. Fit all
    (pk)\left(\begin{array}{l}p \\ k\end{array}\right)
    models that contain exactly
    kk
    predictors.
  • b. Pick the best among these
    (pk)\left(\begin{array}{l}p \\ k\end{array}\right)
    models, and call it
    Mk\mathcal{M}_{k}
    . Here best is defined as having the smallest RSS, or equivalently largest
    R2R^{2}
    .
  • Select a single best model from among
    M0,,Mp\mathcal{M}{0}, \ldots, \mathcal{M}{p}
    using cross-validated prediction error,
    CpC_{p}
    (AIC), BIC, or adjusted
    R2R^{2}
    .
  • F: Explain how forward stepwise selection works in 3 steps.
    • Intuition:
      Instead of searching through all possible subsets, we can seek a good path through them. Forward stepwise selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the fit. Like best subset regression, forward stepwise produces a sequence of models indexed by
      kk
      , the subset size, which must be determined. (Hastie p. 59)
    • More formal description: 1. Let
      M0\mathcal{M}_{0}
      denote the null model, which contains no predictors. 2. For
      k=0,,p1k=0, \cdots, p-1
      : a. Consider all
      pkp-k
      models that augment the predictors in
      Mk\mathcal{M}_{k}
      with one additional predictor. b. Choose the best among these
      pkp-k
      models, and call it
      Mk+1\mathcal{M}_{k+1}
      . Here best is defined as having smallest RSS or highest
      R2R^{2}
      . 3. Select a single best model from among
      M0,,Mp\mathcal{M}_{0}, \ldots, \mathcal{M}_{p}
      using cross-validated prediction error,
      CpC_{p}
      , AIC, BIC, on adjusted
      R2R^{2}
      .
  • F: Explain how backward stepwise selection works in 3 steps.
    • Backward- stepwise selection starts with the full model and sequentially deletes the predictor that has the least impact on the fit. The candidate for dropping is the one with the largest RSS and lowest
      R2R^2
      . (Hastie p. 60)
    • More formal definition:
      Let
      Mp\mathcal{M}_{p}
      denote the full model, which contains all
      pp
      predictors.
      1. 1.
        For
        k=p,p1,,1k=p, p-1, \ldots, 1
        : a. Consider all
        kk
        models that contain all but one of the predictors in
        Mk\mathcal{M}{k}
        , for a total of
        k1k-1
        predictors. b. Choose the best among these
        kk
        models, and call it
        Mk1\mathcal{M}_{k-1}
        . Here best is defined as having smallest RSS or highest
        R2R^{2}
        .
      2. 2.
        Select a single best model from among
        M0,,MP\mathcal{M}_{0}, \ldots, \mathcal{M}_{P}
        using cross-validated prediction error,
        CpC_{p}
        , AIC, BIC, or adjusted
        R2R^{2}
        .
  • F: Compare the best subset selection to forward selection.
    • Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.
    • Best Subset Selection does not add predictors one-at-a-time but chooses from models containing exactly
      kk
      variables. Predictors might be different for different
      kk
      s.
    • Best Subset Selection becomes infeasible for large number of variables (
      40\geq 40
      ). (Hastie p.59)
  • F: Compare the subset selection methods forward and backward stepwise selection.
    • Backward stepwise selection is pretty much the inverse of forward stepwise selection.
  • F: When is it desirable to use backward stepwise selection and when is it desirable to use forward stepwise selection or best subset selection?
    • Computationally best subset selection is most demanding and becomes infeasible for a large number of features.
    • All three deliver similar results (See the comparison in Hastie p. 59)
  • F: What is an alternative to subset selection methods presented in the lecture?
    • Forward-stagewise regression
  • F: What are the reasons why shrinkage methods such as LASSO are preferred over subset selection methods such as best subset selection?
    • If the best subset can be found, it is indeed better than the LASSO in terms of selecting the variables that actually contribute to the fit.
    • In practice LASSO is still preferred as it is computationally much easier to estimate e. g. through the calculation of regularization paths using pathwise coordinate descent. Whereas best subset selection is a NP-hard problem. (see here.)
    • TODO: Lasso ist wenig nachvollziehbar
  • F: Explain how Linear Regression works.
    • Linear regression is a linear model (aka a model that assumes a linear relationship between input variables
      XX
      and the single output variable
      yy
      ).
      yy
      is a linear combination of the input variables
      x1x_1
      to
      xkx_k
      . To best describe the relationship between input variables and output variables a line or hyperplane is fitted to the point cloud. (own)
    • The multiple linear regression model for the population is defined as:
      y=β0+β1x1+β2x2++βkxk+ϵy=\beta_{0}+\beta_{1} x_{1}+\beta_{2} x_{2}+\cdots+\beta_{k} x_{k}+\epsilon
      β0\beta_{0}
      is intercept,
      β1,,βk\beta{1}, \ldots, \beta_{k}
      are regression coefficients of
      kk
      independent variables, and
      ϵ\epsilon
      is error.
      xx
      are the input variables and
      yy
      the dependent variable. We can present this equation in vector notation:
      y=βX+ϵy=\beta X+\epsilon
  • F: What is the purpose
    β\beta
    in a Multiple Linear Regression Model?
    • β\beta
      is a
      (p+1)(p+1)
      -dimensional vector, where
      β0\beta_0
      is the intercept and
      β1,,βk\beta_1,\cdots,\beta_k
      are the regression coefficients of
      kk
      independent variables.
  • F: Explain how an optimal estimate for
    β\beta
    can be derived.
    • A linear regression model has the best fit when the error term
      ϵ\epsilon
      is minimal. To achieve this, the regression coefficients
      β\beta
      have to be estimated such that the error term is minimized. It's common to use squared error terms for
      ε22\| \varepsilon \|_{2}^{2}
      minimization.
    • This leads to the following equation:
      minβi=1N(yiβ0j=1pxijβj)2=minβyXβ2\min _{\boldsymbol{\beta}} \sum_{i=1}^{N}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} x_{i j} \beta_{j}\right)^{2}=\min _{\boldsymbol{\beta}}\|\mathbf{y}-\mathbf{X} \beta\|^{2}
    TODO: Formel für Linear Regression TODO: Minimierung SSE?
  • Which can be reformulated to:
    β=(XX)1Xy\beta=\left(X^{\intercal} X\right)^{-1} X^{\intercal} y
    ϵ=yXβ\epsilon=y-X^{\intercal} \beta
  • F: Does standard Linear Regression require scaling?
    • No, as multiplying
      XiX_i
      with a constant
      cc
      leads to a scaling of the least square coefficient estimates by a factor
      1/c1/c
      .
  • F: Why do we optimize for the SSE for?
    • It is fully differentiable
    • Easy to optimize
    • It also makes sense as:
      f(x)=argminf(x)SSEf(x)=E[yx]f^{*}(x)=\operatorname{argmin}_{f(x)} \mathrm{SSE} \Rightarrow f^{*}(x)=\mathbb{E}[y \mid x]
  • F: Give the definition for the SSE.
    • SST=i=1n(yiyˉ)2S S T=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}
    • where
      yˉ\bar{y}
      is the observed mean of
      yy
      .
  • F: Give the definition for the SSR.
    • SSR=i=1n(y^iyˉ)2S S R=\sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}
    • where
      y^\hat{y}
      is the prediction for
      yy
      and
      yˉ\bar{y}
      the observed mean.
  • F: Give the definition for the SST.
    • SSE=i=1n(yiy^i)2=i=1nei2S S E=\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}=\sum_{i=1}^{n} e_{i}^{2}
    • where
      y^\hat{y}
      is the prediction for
      yy
      .
  • F: Give a graphical intuition for the SSE, SSR and SST.
  • F: Name two measure to test the goodness of fit of a Linear Regression model.
    • Total Sum of Squares (SST)
    • R2R^2
  • F: Write the definitions
    R2R^2
    , Adj.
    R2R^2
    , MAE, RMSE.
    • R2=1SSESSTR^{2}=1-\frac{S S E}{S S T}
    •  Adjusted R2=1(1R2)(n1nk1)\text { Adjusted } R^{2}=1-\left(1-R^{2}\right)\left(\frac{n-1}{n-k-1}\right)
    • RMSE=1ni=1n(yiy^i)2\text{RMSE}=\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}}
    • MAE=1ni=1nyiy^i\text{MAE}=\frac{1}{n} \sum_{i=1}^{n}\left|y_{i}-\hat{y}_{i}\right|
    • TODO: n, k einführen....
  • F: Give an intuition for
    R2R^2
    , MAE and RMSE.
    • R2R^2
      / adj.
      R2R^2
      : How well can my model explain the variance?
    • MAE: How does the model perform on average?
    • RMSE: How many large prediction derivations does the model have? (lecture BDA p. 43)
  • F: Compare
    R2R^2
    , Adj.
    R2R^2
    to MSE, MAE and RMSE. Name advantages and drawbacks.
    • MSE:
      • MSE is differentiable, which is important for finding optima. (+)
    • MAE:
      • The scale of MAE, RMSE depends on the scale of the dependent variable. (-)
      • MAE is not differentiable. (-)
      • MAE is more robust to outliers. (+)
    • R2R^2
      :
      • Measure always increases by adding new independent variables which can lead to the addition of redundant variables in the model. (-)
  • F: Compare MAE to RMSE.
    • RMSE penalizes large errors more than MAE. This can be useful if being off by ten is more than twice as bad as being of by 5. If however being off by 5 is just as bad as being of by 10, MAE should be preferred. (See here.)
  • F: Compare
    R2R^2
    , Adj.
    R2R^2
    to MAE, RMSE. Which of these is normed.
    • R2R^2
      and Adj.
      R2R^2
      between
      [0,1][0,1]
      . TODO: worse than 0 if prediction is worse than mean / large SSE vs small SST.
    • MAE, RMSE between
      [0,)[0, \infty)
      . If RMSE is acceptable depends on the scale of the variables. (see here). RMSE and MAE are
      00
      for models with a perfect fit.
  • F: In which way does the adjusted
    R2R^2
    improve the standard
    R2R^2
    ?
    • A model might have a good fit in-sample but poor fit on out-of-sample, if to many regressors are used.
    • Adj.
      R2R^2
      is a
      R2R^2
      which has been corrected by a penalty function and takes into account the number of
      kk
      regressors in the model.
  • F: Explain the three steps in fitting a regression model.
    • Specification:
      • Determine dependent and explanatory variables.
      • Exclude explanatory variables without predictive power.
      • Collect data for dependent and explanatory variables.
    • Fitting / Estimating:
      • Estimating regression coefficients.
    • Diagnosis:
      • Determine the quality of the regression model with e. g.
        R2R^2
        , adj.
        R2R^2
        , MSE and MAE.
      • Determine the model's significance and the significance of the regression coefficients.
      • analyze standard deviation of regression errors.
  • F: Explain how one can test for the significance of a regression model. Give
    H0H_{0}
    and
    H1H_{1}
    Hypothesis for regression models.
    • H0H_{0}
      states that all regression coefficients are equal to zero, which means none of the explanatory variables play any role.
      H0:β0=β1==βk=0H_{0}: \beta_{0}=\beta_{1}=\cdots=\beta_{k}=0
      H1H_{1}
      states that at least one coefficient is different from zero.
      H1:βj0 for at least one jH_{1}: \beta_{j}\neq0 \text { for at least one } j
  • F: Give an intuition for the Analysis of Variance (ANOVA) test.
    • The ANOVA-Test compares whether the means of two separate sets are equal.
    • The observation
      xi,jx_{i,j}
      which is the
      jj
      -th observation of
      ii
      th can be decomposed into the between-groups variance
      (xˉixˉ)\left(\bar{x}_{i}-\bar{x}\right)
      , the within-group variance
      (xi,jxˉi)\left(x_{i,j}-\bar{x}_i\right)
      and the between-groups mean
      xˉ\bar{x}
      . One gets:
    • xi,j=xˉ+(xˉixˉ)+(xijxˉi)x_{i,j} = \bar{x} + \left(\bar{x}_{i}-\bar{x}\right) +\left(x_{i j}-\bar{x}_{i}\right)
    • Now the
      FF
      statistic is simply the ratio of the between-groups variance and the within-group variance (see here.) (see here.)
  • F: How is the ANOVA test /
    FF
    -test defined?
    • F=SSRkSSEnk1=MSRMSE,F=\frac{\frac{S S R}{k}}{\frac{S S E}{n-k-1}}=\frac{M S R}{M S E},
    • where
      nn
      is sample size,
      kk
      number of parameters in model,
      k1k-1
      number of slope parameters.
  • F: Explain how one can interpret the
    FF
    -Test.
    • If the
      pp
      -Value of the
      FF
      -Test is less than a significance level
      α\alpha
      , the model does explain some variation of the dependent variable
      yy
      .
    • One needs to have a
      FF
      table for the corresponding
      α\alpha
      .
  • F: Explain what multicollinearity is.
    • Multicollinearity refers to the situation in which more than two explanatory variables in a multiple regression model are highly correlated.
    • Tests for multicollinearity are necessary after the models significance has been determined and all significant independent variables as if strong multicollinearity is present, a change in one explanatory variable will also lead to a change of another explanatory variable.
  • F: Name three possible indicators for multicollinearity.
    • Sensitivity of regression coefficients to the inclusion of additional explanatory variables
    • change from significance to insignificance after more explanatory variables have been added
    • An increase in the model’s standard error of the regression
  • F: How can one test for multicollinearity?
    • One can use the variance inflation factor (VIF)
  • F: Give the definition for the variance inflation factor.
    • To check the
      jj
      th variable for multicollinearity, one can calculate the VIF as following:
    • The
      jj
      -th variable is regressed on the remaining
      k1k-1
      variables. The resulting regression would look like:
      xj=c+b1(j)x1++bj1(j)xj1+bj+1(j)xj+1++bk(j)xkj=1,2,,kx_{j}=c+b_{1}^{(j)} x_{1}+\cdots+b_{j-1}^{(j)} x_{j-1}+b_{j+1}^{(j)} x_{j+1}+\cdots+b_{k}^{(j)} x_{k} \quad j=1,2, \cdots, k
      Then we obtain coefficients of determination of this regression,
      Rj2R_{j}^{2}
      .
  • F: What is the intuition of the Variance Inflation Factor?
    • The
      jj
      th variable is regressed on the remaining
      k1k-1
      variables / features.
    • If
      R2R^2
      is large, that means the remaining variables can explain the
      jj
      th variable and so the resulting VIF will be large.
    • TODO: Formula!
  • F: How can VIF be interpreted.
    • A VIF of 10 indicates a severe impact due to multicollinearity.
  • F: How can one test for linearity?
    • Plot regression residuals on the vertical axis and values of the explanatory variables on the horizontal axis. Repeat for every explanatory variable. If errors are randomly scattered, around zero the model assumption is correct.
    • Image downloaded from here.
  • F: Why is it not desirable to use Linear Regression for default prediction?
    • In default prediction one searches for provability of default
      Pr( default = Yes  balance )\operatorname{Pr}(\text { default }=\text { Yes } \mid \text { balance })
      , which ranges between
      00
      and
      11
      .
    • Fitting a line between to a binary response variable (1 = default / 0 = non-default), could lead to estimates outside the
      [0,1][0,1]
      interval, making them hard to interpret as probabilities i. e. if probabilities are negative.
    • Nevertheless, the predictions provide an ordering and can be interpreted as crude probability estimates.
  • F: Explain scenarios, where Ridge Regression would be preferred over LASSO.
    • Ridge only performs parameter shrinkage and no variable selection.
    • Ridge regression is preferred if one wants to insert some prior knowledge into the approach. With ridge, one has the ability to say that all features have at least some weight, even if it is very little (See here.)
  • F: Explain scenarios where LASSO are preferred over Ridge Regression.
    • As with ridge regression, the LASSO shrinks the coefficient estimates towards zero.
    • However, in the case of LASSO some coefficient estimates are forced to be exactly equal to zero (zeroed out) when the tuning parameter
      λ\lambda
      is sufficiently large.
    • Therefore, LASSO does variable selection automatically and shrinkage of parameters.
  • F: Name two approaches for shrinking regression coefficients towards zero.
    • ridge regression
    • LASSO
  • F: Explain what regularization is and why it is useful
  • F: Explain the ridge regression.
    • TODO: extension to Linearen Regression
    • Ridge regression is a regularization approach. Regularization is used to prevent coefficients from fitting so perfectly. This is done by adding a constant multiple to an existing weight vector. Which is sometimes referred to as a regularization term or shrinkage penalty. In case of ridge regression this regularization term. In case of ridge regression it is the sum of the square of the weights.
    • Taking this into account one get's the following formula for ridge regression:
i=1n(yiβ0j=1pβjxij)2+λj=1pβj2=RSS+λj=1pβj2\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} x_{i j}\right)^{2}+\lambda \sum_{j=1}^{p} \beta_{j}^{2}=R S S+\lambda \sum_{j=1}^{p} \beta_{j}^{2}
    • where
      λ0\lambda \geq 0
      is a tuning parameter, to be determined separately. The tuning parameter
      λ\lambda
      serves as control of the relative impact of these two terms on the regression coefficients. Should be selected using cross-validation.
    • Still ridge regression seeks coefficient estimates that fit the data well through minimizing the RSS.
    • The shrinkage penalty is small when
      β1,,βp\beta_1,\cdots, \beta_p
      are close to zero, and so it has the effect of shrinking the estimates of
      βj\beta_j
      towards zero. However,
      β0\beta_0
      is left out from the penalty term, as penalizing the intercept would just shift
      yiy_i
      by some amount
      cc
      . (Hastie p. 64)
    • As such:
      • ridge regression yields non-sparse outputs, as coefficients are shrinked towards zero but never actually are 0.
      • doesn't allow for feature selection. Same reasoning as above.
      • Typically yields better results than LASSO.
  • F: Why is the intercept
    β0\beta_0
    not part of the regularization term?
    • The intercept
      β0β_0
      has been left out of the penalty term. Penalization of the intercept would make the procedure depend on the origin chosen for
      YY
      ; that is, adding a constant
      cc
      to each of the targets
      yiy_i
      would not simply result in a shift of the predictions by the same amount
      cc
      . (Hastie p. 64)
    • Indeed, in the presence of the intercept term, adding
      cc
      to all
      yiy_i
      will simply lead to
      β0\beta_0
      increasing by
      cc
      as well, and correspondingly all predicted values
      yiy_i
      will also increase by
      cc
      . This is not true if the intercept is penalized:
      β0\beta_0
      will have to increase by less than c. (see here.)
  • F: Match
    1\ell_1
    norm,
    2\ell_2
    norm, ridge regression and LASSO to its counterparts.
    • ridge:
      2\ell_2
    • LASSO:
      1\ell_1
  • F: How is the
    2\ell_2
    norm defined?
    • β2=j=1pβj2{||\beta||}_{2}=\sqrt{\sum_{j=1}^{p} \beta_{j}^{2}}
  • F: How is the
    1\ell_1
    norm defined?
    • β1=j=1pβj{||\beta||}_{1}=\sum_{j=1}^{p} |\beta_{j} |
  • F: Explain LASSO.
    • Lasso regression is a regularization approach. Regularization is used to prevent coefficients from fitting to perfectly. This is done by adding a constant multiple with an existing weight vector. Which is referred to as regularization term or shrinkage penalty. In the case of LASSO regression, it is the sum of absolute weights.
    • Taking this into account one get's the following formula for ridge regression:
      i=1n(yiβ0j=1pβjxij)2+λj=1pβj=RSS+λj=1pβj,\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} x_{i j}\right)^{2}+\lambda \sum_{j=1}^{p} |\beta_{j}|=R S S+\lambda \sum_{j=1}^{p} |\beta_{j}|,
      where
      λ0\lambda \geq 0
      is a tuning parameter, to be determined separately. The tuning parameter
      λ\lambda
      serves as control of the relative impact of these two terms on the regression coefficients. Should be selected using cross-validation.
    • Still ridge regression seeks for coefficient estimates that fit the data well through minimizing the RSS.
    • The shrinkage penalty is small when
      β1,,βp\beta_1,\cdots, \beta_p
      are close to zero, and so it has the effect of shrinking the estimates of
      βj\beta_j
      towards zero. However,
      β0\beta_0
      is left out from the penalty term. Some coefficient estimates are even forced to be exactly zero, if
      λ\lambda
      is sufficiently large.
    • As such:
      • Lasso regression yields sparse models. That is, models that involve only a subset of the variables.
      • Can be used for feature selection.
  • F: Explain the difference between LASSO and ridge regression?
    • Both are regularization approaches in order to prevent overfitting of an ordinary linear regression model and introduce smoothness to the model. This is done by adding a constant multiple of an weight vector that prevents the coefficients so perfectly that they overfit.
    • Both shrinkage methods to shrink regression coefficients towards zero.
    • The difference between LASSO and ridge regression is that ridge is just the square of the weights, while Lasso is just the sum of the absolute weights in MSE or other loss functions.
    • TODO: l1 and l2 norm
    • LASSO:
minimizeβi=1n(yiβ0j=1pβjxij)2 subject to j=1pβjs\underset{\beta}{\operatorname{minimize}} \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} x_{i j}\right)^{2} \quad \text { subject to } \quad \sum_{j=1}^{p}\left|\beta_{j}\right| \leq s
  • Ridge regression:
minimizeβi=1n(yiβ0j=1pβjxij)2 subject to j=1pβj2s\underset{\beta}{\operatorname{minimize}} \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} x_{i j}\right)^{2} \quad \text { subject to } \quad \sum_{j=1}^{p} \beta_{j}^{2} \leq s
  • The main difference is, that when doing a subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all
    pp
    predictors in the final model. However, LASSO helps to zero-out coefficients and can yield sparse feature spaces. Where as ridge regression yields non-sparse outputs and can not be used for feature-selection straight away.
  • However, in practice ridge regression performs better than LASSO. (See here.) According to the script there is no clear tendency.
  • Visualization:
  • F: In practice, explain what is the main difference between ridge regression and LASSO.
    • Both are regularization approaches to prevent overfitting of an ordinary linear regression model and introduce smoothness to the model. This is done by adding a constant multiple of an weight vector that prevents the coefficients so perfectly that they overfit.
    • Both shrinkage methods to shrink regression coefficients towards zero.
    • The difference between LASSO and ridge regression is that ridge is just the square of the weights, while Lasso is just the sum of the absolute weights in MSE or other loss functions.
    • TODO: Formel