Nachfolgende Fragen eigenen sich zur Prüfungsvorbereitung mittels Active Recall gedacht und eine Ergänzung zu Karteikarten.
Zur Herkunft der Fragen:
Fragen aus Altklausuren sind mit einem Stern (⭐) markiert.
Ein weiterer Teil der Fragen stammt aus der Vorlesung Maschinelles Lernen (Grundverfahren). Diese Fragen sind mit einem Gehirn (🧠) gekennzeichnet.
Fragen der University of Toronto sind mit einem Camp (🏕️) markiert.
Fragen der University of Berkley mit einem Feuerwehrmann (🧑🚒)
Die übrigen Fragen sind eigene Fragen oder es handelt sich um Interviewfragen.
Überblick
Überblick über Algorithmen der Vorlesung
Big Data
F: What are the characteristics of big data? ⭐
volume
variety
velocity
veracity
value
F: Explain three characteristics of big data? ⭐
Volume refers to the sheer amount of data that is generated.
Variety refers to the diversity of types of data. Data can come in structured, semi-structured, or even unstructured types.
Velocity refers to the sheer speed at which data is generated (and processed).
Veracity refers to the quality of data or accuracy of the collected data. To resolve data quality issues one has to apply sophisticated pre-processing.
F: What is the difference between veracity and variety?
Veracity refers to the quality of data (e. g. noise in data). While variety refers to types of data (e. g. unstructured data) in which data can come. As data is often collected from different sources both their types and their quality can differ.
ML vs. Statistics vs. Econometrics
F: Compare ML to Statistics. What are the most significant differences?
Statistics is:
based on hypothesis, then a collection of data and analysis
model-oriented with an emphasis on parametric models
focus on understanding and hypothesis testing
Whereas in Machine Learning:
there is seldomly a priori hypothesis
data is collected in advance
analysis is data-driven not hypothesis-driven
analysis is algorithm-oriented rather than model-oriented
focus lies on prediction
F: Compare ML to Econometrics. In which way do both differ?
Econometrics is:
concerned about casual interference and counterfactuals
mostly centred around linear regressions and complex structural models
standard errors are often reported after one run
Machine Learning is:
concerned about prediction
using all sorts of data-driven models e. g. Trees, NN, etc.
Structure of data / CRISP-DM / Taxonomy
F: What are the characteristics of unstructured data? Explain them. ⭐
Unstructured data is:
Nonnumeric: No predefined numeric representation for the constructs of interests. Requires manual or automatic coding prior to analysis.
Multifaceted: A single unit of unstructured data posses multiple facets. Each aspect of data provides unique information for studying and different types of research goals. E. g. voice data present information about the speaker such as pitch, speech rate. Data can be used both in psychology and communication.
Concurrent representation: The simultaneous presence of a single data unit's multiple facets that each provides unique information, which allows to represent of different phenomena at the same time. One can study different research questions with one single unit of unstructured data.
F: What is 'structured data'?
Structured data is data that adheres to a pre-defined data model and is therefore straightforward to analyze.
Structured data conforms to a tabular format with a relationship between the different rows and columns.
F: What is unsupervised learning? ⭐
Observe data and construct a low complexity description of the data.
That means in unsupervised learning the dataset that a data set transforms into is not previously known or understood. Data is not labeled. (Grooking p. 13)
We observe only the features
$X_1, X_2,\ldots, X_P$
. We are not interested in prediction, because we do not have an associated response variable
$Y$
.
F: What is supervised learning? ⭐
We observe both a set of features
$X_1,X_2,\ldots,X_P$
for each object, as well as a response or outcome variable
$Y$
. The goal is then to predict
$Y$
using
$X_1, X_2,\ldots,X_P$
.
Examples include clustering and PCA.
F: What are advantages / disadvantages of unsupervised learning techniques?
No labeled data required, which is often expensive and laborious. (+)
Adding labels to the data after clustering is often easier (+) (own)
Unsupervised techniques such as clustering help with data understanding of the raw data. (+) (own)
Unsupervised learning is more subjective, as there is no simple objective (-)
F: Name practical applications of unsupervised learning.
Subgroups of breast cancer patients grouped by their gene expression measurements
Groups of shoppers characterized by their browsing and purchase histories
Movies grouped by ratings assigned by movie raters
F: What is the goal of unsupervised learning?
The goal of unsupervised learning is to discover interesting things about the measurements on how to visualize data and finding subgroups among variables or observations.
F: Give two examples for unsupervised learning techniques.
Clustering algorithms such as
$k$
-means
Dimensionality reduction techniques such as PCA
F: Give examples for structured/unstructured data.
Unstructured: (low degree of organization)
Video Data, as the video comes in different formats, compression ratios, sizes, where the video has to be transformed first to extract information from every single frame
Image Data, just like videos.
Structured: (high degree of organization)
Numeric secondary data e. g. sales figures, as they come in a standardized format and easy to process format e. g. float with
$x$
decimal places
Categorial data e. g. gender, as there are predefined formats
F: Give a brief explanation of categorical, binary, ordinal, and numeric variables.
categorical/nominal: Names of things or symbols.
binary: A nominal variable with two categories or states: 0 or 1.
ordinal: Ordinal variables have a meaningful order or ranking among them, but the magnitude between successive values is not known.
numeric: A quantitative variable. Numeric variables could be interval-scaled or ratio-scaled.
F: Which steps are part of the CRISP-DM model? Explain them in-depth.
1.
Business understanding i. e. developing an understanding of business objects and requirements of the data mining
2.
Data understanding i. e. identify and collect the data set needed to fulfill the business goals
3.
Data preparation i. e. prepare data for modeling
4.
modeling i. e. build several models and assess them on a technical level.
5.
Evaluation i. e. Evaluate whether models are able to help achieve the business goals. Plan on the next steps.
6.
Deployment i. e. Deploy model to production. Make it accessible to customers.
F: Explain common techniques for data gathering.
Bulk downloads: Downloading large amounts of data. Often done using sophisticated software.
APIs: Accessing data through machine-readable interfaces. Examples include Google Maps API.
Web Scraping: Extraction of data from websites. Often done using bots and web crawlers or manually.
F: Why is it desirable to work on normalized data?
Some algorithms require normalized data, such as
$k$
-means clustering, which is 'isotropic' in all directions of space and therefore tens to produce more or less round shapes. Not standardizing data would give more relative on variables with a smaller variance. (See here.)
F: Explain common techniques to analyze the relationship between variables.
A scatter plot (or scatter diagram) is used to show the relationship between variables
Bar plot for high dimensional data
Mean graph for categorical data
Correlation analysis
F: How can missing data be replaced? Explain.
mean-based imputation: i. e. mean is calculated from all observations
median-based imputation: Same as above but with median.
stratified imputation: i. e. categories/ structure of data is considered for replacements. E. g. missing height is different for gender male and female.
regressed imputation: i. e. replacing missing values by predictions of a regression model
F: Explain 3 patterns in which missing data can occur.
Completely random / MCAR: Missing values have no pattern. Can not be predicted.
Missing at random / MaR: Missing values can be predicted using other data available for observation. Assign a categorial value.
Latent, yet unknown variable: Missing value depends on a latent and highly correlated variable.
F: What is a training, test, and validation set for?
Training set is used to fit all potential models
Validation set is used to select hyperparameters of a model
Test set is used to estimate the predictive power of a model on unseen data
F: What is the risk with tuning hyperparameters using a test dataset? 🧑🚒
Tuning model hyperparameters to a test set means that the hyperparameters may overfit that test set. If the same test set is used to estimate performance, it will produce an overestimate. Using a separate validation set for tuning and test set for measuring performance provides an unbiased, realistic measurement of performance. (Berkley p. 14)
LR, Ridge, and Lasso
F: Explain how best subset selection works in 3 steps.
1.
Let
$\mathcal{M}_{0}$
denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation. 2. For
$k=1,2, \ldots p$
:
a. Fit all
$\left(\begin{array}{l}p \\ k\end{array}\right)$
models that contain exactly
$k$
predictors.
b. Pick the best among these
$\left(\begin{array}{l}p \\ k\end{array}\right)$
models, and call it
$\mathcal{M}_{k}$
. Here best is defined as having the smallest RSS, or equivalently largest
$R^{2}$
.
Select a single best model from among
$\mathcal{M}{0}, \ldots, \mathcal{M}{p}$
using cross-validated prediction error,
$C_{p}$
(AIC), BIC, or adjusted
$R^{2}$
.
F: Explain how forward stepwise selection works in 3 steps.
Intuition:
Instead of searching through all possible subsets, we can seek a good path through them. Forward stepwise selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the fit. Like best subset regression, forward stepwise produces a sequence of models indexed by
$k$
, the subset size, which must be determined. (Hastie p. 59)
More formal description: 1. Let
$\mathcal{M}_{0}$
denote the null model, which contains no predictors. 2. For
$k=0, \cdots, p-1$
: a. Consider all
$p-k$
models that augment the predictors in
$\mathcal{M}_{k}$
with one additional predictor. b. Choose the best among these
$p-k$
models, and call it
$\mathcal{M}_{k+1}$
. Here best is defined as having smallest RSS or highest
$R^{2}$
. 3. Select a single best model from among
$\mathcal{M}_{0}, \ldots, \mathcal{M}_{p}$
using cross-validated prediction error,
$C_{p}$
, AIC, BIC, on adjusted
$R^{2}$
.
F: Explain how backward stepwise selection works in 3 steps.
Backward- stepwise selection starts with the full model and sequentially deletes the predictor that has the least impact on the fit. The candidate for dropping is the one with the largest RSS and lowest
$R^2$
. (Hastie p. 60)
More formal definition:
Let
$\mathcal{M}_{p}$
denote the full model, which contains all
$p$
predictors.
1.
For
$k=p, p-1, \ldots, 1$
: a. Consider all
$k$
models that contain all but one of the predictors in
$\mathcal{M}{k}$
, for a total of
$k-1$
predictors. b. Choose the best among these
$k$
models, and call it
$\mathcal{M}_{k-1}$
. Here best is defined as having smallest RSS or highest
$R^{2}$
.
2.
Select a single best model from among
$\mathcal{M}_{0}, \ldots, \mathcal{M}_{P}$
using cross-validated prediction error,
$C_{p}$
, AIC, BIC, or adjusted
$R^{2}$
.
F: Compare the best subset selection to forward selection.
Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.
Best Subset Selection does not add predictors one-at-a-time but chooses from models containing exactly
$k$
variables. Predictors might be different for different
$k$
s.
Best Subset Selection becomes infeasible for large number of variables (
$\geq 40$
). (Hastie p.59)
F: Compare the subset selection methods forward and backward stepwise selection.
Backward stepwise selection is pretty much the inverse of forward stepwise selection.
F: When is it desirable to use backward stepwise selection and when is it desirable to use forward stepwise selection or best subset selection?
Computationally best subset selection is most demanding and becomes infeasible for a large number of features.
All three deliver similar results (See the comparison in Hastie p. 59)
F: What is an alternative to subset selection methods presented in the lecture?
Forward-stagewise regression
F: What are the reasons why shrinkage methods such as LASSO are preferred over subset selection methods such as best subset selection?
If the best subset can be found, it is indeed better than the LASSO in terms of selecting the variables that actually contribute to the fit.
In practice LASSO is still preferred as it is computationally much easier to estimate e. g. through the calculation of regularization paths using pathwise coordinate descent. Whereas best subset selection is a NP-hard problem. (see here.)
TODO: Lasso ist wenig nachvollziehbar
F: Explain how Linear Regression works.
Linear regression is a linear model (aka a model that assumes a linear relationship between input variables
$X$
and the single output variable
$y$
).
$y$
is a linear combination of the input variables
$x_1$
to
$x_k$
. To best describe the relationship between input variables and output variables a line or hyperplane is fitted to the point cloud. (own)
The multiple linear regression model for the population is defined as:
RMSE: How many large prediction derivations does the model have? (lecture BDA p. 43)
F: Compare
$R^2$
, Adj.
$R^2$
to MSE, MAE and RMSE. Name advantages and drawbacks.
MSE:
MSE is differentiable, which is important for finding optima. (+)
MAE:
The scale of MAE, RMSE depends on the scale of the dependent variable. (-)
MAE is not differentiable. (-)
MAE is more robust to outliers. (+)
$R^2$
:
Measure always increases by adding new independent variables which can lead to the addition of redundant variables in the model. (-)
F: Compare MAE to RMSE.
RMSE penalizes large errors more than MAE. This can be useful if being off by ten is more than twice as bad as being of by 5. If however being off by 5 is just as bad as being of by 10, MAE should be preferred. (See here.)
F: Compare
$R^2$
, Adj.
$R^2$
to MAE, RMSE. Which of these is normed.
$R^2$
and Adj.
$R^2$
between
$[0,1]$
. TODO: worse than 0 if prediction is worse than mean / large SSE vs small SST.
MAE, RMSE between
$[0, \infty)$
. If RMSE is acceptable depends on the scale of the variables. (see here). RMSE and MAE are
$0$
for models with a perfect fit.
F: In which way does the adjusted
$R^2$
improve the standard
$R^2$
?
A model might have a good fit in-sample but poor fit on out-of-sample, if to many regressors are used.
Adj.
$R^2$
is a
$R^2$
which has been corrected by a penalty function and takes into account the number of
$k$
regressors in the model.
F: Explain the three steps in fitting a regression model.
Specification:
Determine dependent and explanatory variables.
Exclude explanatory variables without predictive power.
Collect data for dependent and explanatory variables.
Fitting / Estimating:
Estimating regression coefficients.
Diagnosis:
Determine the quality of the regression model with e. g.
$R^2$
, adj.
$R^2$
, MSE and MAE.
Determine the model's significance and the significance of the regression coefficients.
analyze standard deviation of regression errors.
F: Explain how one can test for the significance of a regression model. Give
$H_{0}$
and
$H_{1}$
Hypothesis for regression models.
$H_{0}$
states that all regression coefficients are equal to zero, which means none of the explanatory variables play any role.
$H_{0}: \beta_{0}=\beta_{1}=\cdots=\beta_{k}=0$
$H_{1}$
states that at least one coefficient is different from zero.
$H_{1}: \beta_{j}\neq0 \text { for at least one } j$
F: Give an intuition for the Analysis of Variance (ANOVA) test.
The ANOVA-Test compares whether the means of two separate sets are equal.
The observation
$x_{i,j}$
which is the
$j$
-th observation of
$i$
th can be decomposed into the between-groups variance
statistic is simply the ratio of the between-groups variance and the within-group variance (see here.) (see here.)
F: How is the ANOVA test /
$F$
-test defined?
$F=\frac{\frac{S S R}{k}}{\frac{S S E}{n-k-1}}=\frac{M S R}{M S E},$
where
$n$
is sample size,
$k$
number of parameters in model,
$k-1$
number of slope parameters.
F: Explain how one can interpret the
$F$
-Test.
If the
$p$
-Value of the
$F$
-Test is less than a significance level
$\alpha$
, the model does explain some variation of the dependent variable
$y$
.
One needs to have a
$F$
table for the corresponding
$\alpha$
.
F: Explain what multicollinearity is.
Multicollinearity refers to the situation in which more than two explanatory variables in a multiple regression model are highly correlated.
Tests for multicollinearity are necessary after the models significance has been determined and all significant independent variables as if strong multicollinearity is present, a change in one explanatory variable will also lead to a change of another explanatory variable.
F: Name three possible indicators for multicollinearity.
Sensitivity of regression coefficients to the inclusion of additional explanatory variables
change from significance to insignificance after more explanatory variables have been added
An increase in the model’s standard error of the regression
F: How can one test for multicollinearity?
One can use the variance inflation factor (VIF)
F: Give the definition for the variance inflation factor.
To check the
$j$
th variable for multicollinearity, one can calculate the VIF as following:
The
$j$
-th variable is regressed on the remaining
$k-1$
variables. The resulting regression would look like:
Then we obtain coefficients of determination of this regression,
$R_{j}^{2}$
.
F: What is the intuition of the Variance Inflation Factor?
The
$j$
th variable is regressed on the remaining
$k-1$
variables / features.
If
$R^2$
is large, that means the remaining variables can explain the
$j$
th variable and so the resulting VIF will be large.
TODO: Formula!
F: How can VIF be interpreted.
A VIF of 10 indicates a severe impact due to multicollinearity.
F: How can one test for linearity?
Plot regression residuals on the vertical axis and values of the explanatory variables on the horizontal axis. Repeat for every explanatory variable. If errors are randomly scattered, around zero the model assumption is correct.
Fitting a line between to a binary response variable (1 = default / 0 = non-default), could lead to estimates outside the
$[0,1]$
interval, making them hard to interpret as probabilities i. e. if probabilities are negative.
Nevertheless, the predictions provide an ordering and can be interpreted as crude probability estimates.
F: Explain scenarios, where Ridge Regression would be preferred over LASSO.
Ridge only performs parameter shrinkage and no variable selection.
Ridge regression is preferred if one wants to insert some prior knowledge into the approach. With ridge, one has the ability to say that all features have at least some weight, even if it is very little (See here.)
F: Explain scenarios where LASSO are preferred over Ridge Regression.
As with ridge regression, the LASSO shrinks the coefficient estimates towards zero.
However, in the case of LASSO some coefficient estimates are forced to be exactly equal to zero (zeroed out) when the tuning parameter
$\lambda$
is sufficiently large.
Therefore, LASSO does variable selection automatically and shrinkage of parameters.
F: Name two approaches for shrinking regression coefficients towards zero.
ridge regression
LASSO
F: Explain what regularization is and why it is useful
Ridge regression is a regularization approach. Regularization is used to prevent coefficients from fitting so perfectly. This is done by adding a constant multiple to an existing weight vector. Which is sometimes referred to as a regularization term or shrinkage penalty. In case of ridge regression this regularization term. In case of ridge regression it is the sum of the square of the weights.
Taking this into account one get's the following formula for ridge regression:
$\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} x_{i j}\right)^{2}+\lambda \sum_{j=1}^{p} \beta_{j}^{2}=R S S+\lambda \sum_{j=1}^{p} \beta_{j}^{2}$
where
$\lambda \geq 0$
is a tuning parameter, to be determined separately. The tuning parameter
$\lambda$
serves as control of the relative impact of these two terms on the regression coefficients. Should be selected using cross-validation.
Still ridge regression seeks coefficient estimates that fit the data well through minimizing the RSS.
The shrinkage penalty is small when
$\beta_1,\cdots, \beta_p$
are close to zero, and so it has the effect of shrinking the estimates of
$\beta_j$
towards zero. However,
$\beta_0$
is left out from the penalty term, as penalizing the intercept would just shift
$y_i$
by some amount
$c$
. (Hastie p. 64)
As such:
ridge regression yields non-sparse outputs, as coefficients are shrinked towards zero but never actually are 0.
doesn't allow for feature selection. Same reasoning as above.
Typically yields better results than LASSO.
F: Why is the intercept
$\beta_0$
not part of the regularization term?
The intercept
$β_0$
has been left out of the penalty term. Penalization of the intercept would make the procedure depend on the origin chosen for
$Y$
; that is, adding a constant
$c$
to each of the targets
$y_i$
would not simply result in a shift of the predictions by the same amount
$c$
. (Hastie p. 64)
Indeed, in the presence of the intercept term, adding
Lasso regression is a regularization approach. Regularization is used to prevent coefficients from fitting to perfectly. This is done by adding a constant multiple with an existing weight vector. Which is referred to as regularization term or shrinkage penalty. In the case of LASSO regression, it is the sum of absolute weights.
Taking this into account one get's the following formula for ridge regression:
$\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} x_{i j}\right)^{2}+\lambda \sum_{j=1}^{p} |\beta_{j}|=R S S+\lambda \sum_{j=1}^{p} |\beta_{j}|,$
where
$\lambda \geq 0$
is a tuning parameter, to be determined separately. The tuning parameter
$\lambda$
serves as control of the relative impact of these two terms on the regression coefficients. Should be selected using cross-validation.
Still ridge regression seeks for coefficient estimates that fit the data well through minimizing the RSS.
The shrinkage penalty is small when
$\beta_1,\cdots, \beta_p$
are close to zero, and so it has the effect of shrinking the estimates of
$\beta_j$
towards zero. However,
$\beta_0$
is left out from the penalty term. Some coefficient estimates are even forced to be exactly zero, if
$\lambda$
is sufficiently large.
As such:
Lasso regression yields sparse models. That is, models that involve only a subset of the variables.
Can be used for feature selection.
F: Explain the difference between LASSO and ridge regression? ⭐
Both are regularization approaches in order to prevent overfitting of an ordinary linear regression model and introduce smoothness to the model. This is done by adding a constant multiple of an weight vector that prevents the coefficients so perfectly that they overfit.
Both shrinkage methods to shrink regression coefficients towards zero.
The difference between LASSO and ridge regression is that ridge is just the square of the weights, while Lasso is just the sum of the absolute weights in MSE or other loss functions.
The main difference is, that when doing a subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all
$p$
predictors in the final model. However, LASSO helps to zero-out coefficients and can yield sparse feature spaces. Where as ridge regression yields non-sparse outputs and can not be used for feature-selection straight away.
However, in practice ridge regression performs better than LASSO. (See here.) According to the script there is no clear tendency.
Visualization:
F: In practice, explain what is the main difference between ridge regression and LASSO. ⭐
Both are regularization approaches to prevent overfitting of an ordinary linear regression model and introduce smoothness to the model. This is done by adding a constant multiple of an weight vector that prevents the coefficients so perfectly that they overfit.
Both shrinkage methods to shrink regression coefficients towards zero.
The difference between LASSO and ridge regression is that ridge is just the square of the weights, while Lasso is just the sum of the absolute weights in MSE or other loss functions.
TODO: Formel
F: Explain what is the difference between Linear Regression and LASSO? (8 points) ⭐
is a tuning parameter that serves as control of the relative impact of these two terms on the regression coefficients.
As it can be seen above, linear regression is the most basic form. LASSO includes another term - the so-called regularization term or shrinkage penalty. More on that later. The standard linear regression doesn't penalize for the choice of weights and doesn't include a regularization term.