⁉

# Selbsttestfragen

Nachfolgende Fragen eigenen sich zur Prüfungsvorbereitung mittels

**Active Recall**gedacht und eine Ergänzung zu Karteikarten.Offizielle Fragen sind mit 🧠 markiert. Probeklausurfragen mit einem 🦧. Fragen der University of Stanford mit 🧑🚒. Eigene Fragen haben keine Markierung.

*Under which assumptions is the least-squares objective from linear regression equivalent to a maximum likelihood objective?🦧*- Gaussian likelihood with linear mean and constant noise.

- What is the role of the lambda factor in the objective of ridge regression and how is it determined? 🦧
- $$\lambda$ controls the degree of penalization. It is determined through cross-validation.

- Write down the objective of linear binary logistic regression. The samples are given by$x_{i}$and the labels by$c_{i} \in\{0,1\}$. How is$p\left(c_{i} \mid \boldsymbol{x}_{i}\right)$assumed to be distributed in binary logistic regression?🦧

$\operatorname{argmax}_{\boldsymbol{w}} \sum_{i=1}^{N} c_{i} \log \left(\sigma\left(\boldsymbol{w}^{T} \boldsymbol{x}_{i}\right)\right)+\left(1-c_{i}\right) \log \left(1-\sigma\left(\boldsymbol{w}^{T} \boldsymbol{x}_{i}\right)\right)$

- Logistic regression assumes$p\left(c_{i} \mid x_{i}\right)$to be Bernoulli distributed.
- Explain the difference between generative and discriminative classification? Which is usually easier? 🦧
- In generative classification we model the data and use Bayes rule to obtain the class-posterior. In discriminative classification, the class-posterior is directly modelled. Discriminative classification is usually easier.

- Define the Bernoulli likelihood of linear logistic regression for a single sample$x$and label$y \in \{0,1\}$.
- $p(y \mid x)=\sigma\left(\boldsymbol{w}^{T} \boldsymbol{x}+b\right)^{y}\left(1-\sigma\left(\boldsymbol{w}^{T} \boldsymbol{x}+b\right)\right)^{1-y}$

*Why is it a bad idea to evaluate your algorithm on the training set? 🧠*- It's important to use unseen data when evaluating an algorithm to prevent overfitting.
- One would have zero empirical risk but a very high true risk.
- In the worst case an algorithm could learn the data by heart, perform well during testing and training but perform poorly on real unseen data.

*What is the difference between true and empirical risk? 🧠***True risk**is the performance on a random test point$(x,y)$. True risk is unknown.**Empirical risk**is the performance on the training set. Empirical risk can be evaluated using training samples.

*The true risk can be decomposed in which parts?🧠*- The true risk can be decomposed into:

$\underbrace{\mathbb{E}_{D_{n}}\left[\mathbb{E}_{x, y}\left[\left(\hat{f}_{D_{n}}(x)-\hat{f}_{*}(x)\right)^{2}\right]\right]}_{\text {Variance }}+\underbrace{\mathbb{E}_{x, y}\left[\left(\hat{f}_{*}(x)-f(x)\right)^{2}\right]}_{\text {Bias }^{2}}+\underbrace{\sigma^{2}}_{\text {noise }}$*How is the bias and the variance of a learning algorithm defined and how do they contribute to the true risk?🧠***Variance**:- $\mathbb{E}_{D_{n}}\left[\mathbb{E}_{x, y}\left[\left(\hat{f}_{D_{n}}(\boldsymbol{x})-\hat{f}_{*}(\boldsymbol{x})\right)^{2}\right]\right]$
- Variance is the difference between the estimates and the "best" estimates. Error depends on the number of data points that are available for training. It accounts for the limitations in the training data set.

**Bias**:- $\mathbb{E}_{x, y}\left[\left(\hat{f}_{*}(\boldsymbol{x})-f(\boldsymbol{x})\right)^{2}\right]$
- Bias is the difference between the true function and the "best" estimate of the model.

- Bias, variance, and noise sum up to the true risk.

*What is the advantage/disadvantage of k-fold CV vs. the Hold-out method?🧠*- k-Fold CV uses the entire data set for training, as the$k$-th fold is used for testing and the remaining folds are used for training. This is less wasteful compared to the Hold-Out-method, where a certain percentage of the data is exclusively reserved for testing.
- On the other hand, is k-Fold CV more computationally expensive as it requires$k$iterations to train the model. It also leads to more stable results though. In comparison, the Hold-out-method is less computationally expensive, as it requires only one run.
- What is problematic about the hold-out method is, that unlucky splits might give misleading results.

*Why does it make sense to penalize the norm of the weight vector?🧠*- Applying a penalty to large weight vectors helps to keep the parameters small. Which leads to a smoother function.
- Also, the complexity of the model is limited.
- The larger$\lambda$, the less complex the final model will be.

*Which norm can we use and what are the different effects?*- $\ell_1$
**penalty**:- $\operatorname{penalty}(\boldsymbol{\theta})=\|\boldsymbol{\theta}\|_{1}=\sum_{d}\left|\theta_{d}\right|$
- Introduces sparse solutions, which means some parameters can be 0. Suitable for feature selection.
- Hard to optimize.

- $\ell_2$
**penalty**:- $\operatorname{penalty}(\boldsymbol{\theta})=\|\boldsymbol{\theta}\|_{2}=\sum_{d} \theta_{d}^{2}$
- Redundant parameters will be close to 0, but never 0.
- Easy to optimize.

*What is the effect of early stopping?🧠*- Early stopping prevents overfitting, as model is not trained to the smallest training error.
- Further more, the model's complexity is limited. E. g. with Boosting approaches, one doesn't learn overly complicated trees.

- Give examples how to influence the model-complexity of at least 3 different algorithms.🦧
- Neural networks: Add/remove neurons or entire layers
- K-NN: Increasing k decreases model complexity
- Trees and forests: Increasing depth increases complexity
- Generalized linear methods: Choose larger function class as feature mapping
- Kernel methods: Use kernel with higher dimensional underlying feature space
- Mixture models: Use more mixture compenents

*What we mean with non-parametric / instance-based machine learning algorithms?🧠*- Non-parametric ML algorithms are algorithms, that do not learn a parametric model, but store all the training data and use the training data to make predictions.
- Examples include the$k$-nearest neighbour algorithm and the Gaussian processes.

*How does*$k$*-Nearest neighbour works?🧠*- To classify a new input vector$x$, we examine the$k$closest training data points to$x$(that lie within a hypersphere).
- We assign the class that is most frequent among those neighbours to the query point$x$.

*Why is it hard to use for high-dimensional data?🧠*- $k$-Nearest neighbour is built around a distant-based measure. However, in a very high-dimensional space, most points are equally far apart from each other.
- So the neighbourhood becomes large.

*How to search for nearest neighbours efficiently?🧠*- It's convenient to use a KD-tree as the data structure. The KD tree is balanced and performs binary splits on the feature space.
- Searching goes like this:
- Find region containing$\boldsymbol{x}$. Navigate starting from the root node to the child node containing$\boldsymbol{x}$.
- Save region point$\boldsymbol{x}^{*} = \boldsymbol{x}_{0}$as current best.
- Move up tree and recursively search regions intersecting hypersphere$S\left(\boldsymbol{x},\left\|\boldsymbol{x}-\boldsymbol{x}^{*}\right\|\right)$.
- Update$\boldsymbol{x}^{*}$if new nearest neighbour has been found.

*What is a binary regression/decision tree?🧠*- A binary decision tree is a tree that performs a binary split at each node. The predictor space is segmented into smaller regions.
- In case of regression trees the predicted value at a node is the average response variable of all observations within this node.

*What are useful splitting criterions?🧠***For regression trees:**- Minimum residual sum of squares:$\mathrm{RSS}=\sum_{\text {left }}\left(y_{i}-\bar{y}_{L}\right)^{2}+\sum_{\text {right }}\left(y_{i}-\bar{y}_{R}\right)^{2}$

**For classification trees:**- Information gain / Minimum entropy:$N_{L} H\left(p_{\mathrm{L}}\right)+N_{R} H\left(p_{\mathrm{R}}\right)$

*How can we influence the model complexity of a tree?🧠*- One can limit the depth of the tree or limit the minimum number of samples that must be present within one node.
- Another way is to apply pruning to a decision tree, which refers to cutting back a fully grown tree.

*Why is it useful to use multiple trees and randomization?🧠*- Decision Trees are prone to overfitting. That's why we should learn several trees to increase variability.
- By randomizing the features, using bootstrap samples of the data and combining the trees (random forest) or learning several trees on bootstrap samples (bagging) one can prevent trees from overfitting.

- Name at least two advantages and two disadvantages of decision trees. 🦧
**Advantages**:- Easy to interpret (if they are small)
- Computationally efficient. They are quick to fit, even for large problems.

**Disadvantage**:- Other ML methods such as NN achieve a better accuracy.
- Suffer from instability, if we change the data, ever-so-slightly trees can change a lot.

*Which data structure is usually used to efficiently implement k-Nearest neighbors? Name the main steps in building that data structure.🦧*- KD-Trees are commonly used.
- Building the tree:
- choose dimension (e. g. longest hyper-rectangle)
- Choose median as a pivot
- Split node according to the chosen pivot and dimension.

*How is the clustering problem defined?🧠*- Clustering tries to find a natural grouping within the data.
- One tries to maximize the intra-cluster similarity and minimize the inter-cluster similarity.

*Why is it called 'unsupervised'?🧠*- It's called unsupervised, as the data is unlabelled. There is no supervisor, that tells us in advance to which cluster a training data point belongs. Hence, unsupervised. The structure is actually learned from the data.

*How do hierarchical clustering methods work?*🧠**Init-phase:**Each of the$n$samples becomes a cluster itself**Iterate-phase:**- 1.Find closest clusters and merge them
- 2.Proceed until we have a single cluster

*What is the rule of the cluster-2-cluster distance and which distances can we use?🧠*- So-called cluster linkage types define the distance between two clusters. Commonly used are
- Single linkage
- Complete linkage
- Average linkage
- Centroid linkage

*How does the*$k$*-mean algorithm work? What are the two main steps?🧠*- The two main steps are the
**assignment**and**adjustment step**. **Assignment step:**- Assign each sample to its closest centroid$z_{n}=\arg \min_{k}\left\|\boldsymbol{c}_{k}-\boldsymbol{x}_{n}\right\|^{2}$

**Adjustment Step:**- Adjust the centroids to be the means of the samples assigned to them:$c_{k}=\frac{1}{\left|X_{k}\right|} \sum_{\boldsymbol{x}_{i} \in X_{k}} \boldsymbol{x}_{i}, \quad X_{k}=\left\{\boldsymbol{x}_{n} \mid z_{n}=k\right\}$

*Why does the algorithm converge? What is it minimizing?🧠*- $k$-means minimizes the Sum of Squared Distances (SSD). By checking, if a cluster centroids exists, that is closer to a point than its current cluster centroid the SSD is reduced. If no closer cluster centre exists and the SSD remains constant.
- SSD is defined as$\operatorname{SSD}(C ; \mathcal{D})=\sum_{i=1}^{n} d\left(\boldsymbol{x}_{i}, c\left(\boldsymbol{x}_{i}\right)\right)^{2}$

*Does*$k$*-means find a global minimum of the objective?🧠*- No, generally it doesn't. Finding a global minimum is a NP-hard problem. One would have to check all assignments to find the global best solution.
- Moreover, the result of$k$-means is heavily dependent on the initialization.

- What does dimensionality reduction mean?🧠
- Dimensionality reduction refers to projecting an original data point$\boldsymbol{x}_{i} \in \mathbb{R}^{D}$into a lower dimensional representation$\boldsymbol{z}_{i} \in \mathbb{R}^{M}$through a mapping$\boldsymbol{x}_{i} \rightarrow \boldsymbol{z}_{i}$.
- If the mapping is linear it can be represented as a function:$\boldsymbol{z}_{i}=\boldsymbol{W} \boldsymbol{x}_{i}, \text { with } \boldsymbol{W} \in \mathbb{R}^{M \times D}$.

- What is PCA? 🧠
- PCA is an approach for dimensionality reduction.
- PCA finds a low-dimensional representation of the data set that captures as much variance as possible. PCA seeks to find a smaller number of dimensions, where each dimension is a linear combination of the features.
- The first principal component is a set of features$X_{1}, X_{2}, \ldots, X_{p}$is the normalized linear combination of the features with$Z_{1}=\phi_{11} X_{1}+\phi_{21} X_{2}+\ldots+\phi_{p 1} X_{p}$.$\phi_{1}=\left(\begin{array}{llll} \phi_{11} & \phi_{21} & \ldots & \phi_{p 1} \end{array}\right)^{T}$are the principal component loading vector. (James et. al p. 375)

- What are three things that it does?🧠
- PCA maximizes variance of the projection
- PCA minimizes the error of the reconstruction
- PCA projects the data into a linear subspace

*What are the roles of the Eigenvectors and Eigenvalues in PCA?🧠*- Eigenvectors point to the directions of largest variance
- Largest Eigenvalues give the values of the largest variance

*Can you describe applications of PCA?🧠*- Find a natural coordinate system for the data
- Transform into a lower-dimensional representation
- Common pre-processing technique

- Why should each dimension have a unit variance / be normalized?
- Doing a PCA is a variance maximizing exercise. Therefore, if features have a different variance, features with the greatest variance will be loaded the most. To give the features the greatest weight, that explain most of the variance, data should be normalized first.

*What are parametric methods and how to obtain their parameters?🧠*- Parametric models are models, where we assume a certain underlying probability distribution. The no. of parameters is fixed. (see also here.)

- How many parameters have non-parametric methods?🧠
- non-parametric models derive the probability density from the data and don't assume a parametric models. Thus, don't require parameters to specify the model.
- However, they can have hyperparameters such as the bin count in a histogram, which depend on the concrete method.

*What are mixture models?🧠*- Mixture models create a complex distribution by combining simple ones.
- They have the following parameters$\boldsymbol{\theta}=\left\{\pi_{1}, \boldsymbol{\mu}_{1}, \boldsymbol{\Sigma}_{1}, \ldots, \pi_{K}, \boldsymbol{\mu}_{K}, \boldsymbol{\Sigma}_{K}\right\}$, where$\pi$stands for the responsibility or contribution of each distribution to the final mixture.$\mu_K$and$\Sigma_K$describe the Gaussian distribution of the$k$th component.

*Should gradient methods be used for training mixture models?🧠*- No, gradient descent shouldn't be used to train mixture models, as optimizing the likelihood of a sum is problematic. Convergence will be slow.
- Should use Expectation Maximization instead.

*How does the EM algorithm work?🧠*- The EM algorithm is an iterative algorithm for estimating latent variable models consisting of two steps. The first step is the Expectation step.
- The E step maximizes the lower bound with respect to some arbitrary distribution$q(z)$of the latent variable$z$, by minimizing / vanishing the Kullback Leibler distance of$q(z)$. The lower bound is tight, as the marginal likelihood equals the lower bound.
- In the second step, the M step the lower bound is maximized with respect to$\boldsymbol{\theta}$to obtain a new parameter estimate$\boldsymbol{\theta}^{\text{new}}$. As$q$is estimated using the old parameters of$\boldsymbol{\theta}$, the KL will be larger than zero and the overall marginal likelihood increases.

*What is the biggest problem of mixture models?🧠*- Mixture models assume that the data can be modelled through a combination of parametric distributions e. g. Gaussians.

- How does EM decomposes the marginal likelihood?🧠
- $\log p(\boldsymbol{x} \mid \boldsymbol{\theta})_{z}=\sum_{z} q(z) \log \frac{p(\boldsymbol{x}, z \mid \boldsymbol{\theta})}{q(z)}+\sum_{z} q(z) \log \frac{q(z)}{p(z \mid x)}$

*Why does EM always improve the lower bound?🧠*- In the E-Step the KL is minimized to zero, while the log-likelihood remains unchanged, hence the lower bound has to increase. This is due to the fact, that the lower bound is the difference between the unchanged marginal likelihood and the KL.
- In the M-Step the lower bound also increases, unless it's already at the maximum, as the new parameter set has to be at least as good as the old parameter estimate.

- Why does EM always improve the marginal likelihood?🧠
- Marginal log-likelihood is always improved in the maximization step, as the lower bound is maximized with respect to$\theta$to give some new$\theta^{\text{new}}$.That is, the$\mathcal{L}$will go up. But as$q$is determined with the old parameters, it will be different to the new posterior distribution of$p$. Hence, the KL will be non-zero or positive. So the sum of both terms will be greater than before.

*Why can we optimize each mixture component independently with EM?🧠*- We can separately update the single components and coefficients in the M step, as the objective of the lower bound is additive. (p. 43)
- TODO: Get a better understanding.

- Why do we need sampling for continuous latent variables?🧠
- Typically it is not feasible to compute the integral in the Maximization step with continuous latent variables, as no analytical solutions exist for the integral (see slide 55 f.)
- Instead an MC estimation is used to calculate the lower bound.

*Why is a neural network considered a parametric model?*- Deep learning models are generally parametric. Most often, they have a huge number of parameters, one for each weight that is tuned during training. (see here.)

*What is the link between Entropy and Kullback-Leiber divergence?*$\begin{aligned} \mathrm{KL}(p \| q) &=-\int p(\mathbf{x}) \ln q(\mathbf{x}) \mathrm{d} \mathbf{x}-\left(-\int p(\mathbf{x}) \ln p(\mathbf{x}) \mathrm{d} \mathbf{x}\right) \\ &=-\int p(\mathbf{x}) \ln \left\{\frac{q(\mathbf{x})}{p(\mathbf{x})}\right\} \mathrm{d} \mathbf{x} \end{aligned}$(see Bishop p. 55)*How does the Variational Bayes algorithm improve on the Expectation Maximization algorithm?*- The EM-algorithm assumes that the Kullback-Leiber distance can be set to zero.
- This is however not possible for complex latent variable models. Variational Bayes models solve this issue by allowing the KL to be greater than 0 after the E-step.

*What is the EM algorithm for?*- Finding the maximum likelihood solutions of a probabilistic model with latent variables

*Why is it a lower bound?*- Since Since$\mathrm{KL}(q \| p) \geq 0$it follow that$\mathcal{L}(q, \boldsymbol{\theta}) \leq \log p(\boldsymbol{x} \mid \boldsymbol{\theta})$.

*What are weaknesses of the EM algorithm?*- EM is sensitive to initialization.
- Often times$k$-nearest neighbour is used for initialization.

*How parametric and non-parametric models compare? What are their advantages/drawbacks?*- Parametric models
- Friendly analytic properties (+)
- Simple (+)
- Small memory requirements (+)
- Fast (+)
- Limited representation (-)

- Non-parametric models
- Very general (+)
- Suffer from curse of dimensionality (-)
- Large memory requirements (-)
- Slow (-)

*What is the difference between kernel density estimation and*$k$*-nearest neighbor?*- Both approaches allow to calculate the probability that a point falls into a region.
- Kernel density estimation
- Fix$V$(Volume) and determine$K$(the no. of data points$K$in fixed hypercube)

- $k$-nearest neighbour estimation
- Fix$K$(no. samples in Region) and determine size of Sphere$V$.
- First, we increase the size of the hypersphere until$K$points lie inside.

*Show that the Kullback-Leibler distance is ...*- (Whiteboard)

*What are the convergence properties of EM?*- EM improves the lower bound$\mathcal{L}\left(q_{\text {new }}, \boldsymbol{\theta}_{\text {new }}\right) \geq \mathcal{L}\left(q_{\text {old }}, \boldsymbol{\theta}_{\text {old }}\right)$
- E-Step affects KL, so lower bound has to go up.
- M-Step maximizes lower bound

- EM improves the marginal likelihood$\log p\left(\boldsymbol{x} \mid \boldsymbol{\theta}_{\mathrm{new}}\right) \geq \log p\left(\boldsymbol{x} \mid \boldsymbol{\theta}_{\text {old }}\right)$
- E-Step leaves marginal likelihood unaffected
- M-Step increases, as lower bound and KL increase

*What is the advantage of EM over*$k$*-means?*- EM is harder to learn than k-means but it also gives variances and densities

*What are the advantages of PPCA over the PCA? What are its drawbacks?*- PCA is a one-step solution (only eigenvalue decomposition)
- PCA is very fast
- PPCA is good if we need densities
- PPCA helps to understand EM and complex dimensionality reduction methods.

*Why do we need the Expectation step at all, if the Maximization steps, increases the likelihood?*- The expectation step is necessary to estimate the distribution of the latent variables by using the old parameters and the observed data$\boldsymbol{x}$. So we try to estimate the missing data.
- In the M step the generated (complete data) is used to generate new parameter estimates for$\boldsymbol{\theta}$.
- If there wasn't an E-Step it just wouldn't be possible.

- What is the definition of a kernel and its relation to an underlying feature space?🧠
- Let$\phi: \mathcal{X} \rightarrow \mathbb{R}^{d}$be an arbitrary feature function, then$k\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\left\langle\phi(\boldsymbol{x}), \boldsymbol{\phi}\left(\boldsymbol{x}^{\prime}\right)\right\rangle$defines a positive definite kernel.

- Why are kernels more powerful than traditional feature-based methods?🧠
- Traditional feature-based methods require working in the higher-dimensional feature space, while kernel-based methods do not.
- Especially, with kernels we can map data $x$ to an infinite dimensional feature space. These features never have to be represented explicitly, as long as we can evaluate the inner product of two feature vectors.
- Leads to more powerful representations than traditional feature models.

- What do we mean by the kernel trick?🧠
- Kernel methods allow to calculate the dot product in a high-dimensional feature space without ever computing the mapping$\phi$of the data in that space. Instead of applying the transformations of$\phi (x)$explicitly, the computation is done in the lower-dimensional feature space by replacing the inner product with a function.

- How do we apply the kernel trick to ridge regression?🧠
- The ridge solution is given by:$\boldsymbol{w}_{\text {ridge }}^{*}=\underbrace{\left(\Phi^{T} \Phi+\lambda \boldsymbol{I}\right)^{-1}}_{d \times d \text { matrix inversion }} \boldsymbol{\Phi}^{T} \boldsymbol{y}$
- By applying the "searle set of identies" ($(\boldsymbol{I}+\boldsymbol{A} \boldsymbol{B})^{-1} \boldsymbol{A}=\boldsymbol{A}(\boldsymbol{I}+\boldsymbol{B} \boldsymbol{A})^{-1})$we get:
- $\boldsymbol{w}^{*}=\underbrace{\left(\boldsymbol{\Phi}^{T} \boldsymbol{\Phi}+\lambda \boldsymbol{I}\right)^{-1}}_{d \times d \text { matrix inversion }} \boldsymbol{\Phi}^{T} \boldsymbol{y}=\boldsymbol{\Phi}^{T} \underbrace{\left(\Phi \Phi^{T}+\lambda \boldsymbol{I}\right)^{-1}}_{N \times N \text { matrix inversion }} \boldsymbol{y}$
- The kernelized version is then given by:$\boldsymbol{w}^{*}=\boldsymbol{\Phi}^{T} \underbrace{(\boldsymbol{K}+\lambda \boldsymbol{I})^{-1} \boldsymbol{y}}_{\alpha}=\mathbf{\Phi}^{T} \boldsymbol{\alpha}$

- How do we compute with infinite dimensional vectors?🧠
- Instead of inverting a$d \times d$matrix, we invert a$N \times N$matrix, which is beneficial if$d$is infinite dimensional.
- Still,$\boldsymbol{w}^{*} \in \mathbb{R}^{d}$is potentially infinite dimensional and can not be represented. However, we can evaluate a function$f$that is specified by$\boldsymbol{w}^{*}$:

$f(\boldsymbol{x})=\boldsymbol{\phi}(\boldsymbol{x})^{T} \boldsymbol{w}^{*}=f \boldsymbol{\phi}(\boldsymbol{x})^{T} \boldsymbol{\Phi}^{T} \boldsymbol{\alpha}=\boldsymbol{k}(\boldsymbol{x})^{T} \boldsymbol{\alpha}=\sum_{i} \alpha_{i} k\left(\boldsymbol{x}_{i}, \boldsymbol{x}\right)$with the kernel vector$\boldsymbol{k}\left(\boldsymbol{x}^{*}\right)=\left[\begin{array}{c}k\left(\boldsymbol{x}_{1}, \boldsymbol{x}^{*}\right) \\ \vdots \\ k\left(\boldsymbol{x}_{N}, \boldsymbol{x}^{*}\right)\end{array}\right]=\left[\begin{array}{c}\left(\phi\left(\boldsymbol{x}_{1}\right)^{T} \phi\left(x^{*}\right)\right. \\ \vdots\left(\boldsymbol{x}_{N}\right)^{T} \phi\left(x^{*}\right)\end{array}\right]=\boldsymbol{\Phi}_{X} \boldsymbol{\phi}\left(\boldsymbol{x}^{*}\right)$. - What are the hyper-parameters of a kernel and how can we optimize them?🧠
- For the RBF kernel the bandwidth$\sigma$is a hyper-parameter. Choosing it is a model-selection problem, that can be solved via cross-validation.

- Study the kernel properties where is symmetry relevant? Where and why is positive definiteness of matrix important?
- Inner products are symmetric and strictly positive definite. (see here.) Thus, a function substituting the product aka kernel has to fulfil the same properties.

- What is the impact of changing$\sigma$in an RBF kernel?
- Recall the RBF kernel is given by$k(\boldsymbol{x}, \boldsymbol{y})=\exp \left(-\frac{\|\boldsymbol{x}-\boldsymbol{y}\|^{2}}{2 \sigma^{2}}\right)$.
- If the distance between$\boldsymbol{x}- \boldsymbol{y}$is large,$\exp$will be small, this means training observations$\boldsymbol{x}$will hardly play any role in the prediction of test observation$\boldsymbol{y}$.
- The$\sigma$now controls the width of the neighbourhood or the variance of the Gaussian density.

- What is the actual gain from using polynomial kernels?
- By using a polynomial kernel with$d>1$one gets a more flexible decision boundary. This is comparable to fitting e. g. a classifier in a higher dimensional feature space.

- What is the purpose of the kernel matrix?
- A kernel matrix is built by evaluating the kernel on all pairs and any set inputs. So it stores the similarities of all samples.
- $[\boldsymbol{K}]_{i j}=\boldsymbol{\phi}\left(\boldsymbol{x}_{i}\right)^{T} \boldsymbol{\phi}\left(\boldsymbol{x}_{j}\right)=k\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right)$

- Proof that a positive definite Kernel is symmetric.
- $k\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\left\langle\boldsymbol{\phi}(\boldsymbol{x}), \boldsymbol{\phi}\left(\boldsymbol{x}^{\prime}\right)\right\rangle = \left\langle\boldsymbol{\phi}(\boldsymbol{x}^{\prime}), \boldsymbol{\phi}\left(\boldsymbol{x}\right)\right\rangle = k\left(\boldsymbol{x}^{\prime}, \boldsymbol{x}\right)$

- Proof that the Euclidean distance is a distance measure fulfilling all criteria for a distance measure.

- Why is it good to use a maximum margin objective for classification?🧠
- The maximum margin objective is good for classification, as it tries to find the hyperplane that gives the greatest minimum distance to the closest observations.
- The intuition is, that if less observations next to the decision boundary, the results are less uncertain.

- How can we define the margin as an optimization problem?
- It can be formulated as:
- $\begin{aligned} \operatorname{argmax}_{\mathbf{w}} & \frac{2}{\|\mathbf{w}\|}, \\ \text { s.t. } & \mathbf{w}^{T} \mathbf{x}_{i}+b\left\{\begin{array}{l} \geq+1, \quad \text { falls } y_{i}=+1 \\ \leq-1, \quad \text { falls } y_{i}=-1 \end{array}\right. \end{aligned}$

- What are slack variables and how can they be used to get a 'soft' margin?🧠
- A ''soft' margin is the concept of a hyperplane, where almost all observations are separated correctly. This is done through so called slack variables$\xi_i$, that allow violating the margin. We use slack variables$\xi_i \geq 0$and allow for margin violations:$y_{i}\left(\mathbf{w}^{T} \mathbf{x}_{i}+b\right) \geq 1-\xi_{i}$.

- How is the hinge loss defined?🧠$\operatorname{argmin}_{\mathbf{w}} \lambda \underbrace{\|\mathbf{w}\|^{2}}_{\text {regularization }}+\underbrace{\sum_{i=1}^{N} \max \left(0,1-y_{i} f\left(\boldsymbol{x}_{i}\right)\right)}_{\text {data loss }}, \quad \text { with } \lambda=\frac{1}{C}$
- What is the relation between the slack variables and the hinge loss?🧠
- If$\xi_i$is$>1$, the point lies outside the margin, but doesn't contribute to the loss.
- If$\xi_i$is$0 \leq \xi_{i} \leq 1$, it violates the margin and contributes to the loss.
- If$\xi_i$is$0$, it's a support vector.

- What are advantages and disadvantages in comparison to logistic regression?🧠
- Logistic regression doesn't allow for margin valuations. SVMs however, allow some observations to lie on the wrong side of the margin through the notion of slack variables.
- Logistic regression returns probabilities. SVMs do not.
- Logistic regression is more sensitive to outliers, whereas SVMs find a more balanced decision boundary.

- What is the difference between gradients and sub-gradients?🧠
- In order to calculate the gradient of a function, the function has to be differentiable.
- To calculate sub-gradients the function has to be convex, but not necessarily differentiable.
- The sub-gradient is similar to a piece-wise gradient.

- First, explain the intuition behind slack-variables in support vector machine training. Second, for a single data-point$\left(\boldsymbol{x}_{i}, c_{i}\right)$the margin condition with slack variable$\xi_{i}$is given as$c_{i}\left(\boldsymbol{w}^{T} \boldsymbol{x}_{i}+b\right) \geq 1-\xi_{i}$
- Assuming$0 \leq \xi_{i} \leq 1$, is$\boldsymbol{x}_{i}$classified correctly?
- Assuming$\xi_{i}>1$, is$x_{i}$classified correctly?🦧
- Slack variables$\xi_i$allow for a a margin violation. That means some observation can lie across the margin, but in between the decision boundary or even on the wrong side of the decision boundary.
- It acts as a regularization term.
- It helps to find a optimal hyperplane if a maximum margin hyperplane can otherwise not be found.
- In the first case above, the observation is classified correctly but violates the margin. The second observations is classified falsly and contributes to the loss.

- What are the two basic steps behind Bayesian Learning?🧠 1. Compute the posterior:$\underbrace{p(\boldsymbol{\theta} \mid \mathcal{D})}_{\text {posterior }}=\frac{\overbrace{p(\mathcal{D} \mid \boldsymbol{\theta})}^{\text {data likelihood}} \overbrace{p(\boldsymbol{\theta})}^{\text {prior }}}{\underbrace{p(\mathcal{D})}_{\text {evidence }}}$That is, compute posterior parameters/models
- 1.Integrate out posterior:$\underbrace{p\left(\boldsymbol{x}^{*} \mid \mathcal{D}\right)}_{\text {marginal likelihood }}=\int \underbrace{p\left(\boldsymbol{x}^{*} \mid \boldsymbol{\theta}\right)}_{\text {likelihood }} \underbrace{p(\boldsymbol{\theta} \mid \mathcal{D})}_{\text {posterior }} d \boldsymbol{\theta}$Average all parameters / models weighted by the posterior.

- Why is Bayesian Learning more robust against overfitting?🧠
- Instead of choosing a single estimate for the parameter vector, we estimate a probability distribution for the parameter vector$\boldsymbol{\theta}$. As multiple estimates are incorporated in the averaged parameter vector the model becomes more robust.

- What happens with the posterior if we add more data to the training set?🧠
- For very large datasets the posterior will be a
**point estimate**. Bayesian learning will be equivalent to maximum likelihood solution.

- What is completing the square and how does it work?🧠
- Generally completing the square means to bring a term in the form$(a+b)^2 = a^2+2ab+b^2$