CS480: Machine Learning

Back to cs480

Core concepts

Performance

\[\begin{aligned} \text{Accuracy} &= \frac{TP + TN}{TP + TN + FP + FN}\\ \\ \text{Error Rate} &= \frac{FP + FN}{TP + TN + FP + FN}\\ \\ \text{Precision} &= \frac{TP}{TP + FP}\\ \\ \text{Recall} &= \frac{TP}{TP + FN}\\ \\ F &= \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\\ \\ \text{Weighted }F &= \frac{(1 + \beta^2) \cdot \text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}\\ \end{aligned}\]

Regularizer

A regularizer is a term that adds a cost to having large weights, adding inductive bias favouring "simpler" solutions.

Ridge regularizers use 2-norms, meaning that equivalent-cost weight vectors will make a circle when plotted.

Lasso regularizers use 1-norms, meaning that equivalent-cost weight vectors make a diamond when plotted. This adds incentive for having weights with values exactly at 0 instead of simply having small values, as a ridge regularizer would.

Optimization techniques

Instead of using 0-1 loss, use a more nicely differentiable loss.

Gradient Descent

If we are learning weights \(w\) subject to loss function \(f\):

Closed-form solution

e.g. for squared loss with a 2-norm regularizer with no bias:

\[\begin{aligned} \mathcal{L}(w) &= \frac{1}{2}||Xw-Y||^2 + \frac{\lambda}{2}||w||^2\\ \nabla_w \mathcal{L}(w) &= X^T(Xw-Y) + \lambda w\\ &= X^T Xw - X^T Y + \lambda w\\ &= (X^T X + \lambda I)w - X^T Y\\ \\ (X^T X + \lambda I)w - X^T Y &= 0\\ (X^T X + \lambda I)w &= X^T Y\\ w &= (X^T X + \lambda I)^{-1} X^T Y\\ \end{aligned}\]

Probability

Joint probability: \(P(A, B) = P(A = a \land B = b) \forall a, b\).

Conditional probability: \(P(A|B) = \frac{P(A \land B)}{P(B)}\)

Product rule: \(P(A \land B) = P(A|B)P(B)\). Following this, \(P(A, B | C) = P(A|B, C)P(B|C)\)

Sum rule: \(\sum_i P(A_i | C) = 1\)

Two events are independent if and only if \(P(A,B) = P(A)P(B)\).

Bayes Rule, given a hypothesis \(H\) and evidence \(e\):

\[\underbrace{P(H|e)}_\text{posterior} = \frac{\underbrace{P(e|H)}_\text{likelihood} \underbrace{P(H)}_\text{prior}}{\underbrace{P(e)}_\text{normalizer}}\]

Maximum a Posteriori makes predictions by finding the hypothesis that maximizes the probability of evidence given a prior:

\[\begin{aligned} h_{MAP} &= \operatorname*{argmax}_{h_i} P(h_i | e)\\ &= \operatorname*{argmax}_{h_i} P(h_i) P(e | h_i) \end{aligned}\]

Maximum Likelihood makes predictions by selecting the hypothesis that makes the evidence the most likely:

\[h_{ML} = \operatorname*{argmax}_{h_i} P(e | h_i)\]

If you pick two probability distribution families for the likelihood and the prior and arrive at a posterior in the same family as the prior, then we call the prior family a conjugate prior for the likelihood family.

Margins

Functional margin:

\[\hat{\gamma}_i = y_i(w^T x_i + b)\]

If \(y_i\) and our prediction \(w^T x_i + b\) have the same sign, the functional margin \(\hat{\gamma}_i\) will be positive. A large functional margin means a confident and correct prediction.

The functional margin of a training set is the minimum functional margin of individual training examples.

Geometric margin:

\[\gamma_i = y_i \left(\frac{w^Tx_i + b}{||w||}\right)\]

This refers to the distance between the hyperplane and a point.

The geometric margin of a training set is the minimum distance between the hyperplane and any point in the training set.

Kernalizing

Decision trees

Training

Until either a maximum depth is reached or all remaining data gets grouped into the same leaf, add a decision tree node splitting on the feature that yields the highest information gain.

Given an event \(E\) with a probability \(P(E)\) of occurring, knowing for sure that \(E\) will occur tells you \(I(E) = \log_2\frac{1}{P(E)}\) bits of information.

Entropy sums all possibilities multiplied by the information gain each would yield:

\[\begin{aligned} H(s) &= \sum_i^k p_i I(s_i)\\ &= \sum_i^k p_i \log\frac{1}{p_i}\\ &= - \sum_i^k p_i \log p_i\\ \end{aligned}\]

For binary classification, this simplifies to:

\[H(s) = -p_\oplus \log p_\oplus - p_\ominus \log p_\ominus\]

Conditional entropy is the expected number of bits needed to encode \(y\) if the possible values of \(x\) are known. This is done by taking the sum for each possible value of \(x\) of the percentage of samples having that \(x\) value in the set multiplied with the entropy of the set of samples having that \(x\) value:

\[H(y | x) = \sum_v P(x = v) H(y | x = v)\]

Prediction

Given an example \(x\), step through the decision tree until a leaf node with a prediction is reached.

K Nearest Neighbours (KNN)

Training

Store all training examples (e.g. in a \(k\)-d tree).

Prediction

Define distance between samples, such as Euclidean distance. Any metric will do as long as:

Pick the closest \(K\) stored training examples to the input according to the distance metric. Predict the most common label (or a random label of the possible tied most common labels) out of those \(K\).

For distance-weighted \(K\)NN, compute a weight \(w\) where \(0 \le w \le 1\) for each of the \(K\) nearest neighbours (e.g. based on distance.) Predict based on those weights, \(\frac{\sum_i w_i y_i}{\sum_i w_i}\), instead of simply the most common label \(y\). Example weight functions are \(\frac{1}{c + d(a, b)^n}\) for some \(c, n\) or \(e^{-\frac{d(a,b)^2}{\sigma^2}}\).

Perceptron

Training

Intuitively, the non-bias terms of \(w\) make a vector pointing toward the positive examples, normal to the decision plane. If an example was misclassified, it is because it was on the wrong side of the plane. The vector from the origin to the example should is in the opposite direction of \(w\). So, adding it to \(w\) makes the resulting \(w'\) point slightly closer to that example than it did before.

Prediction

Let \(a = w \cdot x\). Then, we define the prediction \(y\):

\[y = \begin{cases}1, &a \gt 0\\0, &a \le 0\end{cases}\]

Concepts

Bayesian Linear Regression

Training

We want to maximize the posterior probability of weights given labels and data. We typically minimize log of the probability, since it is easier to take the derivative of.

\[\begin{aligned} P(w|y,X) &\propto \frac{P(y|w,X)P(w|X)}{P(y|x)}\\ P(w|x,y) &= \frac{\left(\prod_{i=1}^N P(y_i|x_i, w)\right)P(w)}{P(y|x)}\\ -\ln P(w|x,y) &= -\sum_{i=1}^N \ln P(y_i|x_i, w) - P(w)+\ln(y|x)\\ &= \frac{1}{2\sigma^2}\sum_{i=1}^N(y_i - f(x_i))^2 + \frac{\alpha}{2}||w||^2 + C\\ \end{aligned}\]

From this point, take the derivative, equate to 0, and solve.

Prediction

We don't estimate a single \(w\). Instead, we average all possible models, weighting different models according to their posterior probability.

Naive Bayes classification

Training

The point of generative learning is to find \(P(y|x)\) by estimating \(P(x|y)\) and \(P(y)\). Generally, this would be the calculation for \(P(x|y)\):

\[\begin{aligned} P(x|y) &= P(x_1, x_2, ..., x_m | y)\\ &= P(x_1|y)P(x_2|y,x_1)P(x_3|y,x_1,x_2) ... P(x_m|y,x_1,x_2,...,x_{m-1})\\ \end{aligned}\]

Instead, we introduce the Naive Bayes assumption that \(P(x_i, x_j) = P(x_i)P(x_j), i \ne j\), treating all features as conditionally independent, giving us instead:

\[P(x|y) = \prod_{i=1}^m P(x_i|y)\]

We define probabilities:

Then, find parameters that maximize log-likelihood:

\[\begin{aligned} P(y|x) &\propto P(y)P(x|y)\\ &= \prod_{i=1}^n\left(P(y_i) \prod_{j=1}^m P(x_{i,j}|y_i)\right)\\ &\propto \sum{i=1}^n\left(\ln P(y_i) + \sum{j=1}^m \ln P(x_{i,j}|y_i)\right)\\ &\propto \sum{i=1}^n\left(y_i \log \theta_1 + (1-y_i) \log(1 - \theta_1)\right)\\ &\quad + \sum_{j=1}^m y_i\left(x_{i,j}\log\theta_{i,1} + (1-x_{i,j})\log(1 - \theta_{i,1})\right)\\ &\quad + \sum_{j=1}^m (1-y_i)\left(x_{i,j}\log\theta_{i,0} + (1-x_{i,j})\log(1 - \theta_{i,0})\right)\\ \end{aligned}\]

For discrete classification, choose Bernoulli distributions for \(\theta\) values. For continuous values, choose Gaussian.

Prediction

Calculate the probabilities of \(P(y=c)\) for each class \(c\).

Other properties

The decision boundary can be found by checking where \(\frac{P(y=1|x)}{P(y=0|x)} = 1\), or for easier computation, \(\ln \frac{P(y=1|x)}{P(y=0|x)} = 0\).

Laplace smoothing: add an addition of a constant into the numerator and the denominator of the maximum likelihood estimator to avoid a probability ever going to zero, acting as hallucinated examples.

Support Vector Machines (SVNs)

Training

We are attempting to maximize the geometric margin of a data set. We look for:

\[\max_{w, b} \frac{1}{||w||}, \quad y_i(w^Tx_i+b) \ge 1 \quad \forall i \in \{1, ..., n\}\]

We use Lagrangian optimization. In general, we want:

\[\min_w f(w), \quad g_i(w) \le 0 \forall i, \quad h_i(w) = 0 \forall i\]

Then, we define the generalized Lagrangian:

\[L(w, \alpha, \beta) = f(w) + \sum_{i=1}^k \alpha_i g_i(w) + \sum_{i=1}^k \beta_i h_i(w)\]

The primal problem is defined:

\[\begin{aligned} \theta_P(w) &= \max_{\alpha,\beta: \alpha_i \ge 0} L(w, \alpha, \beta)\\ &= \begin{cases}f(w),&w\text{ satisfies all constraints}\\\infty,&\text{otherwise}\end{cases}\\ \end{aligned}\]

We are concerned with finding:

\[p^* = \min_w \theta_P(w) = \min_w \max_{\alpha,\beta: \alpha_i \ge 0} L(w, \alpha, \beta)\]

Instead of minimizing \(w\) first, we can minimize with respect to \(w\):

\[d^* = \max_{\alpha,\beta: \alpha_i \ge 0} \theta_D (\alpha,\beta) = \max_{\alpha,\beta: \alpha_i \ge 0} \min_w L(w, \alpha, \beta)\]

The max min of a function is always less than the min max of the function, so \(d^* \le p^*\). Under the following conditions, the Karush-Kuhn-Tucker (KKT) conditions, we can determine that they are equal:

e.g. to minimize \(x^2\) subject to \((x-2)^2 \le 1\):

To solve an optimal margin classifier, we want to solve \(\min_{w,b}\frac{1}{2}||w||^2\), subject to \(y_i(w^Tx_i+b) \ge 1\), \(i=1,...,n\). We can rewrite the constraints as \(g_i(w) = 1 - y_i(w^Tx_i+b) \le 0\) and solve the Lagrangian problem.

The dual formulation:

\[\max_\alpha \underbrace{\sum_{i=1}^n \alpha_i}_{\text{Depends only on}\\\text{dual variable}} - \underbrace{\frac{1}{2} \sum_{i,j=1}^n y_iy_j\alpha_i\alpha_j(x_i^T x_j)}_\text{Depends only on data}\]

We replace \(x_i^T x_j\) with a kernel \(K(x_i, x_j)\) so that different kinds of nonlinearities can be represented. The kernel function is equivalent to \(\phi(x_i)^T\phi(\hat x)\), for some feature mapping \(\phi\). The dot product form is useful because it can be many less operations than actually computing the full multiplication. Some kernels:

Due to the dual complementarity condition, we either have \(\alpha_i = 0\) or \(1 - y_i(w^Tx_i+b) = 0\). Only constraints where the latter is true are "active" constraints, and these are the support vectors. They are points lying on the edge of the margin.

To allow for not all constraints to be met, soften the primal problem by using non-infinite values in the loss function, such as quadratic loss or hinge loss.

Ensemble Learning

Bagging

Multiple versions of the same model are trained on different datasets. Given an initial dataset \(D\), sample from it with replacement to create a bootstrapped dataset \(D_*\) for each model instance. Take the average or majority prediction.

Tends to reduce variance but increase bias.

Random Forests

Boosting (AdaBoost)

For \(m\) samples:

Active Learning

Learning how to take samples to achieve the best results

Pool-based Uncertainty Sampling

Given a pool of unlabelled samples, ask the oracle for a label on the most uncertain one.

When using version spaces, pick samples that will eliminate half of the possible model parameters.

Query by Committee

Train multiple models, and query instances for which committee members disagree.

Disagreement metrics:

Dealing with outliers

Unsupervised Learning

K-Means Clustering

Picking \(K\):

\[\begin{aligned} W(K) &= \sum^K_{k=1}\sum_{i \in I_k} ||x_i-\bar{x}_k||^2\\ B(K) &= \sum_{k=1}^K n_k ||\bar{x}_k - \bar{x}||^2\\ CH(K) &= \frac{B(K)/(K-1)}{W(K)/(N-K)}\\ K^* &= \operatorname*{argmax}_{K \in \{2, ..., K_{max}\}} CH(K)\\ \end{aligned}\]

Hierarchical Clustering

Linkage (dissimilarity) functions:

Algorithm:

Principal Component Analysis (PCA)

Semi-Supervised Learning

Self-Learning

Co-Training

Assumes:

Gaussian Mixture Model (GMM)

Assumes data is generated by:

With labelled data:
\[\begin{aligned} \log (D|\theta) &= \log \prod_{i=1}^l p(x_i, y_i | \theta)\\ &= \log \prod_{i=1}^l p(y_i | \theta) p(x_i | y_i, \theta)\\ \end{aligned}\]

With labelled and unlabelled data:
\[\begin{aligned} \log (D|\theta) &= \log \prod_{i=1}^l p(x_i, y_i | \theta) \prod_{i=l+1}^{l+u} p(x_i | \theta)\\ &= \log \prod_{i=1}^l p(y_i | \theta) p(x_i | y_i, \theta) \prod_{i=l+1}^{l+u} \sum_{k=1}^K p(y_i = k | \theta) p(x_i | y_i, \theta)\\ \end{aligned}\]

For 2-class GMM, Expectation Maximization

Expectation step:
\[\gamma_{i,k} \leftarrow P(y_i = k | x_i) = \frac{\pi_k \mathcal{N}(x_i; \mu_k, \sigma_k)}{\sum_{k'=1}^2 \pi_{k'} \mathcal{N}(x_i; \mu_{k'}, \sigma_{k'})}\]

Maximization step:
\[\begin{aligned} \mu_k &\leftarrow \frac{1}{\sum_{i=1}^N \gamma_{i,k}}\sum_{i=1}^N \gamma_{i,k}x_i\\ \sigma_k &\leftarrow \frac{1}{\sum_{i=1}^N \gamma_{i,k}} \sum_{i=1}^N \gamma_{i,k} (x_i - \mu_k)^2\\ \pi_k &\leftarrow \frac{1}{N} \sum_{i=1}^N \gamma_{i,k}\\ \end{aligned}\]

Other methods

Graph-based

Semi-supervised SVM

Neural Networks

\(\sigma(x) = \frac{1}{1 + e^x}\)

\(\frac{\partial \sigma(x)}{\partial x} = \sigma(x)(1-\sigma(x))\)

Gradient descent:

\[\begin{aligned} E &= \frac{1}{2} \sum_{m \in M} (y^{(m)} - o^{(m)})^2\\ \frac{\partial E}{\partial w_i} &= \frac{1}{2} \sum_{m \in M} \frac{\partial E \partial o^m}{\partial o^m \partial w_i}\\ &= \frac{1}{2}\sum_{m \in M} \frac{\partial E}{\partial o^m}\left(\frac{\partial o^m \partial z}{\partial z \partial w_i}\right)\\ &= -\sum_{m \in M} x_1^{(m)} o^{(m)} (1 - o^{(m)})(y^{(m)} - o^{(m)})\\ \Delta w_i &= -\epsilon \frac{\partial E}{\partial w_i}\\ \end{aligned}\]

For RNNs, since weights are constrained to be the same over time, compute \(\frac{\partial E}{\partial w_1}\) and \(\frac{\partial E}{\partial w_2}\). Update \(w_1\) and \(w_2\) both by \(\frac{\partial E}{\partial w_1} + \frac{\partial E}{\partial w_2}\) to enforce \(w_1 = w_2\).

GANs try to find:

\[\begin{aligned} &\min_{W_g} \max_{W_d} \sum_n \log P(x_n \text{ is real; } W_d) + \log P(g(z_n; W_g) \text{ is fake; } W_d)\\ \equiv &\min_{W_g} \max_{W_d} \sum_n \log d(x_n; W_d) + \log(1 - d(g(z_n; W_g); W-d))\\ \end{aligned}\]

To train, repeat until convergence:

Structured prediction

Structured SVM

Goal: learn a scoring function \(s(x,y) \gt s(x, \hat y) \forall \hat y \ne y\) where \(y\) is the true label and \(\hat y\) is an impostor label. In general, \(s(x, y)\) takes the form \(w^T \phi(x, y)\), weights over a feature function.

We try to find \(\hat y = \operatorname*{argmax}_y w \cdot \phi(x, y)\).

Conditional Random Fields

A log linear model is used to give the probability of a label sequence \(y\) given an observation sequence \(x\):

\[p(y|x,w) = \frac{\exp\left(\sum_{j=1}^F w_j \phi_j(x,y)\right)}{\sum_{y'}\exp\left(\sum_{j=1}^F w_j \phi_j(x,y')\right)}\]

Where \(\phi_j(x,y) = \sum_{i=1}^N f_i (y_{i-1}, y_i, x_{1:N}, i)\). \(x\) is a sequence of words, \(y_i\) is the label for the current word, and \(y_{i-1}\) is the label of the previous word.

Reinforcement Learning

Markov System of Rewards

Given a state transition matrix \(P\), a discount factor \(\gamma\), and the reward for being in state \(s_i\) is \(r_i\), we want to find a vector \(U^*(s_i)\), the expected discounted sum of future rewards starting at state \(s_i\).

Can solve by value iteration:

\[\begin{aligned} U^1(s_i) &= r_i\\ U^2(s_i) &= r_i + \gamma \sum_{j=1}^n P_{i,j} U^1 (s_j)\\ &...\\ U^{k+1}(s_i) &= r_i + \gamma \sum_{j=1}^n P_{i,j} U^k (s_j)\\ \end{aligned}\]

When there are actions that have probabilities of outcomes (e.g. in a Markov Decision Process):

\[V^{t+1}(s_i) = \max_k \left[ r_i + \gamma \sum_{j=1}^n P^{(k)}_{i,j}V^t(s_j)\right]\]

The best action \(k\) in state \(s_i\) is the argmax over \(k\) of the previous equation.

Temporal difference

\[V^\pi(s) = V^\pi(s) + \underbrace\alpha_\text{Learning rate} \underbrace{(\underbrace{r(s) + \gamma V^\pi(s')}_\text{target} - V^\pi(s))}_\text{Temporal difference}\]

Basically, this is a running average of samples:

\[\begin{aligned} x &\leftarrow R(s) + \gamma V^\pi (s')\\ V^\pi (s) &\leftarrow (1-\alpha)V^\pi(s) + \alpha x\\ \end{aligned}\]

Q Learning

Make a table \(Q: S \times A \rightarrow R\) representing the value of a state-action pair. The optimal policy is then:
\[\pi^*(s) = \operatorname*{argmax}_a Q(s,a)\]

Bellman's equation:

\[Q(s,a) \leftarrow r(s) + \gamma \sum_{s'} P(s' | s,a) \max_{a'} Q(s', a')\]

That is to say, for each possible state \(s'\) you could end up in by performing \(a\) at state \(s\), take the sum the maximum expected reward you can get from performing some action from \(s'\) multiplied by the probability of landing in \(s'\). Scale this sum by \(\gamma\) and add it to the received reward.

Learning policy = behaviour policy (SARSA)

Learning policy ≠ behaviour policy

Learning by Demonstration

The goal here is to learn a policy that mimics expert actions.

Sequential Decision Making

Learn a policy \(\pi\) that maps sensory readings to action.

\[\hat\pi = \operatorname*{argmin}_\pi E_{s \sim d_\pi}[l(s, \pi)]\]

Here, \(E\) is expected loss, \(d_\pi\) is the distribution of states given that the policy is followed.

Forward Training

For each policy \(\pi_i\):

Stochastic Mixing Iterative Learning (SMILE)

For each policy \(\pi_i\):

Dataset Aggregation (Dagger)

For each policy \(\pi_i\):

Inverse Reinforcement Learning

Given traces from an expert and a probability transition matrix \(T\), create a reward function \(R\) such that:

\[E\left[\sum_{t=0}^\infty \gamma^t R^*(s_t) | \pi^*\right] \le E\left[\sum_{t=0}^\infty \gamma^t R^*(s_t) | \pi\right] \forall \pi\]

Feature based reward function

\[\begin{aligned} &E\left[\sum_{t=0}^\infty \gamma^t R^*(s_t) | \pi\right]\\ =& E\left[\sum_{t=0}^\infty \gamma^t w^T \phi(s_t) | \pi\right]\\ =& w^T E\left[\sum_{t=0}^\infty \gamma^t \phi(s_t) | \pi\right]\\ =& w^T \mu(\pi)\\ \end{aligned}\]

\(\mu\) is the expected cumulative discounted sum of feature values. Given \(\phi: S \rightarrow \mathbb{R}^n\), find \(w^*\) such that \(w^{*T}\mu(\pi^*) \ge w^{*T}\mu(\pi) \forall \pi\).

It suffices that \(||\mu(\pi) - \mu(\pi^*)||_1 \le \epsilon\).

Apprenticeship Learning

Independence criterion

\[\frac{P(R=1|A=a)}{P(R=1|A=b)} \ge 1 - \epsilon\]

This expresses the belief that traits relevant for classification are independent of certain attributes.

Interpretability

Local Surrogate

Examples

Counterfactuals