\[\begin{aligned} \text{Accuracy} &= \frac{TP + TN}{TP + TN + FP + FN}\\ \\ \text{Error Rate} &= \frac{FP + FN}{TP + TN + FP + FN}\\ \\ \text{Precision} &= \frac{TP}{TP + FP}\\ \\ \text{Recall} &= \frac{TP}{TP + FN}\\ \\ F &= \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\\ \\ \text{Weighted }F &= \frac{(1 + \beta^2) \cdot \text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}\\ \end{aligned}\]
A regularizer is a term that adds a cost to having large weights, adding inductive bias favouring "simpler" solutions.
Ridge regularizers use 2-norms, meaning that equivalent-cost weight vectors will make a circle when plotted.
Lasso regularizers use 1-norms, meaning that equivalent-cost weight vectors make a diamond when plotted. This adds incentive for having weights with values exactly at 0 instead of simply having small values, as a ridge regularizer would.
Instead of using 0-1 loss, use a more nicely differentiable loss.
If we are learning weights \(w\) subject to loss function \(f\):
e.g. for squared loss with a 2-norm regularizer with no bias:
\[\begin{aligned} \mathcal{L}(w) &= \frac{1}{2}||Xw-Y||^2 + \frac{\lambda}{2}||w||^2\\ \nabla_w \mathcal{L}(w) &= X^T(Xw-Y) + \lambda w\\ &= X^T Xw - X^T Y + \lambda w\\ &= (X^T X + \lambda I)w - X^T Y\\ \\ (X^T X + \lambda I)w - X^T Y &= 0\\ (X^T X + \lambda I)w &= X^T Y\\ w &= (X^T X + \lambda I)^{-1} X^T Y\\ \end{aligned}\]
Joint probability: \(P(A, B) = P(A = a \land B = b) \forall a, b\).
Conditional probability: \(P(A|B) = \frac{P(A \land B)}{P(B)}\)
Product rule: \(P(A \land B) = P(A|B)P(B)\). Following this, \(P(A, B | C) = P(A|B, C)P(B|C)\)
Sum rule: \(\sum_i P(A_i | C) = 1\)
Two events are independent if and only if \(P(A,B) = P(A)P(B)\).
Bayes Rule, given a hypothesis \(H\) and evidence \(e\):
\[\underbrace{P(H|e)}_\text{posterior} = \frac{\underbrace{P(e|H)}_\text{likelihood} \underbrace{P(H)}_\text{prior}}{\underbrace{P(e)}_\text{normalizer}}\]
Maximum a Posteriori makes predictions by finding the hypothesis that maximizes the probability of evidence given a prior:
\[\begin{aligned} h_{MAP} &= \operatorname*{argmax}_{h_i} P(h_i | e)\\ &= \operatorname*{argmax}_{h_i} P(h_i) P(e | h_i) \end{aligned}\]
Maximum Likelihood makes predictions by selecting the hypothesis that makes the evidence the most likely:
\[h_{ML} = \operatorname*{argmax}_{h_i} P(e | h_i)\]
If you pick two probability distribution families for the likelihood and the prior and arrive at a posterior in the same family as the prior, then we call the prior family a conjugate prior for the likelihood family.
Functional margin:
\[\hat{\gamma}_i = y_i(w^T x_i + b)\]
If \(y_i\) and our prediction \(w^T x_i + b\) have the same sign, the functional margin \(\hat{\gamma}_i\) will be positive. A large functional margin means a confident and correct prediction.
The functional margin of a training set is the minimum functional margin of individual training examples.
Geometric margin:
\[\gamma_i = y_i \left(\frac{w^Tx_i + b}{||w||}\right)\]
This refers to the distance between the hyperplane and a point.
The geometric margin of a training set is the minimum distance between the hyperplane and any point in the training set.
Until either a maximum depth is reached or all remaining data gets grouped into the same leaf, add a decision tree node splitting on the feature that yields the highest information gain.
Given an event \(E\) with a probability \(P(E)\) of occurring, knowing for sure that \(E\) will occur tells you \(I(E) = \log_2\frac{1}{P(E)}\) bits of information.
Entropy sums all possibilities multiplied by the information gain each would yield:
\[\begin{aligned} H(s) &= \sum_i^k p_i I(s_i)\\ &= \sum_i^k p_i \log\frac{1}{p_i}\\ &= - \sum_i^k p_i \log p_i\\ \end{aligned}\]
For binary classification, this simplifies to:
\[H(s) = -p_\oplus \log p_\oplus - p_\ominus \log p_\ominus\]
Conditional entropy is the expected number of bits needed to encode \(y\) if the possible values of \(x\) are known. This is done by taking the sum for each possible value of \(x\) of the percentage of samples having that \(x\) value in the set multiplied with the entropy of the set of samples having that \(x\) value:
\[H(y | x) = \sum_v P(x = v) H(y | x = v)\]
Given an example \(x\), step through the decision tree until a leaf node with a prediction is reached.
Store all training examples (e.g. in a \(k\)-d tree).
Define distance between samples, such as Euclidean distance. Any metric will do as long as:
Pick the closest \(K\) stored training examples to the input according to the distance metric. Predict the most common label (or a random label of the possible tied most common labels) out of those \(K\).
For distance-weighted \(K\)NN, compute a weight \(w\) where \(0 \le w \le 1\) for each of the \(K\) nearest neighbours (e.g. based on distance.) Predict based on those weights, \(\frac{\sum_i w_i y_i}{\sum_i w_i}\), instead of simply the most common label \(y\). Example weight functions are \(\frac{1}{c + d(a, b)^n}\) for some \(c, n\) or \(e^{-\frac{d(a,b)^2}{\sigma^2}}\).
Intuitively, the non-bias terms of \(w\) make a vector pointing toward the positive examples, normal to the decision plane. If an example was misclassified, it is because it was on the wrong side of the plane. The vector from the origin to the example should is in the opposite direction of \(w\). So, adding it to \(w\) makes the resulting \(w'\) point slightly closer to that example than it did before.
Let \(a = w \cdot x\). Then, we define the prediction \(y\):
\[y = \begin{cases}1, &a \gt 0\\0, &a \le 0\end{cases}\]
We want to maximize the posterior probability of weights given labels and data. We typically minimize log of the probability, since it is easier to take the derivative of.
\[\begin{aligned} P(w|y,X) &\propto \frac{P(y|w,X)P(w|X)}{P(y|x)}\\ P(w|x,y) &= \frac{\left(\prod_{i=1}^N P(y_i|x_i, w)\right)P(w)}{P(y|x)}\\ -\ln P(w|x,y) &= -\sum_{i=1}^N \ln P(y_i|x_i, w) - P(w)+\ln(y|x)\\ &= \frac{1}{2\sigma^2}\sum_{i=1}^N(y_i - f(x_i))^2 + \frac{\alpha}{2}||w||^2 + C\\ \end{aligned}\]
From this point, take the derivative, equate to 0, and solve.
We don't estimate a single \(w\). Instead, we average all possible models, weighting different models according to their posterior probability.
The point of generative learning is to find \(P(y|x)\) by estimating \(P(x|y)\) and \(P(y)\). Generally, this would be the calculation for \(P(x|y)\):
\[\begin{aligned} P(x|y) &= P(x_1, x_2, ..., x_m | y)\\ &= P(x_1|y)P(x_2|y,x_1)P(x_3|y,x_1,x_2) ... P(x_m|y,x_1,x_2,...,x_{m-1})\\ \end{aligned}\]
Instead, we introduce the Naive Bayes assumption that \(P(x_i, x_j) = P(x_i)P(x_j), i \ne j\), treating all features as conditionally independent, giving us instead:
\[P(x|y) = \prod_{i=1}^m P(x_i|y)\]
We define probabilities:
Then, find parameters that maximize log-likelihood:
\[\begin{aligned} P(y|x) &\propto P(y)P(x|y)\\ &= \prod_{i=1}^n\left(P(y_i) \prod_{j=1}^m P(x_{i,j}|y_i)\right)\\ &\propto \sum{i=1}^n\left(\ln P(y_i) + \sum{j=1}^m \ln P(x_{i,j}|y_i)\right)\\ &\propto \sum{i=1}^n\left(y_i \log \theta_1 + (1-y_i) \log(1 - \theta_1)\right)\\ &\quad + \sum_{j=1}^m y_i\left(x_{i,j}\log\theta_{i,1} + (1-x_{i,j})\log(1 - \theta_{i,1})\right)\\ &\quad + \sum_{j=1}^m (1-y_i)\left(x_{i,j}\log\theta_{i,0} + (1-x_{i,j})\log(1 - \theta_{i,0})\right)\\ \end{aligned}\]
For discrete classification, choose Bernoulli distributions for \(\theta\) values. For continuous values, choose Gaussian.
Calculate the probabilities of \(P(y=c)\) for each class \(c\).
The decision boundary can be found by checking where \(\frac{P(y=1|x)}{P(y=0|x)} = 1\), or for easier computation, \(\ln \frac{P(y=1|x)}{P(y=0|x)} = 0\).
Laplace smoothing: add an addition of a constant into the numerator and the denominator of the maximum likelihood estimator to avoid a probability ever going to zero, acting as hallucinated examples.
We are attempting to maximize the geometric margin of a data set. We look for:
\[\max_{w, b} \frac{1}{||w||}, \quad y_i(w^Tx_i+b) \ge 1 \quad \forall i \in \{1, ..., n\}\]
We use Lagrangian optimization. In general, we want:
\[\min_w f(w), \quad g_i(w) \le 0 \forall i, \quad h_i(w) = 0 \forall i\]
Then, we define the generalized Lagrangian:
\[L(w, \alpha, \beta) = f(w) + \sum_{i=1}^k \alpha_i g_i(w) + \sum_{i=1}^k \beta_i h_i(w)\]
The primal problem is defined:
\[\begin{aligned} \theta_P(w) &= \max_{\alpha,\beta: \alpha_i \ge 0} L(w, \alpha, \beta)\\ &= \begin{cases}f(w),&w\text{ satisfies all constraints}\\\infty,&\text{otherwise}\end{cases}\\ \end{aligned}\]
We are concerned with finding:
\[p^* = \min_w \theta_P(w) = \min_w \max_{\alpha,\beta: \alpha_i \ge 0} L(w, \alpha, \beta)\]
Instead of minimizing \(w\) first, we can minimize with respect to \(w\):
\[d^* = \max_{\alpha,\beta: \alpha_i \ge 0} \theta_D (\alpha,\beta) = \max_{\alpha,\beta: \alpha_i \ge 0} \min_w L(w, \alpha, \beta)\]
The max min of a function is always less than the min max of the function, so \(d^* \le p^*\). Under the following conditions, the Karush-Kuhn-Tucker (KKT) conditions, we can determine that they are equal:
e.g. to minimize \(x^2\) subject to \((x-2)^2 \le 1\):
To solve an optimal margin classifier, we want to solve \(\min_{w,b}\frac{1}{2}||w||^2\), subject to \(y_i(w^Tx_i+b) \ge 1\), \(i=1,...,n\). We can rewrite the constraints as \(g_i(w) = 1 - y_i(w^Tx_i+b) \le 0\) and solve the Lagrangian problem.
The dual formulation:
\[\max_\alpha \underbrace{\sum_{i=1}^n \alpha_i}_{\text{Depends only on}\\\text{dual variable}} - \underbrace{\frac{1}{2} \sum_{i,j=1}^n y_iy_j\alpha_i\alpha_j(x_i^T x_j)}_\text{Depends only on data}\]
We replace \(x_i^T x_j\) with a kernel \(K(x_i, x_j)\) so that different kinds of nonlinearities can be represented. The kernel function is equivalent to \(\phi(x_i)^T\phi(\hat x)\), for some feature mapping \(\phi\). The dot product form is useful because it can be many less operations than actually computing the full multiplication. Some kernels:
Due to the dual complementarity condition, we either have \(\alpha_i = 0\) or \(1 - y_i(w^Tx_i+b) = 0\). Only constraints where the latter is true are "active" constraints, and these are the support vectors. They are points lying on the edge of the margin.
To allow for not all constraints to be met, soften the primal problem by using non-infinite values in the loss function, such as quadratic loss or hinge loss.
Multiple versions of the same model are trained on different datasets. Given an initial dataset \(D\), sample from it with replacement to create a bootstrapped dataset \(D_*\) for each model instance. Take the average or majority prediction.
Tends to reduce variance but increase bias.
For \(m\) samples:
Learning how to take samples to achieve the best results
Given a pool of unlabelled samples, ask the oracle for a label on the most uncertain one.
When using version spaces, pick samples that will eliminate half of the possible model parameters.
Train multiple models, and query instances for which committee members disagree.
Disagreement metrics:
Dealing with outliers
Picking \(K\):
\[\begin{aligned} W(K) &= \sum^K_{k=1}\sum_{i \in I_k} ||x_i-\bar{x}_k||^2\\ B(K) &= \sum_{k=1}^K n_k ||\bar{x}_k - \bar{x}||^2\\ CH(K) &= \frac{B(K)/(K-1)}{W(K)/(N-K)}\\ K^* &= \operatorname*{argmax}_{K \in \{2, ..., K_{max}\}} CH(K)\\ \end{aligned}\]
Linkage (dissimilarity) functions:
Algorithm:
Assumes:
Assumes data is generated by:
With labelled data:
\[\begin{aligned}
\log (D|\theta) &= \log \prod_{i=1}^l p(x_i, y_i | \theta)\\
&= \log \prod_{i=1}^l p(y_i | \theta) p(x_i | y_i, \theta)\\
\end{aligned}\]
With labelled and unlabelled data:
\[\begin{aligned}
\log (D|\theta) &= \log \prod_{i=1}^l p(x_i, y_i | \theta) \prod_{i=l+1}^{l+u} p(x_i | \theta)\\
&= \log \prod_{i=1}^l p(y_i | \theta) p(x_i | y_i, \theta) \prod_{i=l+1}^{l+u} \sum_{k=1}^K p(y_i = k | \theta) p(x_i | y_i, \theta)\\
\end{aligned}\]
Expectation step:
\[\gamma_{i,k} \leftarrow P(y_i = k | x_i) = \frac{\pi_k \mathcal{N}(x_i; \mu_k, \sigma_k)}{\sum_{k'=1}^2 \pi_{k'} \mathcal{N}(x_i; \mu_{k'}, \sigma_{k'})}\]
Maximization step:
\[\begin{aligned}
\mu_k &\leftarrow \frac{1}{\sum_{i=1}^N \gamma_{i,k}}\sum_{i=1}^N \gamma_{i,k}x_i\\
\sigma_k &\leftarrow \frac{1}{\sum_{i=1}^N \gamma_{i,k}} \sum_{i=1}^N \gamma_{i,k} (x_i - \mu_k)^2\\
\pi_k &\leftarrow \frac{1}{N} \sum_{i=1}^N \gamma_{i,k}\\
\end{aligned}\]
Graph-based
Semi-supervised SVM
\(\sigma(x) = \frac{1}{1 + e^x}\)
\(\frac{\partial \sigma(x)}{\partial x} = \sigma(x)(1-\sigma(x))\)
Gradient descent:
\[\begin{aligned} E &= \frac{1}{2} \sum_{m \in M} (y^{(m)} - o^{(m)})^2\\ \frac{\partial E}{\partial w_i} &= \frac{1}{2} \sum_{m \in M} \frac{\partial E \partial o^m}{\partial o^m \partial w_i}\\ &= \frac{1}{2}\sum_{m \in M} \frac{\partial E}{\partial o^m}\left(\frac{\partial o^m \partial z}{\partial z \partial w_i}\right)\\ &= -\sum_{m \in M} x_1^{(m)} o^{(m)} (1 - o^{(m)})(y^{(m)} - o^{(m)})\\ \Delta w_i &= -\epsilon \frac{\partial E}{\partial w_i}\\ \end{aligned}\]
For RNNs, since weights are constrained to be the same over time, compute \(\frac{\partial E}{\partial w_1}\) and \(\frac{\partial E}{\partial w_2}\). Update \(w_1\) and \(w_2\) both by \(\frac{\partial E}{\partial w_1} + \frac{\partial E}{\partial w_2}\) to enforce \(w_1 = w_2\).
GANs try to find:
\[\begin{aligned} &\min_{W_g} \max_{W_d} \sum_n \log P(x_n \text{ is real; } W_d) + \log P(g(z_n; W_g) \text{ is fake; } W_d)\\ \equiv &\min_{W_g} \max_{W_d} \sum_n \log d(x_n; W_d) + \log(1 - d(g(z_n; W_g); W-d))\\ \end{aligned}\]
To train, repeat until convergence:
Goal: learn a scoring function \(s(x,y) \gt s(x, \hat y) \forall \hat y \ne y\) where \(y\) is the true label and \(\hat y\) is an impostor label. In general, \(s(x, y)\) takes the form \(w^T \phi(x, y)\), weights over a feature function.
We try to find \(\hat y = \operatorname*{argmax}_y w \cdot \phi(x, y)\).
A log linear model is used to give the probability of a label sequence \(y\) given an observation sequence \(x\):
\[p(y|x,w) = \frac{\exp\left(\sum_{j=1}^F w_j \phi_j(x,y)\right)}{\sum_{y'}\exp\left(\sum_{j=1}^F w_j \phi_j(x,y')\right)}\]
Where \(\phi_j(x,y) = \sum_{i=1}^N f_i (y_{i-1}, y_i, x_{1:N}, i)\). \(x\) is a sequence of words, \(y_i\) is the label for the current word, and \(y_{i-1}\) is the label of the previous word.
Given a state transition matrix \(P\), a discount factor \(\gamma\), and the reward for being in state \(s_i\) is \(r_i\), we want to find a vector \(U^*(s_i)\), the expected discounted sum of future rewards starting at state \(s_i\).
Can solve by value iteration:
\[\begin{aligned} U^1(s_i) &= r_i\\ U^2(s_i) &= r_i + \gamma \sum_{j=1}^n P_{i,j} U^1 (s_j)\\ &...\\ U^{k+1}(s_i) &= r_i + \gamma \sum_{j=1}^n P_{i,j} U^k (s_j)\\ \end{aligned}\]
When there are actions that have probabilities of outcomes (e.g. in a Markov Decision Process):
\[V^{t+1}(s_i) = \max_k \left[ r_i + \gamma \sum_{j=1}^n P^{(k)}_{i,j}V^t(s_j)\right]\]
The best action \(k\) in state \(s_i\) is the argmax over \(k\) of the previous equation.
\[V^\pi(s) = V^\pi(s) + \underbrace\alpha_\text{Learning rate} \underbrace{(\underbrace{r(s) + \gamma V^\pi(s')}_\text{target} - V^\pi(s))}_\text{Temporal difference}\]
Basically, this is a running average of samples:
\[\begin{aligned} x &\leftarrow R(s) + \gamma V^\pi (s')\\ V^\pi (s) &\leftarrow (1-\alpha)V^\pi(s) + \alpha x\\ \end{aligned}\]
Make a table \(Q: S \times A \rightarrow R\) representing the value of a state-action pair. The optimal policy is then:
\[\pi^*(s) = \operatorname*{argmax}_a Q(s,a)\]
Bellman's equation:
\[Q(s,a) \leftarrow r(s) + \gamma \sum_{s'} P(s' | s,a) \max_{a'} Q(s', a')\]
That is to say, for each possible state \(s'\) you could end up in by performing \(a\) at state \(s\), take the sum the maximum expected reward you can get from performing some action from \(s'\) multiplied by the probability of landing in \(s'\). Scale this sum by \(\gamma\) and add it to the received reward.
The goal here is to learn a policy that mimics expert actions.
Learn a policy \(\pi\) that maps sensory readings to action.
\[\hat\pi = \operatorname*{argmin}_\pi E_{s \sim d_\pi}[l(s, \pi)]\]
Here, \(E\) is expected loss, \(d_\pi\) is the distribution of states given that the policy is followed.
For each policy \(\pi_i\):
For each policy \(\pi_i\):
For each policy \(\pi_i\):
Given traces from an expert and a probability transition matrix \(T\), create a reward function \(R\) such that:
\[E\left[\sum_{t=0}^\infty \gamma^t R^*(s_t) | \pi^*\right] \le E\left[\sum_{t=0}^\infty \gamma^t R^*(s_t) | \pi\right] \forall \pi\]
\[\begin{aligned} &E\left[\sum_{t=0}^\infty \gamma^t R^*(s_t) | \pi\right]\\ =& E\left[\sum_{t=0}^\infty \gamma^t w^T \phi(s_t) | \pi\right]\\ =& w^T E\left[\sum_{t=0}^\infty \gamma^t \phi(s_t) | \pi\right]\\ =& w^T \mu(\pi)\\ \end{aligned}\]
\(\mu\) is the expected cumulative discounted sum of feature values. Given \(\phi: S \rightarrow \mathbb{R}^n\), find \(w^*\) such that \(w^{*T}\mu(\pi^*) \ge w^{*T}\mu(\pi) \forall \pi\).
It suffices that \(||\mu(\pi) - \mu(\pi^*)||_1 \le \epsilon\).
\[\frac{P(R=1|A=a)}{P(R=1|A=b)} \ge 1 - \epsilon\]
This expresses the belief that traits relevant for classification are independent of certain attributes.