theme | addons | background | title | class | drawings | transition | mdc | layout | ||
---|---|---|---|---|---|---|---|---|---|---|
bricks |
|
text-center |
|
slide-left |
true |
cover |
CS115 - Math for Computer Science
University of Information Technology - Vietnam National University
layout: center background: https://cover.sli.dev
In Logistic Regression, we have the loss function:
The loss function is the relative entropy between the true distribution and the predicted distribution. In which, the
We also know that the classficication boundary is a hyperplane:
We can see that the loss function shape in two cases are different. The loss function of Logistic Regression is convex, which means that it has only one global minimum. This is a good property for optimization.
layout: default image: https://cover.sli.dev level: 2 title: The hinge loss function
In SVM, we modify the loss function of Logistic Regression to make it more robust to outliers. We do this by using a function that only penalizes the points that are far from the decision boundary.
The above functions are called the hinge loss functions. The hinge loss function is defined as:
Which simplifies to:
The margin is the distance between the decision boundary and the closest point of the training set. The support vectors are the points that lie on the margin.
Let
-
If
$y_i = 1$ , then$b+\mathbf{w}^{\intercal}\mathbf{x}_i \geq 0$ $\implies$ $y_i(b+\mathbf{w}^{\intercal}\mathbf{x}_i) \geq 0$ -
If
$y_i = -1$ , then$b+\mathbf{w}^{\intercal}\mathbf{x}_i \leq 0$ $\implies$ $y_i(b+\mathbf{w}^{\intercal}\mathbf{x}_i) \geq 0$
In both cases,
title: Find the best decision boundary layout: image-right image: ./image2.png backgroundSize: contain level: 3
The set of points that lie on the margin are the support vectors. The best decision boundary is the one that maximizes the margin. This can be formulated as an optimization problem:
Back to the equation of the decision boundary:
If we scale
Which means that we can scale
Plugging this into the optimization problem
$$ (2)
<mrow data-mjx-texclass="ORD">
<mover>
<mrow data-mjx-texclass="ORD">
<mi mathvariant="bold">w</mi>
</mrow>
<mo stretchy="false">^</mo>
</mover>
</mrow>
<mo>,</mo>
<mrow data-mjx-texclass="ORD">
<mover>
<mi>b</mi>
<mo stretchy="false">^</mo>
</mover>
</mrow>
</mtd>
<mtd>
<mi></mi>
<mo>=</mo>
</mtd>
<mtd>
<mi>arg</mi>
<mo data-mjx-texclass="NONE">⁡</mo>
<mo data-mjx-texclass="OP" movablelimits="true">max</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mo stretchy="false">|</mo>
<mrow data-mjx-texclass="ORD">
<mo stretchy="false">|</mo>
</mrow>
<mrow data-mjx-texclass="ORD">
<mi mathvariant="bold">w</mi>
</mrow>
<mrow data-mjx-texclass="ORD">
<mo stretchy="false">|</mo>
</mrow>
<msub>
<mo stretchy="false">|</mo>
<mn>2</mn>
</msub>
</mrow>
</mfrac>
</mtd>
</mtr>
</mtable>
</mtd>
</mtr>
Simplifying the optimization problem
The squared
The optimization problem
title: The KKT conditions and the Lagrange multipliers method level: 2 layout: two-cols-header hideInToc: false
The optimization problem
::left::
- Stationarity: The gradient of the Lagrangian with respect to the primal variables is zero:
- Primal feasibility: The primal variables satisfy the constraints:
::right::
- Dual feasibility: The dual variables are non-negative:
- Complementary slackness: The complementary slackness condition:
We have the Lagrangian Dual function:
Which we can use to derive the KKT conditions:
$ Stationary: \nabla_{\mathbf{x}} f(\mathbf{x}) + \sum_{i=1}^{m}\lambda_i \nabla_{\mathbf{x}} h_i(\mathbf{x}) + \sum_{i=1}^{n} \nu_j \nabla_{\mathbf{x}} g_j(\mathbf{x}) = 0 \ Complenatery slackness: \nu_j g_j(\mathbf{x}) = 0, ~~ \forall j \ Primal feasibility: h_i(\mathbf{x}) = 0, g_j(\mathbf{x}) \leq 0, ~~\forall i, j \ Dual feasibility: \nu_i \geq 0, ~~ \forall i $
The optimization problem
$$ g(\lambda) = \min_{\mathbf{w}, b, \lambda} \mathcal{L}(\mathbf{w}, b, \lambda) = \frac{1}{2} ||\mathbf{w}||2^2 + \sum{i=1}^N \lambda_i(1 - y_i(\mathbf{w}^{\intercal}\mathbf{x}_i + b) ) \tag{4} $$
The optimal solution of the Lagrangian Dual function
$$ \frac{\partial \mathcal{L}(\mathbf{w}, b, \lambda)}{\partial \mathbf{w}} = \mathbf{w} - \sum_{i=1}^N \lambda_i y_i \mathbf{x}i = 0 \Rightarrow \mathbf{w} = \sum{i=1}^N \lambda_i y_i \mathbf{x}_i \tag{6} $$
We get the optimal solution of the SVM optimization problem by solving the equations
$ g(\lambda) = \frac{1}{2} ||\mathbf{w}||2^2 + \sum{i=1}^N\lambda_i - \mathbf{w}^{\intercal} \underbrace{\sum_{i=1}^{N} \lambda_iy_i \mathbf{x}i}{\mathbf{w}} - b \underbrace{\sum_{i=1}^N \lambda_i y_i}{0} \ = \sum{i=1}^N\lambda_i - \frac{1}{2} ||\mathbf{w}||2^2 \ = \sum{i=1}^N\lambda_i - \frac{1}{2} \sum_{i=1}^N \lambda_i y_i \mathbf{x}i^{\intercal} \sum{i=1}^N \lambda_i y_i \mathbf{x}i $ $$ = \sum{i=1}^N \lambda_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j \mathbf{x}_i^{\intercal} \mathbf{x}_j $$
The minimum of the function
At the optimal solution, the points that lie on the margin or beyond satisfy:
In practice, the points that lie on the margin are the support vectors. The optimal solution of the SVM optimization problem is the set of support vectors
Function
$$ \mathbf{w} = \sum_{i=1}^{N} \lambda_iy_i\mathbf{x}i = \sum{i=1, \lambda_i \neq 0}^{N} \lambda_iy_i\mathbf{x}i = \sum{(\mathbf{x}_i, y_i) \in \mathcal{S}} \lambda_i y_i\mathbf{x}_i $$
The decision boundary is the set of points that satisfy:
Let us have w, we can find b by:
The class of a new point
$ h_{\mathbf{w}, b}(\mathbf{x}_i) = b + \mathbf{w}^{\intercal}\mathbf{x}i \ = b + (~ \sum{(\mathbf{x}_j, y_j) \in \mathcal{S}}\lambda_jy_j \mathbf{x}_j^{\intercal} ~)\mathbf{x}i \ = b + \sum{(\mathbf{x}j, y_j) \in \mathcal{S}} \lambda_j y_j \mathbf{x}{j}^{\intercal} \mathbf{x}_i \ $
The class of
$ \hat{y} = \left{ \begin{matrix} 1 \text{ if } h_{\mathbf{w}, b}(\mathbf{x}) > 0 \ 0 \text{ if } h_{\mathbf{w}, b}(\mathbf{x}) \leq 0 \end{matrix} \right. $
We can see that the decision function is a linear combination of the support vectors. This is why SVM is called a sparse model. Which mean that the decision function depends only on a few points of the training set.
While the hard margin SVM tries to find the decision boundary that separates the classes perfectly, the soft margin SVM allows some points to lie on the wrong side of the decision boundary. This is done by introducing a slack variable
The optimization problem of the soft margin SVM is:
The optimization problem of the soft margin SVM is:
$$ \mathcal{L}(\mathbf{w}, b, \xi, \lambda, \nu) = \frac{1}{2}{||\mathbf{w}||2^2} + C \sum{i=1}^N \xi_i + \sum_{i=1}^N \lambda_i ( 1 - \xi_i - y_i(\mathbf{w}^{\intercal}\mathbf{x}i + b)) - \sum{i=1}^N \nu_i \xi_i $$
sastify
THe solution of the optimization problem become:
$$ \frac{\partial \mathcal{L}(\mathbf{w}, b, \xi, \lambda, \nu)}{\partial \mathbf{w}} = \mathbf{w} - \sum_{i=1}^N \lambda_i y_i \mathbf{x}i = 0 \Rightarrow \mathbf{w} = \sum{i=1}^N \lambda_i y_i \mathbf{x}_i \tag{10} $$
From
which mean that
Plugging
Now, the optimization problem is to maximize
This problem is similar to the hard margin SVM optimization problem, but with an additional constraint that
We perform the same steps as the hard margin SVM to find the optimal solution of the soft margin SVM:
$$ \mathbf{w} = \sum_{i \in \mathcal{S}} \lambda_m y_i \mathbf{x}i \newline b = \frac{1}{N{\mathcal{M}}} \sum_{n \in \mathcal{M}} (y_n - \mathbf{w}^T\mathbf{x}n) = \frac{1}{N{\mathcal{M}}} \sum_{n \in \mathcal{M}} \left(y_n - \sum_{i \in \mathcal{S}} \lambda_i y_i \mathbf{x}_i^T\mathbf{x}_n\right) $$
with
The class of a new point
$$ h_{\mathbf{w}, b}(\mathbf{x}i) = b + \sum{i=1}^{N} \alpha_i y_i \mathbf{x}_i^\top \mathbf{x} $$
We can see the introduction of the slack variable
However, the choice of the hyperparameter
The main idea of the kernel trick is to map the input data into a higher-dimensional space where the classes are linearly separable. This is done by defining a kernel function
The kernel function
-
Symmetry:
$K(\mathbf{x}_i, \mathbf{x}_j) = K(\mathbf{x}_j, \mathbf{x}_i)$ -
Positive definiteness: For any set of points $\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}n$, the matrix $K$ defined by $K{ij} = K(\mathbf{x}_i, \mathbf{x}_j)$ is positive definite.
-
Mercer's theorem: If
$K(\mathbf{x}_i, \mathbf{x}_j)$ is a continuous function, then it can be expressed as an inner product in a higher-dimensional space.
There are several types of kernel functions that are commonly used in SVM:
-
Linear kernel:
$K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^{\intercal}\mathbf{x}_j$ -
Polynomial kernel:
$K(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^{\intercal}\mathbf{x}_j + c)^d$ -
Gaussian (RBF) kernel:
$K(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\frac{||\mathbf{x}_i - \mathbf{x}_j||^2}{2\sigma^2}\right)$ -
Sigmoid kernel:
$K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\alpha \mathbf{x}_i^{\intercal}\mathbf{x}_j + c)$ -
Laplacian kernel:
$K(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\frac{||\mathbf{x}_i - \mathbf{x}_j||}{\sigma}\right)$
EMNIST Dataset (Extended MNIST) contains 814,255 characters from 62 classes. Each character is represented as a 28x28 pixel image. This experiment uses a subset of the EMNIST dataset containing 62 classes.
The EMNIST ByClass dataset is extremely unbalanced, with some classes having very few samples. To balance the dataset, we merge uppercase and lowercase characters then perform undersampling to ensure that each class has an equal number of samples. Then augment data by rotating, scaling, and shifting the images.
Final dataset has 47 classes, 10 digits and 37 letters.
We then perform standard preprocessing steps such as normalization and splitting the dataset into training and testing sets.
This experiment ultilizes three models: CNN, SVM with linear kernel, and KNN for comparison.
- A Convolutional Neural Network (CNN) with 3 convolutional layers and 2 fully connected layers.
- The CNN is trained using the Adam optimizer with a learning rate of 0.001.
- The CNN is trained for 10 epochs with a batch size of 64.
We use GridSearchCV to find the best hyperparameters for the SVM model. The hyperparameters are:
- C: The regularization parameter.
- kernel: The kernel function used in the SVM model.
- A K-Nearest Neighbors (KNN) model with k=7.
- The KNN model uses the Euclidean distance metric.
We evaluate the models using the following metrics:
- Accuracy: The percentage of correctly classified samples.
- Precision: The ratio of correctly predicted positive observations to the total predicted positive observations.
- Recall: The ratio of correctly predicted positive observations to all observations in the actual class.
- F1 Score: The weighted average of Precision and Recall.
We compare the performance of the CNN, SVM, and KNN models on the EMNIST dataset through Accuracy, Recall, and Precision.
title: Results level: 2 hideInToc: true layout: none
<style> .images { display: flex; justify-content: space-between; } .images img { max-width: 28%; /* Adjust size of images */ } </style>