Support Vector Machines

Reference:
Section 3.7 of Pattern recognition, by Sergios Theodoridis, Konstantinos Koutroumbas (2009)
Slides of CS4220, TUD

Content

Margins: Intuition

Let $\mathbf x_{i}, i=1,2, \ldots, N,$ be the feature vectors of the training set, $X .$ These belong to either of two classes, $\omega_{1}, \omega_{2},$ which are assumed to be linearly separable. The goal is to design a hyperplane
$g(\boldsymbol{x})=\boldsymbol{w}^{T} \boldsymbol{x}+w_{0}=0$
that classifies correctly all the training vectors. Such a hyperplane is not unique. For instance, Figure 3.9 illustrates the classification task with two possible hyperplane solutions. Which one should we choose as the classifier in practice? No doubt the answer is: the full-line one. The reason is that this hyperplane
leaves more “room”on either side, so that data in both classes can move a bit more freely, with less risk of causing an error.

在這裡插入圖片描述

Separable Classes

After the above brief discussion, we are ready to accept that a very sensible choice for the hyperplane classifier would be the one that leaves the maximum margin from both classes. Let us now quantify the term margin.

Every hyperplane is characterized by its direction (determined by $\mathbf w$ ) and its exact position in space (determined by $w_0$ ). Since we want to give no preference to either of the classes, then it is reasonable for each direction to select that hyperplane which has the same distance from the respective nearest points in $\omega_1$ and $\omega_2$ . This is illustrated in Figure 3.10.

在這裡插入圖片描述

The margin for $\text{direction }1$ is $2z_1$ and the margin for $\text{direction }2$ is $2z_2$ . Our goal is to search for the direction that gives the maximum possible margin.

Recall that the distance of a point from a hyperplane is given by
$z=\frac{|g(\mathbf x)|}{\|\mathbf w\|}$
We can now scale $\mathbf w,w_0$ so that the value of $g(\mathbf x)$ , at the nearest points in $\omega_1,\omega_2$ , is equal to $1$ for $\omega_1$ and, thus, equal to $- 1$ for $\omega_2$ . This is equivalent with

Having a margin of $\frac{1}{\|\boldsymbol{w}\|}+\frac{1}{\|\boldsymbol{w}\|}=\frac{2}{\|\boldsymbol{w}\|}$
Requiring that

$\begin{array}{ll} \boldsymbol{w}^{T} \boldsymbol{x}+w_{0} \geq \mathbf{1}, & \forall \boldsymbol{x} \in \omega_{1} \\ \boldsymbol{w}^{T} \boldsymbol{x}+w_{0} \leq-\mathbf{1}, & \forall \boldsymbol{x} \in \omega_{2} \end{array}$

For each $\mathbf x_i$ ,we denote the corresponding class indicator by $y_i$ ( $+ 1$ for $\omega_1$ , $- 1$ for $\omega_2$ .) Obviously, minimizing the norm makes the margin maximum. Our task can now be summarized as: Compute the parameters $\mathbf w,w_0$ of the hyperplane so that to:
$\begin{aligned} &\underset{\mathbf w,w_0}{\operatorname{minimize}} && J(\mathbf w,w_0)=\frac{1}{2}\|\mathbf w\|^2\\ &\text{subject to} && y_i(\mathbf w^T\mathbf x_i+w_0)\ge 1,\quad i=1,2,\cdots,N \end{aligned}\tag{1}$
This is a quadratic optimization task subject to a set of linear inequality constraints. The Lagrangian function can be defined as
$\mathcal L(\mathbf w,w_0,\boldsymbol \lambda)=\frac{1}{2}\mathbf w^T\mathbf w-\sum_{i=1}^N \lambda_i[y_i(\mathbf w^T\mathbf x_i+w_0)-1]\tag{2}$
and the KKT conditions are
$\frac{\partial }{\partial \mathbf w}\mathcal L(\mathbf w,w_0,\boldsymbol \lambda)=0\tag{3}$

$\frac{\partial }{\partial w_0}\mathcal L(\mathbf w,w_0,\boldsymbol \lambda)=0 \tag{4}$

$\lambda_i\ge 0,\quad i=1,2,\cdots, N\tag{5}$

$\lambda_i[y_i(\mathbf w^T\mathbf x_i+w_0)-1]=0,\quad i=1,2,\cdots,N\tag{6}$

Combing $(2)$ with $(3)$ and $(4)$ results in
$\mathbf w=\sum_{i=1}^N \lambda_i y_i \mathbf x_i \tag{7}$

$\sum_{i=1}^N \lambda_i y_i=0 \tag{8}$

Remarks:

The Lagrange multipliers can be either zero or positive. Thus, the vector parameter $\mathbf w$ of the optimal solution is a linear combination of $N_s\le N$ feature vector that are associate with $\lambda_i\ne 0$ . That is
$\mathbf w=\sum_{i=1}^{N_s}\lambda_i y_i \mathbf x_i \tag{9}$
For nonzero $\lambda_i$ , as the set of constraints in $(6)$ suggests, the corresponding $\mathbf x_i$ lies on either of the two hyperplanes, that is
$\mathbf w^T \mathbf x_i+w_0=\pm 1\tag{10}$
These are known as support vectors and the optimum hyperplane classifier as a support vector machine (SVM). They constitute the critical elements of the training set. On the other hand, the resulting hyperplane classifier is insensitive to the number and position of those feature vectors with $\lambda_i=0$ .
$\mathbf w$ is explicitly given in $(7)$ , and $w_0$ can be implicitly obtained by any of the complementary slackness conditions $(6)$ . In practice, $w_0$ is computed as an average value obtained using all conditions of this type.
The objective of problem $(1)$ is a strict convex one since the corresponding Hessian matrix is positive definite. Furthermore, the inequality constraints consist of linear functions. These two conditions guarantee that any local minimum is also global and unique. Therefore, the optimal hyperplane classifier of a support vector machine is unique.

Now we plug $(7)$ and $(8)$ back into the Lagrangian $(2)$ and simplify, we get
$\mathcal L(\mathbf w,w_0,\boldsymbol \lambda)=\sum_{i=1}^N \lambda_i-\frac{1}{2} \sum_{i,j=1}^N\lambda_i\lambda_j y_i y_j \mathbf x_i^T \mathbf x_j \tag{11}$
We then solve the primal problem $(1)$ by Lagrangian duality. The dual problem can be formulated as
$\begin{aligned} &\underset{\boldsymbol \lambda}{\operatorname{maximize}} && \sum_{i=1}^N \lambda_i-\frac{1}{2} \sum_{i,j=1}^N\lambda_i\lambda_j y_i y_j \mathbf x_i^T \mathbf x_j\\ &\text{subject to} && \sum_{i=1}^N \lambda_i y_i=0\\ &~ && \boldsymbol \lambda \succeq\mathbf 0 \end{aligned} \tag{12}$
Once the optimal Lagrange multipliers $\boldsymbol \lambda$ have been computed, the optimal hyperplane is obtained via $(7)$ , and $w_0$ via complementary slackness conditions $(6)$ , as before.

Nonseparable Classes

When the classes are not separable, the above setup is no longer valid. Figure 3.11 illustrates the case in which the two classes are not separable.

在這裡插入圖片描述

The training feature vectors now belong to one of the following three categories:

Vectors that fall outside the band and are correctly classified.
Vectors falling inside the band and are correctly classified. These are the points placed in squares in Figure 3.11, and they satisfy the inequality
$0\le y_i (\mathbf w^T\mathbf x+w_0)<1$
Vectors that are misclassified. They are enclosed by circles and obey the inequality
$y_i(\mathbf w^T\mathbf x+w_0)<0$

All three cases can be treated under a single type of constraints by introducing a new set of variables, namely,
$y_i(\mathbf w^T\mathbf x+w_0)\ge 1-\xi_i \tag{13}$
The first category of data corresponds to $\xi_i=0$ , the second to $0<\xi_i\le 1$ , and the third to $\xi _i>1$ . The variables $\xi_i$ are known as slack variables.

The goal now is to make the margin as large as possible but at the same time to keep the number of points with $\xi >0$ as small as possible. In mathematical terms, this is equivalent to adopting to minimize the cost function
$J(\mathbf w,w_0,\boldsymbol\xi)=\frac{1}{2}\|\mathbf w\|^2+C\sum_{i=1}^N I(\xi_i)\tag{14}$
where $\boldsymbol \xi$ is the vector of the parameter $\xi_i$ and
$I(\xi_i)=\left\{\begin{matrix} 1 & \xi_i >0\\ 0 & \xi_i=0 \end{matrix} \right. \tag{15}$
The parameter $C$ is a positive constant that controls the relative influence of the two competing terms.

However, optimization of the above is difficult since it involves a discontinuous function $I(\cdot)$ . As it is common in such cases, we choose to optimize a closely related cost function, and the goal becomes
$KaTeX parse error: No such environment: equation at position 8: \begin{̲e̲q̲u̲a̲t̲i̲o̲n̲}̲ \begin{aligned…$
The problem is again a convex programming one, and the corresponding Lagrangian is given by
$\mathcal L(\mathbf w,w_0,\boldsymbol \xi,\boldsymbol \lambda,\boldsymbol \mu)=\frac{1}{2}\mathbf w^T\mathbf w+C\sum_{i=1}^N\xi_i-\sum_{i=1}^N\mu_i\xi _i -\sum_{i=1}^N \lambda_i[y_i(\mathbf w^T\mathbf x_i+w_0)-1+\xi_i]\tag{17}$
The KKT conditions are
$\frac{\partial\mathcal L}{\partial \mathbf w}=\mathbf 0\Longrightarrow \mathbf w=\sum_{i=1}^N \lambda_i y_i\mathbf x_i\tag{18}$

$\frac{\partial\mathcal L}{\partial w_0}= 0\Longrightarrow \sum_{i=1}^N \lambda_i y_i=0\tag{19}$

$\frac{\partial\mathcal L}{\partial \xi_i}= 0\Longrightarrow C-\mu_i-\lambda_i=0 \quad i=1,2,\cdots,N \tag{20}$

$\lambda_i[y_i(\mathbf w^T\mathbf x_i+w_0)-1+\xi_i]=0 \quad i=1,2,\cdots,N\tag{21}$

$\mu_i \xi_i=0 \quad i=1,2,\cdots,N\tag{22}$

$\mu_i \ge 0,\lambda_i\ge 0 \quad i=1,2,\cdots,N \tag{23}$

As before, substituting $(18) - (20)$ back into $(17)$ and simplify, we end up with
$\begin{aligned} &\underset{\boldsymbol \lambda}{\operatorname{maximize}} && \sum_{i=1}^N \lambda_i-\frac{1}{2} \sum_{i,j=1}^N\lambda_i\lambda_j y_i y_j \mathbf x_i^T \mathbf x_j\\ &\text{subject to} && \sum_{i=1}^N \lambda_i y_i=0\\ &~ && 0\le \lambda_i \le C\quad i=1,2,\cdots,N \end{aligned}\tag{24}$
Note that, somewhat surprisingly, the only change to the dual problem is that what was originally a constraint that $0\le \lambda_i$ has now become $0\le \lambda_i \le C$ .

Kernels

In previous discussions, the decision boundary is linear. What if the problem is not linear separable? A possible solution is to map the original data to mapped data, as is shown in the figure below:

在這裡插入圖片描述

To distinguish between these two sets of variables, we’ll call the original input value the input attributes of a problem. When that is mapped to some new set of quantities that are then passed to the learning algorithm, we’ll call those new quantities the input features. We will also let $\phi$ denote the feature mapping, which maps from the attributes to the features.

Since the algorithm can be written entirely in terms of the inner products $\langle \mathbf x,\mathbf z \rangle$ , this means that we would replace all those inner products with $\langle \phi(\mathbf x),\phi(\mathbf z) \rangle$ . Specifically, given a feature mapping $\phi$ , we de fine the corresponding Kernel to be
$K(\mathbf x,\mathbf z)=\phi(\mathbf x)^T \phi(\mathbf z)\tag{25}$
Then, everywhere we previously had $\langle \mathbf x,\mathbf z \rangle$ in our algorithm, we could simply replace it with $K(\mathbf x, \mathbf z)$ , and our algorithm would now be learning using the features $\phi$ .
$K(\mathbf x,\mathbf z)$ may be very inexpensive to calculate, even though $\phi(\mathbf x)$ itself may be very expensive to calculate (perhaps because it is an extremely high dimensional vector). In such settings, by using in our algorithm an efficient way to calculate $K(\mathbf x, \mathbf z)$ , we can get SVMs to learn in the high dimensional feature space given by $\phi$ , but without ever having to explicitly find or represent vectors $\phi(\mathbf x)$ .

Therefore, we only define the kernel function, and forget about the mapping $\phi(\mathbf x)$ . For instance, we transform problem $(12)$ to
$\begin{aligned} &\underset{\boldsymbol \lambda}{\operatorname{maximize}} && \sum_{i=1}^N \lambda_i-\frac{1}{2} \sum_{i,j=1}^N\lambda_i\lambda_j y_i y_j K(\mathbf x_i, \mathbf x_j)\\ &\text{subject to} && \sum_{i=1}^N \lambda_i y_i=0\\ &~ && \boldsymbol \lambda \succeq\mathbf 0 \end{aligned}\tag{26}$
where
$K(\mathbf x_i,\mathbf x_j)=(\mathbf x_i^T\mathbf x_j+1)^d$
or
$K(\mathbf x_i,\mathbf x_j)=\exp\left( -\frac{\|\mathbf x_i-\mathbf x_j\|^2}{\sigma^2}\right)$
Other kernels can also be applied.

在這裡插入圖片描述

Support Vector Machines

Content

Margins: Intuition

Separable Classes

Nonseparable Classes

Kernels

相關文章