Support Vector Machines

拉普拉斯的汪發表於2020-12-11

Reference:
Section 3.7 of Pattern recognition, by Sergios Theodoridis, Konstantinos Koutroumbas (2009)
Slides of CS4220, TUD

Margins: Intuition

Let x i , i = 1 , 2 , … , N , \mathbf x_{i}, i=1,2, \ldots, N, xi,i=1,2,,N, be the feature vectors of the training set, X . X . X. These belong to either of two classes, ω 1 , ω 2 , \omega_{1}, \omega_{2}, ω1,ω2, which are assumed to be linearly separable. The goal is to design a hyperplane
g ( x ) = w T x + w 0 = 0 g(\boldsymbol{x})=\boldsymbol{w}^{T} \boldsymbol{x}+w_{0}=0 g(x)=wTx+w0=0
that classifies correctly all the training vectors. Such a hyperplane is not unique. For instance, Figure 3.9 illustrates the classification task with two possible hyperplane solutions. Which one should we choose as the classifier in practice? No doubt the answer is: the full-line one. The reason is that this hyperplane
leaves more “room”on either side, so that data in both classes can move a bit more freely, with less risk of causing an error.

在這裡插入圖片描述

Separable Classes

After the above brief discussion, we are ready to accept that a very sensible choice for the hyperplane classifier would be the one that leaves the maximum margin from both classes. Let us now quantify the term margin.

Every hyperplane is characterized by its direction (determined by w \mathbf w w) and its exact position in space (determined by w 0 w_0 w0). Since we want to give no preference to either of the classes, then it is reasonable for each direction to select that hyperplane which has the same distance from the respective nearest points in ω 1 \omega_1 ω1 and ω 2 \omega_2 ω2. This is illustrated in Figure 3.10.

在這裡插入圖片描述

The margin for direction  1 \text{direction }1 direction 1 is 2 z 1 2z_1 2z1 and the margin for direction  2 \text{direction }2 direction 2 is 2 z 2 2z_2 2z2. Our goal is to search for the direction that gives the maximum possible margin.

Recall that the distance of a point from a hyperplane is given by
z = ∣ g ( x ) ∣ ∥ w ∥ z=\frac{|g(\mathbf x)|}{\|\mathbf w\|} z=wg(x)
We can now scale w , w 0 \mathbf w,w_0 w,w0 so that the value of g ( x ) g(\mathbf x) g(x), at the nearest points in ω 1 , ω 2 \omega_1,\omega_2 ω1,ω2, is equal to 1 1 1 for ω 1 \omega_1 ω1 and, thus, equal to − 1 -1 1 for ω 2 \omega_2 ω2. This is equivalent with

  1. Having a margin of 1 ∥ w ∥ + 1 ∥ w ∥ = 2 ∥ w ∥ \frac{1}{\|\boldsymbol{w}\|}+\frac{1}{\|\boldsymbol{w}\|}=\frac{2}{\|\boldsymbol{w}\|} w1+w1=w2
  2. Requiring that

w T x + w 0 ≥ 1 , ∀ x ∈ ω 1 w T x + w 0 ≤ − 1 , ∀ x ∈ ω 2 \begin{array}{ll} \boldsymbol{w}^{T} \boldsymbol{x}+w_{0} \geq \mathbf{1}, & \forall \boldsymbol{x} \in \omega_{1} \\ \boldsymbol{w}^{T} \boldsymbol{x}+w_{0} \leq-\mathbf{1}, & \forall \boldsymbol{x} \in \omega_{2} \end{array} wTx+w01,wTx+w01,xω1xω2

For each x i \mathbf x_i xi,we denote the corresponding class indicator by y i y_i yi ( + 1 +1 +1 for ω 1 \omega_1 ω1, − 1 -1 1 for ω 2 \omega_2 ω2.) Obviously, minimizing the norm makes the margin maximum. Our task can now be summarized as: Compute the parameters w , w 0 \mathbf w,w_0 w,w0 of the hyperplane so that to:
minimize ⁡ w , w 0 J ( w , w 0 ) = 1 2 ∥ w ∥ 2 subject to y i ( w T x i + w 0 ) ≥ 1 , i = 1 , 2 , ⋯   , N (1) \begin{aligned} &\underset{\mathbf w,w_0}{\operatorname{minimize}} && J(\mathbf w,w_0)=\frac{1}{2}\|\mathbf w\|^2\\ &\text{subject to} && y_i(\mathbf w^T\mathbf x_i+w_0)\ge 1,\quad i=1,2,\cdots,N \end{aligned}\tag{1} w,w0minimizesubject toJ(w,w0)=21w2yi(wTxi+w0)1,i=1,2,,N(1)
This is a quadratic optimization task subject to a set of linear inequality constraints. The Lagrangian function can be defined as
L ( w , w 0 , λ ) = 1 2 w T w − ∑ i = 1 N λ i [ y i ( w T x i + w 0 ) − 1 ] (2) \mathcal L(\mathbf w,w_0,\boldsymbol \lambda)=\frac{1}{2}\mathbf w^T\mathbf w-\sum_{i=1}^N \lambda_i[y_i(\mathbf w^T\mathbf x_i+w_0)-1]\tag{2} L(w,w0,λ)=21wTwi=1Nλi[yi(wTxi+w0)1](2)
and the KKT conditions are
∂ ∂ w L ( w , w 0 , λ ) = 0 (3) \frac{\partial }{\partial \mathbf w}\mathcal L(\mathbf w,w_0,\boldsymbol \lambda)=0\tag{3} wL(w,w0,λ)=0(3)

∂ ∂ w 0 L ( w , w 0 , λ ) = 0 (4) \frac{\partial }{\partial w_0}\mathcal L(\mathbf w,w_0,\boldsymbol \lambda)=0 \tag{4} w0L(w,w0,λ)=0(4)

λ i ≥ 0 , i = 1 , 2 , ⋯   , N (5) \lambda_i\ge 0,\quad i=1,2,\cdots, N\tag{5} λi0,i=1,2,,N(5)

λ i [ y i ( w T x i + w 0 ) − 1 ] = 0 , i = 1 , 2 , ⋯   , N (6) \lambda_i[y_i(\mathbf w^T\mathbf x_i+w_0)-1]=0,\quad i=1,2,\cdots,N\tag{6} λi[yi(wTxi+w0)1]=0,i=1,2,,N(6)

Combing ( 2 ) (2) (2) with ( 3 ) (3) (3) and ( 4 ) (4) (4) results in
w = ∑ i = 1 N λ i y i x i (7) \mathbf w=\sum_{i=1}^N \lambda_i y_i \mathbf x_i \tag{7} w=i=1Nλiyixi(7)

∑ i = 1 N λ i y i = 0 (8) \sum_{i=1}^N \lambda_i y_i=0 \tag{8} i=1Nλiyi=0(8)

Remarks:

  • The Lagrange multipliers can be either zero or positive. Thus, the vector parameter w \mathbf w w of the optimal solution is a linear combination of N s ≤ N N_s\le N NsN feature vector that are associate with λ i ≠ 0 \lambda_i\ne 0 λi=0. That is
    w = ∑ i = 1 N s λ i y i x i (9) \mathbf w=\sum_{i=1}^{N_s}\lambda_i y_i \mathbf x_i \tag{9} w=i=1Nsλiyixi(9)
    For nonzero λ i \lambda_i λi, as the set of constraints in ( 6 ) (6) (6) suggests, the corresponding x i \mathbf x_i xi lies on either of the two hyperplanes, that is
    w T x i + w 0 = ± 1 (10) \mathbf w^T \mathbf x_i+w_0=\pm 1\tag{10} wTxi+w0=±1(10)
    These are known as support vectors and the optimum hyperplane classifier as a support vector machine (SVM). They constitute the critical elements of the training set. On the other hand, the resulting hyperplane classifier is insensitive to the number and position of those feature vectors with λ i = 0 \lambda_i=0 λi=0.

  • w \mathbf w w is explicitly given in ( 7 ) (7) (7), and w 0 w_0 w0 can be implicitly obtained by any of the complementary slackness conditions ( 6 ) (6) (6). In practice, w 0 w_0 w0 is computed as an average value obtained using all conditions of this type.

  • The objective of problem ( 1 ) (1) (1) is a strict convex one since the corresponding Hessian matrix is positive definite. Furthermore, the inequality constraints consist of linear functions. These two conditions guarantee that any local minimum is also global and unique. Therefore, the optimal hyperplane classifier of a support vector machine is unique.

Now we plug ( 7 ) (7) (7) and ( 8 ) (8) (8) back into the Lagrangian ( 2 ) (2) (2) and simplify, we get
L ( w , w 0 , λ ) = ∑ i = 1 N λ i − 1 2 ∑ i , j = 1 N λ i λ j y i y j x i T x j (11) \mathcal L(\mathbf w,w_0,\boldsymbol \lambda)=\sum_{i=1}^N \lambda_i-\frac{1}{2} \sum_{i,j=1}^N\lambda_i\lambda_j y_i y_j \mathbf x_i^T \mathbf x_j \tag{11} L(w,w0,λ)=i=1Nλi21i,j=1NλiλjyiyjxiTxj(11)
We then solve the primal problem ( 1 ) (1) (1) by Lagrangian duality. The dual problem can be formulated as
maximize ⁡ λ ∑ i = 1 N λ i − 1 2 ∑ i , j = 1 N λ i λ j y i y j x i T x j subject to ∑ i = 1 N λ i y i = 0   λ ⪰ 0 (12) \begin{aligned} &\underset{\boldsymbol \lambda}{\operatorname{maximize}} && \sum_{i=1}^N \lambda_i-\frac{1}{2} \sum_{i,j=1}^N\lambda_i\lambda_j y_i y_j \mathbf x_i^T \mathbf x_j\\ &\text{subject to} && \sum_{i=1}^N \lambda_i y_i=0\\ &~ && \boldsymbol \lambda \succeq\mathbf 0 \end{aligned} \tag{12} λmaximizesubject to i=1Nλi21i,j=1NλiλjyiyjxiTxji=1Nλiyi=0λ0(12)
Once the optimal Lagrange multipliers λ \boldsymbol \lambda λ have been computed, the optimal hyperplane is obtained via ( 7 ) (7) (7), and w 0 w_0 w0 via complementary slackness conditions ( 6 ) (6) (6), as before.

Nonseparable Classes

When the classes are not separable, the above setup is no longer valid. Figure 3.11 illustrates the case in which the two classes are not separable.

在這裡插入圖片描述

The training feature vectors now belong to one of the following three categories:

  • Vectors that fall outside the band and are correctly classified.

  • Vectors falling inside the band and are correctly classified. These are the points placed in squares in Figure 3.11, and they satisfy the inequality
    0 ≤ y i ( w T x + w 0 ) < 1 0\le y_i (\mathbf w^T\mathbf x+w_0)<1 0yi(wTx+w0)<1

  • Vectors that are misclassified. They are enclosed by circles and obey the inequality
    y i ( w T x + w 0 ) < 0 y_i(\mathbf w^T\mathbf x+w_0)<0 yi(wTx+w0)<0

All three cases can be treated under a single type of constraints by introducing a new set of variables, namely,
y i ( w T x + w 0 ) ≥ 1 − ξ i (13) y_i(\mathbf w^T\mathbf x+w_0)\ge 1-\xi_i \tag{13} yi(wTx+w0)1ξi(13)
The first category of data corresponds to ξ i = 0 \xi_i=0 ξi=0, the second to 0 < ξ i ≤ 1 0<\xi_i\le 1 0<ξi1, and the third to ξ i > 1 \xi _i>1 ξi>1. The variables ξ i \xi_i ξi are known as slack variables.

The goal now is to make the margin as large as possible but at the same time to keep the number of points with ξ > 0 \xi >0 ξ>0 as small as possible. In mathematical terms, this is equivalent to adopting to minimize the cost function
J ( w , w 0 , ξ ) = 1 2 ∥ w ∥ 2 + C ∑ i = 1 N I ( ξ i ) (14) J(\mathbf w,w_0,\boldsymbol\xi)=\frac{1}{2}\|\mathbf w\|^2+C\sum_{i=1}^N I(\xi_i)\tag{14} J(w,w0,ξ)=21w2+Ci=1NI(ξi)(14)
where ξ \boldsymbol \xi ξ is the vector of the parameter ξ i \xi_i ξi and
I ( ξ i ) = { 1 ξ i > 0 0 ξ i = 0 (15) I(\xi_i)=\left\{\begin{matrix} 1 & \xi_i >0\\ 0 & \xi_i=0 \end{matrix} \right. \tag{15} I(ξi)={10ξi>0ξi=0(15)
The parameter C C C is a positive constant that controls the relative influence of the two competing terms.

However, optimization of the above is difficult since it involves a discontinuous function I ( ⋅ ) I(\cdot) I(). As it is common in such cases, we choose to optimize a closely related cost function, and the goal becomes
KaTeX parse error: No such environment: equation at position 8: \begin{̲e̲q̲u̲a̲t̲i̲o̲n̲}̲ \begin{aligned…
The problem is again a convex programming one, and the corresponding Lagrangian is given by
L ( w , w 0 , ξ , λ , μ ) = 1 2 w T w + C ∑ i = 1 N ξ i − ∑ i = 1 N μ i ξ i − ∑ i = 1 N λ i [ y i ( w T x i + w 0 ) − 1 + ξ i ] (17) \mathcal L(\mathbf w,w_0,\boldsymbol \xi,\boldsymbol \lambda,\boldsymbol \mu)=\frac{1}{2}\mathbf w^T\mathbf w+C\sum_{i=1}^N\xi_i-\sum_{i=1}^N\mu_i\xi _i -\sum_{i=1}^N \lambda_i[y_i(\mathbf w^T\mathbf x_i+w_0)-1+\xi_i]\tag{17} L(w,w0,ξ,λ,μ)=21wTw+Ci=1Nξii=1Nμiξii=1Nλi[yi(wTxi+w0)1+ξi](17)
The KKT conditions are
∂ L ∂ w = 0 ⟹ w = ∑ i = 1 N λ i y i x i (18) \frac{\partial\mathcal L}{\partial \mathbf w}=\mathbf 0\Longrightarrow \mathbf w=\sum_{i=1}^N \lambda_i y_i\mathbf x_i\tag{18} wL=0w=i=1Nλiyixi(18)

∂ L ∂ w 0 = 0 ⟹ ∑ i = 1 N λ i y i = 0 (19) \frac{\partial\mathcal L}{\partial w_0}= 0\Longrightarrow \sum_{i=1}^N \lambda_i y_i=0\tag{19} w0L=0i=1Nλiyi=0(19)

∂ L ∂ ξ i = 0 ⟹ C − μ i − λ i = 0 i = 1 , 2 , ⋯   , N (20) \frac{\partial\mathcal L}{\partial \xi_i}= 0\Longrightarrow C-\mu_i-\lambda_i=0 \quad i=1,2,\cdots,N \tag{20} ξiL=0Cμiλi=0i=1,2,,N(20)

λ i [ y i ( w T x i + w 0 ) − 1 + ξ i ] = 0 i = 1 , 2 , ⋯   , N (21) \lambda_i[y_i(\mathbf w^T\mathbf x_i+w_0)-1+\xi_i]=0 \quad i=1,2,\cdots,N\tag{21} λi[yi(wTxi+w0)1+ξi]=0i=1,2,,N(21)

μ i ξ i = 0 i = 1 , 2 , ⋯   , N (22) \mu_i \xi_i=0 \quad i=1,2,\cdots,N\tag{22} μiξi=0i=1,2,,N(22)

μ i ≥ 0 , λ i ≥ 0 i = 1 , 2 , ⋯   , N (23) \mu_i \ge 0,\lambda_i\ge 0 \quad i=1,2,\cdots,N \tag{23} μi0,λi0i=1,2,,N(23)

As before, substituting ( 18 ) − ( 20 ) (18)-(20) (18)(20) back into ( 17 ) (17) (17) and simplify, we end up with
maximize ⁡ λ ∑ i = 1 N λ i − 1 2 ∑ i , j = 1 N λ i λ j y i y j x i T x j subject to ∑ i = 1 N λ i y i = 0   0 ≤ λ i ≤ C i = 1 , 2 , ⋯   , N (24) \begin{aligned} &\underset{\boldsymbol \lambda}{\operatorname{maximize}} && \sum_{i=1}^N \lambda_i-\frac{1}{2} \sum_{i,j=1}^N\lambda_i\lambda_j y_i y_j \mathbf x_i^T \mathbf x_j\\ &\text{subject to} && \sum_{i=1}^N \lambda_i y_i=0\\ &~ && 0\le \lambda_i \le C\quad i=1,2,\cdots,N \end{aligned}\tag{24} λmaximizesubject to i=1Nλi21i,j=1NλiλjyiyjxiTxji=1Nλiyi=00λiCi=1,2,,N(24)
Note that, somewhat surprisingly, the only change to the dual problem is that what was originally a constraint that 0 ≤ λ i 0\le \lambda_i 0λi has now become 0 ≤ λ i ≤ C 0\le \lambda_i \le C 0λiC.

Kernels

In previous discussions, the decision boundary is linear. What if the problem is not linear separable? A possible solution is to map the original data to mapped data, as is shown in the figure below:

在這裡插入圖片描述

To distinguish between these two sets of variables, we’ll call the original input value the input attributes of a problem. When that is mapped to some new set of quantities that are then passed to the learning algorithm, we’ll call those new quantities the input features. We will also let ϕ \phi ϕ denote the feature mapping, which maps from the attributes to the features.

Since the algorithm can be written entirely in terms of the inner products ⟨ x , z ⟩ \langle \mathbf x,\mathbf z \rangle x,z, this means that we would replace all those inner products with ⟨ ϕ ( x ) , ϕ ( z ) ⟩ \langle \phi(\mathbf x),\phi(\mathbf z) \rangle ϕ(x),ϕ(z). Specifically, given a feature mapping ϕ \phi ϕ, we de fine the corresponding Kernel to be
K ( x , z ) = ϕ ( x ) T ϕ ( z ) (25) K(\mathbf x,\mathbf z)=\phi(\mathbf x)^T \phi(\mathbf z)\tag{25} K(x,z)=ϕ(x)Tϕ(z)(25)
Then, everywhere we previously had ⟨ x , z ⟩ \langle \mathbf x,\mathbf z \rangle x,z in our algorithm, we could simply replace it with K ( x , z ) K(\mathbf x, \mathbf z) K(x,z), and our algorithm would now be learning using the features ϕ \phi ϕ.
K ( x , z ) K(\mathbf x,\mathbf z) K(x,z) may be very inexpensive to calculate, even though ϕ ( x ) \phi(\mathbf x) ϕ(x) itself may be very expensive to calculate (perhaps because it is an extremely high dimensional vector). In such settings, by using in our algorithm an efficient way to calculate K ( x , z ) K(\mathbf x, \mathbf z) K(x,z), we can get SVMs to learn in the high dimensional feature space given by ϕ \phi ϕ, but without ever having to explicitly find or represent vectors ϕ ( x ) \phi(\mathbf x) ϕ(x).

Therefore, we only define the kernel function, and forget about the mapping ϕ ( x ) \phi(\mathbf x) ϕ(x). For instance, we transform problem ( 12 ) (12) (12) to
maximize ⁡ λ ∑ i = 1 N λ i − 1 2 ∑ i , j = 1 N λ i λ j y i y j K ( x i , x j ) subject to ∑ i = 1 N λ i y i = 0   λ ⪰ 0 (26) \begin{aligned} &\underset{\boldsymbol \lambda}{\operatorname{maximize}} && \sum_{i=1}^N \lambda_i-\frac{1}{2} \sum_{i,j=1}^N\lambda_i\lambda_j y_i y_j K(\mathbf x_i, \mathbf x_j)\\ &\text{subject to} && \sum_{i=1}^N \lambda_i y_i=0\\ &~ && \boldsymbol \lambda \succeq\mathbf 0 \end{aligned}\tag{26} λmaximizesubject to i=1Nλi21i,j=1NλiλjyiyjK(xi,xj)i=1Nλiyi=0λ0(26)
where
K ( x i , x j ) = ( x i T x j + 1 ) d K(\mathbf x_i,\mathbf x_j)=(\mathbf x_i^T\mathbf x_j+1)^d K(xi,xj)=(xiTxj+1)d
or
K ( x i , x j ) = exp ⁡ ( − ∥ x i − x j ∥ 2 σ 2 ) K(\mathbf x_i,\mathbf x_j)=\exp\left( -\frac{\|\mathbf x_i-\mathbf x_j\|^2}{\sigma^2}\right) K(xi,xj)=exp(σ2xixj2)
Other kernels can also be applied.

在這裡插入圖片描述

相關文章