Support Vector Machines
Reference:
Section 3.7 of Pattern recognition, by Sergios Theodoridis, Konstantinos Koutroumbas (2009)
Slides of CS4220, TUD
Margins: Intuition
Let
x
i
,
i
=
1
,
2
,
…
,
N
,
\mathbf x_{i}, i=1,2, \ldots, N,
xi,i=1,2,…,N, be the feature vectors of the training set,
X
.
X .
X. These belong to either of two classes,
ω
1
,
ω
2
,
\omega_{1}, \omega_{2},
ω1,ω2, which are assumed to be linearly separable. The goal is to design a hyperplane
g
(
x
)
=
w
T
x
+
w
0
=
0
g(\boldsymbol{x})=\boldsymbol{w}^{T} \boldsymbol{x}+w_{0}=0
g(x)=wTx+w0=0
that classifies correctly all the training vectors. Such a hyperplane is not unique. For instance, Figure 3.9 illustrates the classification task with two possible hyperplane solutions. Which one should we choose as the classifier in practice? No doubt the answer is: the full-line one. The reason is that this hyperplane
leaves more “room”on either side, so that data in both classes can move a bit more freely, with less risk of causing an error.
Separable Classes
After the above brief discussion, we are ready to accept that a very sensible choice for the hyperplane classifier would be the one that leaves the maximum margin from both classes. Let us now quantify the term margin.
Every hyperplane is characterized by its direction (determined by w \mathbf w w) and its exact position in space (determined by w 0 w_0 w0). Since we want to give no preference to either of the classes, then it is reasonable for each direction to select that hyperplane which has the same distance from the respective nearest points in ω 1 \omega_1 ω1 and ω 2 \omega_2 ω2. This is illustrated in Figure 3.10.
The margin for direction 1 \text{direction }1 direction 1 is 2 z 1 2z_1 2z1 and the margin for direction 2 \text{direction }2 direction 2 is 2 z 2 2z_2 2z2. Our goal is to search for the direction that gives the maximum possible margin.
Recall that the distance of a point from a hyperplane is given by
z
=
∣
g
(
x
)
∣
∥
w
∥
z=\frac{|g(\mathbf x)|}{\|\mathbf w\|}
z=∥w∥∣g(x)∣
We can now scale
w
,
w
0
\mathbf w,w_0
w,w0 so that the value of
g
(
x
)
g(\mathbf x)
g(x), at the nearest points in
ω
1
,
ω
2
\omega_1,\omega_2
ω1,ω2, is equal to
1
1
1 for
ω
1
\omega_1
ω1 and, thus, equal to
−
1
-1
−1 for
ω
2
\omega_2
ω2. This is equivalent with
- Having a margin of 1 ∥ w ∥ + 1 ∥ w ∥ = 2 ∥ w ∥ \frac{1}{\|\boldsymbol{w}\|}+\frac{1}{\|\boldsymbol{w}\|}=\frac{2}{\|\boldsymbol{w}\|} ∥w∥1+∥w∥1=∥w∥2
- Requiring that
w T x + w 0 ≥ 1 , ∀ x ∈ ω 1 w T x + w 0 ≤ − 1 , ∀ x ∈ ω 2 \begin{array}{ll} \boldsymbol{w}^{T} \boldsymbol{x}+w_{0} \geq \mathbf{1}, & \forall \boldsymbol{x} \in \omega_{1} \\ \boldsymbol{w}^{T} \boldsymbol{x}+w_{0} \leq-\mathbf{1}, & \forall \boldsymbol{x} \in \omega_{2} \end{array} wTx+w0≥1,wTx+w0≤−1,∀x∈ω1∀x∈ω2
For each
x
i
\mathbf x_i
xi,we denote the corresponding class indicator by
y
i
y_i
yi (
+
1
+1
+1 for
ω
1
\omega_1
ω1,
−
1
-1
−1 for
ω
2
\omega_2
ω2.) Obviously, minimizing the norm makes the margin maximum. Our task can now be summarized as: Compute the parameters
w
,
w
0
\mathbf w,w_0
w,w0 of the hyperplane so that to:
minimize
w
,
w
0
J
(
w
,
w
0
)
=
1
2
∥
w
∥
2
subject to
y
i
(
w
T
x
i
+
w
0
)
≥
1
,
i
=
1
,
2
,
⋯
,
N
(1)
\begin{aligned} &\underset{\mathbf w,w_0}{\operatorname{minimize}} && J(\mathbf w,w_0)=\frac{1}{2}\|\mathbf w\|^2\\ &\text{subject to} && y_i(\mathbf w^T\mathbf x_i+w_0)\ge 1,\quad i=1,2,\cdots,N \end{aligned}\tag{1}
w,w0minimizesubject toJ(w,w0)=21∥w∥2yi(wTxi+w0)≥1,i=1,2,⋯,N(1)
This is a quadratic optimization task subject to a set of linear inequality constraints. The Lagrangian function can be defined as
L
(
w
,
w
0
,
λ
)
=
1
2
w
T
w
−
∑
i
=
1
N
λ
i
[
y
i
(
w
T
x
i
+
w
0
)
−
1
]
(2)
\mathcal L(\mathbf w,w_0,\boldsymbol \lambda)=\frac{1}{2}\mathbf w^T\mathbf w-\sum_{i=1}^N \lambda_i[y_i(\mathbf w^T\mathbf x_i+w_0)-1]\tag{2}
L(w,w0,λ)=21wTw−i=1∑Nλi[yi(wTxi+w0)−1](2)
and the KKT conditions are
∂
∂
w
L
(
w
,
w
0
,
λ
)
=
0
(3)
\frac{\partial }{\partial \mathbf w}\mathcal L(\mathbf w,w_0,\boldsymbol \lambda)=0\tag{3}
∂w∂L(w,w0,λ)=0(3)
∂ ∂ w 0 L ( w , w 0 , λ ) = 0 (4) \frac{\partial }{\partial w_0}\mathcal L(\mathbf w,w_0,\boldsymbol \lambda)=0 \tag{4} ∂w0∂L(w,w0,λ)=0(4)
λ i ≥ 0 , i = 1 , 2 , ⋯ , N (5) \lambda_i\ge 0,\quad i=1,2,\cdots, N\tag{5} λi≥0,i=1,2,⋯,N(5)
λ i [ y i ( w T x i + w 0 ) − 1 ] = 0 , i = 1 , 2 , ⋯ , N (6) \lambda_i[y_i(\mathbf w^T\mathbf x_i+w_0)-1]=0,\quad i=1,2,\cdots,N\tag{6} λi[yi(wTxi+w0)−1]=0,i=1,2,⋯,N(6)
Combing
(
2
)
(2)
(2) with
(
3
)
(3)
(3) and
(
4
)
(4)
(4) results in
w
=
∑
i
=
1
N
λ
i
y
i
x
i
(7)
\mathbf w=\sum_{i=1}^N \lambda_i y_i \mathbf x_i \tag{7}
w=i=1∑Nλiyixi(7)
∑ i = 1 N λ i y i = 0 (8) \sum_{i=1}^N \lambda_i y_i=0 \tag{8} i=1∑Nλiyi=0(8)
Remarks:
-
The Lagrange multipliers can be either zero or positive. Thus, the vector parameter w \mathbf w w of the optimal solution is a linear combination of N s ≤ N N_s\le N Ns≤N feature vector that are associate with λ i ≠ 0 \lambda_i\ne 0 λi=0. That is
w = ∑ i = 1 N s λ i y i x i (9) \mathbf w=\sum_{i=1}^{N_s}\lambda_i y_i \mathbf x_i \tag{9} w=i=1∑Nsλiyixi(9)
For nonzero λ i \lambda_i λi, as the set of constraints in ( 6 ) (6) (6) suggests, the corresponding x i \mathbf x_i xi lies on either of the two hyperplanes, that is
w T x i + w 0 = ± 1 (10) \mathbf w^T \mathbf x_i+w_0=\pm 1\tag{10} wTxi+w0=±1(10)
These are known as support vectors and the optimum hyperplane classifier as a support vector machine (SVM). They constitute the critical elements of the training set. On the other hand, the resulting hyperplane classifier is insensitive to the number and position of those feature vectors with λ i = 0 \lambda_i=0 λi=0. -
w \mathbf w w is explicitly given in ( 7 ) (7) (7), and w 0 w_0 w0 can be implicitly obtained by any of the complementary slackness conditions ( 6 ) (6) (6). In practice, w 0 w_0 w0 is computed as an average value obtained using all conditions of this type.
-
The objective of problem ( 1 ) (1) (1) is a strict convex one since the corresponding Hessian matrix is positive definite. Furthermore, the inequality constraints consist of linear functions. These two conditions guarantee that any local minimum is also global and unique. Therefore, the optimal hyperplane classifier of a support vector machine is unique.
Now we plug
(
7
)
(7)
(7) and
(
8
)
(8)
(8) back into the Lagrangian
(
2
)
(2)
(2) and simplify, we get
L
(
w
,
w
0
,
λ
)
=
∑
i
=
1
N
λ
i
−
1
2
∑
i
,
j
=
1
N
λ
i
λ
j
y
i
y
j
x
i
T
x
j
(11)
\mathcal L(\mathbf w,w_0,\boldsymbol \lambda)=\sum_{i=1}^N \lambda_i-\frac{1}{2} \sum_{i,j=1}^N\lambda_i\lambda_j y_i y_j \mathbf x_i^T \mathbf x_j \tag{11}
L(w,w0,λ)=i=1∑Nλi−21i,j=1∑NλiλjyiyjxiTxj(11)
We then solve the primal problem
(
1
)
(1)
(1) by Lagrangian duality. The dual problem can be formulated as
maximize
λ
∑
i
=
1
N
λ
i
−
1
2
∑
i
,
j
=
1
N
λ
i
λ
j
y
i
y
j
x
i
T
x
j
subject to
∑
i
=
1
N
λ
i
y
i
=
0
λ
⪰
0
(12)
\begin{aligned} &\underset{\boldsymbol \lambda}{\operatorname{maximize}} && \sum_{i=1}^N \lambda_i-\frac{1}{2} \sum_{i,j=1}^N\lambda_i\lambda_j y_i y_j \mathbf x_i^T \mathbf x_j\\ &\text{subject to} && \sum_{i=1}^N \lambda_i y_i=0\\ &~ && \boldsymbol \lambda \succeq\mathbf 0 \end{aligned} \tag{12}
λmaximizesubject to i=1∑Nλi−21i,j=1∑NλiλjyiyjxiTxji=1∑Nλiyi=0λ⪰0(12)
Once the optimal Lagrange multipliers
λ
\boldsymbol \lambda
λ have been computed, the optimal hyperplane is obtained via
(
7
)
(7)
(7), and
w
0
w_0
w0 via complementary slackness conditions
(
6
)
(6)
(6), as before.
Nonseparable Classes
When the classes are not separable, the above setup is no longer valid. Figure 3.11 illustrates the case in which the two classes are not separable.
The training feature vectors now belong to one of the following three categories:
-
Vectors that fall outside the band and are correctly classified.
-
Vectors falling inside the band and are correctly classified. These are the points placed in squares in Figure 3.11, and they satisfy the inequality
0 ≤ y i ( w T x + w 0 ) < 1 0\le y_i (\mathbf w^T\mathbf x+w_0)<1 0≤yi(wTx+w0)<1 -
Vectors that are misclassified. They are enclosed by circles and obey the inequality
y i ( w T x + w 0 ) < 0 y_i(\mathbf w^T\mathbf x+w_0)<0 yi(wTx+w0)<0
All three cases can be treated under a single type of constraints by introducing a new set of variables, namely,
y
i
(
w
T
x
+
w
0
)
≥
1
−
ξ
i
(13)
y_i(\mathbf w^T\mathbf x+w_0)\ge 1-\xi_i \tag{13}
yi(wTx+w0)≥1−ξi(13)
The first category of data corresponds to
ξ
i
=
0
\xi_i=0
ξi=0, the second to
0
<
ξ
i
≤
1
0<\xi_i\le 1
0<ξi≤1, and the third to
ξ
i
>
1
\xi _i>1
ξi>1. The variables
ξ
i
\xi_i
ξi are known as slack variables.
The goal now is to make the margin as large as possible but at the same time to keep the number of points with
ξ
>
0
\xi >0
ξ>0 as small as possible. In mathematical terms, this is equivalent to adopting to minimize the cost function
J
(
w
,
w
0
,
ξ
)
=
1
2
∥
w
∥
2
+
C
∑
i
=
1
N
I
(
ξ
i
)
(14)
J(\mathbf w,w_0,\boldsymbol\xi)=\frac{1}{2}\|\mathbf w\|^2+C\sum_{i=1}^N I(\xi_i)\tag{14}
J(w,w0,ξ)=21∥w∥2+Ci=1∑NI(ξi)(14)
where
ξ
\boldsymbol \xi
ξ is the vector of the parameter
ξ
i
\xi_i
ξi and
I
(
ξ
i
)
=
{
1
ξ
i
>
0
0
ξ
i
=
0
(15)
I(\xi_i)=\left\{\begin{matrix} 1 & \xi_i >0\\ 0 & \xi_i=0 \end{matrix} \right. \tag{15}
I(ξi)={10ξi>0ξi=0(15)
The parameter
C
C
C is a positive constant that controls the relative influence of the two competing terms.
However, optimization of the above is difficult since it involves a discontinuous function
I
(
⋅
)
I(\cdot)
I(⋅). As it is common in such cases, we choose to optimize a closely related cost function, and the goal becomes
KaTeX parse error: No such environment: equation at position 8: \begin{̲e̲q̲u̲a̲t̲i̲o̲n̲}̲ \begin{aligned…
The problem is again a convex programming one, and the corresponding Lagrangian is given by
L
(
w
,
w
0
,
ξ
,
λ
,
μ
)
=
1
2
w
T
w
+
C
∑
i
=
1
N
ξ
i
−
∑
i
=
1
N
μ
i
ξ
i
−
∑
i
=
1
N
λ
i
[
y
i
(
w
T
x
i
+
w
0
)
−
1
+
ξ
i
]
(17)
\mathcal L(\mathbf w,w_0,\boldsymbol \xi,\boldsymbol \lambda,\boldsymbol \mu)=\frac{1}{2}\mathbf w^T\mathbf w+C\sum_{i=1}^N\xi_i-\sum_{i=1}^N\mu_i\xi _i -\sum_{i=1}^N \lambda_i[y_i(\mathbf w^T\mathbf x_i+w_0)-1+\xi_i]\tag{17}
L(w,w0,ξ,λ,μ)=21wTw+Ci=1∑Nξi−i=1∑Nμiξi−i=1∑Nλi[yi(wTxi+w0)−1+ξi](17)
The KKT conditions are
∂
L
∂
w
=
0
⟹
w
=
∑
i
=
1
N
λ
i
y
i
x
i
(18)
\frac{\partial\mathcal L}{\partial \mathbf w}=\mathbf 0\Longrightarrow \mathbf w=\sum_{i=1}^N \lambda_i y_i\mathbf x_i\tag{18}
∂w∂L=0⟹w=i=1∑Nλiyixi(18)
∂ L ∂ w 0 = 0 ⟹ ∑ i = 1 N λ i y i = 0 (19) \frac{\partial\mathcal L}{\partial w_0}= 0\Longrightarrow \sum_{i=1}^N \lambda_i y_i=0\tag{19} ∂w0∂L=0⟹i=1∑Nλiyi=0(19)
∂ L ∂ ξ i = 0 ⟹ C − μ i − λ i = 0 i = 1 , 2 , ⋯ , N (20) \frac{\partial\mathcal L}{\partial \xi_i}= 0\Longrightarrow C-\mu_i-\lambda_i=0 \quad i=1,2,\cdots,N \tag{20} ∂ξi∂L=0⟹C−μi−λi=0i=1,2,⋯,N(20)
λ i [ y i ( w T x i + w 0 ) − 1 + ξ i ] = 0 i = 1 , 2 , ⋯ , N (21) \lambda_i[y_i(\mathbf w^T\mathbf x_i+w_0)-1+\xi_i]=0 \quad i=1,2,\cdots,N\tag{21} λi[yi(wTxi+w0)−1+ξi]=0i=1,2,⋯,N(21)
μ i ξ i = 0 i = 1 , 2 , ⋯ , N (22) \mu_i \xi_i=0 \quad i=1,2,\cdots,N\tag{22} μiξi=0i=1,2,⋯,N(22)
μ i ≥ 0 , λ i ≥ 0 i = 1 , 2 , ⋯ , N (23) \mu_i \ge 0,\lambda_i\ge 0 \quad i=1,2,\cdots,N \tag{23} μi≥0,λi≥0i=1,2,⋯,N(23)
As before, substituting
(
18
)
−
(
20
)
(18)-(20)
(18)−(20) back into
(
17
)
(17)
(17) and simplify, we end up with
maximize
λ
∑
i
=
1
N
λ
i
−
1
2
∑
i
,
j
=
1
N
λ
i
λ
j
y
i
y
j
x
i
T
x
j
subject to
∑
i
=
1
N
λ
i
y
i
=
0
0
≤
λ
i
≤
C
i
=
1
,
2
,
⋯
,
N
(24)
\begin{aligned} &\underset{\boldsymbol \lambda}{\operatorname{maximize}} && \sum_{i=1}^N \lambda_i-\frac{1}{2} \sum_{i,j=1}^N\lambda_i\lambda_j y_i y_j \mathbf x_i^T \mathbf x_j\\ &\text{subject to} && \sum_{i=1}^N \lambda_i y_i=0\\ &~ && 0\le \lambda_i \le C\quad i=1,2,\cdots,N \end{aligned}\tag{24}
λmaximizesubject to i=1∑Nλi−21i,j=1∑NλiλjyiyjxiTxji=1∑Nλiyi=00≤λi≤Ci=1,2,⋯,N(24)
Note that, somewhat surprisingly, the only change to the dual problem is that what was originally a constraint that
0
≤
λ
i
0\le \lambda_i
0≤λi has now become
0
≤
λ
i
≤
C
0\le \lambda_i \le C
0≤λi≤C.
Kernels
In previous discussions, the decision boundary is linear. What if the problem is not linear separable? A possible solution is to map the original data to mapped data, as is shown in the figure below:
To distinguish between these two sets of variables, we’ll call the original input value the input attributes of a problem. When that is mapped to some new set of quantities that are then passed to the learning algorithm, we’ll call those new quantities the input features. We will also let ϕ \phi ϕ denote the feature mapping, which maps from the attributes to the features.
Since the algorithm can be written entirely in terms of the inner products
⟨
x
,
z
⟩
\langle \mathbf x,\mathbf z \rangle
⟨x,z⟩, this means that we would replace all those inner products with
⟨
ϕ
(
x
)
,
ϕ
(
z
)
⟩
\langle \phi(\mathbf x),\phi(\mathbf z) \rangle
⟨ϕ(x),ϕ(z)⟩. Specifically, given a feature mapping
ϕ
\phi
ϕ, we de fine the corresponding Kernel to be
K
(
x
,
z
)
=
ϕ
(
x
)
T
ϕ
(
z
)
(25)
K(\mathbf x,\mathbf z)=\phi(\mathbf x)^T \phi(\mathbf z)\tag{25}
K(x,z)=ϕ(x)Tϕ(z)(25)
Then, everywhere we previously had
⟨
x
,
z
⟩
\langle \mathbf x,\mathbf z \rangle
⟨x,z⟩ in our algorithm, we could simply replace it with
K
(
x
,
z
)
K(\mathbf x, \mathbf z)
K(x,z), and our algorithm would now be learning using the features
ϕ
\phi
ϕ.
K
(
x
,
z
)
K(\mathbf x,\mathbf z)
K(x,z) may be very inexpensive to calculate, even though
ϕ
(
x
)
\phi(\mathbf x)
ϕ(x) itself may be very expensive to calculate (perhaps because it is an extremely high dimensional vector). In such settings, by using in our algorithm an efficient way to calculate
K
(
x
,
z
)
K(\mathbf x, \mathbf z)
K(x,z), we can get SVMs to learn in the high dimensional feature space given by
ϕ
\phi
ϕ, but without ever having to explicitly find or represent vectors
ϕ
(
x
)
\phi(\mathbf x)
ϕ(x).
Therefore, we only define the kernel function, and forget about the mapping
ϕ
(
x
)
\phi(\mathbf x)
ϕ(x). For instance, we transform problem
(
12
)
(12)
(12) to
maximize
λ
∑
i
=
1
N
λ
i
−
1
2
∑
i
,
j
=
1
N
λ
i
λ
j
y
i
y
j
K
(
x
i
,
x
j
)
subject to
∑
i
=
1
N
λ
i
y
i
=
0
λ
⪰
0
(26)
\begin{aligned} &\underset{\boldsymbol \lambda}{\operatorname{maximize}} && \sum_{i=1}^N \lambda_i-\frac{1}{2} \sum_{i,j=1}^N\lambda_i\lambda_j y_i y_j K(\mathbf x_i, \mathbf x_j)\\ &\text{subject to} && \sum_{i=1}^N \lambda_i y_i=0\\ &~ && \boldsymbol \lambda \succeq\mathbf 0 \end{aligned}\tag{26}
λmaximizesubject to i=1∑Nλi−21i,j=1∑NλiλjyiyjK(xi,xj)i=1∑Nλiyi=0λ⪰0(26)
where
K
(
x
i
,
x
j
)
=
(
x
i
T
x
j
+
1
)
d
K(\mathbf x_i,\mathbf x_j)=(\mathbf x_i^T\mathbf x_j+1)^d
K(xi,xj)=(xiTxj+1)d
or
K
(
x
i
,
x
j
)
=
exp
(
−
∥
x
i
−
x
j
∥
2
σ
2
)
K(\mathbf x_i,\mathbf x_j)=\exp\left( -\frac{\|\mathbf x_i-\mathbf x_j\|^2}{\sigma^2}\right)
K(xi,xj)=exp(−σ2∥xi−xj∥2)
Other kernels can also be applied.
相關文章
- Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines 論文研讀ASTGoAIMac
- Support Vector Machines(SVM)如何根據虹膜分類評估性格類別?Mac
- Machine Learning (12) - Support Vector Machine (SVM)Mac
- Perceptron, Support Vector Machine and Dual Optimization Problem (1)Mac
- Perceptron, Support Vector Machine and Dual Optimization Problem (2)Mac
- Perceptron, Support Vector Machine and Dual Optimization Problem (3)Mac
- 支援向量機(Support Vector Machine,SVM)—— 線性SVMMac
- 學習SVM(四) 理解SVM中的支援向量(Support Vector)
- 軒田機器學習技法課程學習筆記1 — Linear Support Vector Machine機器學習筆記Mac
- 林軒田機器學習技法課程學習筆記6 — Support Vector Regression機器學習筆記
- 林軒田機器學習技法課程學習筆記3 — Kernel Support Vector Machine機器學習筆記Mac
- 林軒田機器學習技法課程學習筆記2 — Dual Support Vector Machine機器學習筆記Mac
- 「Attentional Factorization Machines」- 論文摘要Mac
- 林軒田機器學習技法課程學習筆記4 — Soft-Margin Support Vector Machine機器學習筆記Mac
- vector
- 「Neural Factorization Machines for Sparse Predictive Analytics」- 論文摘要Mac
- agc023C – Painting Machines(組合數)GCAIMac
- langchain multi modal supportLangChain
- iOS Technical Support For AlliOS
- DBMS_SUPPORT(zt)
- does not support SSL connections
- 001.POST Not Support
- Paimon Deletion VectorAI
- c++ vectorC++
- Vector擴容
- vector——C++C++
- STL容器---Vector
- vector 使用 上
- Vector和Stack
- Oracle OCP(32):SUPPORT:MOSOracle
- 《The Design of a Practical System for Fault-Tolerant Virtual Machines》論文研讀Mac
- row_vector and col_vector的建立 (Leetcode 807, Leetcode 531)LeetCode
- python-Vector向量Python
- C++(std::vector)C++
- C++ Vector fundamentalC++
- C++ STL -- vectorC++
- vector::shrink_to_fit()
- C++:vector assignC++