Machine Learning and High-Dimensional Information Retrieval - Note 8 - Support Vector Machines

Note 8 Support Vector Machines

The idea behind support vector machines is very simple. In the simplest case, we assume that the samples of the two classes can be linearly separated, that is, we assume that there is a two-dimensional affine hyperplane that can separate the two classes. SVM is a supervised learning algorithm that can find the "best" separating hyperplane given training samples from two classes. Once you find it, it's easy to classify new data points. Depending on which side of the hyperplane the incoming new data point lies, it is assigned to the corresponding category.

8.1 Geometric basis of SVM

For some w ∈ R p \mathbf{w} \in \mathbb{R}^{p}wRp and some nonnegativeb ≥ 0 b \geq 0b0 R p \mathbb{R}^{p} RThe affine hyperplane in p is defined as the set

H w , b : = { x ∈ R p ∣ w ⊤ x − b = 0 } (8.1) \mathcal{H}_{\mathbf{w}, b}:=\left\{\mathbf{x} \in \mathbb{R}^{p}\mid \mathbf{w}^{\ top} \mathbf{x}-b=0\right\}. \tag{8.1}Hw,b:={ xRpwxb=0}.( 8.1 )
Vector w \mathbf { w}w is the normal line, or perpendicular toH w , b \mathcal{H}_{\mathbf{w}, b}Hw,b, because as long as we have an arbitrary line segment x 2 − x 1 \mathrm{x}_{2}-\mathrm{x}_{1} on the hyperplanex2x1,在 x 1 , x 2 ∈ H w , b \mathrm{x}_{1}, \mathrm{x}_{2} \in \mathcal{H}_{\mathrm{w}, b} x1,x2Hw,b,but

w ⊤ ( x 2 − x 1 ) = 0. (8.2) \mathbf{w}^{\top}\left(\mathbf{x}_{2}-\mathbf{x}_{1}\right)=0 . \tag{8.2} w(x2x1)=0.(8.2)

From this, it is easy to derive two affine hyperplanes H w , b , H w ~ , b ~ \mathcal{H}_{\mathbf{w}, b}, \mathcal{H}_{\tilde{ \mathbf{w}}, \tilde{b}}Hw,b,Hw~,b~is parallel if and only if w \mathbf{w}ww ~ \tilde{\mathbf{w}}wMultiples of ~ .

1 { }^{1} 1 This stems from the so-called hyperplane separation theorem, which mainly states that two disjoint convex sets can be separated by a hyperplane. This proof can be traced back to the famous Hermann Minkowski (1864-1909).

Any hyperplane will R p \mathbb{R}^{p}Rp is completely separated in two half spaces1 { }^{1}1 . From a certain pointx \mathbf{x}x toH w , b \mathcal{H}_{\mathbf{w}, b}Hw,bThe signed distance of is defined as

δ ( x , H w , b ) = w ⊤ x − b ∥ w ∥ . (8.3) \delta\left(\mathbf{x}, \mathcal{H}_{\mathbf{w}, b}\right)=\frac{\mathbf{w}^{\top} \mathbf{x }-b}{\|\mathbf{w}\|} . \tag{8.3}d(x,Hw,b)=wwxb.(8.3)

The reason for this definition is because of the following Lemma.

Lemma 8.1 8.1 8. 1 Signed distance to affine hyperplanes (Signed distance to affine hyperplanes )

from x \mathrm{x}x toH w , b \mathcal{H}_{\mathbf{w}, b}Hw,bThe Euclidean distance is given by ∣ δ ( x , H w , b ) ∣ \left|\delta\left(\mathbf{x}, \mathcal{H}_{\mathbf{w}, b}\right)\right |δ(x,Hw,b) ∣Decision .

Proof.
We use some geometric intuition to prove. Suppose from x \mathbf{x}Starting at x , we want to followr \mathbf{r}r direction moves toward the hyperplane. To find the shortest distance we have to solve

min ⁡ ∥ r ∥  s.t.  x + r ∈ H w , b (8.4) \min \|\mathbf{r}\| \quad \text { s.t. } \quad \mathbf{x}+\mathbf{r} \in \mathcal{H}_{\mathbf{w}, b} \tag{8.4} minr s.t. x+rHw,b(8.4)

The constraints here are equivalent to ( x + r ) ⊤ w = b (\mathbf{x}+\mathbf{r})^{\top} \mathbf{w}=b(x+r)w=b , or

r ⊤ w = b − x ⊤ w . (8.5) \mathbf{r}^{\top} \mathbf{w}=b-\mathbf{x}^{\top} \mathbf{w} . \tag{8.5}rw=bxw.(8.5)

It can be seen from this equation that solving (8.4) for r \mathbf{r}r must bew \mathbf{w}Multiples of w . We prove this below. Assumer \mathbf{r}r is notw \mathbf{w}Multiples of w . Then forwi ⊥ \mathbf{w}_{i}^{\perp}wiOrthogonal to w \mathbf{w}w , we can always write it asr = λ w + ∑ i μ iwi ⊥ \mathbf{r}=\lambda \mathbf{w}+\sum_{i} \mu_{i} \mathbf{w}_{ i}^{\perp}r=λw+imiwiIndependently, ( λ w + ∑ i µ iwi ⊥ ) ⊤ w = λ w ⊤ w \left(\lambda \mathbf{w}+\sum_{i} \mu_{i} \mathbf{w}_; {i}^{\perp}\right)^{\top}\mathbf{w}=\lambda \mathbf{w}^{\top}\mathbf{w}(λw+imiwi)w=λw wholds, and according to the triangle inequality,∥ r ∥ ≥ ∥ λ w ∥ \|\mathbf{r}\| \geq\|\lambda \mathbf{w}\|rλw

This observation gives rise to the minimization problem,

min ⁡ λ ∥ λ w ∥ st λ ∥ w ∥ 2 = b − x ⊤ w (8.6) \min _{\lambda}\|\lambda \mathbf{w}\| \quad \text { st } \quad \lambda\|\mathbf{w}\|^{2}=b-\mathbf{x}^{\top} \mathbf{w} \tag{8.6}lminλw s.t. λw2=bxw(8.6)

Its solution is the absolute value of the signed distance

∣ w ⊤ x − b ∣ ∥ w ∥ = ∣ δ ( x , H w , b ) ∣ (8.7) \frac{\left|\mathbf{w}^{\top} \mathbf{x}-b\right|}{\|\mathbf{w}\|}=\left|\delta\left(\mathbf{x}, \mathcal{H}_{\mathbf{w}, b}\right)\right| \tag{8.7} wwxb=δ(x,Hw,b)(8.7)

It is easy to see that the signed distance from the hyperplane is positive in half the space and negative in the other half. We put H w , b \mathcal{H}_{\mathbf{w}, b}Hw,bThe margin of is defined as close to H w , b \mathcal{H}_{\mathbf{w}, b}Hw,bA collection of points. More precisely, let

H + : = { x ∈ R p ∣ w ⊤ x − b = 1 } H − : = { x ∈ R p ∣ w ⊤ x − b = − 1 } (8.8;8.9) \begin{array}{r}; \mathcal{H}_{+}:=\left\{\mathbf{x}\in \mathbb{R}^{p}\mid\mathbf{w}^{\top}\mathbf{x}-b =1\right\}\\\mathcal{H}_{-}:=\left\{\mathbf{x}\in \mathbb{R}^{p}\mid\mathbf{w}^{\top } \mathbf{x}-b=-1\right\}\end{array}\tag{8.8;8.9}H+:={ xRpwxb=1}H:={ xRpwxb=1}(8.8;8.9)

are two parallel to H w , b \mathcal{H}_{\mathbf{w}, b}Hw,baffine hyperplane. Then H w , b \mathcal{H}_{\mathbf{w}, b}Hw,bThe margin of is defined as H + ∪ H − \mathcal{H}_{+}\cup\mathcal{H}_{-}H+HThe convex hull of .

Using Lemma 8.1, we can see the thickness of this margin, which is H + \mathcal{H}_{+}H+and H − \mathcal{H}_{-}HThe distance between them is given by 2 ∣ w ∥ \frac{2}{|\mathbf{w}\|}w2given. Therefore, if we want to find an affine hyperplane H w , b \mathcal{H}_{\mathbf{w}, b}Hw,b, allows for the "best" separation of two classes of data points. If we quantify "best" by allowing a maximum margin while still preventing data points from falling within this margin, we will have to maximize the 2 ∥ w ∥ \ frac{2}{\|\mathbf{w}\|}w2, which brings us to linear SVM.

8.2 Basic Linear SVM

As mentioned above, SVM is a supervised learning method, so we start from NNStarting from N -labeled training data,( xi , yi ) ∈ R p × { − 1 , 1 } , i = 1 , … , N \left(\mathbf{x}_{i}, y_{i}\right) \in \mathbb{R}^{p} \times\{-1,1\}, i=1, \ldots, N(xi,yi)Rp×{ 1,1},i=1,,N. _ 这り,yi y_{i}yiis 1 11 or− 1 -11 , indicating which of the two categories the data belongs to. For linear SVM, we must assume that this data is indeed linearly separable, that is, we must assume that there is an affine hyperplaneH w , b that separates the two classes \mathcal{H}_{\mathrm{w }, b}Hw,b. There are no data points located at H w , b \mathcal{H}_{\mathbf{w}, b}Hw,bWithin the margin of , this constraint is equivalent to requiring that all belonging to yi = 1 y_{i}=1yi=The point of 1 is related to H + \mathcal{H}_{+}H+The distance of is positive, and all belonging to yi = − 1 y_{i}=-1yi=1 data andH − \mathcal{H}_{-}HThe distance is negative, that is.

w ⊤ x i − b ≥ + 1  for  y i = + 1 w ⊤ x i − b ≤ − 1  for  y i = − 1 (8.10) \begin{array}{ll} \mathbf{w}^{\top} \mathbf{x}_{i}-b \geq+1 & \text { for } y_{i}=+1 \\ \mathbf{w}^{\top} \mathbf{x}_{i}-b \leq-1 & \text { for } y_{i}=-1 \end{array} \tag{8.10} wxib+1wxib1 for yi=+1 for yi=1(8.10)

This can be written more concisely as
yi ( w ⊤ xi − b ) ≥ 1 for all i = 1 , … , N (8.11) y_{i}\left(\mathbf{w}^{\top} \mathbf{x} _{i}-b\right) \geq 1 \text { for all } i=1, \ldots, N \tag{8.11}yi(wxib)1 for all i=1,,N(8.11)

Therefore, the task of finding the affine hyperplane that allows the maximum margin, while still separating the two classes, is described by the optimization problem
max ⁡ 2 ∥ w ∥ 2 st yi ( w ⊤ xi − b ) ≥ 1 for all i = 1 , … , N (8.12) \max \frac{2}{\|\mathbf{w}\|^{2}} \quad \text { st } \quad y_{i}\left(\mathbf{w} ^{\top} \mathbf{x}_{i}-b\right) \geq 1 \text { for all } i=1, \ldots, N \tag{8.12}maxw22 s.t. yi(wxib)1 for all i=1,,N(8.12)

或者等价地,以
min ⁡ 1 2 ∥ w ∥ 2  s.t.  y i ( w ⊤ x i − b ) ≥ 1  for all  i = 1 , … , N (8.13) \min \frac{1}{2}\|\mathbf{w}\|^{2} \quad \text { s.t. } \quad y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}-b\right) \geq 1 \text { for all } i=1, \ldots, N \tag{8.13} min21w2 s.t. yi(wxib)1 for all i=1,,N(8.13)

In order to solve this restricted optimization problem, we need to introduce optimization conditions for these types of problems.

8.2.1 Karush-Kuhn-Tucker condition

We refer to the textbook Numerical Optimization by J. Nocedal and SJ Wright, 2nd edition, Springer 2006, for a more detailed insight into the topic of optimization. Consider the general problem
min ⁡ z ∈ R nf ( z ) st ci ( z ) = 0 for i ∈ E and cj ( z ) ≥ 0 for j ∈ I (8.14) \begin{array}{rl} \min _{\ mathbf{z} \in \mathbb{R}^{n}} & f(\mathbf{z}) \\ \text { st } & c_{i}(\mathbf{z})=0 \text { for } i \in \mathcal{E} \\ & \text { and } c_{j}(\mathbf{z}) \geq 0 \text { for } j \in \mathcal{I} \end{array} \ tag{8.14}minzRn s.t. f(z)ci(z)=0 for iE and cj(z)0 for jI(8.14)

and smooth real-valued functions f, cif, c_{i}f,ci. Here E \mathcal{E}E represents equality constraint,I \mathcal{I}I stands for inequality constraint. For a certain point z that satisfies the constraints, we define its activity set asA ( z ) = E \mathcal{A}(\mathbf{z})=\mathcal{E}A(z)=E. _ In other words,A ( z ) \mathcal{A}(\mathbf{z})A ( z ) isz \mathbf{z}The index of all constraints for which z exactly satisfies the equality condition.

The Lagrangian function of the optimization problem (8.15) is given by
L ( z , λ ) = f ( z ) − ∑ i ∈ I ∪ E λ ici ( z ) (8.15) L(\mathbf{z}, \boldsymbol{\lambda})=f(\mathbf{z})-\sum_{i \in \mathcal{I} \cup \mathcal{E}} \lambda_{i} c_{i}(\mathbf{z }) \tag{8.15}L(z,l )=f(z)iIElici(z)(8.15)

Theorem 8.2 Karush-Kuhn-Tucker (KKT) conditions

z ⋆ \mathbf{z}^{\star}z is the solution of (8.15). In constraint function2^{2}Under certain conditions of 2 , there exists a Lagrange multiplierλ ⋆ \boldsymbol{\lambda}^{\star}l , making

∇ z L ( z ⋆ , λ ⋆ ) = 0 c i ( z ⋆ ) = 0  for  i ∈ E c i ( z ⋆ ) ≥ 0  for  i ∈ I λ i ⋆ ≥ 0  for  i ∈ I λ i ⋆ c i ( z ⋆ ) = 0  for  i ∈ I ∪ E (8.16-8.21) \begin{aligned} \nabla_{\mathbf{z}} L\left(\mathbf{z}^{\star}, \boldsymbol{\lambda}^{\star}\right) &=0 \\ c_{i}\left(\mathbf{z}^{\star}\right) &=0 \text { for } i \in \mathcal{E} \\ c_{i}\left(\mathbf{z}^{\star}\right) & \geq 0 \text { for } i \in \mathcal{I} \\ \lambda_{i}^{\star} & \geq 0 \text { for } i \in \mathcal{I} \\ \lambda_{i}^{\star} c_{i}\left(\mathbf{z}^{\star}\right) &=0 \text { for } i \in \mathcal{I} \cup \mathcal{E} \end{aligned} \tag{8.16-8.21} zL(z,l)ci(z)ci(z)lilici(z)=0=0 for iE0 for iI0 for iI=0 for iIE(8.16-8.21)

Since the SVM optimization problem is convex (a convex objective function and constraints that define a convex feasible region), the KKT conditions are z ⋆ , λ ⋆ \ mathbf{z}^{\star}, \boldsymbol{\ lambda}^{\star}z,l⋆Necessary and sufficient conditions for being a solution. The last condition means that either constraintiii is in the active set, or theiithThe i component is zero, or possibly both.

2 { }^{2} 2 This is satisfied if they are linear, so specifically for the case of linear SVM.

8.2.2 Lagrangian Duality

As before, we will consider the general optimization problem

min ⁡ f ( z )  s.t.  c i ( z ) = 0 , i ∈ E , c j ( z ) ≥ 0 , j ∈ I \min f(\mathbf{z}) \quad \text { s.t. } c_{i}(\mathbf{z})=0, i \in \mathcal{E}, \quad c_{j}(\mathbf{z}) \geq 0, j \in \mathcal{I} minf(z) s.t. ci(z)=0,iE,cj(z)0,jI

This is also known as the original problem. The corresponding Lagrangian function is defined as
L ( z , λ ) = f ( z ) − ∑ i ∈ I ∪ E λ ici ( z ) L(\mathbf{z}, \boldsymbol{\lambda})=f (\mathbf{z})-\sum_{i \in \mathcal{I} \cup \mathcal{E}} \lambda_{i} c_{i}(\mathbf{z})L(z,l )=f(z)iIElici(z)

With Lagrange multipliers (also called dual variables) λ i ≥ 0 \lambda_{i} \geq 0li0 . Based on the Lagrangian function, we can create a new function for the objective functionfff provides a lower bound. Since the dual variables are both positive numbers, that is,λ i ≥ 0 \lambda_{i} \geq 0li0 , we know thatf ( z ) ≥ f(\mathbf{z}) \geqf(z) L ( z , λ ) L(\mathbf{z}, \boldsymbol{\lambda})L(z,λ )for all feasiblez \mathbf{z}z . This leads to the definition of the Lagrangian dual function as

g ( λ ) = inf ⁡ z L ( z , λ ) = inf ⁡ z ( f ( z ) − ∑ i ∈ I ∪ E λ here ( z ) ) (8.22) g(\bold symbol{\lambda})=\ inf _{\mathbf{z}} L(\mathbf{z}, \ballsymbol{\lambda})=\inf _{\mathbf{z}}\left(f(\mathbf{z})-\sum_{ i\in\mathcal{I}\cup\mathcal{E}}\lambda_{i}c_{i}(\mathbf{z})\right)\tag{8.22}g ( λ )=zinfL(z,l )=zinf(f(z)iIEλici(z))(8.22)

对偶函数 g ( ⋅ ) g(\cdot) g()是凹的,即使原来的问题不是凸的,因为它是仿射函数的逐点下确界(point-wise infimum of affine functions)。对偶形式产生了目标函数 f f f的最优值 p ⋆ p^{\star} p的下限。对于任何 λ ≥ 0 \boldsymbol{\lambda} \geq 0 λ0,我们有 g ( λ ) ≤ p ⋆ g(\boldsymbol{\lambda}) \leq p^{\star} g(λ)p

拉格朗日对偶问题是最大化问题

max ⁡ λ g ( λ )  s.t.  λ i ≥ 0 \max _{\boldsymbol{\lambda}} g(\boldsymbol{\lambda}) \quad \text { s.t. } \quad \lambda_{i} \geq 0 lmaxg ( λ ) s.t. li0

Under certain conditions (which holds in the case of SVM), the minimum of the original problem coincides with the maximum of the dual problem, i.e. d ⋆ = max ⁡ g ( λ ) = d^{\star}=\max g(\ boldsymbol{\lambda})=d=maxg ( λ )= inf ⁡ z f ( z ) = p ⋆ \inf _{\mathbf{z}} f(\mathbf{z})=p^{\star} infzf(z)=p . This is called strong duality.

Remark
duality enables us to use convex optimization to calculate a lower bound on the optimal value of any problem, convex or not. However, the dual function ggg may not be easy to calculate because it is itself defined as an optimization problem. whengg_Duality works best when g can be written in closed form. Even so, it may not be easy to find a solution to the dual problem, since not all convex problems are easy to solve.

8.2.3 Linear SVM: Primal and Dual Problem (Linear SVM: Primal and Dual Problem)

In the case of an infinitesimal range, it is possible to set a range of
min ⁡ w , b , λ ≥ 0 L ( w , b , λ ) with L ( w , b , λ ) = 1 2 ∥ w ∥ 2 − ∑ i λ i ( yi ( w ⊤ xi − b ) − 1 ) = 1 2 ∥ w ∥ 2 − ∑ i λ iyi ( w ⊤ xi − b ) + ∑ i λ i (8.23-25) \begin{gathered} \ . min _{\mathbf{w}, b, \ballsymbol{\lambda} \geq 0} L(\mathbf{w}, b, \ballsymbol{\lambda}) \\ \text { with } L(\mathbf{; w}, b, \bold symbol{\lambda})=\frac{1}{2}\|\mathbf{w}\|^{2}-\sum_{i}\lambda_{i}\left(y_{ i}\left(\mathbf{w}^{\top}\mathbf{x}_{i}-b\right)-1\right) \\ =\frac{1}{2}\|\mathbf{ w}\|^{2}-\sum_{i}\lambda_{i} y_{i}\left(\mathbf{w}^{\top}\mathbf{x}_{i}-b\right) +\sum_{i}\lambda_{i}\end{gathered}\tag{8.23-25}w,b,λ0minL(w,b,l ) with L(w,b,l )=21w2ili(yi(wxib)1)=21w2iliyi(wxib)+ili(8.23-25)

This problem is strictly convex, so if it exists, its solution is unique. The gradient and optimization parameters of the Lagrangian function ( w , bw, bw,b ) The relationship is

∇ ( w , b ) L ( w , b , λ ) = [ w − ∑ i λ i y i x i ∑ i λ i y i ] (8.26) \nabla_{(\mathbf{w}, b)} L(\mathbf{w}, b, \lambda)=\left[\begin{array}{c} \mathbf{w}-\sum_{i} \lambda_{i} y_{i} \mathbf{x}_{i} \\ \sum_{i} \lambda_{i} y_{i} \end{array}\right] \tag{8.26} (w,b)L(w,b,l )=[wiliyixiiliyi](8.26)

Therefore, the KKT condition (8.17) (8.17)( 8.17 ) and ( 8.21 ) ( 8.21 )( 8.21 ) get _ _ _

w ⋆ − ∑ i λ i ⋆ y i x i = 0 ∑ i λ i ⋆ y i = 0 λ i ⋆ ( y i ( ( w ⋆ ) ⊤ x i − b ⋆ ) − 1 ) = 0. (8.27-29) \begin{array}{r} \mathbf{w}^{\star}-\sum_{i} \lambda_{i}^{\star} y_{i} \mathbf{x}_{i}=0 \\ \sum_{i} \lambda_{i}^{\star} y_{i}=0 \\ \lambda_{i}^{\star}\left(y_{i}\left(\left(\mathbf{w}^{\star}\right)^{\top} \mathbf{x}_{i}-b^{\star}\right)-1\right)=0 . \end{array} \tag{8.27-29} wiliyixi=0iliyi=0li(yi((w)xib)1)=0.(8.27-29)

This means that the solution w ⋆ \mathbf{w}^{\star}w is a linear combination of points touching the boundary hyperplane, i.e. those located atH + \mathcal{H}_{+}H+and H − \mathcal{H}_{-}Hcenter point. These points are nominally support points or vectors.

Now we can use these equations to derive a simple method to find the required optimal hyperplane parameters w ⋆ \mathbf{w}^{\star}w andb ⋆ b^{\star}b . First,(8.27) (8.27)( 8 . 2 7 ) and (8.28) are substituted into (8.25), and we get (omitted to improve readability⋆ ^{\star} ) equation

1 2 ∥ w ∥ 2 − ∑ i λ i y i ( w ⊤ x i − b ) + ∑ i λ i = 1 2 ∑ i , j λ i λ j y i y j x i ⊤ x j − ∑ i , j λ i λ j y i y j x i ⊤ x j + ∑ i λ i y i b + ∑ i λ i = ∑ i λ i − 1 2 ∑ i , j λ i λ j y i y j x i ⊤ x j . \begin{aligned} & \frac{1}{2}\|\mathbf{w}\|^{2}-\sum_{i} \lambda_{i} y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}-b\right)+\sum_{i} \lambda_{i} \\ =& \frac{1}{2} \sum_{i, j} \lambda_{i} \lambda_{j} y_{i} y_{j} \mathbf{x}_{i}^{\top} \mathbf{x}_{j}-\sum_{i, j} \lambda_{i} \lambda_{j} y_{i} y_{j} \mathbf{x}_{i}^{\top} \mathbf{x}_{j}+\sum_{i} \lambda_{i} y_{i} b+\sum_{i} \lambda_{i} \\ =& \sum_{i} \lambda_{i}-\frac{1}{2} \sum_{i, j} \lambda_{i} \lambda_{j} y_{i} y_{j} \mathbf{x}_{i}^{\top} \mathbf{x}_{j} . \end{aligned} ==21w2iliyi(wxib)+ili21i,jliljyiyjxixji,jliljyiyjxixj+iliyib+iliili21i,jliljyiyjxixj.

This only depends on λ \boldsymbol{\lambda}The new formulation of λ is the dual form of the problem (see(8.22) (8.22)( 8 . 2 2 ) ) Let
LD ( λ ) = ∑ i λ i − 1 2 λ ⊤ H λ st λ i ≥ 0 , ∑ i λ iyi = 0 L_{D}(\boldsymbol{\lambda })=\sum_{i} \lambda_{i}-\frac{1}{2} \ballsymbol{\lambda}^{\top} \mathbf{H} \ballsymbol{\lambda}\quad \text { st } \quad \lambda_{i} \geq 0, \sum_{i}\lambda_{i} y_{i}=0LD( l )=ili21lHλ s.t. li0,iliyi=0

where H \mathbf{H}The term of H is defined as the inner product hij = yiyjxi ⊤ xj h_{ij}=y_{i} y_{j} \mathbf{x}_{i}^{\top} \mathbf{x}_{j}hij=yiyjxixj. The optimal Lagrange multiplier is found by solving the maximization problem.

max ⁡ λ ( ∑ i λ i − 1 2 λ ⊤ H λ )  s.t.  λ i ≥ 0 , ∑ i λ i y i = 0 (8.30) \max _{\lambda}\left(\sum_{i} \lambda_{i}-\frac{1}{2} \boldsymbol{\lambda}^{\top} \mathbf{H} \boldsymbol{\lambda}\right) \quad \text { s.t. } \quad \lambda_{i} \geq 0, \sum_{i} \lambda_{i} y_{i}=0 \tag{8.30} lmax(ili21lHλ) s.t. li0,iliyi=0(8.30)

This is a convex quadratic optimization problem that can be solved with a quadratic program (QP) solver (such as the function quadprog in Matlab). By plugging it into the formula (8.27) (8.27)( 8 . 2 7 ) , getb ⋆ b^{\star}b , and then through the formula(8.29) (8.29)( 8 . 2 9 ) Getw ⋆ \mathbf{w}^{\star}w . Note that the Lagrange multiplierλ i \lambda_{i}liCorresponds to point xi \mathbf{x}_{i}xi. result λ i \lambda_{i}liis not equal to zero, then equation (8.29) (8.29)(8.29)意味着 y i ( w ⊤ x i − b ) = 1 y_{i}\left(\mathbf{w}^{\top}\mathbf{x}_{i}-b\right)=1 yi(wxib)=1 , that is, the correspondingxi \mathbf{x}_{i}xiis H + \mathcal{H}_{+}H+or H − \mathcal{H}_{-}Hof an element.

8.3 Soft Margin Linear SVM

Obviously, if the training set is not linearly separable, the above method will fail, because in this case, there are no points that satisfy the constraints. To overcome this obvious shortcoming, soft-margin SVMSVMSVM allows for some misallocated data samples. It finds a compromise between large margin and degree of misclassification. This misclassification is performed by a set ofNNAdditional variables of N ξ i \xi_{i}XiTo quantify, these variables are assumed to be nonnegative, resulting in constraints

y i ( w ⊤ x i − b ) ≥ 1 − ξ i  for all  i = 1 , … , N (8.31) y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}-b\right) \geq 1-\xi_{i} \text { for all } i=1, \ldots, N \tag{8.31} yi(wxib)1Xi for all i=1,,N(8.31)

Therefore, in order to obtain a large margin while maintaining moderate misclassification, we consider the optimization problem

min ⁡ w , b , ξ 1 2 ∥ w ∥ 2 2 + c ∑ i = 1 N ξ i  s.t.  y i ( w ⊤ x i − b ) ≥ 1 − ξ i  and  ξ i ≥ 0 ∀ i (8.32-33) \begin{array}{ll} \min _{\mathbf{w}, b, \boldsymbol{\xi}} & \frac{1}{2}\|\mathbf{w}\|_{2}^{2}+c \sum_{i=1}^{N} \xi_{i} \\ \text { s.t. } & y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}-b\right) \geq 1-\xi_{i} \quad \text { and } \quad \xi_{i} \geq 0 \quad \forall i \end{array} \tag{8.32-33} minw , b , ξ s.t. 21w22+ci=1NXiyi(wxib)1Xi and Xi0i(8.32-33)

Therefore, in order to obtain a large margin, at the same time the free parameter c > 0 c > 0c>0 trades off between large margin and misclassification. SelectedccThe larger c is, the greater the penalty for violating the separation rule. As mentioned before, this is a quadratic programming problem and can be solved with a QP solver. The corresponding KKT conditions are discussed below. The Lagrangian of our soft-margin SVM is: To maintain moderate misclassification, we consider the optimization problem
L ( w , b , ξ , λ , μ ) = 1 2 ∥ w ∥ 2 + c ∑ i ξ i − ∑ i λ i ( yi ( w ⊤ xi − b ) − 1 + ξ i ) − ∑ i μ i ξ i (8.34) L(\mathbf{w}, b, \boldsymbol{\xi}, \boldsymbol{\lambda} , \boldsymbol{\mu})=\frac{1}{2}\|\mathbf{w}\|^{2}+c \sum_{i} \xi_{i}-\sum_{i} \lambda_ {i}\left(y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}-b\right)-1+\xi_{i}\right)-\ sum_{i} \mu_{i} \xi_{i} \tag{8.34}L(w,b,x ,l ,m )=21w2+ciXiili(yi(wxib)1+Xi)imiXi(8.34)

Among them μ i \mu_{i}miis to enforce ξ i \xi_{i}XiDetermine the Lagrange equation for the KKT function
w L = w − ∑ i λ iyixi = 0 ∇ b L = ∑ i λ iyi = 0 ∇ ξ i L = c − λ i − μ i = 0 yi ( w ⊤ xi − b ) − 1 + ξ i ≥ 0 ξ i ≥ 0 λ i ≥ 0 μ i ≥ 0 λ i ( yi ( w ⊤ xi − b ) − 1 + ξ i ) = 0 μ i ξ i = 0 (8.35-43) \begin{array}{r}\nabla_{\mathbf{w}} L=\mathbf{w}-\sum_{i} \lambda_{i} y_{i} \mathbf{ x}_{i}=0 \\ \nabla_{b} L=\sum_{i} \lambda_{i} y_{i}=0 \\ \nabla_{\xi_{i}} L=c-\lambda_ {i}-\mu_{i}=0 \\y_{i}\left(\mathbf{w}^{\top}\mathbf{x}_{i}-b\right)-1+\xi_{ i} \geq 0 \\ \xi_{i} \geq 0 \\ \lambda_{i} \geq 0 \\ \mu_{i} \geq 0 \\ \lambda_{i}\left(y_{i}\ left ( \mathbf{w}^{\top}\mathbf{x}_{i}-b\right)-1+\xi_{i}\right)=0 \\\mu_{i} \xi_{i }=0 \end{array} \tag{8.35-43}wL=wiliyixi=0bL=iliyi=0XiL=climi=0yi(wxib)1+Xi0Xi0li0mi0li(yi(wxib)1+Xi)=0miXi=0(8.35-43)

Although a QP solver can already find the solution to this problem, we will now derive the corresponding dual form, since its appearance is very similar to the separable problem. Additionally, in order to extend SVM to work with kernels, the dual form is required. First, note that equation (8.37) (8.37)( 8 . 3 7 ) meansμ i = c − λ i . \mu_{i}=c-\lambda_{i}.mi=cli.Insert it into equation(8.34) (8.34)( 8 . 3 4 ) , we can already eliminateμ i \mu_{i}midependence.

1 2 w ⊤ w − ∑ i λ i ( y i ( w ⊤ x i − b ) − 1 ) (8.44) \frac{1}{2} \mathbf{w}^{\top} \mathbf{w}-\sum_{i} \lambda_{i}\left(y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}-b\right)-1\right) \tag{8.44} 21wwili(yi(wxib)1)(8.44)

Next, we use (8.35) (8.35)( 8 . 3 5 )的事实:w = ∑ i λ iyixi \mathbf{w}=\sum_{i} \lambda_{i} y_{i} \mathbf{x}_{i}w=iliyixiand substitute it into (8.44). By using attributes (8.36) (8.36)( 8. 3 6 ) , we get the dual form LD (λ) = ∑ i λ i − 1 2 λ ⊤ H λ (8.45) L_{D}(\boldsymbol{\lambda})=\sum_{i} \ lambda_
{i}-\frac{1}{2} \boldsymbol{\lambda}^{\top} \mathbf{H} \boldsymbol{\lambda} \tag{8.45}LD( l )=ili21lHλ(8.45)

Matrix H \mathbf{H}H的项 h i j = y i y j x i ⊤ x j h_{i j}=y_{i} y_{j} \mathbf{x}_{i}^{\top} \mathbf{x}_{j} hij=yiyjxixj 0 ≤ λ i ≤ c 0 \leq \lambda_{i} \leq c 0lic ∑ i λ i y i = 0 \sum_{i} \lambda_{i} y_{i}=0 iliyi=under the constraints of 0 . Therefore, the dual problem has the form

max ⁡ λ L D ( λ )  s.t.  0 ≤ λ i ≤ c , ∑ i λ i y i = 0 (8.46) \max _{\boldsymbol{\lambda}} L_{D}(\boldsymbol{\lambda}) \quad \text { s.t. } \quad 0 \leq \lambda_{i} \leq c, \sum_{i} \lambda_{i} y_{i}=0 \tag{8.46} lmaxLD( l ) s.t. 0lic,iliyi=0(8.46)

After that, the solution is w = ∑ i λ iyixi \mathbf{w}=\sum_{i} \lambda_{i} y_{i} \mathbf{x}_{i}w=iliyixigiven. Therefore, the only difference from the separable case is that λ i \lambda_{i}liThere is an upper limitccc . Equation(8.37) (8.37)( 8.37 ) and ( 8.43 ) ( 8.43 )( 8 . 4 3 ) Semantic outcomeλ i < c , ξ i = ​​0 \lambda_{i}<c, \xi_{i}=0li<c xi=0 . Therefore, any training samplexi \mathbf{x}_{i}xi(如果 0 < λ i < c 0<\lambda_{i}<c 0<li<c ) is a support vector. Furthermore, for any support vectorxi \mathbf{x}_{i}xiEquation (8.42) (8.42)(8.42)简化为 y i ( w ⊤ x i − b ) + 1 = 0 y_{i}\left(\mathbf{w}^{\top} \mathbf{x}_{i}-b\right)+1=0 yi(wxib)+1=0 , can be used to calculateb ⋆ b^{\star}b . In order to obtain a more stable solution, it is recommended to use the average of all points in the support zone. Specifically, it is defined as

b ⋆ = 1 N S u p p ∑ i ∈ S u p p ( ( w ⋆ ) ⊤ x i − y i ) (8.47) b^{\star}=\frac{1}{N_{S u p p}} \sum_{i \in S u p p}\left(\left(\mathbf{w}^{\star}\right)^{\top} \mathbf{x}_{i}-y_{i}\right) \tag{8.47} b=NSupp1iSupp((w)xiyi)(8.47)

where Supp ⁡ \operatorname{Supp}S u p p represents the support index,NS upp N_{S upp}NSuppRepresents the number of support vectors.

8.4 Nuclear SVM

As we saw before, for a set of training samples ( xi , yi ) , i = 1 , … , N , xi ∈ R p , yi ∈ { − 1 , 1 } \left(\mathbf{x}_{i }, y_{i}\right), i=1, \ldots, N, \mathbf{x}_{i} \in \mathbb{R}^{p}, y_{i} \in\{-1 ,1\}(xi,yi),i=1,,N,xiRp,yi{ 1,1 } can be rewritten as dual form

max ⁡ λ ∑ i λ i − 1 2 λ ⊤ H λ  s.t.  0 ≤ λ i ≤ c , ∑ i λ i y i = 0 (8.48) \begin{array}{ll} \max _{\lambda} & \sum_{i} \lambda_{i}-\frac{1}{2} \boldsymbol{\lambda}^{\top} \mathbf{H} \boldsymbol{\lambda} \\ \text { s.t. } & 0 \leq \lambda_{i} \leq c, \sum_{i} \lambda_{i} y_{i}=0 \end{array} \tag{8.48} maxl s.t. ili21lHλ0lic,iliyi=0(8.48)

With Gram matrix H \mathbf{H}H,其项定义为hij = yiyjxi ⊤ xj h_{ij}=y_{i} y_{j} \mathbf{x}_{i}^{\top} \mathbf{x}_{j}hij=yiyjxixj. We talked about the kernel trick in the chapter on kernel PCA, i.e. using the nonlinear function ϕ \phiϕmaps the training samples to the high-dimensional Hilbert spaceH \mathcal{H}H , whose inner product is⟨ ⋅ , ⋅   H \langle\cdot, \cdot\rangle_{\mathcal{H}},HAnd directly use the kernel function κ \kappaκ expression inner product. In other words, the kernel functionκ \kappaκinterpretation ( x , y ) = ⟨ ϕ ( x ) , ϕ ( y ) ⟩ H \kappa(\mathbf{x}, \mathbf{y})=\langle\phi(\mathbf{x}),\phi(\mathbf{y})\rangle_{\mathcal{H}}k ( x ,y)=ϕ ( x ) ,ϕ ( y ) HMap two points to R \mathbb{R}R. _ The same is true in this case. Inner productx ⊤ y \mathbf{x}^{\top} \mathbf{y}xκ ( x , y) = ( α x ⊤ y + β ) γ polynomial kernel κ ( x , y ) = exp ⁡ ( − ∥ x − y
2 / ( 2 σ 2 ) ) radial basis function kernel κ ( x , y ) = tanh ⁡ ( γ x ⊤ y − δ ) sigmoid kernel (8.49-51) \begin{array}{ll} \kappa(\mathbf{; x}, \mathbf{y})=\left(\alpha \mathbf{x}^{\top}\mathbf{y}+\beta\right)^{\gamma} & \text {polynomial kernel}\\ \kappa(\mathbf{x}, \mathbf{y})=\exp \left(-\|\mathbf{x}-\mathbf{y}\|^{2} /\left(2\sigma^{ 2}\right)\right) & \text { radial basis function kernel } \\ \kappa(\mathbf{x}, \mathbf{y})=\tanh \left(\gamma\mathbf{x}^{\ top} \mathbf{y}-\delta\right) & \text { sigmoid kernel } \end{array} \tag{8.49-51}k ( x ,y)=(αxy+b )ck ( x ,y)=exp(xy2/( 2 p2))k ( x ,y)=fishy(γxyd ). polynomial kernel  radial basis function kernel  sigmoid kernel (8.49-51)

tanh ⁡ ( x ) = ( e x − e − x ) / ( e x + e − x ) \tanh (x)=\left(e^{x}-e^{-x}\right) /\left(e^{x}+e^{-x}\right) fishy ( x )=(exex)/(ex+ex ). One of these kernel functions is then used to define the terms of the Gram matrix by

h i j = y i y j κ ( x i , x j ) (8.52) h_{i j}=y_{i} y_{j} \kappa\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) \tag{8.52} hij=yiyjK(xi,xj)(8.52)

A few notes on the aforementioned cores.

  • In the case of polynomial kernel, when the original signal is in R p \mathbb{R}^{p}RWhen p , Hilbert space H \mathcal{H}DimensionH \mathcal{H} of HH ( p + d d ) \left(\begin{array}{c}p+d \\ d\end{array}\right) (p+dd)

  • The radial basis function kernel is also often called a Gaussian kernel, which describes a nonlinear function ϕ \phiϕ , which maps to an infinite-dimensional Hilbert spaceH . H .H.

  • The sigmoid kernel is only in γ , δ \gamma, \deltac ,Under a specific choice of δ , and under the square criterion of the signal ∥ x ∥ 2 . \|\mathbf{x}\|^{2}.x2. Generate an spd kernel matrix under specific conditions.

Solve the following problems

max ⁡ λ ∑ i λ i − 1 2 λ ⊤ H λ  s.t.  0 ≤ λ i ≤ c , ∑ i λ i y i = 0 (8.53) \begin{array}{ll} \max _{\boldsymbol{\lambda}} & \sum_{i} \lambda_{i}-\frac{1}{2} \boldsymbol{\lambda}^{\top} \mathbf{H} \boldsymbol{\lambda} \\ \text { s.t. } & 0 \leq \lambda_{i} \leq c, \sum_{i} \lambda_{i} y_{i}=0 \end{array} \tag{8.53} maxl s.t. ili21lHλ0lic,iliyi=0(8.53)

其中 h i j = y i y j κ ( x i , x j ) h_{i j}=y_{i} y_{j} \kappa\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) hij=yiyjK(xi,xj) provides the Lagrange coefficientλ ⋆ \boldsymbol{\lambda}^{\star}l , but how do we use this result to classify new points? First, note that according to the classification rule in the linear case (that is, usesign ⁡ ( w ⊤ z − b ) \operatorname{sign}\left(\mathbf{w}^{\top} \mathbf{z}-b\right )sign(wzb ) Rule pairz \mathbf{z}z to classify, we can express the decision function in the following way

f ( z ) = ∑ i ∈ S u p p λ i y i κ ( x i , z ) − b (8.54) f(\mathbf{z})=\sum_{i \in S u p p} \lambda_{i} y_{i} \kappa\left(\mathbf{x}_{i}, \mathbf{z}\right)-b \tag{8.54} f(z)=iSuppliyiK(xi,z)b(8.54)

where Supp ⁡ \operatorname{Supp}S u p p represents support (i.e., for alliii 0 < λ i ≤ c 0<\lambda_{i} \leq c 0<lic ). We usef ( z ) f(\mathbf{z})The sign of f ( z ) is used to assign labels.

The only remaining component we need is the factor bbb . From the KKT conditions of the original problem, we know the equation

y i ( ⟨ ϕ ( x i ) , w ⟩ H ) − b ) − 1 = 0 \left.y_{i}\left(\left\langle\phi\left(\mathbf{x}_{i}\right), \mathbf{w}\right\rangle_{\mathcal{H}}\right)-b\right)-1=0 yi( ϕ(xi),wH)b)1=0

Supp ⁡ \operatorname{Supp} any iiin S u p pi must be established. Vectorw \mathbf{w}w is a high-dimensional Hilbert spaceH \mathcal{H}An element of H can be rewritten as the sum of the mapped support vectors∑ j ∈ Supp ⁡ λ iyi ϕ ( xi ) \sum_{j \in \operatorname{Supp}} \lambda_{i} y_{i} \phi\left (\mathbf{x}_{i}\right)jSuppliyiϕ(xi) . Plugging this into the previous equation we get

b = ∑ j ∈ S u p p ( λ j y j ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ H ) − y i b=\sum_{j \in S u p p}\left(\lambda_{j} y_{j}\left\langle\phi\left(\mathbf{x}_{i}\right), \phi\left(\mathbf{x}_{j}\right)\right\rangle_{\mathcal{H}}\right)-y_{i} b=jSupp( ljyjϕ(xi),ϕ(xj)H)yi

For any ii in supportFor i . To make this result more robust,wei进行平均,得到
b = 1 N S u p p ∑ i ∈ S u p p ( ∑ j ∈ S u p p ( λ j y j ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ H ) − y i ) b=\frac{1}{N_{S u p p}} \sum_{i \in S u p p}\left(\sum_{j \in S u p p}\left(\lambda_{j} y_{j}\left\langle\phi\left(\mathbf{x}_{i}\right), \phi\left(\mathbf{x}_{j}\right)\right\rangle_{\mathcal{H}}\right)-y_{i}\right) b=NSupp1iSuppjSupp( ljyjϕ(xi),ϕ(xj)H)yi

Among them NS upp N_{S upp}NSuppis the number of support vectors. Therefore, by replacing the inner product with the kernel function, we can calculate bbb
b = 1 N S u p p ∑ i ∈ S u p p ( ∑ j ∈ S u p p ( λ j y j κ ( x i , x j ) ) − y i ) (8.55) b=\frac{1}{N_{S u p p}} \sum_{i \in S u p p}\left(\sum_{j \in S u p p}\left(\lambda_{j} y_{j} \kappa\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)\right)-y_{i}\right) \tag{8.55} b=NSupp1iSuppjSupp( ljyjK(xi,xj))yi(8.55)

Guess you like

Origin blog.csdn.net/qq_37266917/article/details/122912181