SVM support vector machine
The data set and source files can be obtained in the Github project
link: https://github.com/Raymond-Yang-2001/AndrewNg-Machine-Learing-Homework
1. Linear SVM
1.1 Starting from logistic regression
When classifying by logistic regression, we have h θ ( x ) = σ ( θ ⊤ x ) h_{\theta}(x)=\sigma(\theta^{\top} x)hi(x)=s ( i⊤ x), whereσ \sigmaσ represents the sigmoid function. When classifying, Logisit regression will make the positive classθ ⊤ x ≥ 0 \theta^{\top}x\ge 0i⊤x≥0 , θ ⊤ x < 0of the negative classi⊤x<0。其损失函数如下:
J ( θ ) = 1 m ∑ i = 1 m − y ( i ) log ( h θ ( x ( i ) ) ) − ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) J(\theta)=\frac{1}{m}\sum_{i=1}^{m}{-y^{(i)}\log{(h_{\theta}(x^{(i)}))}}-(1-y^{(i)})\log{(1-h_{\theta}(x^{(i)}))} J(θ)=m1i=1∑m−y(i)log(hi(x(i)))−(1−y(i))log(1−hi(x( i ) ))
Therein,− log h θ ( x ) -\log{h_\theta(x)}−loghi(x)和 − log ( 1 − h θ ( x ) ) -\log{(1-h_{\theta}(x))} −log(1−hiThe function image of ( x )) is as follows:
Modify the loss function as follows, so that wheny = 1 y=1y=When 1 , the expected θ ⊤ x ≫ 1 \theta^{\top}x\gg 1i⊤x≫1 instead ofθ ⊤ x ≫ 0 \theta^{\top}x\gg 0i⊤x≫0;在y = 0 y=0y=When 0 , the expected θ ⊤ x ≪ − 1 \theta^{\top}x\ll -1i⊤x≪− 1 instead ofθ ⊤ x ≪ 0 \theta^{\top}x\ll 0i⊤x≪0。
Let us consider the scheduling function of the SVM:
J ( θ ) = C ∑ i = 1 m [ y ( i ) cost 1 ( θ ⊤ x ) + ( 1 − y ( i ) ) cost 0 ( θ ⊤ x ); ] + 1 2 ∑ j = 1 n θ j 2 J(\theta)=C\sum_{i=1}^{m}{[y^{(i)}\mathrm{cost}_{1}(\ theta^{\top}x)+(1-y^{(i)})\mathrm{cost}_{0}(\theta^{\top}x)]}+\frac{1}{2} \sum_{j=1}^{n}{\theta_{j}^{2}}J(θ)=Ci=1∑m[y(i)cost1( i⊤x)+(1−y(i))cost0( i⊤x)]+21j=1∑nij2
C here is the regularization parameter.
In linear SVM, different from logistic regression output classification probability, we assume:
{ h θ ( x ) = 1 , θ ⊤ x ≥ 0 h θ ( x ) = 0 , else \left\{ \begin{aligned} &h_{ \theta}(x)=1,\quad\theta^{\top}x\ge0 \\ &h_{\theta}(x)=0,\quad\mathrm{else}\\ \end{aligned} \right .{
hi(x)=1,i⊤x≥0hi(x)=0,else
In other words, the SVM classifier directly outputs the classification result.
1.2 Large margin classification and SVM
As mentioned earlier, in SVM, the necessary condition for minimizing the cost function is that when y = 1 y=1y=When 1 , the expected θ ⊤ x ≥ 1 \theta^{\top}x\ge 1i⊤x≥1 instead ofθ ⊤ x ≥ 0 \theta^{\top}x\ge 0i⊤x≥0;在y = 0 y=0y=When 0 , the expected θ ⊤ x ≪ − 1 \theta^{\top}x\ll -1i⊤x≪− 1 instead ofθ ⊤ x < 0 \theta^{\top}x< 0i⊤x<0 . In fact, using 0 as the classification boundary can already make a good distinction for classification. SVM further "widens" this classification boundary from 0 to (-1,1). We call SVM alarge boundary. classifier.
Consider the loss function of linear SVM, assuming we find θ \theta that meets the above conditionsθ , then in any case, the first half of the loss function is 0, which means that the optimization goal can be simplified to:
min 1 2 ∑ j = 1 n θ j 2 = 1 2 ∣ ∣ θ ∣ ∣ 2 s . t . { θ ⊤ x ( i ) ≥ 1 , y ( i ) = 1 θ ⊤ x ( i ) ≤ − 1 , y ( i ) = 0 \min{\frac{1}{2}\sum_{j=1 }^{n}{\theta_{j}^{2}}=\frac{1}{2}||\theta||^{2}} \\ \mathrm{st}\left\{ \begin{ aligned} &\theta^{\top}x^{(i)}\ge1,\quad y^{(i)}=1 \\ &\theta^{\top}x^{(i)}\le -1,\quad y^{(i)}=0 \\ \end{aligned} \right.min21j=1∑nij2=21∣∣θ∣∣2s.t.{
i⊤x(i)≥1,y(i)=1i⊤x(i)≤−1,y(i)=0
It can be known from the knowledge of linear algebra: θ ⊤ x ( i ) = ρ ( i ) ∣ ∣ θ ∣ ∣ \theta^{\top}x^{(i)}=\rho^{(i)}||\theta| |i⊤x(i)=r(i)∣∣θ∣∣, ρ ( i ) \rho^{(i)} r( i ) isx ( i ) x^{(i)}x( i ) isθ \thetaThe projected length in the θ direction.
Suppose one of our decision boundaries is as follows, the blue line isθ \thetaθ direction, the green line perpendicular to it is the decision boundary:
in this case,ρ \rhoρ is relatively small, in order to satisfyρ ( i ) ∣ ∣ θ ∣ ∣ ≥ 1 \rho^{(i)}||\theta||\ge1r(i)∣∣θ∣∣≥1 orρ( i ) ∣ ∣ θ ∣ ∣ ≤ − 1 \rho^{(i)}||\theta||\le-1r(i)∣∣θ∣∣≤−1, ∣ ∣ θ ∣ ∣ ||\theta|| ∣∣ θ ∣∣ must become very large to be satisfied. Obviously, this will make the loss function value larger, which is opposite to the optimization goal.
Consider another decision boundary:
in this case, ρ \rhoρ will become larger, correspondingly∣ ∣ θ ∣ ∣ ||\theta||∣∣ θ ∣∣ can become smaller. By making the spacing larger, that is, through theseρ \rhoFor values such as ρ , the support vector machine can eventually find a smaller norm. This is exactly the purpose of minimizing the objective function in support vector machines, and why support vector machines eventually find large margin classifiers. Because it tries to maximize theseρ \rhoNorms of ρ , which are the distances from training samples to the decision boundary.
1.3 Adjust regularization parameters
The data set used is visualized as follows:
using regularization parameter C=1
from sklearn import svm
svc = svm.LinearSVC(C=1, max_iter=1000)
svc.fit(x,y.ravel())
theta1 = [svc.intercept_[0], svc.coef_[0,0], svc.coef_[0,1]]
x_ax = np.arange(0, 4, 0.1)
xx = np.array([1.5,2.5])
y_ax = -theta1[0] / theta1[2] + (-theta1[1] / theta1[2])*x_ax
print(theta1[0],-theta1[0] / theta1[2],-theta1[1] / theta1[2])
yy = (theta1[2] / theta1[1] )*xx
plt.figure(figsize=(10,8))
plt.scatter(x=positive_data[:, 0], y=positive_data[:, 1], s=10, color="red",label="positive")
plt.scatter(x=negative_data[:, 0], y=negative_data[:, 1], s=10, label="negative")
plt.plot(x_ax, y_ax, label="Decision Boundary")
plt.plot(xx, yy, label="Direction of Theta Vector")
plt.axis('equal')
plt.legend(loc='best',framealpha=0.5)
plt.show()
It can be seen that SVM has learned a better classifier and is not affected by the outliers in the upper left corner.
Regularization parameter C=1000
from sklearn import svm
svc2 = svm.LinearSVC(C=100, max_iter=100000)
svc2.fit(x,y.ravel())
theta2 = [svc2.intercept_[0], svc2.coef_[0,0], svc2.coef_[0,1]]
x_ax = np.arange(0, 4, 0.1)
y_ax = -theta2[0] / theta2[2] + (-theta2[1] / theta2[2])*x_ax
plt.figure(figsize=(10,8))
plt.scatter(x=positive_data[:, 0], y=positive_data[:, 1], s=10, color="red",label="positive")
plt.scatter(x=negative_data[:, 0], y=negative_data[:, 1], s=10, label="negative")
plt.plot(x_ax, y_ax, label="Decision Boundary")
xx = np.array([1,1.5])
yy = (theta2[2] / theta2[1] )*xx
plt.plot(xx + 1, yy, label="Direction of Theta Vector")
plt.axis('equal')
plt.legend(loc=0,framealpha=0.5)
plt.show()
It can be seen that when C is large, SVM is affected by outliers and overfitting occurs.
2. Nonlinear SVM (Gaussian kernel function)
The calculation of the optimization objective of the linear SVM discussed previously is based on θ ⊤ x \theta^{\top}xi⊤ Linear operation of x. When we face more complex decision boundaries, simple linear operation cannot meet the needs well. Just like introducing nonlinear activation functions in neural networks, in SVM, we also introduce nonlinearkernel functionsto achieve more complex classification. This type of SVM is called nonlinear SVM.
This is equivalent to using a series of new features to replace the original samples, and the kernel function completes the nonlinear mapping of samples to new features.
f ( i ) ← x ( i ) f^{(i)} \larr x^{(i)}f(i)←x(i)
2.1 Gaussian kernel
f i = s i m ( x , l ( i ) ) = exp ( − ∣ ∣ x − l ( i ) ∣ ∣ 2 2 σ 2 ) f_{i}=sim(x,l^{(i)})=\exp{\left(-\frac{||x-l^{(i)}||^{2}}{2\sigma^{2}}\right)} fi=s im ( x ,l(i))=exp(−2 p2∣∣x−l(i)∣∣2)
当 x , l ( i ) x,l^{(i)} x,lWhen ( i ) is close to each other, the kernel function value will be close to 1; when the two are far apart, the kernel function value will be close to 0.
f ( i ) = ∣ f 0 ( i ) = 1 f 1 ( i ) = sim ( x ( i ) , l ( 1 ) ) ⋮ fm ( i ) = sim ( x ( i ) , l ( m ) ) ∣ f^{(i)}=\left|\begin{aligned} f^{(i)}_{0}&=1 \\ f^{(i)}_{1}=&sim(x^{( i)},l^{(1)}) \\ \vdots&\\ f^{(i)}_{m}=&sim(x^{(i)},l^{(m)}) \end {aligned} \right|f(i)=
f0(i)f1(i)=⋮fm(i)==1yes im ( x(i),l(1))yes im ( x(i),l(m))
Give the following equations:
J ( θ ) = C ∑ i = 1 m [ y ( i ) cost 1 ( θ ⊤ f ( i ) ) + ( 1 − y ( i ) ) cost 0 ( θ ⊤ f ( i ) ) . ) ] + 1 2 ∑ j = 1 n θ j 2 J(\theta)=C\sum_{i=1}^{m}{[y^{(i)}\mathrm{cost}_{1}( \theta^{\top}f^{(i)})+(1-y^{(i)})\mathrm{cost}_{0}(\theta^{\top}f^{(i) })]}+\frac{1}{2}\sum_{j=1}^{n}{\theta_{j}^{2}}J(θ)=Ci=1∑m[y(i)cost1( i⊤f(i))+(1−y(i))cost0( i⊤f(i))]+21j=1∑nij2
当σ \sigmaWhen the σ parameter is larger, the features will become smoother (∣ ∣ x − l ( i ) ∣ ∣ 2 ||xl^{(i)}||^{2}∣∣x−l(i)∣∣2 has less impact on the change of the function value), the discrimination of different samples will become smaller, which will help alleviate the impact of certain outliers, make the variance of the model smaller, and reduce overfitting, but it will The deviation of the model becomes larger; on the contrary, whenσ \sigmaWhen the σ parameter is small, the features will become more differentiated, making the model variance larger and the bias smaller.
2.2 Nonlinear classification
A visualization of the dataset for non-linear classification looks like this:
Regularization parameter is 100
def show_boundary(svc, scale, fig_size, fig_dpi, positive_data, negative_data, term):
"""
Show SVM classification boundary plot
:param svc: instance of SVC, fitted and probability=True
:param scale: scale for x-axis and y-axis
:param fig_size: figure size, tuple (w, h)
:param fig_dpi: figure dpi, int
:param positive_data: positive data for dataset (n, d)
:param negative_data: negative data for dataset (n, d)
:param term: width for classification boundary
:return: decision plot
"""
t1 = np.linspace(scale[0, 0], scale[0, 1], 500)
t2 = np.linspace(scale[1, 0], scale[1, 1], 500)
coordinates = np.array([[x, y] for x in t1 for y in t2])
prob = svc.predict_proba(coordinates)
idx1 = np.where(np.logical_and(prob[:, 1] > 0.5 - term, prob[:, 1] < 0.5 + term))[0]
my_bd = coordinates[idx1]
plt.figure(figsize=fig_size, dpi=fig_dpi)
plt.scatter(x=my_bd[:, 0], y=my_bd[:, 1], s=10, color="yellow", label="My Decision Boundary")
plt.scatter(x=positive_data[:, 0], y=positive_data[:, 1], s=10, color="red", label="positive")
plt.scatter(x=negative_data[:, 0], y=negative_data[:, 1], s=10, label="negative")
plt.title('Decision Boundary')
plt.legend(loc=2)
plt.show()
from sklearn import svm
from sklearn.metrics import classification_report
svc100 = svm.SVC(C=100, kernel='rbf', gamma=10, probability=True)
svc100.fit(x,y.ravel())
report100 = classification_report(svc100.predict(x),y,digits=4)
print(report100)
show_boundary(svc100, scale=np.array([[0,1],[0.4,1]]), fig_size=fig_size, fig_dpi=fig_dpi,positive_data=positive_data,negative_data=negative_data, term=1e-3)
precision recall f1-score support
0 0.9791 0.9542 0.9665 393
1 0.9625 0.9830 0.9726 470
accuracy 0.9699 863
macro avg 0.9708 0.9686 0.9696 863
weighted avg 0.9701 0.9699 0.9698 863
The regularization parameter is 1
svc1 = svm.SVC(C=1, kernel="rbf", gamma=10, probability=True)
svc1.fit(x,y.ravel())
report1 = classification_report(svc1.predict(x),y,digits=4)
print(report1)
show_boundary(svc1, scale=np.array([[0,1],[0.4,1]]), fig_size=fig_size, fig_dpi=fig_dpi,positive_data=positive_data,negative_data=negative_data, term=1e-3)
precision recall f1-score support
0 0.8851 0.8582 0.8715 395
1 0.8833 0.9060 0.8945 468
accuracy 0.8841 863
macro avg 0.8842 0.8821 0.8830 863
weighted avg 0.8841 0.8841 0.8840 863
It can be seen that as the regularization parameter becomes smaller, the classification boundary becomes "smoother".
2.3 Parameter search
In the application of machine learning, determining parameters is a key step. Different parameters will cause the algorithm to show different performance. One of the most commonly used methods is GridSearch .
Before implementing grid search, we first introduce a method to evaluate model performance- k-fold cross-validation . Generally speaking, during the process of training the model, we only divide a fixed part from the training set as the verification set; k-fold cross-validation divides the training set into k parts, and the model is trained k times, using one of them as the verification set each time. , the rest is used as the training set, and the average score on the validation set is used to evaluate the model performance. This method can more comprehensively consider the data distribution of the entire training set, and can often better reflect the generalization ability of the model than a fixed verification set.
The steps for grid search are:
- A value set is given for the target parameter, and multiple parameters will form a structure similar to a "grid"
- For each parameter value combination, perform k-fold cross validation (in sklearn, k=5 is used by default)
- Select the parameter combination with the highest average score as the optimal parameter combination
The code is implemented as follows:
candidate = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]
parameters_grid = np.array([[c, gamma] for c in candidate for gamma in candidate])
score_list = []
from sklearn.svm import SVC
from SVM import show_boundary
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
for param in parameters_grid:
score = []
for tr_idx, test_idx in kf.split(train_x,train_y):
tr_x,tr_y = train_x[tr_idx], train_y[tr_idx]
test_x, test_y = train_x[test_idx], train_y[test_idx]
svc = SVC(C=param[0], gamma=param[1], probability=True)
svc.fit(tr_x, tr_y.ravel())
score.append(svc.score(test_x, test_y.ravel()))
score_list.append(score)
score_arr = np.array(score_list).mean(axis=1)
best_param = parameters_grid[np.argmax(score_arr)]
best_score = score_arr.max()
param_dict = {
'C': best_param[0], 'gamma': best_param[1]}
best_svc = SVC(probability=True)
best_svc.set_params(**param_dict)
best_svc.fit(train_x,train_y.ravel())
print("Best parameters C={}, gamma={}, with average precision of {:.4f}".format(best_param[0], best_param[1], best_score))
Best parameters C=30.0, gamma=3.0, with average precision of 0.9244
Verify using sklearn
svc = SVC(probability=True)
parameters = {
'C': candidate, 'gamma': candidate}
# default 5-fold
clf = GridSearchCV(svc, parameters, n_jobs=-1)
clf.fit(train_x,train_y.ravel())
print("SKlearn result: C={}, gamma={}".format(clf.best_params_.get('C'), clf.best_params_.get('gamma')))
SKlearn result: C=30, gamma=3
Visualizing datasets and classification boundaries