机器学习——支持向量机

【说明】文章内容来自《机器学习——基于sklearn》，用于学习记录。若有争议联系删除。

1、简介

支持向量机(support vector machine,SVM)是一类按监督学习方式对数据进行二元分类的广义线性分类器，其决策边界是对学习样本求解的最大边距超平面(maximum-marginhyperplane)。与逻辑回归和神经网络相比，支持向量机在学习复杂的非线性方程时提供了一种更清晰、更强大的方式。

1.1 算法思想

支持向量机(Support Vector Machine,SVM)的基本思想是在N维数据找到N-1维的超平面(hyperplane)作为分类的决策边界。确定超平面的规则是：找到离超平面最近的那些点，使它们与超平面的距离尽可能远。在图中，离超平面最近的实心点和空心点称为支持向量，超平面两侧的支持向量与超平面的距离之和称为间隔距离，即图中的2/Ilwll。间隔距离越大，分类的准确率越高。在图中，两条虚线称为决策边界。

超平面可以用如下的线性方程来描述:

其中，w是超平面的法向量，定义了垂直于超平面的方向，b用于平移超平面。
支持向量机之所以成为目前最常用、效果最好的分类器之一，在小样本训练集上能够得到比其他算法更好的结果，原因就在于其优秀的泛化能力。但是，如果数据量很大(如垃圾邮件的分类检测），支持向量机的训练时间就会比较长。

1.2 支持向量机算法库

Sklearn 中支持向量机的算法库分为两类：一类是分类算法库，包括SVC、NuSVC和LinearSVC；另一类是回归算法库，包括svm、LinearSVR、svm.NuSVR、svm.SVR
在 SVC.NuSVC和 LinearSVC这3个分类算法库中，SVC和 NuSVC 差不多，区别仅在于两者对损失的度量方式不同；而LinearSVC 只用于线性分类，不支持各种从低维到高维的核函数，仅支持线性核函数，对线性不可分的数据不能使用。

2、核函数

核函数用于将非线性问题转化为线性问题。通过特征变换增加新的特征，使得低维空间中的线性不可分问题变为高维空间中的线性可分问题，进行升维变换。
SVC的语法如下:

SVC(kernel)

参数 kernel的取值有rbf、linear、 poly,代表不同的核函数。默认的rbf 代表径向基核函数(高斯核函数)，linear 代表线性核函数，poly 代表多项式核函数。

2.1径向基核函数

径向基核函数通过高斯分布函数衡量样本之间的相似度，进而使样本线性可分。径向基核函数的kernel参数取值为rbf，格式如下:

SVC(kernel='rbf', C)

示例：

import numpy as npimport matplotlib.pyplot as pltfrom sklearn import svmfrom sklearn.datasets import make_blobs#先创建50个数据点，将它们分为两类x, y = make_blobs(n_samples = 50, centers = 2, random_state = 6)#创建径向基核的支持向量机模型clf_rbf = svm.SVC(kernel = 'rbf', C = 1000)clf_rbf.fit(x, y)#画数据点plt.scatter(x[:,0], x[:,1], c= y, s = 30, cmap = plt.cm.Paired)#建立图像坐标ax = plt.gca()xlim = ax.get_xlim()ylim = ax.get_ylim()xx = np.linspace(xlim[0], ylim[1], 30)yy = np.linspace(ylim[0], ylim[1], 30)YY, XX = np.meshgrid(yy, xx)xy = np.vstack([XX.ravel(), YY.ravel()]).TZ = clf_rbf.decision_function(xy).reshape(XX.shape)#把分类的决定边界画出来ax.contour(XX, YY, Z, colors = 'k', levels = [-1, 0, 1], alpha = 0.5, linestyles = ['--', '-', '--'])ax.scatter(clf_rbf.support_vectors_[:, 0], clf_rbf.support_vectors_[:, 1], s = 100, linewidth = 1, facecolors = 'none')plt.show()

【运行结果】

2.2线性核函数

线性核函数(linear kernel)不通过核函数进行维度提升，仅在原始维度空间中寻求线性分类边界。线性核函数的kernel参数取值为linear,格式如下:

SVC(kernel='linear', C)

参数C为惩罚系数,用来控制损失函数的惩罚系数，类似于线性回归中的正则化系数。
C值越大，对误分类的惩罚越重，这样会使训练集在测试时准确率很高，但泛化能力弱，容易导致过拟合；C值越小，对误分类的惩罚越轻，容错能力和泛化能力强，但容易导致欠拟合。

线性核函数示例

import numpy as npimport matplotlib.pyplot as pltfrom sklearn import svmfrom sklearn.datasets import make_blobs#先创建50个数据点，让它们分为两类X, y = make_blobs(n_samples = 50, centers = 2, random_state = 6)#创建一个线性核的支持向量机面模型clf = svm.SVC(kernel= 'linear', C = 1000)clf.fit(X,y)#把数据点画出来plt.scatter(X[:, 0], X[:, 1], c = y, s = 30, cmap = plt.cm.Paired)#建立图像坐标ax = plt.gca()#获取坐标轴信息xlim = ax.get_xlim()ylim = ax.get_ylim()xx = np.linspace(xlim[0], xlim[1], 30)yy = np.linspace(ylim[0], ylim[1], 30)YY, XX = np.meshgrid(yy, xx)# meshgrid在二维平面将每一个x和每一个y分别对应起来，编织成栅格xy = np.vstack([XX.ravel(), YY.ravel()]).T#ravel()将数组维度拉成一维数组,np.vstack在竖直方向上堆叠z = clf.decision_function(xy).reshape(XX.shape)#把分类的决策边界画出来contour绘制等高线函数ax.contour(XX, YY, z, colors = 'k', levels = [-1,0,1], alpha = 0.5, linestyles = ['--','-','--'])ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s = 100, linewidth = 1, facecolors = 'none')plt.show()

【运行结果】

2.3 多项式核函数

多项式核函数通过多项式函数增加原始样本特征的高次幂，把样本特征投射到高位空间。多项式核函数的kernel参数取值为ploy。格式如下：

SVC(kernel = 'ploy', degree = 3)

参数degree表示选择的多项式的最高幂次，默认为三次多项式。

from sklearn.svm import SVCimport numpy as npX = np.array([[1,1],[1,2],[1,3],[1,4],[2,1],[2,2],[3,1],[4,1],[5,1],[5,2],[6,1],[6,2],[6,3],[6,4],[3,3],[3,4] ,[3,5],[4,3],[4,4],[4,5]])Y = np.array([1] * 14 + [-1] * 6)T = np.array([[0.5, 0.5], [1.5, 1.5], [3.5, 3.5], [4, 5.5]])#X 为训练样本， Y为训练样本标签（1 和-1）， T为测试样本svc = SVC(kernel = 'poly', degree = 2, gamma = 1, coef0 = 0)svc.fit(X, Y)pre = svc.predict(T)print('预测结果\n', pre)print('正类和负类支持向量总个数：\n',svc.n_support_)print("正类和负类支持向量索引:\n", svc.support_)print("正类和负类支持向量:\n", svc.support_vectors_)

【运行结果】

3、参数调优

3.1 gamma参数

gamma用于控制核函数的影响范围，主要适用于使用径向基函数（RBF）或多项式核函数（Poly）。

对于RBF核函数，gamma参数定义了单个训练样本对模型的影响范围。较小的gamma值表示影响范围较大，样本之间的距离相对较远的特征也可能被考虑进来，从而使决策边界更加平滑。较大的gamma值表示影响范围较小，模型将更加关注每个训练样本的局部区域，可能会导致决策边界更加复杂和详细。

对于Poly核函数，gamma参数定义了特征空间中特征的相似度。较小的gamma值表示特征之间的相似度较高，从而产生更平滑的决策边界。较大的gamma值表示特征之间的相似度较低，可能导致更复杂的决策边界。

import sklearn.svm as svmimport matplotlib.pyplot as pltfrom sklearn.datasets import load_wineimport numpy as npdef make_meshgrid(x, y, h = .02):x_min, x_max = x.min() - 1, x.max() + 1y_min, y_max = y.min() - 1, y.max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))return xx, yydef plot_contours(ax, clf, xx, yy, **params):z = clf.predict(np.c_[xx.ravel(), yy.ravel()])z = z.reshape(xx.shape)out = ax.contourf(xx, yy, z, **params)#使用酒的数据集wine = load_wine()#选取数据集的前两个特征X = wine.data[:,:2]y = wine.targetC = 1.0models = (svm.SVC(kernel = 'rbf', gamma = 0.1, C = C), svm.SVC(kernel = 'rbf', gamma = 1,C = C), svm.SVC(kernel = 'rbf', gamma = 10,C = C))models = (clf.fit(X, y) for clf in models)titles = ('gamma = 0.1','gamma = 1', 'gamma = 10')fig, sub = plt.subplots(1, 3, figsize = (10, 3))#plt.subplots_adjust(wspace = 0.8, hspace = 0.2)X0, X1 = X[:, 0], X[:, 1]xx, yy = make_meshgrid(X0, X1)for clf, title, ax in zip(models, titles, sub.flatten()):plot_contours(ax, clf, xx, yy, cmap = plt.cm.plasma, alpha = 0.8)ax.scatter(X0, X1, c = y, cmap = plt.cm.plasma, s = 20, edgecolors = 'k')ax.set_xlim(xx.min(), xx.max())ax.set_ylim(yy.min(), yy.max())ax.set_xlabel("Feature 0")ax.set_ylabel('Feature 1')ax.set_xticks(())ax.set_yticks(())ax.set_title(title)plt.show()# 参数gamma分别取值为0.1.1和10。# gamma值越小，径向基核直径越大,进入支持向量机的决策边界中的数据越多,决策边界越平滑，模型越简单;# gamma值越大,支持向量机越倾向于把尽可能多的数据放到决策边界中,模型的复杂度越高。# 所以,gamma值越小，模型越倾向于欠拟合;gamma值越大,模型倾向于过拟合。

【运行结果】

3.2 惩罚系数C

C是惩罚系数，即对误差的宽容度,用于调节优化方向中的两个指标(间隔大小和分类准确度)的权重，表示对分错数据的惩罚力度。
当C较大时，分错的数据就会较少,但是过拟合的情况会比较严重;
当C较小时，容易出现欠拟合的情况。
C越大，训练的迭代次数越大，训练时间越长。

from sklearn import datasetsfrom sklearn.model_selection import GridSearchCVfrom sklearn.svm import SVCfrom sklearn.model_selection import train_test_splitiris = datasets.load_iris()x = iris.data[:,:2]y = iris.targetparam_grid = {'gamma': [0.001, 0.01, 0.1, 1, 10, 100], 'C': [0.001, 0.01, 0.1, 1, 10, 100]}print("Parameters:{}".format(param_grid))grid_search = GridSearchCV(SVC(), param_grid, cv = 5)x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 10)grid_search.fit(x_train, y_train)print("test set score:{:.2f}".format(grid_search.score(x_test, y_test)))print('Best parameters:{}'.format(grid_search.best_params_))print('Best score om train set:{:.2f}'.format(grid_search.best_score_))print('Best estimator:',grid_search.best_estimator_)print('Best score:',grid_search.best_score_)

【运行结果】

4、回归问题

支持向量机分类方法能推广到回归问题，称为支持向量回归。支持向量回归有3个版本：SVR、NuSVR和LinearSVR。

import numpy as npfrom sklearn.svm import SVRimport matplotlib.pyplot as plt#产生样本数据x = np.sort(5*np.random.rand(40, 1), axis = 0)y = np.sin(x).ravel()#在目标值中增加噪声数据y[::5] += 3*(0.5 -np.random.rand(8))#估计器svr_rbf = SVR(kernel = 'rbf', C = 1e3, gamma = 0.1)#径向基核函数svr_lin = SVR(kernel = 'linear', C = 1e3)#线性核函数svr_poly = SVR(kernel = 'poly', C = 1e3, degree = 2)#多项式核函数y_rbf = svr_rbf.fit(x, y).predict(x)y_lin = svr_lin.fit(x, y).predict(x)y_poly = svr_poly.fit(x, y).predict(x)lw = 2plt.scatter(x, y, color = 'darkorange', label = 'data')plt.plot(x, y_rbf, color = 'navy', lw = lw, label = 'RBF model')plt.plot(x, y_lin, color = 'c', lw = lw, label = 'Linear model')plt.plot(x, y_poly, color = 'cornflowerblue', lw = lw, label = 'Polynomial model')plt.xlabel('data')plt.ylabel('Support Vector Regression')plt.legend()plt.show()

【运行结果】

5、案例

5.1 鸢尾花

import numpy as npfrom sklearn import datasetsimport sklearn.model_selection as msimport sklearn.svm as svmimport matplotlib.pyplot as pltfrom sklearn.metrics import classification_reportiris = datasets.load_iris()x = iris.data[:,:2]y = iris.target#数据划分x_train, x_test, y_train, y_test = ms.train_test_split(x, y, test_size = 0.25, random_state = 5)#基于线性核函数model = svm.SVC(kernel = 'linear')model.fit(x_train, y_train)#基于多项式核函数，三阶多项式核函数#model = svm.SVC(kernel = 'poly', degree = 3)#model.fit(x_train,, y_train)#预测y_test_pred = model.predict(x_test)#计算模型精度bg = classification_report(y_test, y_test_pred)print('基于线性核函数的分类报告：', bg, sep ='\n')#绘制分类边界线l, r = x[:,0].min() - 1, x[:,0].max() + 1b, t = x[:,1].min() - 1, x[:,1].max() + 1n = 500grid_x,grid_y = np.meshgrid(np.linspace(l,r,n), np.linspace(b,t,n))bg_x = np.column_stack((grid_x.ravel(), grid_y.ravel()))bg_y = model.predict(bg_x)grid_z = bg_y.reshape(grid_x.shape)#画图显示样本数据plt.title('kernel = linear', fontsize = 16)plt.xlabel('x',fontsize = 14)plt.ylabel('y',fontsize = 14)plt.tick_params(labelsize = 10)plt.pcolormesh(grid_x, grid_y, grid_z, cmap = 'gray')plt.scatter(x_test[:,0], x_test[:,1], s = 80, c = y_test, cmap = 'jet', label = 'Samples')plt.legend()plt.show()

【运行结果】

5.1.1 在选择核函数时，一般遵循如下原则：

如果特征非常多或者样本数远少于特征数，数据更偏向线性可分，选择线性核函数效果会更好。
线性和函数的参数少，速度快；径向基核函数的参数多，分类结果非常依赖参数，需要交叉验证或网格搜索最佳参数。
径向基核函数应用最广，对于小样本还是大样本、高纬度还是低纬度等情况都适用。

5.2 波士顿房价

#导人画图工具import matplotlib.pyplot as plt#导人波士顿房价数据集#from sklearn.datasets import load_bostonimport pandas as pdimport numpy as npdata_url = "http://lib.stat.cmu.edu/datasets/boston"raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])target = raw_df.values[1::2, 2]#打印数据集中的键print(raw_df.keys())#导人数据集拆分工具from sklearn.model_selection import train_test_split#建立训练集和测试集X,y=data, targetX_train, X_test, y_train, y_test=train_test_split(X,y,random_state=8)#导人数据预处理工具from sklearn.preprocessing import StandardScaler#对训练集和测试集进行数据预处理scaler=StandardScaler()scaler.fit(X_train)X_train_scaled=scaler.transform(X_train)X_test_scaled=scaler.transform(X_test)#将预处理后的数据特征最大值和最小值用散点图表示n#导人支持向量机回归模型from sklearn.svm import SVR#用预处理后的数据重新训练模型for kernel in ['linear', 'rbf']:svr=SVR(kernel=kernel)svr.fit(X_train_scaled, y_train)print('数据预处理后',kernel,'核函数模型在训练集上的得分:{:.3f}'.format(svr.score(X_train_scaled,y_train)))print('数据预处理后',kernel,'核函数模型在测试集上的得分:{:.3f}'. format(svr.score(X_test_scaled,y_test)))plt.plot(X_train_scaled.min(axis=0),'v',label='train set min')plt.plot(X_train_scaled.max(axis=0),'^', label='train set max')plt.plot(X_test_scaled.min(axis=0), 'v', label='test set min')plt.plot(X_test_scaled.max(axis=0), '^', label='test set max')#设置图注位置为最佳位置plt.legend(loc='best')#设置横纵轴标题plt.xlabel('scaled features')plt.ylabel('scaled feature magnitude')plt.show()#设置径向基核模型的C参数和 gamma参数svr=SVR(C=100, gamma=0.1)svr.fit(X_train_scaled, y_train)print('调节参数后径向基核函数模型在训练集上的得分:{:.3f}'.format(svr.score(X_train_scaled, y_train)))print('调节参数后径向基核函数模型在测试集上的得分:{:.3f}'.format(svr.score(X_test_scaled, y_test)))

【运行结果】

机器学习——支持向量机

1、简介

1.1 算法思想

1.2 支持向量机算法库

2、核函数

2.1径向基核函数

2.2线性核函数

2.3 多项式核函数

3、参数调优

3.1 gamma参数

3.2 惩罚系数C

5、案例

5.1 鸢尾花

5.1.1 在选择核函数时，一般遵循如下原则：

5.2 波士顿房价

最新关注

热文推荐

网络编程-NIO案例与 AIO 案例

【C语言蓝桥杯每日一题】—— 递增序列

如何使用PyCharm里面的库Flask，以及怎么去安装

天翼云加速落地紫金DPU实践应用，让算力供给更高效！

react中实现修改input的defaultValue

6、SpringBoot2之整合Mybatis

机器学习——支持向量机

1、简介

1.1 算法思想

1.2 支持向量机算法库

2、核函数

2.1径向基核函数

2.2线性核函数

2.3 多项式核函数

3、参数调优

3.1 gamma参数

3.2 惩罚系数C

5、案例

5.1 鸢尾花

5.1.1 在选择核函数时，一般遵循如下原则：

5.2 波士顿房价

相关文章

最新关注

热文推荐