吴恩达机器学习python实现（6）：SVM支持向量机（文末附完整代码）

2023年7月5日下午10:41 • 人工智能 • 阅读 96

所有的数据来源：链接：https://pan.baidu.com/s/1vTaw1n77xPPfKk23KEKARA
提取码：5gl2

1 Support Vector Machines

1.1 Prepare datasets

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
from sklearn import svm

'''
1.Prepare datasets
'''
mat = loadmat('data/ex6data1.mat')
print(mat.keys())

X = mat['X']
y = mat['y']
'''大多数SVM的库会自动帮你添加额外的特征x0,所以无需手动添加。'''

def plotData(X, y):
    plt.figure(figsize=(8, 6))
    plt.scatter(X[:, 0], X[:, 1], c=y.flatten(), cmap='rainbow')

    plt.xlabel('x1')
    plt.ylabel('x2')

    pass

接下来取一段范围，这段范围是根据已有数据的大小进行细微扩大，并且将其分成500段，通过meshgrid获得网格线，最终利用等高线图画出分界线

1.2 Decision Boundary

def plotBoundary(clf, X):
    '''Plot Decision Boundary'''
    x_min, x_max = X[:, 0].min() * 1.2, X[:, 0].max() * 1.1
    y_min, y_max = X[:, 1].min() * 1.1, X[:, 1].max() * 1.1

    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500), np.linspace(y_min, y_max, 500))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    Z = Z.reshape(xx.shape)
    plt.contour(xx, yy, Z)

    pass

通过调用sklearn中支持向量机的代码，来进行模型的拟合

models = [svm.SVC(C, kernel='linear') for C in [1, 100]]

clfs = [model.fit(X, y.ravel()) for model in models]
score = [model.score(X, y) for model in models]

def plot():
    title = ['SVM Decision Boundary with C = {}(Example Dataset 1)'.format(C) for C in [1, 100]]
    for model, title in zip(clfs, title):

        plt.figure(figsize=(8, 5))
        plotData(X, y)
        plotBoundary(model, X)
        plt.title(title)
        pass
    pass

A large C parameter tells the SVM to try to classify all the examples correctly.

C plays a rolesimilar to λ, where λ is the regularization parameter that we were using previously for logistic regression.

可以理解对误差的惩罚，惩罚大，则曲线分类精准。

1.2 SVM with Gaussian Kernels

当用SVM作非线性分类时，我们一般使用Gaussian Kernels。
K gaussian ( x ( i ) , x ( j ) ) = exp ⁡ ( − ∥ x ( i ) − x ( j ) ∥ 2 2 σ 2 ) = exp ⁡ ( − ∑ k = 1 ( x k ( i ) − x k ( j ) ) 2 2 σ 2 ) K_{\text {gaussian }}\left(x^{(i)}, x^{(j)}\right)=\exp \left(-\frac{\left\|x^{(i)}-x^{(j)}\right\|^{2}}{2 \sigma^{2}}\right)=\exp \left(-\frac{\sum_{k=1}\left(x_{k}^{(i)}-x_{k}^{(j)}\right)^{2}}{2 \sigma^{2}}\right)K gaussian (x (i ),x (j ))=exp (−2 σ2 ∥∥x (i )−x (j )∥∥2 )=exp ⎝⎜⎛−2 σ2 ∑k =1 (x k (i )−x k (j ))2 ⎠⎟⎞
本文中使用其自带的即可。

def gaussKernel(x1, x2, sigma):
    return np.exp(-(x1 - x2) ** 2).sum() / (2 * sigma ** 2)

a = gaussKernel(np.array([1, 2, 1]), np.array([0, 4, -1]), 2.)

1.2.1 Gaussian Kernel-Example Dataset2

mat = loadmat('data/ex6data2.mat')
x2 = mat['X']
y2 = mat['y']
plotData(x2, y2)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ktLdbJ8u-1622612399587)(C:/Users/DELL/AppData/Roaming/Typora/typora-user-images/image-20210601172524887.png)]

sigma = 0.1
gamma = np.power(sigma, -2)/2
'''
高斯核函数中的gamma越大，相当高斯函数中的σ越小，此时的分布曲线也就会越高越瘦。
高斯核函数中的gamma越小，相当高斯函数中的σ越大，此时的分布曲线也就越矮越胖,smoothly,higher bias, lower variance
'''
clf = svm.SVC(C=1, kernel='rbf', gamma=gamma)
model = clf.fit(x2, y2.flatten())

1.2.2 Gaussian Kernel-Example Dataset3

'''
Example Dataset3
'''
mat3 = loadmat('data/ex6data3.mat')
x3, y3 = mat3['X'], mat3['y']
Xval, yval = mat3['Xval'], mat3['yval']
plotData(x3, y3)

Cvalues = (0.01, 0.03, 0.1, 0.3, 1., 3., 10., 30.)
sigmavalues = Cvalues
best_pair, best_score = (0, 0), 0

for C in Cvalues:
    for sigma in sigmavalues:
        gamma = np.power(sigma, -2.) / 2
        model = svm.SVC(C=C, kernel='rbf', gamma=gamma)
        model.fit(x3, y3.flatten())
        this_score = model.score(Xval, yval)
        '''
         model.score函数的返回值是决定系数,也称R2。
         可以测度回归直线对样本数据的拟合程度,决定系数的取值在0到1之间,
         决定系数越高,模型的拟合效果越好,即模型解释因变量的能力越强。
         '''

        if this_score > best_score:
            best_score = this_score
            best_pair = (C, sigma)
        pass
    pass
print('最优（C, sigma）权值：', best_pair, '决定系数：', best_score)

model = svm.SVC(1, kernel='rbf', gamma=np.power(0.1, -2.) / 2)

model.fit(x3, y3.flatten())
plotData(x3, y3)
plotBoundary(model, x3)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zODc0dOu-1622612399590)(C:/Users/DELL/AppData/Roaming/Typora/typora-user-images/image-20210601224239696.png)]

SVM中的score的作用：

2 Spam Classfication

邮件分类这一块就偷一下懒拉，给大家看看代码

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat
from sklearn import svm
import pandas as pd
import re

from stemming.porter2 import stem

import nltk, nltk.stem.porter

with open('data/emailSample1.txt', 'r') as f:
    email = f.read()
    pass
print(email)

def processEmail(email):
    '''除了Word Stemming, Removal of non-words之外所有的操作'''
    email = email.lower()
    email = re.sub(']>', '', email)
    email = re.sub('(http|https)://[^\s]*', 'httpaddr', email)
    email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email)
    email = re.sub('[\$]+', 'dollar', email)
    email = re.sub('[\d]+', 'number', email)
    return email

def email2TokenList(email):
    """预处理数据，返回一个干净的单词列表"""

    stemmer = nltk.stem.porter.PorterStemmer()

    email = processEmail(email)

    tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\, email)

    tokenlist = []
    for token in tokens:

        token = re.sub('[^a-zA-Z0-9]', '', token)

        stemmed = stemmer.stem(token)

        if not len(token):
            continue
        tokenlist.append(stemmed)

    return tokenlist

def email2VocanIndices(email, vocab):
    '''提取存在单词的索引'''
    token = email2TokenList(email)
    index = [i for i in range(len(vocab)) if vocab[i] in token]
    return index

def email2FeatureVector(email):
    '''
    将email转化为词向量，n是vocab的长度。存在单词的相应位置的值置为1，其余为0
    :param email:
    :return:
    '''
    df = pd.read_table('data/vocab.txt', names=['words'])
    vocab = np.array(df)
    vector = np.zeros(len(vocab))
    vocab_indices = email2VocanIndices(email, vocab)

    for i in vocab_indices:
        vector[i] = 1
        pass
    return vector

vector = email2FeatureVector(email)
print('length of vector = {}\nnum of non-zero = {}'.format(len(vector), int(vector.sum())))

mat1 = loadmat('data/spamTrain.mat')
X, y = mat1['X'], mat1['y']

mat2 = loadmat('data/spamTest.mat')
Xtest, ytest = mat2['Xtest'], mat2['ytest']

clf = svm.SVC(C=0.1, kernel='linear')
clf.fit(X, y)

predTrain = clf.score(X, y)
predTest = clf.score(Xtest, ytest)
print(predTrain, predTest)

附完整代码：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
from sklearn import svm

'''
1.Prepare datasets
'''
mat = loadmat('data/ex6data1.mat')
print(mat.keys())

X = mat['X']
y = mat['y']
'''大多数SVM的库会自动帮你添加额外的特征x0,所以无需手动添加。'''

def plotData(X, y):
    plt.figure(figsize=(8, 6))
    plt.scatter(X[:, 0], X[:, 1], c=y.flatten(), cmap='rainbow')

    plt.xlabel('x1')
    plt.ylabel('x2')

    pass

def plotBoundary(clf, X):
    '''Plot Decision Boundary'''
    x_min, x_max = X[:, 0].min() * 1.2, X[:, 0].max() * 1.1
    y_min, y_max = X[:, 1].min() * 1.1, X[:, 1].max() * 1.1

    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500), np.linspace(y_min, y_max, 500))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    Z = Z.reshape(xx.shape)
    plt.contour(xx, yy, Z)

    pass

models = [svm.SVC(C, kernel='linear') for C in [1, 100]]

clfs = [model.fit(X, y.ravel()) for model in models]
score = [model.score(X, y) for model in models]

def plot():
    title = ['SVM Decision Boundary with C = {}(Example Dataset 1)'.format(C) for C in [1, 100]]
    for model, title in zip(clfs, title):

        plt.figure(figsize=(8, 5))
        plotData(X, y)
        plotBoundary(model, X)
        plt.title(title)
        pass
    pass

'''
2.SVM with Gaussian Kernels
'''

def gaussKernel(x1, x2, sigma):
    return np.exp(-(x1 - x2) ** 2).sum() / (2 * sigma ** 2)

a = gaussKernel(np.array([1, 2, 1]), np.array([0, 4, -1]), 2.)

'''
Example Dataset 2
'''

mat = loadmat('data/ex6data2.mat')
x2 = mat['X']
y2 = mat['y']
plotData(x2, y2)
plt.show()

sigma = 0.1
gamma = np.power(sigma, -2)/2
'''
高斯核函数中的gamma越大，相当高斯函数中的σ越小，此时的分布曲线也就会越高越瘦。
高斯核函数中的gamma越小，相当高斯函数中的σ越大，此时的分布曲线也就越矮越胖,smoothly,higher bias, lower variance
'''
clf = svm.SVC(C=1, kernel='rbf', gamma=gamma)
model = clf.fit(x2, y2.flatten())

'''
Example Dataset3
'''
mat3 = loadmat('data/ex6data3.mat')
x3, y3 = mat3['X'], mat3['y']
Xval, yval = mat3['Xval'], mat3['yval']
plotData(x3, y3)

Cvalues = (0.01, 0.03, 0.1, 0.3, 1., 3., 10., 30.)
sigmavalues = Cvalues
best_pair, best_score = (0, 0), 0

for C in Cvalues:
    for sigma in sigmavalues:
        gamma = np.power(sigma, -2.) / 2
        model = svm.SVC(C=C, kernel='rbf', gamma=gamma)
        model.fit(x3, y3.flatten())
        this_score = model.score(Xval, yval)
        '''
         model.score函数的返回值是决定系数,也称R2。
         可以测度回归直线对样本数据的拟合程度,决定系数的取值在0到1之间,
         决定系数越高,模型的拟合效果越好,即模型解释因变量的能力越强。
         '''

        if this_score > best_score:
            best_score = this_score
            best_pair = (C, sigma)
        pass
    pass
print('最优（C, sigma）权值：', best_pair, '决定系数：', best_score)

model = svm.SVC(1, kernel='rbf', gamma=np.power(0.1, -2.) / 2)

model.fit(x3, y3.flatten())
plotData(x3, y3)
plotBoundary(model, x3)

参考链接：https://blog.csdn.net/Cowry5/article/details/80465922

Original: https://blog.csdn.net/weixin_48577398/article/details/117465475
Author: TCQD
Title: 吴恩达机器学习python实现（6）：SVM支持向量机（文末附完整代码）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/672682/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

pandas教程07—DataFrame的保存与加载

文章目录欢迎关注公众号【Python开发实战】，免费领取Python学习电子书！工具-pandas * Dataframe对象 – 保存与加载 + 保存加载欢迎…

人工智能 2023年6月15日
0086
【YOLOX训练部署】将自己训练的YOLOX权重转化成ONNX 并进行推理

YOLOX 训练自己的VOC数据集【YOLOX训练部署】YOLOX训练自己的VOC数据集_乐亦亦乐的博客-CSDN博客YOLOX 环境安装与训练自己标注的VOC数据集；https…

人工智能 2023年7月12日
0088
想去看演唱却总是抢不到票？教你用Python制作一个自动抢票脚本

前言嗨喽！大家好，这里是魔王！！大麦网，是中国综合类现场娱乐票务营销平台，业务覆盖演唱会、话剧、音乐剧、体育赛事等领域。但是因为票数有限，还有黄牛们不能丢了饭碗，所以导致了…

人工智能 2023年7月3日
00111
PWA学习笔记(二)

APP Shell： 1、应用从显示内容上可粗略划分为内容部分和外壳部分，App Shell 就是外壳部分，即页面的基本结构 2、它不仅包括用户能看到的页面框架部分，还包括用户…

人工智能 2023年6月6日
0082
【Pandas总结】第八节 Pandas 合并数据集_pd.merge()

文章目录 * – 写在前面 – pd.merge()的使用方法 – + 一、数据准备 + 二、参数left 与 right + 三、参数 on …

人工智能 2023年7月7日
00110
基于OpenCV实现的图像拼接（配准）案例

0 工具、环境、平台 -VS2015 C++-OpenCV 4.5.1-Windows 10 64位 1 图像拼接的简要步骤 Note: 两幅图像的拼接需要满足基本条件，一是图像本…

人工智能 2023年6月18日
0098
Metabase的基本使用：10分钟快速入门

Metabase使用手册初始配置按提示一步步填写相关信息即可，注意第一个创建的账户默认即为管理员账户 ; 数据分析接下来就可以正式使用了，右上角各功能如下：下面就以具体场…

人工智能 2023年7月16日
0097
Anaconda 修改默认虚拟环境安装位置

项目场景：使用Anaconda Prompt创建虚拟环境问题描述保存虚拟环境的默认地址是C盘，而我想将下载的虚拟环境保存到我自定义的位置。解决方案： 1、使用 conda …

人工智能 2023年6月26日
00125
两步解决conda安装pytorch时下载速度慢or超时的问题

1.为conda配置清华源打开cmd输入以下命令： conda config –add channels https://mirrors.tuna.tsinghua.edu.c…

人工智能 2023年7月20日
00113
UNet、UNet++、UNet3+系列

一、unet 简介继承FCN的思想，继续进行改进。但是相对于FCN，有几个改变的地方，U-Net是完全对称的，且对解码器（应该自Hinton提出编码器、解码器的概念来，即将图像-…

人工智能 2023年7月13日
0080
opencv图片去畸变相关方法总结

opencv中共提供三种去畸变方法，分别为： cv2.undistort cv2.omnidir.undistortImage cv2.fisheye.undistortImage…

人工智能 2023年6月19日
0092
单调队列算法 – 滑动窗口问题（常见模型：找出滑动窗口中的最大值/最小值）

欢迎观看我的博客，如有问题交流，欢迎评论区留言，一定尽快回复！（大家可以去看我的专栏，是所有文章的目录）文章字体风格：红色文字表示：重难点✔蓝色文字表示：思路以及想法✔ 如果大家觉…

人工智能 2023年7月29日
0081
【C语言】深入理解数组和指针——初识指针

哈喽大家好，我是保护小周ღ，C语言，接下来给大家带来的是深入理解数组和指针的初识指针，这篇主要讲的是基础指针的相关知识，是博主的所见所闻，细节上的知识后面会这里面没有提，会放在后期…

人工智能 2023年5月30日
00169
用opencv的cv2读图并在图像上画框保存（比matplot清晰多了）

官方文档给的定义： Python: cv.Rectangle(img, pt1, pt2, color, thickness=1, lineType=8, shift=0) → N…

人工智能 2023年7月19日
0091
08_采用LTP抽取图谱三元组

文章目录图谱三元组概述自顶向下构建自下向上构建语义角色构建图谱三元组保存Neo4j ; 图谱三元组概述知识图谱的数据是通过三元组（主语，谓词，宾语）的方式进行组织，每一…

人工智能 2023年6月1日
00111
如何使用numpy搭建双隐层神经网络？看这一篇文章就够用了

在阅读本文之前，请确保您已经有了一定的神经网络基础（具体的介绍可以看西瓜书）。本文采用的是标准的BP算法，即每次仅针对一个样例更新权重和阈值。本文将搭建用于分类的双隐层BP神经网络…

人工智能 2023年7月13日
00120

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31