高斯朴素贝叶斯分类的原理解释和手写代码实现

2023年7月2日上午6:41 • 人工智能 • 阅读 94

Gaussian Naive Bayes (GNB) 是一种基于概率方法和高斯分布的机器学习的分类技术。朴素贝叶斯假设每个参数（也称为特征或预测变量）具有预测输出变量的独立能力。所有参数的预测组合是最终预测，它返回因变量被分类到每个组中的概率，最后的分类被分配给概率较高的分组（类）。

; 什么是高斯分布？

高斯分布也称为正态分布，是描述自然界中连续随机变量的统计分布的统计模型。正态分布由其钟形曲线定义，正态分布中两个最重要的特征是均值 (μ) 和标准差 (σ)。平均值是分布的平均值，标准差是分布在平均值周围的”宽度”。

重要的是要知道正态分布的变量 (X) 从 -∞ < X < +∞ 连续分布（连续变量），并且模型曲线下的总面积为 1。

多分类的高斯朴素贝叶斯

导入必要的库：

from random import random
from random import randint
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statistics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from mlxtend.plotting import plot_decision_regions

现在创建一个预测变量呈正态分布的数据集。

#Creating values for FeNO with 3 classes:
FeNO_0 = np.random.normal(20, 19, 200)
FeNO_1 = np.random.normal(40, 20, 200)
FeNO_2 = np.random.normal(60, 20, 200)

#Creating values for FEV1 with 3 classes:
FEV1_0 = np.random.normal(4.65, 1, 200)
FEV1_1 = np.random.normal(3.75, 1.2, 200)
FEV1_2 = np.random.normal(2.85, 1.2, 200)

#Creating values for Broncho Dilation with 3 classes:
BD_0 = np.random.normal(150,49, 200)
BD_1 = np.random.normal(201,50, 200)
BD_2 = np.random.normal(251, 50, 200)

#Creating labels variable with three classes:(2)disease (1)possible disease (0)no disease:
not_asthma = np.zeros((200,), dtype=int)
poss_asthma = np.ones((200,), dtype=int)
asthma = np.full((200,), 2, dtype=int)

#Concatenate classes into one variable:
FeNO = np.concatenate([FeNO_0, FeNO_1, FeNO_2])
FEV1 = np.concatenate([FEV1_0, FEV1_1, FEV1_2])
BD = np.concatenate([BD_0, BD_1, BD_2])
dx = np.concatenate([not_asthma, poss_asthma, asthma])

#Create DataFrame:
df = pd.DataFrame()

#Add variables to DataFrame:
df['FeNO'] = FeNO.tolist()
df['FEV1'] = FEV1.tolist()
df['BD'] = BD.tolist()
df['dx'] = dx.tolist()

#Check database:
df

我们的df有 600 行和 4 列。现在我们可以通过可视化检查变量的分布：

fig, axs = plt.subplots(2, 3, figsize=(14, 7))

sns.kdeplot(df['FEV1'], shade=True, color="b", ax=axs[0, 0])
sns.kdeplot(df['FeNO'], shade=True, color="b", ax=axs[0, 1])
sns.kdeplot(df['BD'], shade=True, color="b", ax=axs[0, 2])
sns.distplot( a=df["FEV1"], hist=True, kde=True, rug=False, ax=axs[1, 0])
sns.distplot( a=df["FeNO"], hist=True, kde=True, rug=False, ax=axs[1, 1])
sns.distplot( a=df["BD"], hist=True, kde=True, rug=False, ax=axs[1, 2])

plt.show()

通过人肉的检查，数据似乎接近高斯分布。还可以使用 qq-plots仔细检查：

from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot

#q-q plot:
fig, axs = pyplot.subplots(1, 3, figsize=(15, 5))
qqplot(df['FEV1'], line='s', ax=axs[0])
qqplot(df['FeNO'], line='s', ax=axs[1])
qqplot(df['BD'], line='s', ax=axs[2])
pyplot.show()

虽然不是完美的正态分布，但已经很接近了。下面查看的数据集和变量之间的相关性：

#Exploring dataset:
sns.pairplot(df, kind="scatter", hue="dx")
plt.show()

可以使用框线图检查这三组的分布，看看哪些特征可以更好的区分出类别

plotting both distibutions on the same figure
fig, axs = plt.subplots(2, 3, figsize=(14, 7))

fig = sns.kdeplot(df['FEV1'], hue= df['dx'], shade=True, color="r", ax=axs[0, 0])
fig = sns.kdeplot(df['FeNO'], hue= df['dx'], shade=True, color="r", ax=axs[0, 1])
fig = sns.kdeplot(df['BD'], hue= df['dx'], shade=True, color="r", ax=axs[0, 2])
sns.boxplot(x=df["dx"], y=df["FEV1"], palette = 'magma', ax=axs[1, 0])
sns.boxplot(x=df["dx"], y=df["FeNO"], palette = 'magma',ax=axs[1, 1])
sns.boxplot(x=df["dx"], y=df["BD"], palette = 'magma',ax=axs[1, 2])

plt.show()

手写朴素贝叶斯分类

手写代码并不是让我们重复的制造轮子，而是通过自己编写代码对算法更好的理解。在进行贝叶斯分类之前，先要了解正态分布。

正态分布的数学公式定义了一个观测值出现在某个群体中的概率：

我们可以创建一个函数来计算这个概率:

def normal_dist(x , mean , sd):
      prob_density = (1/sd*np.sqrt(2*np.pi)) * np.exp(-0.5*((x-mean)/sd)**2)
      return prob_density

知道正态分布公式，就可以计算该样本在三个分组（分类）概率。首先，需要计算所有预测特征和组的均值和标准差：

#Group 0:
group_0 = df[df['dx'] == 0]print('Mean FEV1 group 0: ', statistics.mean(group_0['FEV1']))
print('SD FEV1 group 0: ', statistics.stdev(group_0['FEV1']))
print('Mean FeNO group 0: ', statistics.mean(group_0['FeNO']))
print('SD FeNO group 0: ', statistics.stdev(group_0['FeNO']))
print('Mean BD group 0: ', statistics.mean(group_0['BD']))
print('SD BD group 0: ', statistics.stdev(group_0['BD']))

#Group 1:
group_1 = df[df['dx'] == 1]
print('Mean FEV1 group 1: ', statistics.mean(group_1['FEV1']))
print('SD FEV1 group 1: ', statistics.stdev(group_1['FEV1']))
print('Mean FeNO group 1: ', statistics.mean(group_1['FeNO']))
print('SD FeNO group 1: ', statistics.stdev(group_1['FeNO']))
print('Mean BD group 1: ', statistics.mean(group_1['BD']))
print('SD BD group 1: ', statistics.stdev(group_1['BD']))

#Group 2:
group_2 = df[df['dx'] == 2]
print('Mean FEV1 group 2: ', statistics.mean(group_2['FEV1']))
print('SD FEV1 group 2: ', statistics.stdev(group_2['FEV1']))
print('Mean FeNO group 2: ', statistics.mean(group_2['FeNO']))
print('SD FeNO group 2: ', statistics.stdev(group_2['FeNO']))
print('Mean BD group 2: ', statistics.mean(group_2['BD']))
print('SD BD group 2: ', statistics.stdev(group_2['BD']))

现在，使用一个随机的样本进行测试：FEV1 = 2.75FeNO = 27BD = 125

#Probability for:
#FEV1 = 2.75
#FeNO = 27
#BD = 125

#We have the same number of observations, so the general probability is: 0.33
Prob_geral = round(0.333, 3)

#Prob FEV1:
Prob_FEV1_0 = round(normal_dist(2.75, 4.70, 1.08), 10)
print('Prob FEV1 0: ', Prob_FEV1_0)
Prob_FEV1_1 = round(normal_dist(2.75, 3.70, 1.13), 10)
print('Prob FEV1 1: ', Prob_FEV1_1)
Prob_FEV1_2 = round(normal_dist(2.75, 3.01, 1.22), 10)
print('Prob FEV1 2: ', Prob_FEV1_2)

#Prob FeNO:
Prob_FeNO_0 = round(normal_dist(27, 19.71, 19.29), 10)
print('Prob FeNO 0: ', Prob_FeNO_0)
Prob_FeNO_1 = round(normal_dist(27, 42.34, 19.85), 10)
print('Prob FeNO 1: ', Prob_FeNO_1)
Prob_FeNO_2 = round(normal_dist(27, 61.78, 21.39), 10)
print('Prob FeNO 2: ', Prob_FeNO_2)

#Prob BD:
Prob_BD_0 = round(normal_dist(125, 152.59, 50.33), 10)
print('Prob BD 0: ', Prob_BD_0)
Prob_BD_1 = round(normal_dist(125, 199.14, 50.81), 10)
print('Prob BD 1: ', Prob_BD_1)
Prob_BD_2 = round(normal_dist(125, 256.13, 47.04), 10)
print('Prob BD 2: ', Prob_BD_2)

#Compute probability:
Prob_group_0 = Prob_geral*Prob_FEV1_0*Prob_FeNO_0*Prob_BD_0
print('Prob group 0: ', Prob_group_0)

Prob_group_1 = Prob_geral*Prob_FEV1_1*Prob_FeNO_1*Prob_BD_1
print('Prob group 1: ', Prob_group_1)

Prob_group_2 = Prob_geral*Prob_FEV1_2*Prob_FeNO_2*Prob_BD_2
print('Prob group 2: ', Prob_group_2)

可以看到，这个样本具有属于第 2 组的概率最高。这就是朴素贝叶斯手动计算的的流程，但是这种成熟的算法可以使用来自 Scikit-Learn 的更高效的实现。

Scikit-Learn的分类器样例

Scikit-Learn的GaussianNB为我们提供了更加高效的方法，下面我们使用GaussianNB进行完整的分类实例。首先创建 X 和 y 变量，并执行训练和测试拆分：

#Creating X and y:
X = df.drop('dx', axis=1)
y = df['dx']

#Data split into train and test:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

在输入之前还需要使用 standardscaler 对数据进行标准化：

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

现在构建和评估模型：

#Build the model:
classifier = GaussianNB()
classifier.fit(X_train, y_train)

#Evaluate the model:
print("training set score: %f" % classifier.score(X_train, y_train))
print("test set score: %f" % classifier.score(X_test, y_test))

下面使用混淆矩阵来可视化结果：

Predicting the Test set results
y_pred = classifier.predict(X_test)

#Confusion Matrix:
cm = confusion_matrix(y_test, y_pred)
print(cm)

通过混淆矩阵可以看到，的模型最适合预测类别 0，但类别 1 和 2 的错误率很高。为了查看这个问题，我们使用变量构建决策边界图：

df.to_csv('data.csv', index = False)
data = pd.read_csv('data.csv')
def gaussian_nb_a(data):
    x = data[['BD','FeNO',]].values
    y = data['dx'].astype(int).values
    Gauss_nb = GaussianNB()
    Gauss_nb.fit(x,y)
    print(Gauss_nb.score(x,y))
    #Plot decision region:
    plot_decision_regions(x,y, clf=Gauss_nb, legend=1)
    #Adding axes annotations:
    plt.xlabel('X_train')
    plt.ylabel('y_train')
    plt.title('Gaussian Naive Bayes')
    plt.show()
def gaussian_nb_b(data):
    x = data[['BD','FEV1',]].values
    y = data['dx'].astype(int).values
    Gauss_nb = GaussianNB()
    Gauss_nb.fit(x,y)
    print(Gauss_nb.score(x,y))
    #Plot decision region:
    plot_decision_regions(x,y, clf=Gauss_nb, legend=1)
    #Adding axes annotations:
    plt.xlabel('X_train')
    plt.ylabel('y_train')
    plt.title('Gaussian Naive Bayes')
    plt.show()
def gaussian_nb_c(data):
    x = data[['FEV1','FeNO',]].values
    y = data['dx'].astype(int).values
    Gauss_nb = GaussianNB()
    Gauss_nb.fit(x,y)
    print(Gauss_nb.score(x,y))
    #Plot decision region:
    plot_decision_regions(x,y, clf=Gauss_nb, legend=1)
    #Adding axes annotations:
    plt.xlabel('X_train')
    plt.ylabel('y_train')
    plt.title('Gaussian Naive Bayes')
    plt.show()
gaussian_nb_a(data)
gaussian_nb_b(data)
gaussian_nb_c(data)

通过决策边界我们可以观察到分类错误的原因，从图中我们看到，很多点都是落在决策边界之外的，如果是实际数据我们需要分析具体原因，但是因为是测试数据所以我们也不需要更多的分析。

作者：Carla Martins

https://www.overfit.cn/post/0457f85f2c184ff0864db5256654aef1

Original: https://blog.csdn.net/m0_46510245/article/details/124007911
Author: deephub
Title: 高斯朴素贝叶斯分类的原理解释和手写代码实现

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/664906/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Alexnet论文介绍（超详细）——ImageNet Classification with Deep Convolutional Neural Networks

近期开始阅读cv领域的一些经典论文，本文整理计算机视觉的奠基之作—— Alexnet 论文原文：ImageNet Classification with Deep Convolut…

人工智能 2023年7月26日
0062
点钞机语音怎么打开_我有这些语音识别指令，你都知道吗？

博瑞MHEV、PHEV的语音识别(简称VR)支持控制车窗、天窗、收音机、音乐、电话、空调、导航、系统设置等功能，为您在驾车时的快捷安全操作保驾护航，下面我们一起来了解下都有哪些语音…

人工智能 2023年5月27日
00124
时序分析28 – 时序预测格兰杰因果关系(中) python实践1

时序分析28 – 时序预测 – 格兰杰因果关系(中) Python 实践 1 上一篇文章我们介绍了格兰杰因果关系的基本概念、背景以及相关统计检验法。本篇文章…

人工智能 2023年7月7日
0077
三维可视化地图在智慧城市中的应用-智慧城市数字孪生

传统城区人口密度大，场景复杂，巡查盲点多，消防安全隐患大，消防资源管控不到位；商业区入住率不高，运营成本居高不下；多种管理系统并存，数据不互通，形成数据孤岛，导致数据无法有效利用&…

人工智能 2023年6月4日
0075
使用FSL-FAST分割三种脑组织：白质，灰质，脑脊液

简单介绍 FSL-FAST FAST（FMRIB 的自动分割工具）将大脑的 3D 图像分割成不同的组织类型: 灰质（grey matter, GM）、白质 (white matt…

人工智能 2023年6月18日
00124
哈工大信息安全实验 Snort与单台防火墙联动实验

XX大学XX学院《网络攻击与防御》实验报告实验报告撰写要求实验操作是教学过程中理论联系实际的重要环节，而实验报告的撰写又是知识系统化的吸收和升华过程，因此，实验报告应该体现完整…

人工智能 2023年6月4日
0081
深度学习目标检测模型综述

还是学习啊勿怪勿怪给自己好保存而已哦论文地址：https://arxiv.org/pdf/2104.11892.pdf whaosoft aiot http://143ai….

人工智能 2023年6月17日
0090
Jina AI x 矩池云Matpool ｜神经搜索引擎，一键构建

图片、视频、语音等非结构化数据在快速增长，随着深度学习技术的不断升级，非结构化数据的搜索也逐渐形成可能。在这样的背景下，专注于神经搜索技术的商业开源软件公司——Jina AI，提出…

人工智能 2023年6月4日
0093
AssertionError: CUDA unavailable, invalid device 0 requested

1、查看报错 Traceback (most recent call last): File "train.py", line 651, in <modu…

人工智能 2023年7月22日
0045
【教程】PaddleOCR文字识别，整个安装环境过程

直接下载解压，这个有102M 然后打开这个网址：传送门2 下载权重模型这里我只下载了中英超轻量OCR推理模型 Original: https://blog.csdn.net/q…

人工智能 2023年5月28日
0087
【python安装xlrd模块】

xlrd模块安装 xlrd是python环境下对excel中的数据进行读取的一个模板，可以进行的操作有：读取有效单元格的行数、列数读取指定行（列）的所有单元格的值读取指定单元…

人工智能 2023年7月5日
0074
基于Java+SpringBoot+vue+elementui药品商城采购系统详细设计实现

博主介绍： ✌全网粉丝20W+,csdn特邀作者、博客专家、CSDN新星计划导师、java领域优质创作者,博客之星TOP100、掘金/华为云/阿里云/InfoQ等平台优质作者、专注…

人工智能 2023年5月30日
0085
（2018 -NIPS）SimplE embedding for link prediction in knowledge

（2018 -NIPS）SimplE embedding for link prediction in knowledge 本文为阅读论文过程中的个人总结加上翻译内容构成。摘要 …

人工智能 2023年6月1日
0089
自然语言处理（5）——语言模型

NLP学习笔记（5）——语言模型 1. 基本概念 * 1.1 概念导入 1.2 划分等价类的方法——n元文法模型（n-gram） 1.3 概率计算 1.4 语言模型的应用 &#82…

人工智能 2023年5月30日
0087
（国赛）第七届工创赛之智能垃圾分类

前言省赛博客（k210）比赛成绩：全国一等奖有需要代码资料请+扣扣：1287073476（备注来意）继上次省赛，更换如下配置：开发板： jeston nano 摄像头： …

人工智能 2023年6月2日
0089
2020 BI及数据可视化领域最具商业合作价值企业盘点

大数据产业创新服务媒体 ——聚焦数据 · 改变商业历经2个多月的时间，由数据猿工作人员与外部专家成员联合组成的评选推荐委员会，从数千家企业、机构中通过直接申报交流、外界评价、匿名…

人工智能 2023年6月5日
00116

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

高斯朴素贝叶斯分类的原理解释和手写代码实现

; 什么是高斯分布？

多分类的高斯朴素贝叶斯

手写朴素贝叶斯分类

Scikit-Learn的分类器样例

大家都在看