【机器学习】分类问题全流程

2023年7月2日上午1:43 • 人工智能 • 阅读 88

机器学习分类问题全流程

文章目录

机器学习分类问题全流程
*
导入类库
导入数据
数据探查与预处理
算法审查-训练模型
模型调参
模型最终化&模型评估

冀以尘雾之微补益山海，荧烛末光增辉日月。 ——2022/1/27

导入类库

导入可视化模块、机器学习库中的模型评估模块和模型库模块；以方便后续使用。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pandas import read_csv
from pandas.plotting import scatter_matrix
from pandas import set_option
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

导入数据

将要用的数据集导入，本文以玻璃数据集glass_train.csv为例。


filename = 'glass_train.csv'
dataset = read_csv(filename, header=0)
tag = read_csv('glass_train_labels.csv', header=0)
df=pd.concat([dataset,tag],axis=1)
df.columns=['RI','Na','Mg','AI','Si','K','Ca','Ba','Fe','tag']
df

数据探查与预处理

这里主要介绍简单的数据处理，对于复杂的数据集需要对数据做特征工程进行进一步的处理。


print(df .shape)

df.describe()

df.dtypes

df.isna().sum()

print(tag.value_counts())

df.corr(method='pearson')

sns.heatmap(df.corr())
plt.show()

sns.kdeplot(df["tag"] , color="#1874CD",label="tag_label", alpha=.7)
plt.grid()

plt.show()

array = df.values
X = array[:, 0:9].astype(float)
Y = array[:, 9]
validation_size = 0.2
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

算法审查-训练模型

确定评估算法的基准，利用传统的机器学习算法进行训练并比较训练结果


num_folds = 10
seed = 7
scoring = 'accuracy'

models = {}
models['LR'] = LogisticRegression()
models['LDA'] = LinearDiscriminantAnalysis()
models['KNN'] = KNeighborsClassifier()
models['CART'] = DecisionTreeClassifier()
models['SVM'] = SVC()
results = []
for key in models:
    kfold = KFold(n_splits=num_folds, random_state=seed,shuffle=True)
    cv_results = cross_val_score(models[key], X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    print('%s : %f (%f)' % (key, cv_results.mean(), cv_results.std()))

fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(models.keys())
pyplot.show()


pipelines = {}
pipelines['ScalerLR'] = Pipeline([('Scaler', StandardScaler()), ('LR', LogisticRegression())])
pipelines['ScalerLDA'] = Pipeline([('Scaler', StandardScaler()), ('LDA', LinearDiscriminantAnalysis())])
pipelines['ScalerKNN'] = Pipeline([('Scaler', StandardScaler()), ('KNN', KNeighborsClassifier())])
pipelines['ScalerCART'] = Pipeline([('Scaler', StandardScaler()), ('CART', DecisionTreeClassifier())])
pipelines['ScalerNB'] = Pipeline([('Scaler', StandardScaler()), ('NB', GaussianNB())])

results = []
for key in pipelines:
    kfold = KFold(n_splits=num_folds, random_state=seed,shuffle=True)
    cv_results = cross_val_score(pipelines[key], X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    print('%s : %f (%f)' % (key, cv_results.mean(), cv_results.std()))

fig = plt.figure()
fig.suptitle('Scaled Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(pipelines.keys())
plt.show()


ensembles = {}
ensembles['ScaledAB'] = Pipeline([('Scaler', StandardScaler()), ('AB', AdaBoostClassifier())])
ensembles['ScaledGBM'] = Pipeline([('Scaler', StandardScaler()), ('GBM', GradientBoostingClassifier())])
ensembles['ScaledRF'] = Pipeline([('Scaler', StandardScaler()), ('RFR', RandomForestClassifier())])
ensembles['ScaledET'] = Pipeline([('Scaler', StandardScaler()), ('ETR', ExtraTreesClassifier())])

results = []
for key in ensembles:
    kfold = KFold(n_splits=num_folds, random_state=seed,shuffle=True)
    cv_result = cross_val_score(ensembles[key], X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_result)
    print('%s: %f (%f)' % (key, cv_result.mean(), cv_result.std()))

fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(ensembles.keys())
plt.show()

通过以上结果进行比较分析，得到结果最好的模型。

模型调参


from sklearn.ensemble import GradientBoostingClassifier
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
param_grid = {'n_estimators': [10, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900]}
model = GradientBoostingClassifier()
kfold = KFold(n_splits=num_folds,shuffle=True, random_state=seed)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(X=rescaledX, y=Y_train)
print('最优：%s 使用%s' % (grid_result.best_score_, grid_result.best_params_))

模型最终化&模型评估

根据上述的模型结果选择最佳的模型，以下只是一个参考。


from lightgbm import LGBMClassifier
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
model = LGBMClassifier()
model.fit(X=rescaledX, y=Y_train)

rescaled_validationX = scaler.transform(X_validation)
predictions = model.predict(rescaled_validationX)
print(predictions)

a = pd.DataFrame()
a['预测值'] = list(predictions)
a['实际值'] = list(Y_validation)
a.head()


features = ['RI','Na','Mg','AI','Si','K','Ca','Ba','Fe']
importances = model.feature_importances_

importances_df = pd.DataFrame()
importances_df['特征名称'] = features
importances_df['特征重要性'] = importances
importances_df.sort_values('特征重要性', ascending=False)


rescaled_validationX = scaler.transform(X_validation)
predictions = model.predict(rescaled_validationX)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

参考书籍：Python机器学习实战案例_赵卫东，董亮编著_2019.12
全书源代码：https://github.com/weizy1981/MachineLearning

本文链接：https://blog.csdn.net/qq_46426207/article/details/122723777

Original: https://blog.csdn.net/qq_46426207/article/details/122723777
Author: 穻易yuyee
Title: 【机器学习】分类问题全流程

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/664467/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Anaconda最新安装教程（2022-08-04）

简单易懂的Anaconda最新安装教程（基于win10系统）一、下载Anaconda。以下安装教程是基于win10系统下安装，首先去Anaconda官网下载，官网下载地址如下：…

人工智能 2023年6月12日
0095
使用AI CoNR 算法，仅仅利用4张动漫图片——便可以创建舞蹈视频

上期视频我们分享了一个AI算法，本期我们就分享一下实现代码此AI算法称之为CoNR，该技术通过基于多个动画指定姿势来创建舞蹈视频。 ——1—— 什么是CoNR？下面是 CoNR…

人工智能 2023年5月30日
0069
Python数据分析基础（二）— 统计文档中字符、数字的数量

1、python中的字符串（1）创建、访问、更新字符串字符串是 Python 中最常用的数据类型。我们可以使用引号( ‘ 或 ” )来创建字符串。创…

人工智能 2023年7月15日
0082
2019年‘泰迪杯’数据分析职业技能大赛A题——个人代码分享

目录题目任务 1 数据预处理与统计任务 2 数据分析与可视化代码展示任务一任务二题目任务 1 数据预处理与统计任务 1.1 对数据作必要的预处理，在报告中列出处理…

人工智能 2023年6月11日
0093
【研一小白论文精读】《MoCo》

其实之前读simclr那篇论文的时候已经涉及到一些moco的内容，现在的moco已经更新到了v3。moco是一种典型的contrastive unsupervised learni…

人工智能 2023年7月13日
0059
音频人声分离

语音分离：语音和非语音（噪音）；人声分离：多个说话者的声音分离。 Speaker Separation：输入一段声音序号，输出两段声音序号。目前聚焦在两个说话者声音序号混在一起、…

人工智能 2023年5月23日
0095
牛顿迭代法求方程的根

牛顿迭代法（牛顿-拉弗森方法）五次及以上多项式方程没有根式解（就是没有像二次方程那样的万能公式），这个是被伽罗瓦用群论做出的最著名的结论。没有根式解不意味着方程解不出来，数学家也…

人工智能 2023年6月16日
0071
什么是模型的泛化能力

什么是模型的泛化能力？模型的泛化能力是指训练好的机器学习模型在未见过的数据上表现良好的能力。换句话说，模型的泛化能力是指模型对新样本的泛化程度。一个具有良好泛化能力的模型可以在…

人工智能 2024年1月3日
0044
Tensorflow训练数字识别数据集并部署在OpenCV上

leNet训练自制数据集并部署在OpenCV上 –0. 简介 –1. 数据集介绍 –2. 数据集读取 –3. 网络搭建 &#8211…

人工智能 2023年5月24日
0060
Python哪个Excel库最好用？

作为人生苦短的 Python 程序员，该如何优雅地操作 Excel？其实Python提供的操作Excel的库有7个之多，到底哪个更好使用更加方便呢？首先让我们来整体把握下不同库的特…

人工智能 2023年6月26日
0090
ffmpeg tensorrt c++多拉流硬解码yolov5 yolov7 bytetrack 人流追踪统计硬件编码推流直播

ffmpeg拉流硬解码yolov5 bytetrack 人流追踪统计硬件编码推流直播编程语言C++，所以环境搭建可能比较复杂，需要有耐心。我的机器配置 CPU：I5 12490…

人工智能 2023年6月24日
0085
【已解决】Failed to find Ceres – Found Eigen dependency

### 回答1：如果您在Ubuntu 18.04系统上安装 _ce_res-solver，您可以使用以下步骤进行安装： 1. 更新您的系统： sudo apt-get updat…

人工智能 2023年6月29日
00126
TransH论文翻译

Knowledge Graph Embedding by Translating on Hyperplanes 摘要研究了将由实体和关系组成的大规模知识图谱嵌入到连续向量空间中的…

人工智能 2023年6月1日
0078
人工智能洗衣机模糊推理系统实验（课本实验）

一、实验目的理解模糊逻辑推理的原理及特点，熟练应用模糊推理二、实验内容用python设计洗衣机洗涤时间的模糊控制三、实验要求已知人的操作经验是污泥越多，油脂越多，洗涤时…

人工智能 2023年7月26日
00203
Yolov7：最新最快的实时检测框架，最详细分析解释（附源代码）

关注并星标从此不迷路计算机视觉研究院公众号ID｜ ComputerVisionGzq 学习群｜扫码在主页获取加入方式论文地址：https://arxiv.org/pdf/…

人工智能 2023年6月12日
0086
客户个性分析聚类大数据

客户个性分析聚类大数据作者：桂Sir 联系方式：1052656099@qq.com 不同的消费者，由于受年龄、性别、群体、职业、民族等自身类型的不同，以及生活习惯，兴趣、爱好…

人工智能 2023年7月16日
0059

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

【机器学习】分类问题全流程

文章目录

导入类库

导入数据

数据探查与预处理

算法审查-训练模型

模型调参

模型最终化&模型评估

大家都在看