机器学习（三）：基于LightGBM的分类预测

2023年6月30日下午3:57 • 人工智能 • 阅读 82

LightGBM介绍

LightGBM是2017年由微软推出的可扩展机器学习系统，是微软旗下DMKT的一个开源项目，由2014年首届阿里巴巴大数据竞赛获胜者之一柯国霖老师带领开发。它是一款基于GBDT（梯度提升决策树）算法的分布式梯度提升框架，为了满足缩短模型计算时间的需求，LightGBM的设计思路主要集中在减小数据对内存与计算性能的使用，以及减少多机器并行计算时的通讯代价。

LightGBM可以看作是XGBoost的升级豪华版，在获得与XGBoost近似精度的同时，又提供了更快的训练速度与更少的内存消耗。

LightGBM的主要优点：

简单易用。提供了主流的Python\C++\R语言接口，用户可以轻松使用LightGBM建模并获得相当不错的效果。
高效可扩展。在处理大规模数据集时高效迅速、高准确度，对内存等硬件资源要求不高。
鲁棒性强。相较于深度学习模型不需要精细调参便能取得近似的效果。
LightGBM直接支持缺失值与类别特征，无需对数据额外进行特殊处理

LightGBM的主要缺点：

相对于深度学习模型无法对时空位置建模，不能很好地捕获图像、语音、文本等高维数据。
在拥有海量训练数据，并能找到合适的深度学习模型时，深度学习的精度可以遥遥领先LightGBM。

ps：

安装LightGBM，详见https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html
这个网页介绍了使用lightgbm的两种形式：原生形式(import lightgbm as lgb)和Sklearn接口形式(from lightgbm import LGBMRegressor, LGBMClassifier)具体可查看https://www.cnblogs.com/chenxiangzhen/p/10894306.html
原生形式中可以使用lgb.cv做交叉验证选参数，但要注意数据集必须使用lgb.Dataset函数加以转换

关于LightGBM参数

lightgbm参数很多，应仔细阅读https://lightgbm.readthedocs.io/en/latest/Parameters.html
关于调参，可以参考https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
1、核心参数：task, objective, boosting, n_estimators, learning_rate, metric
2、与决策树相关的参数：num_leaves, max_depth, min_data_in_leaf, feature_fraction_bynode, min_gain_split
3、涉及加速与防止过拟合的参数：bagging_fraction, feature_fraction, lambda_l1, lambda_l2, max_bin, min_data_in_bin, bin_construct_sample_cnt（实际上，决策树中的参数max_depth, min_data_in_leaf,
feature_fraction_bynode也有防止过拟合的作用）
4、处理不平衡的参数：pos_bagging_fraction, neg_bagging_fraction, is_unbalance
5、GOSS相关参数（设置boosting=goss才会启用GOSS）：top_rate, other_rate
6、EFB相关参数：enable_bundle, max_conflict_rate （实际上，这两个参数也可以实现加速）

ps1：网上也有很多调参攻略，例如我随便搜索看到的网页：

https://www.cnblogs.com/wzdLY/p/9867719.html
https://blog.csdn.net/u012513618/article/details/78441676
https://www.cnblogs.com/jiangxinyang/p/9337094.html
https://www.imooc.com/article/43784?block_id=tuijian_wz

ps2：不需要处理缺失值；不需要独热编码（但不能输入字符串）

算法实战

参考链接在此

基于英雄联盟数据集的LightGBM分类实战

数据集变量描述如下：

; 数据集导入

mport numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('./high_diamond_ranked_10min.csv')
y = df.blueWins

drop_cols = ['gameId','blueWins']
x = df.drop(drop_cols, axis=1)
x.describe()

不同对局中插眼数和拆眼数的取值范围存在明显差距，甚至有前十分钟插了250个眼的异常值。
EliteMonsters的取值相当于Deagons + Heralds。
TotalGold 等变量在大部分对局中差距不大。
两支队伍的经济差和经验差是相反数。
红队和蓝队拿到首次击杀的概率大概都是50%

可视化描述

data = x
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std.iloc[:, 0:9]], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

fig, ax = plt.subplots(1,2,figsize=(15,5))

sns.violinplot(x='Features', y='Values', hue='blueWins', data=data, split=True,
               inner='quart', ax=ax[0], palette='Blues')
fig.autofmt_xdate(rotation=45)

data = x
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std.iloc[:, 9:18]], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

sns.violinplot(x='Features', y='Values', hue='blueWins',
               data=data, split=True, inner='quart', ax=ax[1], palette='Blues')
fig.autofmt_xdate(rotation=45)

plt.show()

从图中可以看出：

击杀英雄数量越多更容易赢，死亡数量越多越容易输（bluekills与bluedeaths左右的区别）。
助攻数量与击杀英雄数量形成的图形状类似，说明他们对游戏结果的影响差不多。
一血的取得情况与获胜有正相关，但是相关性不如击杀英雄数量明显。
经济差与经验差对于游戏胜负的影响较小。
击杀野怪数量对游戏胜负的影响并不大。

plt.figure(figsize=(18,14))
sns.heatmap(round(x.corr(),2), cmap='Blues', annot=True)
plt.show()


drop_cols = ['redAvgLevel','blueAvgLevel']
x.drop(drop_cols, axis=1, inplace=True)

sns.set(style='whitegrid', palette='muted')

x['wardsPlacedDiff'] = x['blueWardsPlaced'] - x['redWardsPlaced']
x['wardsDestroyedDiff'] = x['blueWardsDestroyed'] - x['redWardsDestroyed']

data = x[['blueWardsPlaced','blueWardsDestroyed','wardsPlacedDiff','wardsDestroyedDiff']].sample(1000)
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

plt.figure(figsize=(10,6))
sns.swarmplot(x='Features', y='Values', hue='blueWins', data=data)
plt.xticks(rotation=45)
plt.show()

从插眼数量的散点图发现不存在插眼数量与游戏胜负间的显著规律。猜测由于钻石分段以上在哪插眼在哪好排眼都是套路，所以数据中前十分钟插眼数拔眼数对游戏的影响不大。所以我们暂时先把这些特征去掉。


drop_cols = ['blueWardsPlaced','blueWardsDestroyed','wardsPlacedDiff',
            'wardsDestroyedDiff','redWardsPlaced','redWardsDestroyed']
x.drop(drop_cols, axis=1, inplace=True)

x['killsDiff'] = x['blueKills'] - x['blueDeaths']
x['assistsDiff'] = x['blueAssists'] - x['redAssists']

x[['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists']].hist(figsize=(12,10), bins=20)
plt.show()

发现击杀、死亡与助攻数的数据分布差别不大。但是击杀减去死亡、助攻减去死亡的分布与原分布差别很大，因此我们新构造这么两个特征。

data = x[['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists']].sample(1000)
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

plt.figure(figsize=(10,6))
sns.swarmplot(x='Features', y='Values', hue='blueWins', data=data)
plt.xticks(rotation=45)
plt.show()

上图可以发现击杀数与死亡数与助攻数，以及我们构造的特征对数据都有较好的分类能力。

data = pd.concat([y, x], axis=1).sample(500)

sns.pairplot(data, vars=['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists'],
             hue='blueWins')

plt.show()


x['dragonsDiff'] = x['blueDragons'] - x['redDragons']
x['heraldsDiff'] = x['blueHeralds'] - x['redHeralds']
x['eliteDiff'] = x['blueEliteMonsters'] - x['redEliteMonsters']

data = pd.concat([y, x], axis=1)

eliteGroup = data.groupby(['eliteDiff'])['blueWins'].mean()
dragonGroup = data.groupby(['dragonsDiff'])['blueWins'].mean()
heraldGroup = data.groupby(['heraldsDiff'])['blueWins'].mean()

fig, ax = plt.subplots(1,3, figsize=(15,4))

eliteGroup.plot(kind='bar', ax=ax[0])
dragonGroup.plot(kind='bar', ax=ax[1])
heraldGroup.plot(kind='bar', ax=ax[2])

print(eliteGroup)
print(dragonGroup)
print(heraldGroup)

plt.show()

构造了两队之间是否拿到龙、是否拿到峡谷先锋、击杀大型野怪的数量差值，发现在游戏的前期拿到龙比拿到峡谷先锋更容易获得胜利。拿到大型野怪的数量和胜率也存在着强相关。

x['towerDiff'] = x['blueTowersDestroyed'] - x['redTowersDestroyed']

data = pd.concat([y, x], axis=1)

towerGroup = data.groupby(['towerDiff'])['blueWins']
print(towerGroup.count())
print(towerGroup.mean())

fig, ax = plt.subplots(1,2,figsize=(15,5))

towerGroup.mean().plot(kind='line', ax=ax[0])
ax[0].set_title('Proportion of Blue Wins')
ax[0].set_ylabel('Proportion')

towerGroup.count().plot(kind='line', ax=ax[1])
ax[1].set_title('Count of Towers Destroyed')
ax[1].set_ylabel('Count')

推塔是英雄联盟这个游戏的核心，因此推塔数量可能与游戏的胜负有很大关系。我们绘图发现，尽管前十分钟推掉第一座防御塔的概率很低，但是一旦某只队伍推掉第一座防御塔，获得游戏的胜率将大大增加。

利用 LightGBM 进行训练与预测

from sklearn.model_selection import train_test_split
data_target_part = y
data_features_part = x

x_train, x_test, y_train, y_test = train_test_split(data_features_part, data_target_part, test_size = 0.2, random_state = 2020)

from lightgbm.sklearn import LGBMClassifier

clf = LGBMClassifier()

clf.fit(x_train, y_train)


train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)
from sklearn import metrics

print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

利用 LightGBM 进行特征选择

sns.barplot(y=data_features_part.columns, x=clf.feature_importances_)

from sklearn.metrics import accuracy_score
from lightgbm import plot_importance

def estimate(model,data):

    ax1=plot_importance(model,importance_type="gain")
    ax1.set_title('gain')
    ax2=plot_importance(model, importance_type="split")
    ax2.set_title('split')
    plt.show()
def classes(data,label,test):
    model=LGBMClassifier()
    model.fit(data,label)
    ans=model.predict(test)
    estimate(model, data)
    return ans

ans=classes(x_train,y_train,x_test)
pre=accuracy_score(y_test, ans)
print('acc=',accuracy_score(y_test,ans))

通过调整参数获得更好的效果

from sklearn.model_selection import GridSearchCV

learning_rate = [0.1, 0.3, 0.6]
feature_fraction = [0.5, 0.8, 1]
num_leaves = [16, 32, 64]
max_depth = [-1,3,5,8]

parameters = { 'learning_rate': learning_rate,
              'feature_fraction':feature_fraction,
              'num_leaves': num_leaves,
              'max_depth': max_depth}
model = LGBMClassifier(n_estimators = 50)

clf = GridSearchCV(model, parameters, cv=3, scoring='accuracy',verbose=3, n_jobs=-1)
clf = clf.fit(x_train, y_train)

clf.best_params_

clf = LGBMClassifier(feature_fraction = 0.8,
                    learning_rate = 0.1,
                    max_depth= 3,
                    num_leaves = 16)

clf.fit(x_train, y_train)

train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)

print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

至此就完成了一个简单的LightGBM算法的实践应用，感兴趣的同学可以去前文的参考链接里获取相应的数据集自行探索。

Original: https://blog.csdn.net/weixin_46719623/article/details/113558566
Author: 小黄油块跑
Title: 机器学习（三）：基于LightGBM的分类预测

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/661588/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Python 线性 SVM 可视化

支持向量机作为经典的二分类算法，在数学建模比赛中的优越性在于可解释性较强 —— 不像某神经网络因为核函数的引入，会使得数据的维度增加，当维度大于 3 时无法可视化所以在此只针对…

人工智能 2023年6月30日
0078
【OpenCV】车辆识别运动目标检测

目录一：车辆识别运动目标检测二：车辆识别实现超详细步骤解析步骤一：灰度化处理步骤二：帧差处理步骤三：二值化处理步骤四：图像降噪 4-1 腐蚀处理目的去除白色噪点…

人工智能 2023年6月18日
0048
干货！利用潜在邻域结构的无源域自适应

点击蓝字关注我们 AI TIME欢迎每一位AI爱好者的加入！域适应(DA)旨在减轻源域和目标域之间的domain shift。大多数DA方法都需要访问源数据，但通常这是不可行的…

人工智能 2023年5月31日
0092
Spring底层事务原理

Spring事务底层原理一、@EnableTransactionManagement工作原理二、Spring事务基本执行原理三、Spring事务的过程四、Spring事务传…

人工智能 2023年7月31日
0063
海思 YOLOv5 pytorch 转 onnx 转 Caffe 再转 wk 的转化详解

目录：前沿 YOLOv5模型的选取与修改 YOLOv5 pytorch 转 onnx 转 Caffe YOLOv5 Caffe转wk文件总结参考前沿作者在将YOLOv5…

人工智能 2023年6月17日
0090
[python]图像处理pillow库学习记录，查看图像信息、格式转换、图像通道分离与合并、图像增强等等······

一.前言： pillow库是python中经常使用的图像处理库，其中包含了很多的图像处理方法。RGB图像是我们经常使用的图像，常常需要对RGB图像进行处理，或者获取图像的一些基本信…

人工智能 2023年6月20日
0096
用Diffusion Models实现image-to-image转换

_前言：_diffusion models诞生到现在，很多论文热衷于把diffusion models带到自己的领域用于生成，也有不少人醉心于用各种奇技淫巧优化采样过程，以改善di…

人工智能 2023年6月25日
0081
Joint entity recognition and relation extraction as a multi-head selection problem

; Abstract 实体识别和关系抽取比较依赖NLP工具（词性标记POS、依赖关系等），本文提出一种不需要人工或工具提取特征并且能同时进行实体和关系等候区的模型，即使用CRF抽取…

人工智能 2023年6月10日
0069
避免繁琐步骤，在Ubuntu22.04安装cuda、cudnn及pytorch

1. 换源国内用户建议改为国内源，比如为清华源、阿里源等。更改后在终端运行： $: sudo apt update 2. 终端运行命令直接安装显卡驱动运行： $: ubuntu-d…

人工智能 2023年7月21日
0078
AndroidStudio集成GitHub操作入门

团队合作中GitHub的使用学校里做个小组作业啊自己开发个小东西啊Git还是非常好用的，可以很好的保障代码的安全修改。下面整理一下自己的GitHub使用入门文章目录一、查看自己…

人工智能 2023年6月29日
0087
调用云服务实现语音识别合成以及感情分析

人工智能 2023年5月23日
00101
文献笔记1：《Knowledge Graph Completion via Complex Tensor Factorization》理论（上）

0 参考文献 [1] Trouillon T, Dance C R, Welbl J, et al. Knowledge graph completion via complex …

人工智能 2023年6月10日
0071
数字媒体技术考点整理

一、前言概述部分 1.数字媒体包含哪些类型，涉及哪些研究领域数字媒体包括了文字、图形、图像、音频、视频影像和动画等各种形式，以及传播形式和传播内容中采用数字化，即信息的采集、存取…

人工智能 2023年6月22日
0099
labelimg标注的VOC格式标签xml文件和yolo格式标签txt文件相互转换

回答1：将，需要进行以下步骤： 1. 读取信息，包括目标类别、位置坐标等。 2. 根据的要求，将目标位置坐标为相对于图像宽度和高度的比例。 3. 将目标类别为对应的数字…

人工智能 2023年7月3日
0076
基于OpenCV 在python中实现图像自动检测+手动截图

文章目录前言一、设计流程 * 1.1图片读取 1.2 图片处理 1.3 图片裁剪及识别二、图像检测部分三、鼠标截图功能四、主函数五、结果展示 * 5.1 原图 5.2 …

人工智能 2023年7月19日
00105
图像拼接（Image Stiching）方向论文微总结

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年6月17日
0053

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31