数据挖掘竞赛lightgbm通过求最大auc调参

2023年7月17日下午11:14 • 人工智能 • 阅读 69

一、使用步骤

0.首先展示最后的结果

参数含义
learning_rate 一般设置在0.05-0.1之间
n_estimators  100-1000  boosting的迭代次数
min_split_gain  0  执行节点分裂的最小增益  不建议去调整
min_child_sample  一个叶子上的最小数据量,默认设置为20。根据数据量来确定，当数据量比较大时，应提升这个数值，
让叶子节点的数据分布相对稳定。
min_child_weight  一个叶子上的最小hessian和。默认设置为0.001，一般设置为1。

需要算法调节的参数
max_depth   树模型的最大深度。防止过拟合的最重要的参数，一般限制为3~5之间
num_leaves  一棵树上的叶子节点个数。默认设置为31，和max_depth配合来空值树的形状，一般设置为(0, 2^max_depth - 1]的一个数值。
subsample   默认设置为1，一般设置为0。8~1.0之间，防止过拟合。
colsample_bytree  一般设置为0.8~1.0之间，防止过拟合
reg_alpha   L1正则化参数
reg_lambda  L2正则化参数 较大的数值会让各个特征对模型的影响力趋于均匀

def roc_auc_plot(clf,x_train,y_train,x_test, y_test):
    train_auc = roc_auc_score(y_train,clf.predict_proba(x_train)[:,1])
    train_fpr, train_tpr, _ = roc_curve(y_train,clf.predict_proba(x_train)[:,1])
    train_ks = abs(train_fpr-train_tpr).max()
    print('train_ks = ', train_ks)
    print('train_auc = ', train_auc)

    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
    test_ks = abs(test_fpr-test_tpr).max()
    print('test_ks = ', test_ks)
    print('test_auc = ', test_auc)

    from matplotlib import pyplot as plt
    plt.plot(train_fpr,train_tpr,label = 'train_roc')
    plt.plot(test_fpr,test_tpr,label = 'test_roc')
    plt.plot([0,1],[0,1],'k--', c='r')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC Curve')
    plt.legend(loc = 'best')
    plt.show()

lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                boosting_type='gbdt',
                               learning_rate=0.04,
                               min_child_samples=68,
                               min_child_weight=0.01,
                                  max_depth=4,
                              num_leaves=16,
                              colsample_bytree=0.8,
                              subsample=0.8,
                              reg_alpha=0.7777777777777778,
                              reg_lambda=0.3,
                               objective='binary')

clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
roc_auc_plot(clf,x_train,y_train,x_test, y_test)

1.读入清洗后的数据集

代码如下：

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score,roc_curve,auc
import lightgbm as lgb
from multiprocessing import cpu_count
from sklearn.model_selection import KFold
import time
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
import numpy as np

df_Master_clean = pd.read_csv(r'F:\教师培训\ppd7\df_Master_clean.csv', encoding='gb18030')
c=['Idx','target','ListingInfo','sample_status']
x = df_Master_clean[df_Master_clean['target'].notnull()].drop(columns=c)
y = df_Master_clean[df_Master_clean['target'].notnull()]['target']
x_train,x_test, y_train, y_test = train_test_split(x,y,random_state=2,test_size=0.2)

2.参数调整

代码如下（示例）：

#随机设置一组参数为基础
lgb_model1 = lgb.LGBMClassifier(n_estimators=20,
                                boosting_type='gbdt',
                               learning_rate=0.01,
                               min_child_samples=100,
                               min_child_weight=0.003,
                                  max_depth=4,
                              num_leaves=100,
                              colsample_bytree=0.7,
                              subsample=0.6,
                              reg_alpha=0.03,
                              reg_lambda=0.3,
                               objective='binary',
                               max_features = 140,)

import matplotlib.pyplot as plt

通过观察auc大小调参，找到最佳n_estimators
axisx = range(100,1001,10)
test_aucs = []
for i in axisx:
    lgb_model = lgb.LGBMClassifier(n_estimators=i,
                                boosting_type='gbdt',
                               learning_rate=0.01,
                               min_child_samples=100,
                               min_child_weight=0.003,
                                  max_depth=4,
                              num_leaves=100,
                              colsample_bytree=0.7,
                              subsample=0.6,
                              reg_alpha=0.03,
                              reg_lambda=0.3,
                               objective='binary',
                               max_features = 140,)
    clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_aucs.append(test_auc)
plt.figure(figsize=(20,5))
plt.plot(axisx, test_aucs,c="red",label="LGBC")
plt.xticks(axisx)
plt.legend()
plt.show()

import matplotlib.pyplot as plt

通过观察auc大小调参，找到最佳learning_rate,learning_rate=0.04, max_auc=0.7720493433228038
axisx = np.linspace(0.01,0.05,5)

test_aucs = []
for i in axisx:
    lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                boosting_type='gbdt',
                               learning_rate=i,
                               min_child_samples=100,
                               min_child_weight=0.003,
                                  max_depth=4,
                              num_leaves=100,
                              colsample_bytree=0.7,
                              subsample=0.6,
                              reg_alpha=0.03,
                              reg_lambda=0.3,
                               objective='binary',
                               max_features = 140,)
    clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_aucs.append(test_auc)
print('learning_rate={}, max_auc={}'.format(axisx[test_aucs.index(max(test_aucs))], max(test_aucs)))

plt.figure(figsize=(20,5))
plt.plot(axisx, test_aucs,c="red",label="LGBC")
plt.ylabel('auc')
plt.xticks(axisx)
plt.legend()
plt.show()

通过观察auc大小调参，找到最佳min_child_samples   min_child_samples=68, max_auc=0.7754420865945606
axisx = np.linspace(0.01,0.05,5)
axisx = range(65,75,1)
test_aucs = []
for i in axisx:
    lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                boosting_type='gbdt',
                               learning_rate=0.04,
                               min_child_samples=i,
                               min_child_weight=0.003,
                                  max_depth=4,
                              num_leaves=100,
                              colsample_bytree=0.7,
                              subsample=0.6,
                              reg_alpha=0.03,
                              reg_lambda=0.3,
                               objective='binary',
                               max_features = 140,)
    clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_aucs.append(test_auc)
print('min_child_samples={}, max_auc={}'.format(axisx[test_aucs.index(max(test_aucs))], max(test_aucs)))
    test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
    test_ks = abs(test_fpr-test_tpr).max()
    print('test_ks = ', test_ks)
    print('test_auc = ', test_auc)

plt.figure(figsize=(20,5))
plt.plot(axisx, test_aucs,c="red",label="LGBC")
plt.ylabel('auc')
plt.xticks(axisx)
plt.legend()
plt.show()

通过观察auc大小调参，min_child_weight对auc面积没有影响
axisx = np.linspace(0.001,0.005,10)
axisx = range(65,75,1)
axisx = [0.001, 0.005]
test_aucs = []
for i in axisx:
    lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                boosting_type='gbdt',
                               learning_rate=0.04,
                               min_child_samples=68,
                               min_child_weight=i,
                                  max_depth=4,
                              num_leaves=100,
                              colsample_bytree=0.7,
                              subsample=0.6,
                              reg_alpha=0.03,
                              reg_lambda=0.3,
                               objective='binary',
                               max_features = 140,)
    clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_aucs.append(test_auc)
print('min_child_weight={}, max_auc={}'.format(axisx[test_aucs.index(max(test_aucs))], max(test_aucs)))
    test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
    test_ks = abs(test_fpr-test_tpr).max()
    print('test_ks = ', test_ks)
    print('test_auc = ', test_auc)

plt.figure(figsize=(20,5))
plt.plot(axisx, test_aucs,c="red",label="LGBC")
plt.ylabel('auc')
plt.xticks(axisx)
plt.legend()
plt.show()

通过观察auc大小调参max_depth,  max_depth=4, max_auc=0.7754420865945606
axisx = np.linspace(0.001,0.005,10)
axisx = range(65,75,1)
axisx = [3, 4, 5, 6, 7]
test_aucs = []
for i in axisx:
    lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                boosting_type='gbdt',
                               learning_rate=0.04,
                               min_child_samples=68,
                               min_child_weight=0.01,
                                  max_depth=i,
                              num_leaves=100,
                              colsample_bytree=0.7,
                              subsample=0.6,
                              reg_alpha=0.03,
                              reg_lambda=0.3,
                               objective='binary',
                               max_features = 140,)
    clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_aucs.append(test_auc)
print('max_depth={}, max_auc={}'.format(axisx[test_aucs.index(max(test_aucs))], max(test_aucs)))
    test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
    test_ks = abs(test_fpr-test_tpr).max()
    print('test_ks = ', test_ks)
    print('test_auc = ', test_auc)

plt.figure(figsize=(20,5))
plt.plot(axisx, test_aucs,c="red",label="LGBC")
plt.ylabel('auc')
plt.xticks(axisx)
plt.legend()
plt.show()

通过观察auc大小调参num_leaves,  num_leaves=16, max_auc=0.7754420865945606
axisx = np.linspace(0.001,0.005,10)
axisx = range(10,20,1)
axisx = [3, 4, 5, 6, 7]
test_aucs = []
for i in axisx:
    lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                boosting_type='gbdt',
                               learning_rate=0.04,
                               min_child_samples=68,
                               min_child_weight=0.01,
                                  max_depth=4,
                              num_leaves=i,
                              colsample_bytree=0.7,
                              subsample=0.6,
                              reg_alpha=0.03,
                              reg_lambda=0.3,
                               objective='binary',
                               max_features = 140,)
    clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_aucs.append(test_auc)
print('num_leaves={}, max_auc={}'.format(axisx[test_aucs.index(max(test_aucs))], max(test_aucs)))
    test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
    test_ks = abs(test_fpr-test_tpr).max()
    print('test_ks = ', test_ks)
    print('test_auc = ', test_auc)

plt.figure(figsize=(20,5))
plt.plot(axisx, test_aucs,c="red",label="LGBC")
plt.ylabel('auc')
plt.xticks(axisx)
plt.legend()
plt.show()

通过观察auc大小调参colsample_bytree,  colsample_bytree=0.8, max_auc=0.7770645958950596
axisx = np.linspace(0.8,0.85,5)
axisx = range(10,101,10)
axisx = [3, 4, 5, 6, 7]
test_aucs = []
for i in axisx:
    lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                boosting_type='gbdt',
                               learning_rate=0.04,
                               min_child_samples=68,
                               min_child_weight=0.01,
                                  max_depth=4,
                              num_leaves=16,
                              colsample_bytree=i,
                              subsample=0.6,
                              reg_alpha=0.03,
                              reg_lambda=0.3,
                               objective='binary',
                               max_features = 140,)
    clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_aucs.append(test_auc)
print('colsample_bytree={}, max_auc={}'.format(axisx[test_aucs.index(max(test_aucs))], max(test_aucs)))
    test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
    test_ks = abs(test_fpr-test_tpr).max()
    print('test_ks = ', test_ks)
    print('test_auc = ', test_auc)

plt.figure(figsize=(20,5))
plt.plot(axisx, test_aucs,c="red",label="LGBC")
plt.ylabel('auc')
plt.xticks(axisx)
plt.legend()
plt.show()

通过观察auc大小调参subsample, 此参数没有对auc产生影响
axisx = np.linspace(0.7,0.8,10)
axisx = range(10,101,10)
axisx = [3, 4, 5, 6, 7]
test_aucs = []
for i in axisx:
    lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                boosting_type='gbdt',
                               learning_rate=0.04,
                               min_child_samples=68,
                               min_child_weight=0.01,
                                  max_depth=4,
                              num_leaves=16,
                              colsample_bytree=0.8,
                              subsample=i,
                              reg_alpha=0.03,
                              reg_lambda=0.3,
                               objective='binary',
                               )
    clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_aucs.append(test_auc)
print('subsample={}, max_auc={}'.format(axisx[test_aucs.index(max(test_aucs))], max(test_aucs)))
    test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
    test_ks = abs(test_fpr-test_tpr).max()
    print('test_ks = ', test_ks)
    print('test_auc = ', test_auc)

plt.figure(figsize=(20,5))
plt.plot(axisx, test_aucs,c="red",label="LGBC")
plt.ylabel('auc')
plt.xticks(axisx)
plt.legend()
plt.show()

通过观察auc大小调参reg_alpha, reg_alpha=0.7777777777777778, max_auc=0.7773465038084242
axisx = np.linspace(0.7,0.8,10)
axisx = range(0,1,10)
axisx = [3, 4, 5, 6, 7]
test_aucs = []
for i in axisx:
    lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                boosting_type='gbdt',
                               learning_rate=0.04,
                               min_child_samples=68,
                               min_child_weight=0.01,
                                  max_depth=4,
                              num_leaves=16,
                              colsample_bytree=0.8,
                              subsample=0.8,
                              reg_alpha=i,
                              reg_lambda=0.3,
                               objective='binary',
                               )
    clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_aucs.append(test_auc)
print('reg_alpha={}, max_auc={}'.format(axisx[test_aucs.index(max(test_aucs))], max(test_aucs)))
    test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
    test_ks = abs(test_fpr-test_tpr).max()
    print('test_ks = ', test_ks)
    print('test_auc = ', test_auc)

plt.figure(figsize=(20,5))
plt.plot(axisx, test_aucs,c="red",label="LGBC")
plt.ylabel('auc')
plt.xticks(axisx)
plt.legend()
plt.show()

通过观察auc大小调参reg_lambda, reg_lambda=0.30000000000000004, max_auc=0.7773465038084242
axisx = np.linspace(0.28,0.32,5)
axisx = range(0,1,10)
axisx = [3, 4, 5, 6, 7]
test_aucs = []
for i in axisx:
    lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                boosting_type='gbdt',
                               learning_rate=0.04,
                               min_child_samples=68,
                               min_child_weight=0.01,
                                  max_depth=4,
                              num_leaves=16,
                              colsample_bytree=0.8,
                              subsample=0.8,
                              reg_alpha=0.7777777777777778,
                              reg_lambda=i,
                               objective='binary',
                               )
    clf = lgb_model.fit(x_train, y_train,
              eval_set=[(x_train, y_train),(x_test,y_test)],
              eval_metric='auc',early_stopping_rounds=100)
    test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
    test_aucs.append(test_auc)
print('reg_lambda={}, max_auc={}'.format(axisx[test_aucs.index(max(test_aucs))], max(test_aucs)))
    test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
    test_ks = abs(test_fpr-test_tpr).max()
    print('test_ks = ', test_ks)
    print('test_auc = ', test_auc)

plt.figure(figsize=(20,5))
plt.plot(axisx, test_aucs,c="red",label="LGBC")
plt.ylabel('auc')
plt.xticks(axisx)
plt.legend()
plt.show()

总结

我们观察整个调参过程，有的参数并没有对auc产生影响，有的效果很明显，总的来说调参后的效果要好些。

Original: https://blog.csdn.net/weixin_43827767/article/details/120586336
Author: sunnuan01
Title: 数据挖掘竞赛lightgbm通过求最大auc调参

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/699682/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

基于KNN的电影题材分类

我们主要来实践 KNN 分类算法的案例： 基于KNN的电影&#…

人工智能 2023年7月1日
0071
婴儿哭声分类识别实现（准确率99.3%）（深度学习、迁移学习、音频分类、tensorflow）

一、项目概述本文是婴儿哭声分类识别系统化的主体部分，主要解决智能音频分类的问题。基于此目标，本文查找了大量资料，并做了大量实验，最后获得了一个婴儿哭声分类识别准确率相对较高的深度…

人工智能 2023年6月17日
0074
【Python】【爬虫】爬取小说5000章，遇到的爬虫问题与解决思路

啊哦~你想找的内容离你而去了哦内容不存在，可能为如下原因导致： ① 内容还在审核中 ② 内容以前存在，但是由于不符合新的规定而被删除 ③ 内容地址错误 ④ 作者删除了内容。可…

人工智能 2023年7月6日
0057
数据库管理系统（基于前端+后端+数据库）

库存管理系统包括模块：（1）基本信息管理。（2）商品入库管理。（3）商品出库管理。（4）商品查询管理。（5）查看商品目录。 *实训步骤：开发环境：html , css…

人工智能 2023年7月5日
0034
学习笔记：C++环境下OpenCV的findContours函数的参数详解及优化

这个是Visual Studio2019版本在OpenCV环境配置好后所显示的 6个参数，也即为全部参数但是，常用参数仅有四个(参见程序里的第二行注释) cv::void fi…

人工智能 2023年6月17日
0085
增加batch_size的一种方法：梯度累加

一、为什么会有batch_size参数经常做训练的童鞋们都知道，batch_size是一个很重要的超参数，每次训练支取batch_size个数据集进行训练，那么，为啥不取全部呢，…

人工智能 2023年7月13日
0073
【Linux】进程概念与进程状态

文章目录一、什么是进程 * 1、进程概念 2、进程描述 — PCB 二、进程的一些基本操作 * 1、查看进程 2、结束进程 2、通过系统调用获取进程标示符 3、通过系…

人工智能 2023年6月26日
0062
pandas中DataFrame的一些操作

文章目录 * – DataFrame的应用 – 通过二维数组创建DataFrame对象 – 读取csv文件创建DataFrame对象 &#821…

人工智能 2023年7月7日
0068
人工智能图片分类Python小程序

个人小作业，虽说做的很差，也算是一个学习的转化；主要用于分类自己下载的壁纸 1 背景学期末需要一个学习成果的展示，高难度的自己做不来，模型也跑不动（电脑有点渣），刚好自己也有图片…

人工智能 2023年7月3日
0071
浅谈点云与三维重建

《浅谈点云与三维重建》【本期导读】三维重建技术在各领域已经展现出了不可替代性，而点云作为三维重建的重要工具，却常常被忽略。本文将从概念定义、数据来源、类别划分、应用场景等方面来介…

人工智能 2023年5月26日
0071
R语言——数据可视化

目录常用的绘图函数：低级绘图函数，在现有图形上添加元素：高级绘图函数： 1、绘制条形图barplot() 2、绘制饼图pie() 3、绘制直方图hist() 4、绘制散点图p…

人工智能 2023年7月16日
0051
sentence_transformers 语义搜索，语义相似度计算，图片内容理解，图片与文字匹配。

目录介绍sentence_transformers 的实战代码：语义相似度计算：语义搜索句子聚类，相似句子聚类 Original: https://blog.csdn.ne…

人工智能 2023年5月26日
0070
基于强化学习的医疗诊断 Inquire and Diagnose: Neural Symptom Checking Ensemble using Deep Reinforcement Learning

将强化学习应用于医疗诊断的早期代表工作是发表在2016年人工智能领域顶级会议NIPS深度强化学习研讨会上的一篇工作（Inquire and Diagnose: Neural Sym…

人工智能 2023年5月28日
0082
【极简spark教程】SparkSQL、DataSets、DataFrames

读取文件显示数据选择数据仅选择选择并计算 na处理 RDD-数据聚合操作分组计数分组后求最值、均值、总和的方法分组后，求多个聚合值（最值、平均值等）。使用算子g…

人工智能 2023年7月8日
0060
tensorflow之ResourceMgr

ResourceMgr,就是资源管理器。核心功能创建资源查找资源删除资源 tensorflow的ResourceMgr内部如何组织资源？从其创建和查找接口中可以看出：Reso…

人工智能 2023年5月25日
0064
多视图聚类（+incomplete multi view cluster)

1.A study of graph-based system for multi-view clustering 2.Consistency Meets Inconsistenc…

人工智能 2023年6月19日
0071

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

数据挖掘竞赛lightgbm通过求最大auc调参

0.首先展示最后的结果

1.读入清洗后的数据集

2.参数调整

大家都在看