12.集成学习进阶一——xgboost

2023年5月31日上午9:12 • 人工智能 • 阅读 88

xgboost算法

XGBoost（Extreme Gradient Boosting）全名叫极端梯度提升树，XGBoost是集成学习⽅法的王牌，在Kaggle数据挖掘⽐赛中，⼤部分获胜者⽤了XGBoost。

最优模型的构建⽅法

; XGBoost的⽬标函数推导

CART树的介绍

; 树的复杂度定义

定义每课树的复杂度

; 树的复杂度例子

⽬标函数推导

; XGBoost的回归树构建⽅法

计算分裂节点

; 停⽌分裂条件判断

XGBoost与GDBT的区别

; xgboost算法api介绍

官网

pip3 install xgboost

通⽤参数（general parameters）

; Booster 参数（booster parameters）

Parameters for Tree Booster

; Parameters for Linear Booster

xgboost案例–泰坦尼克号存活分析

案例
我们提取到的数据集中的特征包括票的类别，是否存活，乘坐班次，年龄，登陆home.dest，房间，船和性别等。
[数据](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt)
12.集成学习进阶一——xgboost

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier, export_graphviz


titan = pd.read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt")
titan

titan.describe()


x = titan[["pclass", "age", "sex"]]
y = titan["survived"]
x.head()

y.head()


x['age'].fillna(value=titan["age"].mean(), inplace=True)
x.head()


x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22, test_size=0.2)


x_train.head()

x_train = x_train.to_dict(orient="records")
x_test = x_test.to_dict(orient="records")
x_train

transfer = DictVectorizer()

x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)


from xgboost import XGBClassifier
xg = XGBClassifier()
xg.fit(x_train, y_train)

xg.score(x_test, y_test)


depth_range  = range(10)
score = []

for i in depth_range:
    xg = XGBClassifier(eta=1, gamma=0, max_depth=i)
    xg.fit(x_train, y_train)

    s = xg.score(x_test, y_test)

    print(s)
    score.append(s)


import matplotlib.pyplot as plt
plt.plot(depth_range, score)
plt.show()

xgboost案例–otto案例产品分类

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

数据获取

data = pd.read_csv("./data/otto/train.csv")
data.head()

data.shape
data.describe()


import seaborn as sns

sns.countplot(data.target)

plt.show()

数据基本处理

数据已经经过脱敏,不再需要特殊处理


new1_data = data[:10000]
new1_data.shape

sns.countplot(new1_data.target)

plt.show()


y = data["target"]
x = data.drop(["id", "target"], axis=1)
x.head()

y.head()


from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=0)

X_resampled, y_resampled = rus.fit_resample(x, y)

x.shape, y.shape

X_resampled.shape, y_resampled.shape

sns.countplot(y_resampled)
plt.show()

把标签值转换为数字

y_resampled.head()

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_resampled = le.fit_transform(y_resampled)

y_resampled

分割数据

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2)

x_train.shape, y_train.shape

x_test.shape, y_test.shape

sns.countplot(y_test)
plt.show()


from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)

for train_index, test_index in sss.split(X_resampled.values, y_resampled):
    print(len(train_index))
    print(len(test_index))

    x_train = X_resampled.values[train_index]
    x_val = X_resampled.values[test_index]

    y_train = y_resampled[train_index]
    y_val = y_resampled[test_index]

print(x_train.shape, x_val.shape)

sns.countplot(y_val)
plt.show()

数据标准化

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(x_train)

x_train_scaled = scaler.transform(x_train)
x_val_scaled = scaler.transform(x_val)

数据PCA降维

x_train_scaled.shape

from sklearn.decomposition import PCA

pca = PCA(n_components=0.9)
x_train_pca = pca.fit_transform(x_train_scaled)
x_val_pca = pca.transform(x_val_scaled)

print(x_train_pca.shape, x_val_pca.shape)

plt.plot(np.cumsum(pca.explained_variance_ratio_))

plt.xlabel("元素数量")
plt.ylabel("表达信息百分占比")

plt.show()

模型训练

基本模型训练

from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(x_train_pca, y_train)


y_pre_proba = xgb.predict_proba(x_val_pca)
y_pre_proba


from sklearn.metrics import log_loss

log_loss(y_val, y_pre_proba, eps=1e-15, normalize=True)
xgb.get_params

模型调优

确定最优的estimators

scores_ne = []
n_estimators = [100, 200, 300, 400, 500, 550, 600, 700]

for nes in n_estimators:
    print("n_estimators:", nes)
    xgb = XGBClassifier(max_depth=3,
                        learning_rate=0.1,
                        n_estimators=nes,
                        objective="multi:softprob",
                        n_jobs=-1,
                        nthread=4,
                        min_child_weight=1,
                        subsample=1,
                        colsample_bytree=1,
                        seed=42)

    xgb.fit(x_train_pca, y_train)
    y_pre = xgb.predict_proba(x_val_pca)
    score = log_loss(y_val, y_pre)
    scores_ne.append(score)

    print("每次测试的logloss值是:{}".format(score))


plt.plot(n_estimators, scores_ne, "o-")

plt.xlabel("n_estimators")
plt.ylabel("log_loss")
plt.show()

print("最优的n_estimators值是:{}".format(n_estimators[np.argmin(scores_ne)]))

确定最优的max_depth

scores_md = []
max_depths = [1,3,5,6,7]

for md in max_depths:
    print("max_depth:", md)
    xgb = XGBClassifier(max_depth=md,
                        learning_rate=0.1,
                        n_estimators=n_estimators[np.argmin(scores_ne)],
                        objective="multi:softprob",
                        n_jobs=-1,
                        nthread=4,
                        min_child_weight=1,
                        subsample=1,
                        colsample_bytree=1,
                        seed=42)

    xgb.fit(x_train_pca, y_train)
    y_pre = xgb.predict_proba(x_val_pca)
    score = log_loss(y_val, y_pre)
    scores_md.append(score)

    print("每次测试的logloss值是:{}".format(score))


plt.plot(max_depths, scores_md, "o-")

plt.xlabel("max_depths")
plt.ylabel("log_loss")
plt.show()

print("最优的max_depths值是:{}".format(max_depths[np.argmin(scores_md)]))

**依据上面模式,运行调试下面参数
min_child_weights,

subsamples,

consample_bytrees,

etas**

xgb = XGBClassifier(learning_rate =0.1,
                    n_estimators=550,
                    max_depth=3,
                    min_child_weight=3,
                    subsample=0.7,
                    colsample_bytree=0.7,
                    nthread=4,
                    seed=42,
                    objective='multi:softprob')

xgb.fit(x_train_scaled, y_train)

y_pre = xgb.predict_proba(x_val_scaled)

print("测试数据的log_loss值为 : {}".format(log_loss(y_val, y_pre, eps=1e-15, normalize=True)))

测试数据的log_loss值为 : 0.5944022517380477

Original: https://blog.csdn.net/weixin_50973728/article/details/123311899
Author: C–G
Title: 12.集成学习进阶一——xgboost

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/550208/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

【直播】陈安东，马琦钧：赛题Baseline讲解以及语音识别基础知识介绍

语音识别赛题Baseline及基础知识简介目前 Datawhale第24期组队学习正在如火如荼的进行中。为了大家更好的学习”零基础入门语音识别（食物声音识别）&#82…

人工智能 2023年5月27日
00100
年产1万吨L-赖氨酸干粉工厂的设计-发酵工段及车间的设计（lunwen+CAD图纸）

目录第1章引言 21.1 研究背景 21.2 设计的任务及主要设计内容 31.3 设计的规模及产品 31.4 工艺技术参数 31.4.1 生产基础数据 31.4.2 种子培养…

人工智能 2023年6月27日
0054
线扫相机的行频计算方法

一，变量的定义首先设定以下变量：（1）线阵相机的每线像素数（单位：pixel）：Hc（2）目标物的宽幅（单位：m）：Lo（3）目标物运行速率（单位：m/s）：Vo（4）线阵相机线…

人工智能 2023年5月28日
0078
什么是元宇宙？

如果你去网上搜索”元宇宙”这个词，这个概念源自上世纪90年代的科幻小说《雪崩》；元宇宙的英文是Metaverse，是由Meta，元，和Universe，…

人工智能 2023年6月2日
0095
opencv_python-3.4.1.15-cp36-cp36m-win_amd64的成功安装，亲测有效

项目场景： opencv_python-3.4.1.15-cp36-cp36m-win_amd64的安装。问题描述：最近opencv学的越来越多，发现在检测识别的那块有一个函数…

人工智能 2023年7月19日
0081
码头操作系统市场现状研究分析报告 –

辰宇信息咨询市场调研公司最近发布-《2 022-2028中国码头操作系统市场现状研究分析与发展前景预测报告》内容摘要本文研究中国市场码头操作系统现状及未来发展趋势，侧重分析在…

人工智能 2023年7月17日
0070
python pandas模块读取excel_python中pandas模块读取Excel的所有sheet表

Excel转CSV，一个sheet表存入一个CSV文件中，并以sheet名字命名 import pandas as pd def xlsx_to_csv_pd(): sheet_n…

人工智能 2023年7月9日
0079
YOLOv5解析 | 参数与性能指标

传参 conf_thres与 iou_thres均位于 detect.py文件当中 conf_thres：Confidence Threshold，置信度阈值，即以下图片上的值。 …

人工智能 2023年6月24日
0066
1.1.3. Lasso（套索回归）

1.1.3. Lasso 一、简介首先，Lasso同样是线性回归的一种变体。而文档中指出，它是一种能让参数ω \omega ω稀疏的模型（作用）。它是压缩感知领域的基础（地位），…

人工智能 2023年7月16日
0061
SpringBoot整合MongoDB

SpringBoot整合MongoDB 一、创建项目，选择依赖二、引入相关依赖三、如果是第一次使用MongoDB，首先先创建用户定义核心配置文件五、创建实体类创建dao层…

人工智能 2023年7月31日
0061
推荐一款国产免费开源的ERP进销存系统附带安装详细教程

软件简介 ERP可用于自动化和简化整个企业或组织的各项活动，例如会计和采购、项目管理、生产管理、风险管理、合规性和供应链运营。 ERP全称Enterprise Resource P…

人工智能 2023年6月25日
00347
YOLOv5训练自己的数据集（Windows）

1.源码下载 GitHub – ultralytics/yolov5: YOLOv5 🚀 in PyTorch > ONNX > CoreML > T…

人工智能 2023年7月9日
0087
CASIA-HWDB数据集下载和预处理

官网地址下载链接：http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html血泪的教训史：不要把文件夹存放在目录为/v…

人工智能 2023年5月25日
0082
华为机试题 24点

1、描述 2、解题思路暴力法穷举所有的可能的组合（题目只需找到任意满足24点的式子即可）。 4 x 4 x 3 x 4 x 2 x 4 x 1= 256 x 6 = 1536 (…

人工智能 2023年6月4日
0080
二、JavaScript——Hello World

1. 创建文件提前在本地新建好文件夹用于存储项目代码，再通过VSode打开指定存储代码的指定文件夹，并新建HelloWorld.html文件 HelloWorld.html文件新…

人工智能 2023年7月30日
0078
如何在 Python 中计算 MAPE

平均绝对百分比误差 (MAPE) 通常用于衡量模型的预测准确性。计算如下： MAPE = (1/n) * Σ(|实际 – 预测| / |实际|) * 100 在哪里： …

人工智能 2023年6月13日
0063

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31