Python机器学习09——随机森林

2023年6月15日上午6:18 • 人工智能 • 阅读 70

本系列所有的代码和数据都可以从陈强老师的个人主页上下载：Python数据程序

参考书目：陈强.机器学习及Python应用. 北京：高等教育出版社, 2021.

本系列基本不讲数学原理，只从代码角度去让读者们利用最简洁的Python代码实现机器学习方法。

本章开始集成学习模型。集成学习的方法在实际问题上表现效果都很好，因为它聚集了很多分类器效果，集成学习的模型一般都被称为’树模型’，因为都是很多很多很多决策树一起估计出来的。随机森林也是这名字的由来。首先要介绍随机森林起源的bagging（袋装法）算法，bagging和后面的boosting区别在于，bagging是装袋了，每个小的估计器之间不相关。而boosting则是每个小估计器之间都相关。bagging在计量里面称为自助法，就是把样本进行很多次重采样，然后分别用不同的估计器去估计，然后再取平均。而随机森林改进的位置在于，每个小分类器之间采用的特征不一样，这样可以降低估计器之间的相关性，虽然可能增大偏差，但也可以更大幅度减少方差。

Bagging的Python案例

首先导入包和数据，采用非线性的数据mcycle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_iris, load_boston
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import plot_roc_curve
from sklearn.inspection import plot_partial_dependence
from mlxtend.plotting import plot_decision_regions

#读取数据
Motorcycle Example: Tree vs. Bagging
mcycle = pd.read_csv('mcycle.csv')
mcycle.head()
#取X和y
X = np.array(mcycle.times).reshape(-1, 1)
y = mcycle.accel

本次采用决策树对比一下他们的回归线形状，首先估计决策树：

Single tree estimation  best_estimator_.

model = DecisionTreeRegressor(random_state=123)
path = model.cost_complexity_pruning_path(X, y)
param_grid = {'ccp_alpha': path.ccp_alphas}
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(DecisionTreeRegressor(random_state=123), param_grid, cv=kfold)
pred_tree = model.fit(X, y).predict(X)
print(model.score(X,y))
sns.scatterplot(x='times', y='accel', data=mcycle, alpha=0.6)
plt.plot(X, pred_tree, 'b')
plt.title('Single Tree Estimation')

可以看出，单颗决策树的回归线都是矩形，类似楼梯，不平滑，下面进行bagging的方法

Bagging estimation
model = BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=123), n_estimators=500, random_state=0)
pred_bag = model.fit(X, y).predict(X)
print(model.score(X,y))
sns.scatterplot(x='times', y='accel', data=mcycle, alpha=0.6)
plt.plot(X, pred_bag, 'b')
plt.title('Bagging Estimation')

Alternatively,one could use 'RandomForestRegressor', which by default
sets max_features = n_features that is de facto bagging. The results are slightly different.

The advantage of 'BaggingRegressor' is the option to use different base learners.

拟合的曲线要平滑很多。

对于回归问题依旧采用最多的波士顿房价数据集，上面依旧导入了包，下面读取数据，划分训练测试集，然后使用bagging拟合：


Boston = load_boston()
X = pd.DataFrame(Boston.data, columns=Boston.feature_names)
y = Boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

#bagging估计器
model = BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=123), n_estimators=500, oob_score=True, random_state=0)
#拟合
model.fit(X_train, y_train)
#袋外预测值
pred_oob = model.oob_prediction_
#袋外均方误差
mean_squared_error(y_train, pred_oob)
#袋外测试集拟合优度
model.oob_score_
#测试集拟合优度
model.score(X_test, y_test)
#对比线性回归拟合优度
Comparison with OLS
model = LinearRegression().fit(X_train, y_train)
model.score(X_test, y_test)

观察袋外误差随着估计器的个数变化的图：

OOB Errors

oob_errors = []
for n_estimators in range(100,301,10):
    model = BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=123),
                          n_estimators=n_estimators, n_jobs=-1, oob_score=True, random_state=0)
    model.fit(X_train, y_train)
    pred_oob = model.oob_prediction_
    oob_errors.append(mean_squared_error(y_train, pred_oob))

plt.plot(range(100, 301,10), oob_errors)
plt.xlabel('Number of Trees')
plt.ylabel('OOB MSE')
plt.title('Bagging OOB Errors')

可以看出估计器个数越多，袋外误差越小。

回归森林的Python案例

依旧采用波士顿房价数据集：

Random Forest for Regression on Boston Housing Data
#&#x786E;&#x5B9A;&#x8D85;&#x53C2;&#x6570;max_features&#xFF0C;&#x5373;&#x6BCF;&#x6B21;&#x5206;&#x88C2;&#x4F7F;&#x7528;&#x7684;&#x7279;&#x5F81;&#x4E2A;&#x6570;
max_features=int(X_train.shape[1] / 3)
max_features

&#x62DF;&#x5408;&#x8BC4;&#x4EF7;
model = RandomForestRegressor(n_estimators=5000, max_features=max_features, random_state=0)
model.fit(X_train, y_train)
model.score(X_test, y_test)

#&#x9884;&#x6D4B;&#x503C;&#x548C;&#x771F;&#x5B9E;&#x503C;&#x6BD4;&#x8F83;
Visualize prediction fit
pred = model.predict(X_test)

plt.scatter(pred, y_test, alpha=0.6)
w = np.linspace(min(pred), max(pred), 100)
plt.plot(w, w)
plt.xlabel('pred')
plt.ylabel('y_test')
plt.title('Random Forest Prediction')

变量重要性的可视化

Feature Importance Plot
model.feature_importances_
sorted_index = model.feature_importances_.argsort()

plt.barh(range(X.shape[1]), model.feature_importances_[sorted_index])
plt.yticks(np.arange(X.shape[1]), X.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Random Forest')
plt.tight_layout()

画偏依赖图
from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(model, X, ['LSTAT', 'RM'])

手工循环，找最优超参数

scores = []
for max_features in range(1, X.shape[1] + 1):
    model = RandomForestRegressor(max_features=max_features,
                                  n_estimators=500, random_state=123)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores.append(score)

index = np.argmax(scores)
range(1, X.shape[1] + 1)[index]

plt.plot(range(1, X.shape[1] + 1), scores, 'o-')
plt.axvline(range(1, X.shape[1] + 1)[index], linestyle='--', color='k', linewidth=1)
plt.xlabel('max_features')
plt.ylabel('R2')
plt.title('Choose max_features via Test Set')

可以看出当max_feature=9时，测试集的拟合优度最高。

下面比较随机森林，bagging，决策树他们的误差和估计器个数的变化

#RF
scores_rf = []
for n_estimators in range(1, 301):
    model = RandomForestRegressor(max_features=9,
                                  n_estimators=n_estimators, random_state=123)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    mse = mean_squared_error(y_test, pred)
    scores_rf.append(mse)

Bagging
scores_bag = []
for n_estimators in range(1, 301):
    model = BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=123), n_estimators=n_estimators, random_state=0)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    mse = mean_squared_error(y_test, pred)
    scores_bag.append(mse)

#DecisionTree
model = DecisionTreeRegressor()
path = model.cost_complexity_pruning_path(X_train, y_train)
param_grid = {'ccp_alpha': path.ccp_alphas}
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(DecisionTreeRegressor(random_state=123), param_grid, cv=kfold, scoring='neg_mean_squared_error')
model.fit(X_train, y_train)
score_tree = -model.score(X_test, y_test)
scores_tree = [score_tree for i in range(1, 301)]

#画图
plt.plot(range(1, 301), scores_tree, 'k--', label='Single Tree')
plt.plot(range(1, 301), scores_bag, 'k-', label='Bagging')
plt.plot(range(1, 301), scores_rf, 'b-', label='Random Forest')
plt.xlabel('Number of Trees')
plt.ylabel('MSE')
plt.title('Test Error')
plt.legend()

网格化搜索最优超参数

max_features = range(1, X.shape[1] + 1)
param_grid = {'max_features': max_features }
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(RandomForestRegressor(n_estimators=300, random_state=123),
                     param_grid, cv=kfold, scoring='neg_mean_squared_error', return_train_score=True)

model.fit(X_train, y_train)

model.best_params_

cv_mse = -model.cv_results_['mean_test_score']

plt.plot(max_features, cv_mse, 'o-')
plt.axvline(max_features[np.argmin(cv_mse)], linestyle='--', color='k', linewidth=1)
plt.xlabel('max_features')
plt.ylabel('MSE')
plt.title('CV Error for Random Forest')

分类森林的Python案例

#读取数据
Sonar = pd.read_csv('Sonar.csv')
Sonar.shape
Sonar.head(2)
#取出X和y
X = Sonar.iloc[:, :-1]
y = Sonar.iloc[:, -1]
#画变量之间的相关性热力图
sns.heatmap(X.corr(), cmap='Blues')
plt.title('Correlation Matrix')
plt.tight_layout()

划分测试训练集，用决策树和随机森林去拟合比较

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=50, random_state=1)

Single Tree  as benchmark
model = DecisionTreeClassifier()
path = model.cost_complexity_pruning_path(X_train, y_train)
param_grid = {'ccp_alpha': path.ccp_alphas}
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(DecisionTreeClassifier(random_state=123), param_grid, cv=kfold)
model.fit(X_train, y_train)
model.score(X_test, y_test)

Random Forest
model = RandomForestClassifier(n_estimators=500, max_features='sqrt', random_state=123)
model.fit(X_train, y_train)
model.score(X_test, y_test)

网格化搜参

Choose optimal mtry parameter via CV
#GridSearchCV需要响应变量y是数值，所以生成虚拟变量
y_train_dummy = pd.get_dummies(y_train)
y_train_dummy = y_train_dummy.iloc[:, 1]

param_grid = {'max_features': range(1, 11) }
kfold = StratifiedKFold(n_splits=10,shuffle=True,random_state=1)
model = GridSearchCV(RandomForestClassifier(n_estimators=300, random_state=123), param_grid, cv=kfold)
model.fit(X_train, y_train_dummy)

model.best_params_
#max_features=8
#因此采用8进行估计
model = RandomForestClassifier(n_estimators=500, max_features=8, random_state=123)
model.fit(X_train, y_train)
model.score(X_test, y_test)

#变量重要性的图
sorted_index = model.feature_importances_.argsort()
plt.barh(range(X.shape[1]), model.feature_importances_[sorted_index])
plt.yticks(np.arange(X.shape[1]), X.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Random Forest')

计算混淆矩阵

Prediction Performance
pred = model.predict(X_test)
table = pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'])
table

计算混淆矩阵指标

table = np.array(table)
Accuracy = (table[0, 0] + table[1, 1]) / np.sum(table)
Accuracy
Sensitivity  = table[1 , 1] / (table[1, 0] + table[1, 1])
Sensitivity

Specificity = table[0, 0] / (table[0, 0] + table[0, 1])
Specificity

Recall = table[1, 1] / (table[0, 1] + table[1, 1])
Recall

cohen_kappa_score(y_test, pred)

#画ROC曲线
plot_roc_curve(model, X_test, y_test)
x = np.linspace(0, 1, 100)
plt.plot(x, x, 'k--', linewidth=1)
plt.title('ROC Curve for Random Forest')

用鸢尾花两个特征变量画其决策边界

X,y = load_iris(return_X_y=True)
X2 = X[:, 2:4]

model = RandomForestClassifier(n_estimators=500, max_features=1, random_state=1)
model.fit(X2,y)
model.score(X2,y)

plot_decision_regions(X2, y, model)
plt.xlabel('petal_length')
plt.ylabel('petal_width')
plt.title('Decision Boundary for Random Forest')

Original: https://blog.csdn.net/weixin_46277779/article/details/125477184
Author: 阡之尘埃
Title: Python机器学习09——随机森林

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/613928/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

YOLOv5的Tricks | 【Trick12】YOLOv5使用的数据增强方法汇总

如有错误，恳请指出。时隔两个多月重新看yolov5的代码显然开始力不从心，当时应该一鼓作气的整理完的。在专栏前面的内容一直介绍的是yolov5训练时候使用的一些技巧，这里用这篇…

人工智能 2023年6月26日
0072
tensorflow2 auto mpg汽车油耗预测实践（3.5节）

tensorflow2 汽车油耗预测实践 * – tensorflow2 汽车油耗预测实践 – + 1. 数据集 + * 1.1 Auto MPG * 1….

人工智能 2023年5月25日
0086
Windows 10 下安装pycrypto时出错的解决问题

错误信息如下： Collecting pycryptoUsing cached pycrypto-2.6.1.tar.gz (446 kB)Preparing metadata (…

人工智能 2023年7月5日
0067
遥感图像目标检测-论文阅读分享:Fast and accurate multi-class geospatial object detection with large-size…

遥感图像目标检测论文阅读分享-Fast and accurate multi-class geospatial object detection with large-size r…

人工智能 2023年6月17日
0066
情感分析bert家族 pytorch实现(ing

前言由于被宿友问了很多问题，于是就果断在2021最后一天自己从头实现自定义dataset, 自定义模型，写了训练代码，预测输出代码。准确率虽然不高，但这个过程让我清楚了中间处理的…

人工智能 2023年5月27日
0055
Python之第六章内置容器 — 正则表达式

Python之第六章内置容器 — 正则表达式 1.定义 2.常见元字符 3.数量元字符 4.边界字符 5.其他元字符 6.分组元字符 7.实例：例1 匹配8位qq号…

人工智能 2023年6月28日
0080
图像分割UNet系列——Attention Unet详解

图像分割unet系列——Attention Unet详解 * – 1、Attention Unet主要目标 – 2、Attentio…

人工智能 2023年7月13日
0071
AI终将战胜人类？– 以Model Training 的角度看养娃与深度学习的共性

文章大纲工作方式：人的成长 -VS- 深度神经网络培养成本：AI -VS- 养娃 * 超大规模预训练模型可能摸到了强人工智能的边，但成本很高！养娃的成本也不低！其他启发，算…

人工智能 2023年6月25日
0075
pycharm中cv2不提示函数、无法跳转库函数解决方法

pycharm中cv2不提示函数、无法跳转库函数解决方法(cannot find declaration to go to) 当我有一天拿到一个新项目，在运行调试的时候缺很多包，所…

人工智能 2023年6月17日
0067
多多情报通（多多参谋）：拼多多店铺三级惩罚多久恢复？如何恢复？

拼多多也有一些店铺可能会受到三级惩罚，但我不知道这种惩罚会持续多久，需要多长时间才能恢复？如何恢复？跟着小编一起往下看看吧。拼多多店三级惩罚恢复多久。拼多多的三级限制是根据拼多…

人工智能 2023年7月2日
0071
【论文总结】FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding（附翻译）

论文地址：https://arxiv.org/pdf/2103.05950.pdf 代码地址：https: //github.com/MegviiDetection/FSCE 改进…

人工智能 2023年7月10日
0059
深度学习——TensorFlow初体验

今天学习TensorFlow，一个超级好用的神经网络搭载库什么是TensorFlow TensorFlow 是一个采用数据流图(data flow graphs)，用于数值计算的…

人工智能 2023年5月24日
0059
深度学习（初识tensorflow2.版本）之三好学生成绩问题（1）

🔝🔝🔝🔝🔝🔝🔝🔝🔝🔝🔝🔝🥰 博客首页：knighthood2001😗 欢迎点赞👍评论🗨️❤️ 热爱python，期待与大家一同进步成长！！❤️👀 给大家推荐一款很火爆的刷题、面试求…

人工智能 2023年6月15日
0060
统计学小抄：常用术语和基本概念小结

统计学是涉及数据的收集，组织，分析，解释和呈现的学科。统计的类型描述性统计描述性统计是以数字和图表的形式来理解、分析和总结数据。对不同类型的数据(数值的和分类的)使用不同的图…

人工智能 2023年7月17日
0060
无法调用自己电脑的gpu，即torch.cuda.is_available()返回false

在你的CUDA，cuDNN， torch 版本对应的情况下检查torch版本 python import torch print(torch.__version__) 居然是+cp…

人工智能 2023年6月17日
0086
pytorch 11 支持任意维度数据的focal loss的实现（支持ignore_index，支持反向传播训练，支持多分类）

Focal Loss的公式如下所示，其中用来调节正负样本的平衡，在本质上就是交叉熵（nn.CrossEntropyLoss(weight=alpha））中的weight参数，所以在…

人工智能 2023年7月3日
0066

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Python机器学习09——随机森林

Bagging的Python案例

回归森林的Python案例

分类森林的Python案例

大家都在看