随机森林实战（分类任务+特征重要性+回归任务）（含Python代码详解）

2023年6月16日下午9:28 • 人工智能 • 阅读 72

1. 随机森林-分类任务

我们使用随机森林完成鸢尾花分类任务：

第一步，导入我们可能需要的包：

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt

接下来，我们展示一下数据集，主要看一下数据的维度和特征：：

iris=load_iris()
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

data=iris.data
data

我们有四列数据，分别对应四个特征。

我们看一下标签数据：

target=iris.target
target

我们看一下标签的种类：

np.unique(target)

总共分三类。

我们选取后两个特征来分类： 'petal length (cm)', 'petal width (cm)'

X = iris.data[:,[2,3]]
y = iris.target
print('Class labels:', np.unique(y))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
print(X_train.shape, y_train.shape)

我们看一下数据信息：

Class labels: [0 1 2]
(105, 2) (105,)

我们进行训练，并绘制分类决策边界：

def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

     markers = ('s', 'x', 'o', '^', 'v')
     colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
     cmap = ListedColormap(colors[:len(np.unique(y))])

     x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
     x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
     xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                            np.arange(x2_min, x2_max, resolution))
     Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
     Z = Z.reshape(xx1.shape)
     plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap)
     plt.xlim(xx1.min(), xx1.max())
     plt.ylim(xx2.min(), xx2.max())

     for idx, cl in enumerate(np.unique(y)):
         plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
         alpha=0.8, c=colors[idx],
         marker=markers[idx], label=cl,
         edgecolor='black')

     if test_idx:

         X_test, y_test = X[test_idx, :], y[test_idx]
         plt.scatter(X_test[:, 0], X_test[:, 1],
                     c='white', edgecolor='black', alpha=1.0,
                     linewidth=1, marker='o',
                     s=100, label='test set')

forest = RandomForestClassifier(criterion='gini',
                                n_estimators=25,
                                random_state=1,
                                n_jobs=2)
forest.fit(X_train, y_train)
X_combined = np.vstack((X_train, X_test))
y_combined = np.hstack((y_train, y_test))
plot_decision_regions(X_combined, y_combined, classifier=forest, test_idx=range(105,150))
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc='upper left')
plt.show()

我们的输出结果为：

2. 随机森林-特征重要性

我们首先看一下数据集：

cwd = './Machine_Learning/'
data_dir = cwd+'RandomForest随机森林/data/'

df_wine = pd.read_csv(data_dir+'wine.data',
                      header=None,
                      names=['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium',
                               'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity',
                               'Hue', 'OD280/OD315 of diluted wines', 'Proline'])
print('Class labels', np.unique(df_wine['Class label']))
print('numbers of features:', len(df_wine.keys())-1)
df_wine.head()

下面展示部分结果：

由此可知，数据集为白酒数据，共有13个特征。

接下来我们划分训练集和测试集：

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)
X_train.shape

(124, 13)

使用RandomForestClassifier训练，然后调用 feature_importances_属性获得特征重要性：

feat_labels = df_wine.columns[1:]
forest = RandomForestClassifier(n_estimators=200, random_state=1)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
print(len(importances))
importances

我们排一下顺序：

indices = np.argsort(importances)[::-1]
indices

numpy.argsort(a, axis=-1, kind=’quicksort’, order=None) 功能:
将矩阵a在指定轴axis上排序，并返回排序后的下标
参数: a:输入矩阵， axis:需要排序的维度
返回值: 输出排序后的下标

for i in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (i + 1, 30, feat_labels[indices[i]], importances[indices[i]]))

我们看一下最终的结果为：

1) Proline                        0.194966
 2) Flavanoids                     0.187939
 3) Color intensity                0.134887
 4) OD280/OD315 of diluted wines   0.134407
 5) Alcohol                        0.115360
 6) Hue                            0.063459
 7) Total phenols                  0.040195
 8) Magnesium                      0.032550
 9) Proanthocyanins                0.026119
10) Malic acid                     0.024860
11) Alcalinity of ash              0.020615
12) Ash                            0.013679
13) Nonflavanoid phenols           0.010963

我们可视化一下结果：

plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]), importances[indices], align='center')
plt.xticks(range(X_train.shape[1]), feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
plt.show()

3. 随机森林-回归任务

展示一下数据集：

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

df = pd.read_csv(data_dir+'housing.data.txt',
                 header=None,
                 sep='\s+',
                 names= ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'])
df.head()


X = df.iloc[:, :-1].values
y = df['MEDV'].values
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape, y_train.shape)

(404, 13) (404,)

forest = RandomForestRegressor(n_estimators=100, criterion='mse', random_state=1, n_jobs=-1)
forest.fit(X_train, y_train)

y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)
print('MSE train: %.3f, test: %.3f' % (mean_squared_error(y_train, y_train_pred), mean_squared_error(y_test, y_test_pred)))

最终结果为：

MSE train: 1.237, test: 8.916

4. 源代码

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np

from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt

iris = datasets.load_iris()
print(iris['data'].shape, iris['target'].shape)
X = iris.data[:,[2,3]]
y = iris.target
print('Class labels:', np.unique(y))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
print(X_train.shape, y_train.shape)
def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

     markers = ('s', 'x', 'o', '^', 'v')
     colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
     cmap = ListedColormap(colors[:len(np.unique(y))])

     x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
     x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
     xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                            np.arange(x2_min, x2_max, resolution))
     Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
     Z = Z.reshape(xx1.shape)
     plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap)
     plt.xlim(xx1.min(), xx1.max())
     plt.ylim(xx2.min(), xx2.max())

     for idx, cl in enumerate(np.unique(y)):
         plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
         alpha=0.8, c=colors[idx],
         marker=markers[idx], label=cl,
         edgecolor='black')

     if test_idx:

         X_test, y_test = X[test_idx, :], y[test_idx]
         plt.scatter(X_test[:, 0], X_test[:, 1],
                     c='', edgecolor='black', alpha=1.0,
                     linewidth=1, marker='o',
                     s=100, label='test set')
               forest = RandomForestClassifier(criterion='gini',
                                n_estimators=25,
                                random_state=1,
                                n_jobs=2)
forest.fit(X_train, y_train)
X_combined = np.vstack((X_train, X_test))
y_combined = np.hstack((y_train, y_test))
plot_decision_regions(X_combined, y_combined, classifier=forest, test_idx=range(105,150))
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc='upper left')
plt.show()

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

cwd = './MachineLearning/'
data_dir = cwd+'RandomForest随机森林/data/'

df_wine = pd.read_csv(data_dir+'wine.data',
                      header=None,
                      names=['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium',
                               'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity',
                               'Hue', 'OD280/OD315 of diluted wines', 'Proline'])
print('Class labels', np.unique(df_wine['Class label']))
print('numbers of features:', len(df_wine.keys())-1)
df_wine.head()

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)
X_train.shape

feat_labels = df_wine.columns[1:]
forest = RandomForestClassifier(n_estimators=200, random_state=1)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
print(len(importances))
importances

indices = np.argsort(importances)[::-1]
indices

for i in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (i + 1, 30, feat_labels[indices[i]], importances[indices[i]]))

plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]), importances[indices], align='center')
plt.xticks(range(X_train.shape[1]), feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
plt.show()

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

df = pd.read_csv(data_dir+'housing.data.txt',
                 header=None,
                 sep='\s+',
                 names= ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'])
df.head()

X = df.iloc[:, :-1].values
y = df['MEDV'].values
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape, y_train.shape)

forest = RandomForestRegressor(n_estimators=100, criterion='mse', random_state=1, n_jobs=-1)
forest.fit(X_train, y_train)

y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)
print('MSE train: %.3f, test: %.3f' % (mean_squared_error(y_train, y_train_pred), mean_squared_error(y_test, y_test_pred)))

Original: https://blog.csdn.net/wzk4869/article/details/126652944
Author: 旅途中的宽~
Title: 随机森林实战（分类任务+特征重要性+回归任务）（含Python代码详解）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/626735/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

如何在Tensor对象上执行逻辑运算（如and、or等）

介绍本问题是关于如何在Tensor对象上执行逻辑运算（如and、or等）的解决方法。我们将使用Python编程语言和TensorFlow库来展示具体的操作步骤和代码示例。算法原…

人工智能 2024年1月1日
0050
奥运会数据集分析(部分)

小组成员:XXX 主要方法:采用pandas 进行数据处理，采用Pyecharts 进行绘图摘要：针对奥运会2020夏季奥运会的相关分析，利用了python里面的pandas和…

人工智能 2023年7月17日
0051
【项目实战】Python实现Catboost回归模型(CatBoostRegressor算法)项目实战

说明：这是一个机器学习实战项目（附带数据+代码+文档+视频讲解），如需数据+代码+文档+视频讲解可以直接到文章最后获取。 1.项目背景 CatBoost是一种基于对称决策树（o…

人工智能 2023年7月6日
0078
部分梯度下降算法简述

梯度下降算法是通过沿着目标函数 J(θ) 的梯度(一阶导数)相反方向来不断更新模型参数来到达目标函数的极小值点（收敛），学习率为η。当目标函数具有多个参数，则使用相应的偏导 [若…

人工智能 2023年6月6日
0061
Pycharm Debug调试(纯干货)

内容目录（原文见公众号python宝或 www.xmmup.com ）一、打断点二、代码调试三、界面小图标介绍四、控制台介绍数字转换为大写人民币 import sys impo…

人工智能 2023年7月4日
0076
LVQ神经网络基本原理与从聚类角度看本质

原创文章，转载请说明来自《老饼讲解神经网络》:bp.bbbdata.com 目录一、LVQ的网络结构二、LVQ的输出计算方法三、LVQ的本质与意义四、LVQ的网络构建五、…

人工智能 2023年5月31日
0096
KITTI数据集详解

KITTI数据集详解数据采集车以下图片来自KITTI官网：KITT官方linkKitti的数据采集车，顶上是一个 64线的velodyne激光雷达，前面有四个摄像头分别是cam…

人工智能 2023年7月28日
00341
Python 列表切片操作

Python列表切片切片是Python序列的重要操作之一，适用于列表、元组、字符串、range对象等。可以用切片截取列表中任何部分来获得一个新的列表，也可以进行元素的增、删、改。…

人工智能 2023年7月5日
00118
第三讲 GMM以及EM算法学习笔记

目录 1.潜变量模型的学习 2.K-Means聚类模型 3.GMM模型和参数的估计 ** 4.EM算法** 5.总结 6.作业代码 1.潜变量模型的学习 [TencentCloud…

人工智能 2023年6月3日
0073
python数据分析之pandas数据预处理（数据合并与数据提取、loc、iloc、ix函数详解）

文章目录 * – 一、准备工作 – 二、数据合并 – + 1、merge数据表连接 + 2、添加数据 – 三、数据提取 &#8211…

人工智能 2023年6月19日
0073
pandas DataFrame.shift()函数

pandas DataFrame.shift()函数可以把数据移动指定的位数 period参数指定移动的步幅,可以为正为负.axis指定移动的轴,1为行,0为列. eg: 有这样一…

人工智能 2023年6月2日
0081
西湖大学自然语言处理（十一）—— 分类

西湖大学自然语言处理（十一）—— 分类分类和聚类 * 聚类（无监督学习）分类（有监督学习） Support Vector Machine（SVM） * Linear separ…

人工智能 2023年7月1日
0088
双色球神经网络算法分析,双色球预测程序算法

1、神经网络预测双色球有多靠谱你好，这东西没人会预测。如果真会预测，早就成为百万富翁了。预测这东西我早就不信了，上一回在网站上，看别人预测，结果买了，连一个号码都没对上。楼主真要…

人工智能 2023年7月28日
0093
李沐AI论文精读笔记——双流网络的开山之作

论文名称： Two-Stream Convolutional Networksfor Action Recognition in Videos论文下载连接： https://arx…

人工智能 2023年7月13日
0075
量化感知训练QAT(Quantification Aware Training)

目录前言对称量化非对称量化基于Pytorch官方API量化代码实现前言为了减少网络模型的空间占用和运行速度，除了在网络方面进行改进，模型剪枝和量化算是最常用的优化方法。…

人工智能 2023年7月12日
0062
3D目标检测中点云的表征方式总结（一）

3D检测中点云的表征方式总结（一） 1.RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detectio…

人工智能 2023年7月12日
0079

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

随机森林实战（分类任务+特征重要性+回归任务）（含Python代码详解）

1. 随机森林-分类任务

2. 随机森林-特征重要性

3. 随机森林-回归任务

4. 源代码

大家都在看