《scikit-learn机器学习》决策树③ -泰坦尼克号幸存者预测【思路+代码】

泰坦尼克号预测:

1、思路

1.1 数据处理

  • 删除一些与预测无关的数据
  • 修改一些数据,比如男为1,女为0
  • 将一些缺失的数据进行 补充或者删除
  • 找到 target(预测的值)
  • 将数据集进行 分割:分为训练集和交叉训练集

1.2 选择模型并训练

由于本文的内容为决策树,所以使用决策树的算法进行 模型训练

当我们用决策树里面的ID3算法(信息熵)来进行训练的时候,可能会出现 过拟合现象(训练集的score很高,但是交叉训练集的score较低),此时就需要进行 剪枝操作

注意: sklearn里面不支持后剪枝操作,所以说只能有前剪枝操作来对模型进行优化

1.3 用前剪枝对模型进行优化

在sklearn里面,可以使用 max_depth对模型进行剪枝。也就是说,确定决策树的深度,保证在已有的深度条件下进行剪枝,超出深度范围不会再进行分支。

但是我们如何确定depth呢?
我们可以一个个尝试,看哪个depth最优,但是嘞!!
由于程序员一般采用DRY原则,也就是 Donot Repeat Yourself,我们才不会傻乎乎的一个个试,当然是一个大函数给他全盘解决啦!!

就有了一个函数来测试哪个depth最优,然后用不同的depth进行拟合,最后 自动的索引出咱们最高的depth。然后就有了score,但是看数据太不直观了,所以一般来说,都会把数据进行一个 绘制图形直观的展示我们的最优选择。

前剪枝也可以使用 min_impurity_split进行剪枝,也就是控制它的叶子数量,与深度控制有异曲同工之妙
具体的话可以看下面这个博客,里面有讲到前剪枝的一些操作:
sklearn决策树

1.4 试试其他的决策树模型

我们可以尝试使用ID3的进阶C 4.5,或者是与ID3很像的CART算法,多种比较下得到最优的结果。

2、具体代码实现(代码来源于本书,不做详细解释)

2.1 数据处理

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
def read_dataset(fname):

    data = pd.read_csv(fname, index_col=0)

    data.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

    data['Sex'] = (data['Sex'] == 'male').astype('int')

    labels = data['Embarked'].unique().tolist()
    data['Embarked'] = data['Embarked'].apply(lambda n: labels.index(n))

    data = data.fillna(0)
    return data

train = read_dataset('datasets/titanic/train.csv')

《scikit-learn机器学习》决策树③ -泰坦尼克号幸存者预测【思路+代码】
分割数据集
from sklearn.model_selection import train_test_split

y = train['Survived'].values
X = train.drop(['Survived'], axis=1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print('train dataset: {0}; test dataset: {1}'.format(
    X_train.shape, X_test.shape))

2.2 训练模型

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
print('train score: {0}; test score: {1}'.format(train_score, test_score))

前剪枝

from sklearn.tree import export_graphviz

with open("titanic.dot", 'w') as f:
    f = export_graphviz(clf, out_file=f)


def cv_score(d):
    clf = DecisionTreeClassifier(max_depth=d)
    clf.fit(X_train, y_train)
    tr_score = clf.score(X_train, y_train)
    cv_score = clf.score(X_test, y_test)
    return (tr_score, cv_score)

depths = range(2, 15)
scores = [cv_score(d) for d in depths]
tr_scores = [s[0] for s in scores]
cv_scores = [s[1] for s in scores]

best_score_index = np.argmax(cv_scores)
best_score = cv_scores[best_score_index]
best_param = depths[best_score_index]
print('best param: {0}; best score: {1}'.format(best_param, best_score))

plt.figure(figsize=(10, 6), dpi=144)
plt.grid()
plt.xlabel('max depth of decision tree')
plt.ylabel('score')
plt.plot(depths, cv_scores, '.g-', label='cross-validation score')
plt.plot(depths, tr_scores, '.r--', label='training score')
plt.legend()

def cv_score(val):
    clf = DecisionTreeClassifier(criterion='gini', min_impurity_decrease=val)
    clf.fit(X_train, y_train)
    tr_score = clf.score(X_train, y_train)
    cv_score = clf.score(X_test, y_test)
    return (tr_score, cv_score)

values = np.linspace(0, 0.005, 50)
scores = [cv_score(v) for v in values]
tr_scores = [s[0] for s in scores]
cv_scores = [s[1] for s in scores]

best_score_index = np.argmax(cv_scores)
best_score = cv_scores[best_score_index]
best_param = values[best_score_index]
print('best param: {0}; best score: {1}'.format(best_param, best_score))

plt.figure(figsize=(10, 6), dpi=144)
plt.grid()
plt.xlabel('threshold of entropy')
plt.ylabel('score')
plt.plot(values, cv_scores, '.g-', label='cross-validation score')
plt.plot(values, tr_scores, '.r--', label='training score')
plt.legend()
def plot_curve(train_sizes, cv_results, xlabel):
    train_scores_mean = cv_results['mean_train_score']
    train_scores_std = cv_results['std_train_score']
    test_scores_mean = cv_results['mean_test_score']
    test_scores_std = cv_results['std_test_score']
    plt.figure(figsize=(10, 6), dpi=144)
    plt.title('parameters turning')
    plt.grid()
    plt.xlabel(xlabel)
    plt.ylabel('score')
    plt.fill_between(train_sizes,
                     train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std,
                     alpha=0.1, color="r")
    plt.fill_between(train_sizes,
                     test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std,
                     alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, '.--', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, '.-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
from sklearn.model_selection import GridSearchCV

thresholds = np.linspace(0, 0.005, 50)

param_grid = {'min_impurity_decrease': thresholds}

clf = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, return_train_score=True)
clf.fit(X, y)
print("best param: {0}\nbest score: {1}".format(clf.best_params_,
                                                clf.best_score_))

plot_curve(thresholds, clf.cv_results_, xlabel='gini thresholds')
from sklearn.model_selection import GridSearchCV

entropy_thresholds = np.linspace(0, 0.01, 50)
gini_thresholds = np.linspace(0, 0.005, 50)

param_grid = [{'criterion': ['entropy'],
               'min_impurity_decrease': entropy_thresholds},
              {'criterion': ['gini'],
               'min_impurity_decrease': gini_thresholds},
              {'max_depth': range(2, 10)},
              {'min_samples_split': range(2, 30, 2)}]

clf = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, return_train_score=True)
clf.fit(X, y)
print("best param: {0}\nbest score: {1}".format(clf.best_params_,
                                                clf.best_score_))

生成决策树

clf = DecisionTreeClassifier(criterion='entropy', min_impurity_decrease=0.002857142857142857)
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
print('train score: {0}; test score: {1}'.format(train_score, test_score))

with open("titanic.dot", 'w') as f:
    f = export_graphviz(clf, out_file=f)

《scikit-learn机器学习》决策树③ -泰坦尼克号幸存者预测【思路+代码】
《scikit-learn机器学习》决策树③ -泰坦尼克号幸存者预测【思路+代码】

Original: https://blog.csdn.net/weixin_42198265/article/details/121417100
Author: Bessie_Lee
Title: 《scikit-learn机器学习》决策树③ -泰坦尼克号幸存者预测【思路+代码】

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/630634/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球