Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bee77319-2dbd-4e5f-be52-6c9a3ffad370

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6e1e7f51-7758-449d-80b5-25cc9e0d369a

1、背景分析

预测乘客是否存活下来

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:caabe031-71ab-45f1-b751-4134843502df

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6a818147-50dd-47ab-b7a3-31eb97e23bd9

存活是1,死亡是0,响应变量为两种取值,所以这是一个分类问题。

2、数据收集和读取

从kaggle上下载泰坦尼克号的数据

kaggle是国际很有名的数据科学竞赛平台,上面有很多比赛,也有很多数据集,大家可以注册一个账号去看看。下面是泰坦尼克号的项目链接。

Titanic – Machine Learning from Disaster | Kaggle

当然下载数据要登陆账号,而注册账号需要翻墙……如果不方便弄个vpn,但需要泰坦尼克号数据集,可以评论找博主要。

导入数据分析常用包

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

读取数据,分训练集和测试集读取

data=pd.read_csv('train.csv')
data2=pd.read_csv('test.csv')

展示训练集前五行数据

data.head(5)

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

展示测试集前五行数据

data2.head(5)

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

可以看到第一列是乘客的编号。我们的响应变量是Survived,即乘客是否存活,但是测试集是没有这一列的,因为需要我们预测。

后面都是特征变量,乘客的特征:船舱等级、姓名,性别,年龄,子女个数,亲戚个数,船费、登船地点等等

3、数据清洗和整理

查看数据信息

data.info()
data2.info()

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6a8584d2-af90-47b6-b52e-7ac55aaedd8e

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a59459c6-52cd-40cf-90c2-4f03618cb208

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d34df833-2ba4-4cd2-b0c1-f5aeae3fc6de

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:da08009a-1bf4-46f4-b03c-8faf0ddd88dd

画图看缺失值

import missingno as msno
%matplotlib inline
msno.matrix(data)

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ab3a9ab8-2b53-48fd-bb84-339927d13707

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:dc70f70f-bc5c-4c91-9231-874da732a6ac

这个图就是黑色的位置表示有数据,白色表示数据的缺失,可以看到cabin这一列缺失值很多,age年龄缺失值也较多。

数据清洗

特征选择

下面删除一些不需要的变量,比如乘客编号、名字、船票编号等等,还有缺失值很多的Cabin变量

y=data['Survived']
data.drop ('Survived',axis=1, inplace=True)

data.drop ('PassengerId',axis=1, inplace=True)
data.drop ('Name',axis=1, inplace=True)
data.drop ('Ticket',axis=1, inplace=True)
data.drop ('Cabin',axis=1, inplace=True)

ID=data2['PassengerId']
data2.drop ('PassengerId',axis=1, inplace=True)
data2.drop ('Name',axis=1, inplace=True)
data2.drop ('Ticket',axis=1, inplace=True)
data2.drop ('Cabin',axis=1, inplace=True)

缺失值填充

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:835b98b5-1e37-4c51-b122-64f4bc2d321b

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:241cd8b0-d6c0-4422-bee6-23bfdc3f278e

data['Age'].fillna(data.Age.median(), inplace=True)
data['Embarked'].fillna(method='pad',axis=0,inplace=True)

data2['Age'].fillna(data2.Age.median(), inplace=True)
data2['Fare'].fillna(method='pad',axis=0,inplace=True)

分类型数据转化

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e3540a73-46dc-43b5-9e35-fdc154505b1e

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ae4e680e-b44a-4801-8ed2-486900ac3188

d1={'male':0,'female':1}
d2={'S':1,'C':2,'Q':3}

data['Sex']=data['Sex'].map(d1)
data['Embarked']=data['Embarked'].map(d2)

data2['Sex']=data2['Sex'].map(d1)
data2['Embarked']=data2['Embarked'].map(d2)

4、特征工程

整理好后查看数据信息

将数据赋值给X表示为响应变量

X=data.copy()
test=data2.copy()
X.info()
test.info()

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

可以看到没有缺失值了。

画图查看训练集和测试集的变量分布情况

dist_cols = 4
dist_rows = len(data2.columns)
plt.figure(figsize=(4 * dist_cols, 4 * dist_rows))
i = 1
for col in data2.columns:
    ax = plt.subplot(dist_rows, dist_cols, i)
    ax = sns.kdeplot(data[col], color="Red", shade=True)
    ax = sns.kdeplot(data2[col], color="Blue", shade=True)
    ax.set_xlabel(col)
    ax.set_ylabel("Frequency")
    ax = ax.legend(["train", "test"])
    i += 1
plt.show()

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9c7de735-f9e8-44d2-8db3-e38075063754

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7ff443bb-d4e1-4516-85ec-204f88c45489

查看相关系数矩阵

corr = plt.subplots(figsize = (8,6))
corr= sns.heatmap(data.corr(method='spearman'),annot=True,square=True)

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:801b595f-3b59-425c-8b80-47fa1fa2f415

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8e164ab9-7faa-40b4-bacb-0df5236daae6

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:30228160-9694-488e-8677-6ec03c26a5f7

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6e3582da-8ccb-4d61-bd6c-067cdaf88242

5、建模与优化

开始机器学习!

划分训练集和验证集

from sklearn.model_selection import train_test_split
X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.2,stratify=y,random_state=0)

这里是二八开,80%数据训练,20%数据验证。

数据标准化

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_s = scaler.transform(X_train)
X_val_s = scaler.transform(X_val)
test_s=scaler.transform(test)

构建自适应提升模型

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a1ea7839-73c6-4c85-a79d-fe7a8da643d4

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:60d44f27-9d62-4dbe-b40a-1a45c3921663

#自适应提升
from sklearn.ensemble import AdaBoostClassifier
model0 = AdaBoostClassifier(n_estimators=100,random_state=77)
model0.fit(X_train_s, y_train)
model0.score(X_val_s, y_val)

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

模型在验证集上的精度为0.7988,还可以。

6、模型运用——预测

存储预测结果

准备存储预测结果表格

df = pd.DataFrame(columns=['PassengerId','Survived'])
df['PassengerId']=ID
df.head()

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a3de475a-cb05-49c2-bc59-f4c9b533987f

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:be46f923-c3ea-42d2-97d1-a036087907cb

pred = model0.predict(test_s)
df['Survived']=pred
df.to_csv('predict_result__AdaBoost.csv',index=False)

七、模型选择和优化

模型选择

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f4656e95-d394-4699-a0f4-e740b3bb646a

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:128936c4-426e-47e9-ada1-f6d126ded633

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
#逻辑回归
model1 =  LogisticRegression(C=1e10)

#线性判别分析
model2 = LinearDiscriminantAnalysis()

#K近邻
model3 = KNeighborsClassifier(n_neighbors=10)

#决策树
model4 = DecisionTreeClassifier(random_state=77)

#随机森林
model5= RandomForestClassifier(n_estimators=1000,  max_features='sqrt',random_state=10)

#梯度提升
model6 = GradientBoostingClassifier(random_state=123)

#极端梯度提升
model7 =  XGBClassifier(eval_metric=['logloss','auc','error'],n_estimators=1000,
                        colsample_bytree=0.8,learning_rate=0.1,random_state=77)

#支持向量机
model8 = SVC(kernel="rbf", random_state=77)

#神经网络
model9 = MLPClassifier(hidden_layer_sizes=(16,8), random_state=77, max_iter=10000)

model_list=[model1,model2,model3,model4,model5,model6,model7,model8,model9]
model_name=['逻辑回归','线性判别','K近邻','决策树','随机森林','梯度提升','极端梯度提升','支持向量机','神经网络']

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:593c07dd-f26e-4dd0-948d-e3ba8ffd2c73

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bd4e6c29-be95-45ce-9af2-f7966501971a

for i in range(9):
    model_C=model_list[i]
    name=model_name[i]
    model_C.fit(X_train_s, y_train)
    s=model_C.score(X_val_s, y_val)
    print(name+'方法在验证集的准确率为:'+str(s))
    pred = model_C.predict(test_s)
    df['Survived']=pred
    csv_name=name+'的预测结果.csv'
    df.to_csv(csv_name,index=False)

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:09dd1553-fd08-4237-8f1d-5eef9528bd9e

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:34fc4e7e-296c-47ce-9612-007242c11f08

模型优化

模型再继续优化就是调整超参数,这里利用K折交叉验证搜索最优超参数


from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV

我们对max_depth’和 ‘learning_rate这两个超参数进行网格化搜索。当然也有很多别的参数也可以搜索,可以多试试。

Choose best hyperparameters by RandomizedSearchCV
param_distributions = {'max_depth': range(1, 10), 'learning_rate': np.linspace(0.1,0.5,5 )}
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

model = GridSearchCV(estimator=GradientBoostingClassifier(n_estimators=300,random_state=123),
                     param_grid=param_distributions, cv=kfold)
model.fit(X_train_s, y_train)
model.best_params_

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

可以看到最优参数是0.1和3

此时模型的精度为

model = model.best_estimator_
model.score(X_val_s, y_val)

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

模型精度进一步提升到0.84357,然后储存结果

pred = model.predict(test_s)
df['Survived']=pred
df.to_csv('调参后的梯度提升预测结果.csv',index=False)

这样就做完啦,可以把储存出来的文件提交到kaggle上看自己的模型精度的排名了!

八、模型评价

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:246d827c-7c89-4cc7-88e6-6240c907723c

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6e117a43-58e9-4338-8b5e-cb136224e5c5

得到特征变量的重要性排序图

sorted_index = model.feature_importances_.argsort()
plt.barh(range(X.shape[1]), model.feature_importances_[sorted_index])
plt.yticks(np.arange(X.shape[1]), X.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Gradient Boosting')

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8e557e10-341b-4c07-b90a-854eaacb168b

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e5e12953-ad82-4c63-85eb-cfe1708eef17

画出混淆矩阵

Prediction Performance
#prob = model.predict_proba(X_test)
pred = model.predict(X_val_s)
table = pd.crosstab(y_val, pred, rownames=['Actual'], colnames=['Predicted'])
table

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

AUC图

计算AUC指标,并画图

from sklearn.metrics import plot_roc_curve
plot_roc_curve(model, X_val_s, y_val)
x = np.linspace(0, 1, 100)
plt.plot(x, x, 'k--', linewidth=1)

Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

AUC为1最好,我们模型为0.85,也很不错

Original: https://blog.csdn.net/weixin_46277779/article/details/127057289
Author: 阡之尘埃
Title: Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/561356/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球