[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bee77319-2dbd-4e5f-be52-6c9a3ffad370
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6e1e7f51-7758-449d-80b5-25cc9e0d369a
1、背景分析
预测乘客是否存活下来
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:caabe031-71ab-45f1-b751-4134843502df
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6a818147-50dd-47ab-b7a3-31eb97e23bd9
存活是1,死亡是0,响应变量为两种取值,所以这是一个分类问题。
2、数据收集和读取
从kaggle上下载泰坦尼克号的数据 ¶
kaggle是国际很有名的数据科学竞赛平台,上面有很多比赛,也有很多数据集,大家可以注册一个账号去看看。下面是泰坦尼克号的项目链接。
Titanic – Machine Learning from Disaster | Kaggle
当然下载数据要登陆账号,而注册账号需要翻墙……如果不方便弄个vpn,但需要泰坦尼克号数据集,可以评论找博主要。
导入数据分析常用包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
读取数据,分训练集和测试集读取
data=pd.read_csv('train.csv')
data2=pd.read_csv('test.csv')
展示训练集前五行数据
data.head(5)
展示测试集前五行数据
data2.head(5)
可以看到第一列是乘客的编号。我们的响应变量是Survived,即乘客是否存活,但是测试集是没有这一列的,因为需要我们预测。
后面都是特征变量,乘客的特征:船舱等级、姓名,性别,年龄,子女个数,亲戚个数,船费、登船地点等等¶
3、数据清洗和整理
查看数据信息 ¶
data.info()
data2.info()
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6a8584d2-af90-47b6-b52e-7ac55aaedd8e
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a59459c6-52cd-40cf-90c2-4f03618cb208
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d34df833-2ba4-4cd2-b0c1-f5aeae3fc6de
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:da08009a-1bf4-46f4-b03c-8faf0ddd88dd
画图看缺失值
import missingno as msno
%matplotlib inline
msno.matrix(data)
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ab3a9ab8-2b53-48fd-bb84-339927d13707
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:dc70f70f-bc5c-4c91-9231-874da732a6ac
这个图就是黑色的位置表示有数据,白色表示数据的缺失,可以看到cabin这一列缺失值很多,age年龄缺失值也较多。
数据清洗
特征选择
下面删除一些不需要的变量,比如乘客编号、名字、船票编号等等,还有缺失值很多的Cabin变量
y=data['Survived']
data.drop ('Survived',axis=1, inplace=True)
data.drop ('PassengerId',axis=1, inplace=True)
data.drop ('Name',axis=1, inplace=True)
data.drop ('Ticket',axis=1, inplace=True)
data.drop ('Cabin',axis=1, inplace=True)
ID=data2['PassengerId']
data2.drop ('PassengerId',axis=1, inplace=True)
data2.drop ('Name',axis=1, inplace=True)
data2.drop ('Ticket',axis=1, inplace=True)
data2.drop ('Cabin',axis=1, inplace=True)
缺失值填充
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:835b98b5-1e37-4c51-b122-64f4bc2d321b
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:241cd8b0-d6c0-4422-bee6-23bfdc3f278e
data['Age'].fillna(data.Age.median(), inplace=True)
data['Embarked'].fillna(method='pad',axis=0,inplace=True)
data2['Age'].fillna(data2.Age.median(), inplace=True)
data2['Fare'].fillna(method='pad',axis=0,inplace=True)
分类型数据转化
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e3540a73-46dc-43b5-9e35-fdc154505b1e
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ae4e680e-b44a-4801-8ed2-486900ac3188
d1={'male':0,'female':1}
d2={'S':1,'C':2,'Q':3}
data['Sex']=data['Sex'].map(d1)
data['Embarked']=data['Embarked'].map(d2)
data2['Sex']=data2['Sex'].map(d1)
data2['Embarked']=data2['Embarked'].map(d2)
4、特征工程
整理好后查看数据信息
将数据赋值给X表示为响应变量
X=data.copy()
test=data2.copy()
X.info()
test.info()
可以看到没有缺失值了。
画图查看训练集和测试集的变量分布情况
dist_cols = 4
dist_rows = len(data2.columns)
plt.figure(figsize=(4 * dist_cols, 4 * dist_rows))
i = 1
for col in data2.columns:
ax = plt.subplot(dist_rows, dist_cols, i)
ax = sns.kdeplot(data[col], color="Red", shade=True)
ax = sns.kdeplot(data2[col], color="Blue", shade=True)
ax.set_xlabel(col)
ax.set_ylabel("Frequency")
ax = ax.legend(["train", "test"])
i += 1
plt.show()
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9c7de735-f9e8-44d2-8db3-e38075063754
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7ff443bb-d4e1-4516-85ec-204f88c45489
查看相关系数矩阵
corr = plt.subplots(figsize = (8,6))
corr= sns.heatmap(data.corr(method='spearman'),annot=True,square=True)
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:801b595f-3b59-425c-8b80-47fa1fa2f415
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8e164ab9-7faa-40b4-bacb-0df5236daae6
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:30228160-9694-488e-8677-6ec03c26a5f7
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6e3582da-8ccb-4d61-bd6c-067cdaf88242
5、建模与优化
开始机器学习!
划分训练集和验证集
from sklearn.model_selection import train_test_split
X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.2,stratify=y,random_state=0)
这里是二八开,80%数据训练,20%数据验证。
数据标准化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_s = scaler.transform(X_train)
X_val_s = scaler.transform(X_val)
test_s=scaler.transform(test)
构建自适应提升模型
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a1ea7839-73c6-4c85-a79d-fe7a8da643d4
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:60d44f27-9d62-4dbe-b40a-1a45c3921663
#自适应提升
from sklearn.ensemble import AdaBoostClassifier
model0 = AdaBoostClassifier(n_estimators=100,random_state=77)
model0.fit(X_train_s, y_train)
model0.score(X_val_s, y_val)
模型在验证集上的精度为0.7988,还可以。
6、模型运用——预测
存储预测结果
准备存储预测结果表格
df = pd.DataFrame(columns=['PassengerId','Survived'])
df['PassengerId']=ID
df.head()
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a3de475a-cb05-49c2-bc59-f4c9b533987f
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:be46f923-c3ea-42d2-97d1-a036087907cb
pred = model0.predict(test_s)
df['Survived']=pred
df.to_csv('predict_result__AdaBoost.csv',index=False)
七、模型选择和优化
模型选择
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f4656e95-d394-4699-a0f4-e740b3bb646a
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:128936c4-426e-47e9-ada1-f6d126ded633
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
#逻辑回归
model1 = LogisticRegression(C=1e10)
#线性判别分析
model2 = LinearDiscriminantAnalysis()
#K近邻
model3 = KNeighborsClassifier(n_neighbors=10)
#决策树
model4 = DecisionTreeClassifier(random_state=77)
#随机森林
model5= RandomForestClassifier(n_estimators=1000, max_features='sqrt',random_state=10)
#梯度提升
model6 = GradientBoostingClassifier(random_state=123)
#极端梯度提升
model7 = XGBClassifier(eval_metric=['logloss','auc','error'],n_estimators=1000,
colsample_bytree=0.8,learning_rate=0.1,random_state=77)
#支持向量机
model8 = SVC(kernel="rbf", random_state=77)
#神经网络
model9 = MLPClassifier(hidden_layer_sizes=(16,8), random_state=77, max_iter=10000)
model_list=[model1,model2,model3,model4,model5,model6,model7,model8,model9]
model_name=['逻辑回归','线性判别','K近邻','决策树','随机森林','梯度提升','极端梯度提升','支持向量机','神经网络']
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:593c07dd-f26e-4dd0-948d-e3ba8ffd2c73
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bd4e6c29-be95-45ce-9af2-f7966501971a
for i in range(9):
model_C=model_list[i]
name=model_name[i]
model_C.fit(X_train_s, y_train)
s=model_C.score(X_val_s, y_val)
print(name+'方法在验证集的准确率为:'+str(s))
pred = model_C.predict(test_s)
df['Survived']=pred
csv_name=name+'的预测结果.csv'
df.to_csv(csv_name,index=False)
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:09dd1553-fd08-4237-8f1d-5eef9528bd9e
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:34fc4e7e-296c-47ce-9612-007242c11f08
模型优化
模型再继续优化就是调整超参数,这里利用K折交叉验证搜索最优超参数
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV
我们对max_depth’和 ‘learning_rate这两个超参数进行网格化搜索。当然也有很多别的参数也可以搜索,可以多试试。
Choose best hyperparameters by RandomizedSearchCV
param_distributions = {'max_depth': range(1, 10), 'learning_rate': np.linspace(0.1,0.5,5 )}
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(estimator=GradientBoostingClassifier(n_estimators=300,random_state=123),
param_grid=param_distributions, cv=kfold)
model.fit(X_train_s, y_train)
model.best_params_
可以看到最优参数是0.1和3
此时模型的精度为
model = model.best_estimator_
model.score(X_val_s, y_val)
模型精度进一步提升到0.84357,然后储存结果
pred = model.predict(test_s)
df['Survived']=pred
df.to_csv('调参后的梯度提升预测结果.csv',index=False)
这样就做完啦,可以把储存出来的文件提交到kaggle上看自己的模型精度的排名了!
八、模型评价
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:246d827c-7c89-4cc7-88e6-6240c907723c
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6e117a43-58e9-4338-8b5e-cb136224e5c5
得到特征变量的重要性排序图
sorted_index = model.feature_importances_.argsort()
plt.barh(range(X.shape[1]), model.feature_importances_[sorted_index])
plt.yticks(np.arange(X.shape[1]), X.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Gradient Boosting')
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8e557e10-341b-4c07-b90a-854eaacb168b
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e5e12953-ad82-4c63-85eb-cfe1708eef17
画出混淆矩阵
Prediction Performance
#prob = model.predict_proba(X_test)
pred = model.predict(X_val_s)
table = pd.crosstab(y_val, pred, rownames=['Actual'], colnames=['Predicted'])
table
AUC图
计算AUC指标,并画图
from sklearn.metrics import plot_roc_curve
plot_roc_curve(model, X_val_s, y_val)
x = np.linspace(0, 1, 100)
plt.plot(x, x, 'k--', linewidth=1)
AUC为1最好,我们模型为0.85,也很不错
Original: https://blog.csdn.net/weixin_46277779/article/details/127057289
Author: 阡之尘埃
Title: Python数据分析案例08——预测泰坦尼克号乘员的生存(机器学习全流程)
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/561356/
转载文章受原作者版权保护。转载请注明原作者出处!