之前在学校做过课程设计,但是对流程比较一知半解,现在看完了机器学习实战这本书,带着自己的理解重新做一遍。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
观察数据的具体情况,可以发现年龄变量Age和Cabin有缺失,然后Name,sex,Ticket,cabin和Embark是object类型,在后续的数据处理中要进行调整。
data_train = pd.read_csv(r'C:/Users/train.csv')
data_train.info()
<class 'pandas.core.frame.dataframe'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 418 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
</class>
把索引设置为乘客编号
test_process = test_process.set_index(['PassengerId'])
test_process
现在测试集长这样
PclassNameSexAgeSibSpParchTicketFareEmbarkedCalledName_lengthFirst_namePassengerId8923Kelly, Mr. Jamesmale34003309117.8292QMr16Kelly8933Wilkes, Mrs. James (Ellen Needs)female47103632727.0000SMr32Wilkes8942Myles, Mr. Thomas Francismale62002402769.6875QMr25Myles8953Wirz, Mr. Albertmale27003151548.6625SMr16Wirz8963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female2211310129812.2875SMr44Hirvonen…………………………………13053Spector, Mr. Woolfmale2500A.5. 32368.0500SMr18Spector13061Oliva y Ocana, Dona. Ferminafemale3900PC 17758108.9000CNaN28Oliva y Ocana13073Saether, Mr. Simon Sivertsenmale3800SOTON/O.Q. 31012627.2500SMr28Saether13083Ware, Mr. Frederickmale25003593098.0500SMr19Ware13093Peter, Master. Michael Jmale2211266822.3583CNaN24Peter
418 rows × 12 columns
缺失值处理
本次数据的缺失应该是完全随机的,不依赖于其他完全变量,所以可以采取删除和填补两种方式。cabin缺失过多,直接删除这一特征,不放心的话可以计算一些相关度或者画图看看情况。
train_process = data_train.drop(['Cabin'],axis=1)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
Age_df = train_process[['Age','Survived','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values
y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges
train_process.Age=Age_df.Age.astype(int)
年龄缺失值使用随机森林进行填补,建立回归方程进行拟合。
测试集也要删除cabin变量和进行年龄缺失值的填补。
test_process = data_test.drop(['Cabin'],axis=1)
test_process.info()
<class 'pandas.core.frame.dataframe'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
# Column Non-Null Count Dtype
0 Pclass 891 non-null int64
1 Sex 891 non-null float64
2 Age 891 non-null int32
3 SibSp 891 non-null int64
4 Parch 891 non-null int64
5 Ticket 891 non-null float64
6 Fare 891 non-null float64
7 Embarked 891 non-null float64
8 Called 891 non-null float64
9 Name_length 891 non-null float64
10 First_name 891 non-null float64
dtypes: float64(7), int32(1), int64(3)
memory usage: 73.2 KB
</class>
投票法
先看看投票法
lr_clf = LogisticRegression(penalty='l1',solver='saga',n_jobs=-1,max_iter=20000)
rnd_clf = RandomForestClassifier(n_estimators=300,max_depth=8,min_samples_leaf=1,min_samples_split=5,random_state=42)
svm_clf = SVC(C=2,kernel='poly',random_state=42,probability=True)
voting_clf = VotingClassifier(estimators=[('lr',lr_clf),('rf',rnd_clf),('scv',svm_clf)],voting='soft')
voting_clf.fit(X_train_encoded,y_train)
VotingClassifier(estimators=[('lr',
LogisticRegression(max_iter=20000, n_jobs=-1,
penalty='l1', solver='saga')),
('rf',
RandomForestClassifier(max_depth=8,
min_samples_split=5,
n_estimators=300,
random_state=42)),
('scv',
SVC(C=2, kernel='poly', probability=True,
random_state=42))],
voting='soft')
y_test = pd.read_csv(r'C:/Users/gender_submission.csv')
y_test = y_test['Survived']
from sklearn.metrics import accuracy_score
for clf in (lr_clf,rnd_clf,svm_clf,voting_clf):
clf.fit(X_train_encoded,y_train)
y_pred = clf.predict(X_test_encoded)
print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
LogisticRegression 0.6961722488038278
RandomForestClassifier 0.80622009569378
SVC 0.6363636363636364
VotingClassifier 0.8110047846889952
再试试XGBoost,果然效果比较好。
XGBoost
import xgboost
from sklearn.metrics import mean_squared_error
xgb_reg = xgboost.XGBRFRegressor(random_state=42)
xgb_reg.fit(X_train_encoded,y_train)
y_pred = xgb_reg.predict(X_test_encoded)
val_error=mean_squared_error(y_test,y_pred)
print("Validation MSE:", val_error)
Validation MSE: 0.5023153196818051
Original: https://blog.csdn.net/weixin_43925467/article/details/124055489
Author: aka.炼金术士
Title: 数据分析——泰坦尼克号预测
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/696284/
转载文章受原作者版权保护。转载请注明原作者出处!