数据分析——泰坦尼克号预测

2023年7月16日上午11:08 • 人工智能 • 阅读 86

之前在学校做过课程设计，但是对流程比较一知半解，现在看完了机器学习实战这本书，带着自己的理解重新做一遍。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

观察数据的具体情况，可以发现年龄变量Age和Cabin有缺失，然后Name，sex，Ticket，cabin和Embark是object类型，在后续的数据处理中要进行调整。

data_train = pd.read_csv(r'C:/Users/train.csv')
data_train.info()

<class 'pandas.core.frame.dataframe'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype
 0   PassengerId  418 non-null    int64
 1   Pclass       418 non-null    int64
 2   Name         418 non-null    object
 3   Sex          418 non-null    object
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64
 6   Parch        418 non-null    int64
 7   Ticket       418 non-null    object
 8   Fare         418 non-null    float64
 9   Cabin        91 non-null     object
 10  Embarked     418 non-null    object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
</class>

把索引设置为乘客编号

test_process = test_process.set_index(['PassengerId'])
test_process

现在测试集长这样

PclassNameSexAgeSibSpParchTicketFareEmbarkedCalledName_lengthFirst_namePassengerId8923Kelly, Mr. Jamesmale34003309117.8292QMr16Kelly8933Wilkes, Mrs. James (Ellen Needs)female47103632727.0000SMr32Wilkes8942Myles, Mr. Thomas Francismale62002402769.6875QMr25Myles8953Wirz, Mr. Albertmale27003151548.6625SMr16Wirz8963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female2211310129812.2875SMr44Hirvonen…………………………………13053Spector, Mr. Woolfmale2500A.5. 32368.0500SMr18Spector13061Oliva y Ocana, Dona. Ferminafemale3900PC 17758108.9000CNaN28Oliva y Ocana13073Saether, Mr. Simon Sivertsenmale3800SOTON/O.Q. 31012627.2500SMr28Saether13083Ware, Mr. Frederickmale25003593098.0500SMr19Ware13093Peter, Master. Michael Jmale2211266822.3583CNaN24Peter

418 rows × 12 columns

缺失值处理

本次数据的缺失应该是完全随机的，不依赖于其他完全变量，所以可以采取删除和填补两种方式。cabin缺失过多，直接删除这一特征，不放心的话可以计算一些相关度或者画图看看情况。


train_process = data_train.drop(['Cabin'],axis=1)


from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
Age_df = train_process[['Age','Survived','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values

y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges
train_process.Age=Age_df.Age.astype(int)

年龄缺失值使用随机森林进行填补，建立回归方程进行拟合。

测试集也要删除cabin变量和进行年龄缺失值的填补。


test_process = data_test.drop(['Cabin'],axis=1)
test_process.info()

<class 'pandas.core.frame.dataframe'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype
 0   Pclass       891 non-null    int64
 1   Sex          891 non-null    float64
 2   Age          891 non-null    int32
 3   SibSp        891 non-null    int64
 4   Parch        891 non-null    int64
 5   Ticket       891 non-null    float64
 6   Fare         891 non-null    float64
 7   Embarked     891 non-null    float64
 8   Called       891 non-null    float64
 9   Name_length  891 non-null    float64
 10  First_name   891 non-null    float64
dtypes: float64(7), int32(1), int64(3)
memory usage: 73.2 KB
</class>

投票法

先看看投票法

lr_clf = LogisticRegression(penalty='l1',solver='saga',n_jobs=-1,max_iter=20000)
rnd_clf = RandomForestClassifier(n_estimators=300,max_depth=8,min_samples_leaf=1,min_samples_split=5,random_state=42)
svm_clf = SVC(C=2,kernel='poly',random_state=42,probability=True)
voting_clf = VotingClassifier(estimators=[('lr',lr_clf),('rf',rnd_clf),('scv',svm_clf)],voting='soft')
voting_clf.fit(X_train_encoded,y_train)

  VotingClassifier(estimators=[('lr',
                                  LogisticRegression(max_iter=20000, n_jobs=-1,
                                                     penalty='l1', solver='saga')),
                                 ('rf',
                                  RandomForestClassifier(max_depth=8,
                                                         min_samples_split=5,
                                                         n_estimators=300,
                                                         random_state=42)),
                                 ('scv',
                                  SVC(C=2, kernel='poly', probability=True,
                                      random_state=42))],
                     voting='soft')

y_test = pd.read_csv(r'C:/Users/gender_submission.csv')

y_test = y_test['Survived']

from sklearn.metrics import accuracy_score

for clf in (lr_clf,rnd_clf,svm_clf,voting_clf):
    clf.fit(X_train_encoded,y_train)
    y_pred = clf.predict(X_test_encoded)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.6961722488038278
RandomForestClassifier 0.80622009569378
SVC 0.6363636363636364
VotingClassifier 0.8110047846889952

再试试XGBoost，果然效果比较好。

XGBoost

import xgboost
from sklearn.metrics import mean_squared_error
xgb_reg = xgboost.XGBRFRegressor(random_state=42)
xgb_reg.fit(X_train_encoded,y_train)
y_pred = xgb_reg.predict(X_test_encoded)
val_error=mean_squared_error(y_test,y_pred)
print("Validation MSE:", val_error)

Validation MSE: 0.5023153196818051

Original: https://blog.csdn.net/weixin_43925467/article/details/124055489
Author: aka.炼金术士
Title: 数据分析——泰坦尼克号预测

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/696284/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

基于LSTM的ECG分类用于个人可穿戴设备的连续监测

LSTM-Based ECG Classification for Continuous Monitoring on Personal Wearable Devices原文地址：h…

人工智能 2023年7月1日
0095
学习Transformer：自注意力与多头自注意力的原理及实现

前言自从Transformer[3]模型在NLP领域问世后，基于Transformer的深度学习模型性能逐渐在NLP和CV领域(Vision Transformer)取得了令人惊…

人工智能 2023年6月23日
0082
实例7：将图片文件制作成Dataset数据集

在图像训练过程中，一个变形丰富的数据集使模型的精度和泛化能力提高了一倍。 [En] In the process of image training, a deformation-…

人工智能 2023年5月25日
0071
Detectron2入门代码教程——以Faster RCNN在自定义数据集上目标检测为例

文章目录 Detectron2介绍代码解读 * 准备数据集训练验证参考资料 Detectron2介绍 Detectron2是Facebook AI Research的下一代…

人工智能 2023年7月10日
0056
Python数组基本操作

项目一：基础编程1.创建两个数组完成sqrt,abs,函数的调用,以及两个数组做加法输出格式如下： import numpy as np from numpy import ara…

人工智能 2023年6月15日
0095
FaceSwap安装后在win10下无法打开GUI

FaceSwap安装后在win10下无法打开GUI 一、前言：本来根据GitHub上的项目，想通过setup.py来配置，但是尝试了很多次都很难解决。但是faceswap集合了一…

人工智能 2023年5月26日
00142
GPU选型调研！3090依旧是性价比之王

最近算力不够，一些加Transfomer的3D图像分割，现有的显卡显存都带不动，或者是一个实验要跑一周以上时间。所以近期又专门花时间调研了下GPU选型。现有两张3090显卡，因为…

人工智能 2023年5月23日
00107
【论文阅读】基于混淆的加强网络安全的方法

基于混淆的加强网络安全的方法一、摘要：二、相关工作 * 1 相关检测工作： – (1)排名算法: (2)用户行为分析: (3)网页质量: (4)机器学习: 2 相关…

人工智能 2023年6月29日
0075
【Python数据分析与可视化】Matplotlib数据可视化（实训四）

全球星巴克门店数据分析 import pandas as pd import numpy as np from pandas import Series,DataFrame imp…

人工智能 2023年7月18日
0051
共享单车项目数据可视化+需求策略分析

一、项目背景自行车共享系统是一种租赁自行车的方式，其中获得会员资格、租赁和归还自行车的过程是通过遍布城市的站点网络自动完成的。使用这些系统，人们可以从一个地方租用自行车，并根据需…

人工智能 2023年6月11日
0082
PYTORCH学习（3）：多维tensors求余弦相似度和欧氏距离

1、为什么要写这篇blog 因为最近在使用pytorch复现关于图像处理的深度学习论文时，需要求4维张量与4维张量（Batch，Channel，sizeA，sizeB）的余弦相似度…

人工智能 2023年7月21日
0051
opencv+python实战日记入门篇（一）车牌识别捕捉录像、图片灰度化、图像模糊化处理、图像边缘化处理、图像膨胀和腐蚀

目录捕捉录像图片灰度化图像模糊化处理图像边缘化处理图像膨胀和腐蚀参考链接（bilibili）：我愿称之为史上最强opencv，难得看到有人把opencv讲的如此通俗易懂…

人工智能 2023年6月21日
0073
计算机视觉快速入门一 —— 图像基本操作(二）

计算机视觉快速入门一 —— 图像基本操作(二） 1.灰度图 img_gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) import cv2 #…

人工智能 2023年7月19日
0052
【机器学习】聚类算法详细介绍（理论+图解）

🌠 『精品学习专栏导航帖』 🐳最适合入门的100个深度学习实战项目 🐳 🐙【PyTorch深度学习项目实战100例目录】项目详解 + 数据集 + 完整源码 🐙 🐶【机器学习入门项目…

人工智能 2023年6月2日
0067
数据的清洗

数据分析师80%的时间都花在数据清洗上！好的数据质量，应该满足”完全合一” • 完整性：数据是否存在空值，字段是否完善，是否有漏掉 • 全面性：观察某一列…

人工智能 2023年6月11日
00106
目标检测的衡量标准

软件测试规范目录一.概述 …………………………&#82…

人工智能 2023年7月10日
0091

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

数据分析——泰坦尼克号预测

缺失值处理

投票法

XGBoost

大家都在看