数据分析——泰坦尼克号预测

2023年7月16日上午11:08 • 人工智能 • 阅读 99

之前在学校做过课程设计，但是对流程比较一知半解，现在看完了机器学习实战这本书，带着自己的理解重新做一遍。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

观察数据的具体情况，可以发现年龄变量Age和Cabin有缺失，然后Name，sex，Ticket，cabin和Embark是object类型，在后续的数据处理中要进行调整。

data_train = pd.read_csv(r'C:/Users/train.csv')
data_train.info()

<class 'pandas.core.frame.dataframe'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype
 0   PassengerId  418 non-null    int64
 1   Pclass       418 non-null    int64
 2   Name         418 non-null    object
 3   Sex          418 non-null    object
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64
 6   Parch        418 non-null    int64
 7   Ticket       418 non-null    object
 8   Fare         418 non-null    float64
 9   Cabin        91 non-null     object
 10  Embarked     418 non-null    object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
</class>

把索引设置为乘客编号

test_process = test_process.set_index(['PassengerId'])
test_process

现在测试集长这样

PclassNameSexAgeSibSpParchTicketFareEmbarkedCalledName_lengthFirst_namePassengerId8923Kelly, Mr. Jamesmale34003309117.8292QMr16Kelly8933Wilkes, Mrs. James (Ellen Needs)female47103632727.0000SMr32Wilkes8942Myles, Mr. Thomas Francismale62002402769.6875QMr25Myles8953Wirz, Mr. Albertmale27003151548.6625SMr16Wirz8963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female2211310129812.2875SMr44Hirvonen…………………………………13053Spector, Mr. Woolfmale2500A.5. 32368.0500SMr18Spector13061Oliva y Ocana, Dona. Ferminafemale3900PC 17758108.9000CNaN28Oliva y Ocana13073Saether, Mr. Simon Sivertsenmale3800SOTON/O.Q. 31012627.2500SMr28Saether13083Ware, Mr. Frederickmale25003593098.0500SMr19Ware13093Peter, Master. Michael Jmale2211266822.3583CNaN24Peter

418 rows × 12 columns

缺失值处理

本次数据的缺失应该是完全随机的，不依赖于其他完全变量，所以可以采取删除和填补两种方式。cabin缺失过多，直接删除这一特征，不放心的话可以计算一些相关度或者画图看看情况。


train_process = data_train.drop(['Cabin'],axis=1)


from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
Age_df = train_process[['Age','Survived','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values

y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges
train_process.Age=Age_df.Age.astype(int)

年龄缺失值使用随机森林进行填补，建立回归方程进行拟合。

测试集也要删除cabin变量和进行年龄缺失值的填补。


test_process = data_test.drop(['Cabin'],axis=1)
test_process.info()

<class 'pandas.core.frame.dataframe'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype
 0   Pclass       891 non-null    int64
 1   Sex          891 non-null    float64
 2   Age          891 non-null    int32
 3   SibSp        891 non-null    int64
 4   Parch        891 non-null    int64
 5   Ticket       891 non-null    float64
 6   Fare         891 non-null    float64
 7   Embarked     891 non-null    float64
 8   Called       891 non-null    float64
 9   Name_length  891 non-null    float64
 10  First_name   891 non-null    float64
dtypes: float64(7), int32(1), int64(3)
memory usage: 73.2 KB
</class>

投票法

先看看投票法

lr_clf = LogisticRegression(penalty='l1',solver='saga',n_jobs=-1,max_iter=20000)
rnd_clf = RandomForestClassifier(n_estimators=300,max_depth=8,min_samples_leaf=1,min_samples_split=5,random_state=42)
svm_clf = SVC(C=2,kernel='poly',random_state=42,probability=True)
voting_clf = VotingClassifier(estimators=[('lr',lr_clf),('rf',rnd_clf),('scv',svm_clf)],voting='soft')
voting_clf.fit(X_train_encoded,y_train)

  VotingClassifier(estimators=[('lr',
                                  LogisticRegression(max_iter=20000, n_jobs=-1,
                                                     penalty='l1', solver='saga')),
                                 ('rf',
                                  RandomForestClassifier(max_depth=8,
                                                         min_samples_split=5,
                                                         n_estimators=300,
                                                         random_state=42)),
                                 ('scv',
                                  SVC(C=2, kernel='poly', probability=True,
                                      random_state=42))],
                     voting='soft')

y_test = pd.read_csv(r'C:/Users/gender_submission.csv')

y_test = y_test['Survived']

from sklearn.metrics import accuracy_score

for clf in (lr_clf,rnd_clf,svm_clf,voting_clf):
    clf.fit(X_train_encoded,y_train)
    y_pred = clf.predict(X_test_encoded)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.6961722488038278
RandomForestClassifier 0.80622009569378
SVC 0.6363636363636364
VotingClassifier 0.8110047846889952

再试试XGBoost，果然效果比较好。

XGBoost

import xgboost
from sklearn.metrics import mean_squared_error
xgb_reg = xgboost.XGBRFRegressor(random_state=42)
xgb_reg.fit(X_train_encoded,y_train)
y_pred = xgb_reg.predict(X_test_encoded)
val_error=mean_squared_error(y_test,y_pred)
print("Validation MSE:", val_error)

Validation MSE: 0.5023153196818051

Original: https://blog.csdn.net/weixin_43925467/article/details/124055489
Author: aka.炼金术士
Title: 数据分析——泰坦尼克号预测

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/696284/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

多元线性回归中的逐步回归及其相关理论介绍

参考书籍：1、《应用多元统计分析》高惠璇 1、表达式用来研究因变量Y和m个自变量的相关关系（一共有n个样本，）矩阵表示为：记为或 2、回归方程和回归系数的显著性检验 2.1 …

人工智能 2023年6月18日
0073
函数逼近和曲线拟合、插值

因为精力有限加上涉及的内容太多，无法一次性写完，后续会持续更新~ 文章目录前言一、函数逼近 * 1.背景 2.定义 2.相关知识 3.适用情况 4.函数逼近二、万能逼近定理 …

人工智能 2023年6月17日
00109
【SPSS】重复测量设计方差分析-单因素

首先，重复测量设计方差分析需满足三个条件：正态、方差齐、满足球形度。单因素重复测量单因素重复没有组间干预措施的影响，只有主体内（时间）的影响。 1.数据介绍 8份血样，分别检测…

人工智能 2023年7月15日
0066
【天池】零基础入门数据挖掘-心跳信号分类预测（GPU版本）

【天池】零基础入门数据挖掘-心跳信号分类预测（GPU版本）为什么要写这篇文章？ * 赛题背景赛题分析赛题环境代码剖析比赛成绩结束语为什么要写这篇文章？本文献给Pyt…

人工智能 2023年7月1日
0094
pandas之Seris和DataFrame

pandas是一个强大的python工具包，提供了大量处理数据的函数和方法，用于处理数据和分析数据。使用pandas之前需要先安装pandas包，并通过import pandas…

人工智能 2023年6月2日
0089
在游戏博弈中才是博弈游戏的最佳实践

昨晚又打游戏，打成三国，2个法国，我是苏军，两个法国呢家里都有一个巨炮，我有几个坦克，我想的是我去冲任何一家，他们都会手里再放一个巨炮的，我面对2个巨炮，坦克就是白送呀。除非我造…

人工智能 2023年6月28日
0068
ffmpeg源码阅读之avformat_open_input

头文件 int avformat_open_input(AVFormatContext **ps, const char *url,const AVInputFormat *fmt…

人工智能 2023年6月26日
00129
YOLOX自定义数据集训练（抢先踩坑）

序言昨天被YOLOX刷屏了，各大公众号强推：性能超yolov5！！吊打一切yolo！！看麻了我，标题还能再夸张点嘛？出于对前沿技术的渴望，还是要去学习学习，论文中改进了很多地…

人工智能 2023年7月5日
0083
ARGO数据集—自动驾驶场景（版本：Argoverse 1.1）

前言 ARGO是一个自动驾驶场景的数据集，它有竞赛排行（立体深度估计、运动预测、3D检测、3D跟踪等等），截止2021.12最新版本是Argoverse 1.1；Argoverse…

人工智能 2023年6月11日
0078
R绘制折线图

R绘制折线图参考资料：《R语言数据分析与可视化从入门到精通》《R数据可视化手册》基本图形 R基础函数使用 plot()函数绘制折线图时需向其传递一个包含 x值的向量和一个包…

人工智能 2023年7月14日
0056
KGCN：Knowledge Graph Convolutional Networks for Recommender Systems

emm…图片复制过来显示不了（因为我太懒了0.0），要看图的话可以去我的博客瞅瞅，嘿嘿嘿对了，有些英文短句假如翻译成中文，阅读的时候就太搞脑子了，所以我干脆就不翻译了 …

人工智能 2023年6月1日
0093
CycleGAN的pytorch代码实现（代码详细注释）

CycleGAN代码参考代码 CycleGAN原理代码介绍 * models datasets utils cycle_gan test 训练结果放在一个文件里参考代码 &…

人工智能 2023年7月24日
0060
YoloV5转tensorrt

一、在转tensorrt之前，请先确认你下载的yolo v5的版本及对应权重：这个一定要注意，因为有的童鞋上来就转，如果成功了，那还好，成功不了，首先就要想到这个问题，就是yo…

人工智能 2023年7月23日
0054
聚类算法总结

聚类算法的分类聚类算法有很多种分法，体系也很大，这里举例几种分法：基于划分的聚类：聚类目标是使得类内的点足够近，类间的点足够远，常见的如k-means及其衍生算法基于密度的聚类：…

人工智能 2023年5月31日
00159
pygame外星人入侵

✅作者简介：大家好我是hacker707,大家可以叫我hacker，新星计划第三季python赛道Top1🥇🥇🥇📃个人主页：hacker707的csdn博客🔥系列专栏：python…

人工智能 2023年7月6日
0055
【python数据分析】数据如何进行合并

数据的合并在拥有了数据基本筛选能力后，我们还要有更加nb的操作，接下来就学习如何利用Pandas合并多个DataFrame数据，以及筛选我们心仪的数据。在数据合并里面主要讲两个函…

人工智能 2023年6月19日
0087

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

数据分析——泰坦尼克号预测

缺失值处理

投票法

XGBoost

大家都在看