泰坦尼克号人员预测模型(python/jupyter-notebook/数据挖掘/数据分析)

2023年6月19日下午3:38 • 人工智能 • 阅读 97

泰坦尼克号人员预测模型

运用python实现泰坦尼克号的人员预测，机器学习，数据挖掘

前言

以泰坦尼克号数据为对象，结合当时背景，理解数据和认识数据，掌握数据的初步探索，具体包括缺失值的处理、数据间的相关性分析等。掌握属性变换、特征生成、特征选择与主成分分析处理方法，生成适合模型算法来预测人员生还情况

实验过程

字段解读

泰坦尼克号人员预测模型(python/jupyter-notebook/数据挖掘/数据分析)

每个字段的意义如下。

PassengerId ：乘客ID

Pclass：乘客等级(1/2/3等舱位)（属性代表船舱等级，1-一等舱，2-二等舱，3-三等舱，从一定程度上反应了这个乘客经济情况和社会地位。）

Name：乘客姓名

Sex：性别

Age：年龄

SibSp：堂兄弟/妹个数

Parch：父母与小孩个数

Ticket：船票信息（字母与数字具体代表什么信息，需要猜测分析判断）

Fare：票价

Cabin：客舱

Embarked：登船港口

Survived：乘客是否获救

; 使用jupyter-notebook实现的完整过程：

导入所有的第三方库：

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn import svm
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
from pylab import *
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

数据理解以及数据分析：


titanic = pd.read_csv(r'D:\Pycharm工程\MYPYTHON2\Titanic\train.csv')

pd.set_option('display.width', 130)
print(titanic.head(20))


pd.isnull(titanic).sum()


titanic.describe()


titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())
titanic['Cabin'] = titanic['Cabin'].fillna('NO')
titanic['Embarked'] = titanic['Embarked'].fillna('S')
pd.isnull(titanic).sum()


fig = plt.figure(figsize=(15,10))

plt.subplot2grid((1, 3), (0, 0))
titanic.Survived.value_counts().plot(kind='bar')
plt.title("获救情况 (1为获救)")
plt.ylabel("人数")
plt.xlabel("Survived")

plt.subplot2grid((1, 3), (0, 1))
titanic.Pclass.value_counts().plot(kind="bar")
plt.title("乘客等级分布")
plt.ylabel("人数")
plt.xlabel("Pclass")

plt.subplot2grid((1, 3), (0, 2))
titanic.Embarked.value_counts().plot(kind='bar')
plt.title("各登船口岸上船人数")
plt.ylabel("人数")
plt.xlabel("Embarked")

结果分析：

从分布图可以看出，没有获救的的人数比重大、乘客等级分布里面、三等舱的人数最多、登船口岸上船人数最多的是S口岸


Survived_0 = titanic.Pclass[titanic.Survived == 0].value_counts()
Survived_1 = titanic.Pclass[titanic.Survived == 1].value_counts()
df = pd.DataFrame({'获救': Survived_1, '未获救': Survived_0})
df.plot(kind='bar', stacked=True)
plt.title("各乘客等级的获救情况")
plt.xlabel("乘客等级")
plt.ylabel("人数")

结果分析：

从大致的来看，乘客等级为一即头等舱的人获救的人数最多，同样在这个等级里面获救的比例也越大，而再看看等级为三的乘客，没有获救的人数占大比例，其中只有少部分存活下来，可以推测，在当时的社会环境中，社会阶层具有很大的影响对于获救，等级越高的人，会给予一定的社会福利，机会也会比较多，同样也证实了，处于社会底层的人民就是会有不公平的待遇


fig = plt.figure()
fig.set_size_inches(12, 12)
plt.subplot2grid((2, 2), (0, 0))
plt.scatter(titanic.Survived, titanic.Age)
plt.ylabel('年龄')
plt.grid(b=True, which='major', axis='y')
plt.title('乘客年龄与获救的关系(1为获救)')

结果分析：
从得到的结果来看，年龄对于获救情况没有很大关系，不过还是可以看出，年龄大的人，获救的少，当然也符合事实，年龄大的人一方面回想着把机会留给年轻一代，同时年龄大的人，活动不方便，跟不上撤离的节奏，所以年龄大的人获救的比例少


Survived_m = titanic.Survived[titanic.Sex == 'male'].value_counts()
Survived_f = titanic.Survived[titanic.Sex == 'female'].value_counts()
df = pd.DataFrame({'男性': Survived_m, '女性': Survived_f})
df.plot(kind='bar', stacked=True)
plt.title("乘客性别与获救情况关系")
plt.xlabel("性别")
plt.ylabel("人数")

结果分析：
从直观上来看，获救的人数中，女性占有很大的比例，同样在女性这一类别中，女性获救的比例也大于男性获救的比例，当然这也符合当时英国的社会精神’骑士精神’，男生要保护好自己的女人，并且要尊重女生，所以在撤退的时候，会有女士优先，就像电影中所看到的，jack在最后也会把生存的希望留给rose，并嘱咐她好好活下去。这是一种为人称赞的精神！


Survived_0 = titanic.Embarked[titanic.Survived == 0].value_counts()
Survived_1 = titanic.Embarked[titanic.Survived == 1].value_counts()
df = pd.DataFrame({'获救': Survived_1, '未获救': Survived_0})
df.plot(kind='bar', stacked=True)
plt.title("各登录港口乘客与获救情况")
plt.xlabel("登录港口")
plt.ylabel("人数")

结果分析：

从实验结果来看，乘客的登录口岸是哪一个对是否获救没有多大影响，大致维持的比例都差不多。


sibsp_Survived = titanic.SibSp[titanic.Survived == 1].value_counts()
sibsp_Nosurvived = titanic.SibSp[titanic.Survived == 0].value_counts()
df5 = pd.DataFrame({'获救': sibsp_Survived, '未获救': sibsp_Nosurvived})
df5.plot(kind='bar', alpha=0.7)
plt.title('堂兄弟姐妹人数与获救情况的关系')
plt.xlabel("堂兄弟姐妹人数")
plt.ylabel("人数")
plt.show()

parch_Survived = titanic.Parch[titanic.Survived == 1].value_counts()
parch_Nosurvived = titanic.Parch[titanic.Survived == 0].value_counts()
df4 = pd.DataFrame({'Survived': parch_Survived, 'Nosurvived': parch_Nosurvived})
df4.plot(kind='bar', alpha=0.7)
plt.title('孩子父母人数与获救情况的关系')
plt.xlabel("孩子父母人数")
plt.ylabel("人数")
plt.show()

结果分析：
从结果来看：没有堂兄弟姐妹和孩子父母的人获救的最多，也就是独自一人的获救人数最多。获救人数的多少随着堂兄弟姐妹和孩子父母数量的增多也呈现出递减的趋势，这一现象来看，我们小组的理解就是：可能一个人他没有亲人，也就没有牵挂，所以逃亡的时候可以无所顾忌，同样单独的一个人，行动起来方便，所以获救的机会大一点，不用在撤离的时候顾及自己的亲人，也相对比较果断，只要自己能够活下来就好


def long(s):
    return len(s)
titanic['Name_long'] = titanic['Name'].apply(long)
titanic = titanic.sort_values(by = ['Survived'], ascending = False)

plt.scatter(titanic.iloc[:342,2],titanic.iloc[:342,12],c='red')
plt.scatter(titanic.iloc[343:891,2],titanic.iloc[343:891,12],c='blue')

结果分析：

想着名字的长短或许和贵族有关，随后建立了名字长短、船票的等级、是否获救三者的关系，从实验结果来看，三者看上去可能有一点关系、有可能没有多大关系。

属性值的处理：


titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 0
titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 1
titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0
titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1
titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2

titanic['AgeBand'] = pd.qcut(titanic.Age, 4)
titanic[['AgeBand', 'Survived']].groupby('AgeBand').mean()
def age_level(x):
        if x  22:
            return 0
        elif x < 28:
            return 1
        elif x < 35:
            return 2
        else:
            return 3

titanic['Age_type'] = titanic['Age'].apply(age_level)

titanic['FareBand'] = pd.qcut(titanic.Fare, 5)
titanic[['FareBand', 'Survived']].groupby('FareBand').mean()
def fare_level(x):
        if x  7.854:
            return 0
        elif x < 10.5:
            return 1
        elif x < 21.679:
            return 2
        elif x < 39.688:
            return 3
        else:
            return 4

titanic['Fare_type'] = titanic['Fare'].apply(fare_level)

titanic['FamilySize'] = titanic['SibSp']+titanic['Parch']+1

def alone(x):
    return 0 if x > 1 else 1
titanic['IsAlone'] = titanic['FamilySize'].apply(alone)
titanic.head(20)


titanic_t = titanic.drop(['PassengerId', 'Name', 'Cabin', 'Age', 'SibSp', 'Parch', 'FamilySize', 'AgeBand', 'FareBand', 'Ticket','Name_long'], axis=1)
titanic_t.head(20)

构建分类模型，评估模型，以及使用模型预测

划分测试集和训练集


X = titanic_t.iloc[:, 1:]
Y = titanic_t.iloc[:, 0]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

构建分类模型、计算得分


svc = svm.SVC(gamma='auto')
svc.fit(X_train, Y_train)
print(svc.score(X_test, Y_test))

0.7649253731343284


dt = DecisionTreeClassifier(criterion='entropy')
dt.fit(X_train, Y_train)
print(dt.score(X_test, Y_test))

0.7947761194029851


gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
print(gaussian.score(X_test, Y_test))

0.7723880597014925

可以根据得到的实验结果选取最好的参数(每次实验划分的数据集的比例以及选取的随机序列会干扰结果，大家可以选取最好的实验结果)


mo = KNN(n_neighbors=5,weights="distance",p=3)
mo.fit(X_train, Y_train)
print(mo.score(X_test, Y_test))

0.7686567164179104

在这里需要提一下软投票以及硬投票


rfc = svm.SVC(gamma='auto',probability=True)
GB = GaussianNB()
dtc = tree.DecisionTreeClassifier()
mo = KNN(n_neighbors=5,weights="distance",p=3)
clf_vc = VotingClassifier(estimators=[('svm', rfc), ('GNB', GB), ('dtc', dtc),('knn',mo)],voting='soft')
clf_vc.fit(X_train, Y_train)
print(clf_vc.score(X_test, Y_test))

0.8134328358208955

预测分析

titanic_test = pd.read_csv(r'D:\Pycharm工程\MYPYTHON2\Titanic\test.csv')
pd.set_option('display.width', 130)

 titanic_test['Age'] = titanic_test['Age'].fillna(titanic_test['Age'].median())
titanic_test['Fare'] = titanic_test['Fare'].fillna(titanic_test['Fare'].median())
titanic_test['Cabin'] = titanic_test['Cabin'].fillna('NO')
titanic_test['Embarked'] = titanic_test['Embarked'].fillna('S')
pd.isnull(titanic_test).sum()

titanic_test.loc[titanic_test['Sex'] == 'male', 'Sex'] = 0
titanic_test.loc[titanic_test['Sex'] == 'female', 'Sex'] = 1
titanic_test.loc[titanic_test['Embarked'] == 'S', 'Embarked'] = 0
titanic_test.loc[titanic_test['Embarked'] == 'C', 'Embarked'] = 1
titanic_test.loc[titanic_test['Embarked'] == 'Q', 'Embarked'] = 2

titanic_test['AgeBand'] = pd.qcut(titanic_test.Age, 4)
titanic_test[['AgeBand']].groupby('AgeBand').mean()
def age_level(x):
        if x  22:
            return 0
        elif x < 28:
            return 1
        elif x < 35:
            return 2
        else:
            return 3

titanic_test['Age_type'] = titanic_test['Age'].apply(age_level)

titanic_test['FareBand'] = pd.qcut(titanic_test.Fare, 5)
titanic_test[['FareBand']].groupby('FareBand').mean()
def fare_level(x):
        if x  7.854:
            return 0
        elif x < 10.5:
            return 1
        elif x < 21.679:
            return 2
        elif x < 39.688:
            return 3
        else:
            return 4

titanic_test['Fare_type'] = titanic_test['Fare'].apply(fare_level)

titanic_test['FamilySize'] = titanic_test['SibSp']+titanic_test['Parch']+1

def alone(x):
    return 0 if x > 1 else 1
titanic_test['IsAlone'] = titanic_test['FamilySize'].apply(alone)
titanic_test.head(10)

titanic_test_t = titanic_test.drop(['PassengerId', 'Name', 'Cabin', 'Age', 'SibSp', 'Parch', 'FamilySize', 'AgeBand', 'FareBand', 'Ticket'], axis=1)
titanic_test_t.head(20)


result1 = svc.predict(titanic_test_t.iloc[:,0:7])
print(result1)


result2 = dt.predict(titanic_test_t.iloc[:,0:7])
print(result2)


result3 = gaussian.predict(titanic_test_t.iloc[:,0:7])
print(result3)


result4 = mo.predict(titanic_test_t.iloc[:,0:7])
print(result4)


result5 =clf_vc.predict(titanic_test_t.iloc[:,0:7])
print(result5)

总结

实验的内容过程大致就是上面的描述，大家可以根据上面的描述到jupyter-notebook一步一步演示，应该没有多大问题，需要注意的是，注意文件存放的位置，要让程序运行的时候可以读取到文件，还有就是在jupyter-notebook里面有时候会自动展示结果，有时候又没有，大家做的时候可以多调试。

有些细节的做的不好的地方，大家可以在讨论区指出来哦！

Original: https://blog.csdn.net/Embrace_yxl_/article/details/122123724
Author: Embrace_yxl_
Title: 泰坦尼克号人员预测模型(python/jupyter-notebook/数据挖掘/数据分析)

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/639671/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Leetcode 648 单词替换

英语中有一个叫做词根(root)的概念，我们可在词根后面添加其他一些词，组成另一个较长的单词——即继承词(successor)。例如，词根an跟随着单词other，可以形成新的单词…

人工智能 2023年6月4日
0088
CIFAR-10 数据集简介

复现代码的过程中，简单了解了作者使用的数据集CIFAR-10 dataset ，简单记录一下。CIFAR-10数据集是8000万微小图片的标签子集，它的收集者是：Alex Kriz…

人工智能 2023年7月1日
0089
Google Earth Engine（GEE）随机森林分类

今日分享： Google Earth Engine（GEE）随机森林分类九月第一天，来简单分享下如何在GEE中进行随机森林分类。之做土地利用分类，一直再用ENVI去做，发现做分类…

人工智能 2023年7月3日
0082
【python + opencv + pytorch】车牌提取、分割、识别 pro版

老规矩，先看最后成果图（如果想要全部工程，文章最后我会把github链接放上） 1、分割车牌2、分割字符 3、识别字符最终识别的车牌号码是：浙F99999 整个车牌识别分五步：1、…

人工智能 2023年7月21日
00107
chatGPT写的一篇动态环境下的视觉slam论文

今天尝试了一下chatGPT，虽然没有什么创新点，但是对各种概念的描写还是没问题的。 Abstract: Simultaneous localization and mapping…

人工智能 2023年7月31日
0058
机器学习案例（四）：LSTM股价预测

预测股市是机器学习在金融领域最重要的应用之一。在本文中，我将带你了解一个使用机器学习 Python 进行股票价格预测的数据科学项目。如果投资者能够准确预测市场走势，他就能得到很多股…

人工智能 2023年6月15日
00120
【Linux】线程

【Linux】线程 1 为什么要有线程？首先，对于任何一个进程来讲，即便我们没有主动地去创建线程，进程也是默认有一个主线程的。线程是负责执行二进制指令的，而进程管的比线程多多…

人工智能 2023年7月20日
0068
【Aseprite】制作Unity2D瓦片地图素材（平台游戏）

目录前言：一、创建画布二、中心块的制作三、中上块的制作四、中下块的制作五、左中块的制作六、右中块的制作七、角块的制作八、小块的制作九、洞穴的制作十、斜坡的制作…

人工智能 2023年6月30日
0087
Geohash算法

用户附近位置计算经纬度与物理距离介绍经纬度是经度与纬度的合称组成一个坐标系统，称为地理坐标系统，它是一种利用三度空间的球面来定义地球上的空间的球面坐标系统，能够标示地球上的任何…

人工智能 2023年5月26日
0083
使用rpm包制作本地镜像仓库和使用httpd发布镜像服务实现内网使用yum命令

记录：313 场景：在CentOS 7.9操作系统，使用reposync命令下载rpm包；使用createrepo把rpm包制作成本地镜像仓库；使用httpd发布本地镜像服务；实现…

人工智能 2023年6月26日
0092
TensorFlow安装问题：Could not load dynamic library ‘*****.dll‘； dlerror: ********.dll not found

在tensorflow2.x版本安装成功后，在运行下段代码后： tf.config.list_physical_devices(‘GPU’) 总是会出现以下情况：（注意：一般会有…

人工智能 2023年5月26日
0082
机器学习 | 回归问题

机器学习 | 回归问题更多内容，关注wx公众号：数据分析这件小事儿对于监督学习，其基本问题就是使用特征向量x预测响应变量y，如果响应变量y为连续变量，则称为回归问题。用x来预…

人工智能 2023年6月18日
0086
文本转语音通过语音合成标记语言（SSML）改进合成知识点详解（1）

1. 创建SSML文档 speak是根元素，speak&#x5…

人工智能 2023年5月27日
00102
yshon对讲机如何调频率_对讲机的这些高级使用技巧，你都知道吗？

1、亚音如何设置假设您不小心设置了与陌生人相同的频道，以及如何使其他人听不到您的消息，您需要为该频道添加密码。 [En] Suppose you accidentally set…

人工智能 2023年5月27日
00152
win10安装TensorFlow-gpu-2.6详细教程

提示：看此文章大前提需要拥有 NVIDIA的显卡本篇献给那些想安装 官方已经&#…

人工智能 2023年5月23日
0091
【支持向量机SVM系列教程3】支持向量回归SVR

文章目录 * – 3 支持向量回归SVR – + 3.1 解决的目标 + 3.2 偏差 ϵ \epsilon ϵ 的理解 + 3.3 目标函数的转化 + 3…

人工智能 2023年6月17日
0090

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

泰坦尼克号人员预测模型(python/jupyter-notebook/数据挖掘/数据分析)

前言

实验过程

字段解读

; 使用jupyter-notebook实现的完整过程：

导入所有的第三方库：

数据理解以及数据分析：

属性值的处理：

构建分类模型，评估模型，以及使用模型预测

划分测试集和训练集

预测分析

总结

大家都在看