【Kaggle】Titanic – Machine Learning from Disaster

1. 题目描述

[En]

The process is to download the dataset and copy it to the project directory. Then the data is processed, including missing values, invalid fields, and so on. The first thing you need to understand is the meaning of each field, and then decide which fields are invalid, and then deal with them programmatically. For example, the data format is:

; 3. 解答

[En]

The most direct way is to build a sequence model and then predict it. That is:

def train(feature, label, epochs=10):
model = tf.keras.Sequential()
model.summary()
history = model.fit(feature, label, epochs=epochs, batch_size=64)
plot_loss_and_accuracy(history)
return model


def plot_loss_and_accuracy(history):
"""
绘制loss和accuracy的图像
"""
history_loss = history.history['loss']
history_accuracy = history.history['accuracy']

plt.plot(history_loss, label="loss")
plt.plot(history_accuracy, label="accuracy")
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend()
plt.show()


[En]

Here, consider to see what other people write. Of course, record your own code:

"""
by: 梦否
date: 2022-4-11
"""
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

def get_raw_data(dataset_name):
return train_data

def get_train_data(dataset_name):
raw_data = get_raw_data(dataset_name)

drop_data = raw_data.drop(['Name'], axis=1)

drop_data.replace('male', 0, inplace=True)
drop_data.replace('female', 1, inplace=True)

drop_data.fillna(0, inplace=True)
average_age = round(drop_data['Age'].mean())
drop_data['Age'] = drop_data['Age'].apply(lambda age: average_age if age == 0 else age)

drop_data = drop_data.drop(['Ticket'], axis=1)
drop_data = drop_data.drop(['Cabin'], axis=1)
drop_data = drop_data.drop(['Embarked'], axis=1)
drop_data = drop_data.drop(['PassengerId'], axis=1)

train_feature = drop_data.iloc[:, 1:].to_numpy()
train_label = drop_data.iloc[:, 0].to_numpy()
return train_feature, train_label

def get_test_data(dataset_name):
raw_data = get_raw_data(dataset_name)

drop_data = raw_data.drop(['Name'], axis=1)
drop_data = drop_data.drop(['Ticket'], axis=1)
drop_data = drop_data.drop(['Cabin'], axis=1)
drop_data = drop_data.drop(['Embarked'], axis=1)

drop_data.replace('male', 0, inplace=True)
drop_data.replace('female', 1, inplace=True)

drop_data.fillna(0, inplace=True)
average_age = round(drop_data['Age'].mean())
drop_data['Age'] = drop_data['Age'].apply(lambda age: average_age if age == 0 else age)

identify = drop_data.iloc[:, 0].to_numpy()
feature = drop_data.iloc[:, 1:].to_numpy()
return identify, feature

def train(feature, label, epochs=10):
model = tf.keras.Sequential()
model.summary()
history = model.fit(feature, label, epochs=epochs, batch_size=64)
plot_loss_and_accuracy(history)
return model

def plot_loss_and_accuracy(history):
"""
绘制loss和accuracy的图像
"""
history_loss = history.history['loss']
history_accuracy = history.history['accuracy']

plt.plot(history_loss, label="loss")
plt.plot(history_accuracy, label="accuracy")
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend()
plt.show()

def predict_model(model, feature):
predict = model.predict(feature)
predict[predict < 0.5] = 1
predict[predict >= 0.5] = 0
return predict

if __name__ == '__main__':
train_x, train_y = get_train_data("./dataset/train.csv")
print(train_x)
identify, feature = get_test_data("./dataset/test.csv")
model = train(train_x, train_y, epochs=500)
predict = predict_model(model, feature)
submission = pd.DataFrame({
"PassengerId": identify,
"Survived": predict[:, 0].astype(int)
})
submission.to_csv('result.csv', index=False)

1. 代码学习

4.1 数据集处理

; 2. 分析字段Fare

[En]

Here the author’s missing value handling is worth recording:

test_df["Fare"].fillna(test_df["Fare"].median(), inplace=True)


[En]

It is populated directly with the original data average, which is divided into several steps in my code:


drop_data.fillna(0, inplace=True)
average_age = round(drop_data['Age'].mean())
drop_data['Age'] = drop_data['Age'].apply(lambda age: average_age if age == 0 else age)



fare_not_survived = titanic_df["Fare"][titanic_df["Survived"] == 0]
fare_survived     = titanic_df["Fare"][titanic_df["Survived"] == 1]


[En]

Then, calculate the average and standard deviation:


avgerage_fare = DataFrame([fare_not_survived.mean(), fare_survived.mean()])
std_fare      = DataFrame([fare_not_survived.std(), fare_survived.std()])


[En]

Then the average value is drawn as a bar graph, and the standard deviation is used as the error line, that is:


avgerage_fare.plot(yerr=std_fare,kind='bar',legend=False)


3. 分析字段Age

[En]

There is also a missing value in this field, and the processing here is as follows:

generate random numbers between (mean – std) & (mean + std)


average_age_titanic   = titanic_df["Age"].mean()
std_age_titanic       = titanic_df["Age"].std()
count_nan_age_titanic = titanic_df["Age"].isnull().sum()


[En]

Then, generate the corresponding random number:


rand_1 = np.random.randint(average_age_titanic - std_age_titanic, average_age_titanic + std_age_titanic, size = count_nan_age_titanic)



titanic_df["Age"][np.isnan(titanic_df["Age"])] = rand_1
test_df["Age"][np.isnan(test_df["Age"])] = rand_2



titanic_df['Age'] = titanic_df['Age'].astype(int)


[En]

Looking at the picture above, we can see that whether it is survival or death, the distribution of age data is similar and indistinguishable. So the author made a statistic:

[En]

According to the statistics of the number of survivors at all ages, it can be seen here that it is related, that is, it is easier to survive when young or old. That is, relevant.

4. 分析字段Cabin

[En]

Because its missing value is too much, it is excluded directly.

5. 分析字段Parch & SibSp

titanic_df['Family'] =  titanic_df["Parch"] + titanic_df["SibSp"]


[En]

To facilitate processing, convert directly to:

titanic_df['Family'].loc[titanic_df['Family'] > 0] = 1


[En]

Then draw the relationship between the two:

family_perc = titanic_df[["Family", "Survived"]].groupby(['Family'],as_index=False).mean()
sns.barplot(x='Family', y='Survived', data=family_perc, order=[1,0], ax=axis2)


[En]

Observation shows that the survival rate of people without families is about 0.2 lower than that of families with families. Most people do not have a family, so some families may have a higher survival rate. That is, relevant.

6. 分析字段Sex

def get_person(passenger):
age,sex = passenger
return 'child' if age < 16 else sex

titanic_df['Person'] = titanic_df[['Age','Sex']].apply(get_person,axis=1)
test_df['Person']    = test_df[['Age','Sex']].apply(get_person,axis=1)



person_perc = titanic_df[["Person", "Survived"]].groupby(['Person'],as_index=False).mean()
sns.barplot(x='Person', y='Survived', data=person_perc, ax=axis2, order=['male','female','child'])


[En]

What is more prominent is that the number of children is the least, but the survival rate is relatively high; the number of women is relatively small, but the survival rate is also much higher than that of men, and the polarity difference is relatively large, so this factor is more important.

; 4.2 模型搭建

[En]

After the data is processed, the logical regression model is used directly:

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
logreg.score(X_train, Y_train)


random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)


submission = pd.DataFrame({
"PassengerId": test_df["PassengerId"],
"Survived": Y_pred
})
submission.to_csv('titanic.csv', index=False)


4.3 总结

Original: https://blog.csdn.net/qq_26460841/article/details/124107000
Author: 梦否
Title: 【Kaggle】Titanic – Machine Learning from Disaster

(0)

