Logstic Regression模型对German Credit数据集进行分类

2023年7月3日上午8:43 • 人工智能 • 阅读 74

Case: German Credit

在这份作业中，我们使用了Logstic Regression模型对German Credit数据集进行了分类。并用混淆矩阵和ROC曲线对模型进行了评估。

若对本文存有疑问或获取数据代码，请直接私信博主或直接添加博主VX: 1178623893

The German Credit data set contains observations on 30 variables for 1000 past applicants for credit. Each applicant was rated as “good credit”(700 cases) or “bad credit” (300 cases).

Assignment

1. Review the predictor variables and guess from their definition at what their role might be in a credit decision. Are there any surprises in the data?

2.Divide the data randomly into training (60%) and validation (40%) partitions, and develop classification models using the following data mining techniques in XLMiner

3.Choose one model from each technique and report the confusion matrix and the cost/gain matrix for the validation data. For the logistic regression model use a cutoff “predicted probability of success” (“success”=1) of 0.5. Which technique gives the most net profit on the validation data?

4. Let’s see if we can improve our performance by changing the cutoff. Rather than accepting above classification of everones’s credit status, let’s use the “predicted probability of finding a good applicant” in logistic regression as a basis for selecting the best credit risks first, followed by poorer risk applicants.

a. Sort the test data on "predicted probability of success."

b. For each test case, calculate the actual cost/gain of extending credit.

c. Add another column for cumulative net profit.

d. How far into the test data do you go to get maximum net profit? (Often this is specified as a percentile or rounded to deciles.)

e. If this logistic regression model is scored to future applicants, what "probability of success" cutoff should be used in extending credit?


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas_profiling
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

Q1. Review the predictor variables and guess from their definition at what their role might be in a credit decision. Are there any surprises in the data?


df = pd.read_excel(r'GermanCredit.xlsx')
df.head(10)

OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING…AGEOTHER_INSTALLRENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNRESPONSE01064000100…67001221101121482000100…22001121000233124000010…49001112001340422001000…45000122001450243100000…53000222000563362000010…35000112101673242001000…53001121001781362010000…35010131101893122000100…610011110019101304100000…28001231000

10 rows × 32 columns

df.info()

<class 'pandas.core.frame.dataframe'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 32 columns):
 #   Column            Non-Null Count  Dtype
The modelling results:
The slope of the logistic regression technique: [[-0.03410161  0.62564931 -0.48414517  0.504973   -0.45165218  0.11433931
  -0.18546591  0.09348364 -0.23355452 -0.07481663 -0.17873389  0.52594587
   0.245582   -0.3468231   0.02593704  0.37926205  0.15389971 -0.06706053
   0.16737355 -0.03974959  0.14224639 -0.25124865  0.04792307 -0.26928597
  -0.51482951 -0.24693031 -0.19668434  0.02740041 -0.02860024  0.20056279
   0.27471923]]
The intercept of the logistic regression technique: [1.38831511]
Number of successes from verification results&#xFF1A; 315.0
Number of successes from model calculation results&#xFF1A; 315


X_test.loc[:,'Score'] = s

X_test.sort_values("Score",inplace=True)
plt.plot(X_test.loc[:,'Score'].values)
plt.show()
X_test

Logstic Regression模型对German Credit数据集进行分类

OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING…AGEOTHER_INSTALLRENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNScore9729730241100000…29010201000.0215433343350240001000…23110222000.0382627287291481000001…59010121000.04949359600364001000…23010211100.05644711120482000001…24010121000.074500…………………………………………………………519520364000100…36000221000.9928151351363124000100…38001221100.9928975675683244000100…34001121000.992978156157094000000…48001222010.9932142092103122010000…55001121010.996783

400 rows × 32 columns

print('The sorted test data on "predicted probability of success" as follows:')
X_test.loc[:,'Score']

The sorted test data on "predicted probability of success" as follows:

972    0.021543
334    0.038262
728    0.049493
59     0.056447
11     0.074500
         ...

519    0.992815
135    0.992897
567    0.992978
156    0.993214
209    0.996783
Name: Score, Length: 400, dtype: float64

Q4.b

For each test case, calculate the actual cost/gain of extending credit.

actual_gain = X_test['Score']*100-500*(1-X_test['Score'])
X_test.loc[:,'Actual_gain'] = actual_gain
plt.plot(actual_gain.values)
plt.xlabel('number of test cases')
plt.ylabel('actual gain')
plt.show()
X_test

OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING…OTHER_INSTALLRENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNScoreActual_gain9729730241100000…010201000.021543-487.0744353343350240001000…110222000.038262-477.0429257287291481000001…010121000.049493-470.30418259600364001000…010211100.056447-466.13208011120482000001…010121000.074500-455.300084…………………………………………………………519520364000100…000221000.99281595.6888381351363124000100…001221100.99289795.7380595675683244000100…001121000.99297895.786722156157094000000…001222010.99321495.9283152092103122010000…001121010.99678398.069581

400 rows × 33 columns

print('The actual cost/gain of extending credit for each case as follows:')
X_test.loc[:,'Actual_gain']

The actual cost/gain of extending credit for each case as follows:

972   -487.074435
334   -477.042925
728   -470.304182
59    -466.132080
11    -455.300084
          ...

519     95.688838
135     95.738059
567     95.786722
156     95.928315
209     98.069581
Name: Actual_gain, Length: 400, dtype: float64

Q4.c

Add another column for cumulative net profit.

cumulate_net_profit = np.cumsum(actual_gain)
X_test.loc[:,'Cumulate_net_profit'] = cumulate_net_profit
plt.plot(cumulate_net_profit.values)
plt.xlabel('number of test cases')
plt.ylabel('cumulate net profit')
plt.show()
X_test

OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING…RENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNScoreActual_gainCumulate_net_profit9729730241100000…10201000.021543-487.074435-487.0744353343350240001000…10222000.038262-477.042925-964.1173607287291481000001…10121000.049493-470.304182-1434.42154259600364001000…10211100.056447-466.132080-1900.55362211120482000001…10121000.074500-455.300084-2355.853706…………………………………………………………519520364000100…00221000.99281595.688838-28084.1739741351363124000100…01221100.99289795.738059-27988.4359155675683244000100…01121000.99297895.786722-27892.649194156157094000000…01222010.99321495.928315-27796.7208782092103122010000…01121010.99678398.069581-27698.651297

400 rows × 34 columns

print('The column for cumulative net profit. as follows:')
X_test.loc[:,'Cumulate_net_profit']

The column for cumulative net profit. as follows:

972     -487.074435
334     -964.117360
728    -1434.421542
59     -1900.553622
11     -2355.853706
           ...

519   -28084.173974
135   -27988.435915
567   -27892.649194
156   -27796.720878
209   -27698.651297
Name: Cumulate_net_profit, Length: 400, dtype: float64

Q4.d

How far into the test data do you go to get maximum net profit? (Often this is specified as a percentile or rounded to deciles.)

plt.bar(np.arange(31),clf.coef_.reshape(-1))
plt.xlabel('Variables')
plt.ylabel('Weight')
plt.show()

In order to maximize net profit, the “predictive success probability” of the test data is critical.

The results of the model show that the checking account status has a significant positive effect on the probability of success, while the duration of credit and the purpose of credit have a negative effect on the probability of success

Q4.e

If this logistic regression model is scored to future applicants, what “probability of success” cutoff should be used in extending credit?

X_test.iloc[int((5/6)*400),-3]

0.9578759345718599

In order to reduce the cost of extending credit, balance risks and benefits.Here we propose to set up a reasonable “probability of success” cutoff point so that the number of successful people should be greater than or equal to five times the number of unsuccessful people.This means that 5/6 of all people should be below the probability of success. By calculation, I think the “probability of success” cutoff point should be set at 0.95

Original: https://blog.csdn.net/GODSuner/article/details/115029101
Author: 春风惹人醉
Title: Logstic Regression模型对German Credit数据集进行分类

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/667249/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

机器学习-卷积神经网络之深度残差网络（三）

背景介绍：MNIST数据集识别黑白的手写数字图片，不适合彩色模型的RGB三通道图片。用深度残差网络学习多通道图片。简单介绍一下深度残差网络：普通的深度网络随着网络深度的加深，拟合…

人工智能 2023年5月26日
00101
MySQL数据库创建表一系列操作

在MySQL数据库中,创建新表使用CREATE TABLE语句。语法格式: CREATE[ TEMPORARY ]TABLE[ IF NOT EXISTS] table_name …

人工智能 2023年7月29日
0049
顺丰同城前端一面

顺丰同城前端一面文章目录顺丰同城前端一面 * – 1.自我介绍 2.选择前端的理由 3.用css实现三角形 4.垂直水平居中实现 5.基本数据类型，引用数据类型 6…

人工智能 2023年6月29日
0059
李宏毅NLP笔记

目录 1.course overview2.语音辨识3.4.5.6.7.8.9.10. 一、Course Overview 自然语言（1）概念人造语言：程序语言，Python、…

人工智能 2023年5月27日
0083
基于朴素贝叶斯/逻辑回归的垃圾邮件文本分类

目录一、邮件数据集二、文本分类三、朴素贝叶斯 1、贝叶斯公式 2、应用举例 3、用朴素贝叶斯进行文本分类四、逻辑回归五、代码 1、导入程序运行必需的库 2、获取邮件内容以…

人工智能 2023年5月30日
0097
数据分析：单元2 NumPy数据存取与函数

本单元的主要围绕NumPy数据存取与函数，我们会学到很多的函数，基本上都是由NumPy提供的函数，具体的内容说明我记录在了表格之中，不过代码的实例需要你自己去敲，因为我提供的是图片…

人工智能 2023年6月11日
0062
pip：你真的熟悉怎么用了吗？

1、添加pip源 C:\Users\matebook\AppData\Roaming\pip 在这个路径下创建pip文件夹（默认没有）在文件夹里面新建一个 pip.ini文件打开文…

人工智能 2023年5月25日
0095
基于51单片机的电子密码锁设计

要求：设计一个电子密码锁输入密码即可开锁，可删除信息，可通过输入123456789ABCDEF等字符密码错误发出警报用LCD1602显示相关信息当输入正确时 D2会亮灯，…

人工智能 2023年6月30日
0057
检测分割算法改进(篇四) BFP(Balanced Feature Pyramid)模块

模块来源：为了使模型对不同尺度的目标更好的适配，往往在网络中增加FPN结构，实际证明将高低层次的信息互补确实会带来检测性能的提升。但是Libra R-CNN网络的作者认为FPN是有…

人工智能 2023年7月10日
0058
50_Pandas读取 Excel 文件 (xlsx, xls)

要使用 pandas 将 Excel 文件（扩展名：.xlsx、.xls）作为 pandas.DataFrame 读取，请使用 pandas.read_excel () 函数。这…

人工智能 2023年7月4日
0047
pandas 库前置知识

pandas 库 * – Pandas – + 创建一个 Series 对象 pd.Series() + 创建一个 DataFrame 对象 pd.Data…

人工智能 2023年7月8日
0063
【大数据可视化分析】股吧帖子情感倾向及用户参与行为

目录 1. 报告摘要 2. 报告正文 * 2.1 2008-2020年股吧总体分析 – （1） 2008-2020年股吧综合参数（折线图）（2） 2008-2020年…

人工智能 2023年7月16日
0077
【机器学习】Logistic 分类回归算法（二元分类 & 多元分类）

🤵‍♂️ 个人主页: @计算机魔术师👨‍💻 作者简介：CSDN内容合伙人，全栈领域优质创作者。该文章收录专栏✨— 机器学习 —✨ 【机器学习】logistics分类一、线性回归…

人工智能 2023年7月3日
0072
梯度下降与一元线性回归

梯度下降基本概念梯度下降法（gradient descent),又名最速下降法（steepest descent)是求解无约束最优化问题最常用的方法。它是一种迭代方法，每一…

人工智能 2023年6月17日
0099
[BEV系列]BEVFormer: Learning Bird’s-Eye-ViewRepresentation from Multi-Camera Images viaSpatiotemporal

论文链接：https://arxiv.org/pdf/2203.17270v1.pdf代码链接：https://github.com/zhiqi-li/BEVFormer 摘要（A…

人工智能 2023年6月2日
0062
常用激活函数activation function（Softmax、Sigmoid、Tanh、ReLU和Leaky ReLU) 附激活函数图像绘制python代码

激活函数是确定神经网络输出的数学方程式。激活函数的作用：给神经元引入了非线性因素，使得神经网络可以任意逼近任何非线性函数。 1、附加到网络中的每个神经元，并根据每个神经元的输入来…

人工智能 2023年7月27日
0073

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31