Logstic Regression模型对German Credit数据集进行分类

Case: German Credit

在这份作业中,我们使用了Logstic Regression模型对German Credit数据集进行了分类。并用混淆矩阵和ROC曲线对模型进行了评估。

若对本文存有疑问或获取数据代码,请直接私信博主或直接添加博主VX: 1178623893

The German Credit data set contains observations on 30 variables for 1000 past applicants for credit. Each applicant was rated as “good credit”(700 cases) or “bad credit” (300 cases).

Assignment

1. Review the predictor variables and guess from their definition at what their role might be in a credit decision. Are there any surprises in the data?

2.Divide the data randomly into training (60%) and validation (40%) partitions, and develop classification models using the following data mining techniques in XLMiner

3.Choose one model from each technique and report the confusion matrix and the cost/gain matrix for the validation data. For the logistic regression model use a cutoff “predicted probability of success” (“success”=1) of 0.5. Which technique gives the most net profit on the validation data?

4. Let’s see if we can improve our performance by changing the cutoff. Rather than accepting above classification of everones’s credit status, let’s use the “predicted probability of finding a good applicant” in logistic regression as a basis for selecting the best credit risks first, followed by poorer risk applicants.

a. Sort the test data on "predicted probability of success."

b. For each test case, calculate the actual cost/gain of extending credit.

c. Add another column for cumulative net profit.

d. How far into the test data do you go to get maximum net profit? (Often this is specified as a percentile or rounded to deciles.)

e. If this logistic regression model is scored to future applicants, what "probability of success" cutoff should be used in extending credit?


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas_profiling
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

Q1. Review the predictor variables and guess from their definition at what their role might be in a credit decision. Are there any surprises in the data?


df = pd.read_excel(r'GermanCredit.xlsx')
df.head(10)

OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING…AGEOTHER_INSTALLRENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNRESPONSE01064000100…67001221101121482000100…22001121000233124000010…49001112001340422001000…45000122001450243100000…53000222000563362000010…35000112101673242001000…53001121001781362010000…35010131101893122000100…610011110019101304100000…28001231000

10 rows × 32 columns

df.info()
<class 'pandas.core.frame.dataframe'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 32 columns):
 #   Column            Non-Null Count  Dtype
The modelling results:
The slope of the logistic regression technique: [[-0.03410161  0.62564931 -0.48414517  0.504973   -0.45165218  0.11433931
  -0.18546591  0.09348364 -0.23355452 -0.07481663 -0.17873389  0.52594587
   0.245582   -0.3468231   0.02593704  0.37926205  0.15389971 -0.06706053
   0.16737355 -0.03974959  0.14224639 -0.25124865  0.04792307 -0.26928597
  -0.51482951 -0.24693031 -0.19668434  0.02740041 -0.02860024  0.20056279
   0.27471923]]
The intercept of the logistic regression technique: [1.38831511]
Number of successes from verification results&#xFF1A; 315.0
Number of successes from model calculation results&#xFF1A; 315

X_test.loc[:,'Score'] = s

X_test.sort_values("Score",inplace=True)
plt.plot(X_test.loc[:,'Score'].values)
plt.show()
X_test

Logstic Regression模型对German Credit数据集进行分类

OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING…AGEOTHER_INSTALLRENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNScore9729730241100000…29010201000.0215433343350240001000…23110222000.0382627287291481000001…59010121000.04949359600364001000…23010211100.05644711120482000001…24010121000.074500…………………………………………………………519520364000100…36000221000.9928151351363124000100…38001221100.9928975675683244000100…34001121000.992978156157094000000…48001222010.9932142092103122010000…55001121010.996783

400 rows × 32 columns

print('The sorted test data on "predicted probability of success" as follows:')
X_test.loc[:,'Score']
The sorted test data on "predicted probability of success" as follows:

972    0.021543
334    0.038262
728    0.049493
59     0.056447
11     0.074500
         ...

519    0.992815
135    0.992897
567    0.992978
156    0.993214
209    0.996783
Name: Score, Length: 400, dtype: float64

Q4.b

For each test case, calculate the actual cost/gain of extending credit.
actual_gain = X_test['Score']*100-500*(1-X_test['Score'])
X_test.loc[:,'Actual_gain'] = actual_gain
plt.plot(actual_gain.values)
plt.xlabel('number of test cases')
plt.ylabel('actual gain')
plt.show()
X_test

Logstic Regression模型对German Credit数据集进行分类

OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING…OTHER_INSTALLRENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNScoreActual_gain9729730241100000…010201000.021543-487.0744353343350240001000…110222000.038262-477.0429257287291481000001…010121000.049493-470.30418259600364001000…010211100.056447-466.13208011120482000001…010121000.074500-455.300084…………………………………………………………519520364000100…000221000.99281595.6888381351363124000100…001221100.99289795.7380595675683244000100…001121000.99297895.786722156157094000000…001222010.99321495.9283152092103122010000…001121010.99678398.069581

400 rows × 33 columns

print('The actual cost/gain of extending credit for each case as follows:')
X_test.loc[:,'Actual_gain']
The actual cost/gain of extending credit for each case as follows:

972   -487.074435
334   -477.042925
728   -470.304182
59    -466.132080
11    -455.300084
          ...

519     95.688838
135     95.738059
567     95.786722
156     95.928315
209     98.069581
Name: Actual_gain, Length: 400, dtype: float64

Q4.c

Add another column for cumulative net profit.
cumulate_net_profit = np.cumsum(actual_gain)
X_test.loc[:,'Cumulate_net_profit'] = cumulate_net_profit
plt.plot(cumulate_net_profit.values)
plt.xlabel('number of test cases')
plt.ylabel('cumulate net profit')
plt.show()
X_test

Logstic Regression模型对German Credit数据集进行分类

OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING…RENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNScoreActual_gainCumulate_net_profit9729730241100000…10201000.021543-487.074435-487.0744353343350240001000…10222000.038262-477.042925-964.1173607287291481000001…10121000.049493-470.304182-1434.42154259600364001000…10211100.056447-466.132080-1900.55362211120482000001…10121000.074500-455.300084-2355.853706…………………………………………………………519520364000100…00221000.99281595.688838-28084.1739741351363124000100…01221100.99289795.738059-27988.4359155675683244000100…01121000.99297895.786722-27892.649194156157094000000…01222010.99321495.928315-27796.7208782092103122010000…01121010.99678398.069581-27698.651297

400 rows × 34 columns

print('The column for cumulative net profit. as follows:')
X_test.loc[:,'Cumulate_net_profit']
The column for cumulative net profit. as follows:

972     -487.074435
334     -964.117360
728    -1434.421542
59     -1900.553622
11     -2355.853706
           ...

519   -28084.173974
135   -27988.435915
567   -27892.649194
156   -27796.720878
209   -27698.651297
Name: Cumulate_net_profit, Length: 400, dtype: float64

Q4.d

How far into the test data do you go to get maximum net profit? (Often this is specified as a percentile or rounded to deciles.)
plt.bar(np.arange(31),clf.coef_.reshape(-1))
plt.xlabel('Variables')
plt.ylabel('Weight')
plt.show()

Logstic Regression模型对German Credit数据集进行分类

In order to maximize net profit, the “predictive success probability” of the test data is critical.

The results of the model show that the checking account status has a significant positive effect on the probability of success, while the duration of credit and the purpose of credit have a negative effect on the probability of success

Q4.e

If this logistic regression model is scored to future applicants, what “probability of success” cutoff should be used in extending credit?
X_test.iloc[int((5/6)*400),-3]
0.9578759345718599

In order to reduce the cost of extending credit, balance risks and benefits.Here we propose to set up a reasonable “probability of success” cutoff point so that the number of successful people should be greater than or equal to five times the number of unsuccessful people.This means that 5/6 of all people should be below the probability of success. By calculation, I think the “probability of success” cutoff point should be set at 0.95

Original: https://blog.csdn.net/GODSuner/article/details/115029101
Author: 春风惹人醉
Title: Logstic Regression模型对German Credit数据集进行分类

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/667249/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球