机器学习之信用卡欺诈检测
一、机器学习之信用卡欺诈检测
1.1 前言
- 数据来源:Kaggle 信用卡欺诈检测数据集https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?resource=download;
- 本文采用 XGBoost、随机森林、KNN、逻辑回归、SVM 和决策树解决信用卡欺诈检测问题;
1.2 案例分析
1.2.1 导入所需模块到 python 环境
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from termcolor import colored as cl
import itertools
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
1.2.2 读取数据,删除无用的Time列
- 关于数据: 我们将要使用的数据是 Kaggle 信用卡欺诈检测数据集。它包含特征 V1 到 V28,是 PCA 获得的主要成分,并忽略对构建模型没有用的时间特征。
- 其余的特征是包含交易总金额的”金额”特征和包含交易是否为欺诈案件的”类别”特征,类别0标识欺诈,类别1表示正常。
df = pd.read_csv(r'../creditcard.csv')
print("Data's columns contain:\n", df.columns)
print("Data shape:\n", df.shape)
df.drop('Time', axis=1, inplace=True)
pd.set_option('display.max_columns', df.shape[1])
print(df.head())
'''
Data's columns contain:
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
'Class'],
dtype='object')
Data shape:
(284807, 31)
V1 V2 V3 V4 V5 V6 V7 \
0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
V8 V9 V10 V11 V12 V13 V14 \
0 0.098698 0.363787 0.090794 -0.551600 -0.617801 -0.991390 -0.311169
1 0.085102 -0.255425 -0.166974 1.612727 1.065235 0.489095 -0.143772
2 0.247676 -1.514654 0.207643 0.624501 0.066084 0.717293 -0.165946
3 0.377436 -1.387024 -0.054952 -0.226487 0.178228 0.507757 -0.287924
4 -0.270533 0.817739 0.753074 -0.822843 0.538196 1.345852 -1.119670
V15 V16 V17 V18 V19 V20 V21 \
0 1.468177 -0.470401 0.207971 0.025791 0.403993 0.251412 -0.018307
1 0.635558 0.463917 -0.114805 -0.183361 -0.145783 -0.069083 -0.225775
2 2.345865 -2.890083 1.109969 -0.121359 -2.261857 0.524980 0.247998
3 -0.631418 -1.059647 -0.684093 1.965775 -1.232622 -0.208038 -0.108300
4 0.175121 -0.451449 -0.237033 -0.038195 0.803487 0.408542 -0.009431
V22 V23 V24 V25 V26 V27 V28 \
0 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053
1 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724
2 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752
3 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458
4 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153
Amount Class
0 149.62 0
1 2.69 0
2 378.66 0
3 123.50 0
4 69.99 0
'''
1.2.3 探索性数据分析及数据预处理
`python
cases = len(df)
nonfraud_cases = df[df.Class == 0]
fraud_cases = df[df.Class == 1]
fraud_percentage = round(len(nonfraud_cases) / cases * 100, 2)
print(cl(‘CASE COUNT’, attrs=[‘bold’]))
print(cl(‘-‘ * 40, attrs=[‘bold’]))
print(cl(‘Total number of cases are {}’.format(cases), attrs=[‘bold’]))
print(cl(‘Number of Non-fraud cases are {}’.format(len(nonfraud_cases)), attrs=[‘bold’]))
print(cl(‘Number of fraud cases are {}’.format(len(fraud_cases)), attrs=[‘bold’]))
print(cl(‘Percentage of fraud cases is {}%’.format(fraud_percentage), attrs=[‘bold’]))
print(cl(‘-‘ * 40, attrs=[‘bold’]))
print(cl(‘CASE AMOUNT STATISTICS’, attrs=[‘bold’]))
print(cl(‘-‘ * 40, attrs=[‘bold’]))
print(cl(‘NON-FRAUD CASE AMOUNT STATS’, attrs=[‘bold’]))
print(nonfraud_cases.Amount.describe())
print(cl(‘-‘ * 40, attrs=[‘bold’]))
print(cl(‘FRAUD CASE AMOUNT STATS’, attrs=[‘bold’]))
print(fraud_cases.Amount.describe())
print(cl(‘-‘ * 40, attrs=[‘bold’]))
sc = StandardScaler()
amount = df.Amount.values
df.Amount = sc.fit_transform(amount.reshape(-1, 1))
print(cl(df.Amount.head(10), attrs=[‘bold’]))
x = df.drop(‘Class’, axis=1).values
y = df.Class.values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
”’
CASE COUNT
CASE AMOUNT STATISTICS
FRAUD CASE AMOUNT STATS
count 492.000000
mean 122.211321
std 256.683288
min 0.000000
25% 1.000000
50% 9.250000
75% 105.890000
max 2125.870000
Name: Amount, dtype: float64
Original: https://blog.csdn.net/qq_40216188/article/details/125853308
Author: 西西先生666
Title: 机器学习之信用卡欺诈检测
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/787413/
转载文章受原作者版权保护。转载请注明原作者出处!