数据分析实验 sklearn 逻辑回归

2023年7月16日上午6:44 • 人工智能 • 阅读 91

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

实验相关内容
*
–
使用步骤
*
–
+
*
– 代码如下（示例）：

实验相关内容

1.数据分析实验

非均衡数据的处理

提示：以下是本篇文章正文内容，下面案例可供参考

2.数据集介绍

数据集包括了 2013 年 9 月份两天时间内的信用卡交易数据，284807 笔交易中，一共有 492 笔是欺诈行为。输入数据一共包括了 28 个特征 V1，V2，……V28 对应的取值，以及交易时间 Time 和交易金额 Amount。为了保护数据隐私，我们不知道 V1 到 V28 这些特征代表的具体含义，只知道这 28 个特征值是通过 PCA 变换得到的结果。另外字段 Class 代表该笔交易的分类，Class=0 为正常（非欺诈），Class=1 代表欺诈

3.实验目标

目标是针对这个数据集构建一个信用卡欺诈分析的分类器，采用的是逻辑回归。

4.整个流程

了解逻辑回归分类，以及如何在 sklearn 中使用它；
信用卡欺诈属于二分类问题，欺诈交易在所有交易中的比例很小，对于这种数据不平衡的情况，到底采用什么样的模型评估标准会更准确；
完成信用卡欺诈分析的实战项目，并通过数据可视化对数据探索和模型结果评估进一步加强了解。

; 实验前讲解

– 如何使用 sklearn 中的逻辑回归工具：

在 sklearn 中，使用 LogisticRegression() 函数构建逻辑回归分类器，函数里有一些常用的构造参数：
penalty：惩罚项，取值为 l1 或 l2，默认为 l2。当模型参数满足高斯分布的时候，使用
l2，当模型参数满足拉普拉斯分布的时候，使用 l1；
solver：代表的是逻辑回归损失函数的优化方法。有 5 个参数可选，分别为
liblinear、lbfgs、newton-cg、sag 和 saga。默认为
liblinear，适用于数据量小的数据集，当数据量大的时候可以选用 sag 或 saga 方法；
max_iter：算法收敛的最大迭代次数，默认为 10； n_jobs：拟合和预测的时候 CPU 的核数，默认是1，也可以是整数，如果是-1 则代表 CPU 的核数。当我们创建好之后，就可以使用 fit 函数拟合，使用 predict 函数预测。*

– 模型评估指标

这里先介绍下数据预测的四种情况：TP、FP、TN、FN。*

准确率 Accuracy = (TP+TN)/(TP+TN+FN+FP)；
精确率 P = TP/ (TP+FP)； 召回率 R = TP/ (TP+FN)，也称为查全率。
F1 作为精确率 P 和召回率 R 的调和平均，数值越大代表模型的结果越好。

使用步骤

代码如下（示例）：

`python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_recall_curve
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
import warnings
warnings.filterwarnings('ignore')

def plot_confusion_matrix(cm, classes, normalize = False, title = 'Confusion matrix"', cmap = plt.cm.Blues) :
    plt.figure()
    plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation = 0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.

    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])) :
        plt.text(j, i, cm[i, j],
                 horizontalalignment = 'center',
                 color = 'white' if cm[i, j] > thresh else 'black')

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()

def show_metrics():
    tp = cm[1,1]
    fn = cm[1,0]
    fp = cm[0,1]
    tn = cm[0,0]
    print('精确率: {:.3f}'.format(tp/(tp+fp)))
    print('召回率: {:.3f}'.format(tp/(tp+fn)))
    print('F1值: {:.3f}'.format(2*(((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn))))))
def show_metrics1():
    tp = cm1[1,1]
    fn = cm1[1,0]
    fp = cm1[0,1]
    tn = cm1[0,0]
    print('精确率: {:.3f}'.format(tp/(tp+fp)))
    print('召回率: {:.3f}'.format(tp/(tp+fn)))
    print('F1值: {:.3f}'.format(2*(((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn))))))

def plot_precision_recall():
    plt.step(recall, precision, color = 'b', alpha = 0.2, where = 'post')
    plt.fill_between(recall, precision, step ='post', alpha = 0.2, color = 'b')
    plt.plot(recall, precision, linewidth=2)
    plt.xlim([0.0,1])
    plt.ylim([0.0,1.05])
    plt.xlabel('召回率')
    plt.ylabel('精确率')
    plt.title('精确率-召回率 曲线')
    plt.show();

data = pd.read_csv('creditcard.csv')

data.describe()

plt.rcParams['font.sans-serif'] = ['SimHei']

plt.figure()
ax = sns.countplot(x = 'Class', data = data)
plt.title('类别分布')
plt.show()

num = len(data)
num_fraud = len(data[data['Class'] == 1])

print('总交易笔数: ', num)
print('诈骗交易笔数：', num_fraud)
print('诈骗交易比例：{:.6f}'.format(num_fraud/num))

f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(15,8))
bins = 50
ax1.hist(data.Time[data.Class == 1], bins = bins, color = 'deeppink')
ax1.set_title('诈骗交易')
ax2.hist(data.Time[data.Class == 0], bins = bins, color = 'deepskyblue')
ax2.set_title('正常交易')
plt.xlabel('时间')
plt.ylabel('交易次数')
plt.show()

data['Amount_Norm'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))

y = np.array(data.Class.tolist())
data = data.drop(['Time','Amount','Class'],axis = 1)
X = np.array(data.iloc[:,:].values)

train_x,test_x,train_y,test_y = train_test_split(X, y, test_size = 0.1,random_state =  33)
train_x1,test_x1,train_y1,test_y1 = train_test_split(X, y, test_size = 0.1,random_state =  3)

clf = LogisticRegression()
clf.fit(train_x,train_y)
predict_y = clf.predict(test_x)

cls =LinearSVC()
cls.fit(train_x1,train_y1)
predict_y1 = cls.predict(test_x1)

score_y = clf.decision_function(test_x)

score_y1 = cls.decision_function(test_x1)

cm = confusion_matrix(test_y, predict_y)
class_names = [0,1]

cm1 = confusion_matrix(test_y1, predict_y1)
class_names = [0,1]

plot_confusion_matrix(cm, classes = class_names, title = '逻辑回归 混淆矩阵')

plot_confusion_matrix(cm1, classes = class_names, title = 'svc 混淆矩阵')

show_metrics()

show_metrics1()

precision, recall, thresholds = precision_recall_curve(test_y, score_y)
plot_precision_recall()

precision, recall, thresholds = precision_recall_curve(test_y1, score_y1)
plot_precision_recall()

由于用了两种分类一种逻辑回归一种线性SVM，在代码中都有标注

Original: https://blog.csdn.net/m0_59592892/article/details/123779390
Author: 拉垮的菜鸟
Title: 数据分析实验 sklearn 逻辑回归

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/695897/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

NLP–社区检测算法(Community Detection)总结【原理】

文章目录文章目录社区检测(Community Detection) 社区社区检测社区检测与聚类的对比分析鲁汶算法(Louvain ) 莱顿社区检测(Leiden) 标签传…

人工智能 2023年7月27日
0070
Verilog数字系统教程学习——Verilog语法的基本概念

Verilog HDL是一种用于数字系统设计的语言。Verilog HDL既是一种行为描述语言也是一种结构描述语言。行为描述——逻辑——reg型变量结构描述——连线——wi…

人工智能 2023年6月28日
0071
回归 Evaluation Metrics

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年6月18日
0058
Windows程序意外挂掉，但显存依然被占用

Windows程序意外挂掉，但显存依然被占用 1.软件环境 2.问题描述 3.解决方法 * 3.1.查找当前占用显存的程序 3.2.关闭显存占用程序 4.结果预览 1.软件环境 W…

人工智能 2023年5月26日
00124
深入理解PSNR（峰值信噪比）(附matlab代码)

深入理解PSNR 作者：老李日期：2022-1-19 本文引入MSE、SNR、变异系数（Coefficient of Variation），并希望从统计学的角度上解释这个变量这个…

人工智能 2023年6月16日
0088
【OpenCV】cv2.putText()函数用法

文章目录 cv2.putText()函数用法 * 参数字体选择例如 cv2.putText()函数用法 cv2.putText(image, text, org, font, …

人工智能 2023年5月26日
00120
Pytorch中torch.cat()函数解析

一. torch.cat（）函数解析 1. 函数说明 1.1 官网：torch.cat()，函数定义及参数说明如下图所示：1.2 函数功能函数将两个张量（tensor）按指定维度拼…

人工智能 2023年6月16日
0073
【代码实践】使用CLIP做一些多模态的事情

CLIP到底有多强，让我们来试试吧！ CLIP模型及代码地址：GitHub – openai/CLIP: Contrastive Language-Image Pret…

人工智能 2023年7月26日
0069
Laravel_5.4.0_8.6.12+_反序列化链_RCE1

对应 PHPGGC 中的 Laravel/RCE2 这是 Laravel 反序列化链系列的第一篇文章 0x00 漏洞环境 https://github.com/N0puple/ph…

人工智能 2023年5月30日
0075
Cartographer学习记录：Cartographer地图3D可视化配置（自录数据集版）

在上一篇对Cartographer官方数据集进行可视化配置后，这篇博客将跟各位小伙伴们分享如果利用自己录制的数据包进行地图的3D可视化。因为之前还没有做博客的习惯，没有将我搭建平台…

人工智能 2023年6月10日
0092
【语音识别】基于matlab矢量量化（VQ）说话人识别【含Matlab源码 575期】

⛄一、获取代码方式获取代码方式1：完整代码已上传我的资源：【语音识别】基于matlab矢量量化（VQ）说话人识别【含Matlab源码 575期】点击上面的蓝色字体，付费直接下载，…

人工智能 2023年5月27日
0088
图像的基本处理（五）ImageDraw 模块的使用

✨✨✨感谢优秀的你打开了小白的文章“希望在看文章的你今天又进步了一点点，离美好生活更近一步！”🌈 目录 🚀往期回顾 🍉前言 🍉基本函数应用 🍉应用实例一 🍉…

人工智能 2023年6月20日
00106
样本选择模型 & 处理效应模型

一、样本选择偏差与自选择偏差样本选择偏差样本选择偏差的非随机选择机制在于对样本的选择不随机。在样本数据的采集过程中，只对某部分群体进行调查，但这部分群体与其他群体在某些方面的…

人工智能 2023年6月16日
00114
pytorch之池化层

在本节中我们介绍池化（pooling）层，它的提出是为了缓解卷积层对位置的过度敏感性二维最大池化层和平均池化层同卷积层一样，池化层每次对输入数据的一个固定形状窗口（又称池化窗口…

人工智能 2023年7月22日
0084
Jetson Xavier NX 卸载Tensorrt8.2.1并安装指定版本8.0.1

我的板子目前环境是Jetpack4.6.2、CUDA10.2、Cudnn8.2.1、Tensorrt8.2.1 首先说一下为什么要更换版本，在执行tensorrt的测试案例的时候，…

人工智能 2023年6月17日
0087
论文笔记：Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling详解

论文：https://arxiv.org/abs/2111.03930 代码：GitHub – gaopengcuhk/Tip-Adapter 摘要对比性视觉语言预训…

人工智能 2023年6月22日
00124

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31