python评分卡3_woe与IV分箱实现

2023年6月16日上午11:35 • 人工智能 • 阅读 108

本系列分以下章节：
python评分卡1_woe与IV值
 python评分卡2_woe与IV分箱方法
 python评分卡3_woe与IV分箱实现
 python评分卡4_logistics原理与解法_sklearn英译汉
 python评分卡5_Logit例1_plot_logistic_l1_l2_sparsity
python评分卡6_Logit例2plot_logistic_path

1.Python第三方库

如图所示，选择woe-scoring与woe-iv这两个包

; 1.1选woe-scoring

优点：

1).对于1，时间较为新，我现在写博客是2022-05-08，技术要努力学新的，新的通常兼容旧的功能，并根据旧的使用经验做了改进。
2).评分卡后期要结合Logistic回归模型，
Weight Of Evidence Transformer and LogisticRegression model with scikit-learn API翻译成中文:
证据转换器的权重和使用scikit-learn中API的Logistic回归模型

不足：

1).包太新，关于这个包的网上资料较少，甚至这个包的说明文档都没有
2).这个时候需要去github.com去寻找资料，幸运的是找到了，下面一起学习翻一下
3).包的依赖较为新，有时候需要重新安装更新一系列包

1.2选woe-iv

优点：

1).相关性强 aculate woe(weight of evidence) of each feature and then iv(information value).译文为计算每个特征的woe（证据权重），然后计算iv（信息值）。

不足：

1).时间久远,功能可能没有最新的包完善
2).不一定兼容现在的环境配置

2.woe-scoring 使用说明

2.1 WOE-Scoring

Monotone Weight Of Evidence Transformer and LogisticRegression model with scikit-learn API
证据转换器的单调权重和使用scikit-learn中API的Logistic回归模型

2.2 Quickstart 快速入门

2.2.1 Install the package: 安装包

pip install woe-scoring

2.2.2 Use WOETransformer: 使用woe转换器

import pandas as pd
from woe_scoring import WOETransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

使用数据集介绍

代码数据连接
PassengerId: 乘客在数据集中的编号
Survived：是否存活（0代表否，1代表是）
Pclass：社会阶级（1代表上层阶级，2代表中层阶级，3代表底层阶级）
Name：船上乘客的名字
Sex：船上乘客的性别
Age：船上乘客的年龄（可能存在 NaN）
SibSp：乘客在船上的兄弟姐妹和配偶的数量
Parch：乘客在船上的父母以及小孩的数量
Ticket：乘客船票的编号
Fare：乘客为船票支付的费用
Cabin：乘客所在船舱的编号（可能存在 NaN）
Embarked：乘客上船的港口（C 代表从 Cherbourg 登船，Q 代表从 Queenstown 登船，S 代表从 Southampton 登船）

df

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked0103Braund, Mr. Owen Harris122.010A/5 211717.2500NaN01211Cumings, Mrs. John Bradley (Florence Briggs Th…038.010PC 1759971.2833C8502313Heikkinen, Miss. Laina026.000STON/O2. 31012827.9250NaN03411Futrelle, Mrs. Jacques Heath (Lily May Peel)035.01011380353.1000C12304503Allen, Mr. William Henry135.0003734508.0500NaN0…………………………………88688702Montvila, Rev. Juozas127.00021153613.0000NaN088788811Graham, Miss. Margaret Edith019.00011205330.0000B42088888903Johnston, Miss. Catherine Helen “Carrie”0NaN12W./C. 660723.4500NaN088989011Behr, Mr. Karl Howell126.00011136930.0000C148089089103Dooley, Mr. Patrick132.0003703767.7500NaN2

891 rows × 12 columns


df = pd.read_csv('titanic_data.csv')
df['Sex']=df['Sex'].apply(lambda x: 1 if x=='male' else 0 )
df['Embarked']=df['Embarked'].apply(lambda x: 1 if x=='C' else x).apply(lambda x: 2 if x=='Q' else 0 )
train, test = train_test_split(
    df, test_size=0.3, random_state=42, stratify=df["Survived"]
)

cat_cols = [
    "PassengerId",
    "Survived",
    "Name",
    "Ticket",
    "Cabin",
]

 special_cols= [
    "Pclass",
    "Sex",
    "SibSp",
    "Parch",
    "Embarked",
]

encoder = WOETransformer(
    max_bins=8,
    min_pct_group=0.1,
    diff_woe_threshold=0.1,
    cat_features=cat_cols,
    special_cols=special_cols,
    n_jobs=-1,
    merge_type='chi2',
)

encoder.fit(train, train["Survived"])
encoder.save("train_dict.json")

enc_train = encoder.transform(train)
enc_test = encoder.transform(test)

model = LogisticRegression()
model.fit(enc_train, train["Survived"])
test_proba = model.predict_proba(enc_test)[:, 1]

2.2.3 Use CreateModel: 使用创立的model

import pandas as pd
from woe_scoring import CreateModel
from sklearn.model_selection import train_test_split

df = pd.read_csv('titanic_data.csv')
df['Sex']=df['Sex'].apply(lambda x: 1 if x=='male' else 0 )
df['Embarked']=df['Embarked'].apply(lambda x: 1 if x=='C' else x).apply(lambda x: 2 if x=='Q' else 0 )
train, test = train_test_split(
    df, test_size=0.3, random_state=42, stratify=df["Survived"]
)

cat_cols = [
    "PassengerId",
    "Survived",
    "Name",
    "Ticket",
    "Cabin",
]

model = CreateModel(
    max_vars=0.8,
    special_cols=special_cols,
    n_jobs=-1,
    random_state=42,
    class_weight='balanced',
    cv=3,
)
model.fit(train, train["Survived"])
model.save_reports("/")
test_proba = model.predict_proba(test[model.feature_names_])

D:\d_programe\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)

Optimization terminated successfully.

         Current function value: 0.462147
         Iterations 6

D:\d_programe\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)

3.woe-iv

3.1 项目描述

首先我们看下包的项目描述，这是我们了解Python第三方库最快的途径

; Features 特征

Calculation of WOE and IV 计算woe和iv

def WOE(cls, data, varList, type0=’Con’, target_id=’y’, resfile=’result.xlsx’):
“”” 对分类变量直接进行分组统计并进行WOE、IV值计算对连续型变量进行分组（default:10）后进行WOE、IV值计算 :
param data: pandas DataFrame, mostly refer to ABT(Analysis Basics Table) :
param varList: variable list :
param type0: Continuous or Discontinuous(Category), ‘con’ is the required input for Continuous :
param target_id: y flag when gen the train data :
param resfile: download path of the result file of WOE and IV :
return: pandas DataFrame, result of woe and iv value according y flag
“””

定义函数 WOE(……):
“””
参数 data: 变量是pandas DataFrame类型,主要参考ABT（分析基础表）：
参数 varList: 变量是list类型 :
参数 type0: 连续或者非连续（类别） , 当值’con’时要求输入的是连续型变量
参数 target_id: 生成列车数据时的y标志：
参数 resfile: WOE和IV结果文件的下载路径;
return返回值）: pandas DataFrame结构类型, 根据y标志计算的woe和iv值结果;
“””

2 Apply of WOE repalcement of ABT 应用WOE替代ABT（分析基础表）

def applyWOE(cls, X_data, X_map, var_list, id_cols_list=None, flag_y=None):
“””将最优分箱的结果WOE值对原始数据进行编码 :
param X_data: pandas DataFrame, mostly refer to ABT(Analysis Basics Table) :
param X_map: pandas dataframe, map table, result of applying WOE, refer the func woe_iv.WOE :
param var_list: variable list :
param id_cols_list: some other features not been analysed but wanted like id, adress, etc. :
param flag_y: y flag when gen the train data :
return: pandas DataFrame, result of bining with y flag
“””

参数 X_data: pandas DataFrame结构类型, mostly refer to ABT(Analysis Basics Table) :
参数 X_map: pandas dataframe结构类型, map table, result of applying WOE, refer the func woe_iv.WOE :
参数 var_list: 变脸是 list 结构类型:
参数 id_cols_list: some other features not been analysed but wanted like id, adress, etc. :
参数 flag_y: y flag when gen the train data :
返回值: pandas DataFrame结构类型, result of bining with y flag

4.Python编码计算iv 与woe

4.1 函数构造

import pandas as pd, numpy as np, os, re, math, time

def is_monotonic(temp_series):
    return all(temp_series[i]  temp_series[i + 1] for i in range(len(temp_series) - 1)) or all(temp_series[i] >= temp_series[i + 1] for i in range(len(temp_series) - 1))

def prepare_bins(bin_data, c_i, target_col, max_bins):
    force_bin = True
    binned = False
    remarks = np.nan

    for n_bins in range(max_bins, 2, -1):
        try:
            bin_data[c_i + "_bins"] = pd.qcut(bin_data[c_i], n_bins, duplicates="drop")
            monotonic_series = bin_data.groupby(c_i + "_bins")[target_col].mean().reset_index(drop=True)
            if is_monotonic(monotonic_series):
                force_bin = False
                binned = True
                remarks = "binned monotonically"
                break
        except:
            pass

    if force_bin or (c_i + "_bins" in bin_data and bin_data[c_i + "_bins"].nunique() < 2):
        _min=bin_data[c_i].min()
        _mean=bin_data[c_i].mean()
        _max=bin_data[c_i].max()
        bin_data[c_i + "_bins"] = pd.cut(bin_data[c_i], [_min, _mean, _max], include_lowest=True)
        if bin_data[c_i + "_bins"].nunique() == 2:
            binned = True
            remarks = "binned forcefully"

    if binned:
        return c_i + "_bins", remarks, bin_data[[c_i, c_i+"_bins", target_col]].copy()
    else:
        remarks = "couldn't bin"
        return c_i, remarks, bin_data[[c_i, target_col]].copy()

def iv_woe_4iter(binned_data, target_col, class_col):
    if "_bins" in class_col:
        binned_data[class_col] = binned_data[class_col].cat.add_categories(['Missing'])
        binned_data[class_col] = binned_data[class_col].fillna("Missing")
        temp_groupby = binned_data.groupby(class_col).agg({class_col.replace("_bins", ""):["min", "max"],
                                                           target_col: ["count", "sum", "mean"]}).reset_index()
    else:
        binned_data[class_col] = binned_data[class_col].fillna("Missing")
        temp_groupby = binned_data.groupby(class_col).agg({class_col:["first", "first"],
                                                           target_col: ["count", "sum", "mean"]}).reset_index()

    temp_groupby.columns = ["sample_class", "min_value", "max_value", "sample_count", "event_count", "event_rate"]
    temp_groupby["non_event_count"] = temp_groupby["sample_count"] - temp_groupby["event_count"]
    temp_groupby["non_event_rate"] = 1 - temp_groupby["event_rate"]
    temp_groupby = temp_groupby[["sample_class", "min_value", "max_value", "sample_count",
                                 "non_event_count", "non_event_rate", "event_count", "event_rate"]]

    if "_bins" not in class_col and "Missing" in temp_groupby["min_value"]:
        temp_groupby["min_value"] = temp_groupby["min_value"].replace({"Missing": np.nan})
        temp_groupby["max_value"] = temp_groupby["max_value"].replace({"Missing": np.nan})
    temp_groupby["feature"] = class_col
    if "_bins" in class_col:
        temp_groupby["sample_class_label"]=temp_groupby["sample_class"].replace({"Missing": np.nan}).astype('category').cat.codes.replace({-1: np.nan})
    else:
        temp_groupby["sample_class_label"]=np.nan
    temp_groupby = temp_groupby[["feature", "sample_class", "sample_class_label", "sample_count", "min_value", "max_value",
                                 "non_event_count", "non_event_rate", "event_count", "event_rate"]]

"""
    **********get distribution of good and bad 得到好的和坏的分布
"""
    temp_groupby['distbn_non_event'] = temp_groupby["non_event_count"]/temp_groupby["non_event_count"].sum()
    temp_groupby['distbn_event'] = temp_groupby["event_count"]/temp_groupby["event_count"].sum()

    temp_groupby['woe'] = np.log(temp_groupby['distbn_non_event'] / temp_groupby['distbn_event'])
    temp_groupby['iv'] = (temp_groupby['distbn_non_event'] - temp_groupby['distbn_event']) * temp_groupby['woe']

    temp_groupby["woe"] = temp_groupby["woe"].replace([np.inf,-np.inf],0)
    temp_groupby["iv"] = temp_groupby["iv"].replace([np.inf,-np.inf],0)

    return temp_groupby

"""
- iterate over all features. 迭代所有功能。
- calculate WOE & IV for there classes.计算这些类别的woe与iv值
- append to one DataFrame woe_iv.追加到数据帧woe_iv中。
"""
def var_iter(data, target_col, max_bins):
    woe_iv = pd.DataFrame()
    remarks_list = []
    for c_i in data.columns:
        if c_i not in [target_col]:

"""
            ----logic---
            binning is done only when feature is continuous and non-binary. 仅当数据是连续型，且不是二进制时才会处理
            Note: Make sure dtype of continuous columns in dataframe is not object. 确保dataframe中连续列的数据类型不是object。
"""
            c_i_start_time=time.time()
            if np.issubdtype(data[c_i], np.number) and data[c_i].nunique() > 2:
                class_col, remarks, binned_data = prepare_bins(data[[c_i, target_col]].copy(), c_i, target_col, max_bins)
                agg_data = iv_woe_4iter(binned_data.copy(), target_col, class_col)
                remarks_list.append({"feature": c_i, "remarks": remarks})
            else:
                agg_data = iv_woe_4iter(data[[c_i, target_col]].copy(), target_col, c_i)
                remarks_list.append({"feature": c_i, "remarks": "categorical"})

            woe_iv = woe_iv.append(agg_data)
    return woe_iv, pd.DataFrame(remarks_list)

def get_iv_woe(data, target_col, max_bins):
    func_start_time = time.time()
    woe_iv, binning_remarks = var_iter(data, target_col, max_bins)
    print("------------------IV and WOE calculated for individual groups.------------------")
    print("Total time elapsed: {} minutes".format(round((time.time() - func_start_time) / 60, 3)))

    woe_iv["feature"] = woe_iv["feature"].replace("_bins", "", regex=True)
    woe_iv = woe_iv[["feature", "sample_class", "sample_class_label", "sample_count", "min_value", "max_value",
                     "non_event_count", "non_event_rate", "event_count", "event_rate", 'distbn_non_event',
                     'distbn_event', 'woe', 'iv']]

    iv = woe_iv.groupby("feature")[["iv"]].agg(["sum", "count"]).reset_index()
    print("------------------Aggregated IV values for features calculated.------------------")
    print("Total time elapsed: {} minutes".format(round((time.time() - func_start_time) / 60, 3)))

    iv.columns = ["feature", "iv", "number_of_classes"]
    null_percent_data=pd.DataFrame(data.isnull().mean()).reset_index()
    null_percent_data.columns=["feature", "feature_null_percent"]
    iv=iv.merge(null_percent_data, on="feature", how="left")
    print("------------------Null percent calculated in features.------------------")
    print("Total time elapsed: {} minutes".format(round((time.time() - func_start_time) / 60, 3)))
    iv = iv.merge(binning_remarks, on="feature", how="left")
    woe_iv = woe_iv.merge(iv[["feature", "iv", "remarks"]].rename(columns={"iv": "iv_sum"}), on="feature", how="left")
    print("------------------Binning remarks added and process is complete.------------------")
    print("Total time elapsed: {} minutes".format(round((time.time() - func_start_time) / 60, 3)))
    return iv, woe_iv.replace({"Missing": np.nan})

4.2 函数调用

Step-1 : 加载数据

代码数据连接

data=pd.read_csv("data.csv")
print(data.shape)

(1000, 8)

step-2 调用get_iv_woe函数

iv, woe_iv = get_iv_woe(data.copy(), target_col="bad_customer", max_bins=20)
print(iv.shape, woe_iv.shape)

Total time elapsed: 0.009 minutes
Total time elapsed: 0.01 minutes
(7, 5) (49, 16)

D:\d_programe\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3444: PerformanceWarning: indexing past lexsort depth may impact performance.

  exec(code_obj, self.user_global_ns, self.user_ns)
D:\d_programe\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3444: PerformanceWarning: indexing past lexsort depth may impact performance.

  exec(code_obj, self.user_global_ns, self.user_ns)

注意：确保dataframe中连续列的数据类型不是object。如果是object，它将被认为是分类的，将不会处理

参考引用：
1.https://github.com/klaudia-nazarko
2.https://pypi.org/project/woe-iv/
3.https://pypi.org/project/woe-scoring/

Original: https://blog.csdn.net/u012338969/article/details/124655874
Author: 雪龙无敌
Title: python评分卡3_woe与IV分箱实现

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/623780/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

一些三维重建知识点

目录 0 我的疑问 1 什么是点云？ * 1.1 何为点云？ 1.2 从何而来？ 1.3 点云有什么用？ 2 深度图像、点云、体素、网格 3 三维重建流程 4 常用三维数据集 5 …

人工智能 2023年5月28日
0080
Anaconda虚拟环境安装GPU版本的pytorch

1. 首先安装好Anaconda 2. 安装完之后Anaconda默认是没有桌面快捷方式的，需要点击【开始】，找到Anaconda，点击Anaconda Prompt 3. Ana…

人工智能 2023年7月22日
0079
【python量化】搭建一个CNN-LSTM模型用于股票价格预测

写在前面下面的这篇文章主要教大家如何搭建一个基于CNN-LSTM的股票预测模型，并将其用于股票价格预测当中。原代码在文末进行获取。 1 CNN-LSTM模型这篇文章将带大家通…

人工智能 2023年6月23日
00109
AI智能图像识别的工作原理及行业应用

AI智能图像识别（人工智能(AI)的一部分）是当今一个正在蓄势待发的人工智能大趋势。富维图像也正在从事图像识别技术研发和应用。数据显示，人工智能图像识别市场规模已达到近390亿美元…

人工智能 2023年6月16日
00111
2021 CCF大数据与计算智能大赛个贷违约预测top 73 解决方案

目录一、概述二、解题过程 * 2.1 数据 2.2 构建基线 2.3 进阶思路一 2.4 进阶思路二 2.5 进阶思路三 2.6 融合 2.7 调优提分过程 2.8 其他工作 …

人工智能 2023年6月19日
0089
命名实体识别BiLSTM-CRF代码实现

命名实体识别BiLSTM-CRF代码实现 – 潘登同学的NLP笔记文章目录 * – 命名实体识别BiLSTM-CRF代码实现 — 潘登同学的NLP笔记* …

人工智能 2023年6月16日
0099
医学影像处理工具：SimpleITK学习笔记（二）图像基本操作

本节大纲 sitk图像的构造 sitk中的常见属性值像素的相关操作 SimpleITK和Numpy的相互转换类型转化中的index顺序 sitk图像的构造 sitk有几种构建图…

人工智能 2023年5月26日
0070
OpenCV-Python＜八＞图像平滑处理

消除图像中的噪音成分，叫做图像的平滑处理或者图像滤波。即在尽量保留图像细节特征的情况下对目标图像的噪声进行抑制。它是图像预处理过程中不可缺少的步骤。处理效果的好坏将直接影响到后续图…

人工智能 2023年7月19日
0070
Perl脚本获取.bash_profile中变量

Perl脚本获取.bash_profile中变量 最近在修改MySQL…

人工智能 2023年6月29日
0086
PyCharm更换pip源为国内源、模块安装、PyCharm依赖包导入导出教程

一、更换pip为国内源 1.使用PyCharm创建一个工程 2.通过 File -> Setting…选择解释器为本工程下的Python解释器。 3.单…

人工智能 2023年7月5日
00333
跟李沐学AI 动手学深度学习环境配置d2l、pytorch的安装（windows环境、python版本3.7）

我们的任务主要有：配置过程中主要参考了以下文章：https://blog.csdn.net/qq_38311396/article/details/120768038 ; 配置详…

人工智能 2023年6月24日
00146
第4章 docker仓库管理

1.公共仓库地址 2.登录docker hub docker login https://hub.docker.com 3.搜索镜像 docker search centos 4….

人工智能 2023年6月28日
00102
SSM整合

回答1： SSM指的是 MVC+ 这一组合，而 Security是框架中用于安全认证和授权的模块。将 Security，可以在SSM应用中提供更加完善的安全控制和认证功能。具体…

人工智能 2023年6月29日
0082
防止过拟合的方法

尽管过拟合是机器学习中的一个错误，它会降低模型的性能，但是，我们可以通过多种方式来防止它。使用线性模型，我们可以避免过拟合；然而，许多现实世界的问题是非线性的。防止模型过度拟合很重…

人工智能 2023年6月16日
00100
树莓派挂载exfat和ntfs格式硬盘优盘

树莓派 Linux系统默认可以自动识别到fat32格式的盘，但fat32支持的文件不能大于4G，所以只能将移动硬盘和U盘格式化为NTFS和exFAT这两种格式的，闪迪U盘一般默认格…

人工智能 2023年6月12日
00116
字节面试问到CPU的多级缓存架构，诸佬们怎么回答？

前言：大家好，我是小威，24届毕业生，上周在面试字节中，问到了一个关于CPU多级缓存架构的问题，当时答得并不是很好，之后查阅了资料，对此进行了复盘总结。如果文章有什么需要改进的地方…

人工智能 2023年7月31日
0071

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31