数据处理中的过采样、下采样、联合采样和集成采样

2023年6月19日下午1:09 • 人工智能 • 阅读 62

1. 导包

from sklearn.datasets import load_iris
from sklearn.datasets import load_breast_cancer
from collections import Counter

from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTEN
from imblearn.over_sampling import SMOTENC
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.over_sampling import SVMSMOTE
from imblearn.over_sampling import KMeansSMOTE
from imblearn.over_sampling import ADASYN

from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import ClusterCentroids
from imblearn.under_sampling import NearMiss
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
from imblearn.under_sampling import AllKNN
from imblearn.under_sampling import CondensedNearestNeighbour
from imblearn.under_sampling import OneSidedSelection
from imblearn.under_sampling import NeighbourhoodCleaningRule
from imblearn.under_sampling import InstanceHardnessThreshold

from imblearn.combine import SMOTEENN
from imblearn.combine import SMOTETomek

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from sklearn.ensemble import BaggingClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.ensemble import RUSBoostClassifier
from imblearn.ensemble import EasyEnsembleClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

2. 找数据

鸢尾花数据比较均衡，乳腺癌数据不均衡，可以用作样本数据

X, y = load_iris(return_X_y=True)
print(X, y, X.shape, y.shape, len(X), len(y), sep="\n")
print(sorted(Counter(y).items()))

X, y = load_breast_cancer(return_X_y=True)
print(X, y)
print(X.shape, y.shape)
print(len(X), len(y))
print(sorted(Counter(y).items()))
count0, count1 = 0, 0
for yy in y:
    if yy == 0:
        count0 += 1
    elif yy == 1:
        count1 += 1
print(count0, count1)

3. 过采样

随机过采样

ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

对于少数类样本a, 随机选择一个最近邻的样本b, 然后从a与b的连线上随机选取一个点c作为新的少数类样本

ros = SMOTE(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

若数据集仅由类别型特征构成，则可用SMOTEN

ros = SMOTEN(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

SMOTENC可处理分类特征的SMOTE
只有当数据同时包含数值型和类别型特征时，SMOTENC才起作用

ros = SMOTENC(categorical_features=[18, 19], random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

由于SMOTE算法随机选择少数类样本来生成新样本，没有考虑到所选少数类样本周围的情况，因此可能存在两个问题：
所选的少数类样本周围也都是少数类样本，这样合成的新样本不会提供太多有用信息；
所选的少数类样本周围都是多数类样本，这样的样本可能是噪声，生成的新样本可能会与周围多数类样本重叠。

=========================================================
因此BorderlineSMOTE算法在SMOTE算法上进行改进，只使用处于边界上的少数类样本来合成新样本，算法首先将所有少数类样本划分为三类，分别是：
noise：噪声样本，即少数类样本的周围K个近邻都是多数类样本；
danger：危险样本，即少数类样本的周围K个近邻中有一半及以上为多数类样本；
safe：安全样本，即少数类样本的周围K个近邻中有一半以上为少数类样本。

=========================================================
BorderlineSMOTE算法只会从处于”danger”状态的样本中随机选择，然后用SMOTE算法产生新的样本。
处于”danger”状态的样本代表靠近”边界”附近的少数类样本，而处于边界附近的样本往往更容易被误分类。
因而 Border-line SMOTE 只对那些靠近”边界”的少数类样本进行人工合成样本，而 SMOTE 则对所有少数类样本一视同仁。

=========================================================
Border-line SMOTE 分为两种: Borderline-1 SMOTE 和 Borderline-2 SMOTE。
Borderline-1 SMOTE 在合成样本时式中的x^ 是一个少数类样本
Borderline-2 SMOTE 中的x^ 则是k近邻中的任意一个样本

ros = BorderlineSMOTE(kind="borderline-1", random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

SVMSMOTE使用支持向量机分类器产生支持向量然后再生成新的少数类样本，然后使用SMOTE合成样本

ros = SVMSMOTE(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

KMeansSMOTE原理：在使用SMOTE进行过采样之前应用KMeans聚类。
KMeansSMOTE包括三个步骤：聚类、过滤和过采样。
在聚类步骤中，使用k均值聚类为k个组。过滤选择用于过采样的簇，保留具有高比例的少数类样本的簇。
然后，它分配合成样本的数量，将更多样本分配给少数样本稀疏分布的群集。
最后，过采样步骤，在每个选定的簇中应用SMOTE以实现少数和多数实例的目标比率。

ros = KMeansSMOTE(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

ADASYN自适应综合过采样，关注的是在那些基于K近邻分类器被错误分类的原始样本附近生成新的少数类样本

ros = ADASYN(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

4. 下采样

RandomUnderSampler随机下采样

rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

ClusterCentroids每一个类别的样本都会用K-means算法的中心点来合成

rus = ClusterCentroids(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

NearMiss添加了一些启发式(heuristic)的规则来选择样本，version=1，2，3，3的结果和1，2不同，3的结果(277, 30) (277,) 277 277 [(0, 212), (1, 65)]

rus = NearMiss(version=3)
X_resampled, y_resampled = rus.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

EditedNearestNeighbours应用最近邻算法来编辑(edit)数据集, 找出那些与邻居不太友好的样本然后移除

enn = EditedNearestNeighbours()
X_resampled, y_resampled = enn.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

RepeatedEditedNearestNeighbours重复基础的EditedNearestNeighbours算法多次

renn = RepeatedEditedNearestNeighbours()
X_resampled, y_resampled = renn.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

ALLKNN算法在进行每次迭代的时候, 最近邻的数量都在增加

allknn = AllKNN()
X_resampled, y_resampled = renn.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

CondensedNearestNeighbour使用1近邻的方法来进行迭代, 来判断一个样本是应该保留还是剔除，具体算法步骤如下：

将所有少数类样本放入集合C；从目标类（待下采样的类）中选择一个样本放入C，其余的该类样本放入集合S；逐样本遍历集合S，训练1近邻分类器，对S中的所有样本进行分类；将S中错误分类的样本加入C；重复上述过程直到没有再加入C中的样本

cnn = CondensedNearestNeighbour(random_state=42)
X_resampled, y_resampled = cnn.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

OneSidedSelection 函数使用 TomekLinks 方法来剔除噪声数据(多数类样本)

oss = OneSidedSelection(random_state=42)
X_resampled, y_resampled = oss.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

NeighbourhoodCleaningRule 算法主要关注如何清洗数据而不是筛选(considering)他们；
因此, 该算法将使用EditedNearestNeighbours和 3-NN分类器结果拒绝的样本之间的并集

ncr = NeighbourhoodCleaningRule()
X_resampled, y_resampled = ncr.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

InstanceHardnessThreshold是一种很特殊的方法，是在数据上运用一种分类器, 然后将概率低于阈值的样本剔除掉

iht = InstanceHardnessThreshold(random_state=42)
X_resampled, y_resampled = iht.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

5. 混合采样

过采样与下采样结合

smoteenn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smoteenn.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

过采样与下采样结合

smotetomek = SMOTETomek(random_state=42)
X_resampled, y_resampled = smotetomek.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

6. 切分数据

这里的X，y对应乳腺癌数据集load_breast_cancer

trainX, testX, trainY, testY = train_test_split(X, y, random_state=42)

7. 集成采样

from sklearn.ensemble import BaggingClassifier
BaggingClassifier该分类器并不允许对每个数据集进行均衡. 因此, 在对不均衡样本进行训练的时候, 分类器其实是有偏的, 偏向于多数类

bc = BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)
bc.fit(trainX, trainY)
preY = bc.predict(testX)
print(preY)
print(confusion_matrix(testY, preY))

BalancedBaggingClassifier 允许在训练每个基学习器之前对每个子集进行重抽样；
简而言之, 该方法结合了EasyEnsemble 采样器与分类器(如BaggingClassifier)的结果
BalaceBaggingClassifier使得在训练每个分类器之前,在每个子集上进行重采样，其参数与sklearn中的BaggingClassifier相同,除了增加了两个参数:sampling_strategy和replacement来控制随机下采样的方式

bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)
bbc.fit(trainX, trainY)
preY = bbc.predict(testX)
print(preY)
print(confusion_matrix(testY, preY))

在执行boosting迭代之前执行一个随机下采样

rbc = RUSBoostClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)
rbc.fit(trainX, trainY)
preY = rbc.predict(testX)
print(preY)
print(confusion_matrix(testY, preY))

EasyEnsemble 通过对原始的数据集进行随机下采样实现对数据集进行集成

EasyEnsembleClassifier，即采用Adaboost
计算弱分类器的错误率,对错误分类的样本分配更大的权值,正确分类的样本赋予更小权值。只要分类精度大于0.5即可做最终分类器中一员,弱分类器精度越高,权重越大。

eec = EasyEnsembleClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)
eec.fit(trainX, trainY)
preY = eec.predict(testX)
print(preY)
print(confusion_matrix(testY, preY))

在构建每棵树时使用平衡的bootstrap数据子集

brfc = BalancedRandomForestClassifier(random_state=42)
brfc.fit(trainX, trainY)
preY = brfc.predict(testX)
print(preY)
print(confusion_matrix(testY, preY))

感谢大家的关注和支持，希望我写的文章能够让你们有收获。
如有不足，还请各位多多指正！

Original: https://blog.csdn.net/qq_38500228/article/details/122602498
Author: 一览天下945
Title: 数据处理中的过采样、下采样、联合采样和集成采样

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/639158/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

基于深度学习的木薯叶片病害识别与检测

根据联合国粮食及农业组织（FAO），农业是世界总人口约60%的主要生计来源。发展中国家的经济完全依赖农产品。随着世界人口以更快的速度增长，对粮食的需求也在急剧上升。最近几天，农业正…

人工智能 2023年7月13日
0058
【JAVA】使用百度语音识别 Rest API，遇到识别结果显示乱码的问题和解决

文章目录遇到乱码问题百度语音识别 Rest API Rest API 的使用乱码现象和解决过程 * 1. 乱码现象 2. 解决过程遇到乱码问题在使用百度语音识别 JAVA…

人工智能 2023年5月27日
00136
[Pyecharts]数据可视化大屏展示

【Pyecharts】大屏展示-练习前言 * 1.条件： 2.代码及展示 – 时间序列曲线图时间轮播 24小时轮播 1：2：3 拥堵路段词云图拥堵榜、通畅榜水球…

人工智能 2023年7月15日
0066
方面级情感分析（一）

方面级情感分析任务定义：方面级情感分析(Aspect-based sentiment analysis, ABSA)主要包括方面词抽取和方面级情感分类。ABSA任务需要识别出给定目…

人工智能 2023年5月28日
0060
基于MATLAB实现简单人工神经网络

资源下载地址：https://download.csdn.net/download/sheziqiong/85979688资源下载地址：https://download.csdn….

人工智能 2023年7月13日
0092
序列标注任务：LSTM / Bert 等 + CRF 架构

啊哦~你想找的内容离你而去了哦内容不存在，可能为如下原因导致： ① 内容还在审核中 ② 内容以前存在，但是由于不符合新的规定而被删除 ③ 内容地址错误 ④ 作者删除了内容。可…

人工智能 2023年5月28日
0062
实验八项目案例-电商数据分析

任务描述本关任务：根据用户行为数据，编写 MapReduce 程序来统计出用户流失情况。相关知识本实训为中级难度的 MapReduce 程序设计练习，模拟真实场景中电商数据的…

人工智能 2023年6月19日
0051
python2版本-莫烦tensorflow可视化2

import tensorflow._api.v2.compat.v1 as tf import numpy as np tf.disable_v2_behavior() def …

人工智能 2023年5月24日
0079
深度学习100例 | 第41天-卷积神经网络（CNN）：UrbanSound8K音频分类（语音识别）

### 回答1：深度学习_是当今计算机科学领域最为热门的研究方向之一，其在图像 _分类、语言识别、自然语言处理等诸多领域都有广泛的应用。而卷积神经网络（ CNN）是深度学习_…

人工智能 2023年7月2日
0083
【LeetCode】647. 回文子串

题目描述给你一个字符串 s ，请你统计并返回这个字符串中回文子串的数目。回文字符串是正着读和倒过来读一样的字符串。子字符串是字符串中的由连续字符组成的一个序列。具有不同…

人工智能 2023年6月27日
0080
图神经网络-GCN、GraphSAGE、NGCF、LightGCN

本篇主要讲解GCN、GraphSAGE、NGCF、LightGCN。基础概念图的分类：同构图：图中只有一种类型的节点、一种类型的边。异构图：图中有多种类型的节点或多种类…

人工智能 2023年7月13日
0075
【老生谈算法】matlab实现图像平滑算法——图像平滑算法

基于MATLAB的图像平滑算法实现及应用 1、原文下载：本算法原文如下，有需要的朋友可以点击进行下载序号原文（点击下载）本项目原文 2、算法详解：第一章、概述1.1 图像…

人工智能 2023年6月18日
0075
一、opencv详细介绍

; 一、opencv详细介绍文章目录一、opencv详细介绍 * 1.1 什么是opencv 1.2 opencv历史 1.3 为什么用OpenCV 1.4 opencv的应用…

人工智能 2023年6月18日
0073
文本结构化

信息抽取之文本结构化浅谈如何快速制作一个专业领域的文本结构化工具，可用于非规则自然文本的关键信息快速抽取前言—电子病历文本结构化电子病历的文本结构化是指我们从电子病历的自然语…

人工智能 2023年5月28日
0077
人脸识别5.1- insightface人脸检测模型scrfd-训练实战笔记，目标检测的理论理解

1、 insightface/ detection / scrfd模型训练 1.0、数据集数据集下载说明：https://github.com/deepinsight/insig…

人工智能 2023年7月12日
0067
[九]深度学习Pytorch-transforms图像增强(剪裁、翻转、旋转)

往期内容 [一]深度学习Pytorch-张量定义与张量创建 [二]深度学习Pytorch-张量的操作：拼接、切分、索引和变换 [三]深度学习Pytorch-张量数学运算 [四]深度…

人工智能 2023年6月13日
00100

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31