数据处理中的过采样、下采样、联合采样和集成采样

1. 导包

from sklearn.datasets import load_iris
from sklearn.datasets import load_breast_cancer
from collections import Counter

from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTEN
from imblearn.over_sampling import SMOTENC
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.over_sampling import SVMSMOTE
from imblearn.over_sampling import KMeansSMOTE
from imblearn.over_sampling import ADASYN

from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import ClusterCentroids
from imblearn.under_sampling import NearMiss
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
from imblearn.under_sampling import AllKNN
from imblearn.under_sampling import CondensedNearestNeighbour
from imblearn.under_sampling import OneSidedSelection
from imblearn.under_sampling import NeighbourhoodCleaningRule
from imblearn.under_sampling import InstanceHardnessThreshold

from imblearn.combine import SMOTEENN
from imblearn.combine import SMOTETomek

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from sklearn.ensemble import BaggingClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.ensemble import RUSBoostClassifier
from imblearn.ensemble import EasyEnsembleClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

2. 找数据

鸢尾花数据比较均衡,乳腺癌数据不均衡,可以用作样本数据

X, y = load_iris(return_X_y=True)
print(X, y, X.shape, y.shape, len(X), len(y), sep="\n")
print(sorted(Counter(y).items()))

X, y = load_breast_cancer(return_X_y=True)
print(X, y)
print(X.shape, y.shape)
print(len(X), len(y))
print(sorted(Counter(y).items()))
count0, count1 = 0, 0
for yy in y:
    if yy == 0:
        count0 += 1
    elif yy == 1:
        count1 += 1
print(count0, count1)

3. 过采样

随机过采样

ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

对于少数类样本a, 随机选择一个最近邻的样本b, 然后从a与b的连线上随机选取一个点c作为新的少数类样本

ros = SMOTE(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

若数据集仅由类别型特征构成,则可用SMOTEN

ros = SMOTEN(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

SMOTENC可处理分类特征的SMOTE
只有当数据同时包含数值型和类别型特征时,SMOTENC才起作用

ros = SMOTENC(categorical_features=[18, 19], random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

由于SMOTE算法随机选择少数类样本来生成新样本,没有考虑到所选少数类样本周围的情况,因此可能存在两个问题:
所选的少数类样本周围也都是少数类样本,这样合成的新样本不会提供太多有用信息;
所选的少数类样本周围都是多数类样本,这样的样本可能是噪声,生成的新样本可能会与周围多数类样本重叠。

=========================================================
因此BorderlineSMOTE算法在SMOTE算法上进行改进,只使用处于边界上的少数类样本来合成新样本,算法首先将所有少数类样本划分为三类,分别是:
noise:噪声样本,即少数类样本的周围K个近邻都是多数类样本;
danger:危险样本,即少数类样本的周围K个近邻中有一半及以上为多数类样本;
safe:安全样本,即少数类样本的周围K个近邻中有一半以上为少数类样本。

=========================================================
BorderlineSMOTE算法只会从处于”danger”状态的样本中随机选择,然后用SMOTE算法产生新的样本。
处于”danger”状态的样本代表靠近”边界”附近的少数类样本,而处于边界附近的样本往往更容易被误分类。
因而 Border-line SMOTE 只对那些靠近”边界”的少数类样本进行人工合成样本,而 SMOTE 则对所有少数类样本一视同仁。

=========================================================
Border-line SMOTE 分为两种: Borderline-1 SMOTE 和 Borderline-2 SMOTE。
Borderline-1 SMOTE 在合成样本时式中的x^ 是一个少数类样本
Borderline-2 SMOTE 中的x^ 则是k近邻中的任意一个样本

ros = BorderlineSMOTE(kind="borderline-1", random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

SVMSMOTE使用支持向量机分类器产生支持向量然后再生成新的少数类样本,然后使用SMOTE合成样本

ros = SVMSMOTE(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

KMeansSMOTE原理:在使用SMOTE进行过采样之前应用KMeans聚类。
KMeansSMOTE包括三个步骤:聚类、过滤和过采样。
在聚类步骤中,使用k均值聚类为k个组。过滤选择用于过采样的簇,保留具有高比例的少数类样本的簇。
然后,它分配合成样本的数量,将更多样本分配给少数样本稀疏分布的群集。
最后,过采样步骤,在每个选定的簇中应用SMOTE以实现少数和多数实例的目标比率。

ros = KMeansSMOTE(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

ADASYN自适应综合过采样,关注的是在那些基于K近邻分类器被错误分类的原始样本附近生成新的少数类样本

ros = ADASYN(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

4. 下采样

RandomUnderSampler随机下采样

rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

ClusterCentroids每一个类别的样本都会用K-means算法的中心点来合成

rus = ClusterCentroids(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

NearMiss添加了一些启发式(heuristic)的规则来选择样本,version=1,2,3,3的结果和1,2不同,3的结果(277, 30) (277,) 277 277 [(0, 212), (1, 65)]

rus = NearMiss(version=3)
X_resampled, y_resampled = rus.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

EditedNearestNeighbours应用最近邻算法来编辑(edit)数据集, 找出那些与邻居不太友好的样本然后移除

enn = EditedNearestNeighbours()
X_resampled, y_resampled = enn.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

RepeatedEditedNearestNeighbours重复基础的EditedNearestNeighbours算法多次

renn = RepeatedEditedNearestNeighbours()
X_resampled, y_resampled = renn.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

ALLKNN算法在进行每次迭代的时候, 最近邻的数量都在增加

allknn = AllKNN()
X_resampled, y_resampled = renn.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

CondensedNearestNeighbour使用1近邻的方法来进行迭代, 来判断一个样本是应该保留还是剔除,具体算法步骤如下:

将所有少数类样本放入集合C;从目标类(待下采样的类)中选择一个样本放入C,其余的该类样本放入集合S;逐样本遍历集合S,训练1近邻分类器,对S中的所有样本进行分类;将S中错误分类的样本加入C;重复上述过程直到没有再加入C中的样本

cnn = CondensedNearestNeighbour(random_state=42)
X_resampled, y_resampled = cnn.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

OneSidedSelection 函数使用 TomekLinks 方法来剔除噪声数据(多数类样本)

oss = OneSidedSelection(random_state=42)
X_resampled, y_resampled = oss.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

NeighbourhoodCleaningRule 算法主要关注如何清洗数据而不是筛选(considering)他们;
因此, 该算法将使用EditedNearestNeighbours和 3-NN分类器结果拒绝的样本之间的并集

ncr = NeighbourhoodCleaningRule()
X_resampled, y_resampled = ncr.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

InstanceHardnessThreshold是一种很特殊的方法,是在数据上运用一种分类器, 然后将概率低于阈值的样本剔除掉

iht = InstanceHardnessThreshold(random_state=42)
X_resampled, y_resampled = iht.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

5. 混合采样

过采样与下采样结合

smoteenn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smoteenn.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

过采样与下采样结合

smotetomek = SMOTETomek(random_state=42)
X_resampled, y_resampled = smotetomek.fit_resample(X, y)
print(X_resampled, y_resampled, X_resampled.shape, y_resampled.shape, sep="\n")
print(len(X_resampled), len(y_resampled))
print(sorted(Counter(y_resampled).items()))

6. 切分数据

这里的X,y对应乳腺癌数据集load_breast_cancer

trainX, testX, trainY, testY = train_test_split(X, y, random_state=42)

7. 集成采样

from sklearn.ensemble import BaggingClassifier
BaggingClassifier该分类器并不允许对每个数据集进行均衡. 因此, 在对不均衡样本进行训练的时候, 分类器其实是有偏的, 偏向于多数类

bc = BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)
bc.fit(trainX, trainY)
preY = bc.predict(testX)
print(preY)
print(confusion_matrix(testY, preY))

BalancedBaggingClassifier 允许在训练每个基学习器之前对每个子集进行重抽样;
简而言之, 该方法结合了EasyEnsemble 采样器与分类器(如BaggingClassifier)的结果
BalaceBaggingClassifier使得在训练每个分类器之前,在每个子集上进行重采样,其参数与sklearn中的BaggingClassifier相同,除了增加了两个参数:sampling_strategy和replacement来控制随机下采样的方式

bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)
bbc.fit(trainX, trainY)
preY = bbc.predict(testX)
print(preY)
print(confusion_matrix(testY, preY))

在执行boosting迭代之前执行一个随机下采样

rbc = RUSBoostClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)
rbc.fit(trainX, trainY)
preY = rbc.predict(testX)
print(preY)
print(confusion_matrix(testY, preY))

EasyEnsemble 通过对原始的数据集进行随机下采样实现对数据集进行集成

EasyEnsembleClassifier,即采用Adaboost
计算弱分类器的错误率,对错误分类的样本分配更大的权值,正确分类的样本赋予更小权值。只要分类精度大于0.5即可做最终分类器中一员,弱分类器精度越高,权重越大。

eec = EasyEnsembleClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)
eec.fit(trainX, trainY)
preY = eec.predict(testX)
print(preY)
print(confusion_matrix(testY, preY))

在构建每棵树时使用平衡的bootstrap数据子集

brfc = BalancedRandomForestClassifier(random_state=42)
brfc.fit(trainX, trainY)
preY = brfc.predict(testX)
print(preY)
print(confusion_matrix(testY, preY))

感谢大家的关注和支持,希望我写的文章能够让你们有收获。
如有不足,还请各位多多指正!

Original: https://blog.csdn.net/qq_38500228/article/details/122602498
Author: 一览天下945
Title: 数据处理中的过采样、下采样、联合采样和集成采样

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/639158/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球