大数据缺省值插补方法（回归填补[stochastic regression imputation]，聚类填补，。。）

2023年6月17日下午1:51 • 人工智能 • 阅读 91

文章目录

回归填补
*
random imputation
deterministic regression imputation
stochastic regression imputation
聚类填补
Autoencoder填补
结论

回归填补

首先导入所需要的包

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import missingno as mno

import warnings
warnings.filterwarnings('ignore')

接着导入数据

data=np.loadtxt('data\Magic.txt')
tmp_columns=list('abcdefghij')
tmp_columns.append('class')
magic=pd.DataFrame(data=data,columns=tmp_columns)

随机抽出10条数据以观察

magic.sample(10)

大数据缺省值插补方法（回归填补[stochastic regression imputation]，聚类填补，。。）

查看数据集的缺失值情况

magic.isnull().sum()

我们发现没有缺失值

我们画出特征之间的热图，观察特征之间的相关性

%matplotlib inline
'''''
可以看出a-b,a-c,b-c,d-e,j-a,j-b,j-c
'''
complete_features=magic.loc[:,magic.columns.difference(['class'])]

plt.figure(figsize=(10,10))
sns.heatmap(complete_features.corr(),annot=True)

接着，根据特征之间的相关性，我们选择 a，b，c作为含有缺失值的列

我们随机从 a，b，c中抽取 10%的数据清空

prob_missing = 0.1
col_incomplete=['a','b','c']
ind_incomplete=[magic.columns.get_loc(i) for i in col_incomplete]
df_incomplete = magic.copy()
ix = [(row, col) for row in range(magic.shape[0]) for col in ind_incomplete]
for row, col in random.sample(ix, int(round(prob_missing * len(ix)))):
    df_incomplete.iat[row, col] = np.nan

df_complete=magic[col_incomplete]
df_incomplete_copy=df_incomplete.copy()

df_incomplete.isna().sum()
mno.matrix(df_incomplete, figsize = (20, 6))

处理后的数据表可视化之后如下图所示

random imputation

接着，我们要对 a，b，c列进行回归填充，我打算采用 knn回归模型，训练集的特征为除了待预测的特征之外的所有特征。可是，由于不只一行含有空缺值，所以我们不能直接去进行预测，于是我们用 a，b，c列中不为空的值来随机填充 a，b，c列为新的三个特征 a_tmp, b_tmp, c_tmp，当我们预测a时，就可以将 b_tmp, c_tmp作为特征一起参与训练

missing_columns=col_incomplete

def random_imputation(df,feature):
    num_missing=df[feature].isnull().sum()
    observed_values=df.loc[df[feature].notnull(),feature]
    df.loc[df[feature].isnull(),feature+'_imp']=np.random.choice(
        observed_values,num_missing,replace=True
    )
    return df

for feature in missing_columns:
    df_incomplete[feature+'_imp']=df_incomplete[feature]
    df_incomplete=random_imputation(df_incomplete,feature)

mno.matrix(df_incomplete,figsize=[20,6])

填充新特征后的数据表如下图所示

deterministic regression imputation

接着，我们采用 knn（ n_neighbour=3）模型来分别对每一个缺失特征进行预测填补缺失值

from sklearn.neighbors import KNeighborsRegressor

deter_data=pd.DataFrame(columns=['Det'+name for name in missing_columns])
for feature in missing_columns:
    deter_data['Det'+feature]=df_incomplete[feature+'_imp']
    para=list(set(df_incomplete.columns)-set(missing_columns)-{feature+'_imp'})

    model=KNeighborsRegressor()
    model.fit(X=df_incomplete[para],y=df_incomplete[feature+'_imp'])
    deter_data.loc[df_incomplete[feature].isnull(), 'Det'+feature]=model.predict(
        df_incomplete[para]
    )[df_incomplete[feature].isnull()]

mno.matrix(deter_data,figsize=[20,5])

填充之后的数据表如下图所示，可以发现数据集已经不含有空缺值

接着，我们观察一下原始数据和填补后数据的分布直方图和箱线图

sns.set()
fig,axes=plt.subplots(nrows=3,ncols=2)
fig.set_size_inches(8,8)

for index, variable in enumerate(['a','b','c']):
    sns.distplot(df_incomplete[variable].dropna(),kde=False,ax=axes[index, 0],color='blue')
    sns.distplot(deter_data['Det'+variable],kde=False,ax=axes[index,0],color='red')
    sns.boxplot(data=pd.concat([df_incomplete[variable], deter_data['Det'+variable]],axis=1),ax=axes[index,1])
    plt.tight_layout()

我们可以发现，原始完整数据的特征分布直方图要比填补后的特征直方图要更高，更窄，换句话说， 原始完整数据的特征分布的标准差要比填补后的小

造成这种现象的原因是： 由于我们是采用回归方法来填充的缺失值，这些缺失值其实是沿着回归模型的超平面上下波动的，其含有一定的噪声

从箱线图中我们也能看到，填补后的数据相比原始数据的 IQ range要更宽

stochastic regression imputation

于是，为了解决这个问题，我们会在回归填充的数据中添加一定的干扰项，这些干扰项服从正态分布

random_data=pd.DataFrame(columns=['Ran'+name for name in missing_columns])
for feature in missing_columns:
    random_data['Ran'+feature]=df_incomplete[feature+'_imp']
    para=list(set(df_incomplete.columns)-set(missing_columns)-{feature+'_imp'})

    model=KNeighborsRegressor()
    model.fit(X=df_incomplete[para],y=df_incomplete[feature+'_imp'])

    predict=model.predict(df_incomplete[para])
    std_error=(predict[df_incomplete[feature].notnull()]
        -df_incomplete.loc[df_incomplete[feature].notnull(), feature+'_imp']).std()
    random_predict=np.random.normal(size=df_incomplete[feature].shape[0],
        loc=predict,scale=std_error
    )

    random_data.loc[(df_incomplete[feature].isnull())&(random_predict>0),
        'Ran'+feature]=random_predict[(df_incomplete[feature].isnull())
        &(random_predict > 0)]

接着我们可视化一下

sns.set()
fig,axes=plt.subplots(nrows=3,ncols=2)
fig.set_size_inches(8,8)

for index, variable in enumerate(['a','b','c']):
    sns.distplot(df_incomplete[variable].dropna(),kde=False,ax=axes[index, 0],color='blue')
    sns.distplot(random_data['Ran'+variable],kde=False,ax=axes[index,0],color='red')
    axes[index, 0].set(xlabel=variable+'/'+variable+'_imp')
    sns.boxplot(data=pd.concat([df_incomplete[variable], random_data['Ran'+variable]],axis=1),ax=axes[index,1])

    plt.tight_layout()

我们可以发现，填充之后的特征分布较原来有了一定的改善，且保留了原始分布的形状

填补之后数据不再含有空值

df_incomplete[missing_columns]=random_data
df_incomplete.drop(columns=['a_imp','b_imp','c_imp'],axis=1,inplace=True)
df_incomplete.isnull().sum()

我们计算knn填补的均方误差

knn_mse=((df_complete.values-random_data.values)**2).sum()
knn_mse
结果为knn_mse= 32.29607335035754

聚类填补

接着，我们采用 K-means方法来填补缺失值
我们首先构造一个不含缺失列（ a，b，c）的数据集用以聚类

df_incomplete=df_incomplete_copy.copy()
df_cluster=df_incomplete[df_incomplete.columns.difference(['a','b','c','class'])]
df_cluster

数据表如下图所示

为了可视化数据的聚类情况，我编写了一个 pca降维可视化的函数

from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D

def plot_pca(num,data,label):
    pca=PCA(n_components=num)
    X_pca=pca.fit_transform(data)
    print(pca.components_)

    X_failure=np.array([x for i,x in enumerate(X_pca) if label[i]==1.0])
    X_healthy=np.array([x for i,x in enumerate(X_pca) if label[i]==2.0])

    if num==3:
        fig = plt.figure(figsize=[10,15])
        ax = Axes3D(fig)

        ax.set_zlabel('Z', fontdict={'size': 15, 'color': 'red'})
        ax.set_ylabel('Y', fontdict={'size': 15, 'color': 'red'})
        ax.set_xlabel('X', fontdict={'size': 15, 'color': 'red'})
        ax.scatter(X_failure[:,0], X_failure[:,1], X_failure[:,2])
        ax.scatter(X_healthy[:,0], X_healthy[:,1], X_healthy[:,2])

        ax.view_init(elev=50,azim=10)
    elif num==2:
        plt.figure(figsize=[10,10])
        plt.scatter(X_failure[:,0],X_failure[:,1])
        plt.scatter(X_healthy[:,0],X_healthy[:,1])
    else:
        print('i do not want to work.....')

接着，为了找到合适的簇个数，我们采用手肘法来计算最佳的聚类簇数

from sklearn.cluster import KMeans
%matplotlib inline
SSE = []
for k in range(1,9):
    estimator=KMeans(n_clusters=k, random_state=9)
    estimator.fit(df_cluster)
    SSE.append(estimator.inertia_)
plt.xlabel('k')
plt.ylabel('SSE')
plt.plot(range(1,9),SSE,'o-')
plt.show()

我们发现k=2时曲率最大，故我们选择聚类的簇个数为 2

接着，我们可视化一下聚类效果

%matplotlib inline

kmeans = KMeans(n_clusters=2, random_state=9)
idxs = kmeans.fit_predict(df_cluster)

pca=PCA(n_components=3)
pca.fit(df_cluster)
X_pca=pca.transform(df_cluster)

subX = []

for id in range(len(np.unique(idxs))):
    subX.append(np.array([X_pca[i] for i in range(X_pca.shape[0]) if idxs[i] == id]))

fig = plt.figure(figsize=[8,8])
ax = Axes3D(fig)

ax.set_zlabel('Z', fontdict={'size': 15, 'color': 'red'})
ax.set_ylabel('Y', fontdict={'size': 15, 'color': 'red'})
ax.set_xlabel('X', fontdict={'size': 15, 'color': 'red'})

for x in range(len(subX)):
    newX = subX[x]

    ax.scatter(newX[:,0], newX[:,1], newX[:,2])

我们可以发现，数据的分布不是完全呈簇类分布


df_list=[]
df_data=df_incomplete[df_incomplete.columns.difference(['class'])].values
for id in range(len(np.unique(idxs))):
    tmp_cluster_df=pd.DataFrame([df_data[i] for i in range(df_data.shape[0]) if idxs[i]==id]).iloc[:,:3]
    df_list.append(tmp_cluster_df.mean().values)

cluster_data=pd.DataFrame(columns=['Clu_'+name for name in missing_columns])

for feature in missing_columns:
    cluster_data['Clu_'+feature]=df_incomplete[feature]
cluster_data['cluster']=idxs
cluster_data

下图为我们得到的数据集

填补数据集

for i,feature in enumerate(missing_columns):
    cluster_data.loc[(cluster_data['Clu_'+feature].isnull())&
        (cluster_data['cluster']==0),'Clu_'+feature]=df_list[0][i]
for i,feature in enumerate(missing_columns):
    cluster_data.loc[(cluster_data['Clu_'+feature].isnull())&
        (cluster_data['cluster']==1),'Clu_'+feature]=df_list[1][i]
cluster_data.drop(['cluster'],axis=1,inplace=True)
cluster_data.isnull().sum()

计算聚类填补的均方误差

cluster_mse=((df_complete.values-cluster_data.values)**2).sum()
cluster_mse
kmeans_mse= 74.81802220522698

可视化一下填补之后的特征分布

sns.set()
fig,axes=plt.subplots(nrows=3,ncols=2)
fig.set_size_inches(8,8)

for index, variable in enumerate(missing_columns):
    sns.distplot(df_incomplete[variable].dropna(),kde=False,ax=axes[index, 0],color='blue')
    sns.distplot(cluster_data['Clu_'+variable],kde=False,ax=axes[index,0],color='red')
    axes[index, 0].set(xlabel=variable)
    sns.boxplot(data=pd.concat([df_incomplete[variable],cluster_data['Clu_'+variable]],axis=1),ax=axes[index,1])

    plt.tight_layout()

我们可以发现，填补后的特征分布不能很好的拟合原始特征分布， 聚类方法的表现不算好

也可能是因为聚类的簇数太少了

Autoencoder填补

懒了，以后填坑
以前用过这个方法，想了解的直接tp到我的这篇文章

结论

这次实验中聚类方法（K-means）填充缺失值效果并不好，我猜测是因为数据并没有很好的成簇分布，最终填补后的数据集与原始数据集的均方误差为74.81802220522698
在回归填充缺失值中，我们可以发现，原始完整数据的特征分布直方图要比填补后的特征直方图要更高，更窄，换句话说，原始完整数据的特征分布的标准差要比填补后的小。
造成这种现象的原因是：由于我们是采用回归方法来填充的缺失值，这些缺失值其实是沿着回归模型的超平面上下波动的，其含有一定的噪声
所以最后在由回归填充的每一项中加入一定的修正项，这些修正项服从高斯分布，这样做之后填充的效果明显更好了，mse也相对较小

Original: https://blog.csdn.net/NP_hard/article/details/121221334
Author: NP_hard
Title: 大数据缺省值插补方法（回归填补[stochastic regression imputation]，聚类填补，。。）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/630279/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

2022你不容错过的软件测试项目实战（web+app+h5+小程序）免费版

啊哦~你想找的内容离你而去了哦内容不存在，可能为如下原因导致： ① 内容还在审核中 ② 内容以前存在，但是由于不符合新的规定而被删除 ③ 内容地址错误 ④ 作者删除了内容。可…

人工智能 2023年5月30日
0074
Win10/Win11下解决bert-serving报错server TypeError: cannot unpack non-iterable NoneType object

在stackoverflow上查找答案，结果为： This error can be cause by multiple reasons: (1) Installing the w…

人工智能 2023年5月24日
0096
详解YOLOv5中的Bottleneck

深度学习入门小菜鸟，希望像做笔记记录自己学的东西，也希望能帮助到同样入门的人，更希望大佬们帮忙纠错啦~侵权立删。目录一、背景知识 — 残差结构二、Bottlene…

人工智能 2023年7月27日
0067
Python 数据分析函数汇总

import pandas as pd Data = pd.read_csv(‘Data.csv’,sep=’,’,dtype=object) head() 函数：返回前n行的da…

人工智能 2023年7月7日
0061
任务调度之Timer定时器源码分析

文章目录 1. 概述 2. 案例 3. 源码分析 * 3.1属性与构造 3.2 TaskQueue优先级队列 3.3 TimerThread消费线程 4. 流程总结 5. 不足概…

人工智能 2023年6月29日
0090
【深度强化学习】多智能体算法汇总

0 Preliminaries 在多智能体强化学习算法中，两个主要的技术指标为合理性与收敛性。合理性（rationality）：在对手使用一个恒定策略的情况下，当前智能体能够学习…

人工智能 2023年6月16日
00102
如何在MXNet中实现卷积神经网络（CNN）算法

如何在MXNet中实现卷积神经网络（CNN）算法详细介绍卷积神经网络（Convolutional Neural Networks，CNN）是一种广泛应用于计算机视觉领域的深度学…

人工智能 2024年1月1日
0053
异质网络模型HetGNN论文总结理解

论文题目：Heterogeneous Graph Neural Network 论文来源：KDD 2019 论文链接：https://www3.nd.edu/~dial/publi…

人工智能 2023年7月14日
0062
ubuntu+docker+pycharm环境深度学习远程炼丹使用教程

文章目录前言一、docker环境准备 * 1.下载镜像 2.运行容器 3.给容器安装openssh-server和openssh-client 4.vim打开并修改配置文件 5…

人工智能 2023年6月16日
0090
cv2.getPerspectiveTransform 透视变换函数解析

简介透视变换(Perspective Transformation)是将成像投影到一个新的视平面(Viewing Plane)，也称作投影映射(Projective Mappin…

人工智能 2023年5月26日
0062
TF-IDF的算法原理以及Python实现

算法原理 TF-IDF（Term Frequency-Inverse Document Frequency）是词频-逆文档频率，主要实现在一个文章集中找到每篇文章的关键字（也就是…

人工智能 2023年5月28日
0085
机器学习（五）—— 决策树回归模型和集合算法

决策树回归模型和集合算法 1. 决策树概述决策树(Decision Tree）是在已知各种情况发生概率的基础上，通过构成决策树来求取净现值的期望值大于等于零的概率 ——百度百科 …

人工智能 2023年6月13日
0079
憨批的语义分割重制版10——Tensorflow2 搭建自己的DeeplabV3+语义分割平台

憨批的语义分割重制版10——Tensorflow2 搭建自己的DeeplabV3+语义分割平台注意事项学习前言什么是DeeplabV3+模型代码下载 DeeplabV3+实…

人工智能 2023年7月12日
0056
程序中使用到的函数（学习记录）

使用到的程序（学习记录） opencv&&c++ 在一副图像中寻找和另一幅图像最相似（匹配）部分的技术。输入有两幅图像一副是 template.jpg另一幅是 or…

人工智能 2023年6月22日
0086
1.1.7. Least Angle Regression（最小角回归）和 1.1.8. LARS Lasso

1.1.7. Least Angle Regression 简介求极值的算法有很多，有基于梯度的，例如：常规梯度下降、坐标梯度下降、最速梯度下降、共轭梯度下降也有基于样本和角度…

人工智能 2023年6月18日
0098
win10系统使用yolov5，安装cuda、anaconda、pytorch、opencv避坑避雷最全讲解+常见问题解答

yolo全称”you only look once”，可以用来进行快速目标识别网上资料很多，来做一下整理。yolo现在做到v5了，有很多人研究，本着要做就做…

人工智能 2023年7月22日
0055

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

大数据缺省值插补方法（回归填补[stochastic regression imputation]，聚类填补，。。）

文章目录

random imputation

deterministic regression imputation

stochastic regression imputation

大家都在看