


0 前言


1 数据标准化

1.1 标准化定义

标准化的定义:又被称为均值移除(mean removal),对不同样本的同一特征值进行处理,最终均值为0,标准差为1,采用此种方式我们只需要使用如下公式即可。

  • 数据均值
  • 数据标准差

1.2 为什么要进行数据标准化?



In machine learning, many algorithms and methods for evaluating models are based on distance (residual) processing, that is, or, therefore, the impact of different distances on the model should be avoided when random data sampling is carried out. Therefore, it is necessary to carry out standardized processing to ensure that the randomly selected data are equidistant. To speak human language, to borrow an image, for example, is to treat different ellipses into positive circles, so that any value on the circle is equal to the distance from the origin.


1.3 实例操作

import numpy as npfrom sklearn import preprocessingdata = np.array([[3, -1.5,  2, -5.4], [0,  4,  -0.3, 2.1], [1,  3.3, -1.9, -4.3]])



The output result is: (the data is randomly set to facilitate manual verification later)


按照标准化的公式,要先计算均值和方差,那么有个问题就来了:计算的数据是横向(一行数据,axis = 1),还是纵向(一列数据,axis = 0)的呢?每一列(纵向)都代表着一个字段的数据,而每一行却包含了所有字段中的一个数据,而在计算均值和方差时候应该选取的是某个字段进行,也就是需要计算纵向的数据

print('均值: ',data.mean(axis = 0))print('标准差: ', data.std(axis = 0))





Then you can do manual verification.

import mathmath.sqrt(((3-1.33333333)**2+(0-1.33333333)**2+(1-1.33333333)**2)/3)





The final standardized result is: (displayed with the data of the first row of the first column)


以上的过程虽然原理很简单,操作起来也不是很难,但是要是每次进行数据处理之前都得一个数据一个数据的挨个处理,就显着很浪费时间,因此就可以使用​ ​preprocessing​​函数进行处理

核心代码:​ ​preprocessing.scale()​

data_standarized = preprocessing.scale(data)print('均值: ',data_standarized.mean(axis = 0))print('标准差: ', data_standarized.std(axis = 0))



2 数据缩放化

2.1 0-1缩放



For the same eigenvalue of different samples, subtract its minimum value, divided by (maximum-minimum value), the final original maximum value is 1, and the original minimum value is 0, so that the weight influence of different unit sizes on the final structure can be effectively eliminated in data analysis. (for example, if the stock price fluctuates between 5 and 7 yuan, but the daily trading volume is about 1 million, without scaling, the data weight of the trading volume will be tens of thousands of times higher than the stock price, resulting in abnormal final forecast data)

2.2 实例操作



除了手动计算外,也可以直接调用sklearn中的模块,也是在​ ​preprocessing​​​函数中,使用​ ​MinMaxScaler​​方法

核心代码:​ ​preprocessing.MinMaxScaler()​

data_scaler = preprocessing.MinMaxScaler(feature_range = (0,2))data_scaled = data_scaler.fit_transform(data)

输出结果为:(​ ​MinMaxScaler​​括号里可以进行参数的调整,根据自己的需求进行设置,比如将区间缩放至0-2,得到的结果就是刚刚利用numpy求解的2倍)


3 数据归一化

3.1 数据归一化定义

如果要调整特征向量中的值时,可以使用数据归一化,以便可以使用通用比例尺对其进行测量。机器学习中最常用的规范化形式之一是调整特征向量的值,使其总和为1(方便查找重要特征)。常见的处理方式有如下几种:L1模式, L2模式。

L1模式:理解就是加了绝对值的,也是常见的一种模式,还有其他很多的名字称呼,比如熟悉的曼哈顿距离,最小最对误差等。使用L1模式可以度量两个向量间的差异,如绝对误差和(Sum of Absolute Difference)


3.2 实例操作

核心代码:​ ​preprocessing.normalize()​

L1 模式

data_normalized = preprocessing.normalize(data,'l1',axis = 0)print(data_normalized)data_norm_abs = np.abs(data_normalized)print(data_norm_abs.sum(axis=0))



The result of the output is: (you can see that the sum of the converted data of each column is 1)



data_normalized = preprocessing.normalize(data,'l2',axis = 0)print(data_normalized)#0.31622777*3 = 0.9486833...(data_normalized*data_normalized).sum(axis = 0)



The output result is: (the transformed data of each column is also calculated to be 1 according to the formula, and the proportion remains the same.)


4 二值化

4.1 二值化定义



When we want to convert a digital feature vector into a Boolean vector, we can use binarization (that is, according to the threshold specified by ourselves, if the threshold is exceeded, it is 1, and if it is less than this threshold, it is 0). In the field of digital image processing, image binarization is the process of converting a color or grayscale image into a binary image (that is, an image with only two colors (usually black and white).



This technique is used to identify objects, shapes, especially characters. Through binarization, the object of interest can be separated from the background area of the found object, such as common black-and-white portraits.


4.2 实际操作

核心代码:​ ​preprocessing.Binarizer()​

data_binarized = preprocessing.Binarizer(threshold = 1.4).transform(data)print(data_binarized)print(data)



The result of the output is: (you can set a threshold, the data greater than this value is 1, those not greater than 0 are 0, and the final data are 0 and 1)


5 独热编码

5.1 独热编码定义



In many computing models, numerical processing can only be carried out, but all kinds of classified data are often encountered in the data, so it is necessary to encode the classified data numerically. Mono-thermal coding converts classified data into 0-1 coding, and the single-hot encoder can regard * one-key coding * as a tool to enhance the feature vector. Each feature in the feature vector is encoded based on this scheme. This helps us to improve space efficiency. The coding process is as follows


关于 独热的理解:经过编码后,一行数据中,只有一个数据是1,其余的都是0,对比红路灯,在一个时刻,该指示灯只能显示一种颜色


5.2 实例操作

核心代码:​ ​preprocessing.OneHotEncoder()​


data = np.array([[1, 1, 2], [0, 2, 3], [1, 0, 1], [0, 1, 0]])print(data)




encoder = preprocessing.OneHotEncoder()encoder.fit(data)encoder_vector = encoder.transform([[0,0,0]]).toarray()print(encoder_vector)encoder_vector = encoder.transform([[1,0,3]]).toarray()print(encoder_vector)





If the numbers are difficult to understand, then here is the introduction of Zhang San Li Si.

data = np.array([['张三', '张三', '李四'], ['王五', '李四', '赵六'], ['张三', '王五', '张三'], ['王五','张三', '王五']])print(data)encoder = preprocessing.OneHotEncoder()encoder.fit(data)encoder_vector = encoder.transform([['张三','张三','张三']]).toarray()print(encoder_vector)



The output result is: (here is the same principle as above, but the Chinese text data is used for display, which is easy to understand.)



6.1 标签编码定义



In addition to thermal coding, classification variables can be tagged, that is, not only 0 and 1, but also other values. The common title of tag coding is to “tag” the data and find the corresponding data. It can also be operated manually.


6.2 实例操作

核心代码:​ ​preprocessing.LabelEncoder()​

input_classes = ['audi', 'ford', 'audi', 'toyota', 'ford', 'bmw']label_encoder = preprocessing.LabelEncoder()label_encoder.fit(input_classes)for i, item in enumerate(label_encoder.classes_):    print(item, "-->", i)

输出结果为:(回顾基础部分讲解​ ​for​​​循环时讲到的​ ​for-enumerate​​的配合使用,可以直接输出对应的标签信息和标签值)


经过​ ​fit​​训练之后的标签信息,就可以对部分数据甚至全部的数据的标签信息甚至逆标签信息查询,比如指定原列表中的部分数据,就可以直接获得对应数据的编码




At the same time, you can also reverse the tag information query, specify the encoded value, and output the corresponding tag information.


7 缺失值处理

7.1 处理方式



The most common data situation is missing part of the data, so how to deal with missing values? Is there a fixed formula? The treatment is as follows:

  • 删除:缺失样本量 非常大,删除整个字段;如果缺失量较少,且 难以填充 则删除缺失样本
  • 填充:缺失量 小于10%,根据缺失变量的数据分布采取 均值(正态分布)中位数(偏态分布) 进行填充
  • 模拟或预测缺失样本:根据样本的数据分布生成随机值填充(内插);使用与缺失相关性高的特征建立模型来预测缺失值

    simulate or predict missing samples: generate * random value filling (interpolation) * according to the data distribution of the sample; use features with high correlation to the missing to establish a model to predict the missing value*


7.2 实例操作

采用sklearn模块中的​ ​SimpleImputer​​函数进行处理,使用的步骤共四步

  • (1)从模块中导入函数
  • (2)设定填充的对象和填充的方式
  • (3)选取数据
  • (4)处理数据

读取测试数据,其中​ ​Age​​​和​ ​Salary​​字段有缺失值




Extract the corresponding fields, either by column or coordinate

#采用坐标的方式进行提取X = dataset.iloc[:, :-1].valuesY = dataset.iloc[:, -1].values

输出结果如下:(一般会把所有的特征字段提取后赋值为​ ​X​​​,标签字段提取后赋值为​ ​Y​​)



#导入处理模块from sklearn.impute import SimpleImputer  #设定要处理的对象和处理规则imputer=SimpleImputer(missing_values=np.nan,strategy='mean')#选取数据fit训练imputer= imputer.fit(X[:,1:3])#处理数据X[:,1:3]=imputer.transform(X[:,1:3])





Finally, with regard to the rules that can be set, by calling the documentation, you can find that the types that can be used are as follows: (mean, median, mode and constant)




At this point, the introduction of the pretreatment of big data is finished, and the ✿✿ flowers (°▽ °) ✿

Original: https://blog.51cto.com/u_15713987/5462745
Author: 百木从森
Title: 【数据分析师-数据分析项目案例】大数据预处理





亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球