这本书类似于工具书或者字典,对于python具体代码的调用和使用场景写的很清楚,感觉虽然是工具书,但是对照着做一遍应该可以对机器学习中python常用的这些库有更深入的理解,在应用中也能更为熟练。
以下是根据书上的代码进行实操,注释基本写明了每句代码的作用(写在本句代码之前)和print的输出结果(写在print之后)。不一定严格按照书上内容进行,根据代码运行时具体情况稍作顺序调整,也加入了一些自己的理解。
如果复制到自己的环境下跑一遍输出,相信理解会更深刻更清楚。
博客中每个代码块代表一次完整的运行结果,可以直接以此为单位复制并运行。
包括:
主要是 sklearn
模块,对数值特征处理的一些应用。
04-1 特征缩放
from sklearn import preprocessing
import numpy as np
创建特征
feature = np.array([[-500.5], [-100.1], [0], [100.1], [900.9]])
print(feature)
[[-500.5]
[-100.1]
[ 0. ]
[ 100.1]
[ 900.9]]
--创建缩放器,归一化,特征的最小值和最大值分别赋予0和1
minmax_scale = preprocessing.MinMaxScaler(feature_range = (0, 1))
缩放特征
scaled_feature = minmax_scale.fit_transform(feature)
print(scaled_feature)
[[0. ]
[0.28571429]
[0.35714286]
[0.42857143]
[1. ]]
输出平均值,标准差
print(scaled_feature.mean())
print(scaled_feature.std())
0.41428571428571426
0.32701494692170274
--创建缩放器,标准化,平均值为0,标准差为1
scaler = preprocessing.StandardScaler()
标准化特征
scaled_feature = scaler.fit_transform(feature)
print(scaled_feature)
[[-1.26687088]
[-0.39316683]
[-0.17474081]
[ 0.0436852 ]
[ 1.79109332]]
输出平均值,标准差
print(scaled_feature.mean())
print(scaled_feature.std())
0.0
1.0
--创建缩放器,缩放有离群值的数据
scaler = preprocessing.RobustScaler()
标准化特征
scaled_feature = scaler.fit_transform(feature)
print(scaled_feature)
[[-2.5]
[-0.5]
[ 0. ]
[ 0.5]
[ 4.5]]
输出平均值,标准差
print(scaled_feature.mean())
print(scaled_feature.std())
0.4
2.2891046284519194
04-2 归一化观察值
与 特征缩放的区别在于:特征缩放以整体所有特征为单位进行计算,观察值以样本(行)为单位进行计算。
from sklearn.preprocessing import Normalizer
import numpy as np
创建特征矩阵
feature = np.array([[0.5, 0.5], [1.1, 3.4], [1.5, 20.2], [1.63, 34.4], [10.9, 3.3]])
print(feature)
[[ 0.5 0.5 ]
[ 1.1 3.4 ]
[ 1.5 20.2 ]
[ 1.63 34.4 ]
[10.9 3.3 ]]
创建归一化器,L2范数
normalizer = Normalizer(norm = 'l2')
转换特征矩阵
print(normalizer.transform(feature))
[[0.70710678 0.70710678]
[0.30782029 0.95144452]
[0.07405353 0.99725427]
[0.04733062 0.99887928]
[0.95709822 0.28976368]]
创建归一化器,L1范数
normalizer = Normalizer(norm = 'l1')
转换特征矩阵
print(normalizer.transform(feature))
[[0.5 0.5 ]
[0.24444444 0.75555556]
[0.06912442 0.93087558]
[0.04524008 0.95475992]
[0.76760563 0.23239437]]
创建归一化器,最大值归一化
normalizer = Normalizer(norm = 'max')
转换特征矩阵
print(normalizer.transform(feature))
[[1. 1. ]
[0.32352941 1. ]
[0.07425743 1. ]
[0.04738372 1. ]
[1. 0.30275229]]
04-3 多项式特征和交互特征
- 创建 多项式特征,解决特征与目标是非线性关系的问题
- 创建 交互特征,解决目标由多个特征决定的问题
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
创建特征矩阵
features = np.array([[2, 3], [2, 3], [2, 3]])
print(features)
[[2 3]
[2 3]
[2 3]]
创建PolynomialFeatures对象
polynomial_interaction = PolynomialFeatures(degree = 2, include_bias = False)
--创建多项式特征,解决特征与目标是非线性关系的问题,degree是最高阶数
x1, x2, x1^2, x1*x2, x2^2
print(polynomial_interaction.fit_transform(features))
[[2. 3. 4. 6. 9.]
[2. 3. 4. 6. 9.]
[2. 3. 4. 6. 9.]]
polynomial_interaction = PolynomialFeatures(degree = 3, include_bias = False)
degree = 3,最大值为原特征最大值的三次方
print(polynomial_interaction.fit_transform(features))
[[ 2. 3. 4. 6. 9. 8. 12. 18. 27.]
[ 2. 3. 4. 6. 9. 8. 12. 18. 27.]
[ 2. 3. 4. 6. 9. 8. 12. 18. 27.]]
interaction = PolynomialFeatures(degree = 2, interaction_only = True, include_bias = False)
--创建交互特征,解决目标由多个特征决定的问题,degree是最高阶数
# x1, x2, x1*x2
print(interaction.fit_transform(features))
[[2. 3. 6.]
[2. 3. 6.]
[2. 3. 6.]]
04-4 自定义特征转换
有时需要按照自己的需求转换特征,比如求特征的对数。可以通过函数转换器 FunctionTransformer()
或者pandas中的 apply()
方法两种方式达到自定义特征转换的目的。
from sklearn.preprocessing import FunctionTransformer
import numpy as np
创建特征矩阵
features = np.array([[2, 3], [2, 3], [2, 3]])
print(features)
[[2 3]
[2 3]
[2 3]]
自定义函数
def add_ten(x):
return x + 10
创建转换器
ten_transformer = FunctionTransformer(add_ten)
print(ten_transformer.transform(features))
[[12 13]
[12 13]
[12 13]]
同样可以采用pandas来转换
import pandas as pd
df = pd.DataFrame(features, columns = ['feature_1', 'feature_2'])
print(df.apply(add_ten))
feature_1 feature_2
0 12 13
1 12 13
2 12 13
04-5 异常值
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs
import numpy as np
创建聚类的模拟数据集
features,_ = make_blobs(n_samples = 10, n_features = 2, centers = 1, random_state = 1)
print(features)
[[-1.83198811 3.52863145]
[-2.76017908 5.55121358]
[-1.61734616 4.98930508]
[-0.52579046 3.3065986 ]
[ 0.08525186 3.64528297]
[-0.79415228 2.10495117]
[-1.34052081 4.15711949]
[-1.98197711 4.02243551]
[-2.18773166 3.33352125]
[-0.19745197 2.34634916]]
替换极端值
features[0,1] = 10000
features[1,1] = 10000
print(features)
[[-1.83198811e+00 1.00000000e+04]
[-2.76017908e+00 1.00000000e+04]
[-1.61734616e+00 4.98930508e+00]
[-5.25790464e-01 3.30659860e+00]
[ 8.52518583e-02 3.64528297e+00]
[-7.94152277e-01 2.10495117e+00]
[-1.34052081e+00 4.15711949e+00]
[-1.98197711e+00 4.02243551e+00]
[-2.18773166e+00 3.33352125e+00]
[-1.97451969e-01 2.34634916e+00]]
----方法一:EllipticEnvelope()
创建异常值识别器,污染指数contamination是异常值的比例
outlier_detector = EllipticEnvelope(contamination = .1)
拟合识别器
outlier_detector.fit(features)
预测异常值
print(outlier_detector.predict(features))
[-1 1 1 1 1 1 1 1 1 1]
修改污染指数
outlier_detector = EllipticEnvelope(contamination = .3)
拟合识别器
outlier_detector.fit(features)
预测异常值
print(outlier_detector.predict(features))
[-1 -1 1 1 -1 1 1 1 1 1]
----方法二:四分位差IQR识别
也可以只查看某个特征的异常值,采用四分位差IQR识别
IQR = 第一个四分位数和第三个四分位数的差值
异常值常常被定义为比第一个四分位数小1.5个IQR,或比第三个四分位数大1.5个IQR的值
feature = features[:,1]
print(feature)
[1.00000000e+04 1.00000000e+04 4.98930508e+00 3.30659860e+00
3.64528297e+00 2.10495117e+00 4.15711949e+00 4.02243551e+00
3.33352125e+00 2.34634916e+00]
创建通过四分位差IQR识别法,返回异常值下标的函数
def indicies_of_outliers(x):
q1, q3 = np.percentile(x, [25, 75])
iqr = q3 - q1
lower_bound = q1 - (iqr * 1.5)
upper_bound = q3 + (iqr * 1.5)
return np.where((x > upper_bound) | (x < lower_bound))
识别异常值下标
print(indicies_of_outliers(feature))
(array([0, 1]),)
----处理异常值
-----方法一:采用RobustScaler()缩放含有离群值的特征
from sklearn import preprocessing
scaler = preprocessing.RobustScaler()
scaled_feature = scaler.fit_transform(features)
print(scaled_feature)
[[-2.61212566e-01 6.80970487e+03]
[-9.47948061e-01 6.80970487e+03]
[-1.02406616e-01 7.87126291e-01]
[ 7.05196630e-01 -3.59186642e-01]
[ 1.15728512e+00 -1.28464128e-01]
[ 5.06645267e-01 -1.17778692e+00]
[ 1.02406616e-01 2.20215119e-01]
[-3.72184092e-01 1.28464128e-01]
[-5.24414566e-01 -3.40846083e-01]
[ 9.48122608e-01 -1.01333897e+00]]
-----方法二:分析特征值的成因,针对性处理
import pandas as pd
创建数据帧
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116] # 卧室数量?
houses['Square_Feet'] = [1500, 2500, 1500, 48000]
print(houses)
Price Bathrooms Square_Feet
0 534433 2.0 1500
1 392333 3.5 2500
2 293222 2.0 1500
3 4322032 116.0 48000
可以通过已知条件直接筛选的方式来筛选观察值
print(houses[houses['Bathrooms'] < 20])
Price Bathrooms Square_Feet
0 534433 2.0 1500
1 392333 3.5 2500
2 293222 2.0 1500
或者把他们标记为异常值,并作为数据集的一个特征
houses['Outlier'] = np.where(houses['Bathrooms'] < 20, 0, 1)
print(houses)
Price Bathrooms Square_Feet Outlier
0 534433 2.0 1500 0
1 392333 3.5 2500 0
2 293222 2.0 1500 0
3 4322032 116.0 48000 1
对异常值进行转换,降低异常值的影响
对特征取对数值
houses['log_of_square_feet'] = [np.log(x) for x in houses['Square_Feet']]
print(houses)
Price Bathrooms Square_Feet Outlier log_of_square_feet
0 534433 2.0 1500 0 7.313220
1 392333 3.5 2500 0 7.824046
2 293222 2.0 1500 0 7.313220
3 4322032 116.0 48000 1 10.778956
04-6 离散化与分组
from sklearn.preprocessing import Binarizer
import numpy as np
age = np.array([[6], [12], [20], [36], [65]])
-- 方法一:两个区间,二值化
创建二值化器
binarizer = Binarizer(18)
二值化特征
print(binarizer.fit_transform(age))
[[0]
[0]
[1]
[1]
[1]]
-- 方法二:多个区间,离散化
将特征离散化,bins是区间列表,落在第i(0-n)个区间,返回的值就是i
print(np.digitize(age, bins = [18]))
[[0]
[0]
[1]
[1]
[1]]
print(np.digitize(age, bins = [20, 30, 64]))
[[0]
[0]
[1]
[2]
[3]]
-- 方法三:无显式关系联,聚类分组
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
创建模拟的矩阵特征
features, _ = make_blobs(n_samples = 50, n_features = 2, centers = 3, random_state = 1)
print(features[:5])
[[-9.87755355 -3.33614544]
[-7.28721033 -8.35398617]
[-6.94306091 -7.0237442 ]
[-7.44016713 -8.79195851]
[-6.64138783 -8.07588804]]
创建数据帧
dataframe = pd.DataFrame(features, columns = ['feature_1', 'feature_2'])
print(dataframe.head(5))
feature_1 feature_2
0 -9.877554 -3.336145
1 -7.287210 -8.353986
2 -6.943061 -7.023744
3 -7.440167 -8.791959
4 -6.641388 -8.075888
创建K-Means聚类器
clusterer = KMeans(3, random_state = 0)
将聚类应用在特征上
clusterer.fit(features)
预测聚类的值
dataframe['group'] = clusterer.predict(features)
print(dataframe.head(5))
feature_1 feature_2 group
0 -9.877554 -3.336145 0
1 -7.287210 -8.353986 2
2 -6.943061 -7.023744 2
3 -7.440167 -8.791959 2
4 -6.641388 -8.075888 2
04-7 缺失值处理
import numpy as np
创建特征矩阵
features = np.array([[1.1, 11.1], [2.2, 22.2], [3.3, 33.3], [4.4, 44.4], [np.nan, 55]])
print(features)
[[ 1.1 11.1]
[ 2.2 22.2]
[ 3.3 33.3]
[ 4.4 44.4]
[ nan 55. ]]
-- 方法一:只保留没有(~表示取反补集)缺失值的观察值
print(features[~np.isnan(features).any(axis = 1)])
[[ 1.1 11.1]
[ 2.2 22.2]
[ 3.3 33.3]
[ 4.4 44.4]]
-- 方法二:pd.dropna()
import pandas as pd
dataframe = pd.DataFrame(features, columns = ['feature_1', 'feature_2'])
删除带有缺失值的观察值
print(dataframe.dropna())
feature_1 feature_2
0 1.1 11.1
1 2.2 22.2
2 3.3 33.3
3 4.4 44.4
-- 填充缺失值
--- 方法一:fancyimpute模块
from fancyimpute import KNN
填充算法:最近邻估算,使用两行都具有观测数据的特征的均方差来对样本进行加权。然后用加权的结果进行特征值填充
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
创建模拟特征矩阵
features, _ = make_blobs(n_samples = 1000, n_features = 2, random_state = 1)
print(features[:5])
[[-3.05837272 4.48825769]
[-8.60973869 -3.72714879]
[ 1.37129721 5.23107449]
[-9.33917563 -2.9544469 ]
[-8.63895561 -8.05263469]]
标准化特征
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)
print(standardized_features[:5])
[[ 0.87301861 1.31426523]
[-0.67073178 -0.22369263]
[ 2.1048424 1.45332359]
[-0.87357709 -0.07903966]
[-0.67885655 -1.03344137]]
替换为缺失值
true_value = standardized_features[0,0]
standardized_features[0,0] = np.nan
print(standardized_features[:5])
[[ nan 1.31426523]
[-0.67073178 -0.22369263]
[ 2.1048424 1.45332359]
[-0.87357709 -0.07903966]
[-0.67885655 -1.03344137]]
预测特征矩阵中的缺失值
features_knn_imputed = KNN(k = 5, verbose = 0).fit_transform(standardized_features)
对比真实值和填充值
print('True:', true_value)
print('Imputed:', features_knn_imputed[0,0])
True: 0.8730186113995938
Imputed: 1.0955332713113226
--- 方法二:sklearn的Imputer模块
用特征的平均数、中位数或众数填充均值,效果一般比KNN的差
from sklearn.impute import SimpleImputer
创建填充器
mean_imputer = SimpleImputer(strategy = 'mean')
填充缺失值
features_mean_imputed = mean_imputer.fit_transform(standardized_features)
对比真实值和填充值
print('True:', true_value)
print('Imputed:', features_knn_imputed[0,0])
True: 0.8730186113995938
Imputed: 1.0955332713113226
如果采用填充策略,最好创建一个新的二元特征来表示该观察值是否具有填充值,有时缺失值也是一个信息
Original: https://www.cnblogs.com/camilia/p/16700449.html
Author: CAMILIA
Title: [Python]-sklearn模块-机器学习Python入门《Python机器学习手册》-04-处理数值型数据
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/807820/
转载文章受原作者版权保护。转载请注明原作者出处!