说明:本blog基于python3, pandas 1.3.5版本
本文主要介绍如何对数据做预处理,包括 缺失值过滤、缺失值补全、数据转换(重复值删除,数据映射、数据替换)、简单运算自动对齐与函数处理、统计运算和排序,共5个部分。并附有代码实例。
【注:本文所有部分根据pandas中的基础数据结构进行分类讲解,Series 和 DataFrame】
1.1 缺失值过滤
使用dropna方法,
公式: DataFrame.dropna(axis = 0/1, how = “all”, thresh =按衡量标准 删除的最小Nan值个数, subset = [“目标列”])
代码如下
import pandas as pd
import numpy as np
a = pd.DataFrame([[1,2,np.nan],[np.nan,2,3], [np.nan, np.nan, np.nan], [3,5,7]])
print(a)
a1 = a.dropna(axis = 0)
print(("删除所有包含Nan值的行 \n{}").format(a1))
a2 = a.dropna(axis = 1)
print(("删除所有包含Nan值的列 \n{}").format(a2))
a3 = a.dropna(how = "all", axis = 0)
print(("删除行元素全为Nan值的行 \n{}").format(a3))
结果如下
0 1 2
0 1.0 2.0 NaN
1 NaN 2.0 3.0
2 NaN NaN NaN
3 3.0 5.0 7.0
删除所有包含Nan值的行
0 1 2
3 3.0 5.0 7.0
删除所有包含Nan值的列
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
删除行元素全为Nan值的行
0 1 2
0 1.0 2.0 NaN
1 NaN 2.0 3.0
3 3.0 5.0 7.0
1.2 缺失值补全
DataFrame.fillna(字典形式的按列填充/常数值, method = “ffill”/”bfill”, axis = 0/1, inplace = True/False)
【注:axis = 0代表按列填充,axis = 1代表按行填充;这里的0,1与pandas其他方法的0,1不同!!!】
import pandas as pd
import numpy as np
a = pd.DataFrame([[1,2,np.nan],[np.nan,2,3], [np.nan, np.nan, np.nan], [3,5,7]])
print(a)
a.fillna(method = "bfill", inplace= True, axis = 0)
print("填充后的DataFrame:")
print(a)
结果如下,
0 1 2
0 1.0 2.0 NaN
1 NaN 2.0 3.0
2 NaN NaN NaN
3 3.0 5.0 7.0
填充后的DataFrame:
0 1 2
0 1.0 2.0 3.0
1 3.0 2.0 3.0
2 3.0 5.0 7.0
3 3.0 5.0 7.0
2.1 某列重复值删除
公式:DataFrame.drop_duplicates(subset = 列索引组成的列表,inplace = True/False)
import pandas as pd
import numpy as np
a = pd.DataFrame([[1,2,np.nan],[np.nan,2,3], [np.nan, np.nan, np.nan], [3,5,7]])
print(a)
a.drop_duplicates(subset = [1], inplace = True)
print(a)
结果如下
0 1 2
0 1.0 2.0 NaN
1 NaN 2.0 3.0
2 NaN NaN NaN
3 3.0 5.0 7.0
0 1 2
0 1.0 2.0 NaN
2 NaN NaN NaN
3 3.0 5.0 7.0
2.2 某列/某些列数据映射
公式:原DataFrame[新列索引] = Series.map(一个字典类型的映射/一个函数)
假设我们对列索引为0的列做map操作,形成一个新列,命名为”map_relationship”个DataFrame,代码如下
import pandas as pd
import numpy as np
a = pd.DataFrame([[1,2,np.nan],[np.nan,2,3], [np.nan, 20, np.nan], [3,5,7]])
print(a)
map_relationship = {1:"a", 2:"b",3:"c", 4:"d"}
a["map_relationship"] = a[0].map(map_relationship)
print(a)
结果如下
0 1 2
0 1.0 2 NaN
1 NaN 2 3.0
2 NaN 20 NaN
3 3.0 5 7.0
0 1 2 map_relationship
0 1.0 2 NaN a
1 NaN 2 3.0 NaN
2 NaN 20 NaN NaN
3 3.0 5 7.0 c
2.3 某列数据替换
公式: DataFrame[列索引].replace(被替换值= 替换值, inplace = True/False),
【注:只要inplace参数为True,则返回DataFrame的”视图”】
代码如下
import pandas as pd
import numpy as np
a = pd.DataFrame([[1,2,np.nan],[np.nan,2,3], [np.nan, 20, np.nan], [3,5,7]])
print(a)
###只对列索引为0的列中的Nan值做替换,替换为999
a[0].replace(np.nan, 999, inplace = True)
print(a)
结果如下
0 1 2
0 1.0 2 NaN
1 NaN 2 3.0
2 NaN 20 NaN
3 3.0 5 7.0
0 1 2
0 1.0 2 NaN
1 999.0 2 3.0
2 999.0 20 NaN
3 3.0 5 7.0
2.4 DataFrame所有数据替换
公式: DataFrame.replace(被替换值= 替换值, inplace = True/False),
【注:只要inplace参数为True,则返回DataFrame的”视图”】
代码如下
import pandas as pd
import numpy as np
a = pd.DataFrame([[1,2,np.nan],[np.nan,2,3], [np.nan, 20, np.nan], [3,5,7]])
print(a)
###对所有元素中的Nan值做替换,替换为999
a.replace(np.nan, 999, inplace = True)
print(a)
结果如下
0 1 2
0 1.0 2 NaN
1 NaN 2 3.0
2 NaN 20 NaN
3 3.0 5 7.0
0 1 2
0 1.0 2 999.0
1 999.0 2 3.0
2 999.0 20 999.0
3 3.0 5 7.0
遇到离散型数据,我们通常采用分箱(即分段)或计算分位数的办法从统计学角度整体认知样本数据
分箱方法公式:pd.cut(必须为一维数组, bins = 分段区间点组成的列表, labels = 每段区间的组名组成的列表, ordered = True/False)
【注:其中分段区间默认包含左右两个边界点;如果需要取掉右边界,使用right = False】
代码如下
import pandas as pd
import numpy as np
a = pd.Series([1,2,10,3,55,200,70,8,93,67])
print(a)
b = pd.cut(a,bins = [-10,50,150,500], labels = ["small","middle","big"], ordered = True)
print(b)
结果如下
0 1
1 2
2 10
3 3
4 55
5 200
6 70
7 8
8 93
9 67
dtype: int64
0 small
1 small
2 small
3 small
4 middle
5 big
6 middle
7 small
8 middle
9 middle
dtype: category
Categories (3, object): ['small' < 'middle' < 'big']
分位数公式:新变量 = pd.qcut(几分位数就填几)
接下来,我们对上边生成的a计算25%,50%,75%,100%共四个分位数;并计算每个分位区间的样本数
代码如下
c = pd.qcut(a, 4)
print(c)
d = c.value_counts()
print(d)
结果如下
0 (0.999, 4.25]
1 (0.999, 4.25]
2 (4.25, 32.5]
3 (0.999, 4.25]
4 (32.5, 69.25]
5 (69.25, 200.0]
6 (69.25, 200.0]
7 (4.25, 32.5]
8 (69.25, 200.0]
9 (32.5, 69.25]
dtype: category
Categories (4, interval[float64, right]): [(0.999, 4.25] < (4.25, 32.5] < (32.5, 69.25] <
(69.25, 200.0]]
(0.999, 4.25] 3
(69.25, 200.0] 3
(4.25, 32.5] 2
(32.5, 69.25] 2
dtype: int64
以上结果说明,对于一维数组a而言,
25%分位数是4.25,共3个元素
50%分位数是32.5,共5个元素
75%分位数是69.25,共7个元素
100%分位数是200,共10个元素
写在最后,pandas对数据的预处理在本文中都已涵盖,希望对你的学习有帮助
Original: https://blog.csdn.net/dylan_young/article/details/122407203
Author: Efred.D
Title: Pandas常见方法(2)-pandas对数据的预处理
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/677581/
转载文章受原作者版权保护。转载请注明原作者出处!