Pandas常见方法(2)-pandas对数据的预处理

说明:本blog基于python3, pandas 1.3.5版本

本文主要介绍如何对数据做预处理,包括 缺失值过滤、缺失值补全、数据转换(重复值删除,数据映射、数据替换)、简单运算自动对齐与函数处理、统计运算和排序,共5个部分。并附有代码实例。
【注:本文所有部分根据pandas中的基础数据结构进行分类讲解,Series 和 DataFrame】

1.1 缺失值过滤

使用dropna方法,
公式: DataFrame.dropna(axis = 0/1, how = “all”, thresh =按衡量标准 删除的最小Nan值个数, subset = [“目标列”])

代码如下

import pandas as pd
import numpy as np

a = pd.DataFrame([[1,2,np.nan],[np.nan,2,3], [np.nan, np.nan, np.nan], [3,5,7]])
print(a)

a1 = a.dropna(axis = 0)
print(("删除所有包含Nan值的行 \n{}").format(a1))

a2 = a.dropna(axis = 1)
print(("删除所有包含Nan值的列 \n{}").format(a2))

a3 = a.dropna(how = "all", axis = 0)
print(("删除行元素全为Nan值的行 \n{}").format(a3))

结果如下

 0    1    2
0  1.0  2.0  NaN
1  NaN  2.0  3.0
2  NaN  NaN  NaN
3  3.0  5.0  7.0
删除所有包含Nan值的行
     0    1    2
3  3.0  5.0  7.0
删除所有包含Nan值的列
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
删除行元素全为Nan值的行
     0    1    2
0  1.0  2.0  NaN
1  NaN  2.0  3.0
3  3.0  5.0  7.0

1.2 缺失值补全

DataFrame.fillna(字典形式的按列填充/常数值, method = “ffill”/”bfill”, axis = 0/1, inplace = True/False)
【注:axis = 0代表按列填充,axis = 1代表按行填充;这里的0,1与pandas其他方法的0,1不同!!!】

import pandas as pd
import numpy as np

a = pd.DataFrame([[1,2,np.nan],[np.nan,2,3], [np.nan, np.nan, np.nan], [3,5,7]])
print(a)

a.fillna(method = "bfill", inplace= True, axis = 0)
print("填充后的DataFrame:")
print(a)

结果如下,

     0    1    2
0  1.0  2.0  NaN
1  NaN  2.0  3.0
2  NaN  NaN  NaN
3  3.0  5.0  7.0
填充后的DataFrame:
     0    1    2
0  1.0  2.0  3.0
1  3.0  2.0  3.0
2  3.0  5.0  7.0
3  3.0  5.0  7.0

2.1 某列重复值删除

公式:DataFrame.drop_duplicates(subset = 列索引组成的列表,inplace = True/False)

import pandas as pd
import numpy as np

a = pd.DataFrame([[1,2,np.nan],[np.nan,2,3], [np.nan, np.nan, np.nan], [3,5,7]])
print(a)
a.drop_duplicates(subset = [1], inplace = True)
print(a)

结果如下

     0    1    2
0  1.0  2.0  NaN
1  NaN  2.0  3.0
2  NaN  NaN  NaN
3  3.0  5.0  7.0
     0    1    2
0  1.0  2.0  NaN
2  NaN  NaN  NaN
3  3.0  5.0  7.0

2.2 某列/某些列数据映射

公式:原DataFrame[新列索引] = Series.map(一个字典类型的映射/一个函数)
假设我们对列索引为0的列做map操作,形成一个新列,命名为”map_relationship”个DataFrame,代码如下

import pandas as pd
import numpy as np

a = pd.DataFrame([[1,2,np.nan],[np.nan,2,3], [np.nan, 20, np.nan], [3,5,7]])
print(a)

map_relationship = {1:"a", 2:"b",3:"c", 4:"d"}

a["map_relationship"] = a[0].map(map_relationship)
print(a)

结果如下

     0   1    2
0  1.0   2  NaN
1  NaN   2  3.0
2  NaN  20  NaN
3  3.0   5  7.0
     0   1    2 map_relationship
0  1.0   2  NaN                a
1  NaN   2  3.0              NaN
2  NaN  20  NaN              NaN
3  3.0   5  7.0                c

2.3 某列数据替换

公式: DataFrame[列索引].replace(被替换值= 替换值, inplace = True/False)
【注:只要inplace参数为True,则返回DataFrame的”视图”】

代码如下

import pandas as pd
import numpy as np

a = pd.DataFrame([[1,2,np.nan],[np.nan,2,3], [np.nan, 20, np.nan], [3,5,7]])
print(a)
###只对列索引为0的列中的Nan值做替换,替换为999
a[0].replace(np.nan, 999, inplace = True)
print(a)

结果如下

     0   1    2
0  1.0   2  NaN
1  NaN   2  3.0
2  NaN  20  NaN
3  3.0   5  7.0
       0   1    2
0    1.0   2  NaN
1  999.0   2  3.0
2  999.0  20  NaN
3    3.0   5  7.0

2.4 DataFrame所有数据替换

公式: DataFrame.replace(被替换值= 替换值, inplace = True/False)
【注:只要inplace参数为True,则返回DataFrame的”视图”】

代码如下

import pandas as pd
import numpy as np

a = pd.DataFrame([[1,2,np.nan],[np.nan,2,3], [np.nan, 20, np.nan], [3,5,7]])
print(a)
###对所有元素中的Nan值做替换,替换为999
a.replace(np.nan, 999, inplace = True)
print(a)

结果如下

     0   1    2
0  1.0   2  NaN
1  NaN   2  3.0
2  NaN  20  NaN
3  3.0   5  7.0
       0   1      2
0    1.0   2  999.0
1  999.0   2    3.0
2  999.0  20  999.0
3    3.0   5    7.0

遇到离散型数据,我们通常采用分箱(即分段)或计算分位数的办法从统计学角度整体认知样本数据

分箱方法公式:pd.cut(必须为一维数组, bins = 分段区间点组成的列表, labels = 每段区间的组名组成的列表, ordered = True/False)
【注:其中分段区间默认包含左右两个边界点;如果需要取掉右边界,使用right = False】

代码如下

import pandas as pd
import numpy as np

a = pd.Series([1,2,10,3,55,200,70,8,93,67])
print(a)
b = pd.cut(a,bins = [-10,50,150,500], labels = ["small","middle","big"], ordered = True)
print(b)

结果如下

0      1
1      2
2     10
3      3
4     55
5    200
6     70
7      8
8     93
9     67
dtype: int64
0     small
1     small
2     small
3     small
4    middle
5       big
6    middle
7     small
8    middle
9    middle
dtype: category
Categories (3, object): ['small' < 'middle' < 'big']

分位数公式:新变量 = pd.qcut(几分位数就填几)
接下来,我们对上边生成的a计算25%,50%,75%,100%共四个分位数;并计算每个分位区间的样本数
代码如下

c = pd.qcut(a, 4)
print(c)
d = c.value_counts()
print(d)

结果如下

0     (0.999, 4.25]
1     (0.999, 4.25]
2      (4.25, 32.5]
3     (0.999, 4.25]
4     (32.5, 69.25]
5    (69.25, 200.0]
6    (69.25, 200.0]
7      (4.25, 32.5]
8    (69.25, 200.0]
9     (32.5, 69.25]
dtype: category
Categories (4, interval[float64, right]): [(0.999, 4.25] < (4.25, 32.5] < (32.5, 69.25] <
                                           (69.25, 200.0]]
(0.999, 4.25]     3
(69.25, 200.0]    3
(4.25, 32.5]      2
(32.5, 69.25]     2
dtype: int64

以上结果说明,对于一维数组a而言,
25%分位数是4.25,共3个元素
50%分位数是32.5,共5个元素
75%分位数是69.25,共7个元素
100%分位数是200,共10个元素

写在最后,pandas对数据的预处理在本文中都已涵盖,希望对你的学习有帮助

Original: https://blog.csdn.net/dylan_young/article/details/122407203
Author: Efred.D
Title: Pandas常见方法(2)-pandas对数据的预处理

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/677581/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球