pandas处理异常数据(缺失值和重复值)



1. 缺失值**

a) 可以用None或者np.nan来表示缺失的值

import pandas as pd
import numpy as np
data=[['mark',55,'Italy',4.5,'Europe'],
      ['John',33,'China',3.8,'Asian'],
      ['mary',40,'Japan',2.3,'Asian']]
df=pd.DataFrame(data=data,columns=['name','age','country','score','continent'],
                index=[1001,1002,1003])
df.loc[1001,'score'] = None
df.loc[1004,:]=None
print(df)

pandas处理异常数据(缺失值和重复值)

b) 移除所有包含缺失数据的行

df=df.dropna()

pandas处理异常数据(缺失值和重复值)
注:
pd.dropna()同样不能改动原Dataframe,需要重新赋值

c) 只移除所有数据却缺失的行

df=df.dropna(how='all')

pandas处理异常数据(缺失值和重复值)

d) 判断对应位置上是否时NAN

df=df.isna()

pandas处理异常数据(缺失值和重复值)

e) 使用fillna()对缺失值进行填补

df=df.fillna({'score':df['score'].mean()})

pandas处理异常数据(缺失值和重复值)

2. 对重复值进行处理

a) 使用drop_duplicates函数(第一次出现的数据会保留)

import pandas as pd
import numpy as np
data=[['mark',55,'Italy',4.5,'Europe'],
      ['John',33,'China',3.8,'Asian'],
      ['mary',40,'Japan',2.3,'Asian'],
      ['fiona',35,'China',5.6,'Asian']]
df=pd.DataFrame(data=data,columns=['name','age','country','score','continent'],
                index=[1001,1002,1003,1004])
df=df.drop_duplicates(['country','continent'])
print(df)

pandas处理异常数据(缺失值和重复值)

b) is_unique确认是否存在重复值

unique获取去重之后的值

import pandas as pd
import numpy as np
data=[['mark',55,'Italy',4.5,'Europe'],
      ['John',33,'China',3.8,'Asian'],
      ['mary',40,'Japan',2.3,'Asian'],
      ['fiona',35,'China',5.6,'Asian']]
df=pd.DataFrame(data=data,columns=['name','age','country','score','continent'],
                index=[1001,1002,1003,1004])
print(df['country'].is_unique)
df=df['country'].unique()
print(df)

pandas处理异常数据(缺失值和重复值)

c) 定位重复行—duplicated

  1. keep参数为first时,保留第一次出现的数据,
    keep参数为False时,所有重复数据都会被标记为True
import pandas as pd
import numpy as np
data=[['mark',55,'Italy',4.5,'Europe'],
      ['John',33,'China',3.8,'Asian'],
      ['mary',40,'Japan',2.3,'Asian'],
      ['fiona',35,'China',5.6,'Asian']]
df=pd.DataFrame(data=data,columns=['name','age','country','score','continent'],
                index=[1001,1002,1003,1004])
print(df['country'].duplicated(keep=False))
print(df['country'].duplicated(keep='first'))

pandas处理异常数据(缺失值和重复值)
2. 定位重复行
import pandas as pd
import numpy as np
data=[['mark',55,'Italy',4.5,'Europe'],
      ['John',33,'China',3.8,'Asian'],
      ['mary',40,'Japan',2.3,'Asian'],
      ['fiona',35,'China',5.6,'Asian']]
df=pd.DataFrame(data=data,columns=['name','age','country','score','continent'],
                index=[1001,1002,1003,1004])
df=df.loc[df['country'].duplicated(keep=False),:]
print(df)

pandas处理异常数据(缺失值和重复值)

Original: https://blog.csdn.net/m0_51328444/article/details/128276381
Author: cousinmary
Title: pandas处理异常数据(缺失值和重复值)

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/752154/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球