pandas去重函数

  • DataFrame.duplicated(subset=None, keep=first)

返回布尔类型的Series结构表示有重复值的行,True表示是重复值(行)

subset: column label or sequence of labels, optional

可以指定检测某一列是否有重复值。默认将检测pandas数据中是否有重复行

keep: {first, last, False}, default first

first: 对于所有重复值,标记除第一次出现的重复值, 默认

last: 对于所有重复值,标记除最后一次出现的重复值

False: 标记所有重复值

df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})
df

    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0
df.duplicated()

0    False
1     True
2    False
3    False
4    False
dtype: bool
  • DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False, ignore_index=False)

返回已去重的DataFrame结构,默认保留第一次出现的行(值)、非原地操作、不为去重后的行添加默认索引

  • subset: column label or sequence of labels, optional Only consider certain columns for identifying duplicates, by default use all of the columns.

  • keep: {‘first’, ‘last’, False}, default ‘first’ 同pandas.DataFrame.duplicated()

  • inplace: bool, default False Whether to drop duplicates in place or to return a copy.

  • ignore_index: bool, default False If True, the resulting axis will be labeled 0, 1, …, n – 1. New in version 1.0.0.

Returns

  • DataFrame or None DataFrame with duplicates removed or None if inplace=True.

  • Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)

统计各种值出现的次数,默认降序排列,以便将次数最多的值(除NA)置顶

index = pd.Index([3, 1, 2, 3, 4, np.nan])
index.value_counts()

3.0    2
2.0    1
4.0    1
1.0    1
dtype: int64

Original: https://blog.csdn.net/what_how_why2020/article/details/114982839
Author: 思想在拧紧
Title: pandas去重函数

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/738833/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球