DataFrame.duplicated
(subset=None, keep=first)
返回布尔类型的Series结构表示有重复值的行,True表示是重复值(行)
subset: column label or sequence of labels, optional
可以指定检测某一列是否有重复值。默认将检测pandas数据中是否有重复行
keep: {first, last, False}, default first
first
: 对于所有重复值,标记除第一次出现的重复值, 默认。
last
: 对于所有重复值,标记除最后一次出现的重复值
False
: 标记所有重复值
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
df.duplicated()
0 False
1 True
2 False
3 False
4 False
dtype: bool
DataFrame.
drop_duplicates(subset=None, keep=’first’, inplace=False, ignore_index=False)
返回已去重的DataFrame结构,默认保留第一次出现的行(值)、非原地操作、不为去重后的行添加默认索引
-
subset: column label or sequence of labels, optional Only consider certain columns for identifying duplicates, by default use all of the columns.
-
keep: {‘first’, ‘last’, False}, default ‘first’ 同pandas.DataFrame.duplicated()
-
inplace: bool, default False Whether to drop duplicates in place or to return a copy.
-
ignore_index: bool, default False If True, the resulting axis will be labeled 0, 1, …, n – 1. New in version 1.0.0.
Returns
-
DataFrame or None DataFrame with duplicates removed or None if
inplace=True
. -
Series.value_counts
(normalize=False, sort=True, ascending=False, bins=None, dropna=True)
统计各种值出现的次数,默认降序排列,以便将次数最多的值(除NA)置顶
index = pd.Index([3, 1, 2, 3, 4, np.nan])
index.value_counts()
3.0 2
2.0 1
4.0 1
1.0 1
dtype: int64
Original: https://blog.csdn.net/what_how_why2020/article/details/114982839
Author: 思想在拧紧
Title: pandas去重函数
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/738833/
转载文章受原作者版权保护。转载请注明原作者出处!