pandas 数据处理-Group by操作

使用 “group by” 方式我们通常会有以下一个或几个步骤:

  • Splitting:根据某一准则对数据分组
  • Applying :对每一分组数据运用某个方法
  • Combining :将结果组合为数据结构
    在上述步骤中,split 方法较直接,在 split 之后我们希望对分组数据做相关计算,在 apply 步骤中我们可能想对数据进行如下操作:
  • Aggregation::聚合操作,对分组数据做汇总统计,如计算sums 或 means、统计分组个数 counts
  • Transformation:对分组数据做特定操作,如:分组数据进行标准化、按照分组数据数据选择值填充控制
  • Filtration:过滤数据。对组计算 True 或 False,按组进弃数据。如:丢弃某些 counts 较小的组,根据组sums 或 means 过滤数据

1 Split

pandas objects 可以基于任何轴进行分割,group by 会创建一个 GroupBy object 对象

import numpy as np
import pandas as pd
df = pd.DataFrame(
    [
        ("bird", "Falconiformes", 389.0),
        ("bird", "Psittaciformes", 24.0),
        ("mammal", "Carnivora", 80.2),
        ("mammal", "Primates", np.nan),
        ("mammal", "Carnivora", 58),
    ],
    index=["falcon", "parrot", "lion", "monkey", "leopard"],
    columns=("class", "order", "max_speed"),)

grouped = df.groupby("class")
grouped = df.groupby("order", axis="columns")
grouped = df.groupby(["class", "order"])
grouped
Out[2]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023F22DEA9C8>
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

df2 = df.set_index(["A", "B"])

Out[2]:
            C       D
A   B
foo one 0.194055    -0.087457
bar one -1.542546   -1.442626
foo two 0.867688    -0.540060
bar three   1.622831    0.331491
foo two -0.364909   0.639529
bar two 0.771066    -0.675301
foo one 1.071776    0.884663
    three   1.367875    1.474144

1.1 GroupBy sorting

groupby 之后数据会按照 key 值进行排序,可以显示声明不排序

df2.groupby(["X"], sort=False).sum()

1.2 GroupBy dropna

默认情况下, 在 groupby 操作中会排除 NA key 的统计,可以通过设置 dropna=False 去除这一限制

df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])

df_dropna.groupby(by=["b"], dropna=True).sum()
Out[30]:
     a  c
b
1.0  2  3
2.0  2  5

df_dropna.groupby(by=["b"], dropna=False).sum()
Out[31]:
     a  c
b
1.0  2  3
2.0  2  5
NaN  1  4

1.3 GroupBy object attributes

groups 的属性是一个字典,key 为分组唯一值,value 为属于key的轴标签

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
gb= df.groupby(["A", "B"])
gb.groups
Out[32]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

len(grouped)

gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group  gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

1.4 Grouping DataFrame with Index levels and columns

arrays = [
    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
    ["one", "two", "one", "two", "one", "two", "one", "two"],
]

index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index)

Out[53]:
              A  B
first second
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7

Out[54]:
          B
second A
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

df.groupby([pd.Grouper(level="second"), "A"]).sum()

df.groupby(["second", "A"]).sum()

1.5 DataFrame column selection in GroupBy

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
grouped = df.groupby(["A"])
grouped_C = grouped["C"]

df["C"].groupby(df["A"])

2 遍历组 Iterating through groups


grouped = df.groupby('A')
for name, group in grouped:
    print(name)
    print(group)

bar
     A      B         C         D
1  bar    one  0.254161  1.511763
3  bar  three  0.215897 -0.990582
5  bar    two -0.077118  1.211526
foo
     A      B         C         D
0  foo    one -0.575247  1.346061
2  foo    two -1.143704  1.627081
4  foo    two  1.193555 -0.441652
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

3 选择组 Selecting a group

grouped.get_group("bar")

df.groupby(["A", "B"]).get_group(("bar", "one"))

4 数据聚合 Aggregation

GroupBy object 对象创建之后我们可以对分组数据做计算

grouped = df.groupby("A")
grouped.aggregate(np.sum)
Out[69]:
        C         D
A
bar  0.392940  1.732707
foo -1.796421  2.824590

grouped = df.groupby(["A", "B"])
grouped.aggregate(np.sum)
Out[71]:
                  C         D
A   B
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.983776  1.614581
    three -0.862495  0.024580
    two    0.049851  1.185429

从上面结果可以看到,对分组数据做计算之后,分组 name 作为了新的索引。如果不想分组name 最为新索引,可以用 as_index 操作数据

grouped = df.groupby(["A", "B"], as_index=False)
grouped.aggregate(np.sum)
Out[73]:
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

df.groupby("A", as_index=False).sum()

df.groupby(["A", "B"]).sum().reset_index()
Out[75]:
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

统计每组个数时,返回为 Series ,group name 为索引,值为组的大小

grouped = df.groupby(["A", "B"], as_index=False)
grouped.size()
Out[76]:
     A      B  size
0  bar    one     1
1  bar  three     1
2  bar    two     1
3  foo    one     2
4  foo  three     1
5  foo    two     2

grouped.describe()

df4.groupby("A")["B"].nunique()

可以应用于分组数据的方法

pandas 数据处理-Group by操作
汇总返回的都为 Series 类型数据

4.1 对分组数据一次执行多个方法

执行单个汇总方法,返回的为 Series 类型数据。当传入多个统计方法时候,返回为 DataFrame

grouped = df.groupby("A")
grouped["C"].agg([np.sum, np.mean, np.std])
Out[83]:
          sum      mean       std
A
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

grouped[["C", "D"]].agg([np.sum, np.mean, np.std])
Out[84]:
            C                             D
          sum      mean       std       sum      mean       std
A
bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

应用 lambda 函数

grouped["C"].agg([lambda x: x.max() - x.min(), lambda x: x.median() - x.mean()])
Out[88]:
     <lambda_0>  <lambda_1>
A
bar    0.331279    0.084917
foo    2.337259   -0.215962

4.2 命名聚合Named aggregation

animals = pd.DataFrame(
    {
        "kind": ["cat", "dog", "cat", "dog"],
        "height": [9.1, 6.0, 9.5, 34.0],
        "weight": [7.9, 7.5, 9.9, 198.0],
    }
)

animals.groupby("kind").agg(
    min_height=pd.NamedAgg(column="height", aggfunc="min"),
    max_height=pd.NamedAgg(column="height", aggfunc="max"),
    average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
)
Out[91]:
      min_height  max_height  average_weight
kind
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

animals.groupby("kind").agg(
    min_height=("height", "min"),
    max_height=("height", "max"),
    average_weight=("weight", np.mean),
)

grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)})
Out[95]:
            C         D
A
bar  0.392940  1.366330
foo -1.796421  0.884785

animals.groupby("kind")[["height"]].agg(lambda x: x.astype(int).sum())

5 过滤 Filtration

dff = pd.DataFrame({"A": np.arange(8), "B": list("aabbbbcc")})
dff.groupby("B").filter(lambda x: len(x) > 2)
Out[142]:
   A  B
2  2  b
3  3  b
4  4  b
5  5  b
dff.groupby("B").filter(lambda x: len(x) > 2, dropna=False)
Out[143]:
     A    B
0  NaN  NaN
1  NaN  NaN
2  2.0    b
3  3.0    b
4  4.0    b
5  5.0    b
6  NaN  NaN
7  NaN  NaN

参考

Group by: split-apply-combine

Original: https://blog.csdn.net/weixin_40994552/article/details/124906960
Author: 小何才露尖尖角
Title: pandas 数据处理-Group by操作

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/737861/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球