pandas 数据处理-Group by操作

2023年8月6日下午3:43 • Python • 阅读 63

使用 “group by” 方式我们通常会有以下一个或几个步骤：

Splitting：根据某一准则对数据分组
Applying ：对每一分组数据运用某个方法
Combining ：将结果组合为数据结构
在上述步骤中，split 方法较直接，在 split 之后我们希望对分组数据做相关计算，在 apply 步骤中我们可能想对数据进行如下操作：
Aggregation:：聚合操作，对分组数据做汇总统计，如计算sums 或 means、统计分组个数 counts
Transformation：对分组数据做特定操作，如：分组数据进行标准化、按照分组数据数据选择值填充控制
Filtration：过滤数据。对组计算 True 或 False，按组进弃数据。如：丢弃某些 counts 较小的组，根据组sums 或 means 过滤数据

1 Split

pandas objects 可以基于任何轴进行分割，group by 会创建一个 GroupBy object 对象

import numpy as np
import pandas as pd
df = pd.DataFrame(
    [
        ("bird", "Falconiformes", 389.0),
        ("bird", "Psittaciformes", 24.0),
        ("mammal", "Carnivora", 80.2),
        ("mammal", "Primates", np.nan),
        ("mammal", "Carnivora", 58),
    ],
    index=["falcon", "parrot", "lion", "monkey", "leopard"],
    columns=("class", "order", "max_speed"),)

grouped = df.groupby("class")
grouped = df.groupby("order", axis="columns")
grouped = df.groupby(["class", "order"])
grouped
Out[2]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023F22DEA9C8>

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

df2 = df.set_index(["A", "B"])

Out[2]:
            C       D
A   B
foo one 0.194055    -0.087457
bar one -1.542546   -1.442626
foo two 0.867688    -0.540060
bar three   1.622831    0.331491
foo two -0.364909   0.639529
bar two 0.771066    -0.675301
foo one 1.071776    0.884663
    three   1.367875    1.474144

1.1 GroupBy sorting

groupby 之后数据会按照 key 值进行排序，可以显示声明不排序

df2.groupby(["X"], sort=False).sum()

1.2 GroupBy dropna

默认情况下，在 groupby 操作中会排除 NA key 的统计，可以通过设置 dropna=False 去除这一限制

df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])

df_dropna.groupby(by=["b"], dropna=True).sum()
Out[30]:
     a  c
b
1.0  2  3
2.0  2  5

df_dropna.groupby(by=["b"], dropna=False).sum()
Out[31]:
     a  c
b
1.0  2  3
2.0  2  5
NaN  1  4

1.3 GroupBy object attributes

groups 的属性是一个字典，key 为分组唯一值，value 为属于key的轴标签

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
gb= df.groupby(["A", "B"])
gb.groups
Out[32]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

len(grouped)


gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group  gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

1.4 Grouping DataFrame with Index levels and columns

arrays = [
    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
    ["one", "two", "one", "two", "one", "two", "one", "two"],
]

index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index)

Out[53]:
              A  B
first second
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7

Out[54]:
          B
second A
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

df.groupby([pd.Grouper(level="second"), "A"]).sum()

df.groupby(["second", "A"]).sum()

1.5 DataFrame column selection in GroupBy

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
grouped = df.groupby(["A"])
grouped_C = grouped["C"]

df["C"].groupby(df["A"])

2 遍历组 Iterating through groups


grouped = df.groupby('A')
for name, group in grouped:
    print(name)
    print(group)

bar
     A      B         C         D
1  bar    one  0.254161  1.511763
3  bar  three  0.215897 -0.990582
5  bar    two -0.077118  1.211526
foo
     A      B         C         D
0  foo    one -0.575247  1.346061
2  foo    two -1.143704  1.627081
4  foo    two  1.193555 -0.441652
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

3 选择组 Selecting a group

grouped.get_group("bar")

df.groupby(["A", "B"]).get_group(("bar", "one"))

4 数据聚合 Aggregation

GroupBy object 对象创建之后我们可以对分组数据做计算

grouped = df.groupby("A")
grouped.aggregate(np.sum)
Out[69]:
        C         D
A
bar  0.392940  1.732707
foo -1.796421  2.824590

grouped = df.groupby(["A", "B"])
grouped.aggregate(np.sum)
Out[71]:
                  C         D
A   B
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.983776  1.614581
    three -0.862495  0.024580
    two    0.049851  1.185429

从上面结果可以看到，对分组数据做计算之后，分组 name 作为了新的索引。如果不想分组name 最为新索引，可以用 as_index 操作数据

grouped = df.groupby(["A", "B"], as_index=False)
grouped.aggregate(np.sum)
Out[73]:
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

df.groupby("A", as_index=False).sum()

df.groupby(["A", "B"]).sum().reset_index()
Out[75]:
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

统计每组个数时，返回为 Series ，group name 为索引，值为组的大小

grouped = df.groupby(["A", "B"], as_index=False)
grouped.size()
Out[76]:
     A      B  size
0  bar    one     1
1  bar  three     1
2  bar    two     1
3  foo    one     2
4  foo  three     1
5  foo    two     2


grouped.describe()

df4.groupby("A")["B"].nunique()

可以应用于分组数据的方法

汇总返回的都为 Series 类型数据

4.1 对分组数据一次执行多个方法

执行单个汇总方法，返回的为 Series 类型数据。当传入多个统计方法时候，返回为 DataFrame

grouped = df.groupby("A")
grouped["C"].agg([np.sum, np.mean, np.std])
Out[83]:
          sum      mean       std
A
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

grouped[["C", "D"]].agg([np.sum, np.mean, np.std])
Out[84]:
            C                             D
          sum      mean       std       sum      mean       std
A
bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

应用 lambda 函数

grouped["C"].agg([lambda x: x.max() - x.min(), lambda x: x.median() - x.mean()])
Out[88]:
     <lambda_0>  <lambda_1>
A
bar    0.331279    0.084917
foo    2.337259   -0.215962

4.2 命名聚合Named aggregation

animals = pd.DataFrame(
    {
        "kind": ["cat", "dog", "cat", "dog"],
        "height": [9.1, 6.0, 9.5, 34.0],
        "weight": [7.9, 7.5, 9.9, 198.0],
    }
)

animals.groupby("kind").agg(
    min_height=pd.NamedAgg(column="height", aggfunc="min"),
    max_height=pd.NamedAgg(column="height", aggfunc="max"),
    average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
)
Out[91]:
      min_height  max_height  average_weight
kind
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

animals.groupby("kind").agg(
    min_height=("height", "min"),
    max_height=("height", "max"),
    average_weight=("weight", np.mean),
)

grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)})
Out[95]:
            C         D
A
bar  0.392940  1.366330
foo -1.796421  0.884785

animals.groupby("kind")[["height"]].agg(lambda x: x.astype(int).sum())

5 过滤 Filtration

dff = pd.DataFrame({"A": np.arange(8), "B": list("aabbbbcc")})
dff.groupby("B").filter(lambda x: len(x) > 2)
Out[142]:
   A  B
2  2  b
3  3  b
4  4  b
5  5  b
dff.groupby("B").filter(lambda x: len(x) > 2, dropna=False)
Out[143]:
     A    B
0  NaN  NaN
1  NaN  NaN
2  2.0    b
3  3.0    b
4  4.0    b
5  5.0    b
6  NaN  NaN
7  NaN  NaN

参考

Group by: split-apply-combine

Original: https://blog.csdn.net/weixin_40994552/article/details/124906960
Author: 小何才露尖尖角
Title: pandas 数据处理-Group by操作

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/737861/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

1.还不会部署高可用的kubernetes集群?看我手把手教你使用二进制部署v1.23.6的K8S集群实践(上)

公众号关注「 WeiyiGeek」设为「特别关注」，每天带你玩转网络安全运维、应用开发、物联网IOT学习！本章目录： 0x00 前言简述 0x01 环境准备主机规划软件版…

Python 2023年6月12日
0095
Python3教程：解压序列

一 .普遍情况: x,y,z = 1,2,3 print("x:",x) x:1 print("y:",y) y:2 print(&quot…

Python 2023年6月9日
0059
HTML爱心网页制作[樱花+爱心]

HTML+CSS+JavaScript实现先点赞后观看,养成好习惯“不想动手的小伙伴可以直接拿网盘成品”阿里云盘——提取码: 0d…

Python 2023年10月9日
0083
conda虚拟环境总结与解读

文章目录前言 conda环境概述 * conda有什么用 conda的环境层级 Miniconda安装 * 与Anaconda的对比注意版本与安装包来源 conda命令 jup…

Python 2023年9月7日
0051
一文读懂CPU工作原理、程序是如何在单片机内执行的、指令格式之操作码地址码

文章较长，大家可选择性阅读，嘎嘎细计算机结构 ; CPU的运行原理 CPU的控制单元在时序脉冲的作用下，将指令计数器里所指向的指令地址（这个地址是在内存里的)送到地址总线上去，然…

Python 2023年11月9日
0080
分库分表问题

数据库可以通过主从复制将数据复制多份实现读写分离，读走从库，写走主库，应对量并发读的能力，同时提高数据安全性。但是对于单个表，还存在很多问题，比如：单表记录过多，字段加上索引，…

Python 2023年10月23日
0044
教你用Python自制拼图小游戏，轻松搞定熊孩子

Python版本：3.6.4 相关模块： pygame模块；以及一些Python自带的模块安装Python并添加到环境变量，pip安装需要的相关模块即可。将图像分为m×n个矩…

Python 2023年9月25日
0033
tensorflow的安装测试和版本、路径查询

背景：已经安装好tensorflow，接下来主要是进行测试。测试tensorflow 导入包 import tensorflow as tf sess = tf.compat.v…

Python 2023年8月26日
0042
猿创征文｜我的C/C++技术成长之路

作者简介：一名双非本科大二网络工程专业在读，热衷编程，喜欢写作忘记背后，努力面前，向着标杆奋力追求技术上的成长路线 * 一、初识C语言二、ACMer的养成记三、接触数据结构…

Python 2023年11月5日
0047
猿创征文 | 国产数据库TiDB架构特性

前言 TiDB 是 PingCAP 公司自主设计、研发的开源分布式关系型数据库，是一款同时支持在线事务处理与在线分析处理 (Hybrid Transactional and Ana…

Python 2023年10月7日
0025
pytest – 官方文档翻译（01）

开始准备：安装：前置条件：pytest 需要基于: Python 3.6, 3.7, 3.8, 3.9, or PyPy3 用以下命令安装 pip install -U pyte…

Python 2023年9月14日
0056
【完虐算法】LeetCode 接雨水问题，全复盘

大家好！ [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is sto…

Python 2023年6月3日
0076
Numpy常用方法总结

1.创建数组t1 = np.array(range(1,4))t1 = np.array([1,2,3])t1 = np.arange(1,4) 2.指定数据类型（1）在创建数组时…

Python 2023年8月25日
0058
Spring 6 源码编译和高效阅读源码技巧分享

一. 前言 Spring Boot 3 RELEASE版本于 2022年11月24日正式发布，相信已经有不少同学开始准备新版本的学习了，不过目前还不建议在实际项目中做升级，毕竟还…

Python 2023年10月12日
0057
slenium录制python pytest脚本

插件下载 https://www.crx4chrome.com/crx/77585/ ; 添加Selenium扩展 1、谷歌浏览器输入扩展程地址：chrome://extensio…

Python 2023年9月12日
0040
华为怎么配置SSH登陆，华为怎么配置Telnet登陆

华为 SSH 配置实例: [Huawei]user-interface vty 0 4 //配置虚拟终端 [Huawei-ui-vty0-4]authentication-mode…

Python 2023年6月10日
0095

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31