pandas–groupby相关操作

2023年8月16日上午5:15 • Python • 阅读 44

pandas——groupby操作

实验目的
熟练掌握pandas中的groupby操作

实验原理
groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False）
参数说明：
by：是指分组依据（列表、字典、函数，元组，Series）
axis：是作用维度（0为行，1为列）
level：根据索引级别分组
sort：对groupby分组后新的dataframe中索引进行排序，sort=True为升序，
as_index：在groupby中使用的键是否成为新的dataframe中的索引，默认as_index=True
group_keys：在调用apply时，将group键添加到索引中以识别片段
squeeze ：如果可能的话，减少返回类型的维数，否则返回一个一致的类型

grouping操作（split-apply-combine）
数据的分组&聚合 – 什么是groupby 技术?
在数据分析中，我们往往需要在将数据拆分，在每一个特定的组里进行运算。比如根据教育水平和年龄段计算某个城市的工作人口的平均收入。
pandas中的groupby提供了一个高效的数据的分组运算。
我们通过一个或者多个分类变量将数据拆分，然后分别在拆分以后的数据上进行需要的计算
我们可以把上述过程理解为三部：

1.拆分数据（split）
2.应用某个函数（apply）
3.汇总计算结果（aggregate）

实验环境
Python 3.6.1
Jupyter
实验内容
练习pandas中的groupby的操作案例

import pandas as pd
import numpy as np

1.创建一个数据帧df

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})
df

ABCD0fooone1.2103140.7046321barone-0.0622501.4466612footwo-1.3891480.7411583barthree-1.0954871.7590024footwo-0.964502-0.5646135bartwo0.829750-2.9512026fooone-0.516992-0.0586817foothree-2.7286340.250330

2.通过A列对df进行分布操作

df.groupby('A')

<pandas.core.groupby.generic.dataframegroupby object at 0x0000000004e64130>
</pandas.core.groupby.generic.dataframegroupby>

3.通过A、B列对df进行分组操作

df.groupby(['A','B'])

<pandas.core.groupby.generic.dataframegroupby object at 0x0000000004e68f10>
</pandas.core.groupby.generic.dataframegroupby>

4.使用自定义函数进行分组操作，自定义一个函数，使用groupby方法并使用自定义函数给定的条件，按列对df进行分组。

def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'vowel'
    else:
        return 'consonant'

grouped = df.groupby(get_letter_type, axis=1)
for group in grouped:
    print(group)

('consonant',        B         C         D
0    one  1.210314  0.704632
1    one -0.062250  1.446661
2    two -1.389148  0.741158
3  three -1.095487  1.759002
4    two -0.964502 -0.564613
5    two  0.829750 -2.951202
6    one -0.516992 -0.058681
7  three -2.728634  0.250330)
('vowel',      A
0  foo
1  bar
2  foo
3  bar
4  foo
5  bar
6  foo
7  foo)

5.创建一个Series名为s，使用groupby根据s的索引对s进行分组，返回分组后的新Series，对新Series进行first、last、sum操作。

lst = [1, 2, 3, 1, 2, 3]
s = pd.Series([1, 2, 3, 10, 20, 30], lst)
s

1     1
2     2
3     3
1    10
2    20
3    30
dtype: int64

grouped = s.groupby(level=0)
grouped

<pandas.core.groupby.generic.seriesgroupby object at 0x0000000004e714c0>
</pandas.core.groupby.generic.seriesgroupby>


grouped.first()

1    1
2    2
3    3
dtype: int64


grouped.last()

1    10
2    20
3    30
dtype: int64


grouped.sum()

1    11
2    22
3    33
dtype: int64

6.分组排序，使用groupby进行分组时，默认是按分组后索引进行升序排列，在groupby方法中加入sort=False参数，可以进行降序排列。

df2=pd.DataFrame({'X':['B','B','A','A'],'Y':[1,2,3,4]})
df2

XY0B11B22A33A4


df2.groupby(['X']).sum()

YXA7B3


df2.groupby(['X'],sort=False).sum()

YXB3A7

7.使用get_group方法得到分组后某组的值。

df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
df3

XY0A11B42A33B2


df3.groupby(['X']).get_group('A')

XY0A12A3


df3.groupby(['X']).get_group('B')

XY1B43B2

8.使用groups方法得到分组后所有组的值。

df.groupby('A').groups

{'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}

df.groupby(['A','B']).groups

{('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

9.查看分组对象的所有内置函数

grouped=df.groupby(['A'])
grouped

<pandas.core.groupby.generic.dataframegroupby object at 0x0000000008070b80>
</pandas.core.groupby.generic.dataframegroupby>

help(grouped)

10.多级索引分组，创建一个有两级索引的Series，并使用两个方法对Series进行分组并求和。

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
arrays

[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
 ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

index=pd.MultiIndex.from_arrays(arrays,names=['first','second'])
index

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

s=pd.Series(np.random.randn(8),index=index)
s

first  second
bar    one       0.528530
       two       0.083659
baz    one      -1.561120
       two      -1.276969
foo    one      -0.487720
       two       0.339357
qux    one       0.198976
       two      -0.379343
dtype: float64

s.groupby(level=0).sum()

first
bar    0.612189
baz   -2.838089
foo   -0.148363
qux   -0.180368
dtype: float64

s.groupby(level='second').sum()

second
one   -1.321335
two   -1.233296
dtype: float64

11.复合分组，对s按first、second进行分组并求和。

s.groupby(level=['first', 'second']).sum()

first  second
bar    one       0.528530
       two       0.083659
baz    one      -1.561120
       two      -1.276969
foo    one      -0.487720
       two       0.339357
qux    one       0.198976
       two      -0.379343
dtype: float64

12.复合分组（按索引和列），创建数据帧df，使用索引级别和列对df进行分组。

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3], 'B': np.arange(8)},index=index)
df

ABfirstsecondbarone10two11bazone12two13fooone24two25quxone36two37

df.groupby([pd.Grouper(level=1),'A']).sum()

BsecondAone122436two142537

13.对df进行分组，将分组后C列的值赋值给grouped，统计grouped中每类的个数

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})
df

ABCD0fooone0.2404360.1781881barone0.078877-0.6675102footwo0.287559-1.0290243barthree0.2757510.6858174footwo-0.469280-1.5833825bartwo0.182907-0.3063876fooone-0.9307720.2311607foothree-0.8266081.170842

grouped=df.groupby(['A'])
grouped

<pandas.core.groupby.generic.dataframegroupby object at 0x0000000009540760>
</pandas.core.groupby.generic.dataframegroupby>

grouped_C=grouped['C']
grouped_C

<pandas.core.groupby.generic.seriesgroupby object at 0x00000000095350d0>
</pandas.core.groupby.generic.seriesgroupby>

grouped_C.count()

A
bar    3
foo    5
Name: C, dtype: int64

14.对上面创建的df的C列，按A列值进行分组并求和。

df['C'].groupby(df['A']).sum()

A
bar    0.537535
foo   -1.698664
Name: C, dtype: float64

15.遍历分组结果，通过A，B两列对df进行分组，分组结果的组名为元组。

for name, group in df.groupby(['A', 'B']):
    print(name)
    print(group)

('bar', 'one')
     A    B         C        D
1  bar  one  0.078877 -0.66751
('bar', 'three')
     A      B         C         D
3  bar  three  0.275751  0.685817
('bar', 'two')
     A    B         C         D
5  bar  two  0.182907 -0.306387
('foo', 'one')
     A    B         C         D
0  foo  one  0.240436  0.178188
6  foo  one -0.930772  0.231160
('foo', 'three')
     A      B         C         D
7  foo  three -0.826608  1.170842
('foo', 'two')
     A    B         C         D
2  foo  two  0.287559 -1.029024
4  foo  two -0.469280 -1.583382

16.通过A列对df进行分组，并查看分组对象的bar列

df.groupby(['A']).get_group(('bar'))

ABCD1barone0.078877-0.6675103barthree0.2757510.6858175bartwo0.182907-0.306387

17.按A,B两列对df进行分组，并查看分组对象中bar、one都存在的部分

df.groupby(['A','B']).get_group(('bar','one'))

ABCD1barone0.078877-0.66751

注意:当分组按两列来分时，查看分组对象也应该包含每列的一部分。

1.聚合操作，按A列对df进行分组，使用聚合函数aggregate求每组的和。

grouped=df.groupby(['A'])
grouped

<pandas.core.groupby.generic.dataframegroupby object at 0x0000000008f95a00>
</pandas.core.groupby.generic.dataframegroupby>

grouped.aggregate(np.sum)

CDAbar0.537535-0.288080foo-1.698664-1.032218

2.按A、B两列对df进行分组，并使用聚合函数aggregate对每组求和

grouped=df.groupby(['A','B'])
grouped.aggregate(np.sum)

CDABbarone0.078877-0.667510three0.2757510.685817two0.182907-0.306387fooone-0.6903350.409347three-0.8266081.170842two-0.181721-2.612407

注意：通过上面的结果可以看到。聚合完成后每组都有一个组名作为新的索引，使用as_index=False可以忽略组名。

3.当as_index=True时，在groupby中使用的键将成为新的dataframe中的索引。按A、B两列对df进行分组，这是使参数as_index=False，再使用聚合函数aggregate求每组的和。

grouped=df.groupby(['A','B'],as_index=False)
grouped.aggregate(np.sum)

ABCD0barone0.078877-0.6675101barthree0.2757510.6858172bartwo0.182907-0.3063873fooone-0.6903350.4093474foothree-0.8266081.1708425footwo-0.181721-2.612407

4.使用reset_index函数可以得到与参数as_index=False相同的结果

df.groupby(['A','B']).sum().reset_index()

ABCD0barone0.078877-0.6675101barthree0.2757510.6858172bartwo0.182907-0.3063873fooone-0.6903350.4093474foothree-0.8266081.1708425footwo-0.181721-2.612407

5.聚合操作，按A、B列对df进行分组，使用size方法，求每组的大小。返回一个Series，索引是组名，值是每组的大小。

grouped=df.groupby(['A','B'])
grouped.size()

A    B
bar  one      1
     three    1
     two      1
foo  one      2
     three    1
     two      2
dtype: int64

6.聚合操作，对分组grouped进行统计描述。

grouped.describe()

CDcountmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%maxABbarone1.00.078877NaN0.0788770.0788770.0788770.0788770.0788771.0-0.667510NaN-0.667510-0.667510-0.667510-0.667510-0.667510three1.00.275751NaN0.2757510.2757510.2757510.2757510.2757511.00.685817NaN0.6858170.6858170.6858170.6858170.685817two1.00.182907NaN0.1829070.1829070.1829070.1829070.1829071.0-0.306387NaN-0.306387-0.306387-0.306387-0.306387-0.306387fooone2.0-0.3451680.828169-0.930772-0.637970-0.345168-0.0523660.2404362.00.2046740.0374570.1781880.1914310.2046740.2179170.231160three1.0-0.826608NaN-0.826608-0.826608-0.826608-0.826608-0.8266081.01.170842NaN1.1708421.1708421.1708421.1708421.170842two2.0-0.0908610.535166-0.469280-0.280070-0.0908610.0983490.2875592.0-1.3062030.391990-1.583382-1.444793-1.306203-1.167614-1.029024

注意：聚合函数可以减少数据帧的维度，常用的聚合函数有：mean、sum、size、count、std、var、sem 、describe、first、last、nth、min、max。

7.执行多个函数在一个分组结果上：在分组返回的Series中我们可以通过一个聚合函数的列表或一个字典去操作series，返回一个DataFrame。

grouped=df.groupby('A')
grouped['C'].agg([np.sum,np.mean,np.std])

summeanstdAbar0.5375350.1791780.098490foo-1.698664-0.3397330.577332

grouped.agg([np.sum,np.mean,np.std])

CDsummeanstdsummeanstdAbar0.5375350.1791780.098490-0.288080-0.0960270.700758foo-1.698664-0.3397330.577332-1.032218-0.2064441.096466

grouped['C'].agg([np.sum,np.mean,np.std]).rename(columns={'sum':'foo','mean':'bar','std':'baz'})

foobarbazAbar0.5375350.1791780.098490foo-1.698664-0.3397330.577332

8.作用不同的聚合函数到DataFrame的不同列上，通过聚合函数的一个字典作用不同的聚合函数到一个DataFrame的列上。

grouped.agg({'C':np.sum,'D':lambda x:np.std(x,ddof=1)})

CDAbar0.5375350.700758foo-1.6986641.096466

grouped.agg({'C':np.sum,'D':np.std})