啃书 《利用python进行数据分析》 第十章数据聚合与分组操作:GroupBy详解、agg()、apply()、pivot_table()、crosstab()详解

groupby就是分组嘛,分组完后会产生一个GroupBy对象,
这个对象可以遍历(1.)和选择子集(2.)
这个对象可以根据某一类别变量列(或多列)来构建,也可以通过字典、Series(3.)分组、也可以通过函数分组

先来看一下最常用的情况:

根据列来构建,得到GroupBy对象

df=pd.DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})

df
Out[5]:
  key1 key2     data1     data2
0    a  one -1.159168  2.053633
1    a  two  0.329434 -2.261167
2    b  one -0.032842  1.040264
3    b  two  1.288403 -0.802069
4    a  one -0.523062 -0.687959

grouped=df['data1'].groupby(df['key1'])

grouped
Out[8]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x0000014A100E1490>

grouped.mean()
Out[9]:
key1
a   -0.450932
b    0.627781
Name: data1, dtype: float64

means=df['data1'].groupby([df['key1'],df['key2']]).mean()

means
Out[13]:
key1  key2
a     one    -0.841115
      two     0.329434
b     one    -0.032842
      two     1.288403
Name: data1, dtype: float64

means.unstack()
Out[14]:
key2       one       two
key1
a    -0.841115  0.329434
b    -0.032842  1.288403

states=np.array(['Ohio','California','California','Ohio','Ohio'])

years=np.array([2005,2005,2006,2005,2006])

df['data1'].groupby([states,years]).mean()
Out[17]:
California  2005    0.329434
            2006   -0.032842
Ohio        2005    0.064618
            2006   -0.523062
Name: data1, dtype: float64

df.groupby('key1').mean()
Out[18]:
         data1     data2
key1
a    -0.450932 -0.298498
b     0.627781  0.119097

df.groupby(['key1','key2']).mean()
Out[19]:
              data1     data2
key1 key2
a    one  -0.841115  0.682837
     two   0.329434 -2.261167
b    one  -0.032842  1.040264
     two   1.288403 -0.802069

df.groupby(['key1','key2']).size()
Out[20]:
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

通过以上例子我们发现:

  • groupby的分组键可以是一列,也可以是多列;
  • 分组键只要和被分的列相同长度就可以
  • 直接对一个dataframe对象分组时,分组键直接写列名就好
  • GroupBy分组后产生GroupBy对象,将函数应用在GroupBy对象上会得到想要的效果

1.遍历个分组

对GroupBy对象遍历,会生成一个包含 组名和数据块的2维元组序列。


for name,group in df.groupby('key1'):
    print(name)
    print(group)

a
  key1 key2     data1     data2
0    a  one -1.159168  2.053633
1    a  two  0.329434 -2.261167
4    a  one -0.523062 -0.687959
b
  key1 key2     data1     data2
2    b  one -0.032842  1.040264
3    b  two  1.288403 -0.802069

for (k1,k2),group in df.groupby(['key1','key2']):
    print((k1,k2))
    print(group)

('a', 'one')
  key1 key2     data1     data2
0    a  one -1.159168  2.053633
4    a  one -0.523062 -0.687959
('a', 'two')
  key1 key2     data1     data2
1    a  two  0.329434 -2.261167
('b', 'one')
  key1 key2     data1     data2
2    b  one -0.032842  1.040264
('b', 'two')
  key1 key2     data1     data2
3    b  two  1.288403 -0.802069

  • 转成字典形式,方便选择
pieces=dict(list(df.groupby('key1')))

pieces['b']
Out[31]:
  key1 key2     data1     data2
2    b  one -0.032842  1.040264
3    b  two  1.288403 -0.802069
  • 在Axis=1轴上根据dtype分组

df.dtypes
Out[32]:
key1      object
key2      object
data1    float64
data2    float64
dtype: object

grouped=df.groupby(df.dtypes,axis=1)

grouped
Out[34]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000014A11C6E4F0>

for dtype,group in grouped:
    print(dtype)
    print(group)

float64
      data1     data2
0 -1.159168  2.053633
1  0.329434 -2.261167
2 -0.032842  1.040264
3  1.288403 -0.802069
4 -0.523062 -0.687959
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one

2.选择一列或所有列的子列

df.groupby(['key1','key2'])[['data1']].mean()
Out[39]:
              data1
key1 key2
a    one  -0.841115
     two   0.329434
b    one  -0.032842
     two   1.288403

df.groupby(['key1','key2']).mean()
Out[40]:
              data1     data2
key1 key2
a    one  -0.841115  0.682837
     two   0.329434 -2.261167
b    one  -0.032842  1.040264
     two   1.288403 -0.802069

3.使用字典和Series分组

这是在axis=1轴向上的 分组

people=pd.DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])

people.iloc[2:3,[1,2]]=np.nan

people
Out[43]:
               a         b         c         d         e
Joe    -0.030326  1.494879 -0.281233 -0.672231  0.420580
Steve   0.591620 -0.295068  0.468036 -0.275808  0.306458
Wes    -0.609984       NaN       NaN -0.032940  0.742902
Jim    -0.305856 -1.170328 -0.091456 -0.245780 -0.248539
Travis -0.730377  1.098705  0.496633 -0.535223  1.077263

mapping={'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}

by_column=people.groupby(mapping,axis=1)

by_column.sum()
Out[46]:
            blue       red
Joe    -0.953464  1.885133
Steve   0.192228  0.603010
Wes    -0.032940  0.132917
Jim    -0.337237 -1.724723
Travis -0.038590  1.445590

map_series=pd.Series(mapping)

map_series
Out[48]:
a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

people.groupby(map_series,axis=1).count()
Out[50]:
        blue  red
Joe        2    3
Steve      2    3
Wes        1    2
Jim        2    3
Travis     2    3

4.使用函数分组

people.groupby(len).sum()
Out[51]:
          a         b         c         d         e
3 -0.946167  0.324551 -0.372689 -0.950951  0.914943
5  0.591620 -0.295068  0.468036 -0.275808  0.306458
6 -0.730377  1.098705  0.496633 -0.535223  1.077263

key_list=['one','one','one','two','two']

people.groupby([len,key_list]).min()
Out[54]:
              a         b         c         d         e
3 one -0.609984  1.494879 -0.281233 -0.672231  0.420580
  two -0.305856 -1.170328 -0.091456 -0.245780 -0.248539
5 one  0.591620 -0.295068  0.468036 -0.275808  0.306458
6 two -0.730377  1.098705  0.496633 -0.535223  1.077263

5.根据索引层级分组

columns=pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3]],names=['cty','tenor'])

hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns)

hier_df
Out[57]:
cty          US                            JP
tenor         1         3         5         1         3
0     -0.286884  0.558246  0.453022  1.439828  0.045596
1     -0.164324  0.497046  0.173779  0.376164 -0.513194
2     -0.411519 -1.334042  0.470202 -0.249659  1.129361
3      0.459289 -1.111118  0.217509  1.570917  0.940517

hier_df.groupby(level='cty',axis=1).count()
Out[58]:
cty  JP  US
0     2   3
1     2   3
2     2   3
3     2   3

小结:

  • groupby的分组方式:
  • 分组键(具体的一列值)、字典、Series、函数、等
  • 可以是axis=0向的,也可以是axis=1向的。但长度都要对应
  • GroupBy对象可以遍历、选择

聚合就是根据数组产生 标量值

  • 聚合的方法有很多,可以参考pandas 的描述性统计
  • 如果要使用自己的函数,需要用agg或aggregate方法来传递

df
Out[59]:
  key1 key2     data1     data2
0    a  one -1.159168  2.053633
1    a  two  0.329434 -2.261167
2    b  one -0.032842  1.040264
3    b  two  1.288403 -0.802069
4    a  one -0.523062 -0.687959

def peak_to_peak(arr):
    return arr.max()-arr.min()

df.groupby('key1').agg(peak_to_peak)
Out[62]:
         data1     data2
key1
a     1.488602  4.314799
b     1.321245  1.842333

  • describe()方法虽然不是聚合函数,但也是有效的
df.groupby('key1').describe()
Out[63]:
     data1                      ...     data2
     count      mean       std  ...       50%       75%       max
key1                            ...

a      3.0 -0.450932  0.746918  ... -0.687959  0.682837  2.053633
b      2.0  0.627781  0.934261  ...  0.119097  0.579681  1.040264

[2 rows x 16 columns]

1.逐列及多函数应用

  • 将函数名以字符串的形式传递给agg()
  • 也可以传递函数名的列表

agg函数传达聚合函数的方法多种多样:

  • 多列&一个(多个)聚合函数
  • 多列&一个(多个)聚合函数
  • 利用元组实现对聚合结果的列名重命名
  • 利用字典实现 对不同列应用不同的聚合函数

以下是案例

数据导入


data=pd.read_excel(r'C:\Users\章ky\Desktop\非洲通讯产品销售数据.xlsx')

data.head()
Out[9]:
          日期             国家            城市       地区         服务分类     销售额     利润
0 2017-01-01  Cote d'Ivoire       Abidjan  Western   Commercial  656.96   6.57
1 2017-01-01     Madagascar  Antananarivo  Eastern       Public  875.94 -70.08
2 2017-01-01         Rwanda        Kigali  Eastern       Public  258.35  18.08
3 2017-01-01       Zimbabwe        Harare  Eastern  Residential  875.62 -35.02
4 2017-01-02       Ethiopia   Addis Ababa  Eastern  Residential  509.93  10.20

grouped=data.groupby(['服务分类','地区'])

grouped_pct=grouped['销售额']

grouped_pct
Out[12]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x0000016605D20B20>

多个聚合函数


grouped_pct.agg('mean')
Out[13]:
服务分类         地区
Commercial   Eastern     516.283218
             Middle      494.972603
             Northern    472.082821
             Southern    536.559286
             Western     492.849565
Public       Eastern     487.137327
             Middle      452.043729
             Northern    416.683111
             Southern    609.509310
             Western     487.892286
Residential  Eastern     507.873900
             Middle      530.288333
             Northern    524.036818
             Southern    529.237719
             Western     474.820100
Name: 销售额, dtype: float64

grouped_pct.agg(['mean','std'])
Out[14]:
                            mean         std
服务分类        地区
Commercial  Eastern   516.283218  283.011987
            Middle    494.972603  303.269793
            Northern  472.082821  286.887794
            Southern  536.559286  287.523768
            Western   492.849565  271.758411
Public      Eastern   487.137327  292.310160
            Middle    452.043729  320.831745
            Northern  416.683111  252.796114
            Southern  609.509310  238.548006
            Western   487.892286  299.995695
Residential Eastern   507.873900  285.163213
            Middle    530.288333  315.332683
            Northern  524.036818  263.298494
            Southern  529.237719  314.481258
            Western   474.820100  306.045465

利用元组修改聚合结果的列名


grouped_pct.agg([('平均值','mean'),('标准差',np.std)])
Out[15]:
                             平均值         标准差
服务分类        地区
Commercial  Eastern   516.283218  283.011987
            Middle    494.972603  303.269793
            Northern  472.082821  286.887794
            Southern  536.559286  287.523768
            Western   492.849565  271.758411
Public      Eastern   487.137327  292.310160
            Middle    452.043729  320.831745
            Northern  416.683111  252.796114
            Southern  609.509310  238.548006
            Western   487.892286  299.995695
Residential Eastern   507.873900  285.163213
            Middle    530.288333  315.332683
            Northern  524.036818  263.298494
            Southern  529.237719  314.481258
            Western   474.820100  306.045465

functions=['count','mean','max']

result=grouped['销售额','利润'].agg(functions)
<ipython-input-17-b855aa14b760>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

  result=grouped['销售额','利润'].agg(functions)

result
Out[18]:
                       销售额                        利润
                     count        mean     max count       mean    max
服务分类        地区
Commercial  Eastern     87  516.283218  994.49    87   6.590345  88.23
            Middle      73  494.972603  984.20    73   4.921781  85.03
            Northern    39  472.082821  928.29    39   6.058205  72.88
            Southern    42  536.559286  993.94    42  13.109286  95.39
            Western    115  492.849565  992.34   115  -1.400696  69.29
Public      Eastern    101  487.137327  982.91   101   0.291089  77.22
            Middle      59  452.043729  997.23    59   7.814237  89.74
            Northern    45  416.683111  928.39    45   8.741556  58.28
            Southern    29  609.509310  989.02    29  15.928966  95.13
            Western    105  487.892286  994.43   105  -1.576667  89.62
Residential Eastern    100  507.873900  992.58   100   5.482900  69.66
            Middle      60  530.288333  996.56    60   8.156500  85.19
            Northern    44  524.036818  984.51    44  13.494318  74.07
            Southern    57  529.237719  999.08    57   1.455614  76.24
            Western    100  474.820100  992.39   100   5.709300  91.98

result['销售额']
Out[19]:
                      count        mean     max
服务分类        地区
Commercial  Eastern      87  516.283218  994.49
            Middle       73  494.972603  984.20
            Northern     39  472.082821  928.29
            Southern     42  536.559286  993.94
            Western     115  492.849565  992.34
Public      Eastern     101  487.137327  982.91
            Middle       59  452.043729  997.23
            Northern     45  416.683111  928.39
            Southern     29  609.509310  989.02
            Western     105  487.892286  994.43
Residential Eastern     100  507.873900  992.58
            Middle       60  530.288333  996.56
            Northern     44  524.036818  984.51
            Southern     57  529.237719  999.08
            Western     100  474.820100  992.39

ftuples=[('平均值','mean'),('中位数','median')]

grouped['销售额','利润'].agg(ftuples)
<ipython-input-21-10538c1cd5b3>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

  grouped['销售额','利润'].agg(ftuples)
Out[21]:
                             销售额                  利润
                             平均值      中位数        平均值     中位数
服务分类        地区
Commercial  Eastern   516.283218  506.010   6.590345   3.620
            Middle    494.972603  483.350   4.921781   2.120
            Northern  472.082821  506.780   6.058205   2.610
            Southern  536.559286  510.665  13.109286   5.475
            Western   492.849565  494.930  -1.400696  -3.710
Public      Eastern   487.137327  472.240   0.291089  -0.850
            Middle    452.043729  398.180   7.814237   2.570
            Northern  416.683111  434.160   8.741556   6.570
            Southern  609.509310  621.630  15.928966  17.850
            Western   487.892286  462.980  -1.576667   0.520
Residential Eastern   507.873900  500.125   5.482900   4.230
            Middle    530.288333  471.820   8.156500   3.375
            Northern  524.036818  489.030  13.494318  11.835
            Southern  529.237719  526.930   1.455614   3.330
            Western   474.820100  393.110   5.709300   3.660

grouped.agg({'销售额':np.max,'利润':'sum'})
Out[22]:
                         销售额      利润
服务分类        地区
Commercial  Eastern   994.49  573.36
            Middle    984.20  359.29
            Northern  928.29  236.27
            Southern  993.94  550.59
            Western   992.34 -161.08
Public      Eastern   982.91   29.40
            Middle    997.23  461.04
            Northern  928.39  393.37
            Southern  989.02  461.94
            Western   994.43 -165.55
Residential Eastern   992.58  548.29
            Middle    996.56  489.39
            Northern  984.51  593.75
            Southern  999.08   82.97
            Western   992.39  570.93

利用字典不同的列应用不同的聚合函数


grouped.agg({'销售额':[np.max,'min','sum','count'],'利润':'sum'})
Out[23]:
                         销售额                              利润
                        amax     min       sum count     sum
服务分类        地区
Commercial  Eastern   994.49    1.14  44916.64    87  573.36
            Middle    984.20    0.89  36133.00    73  359.29
            Northern  928.29   43.55  18411.23    39  236.27
            Southern  993.94    1.39  22535.49    42  550.59
            Western   992.34    5.47  56677.70   115 -161.08
Public      Eastern   982.91    3.10  49200.87   101   29.40
            Middle    997.23   32.15  26670.58    59  461.04
            Northern  928.39   22.19  18750.74    45  393.37
            Southern  989.02  139.31  17675.77    29  461.94
            Western   994.43    2.69  51228.69   105 -165.55
Residential Eastern   992.58    6.87  50787.39   100  548.29
            Middle    996.56   18.24  31817.30    60  489.39
            Northern  984.51  120.45  23057.62    44  593.75
            Southern  999.08   23.96  30166.55    57   82.97
            Western   992.39    3.30  47482.01   100  570.93

2.返回不含索引的聚合函数 as_index=False

前面的例子中都是带有索引的

利用参数as_index=False来禁用分组键作为索引的行为

data.groupby(['服务分类','地区'],as_index=False).mean()
Out[24]:
           服务分类        地区         销售额         利润
0    Commercial   Eastern  516.283218   6.590345
1    Commercial    Middle  494.972603   4.921781
2    Commercial  Northern  472.082821   6.058205
3    Commercial  Southern  536.559286  13.109286
4    Commercial   Western  492.849565  -1.400696
5        Public   Eastern  487.137327   0.291089
6        Public    Middle  452.043729   7.814237
7        Public  Northern  416.683111   8.741556
8        Public  Southern  609.509310  15.928966
9        Public   Western  487.892286  -1.576667
10  Residential   Eastern  507.873900   5.482900
11  Residential    Middle  530.288333   8.156500
12  Residential  Northern  524.036818  13.494318
13  Residential  Southern  529.237719   1.455614
14  Residential   Western  474.820100   5.709300

小结:agg()的作用

  • 逐列传递聚合函数
  • 对不同的列传递不同函数(字典)
  • 对聚合的列重命名(元组)
  • 禁止用分组键作为索引的行为,as_index=False

apply() 函数重点在于将函数 应用到每一个分组中,这里的 应用函数既包括了 聚合函数也包括了 非聚合函数

agg() 重点在 聚合,如果是非聚合函数,就不太行咯(也有例外,但总之效果不好)

data.head()
Out[26]:
          日期             国家            城市       地区         服务分类     销售额     利润
0 2017-01-01  Cote d'Ivoire       Abidjan  Western   Commercial  656.96   6.57
1 2017-01-01     Madagascar  Antananarivo  Eastern       Public  875.94 -70.08
2 2017-01-01         Rwanda        Kigali  Eastern       Public  258.35  18.08
3 2017-01-01       Zimbabwe        Harare  Eastern  Residential  875.62 -35.02
4 2017-01-02       Ethiopia   Addis Ababa  Eastern  Residential  509.93  10.20

def top(data,n=5,column='销售额'):
    return data.sort_values(by=column)[-n:]

top(data,n=6)
Out[29]:
            日期                                国家  ...     销售额     利润
402 2018-06-10                      Burkina Faso  ...  994.43   9.94
623 2019-04-02                        Mozambique  ...  994.49 -59.67
365 2018-04-11          Central African Republic  ...  995.01 -59.70
75  2017-02-20                 Equatorial Guinea  ...  996.56   9.97
464 2018-09-12  Democratic Republic of the Congo  ...  997.23  49.86
627 2019-04-14                           Namibia  ...  999.08  19.98

[6 rows x 7 columns]

data.groupby('服务分类').apply(top)
Out[30]:
                         日期                                国家  ...     销售额     利润
服务分类                                                           ...

Commercial  592  2019-01-28                           Somalia  ...  985.13  59.11
            100  2017-04-09                             Benin  ...  989.86  69.29
            985  2020-09-02                           Nigeria  ...  992.34  29.77
            1021 2020-10-13                      South Africa  ...  993.94 -49.70
            623  2019-04-02                        Mozambique  ...  994.49 -59.67
Public      391  2018-05-18                         Swaziland  ...  989.02  89.01
            552  2019-01-10                             Ghana  ...  992.12  19.84
            820  2020-01-11                              Togo  ...  992.23 -19.84
            402  2018-06-10                      Burkina Faso  ...  994.43   9.94
            464  2018-09-12  Democratic Republic of the Congo  ...  997.23  49.86
Residential 459  2018-09-03                           Nigeria  ...  992.39  39.70
            60   2017-01-26                             Kenya  ...  992.58  49.63
            365  2018-04-11          Central African Republic  ...  995.01 -59.70
            75   2017-02-20                 Equatorial Guinea  ...  996.56   9.97
            627  2019-04-14                           Namibia  ...  999.08  19.98

[15 rows x 7 columns]

data.groupby(['服务分类','地区']).apply(top,n=2,column='利润')
Out[42]:
                                  日期  ...     利润
服务分类        地区                        ...

Commercial  Eastern  708  2019-08-09  ...  85.17
                     283  2018-01-11  ...  88.23
            Middle   198  2017-08-26  ...  80.60
                     548  2019-01-09  ...  85.03
            Northern 257  2017-12-15  ...  60.19
                     782  2019-12-09  ...  72.88
            Southern 157  2017-07-03  ...  78.72
                     118  2017-05-10  ...  95.39
            Western  439  2018-08-06  ...  66.69
                     100  2017-04-09  ...  69.29
Public      Eastern  878  2020-02-28  ...  68.76
                     858  2020-02-01  ...  77.22
            Middle   588  2019-01-26  ...  78.78
                     562  2019-01-14  ...  89.74
            Northern 185  2017-08-13  ...  56.11
                     63   2017-01-27  ...  58.28
            Southern 391  2018-05-18  ...  89.01
                     1046 2020-12-07  ...  95.13
            Western  569  2019-01-18  ...  79.70
                     460  2018-09-04  ...  89.62
Residential Eastern  1033 2020-11-18  ...  62.62
                     497  2018-10-22  ...  69.66
            Middle   1001 2020-09-19  ...  78.98
                     407  2018-06-12  ...  85.19
            Northern 64   2017-01-28  ...  60.88
                     293  2018-01-15  ...  74.07
            Southern 694  2019-07-25  ...  70.47
                     693  2019-07-22  ...  76.24
            Western  94   2017-03-18  ...  86.98
                     829  2020-01-15  ...  91.98

[30 rows x 7 columns]

分析上例:

  • top函数是一个排序函数
  • 复习下:sort_values()默认升序,通过参数by传递根据哪列来排序
  • groupby(‘服务分类’):根据’服务分类’来分组
  • apply(top):将top函数应用到每个分组上,即每个分组内部的,根据销售额排序的最高的6条数据

前面提到的GroupBy对象直接应用非聚合函数describe()函数,原理其实是省略了apply()这一部


result=data.groupby('服务分类')['销售额'].describe()

result
Out[45]:
             count        mean         std  ...     50%     75%     max
服务分类                                        ...

Commercial   356.0  501.893427  283.651006  ...  501.22  741.69  994.49
Public       339.0  482.379499  292.925777  ...  456.37  737.36  997.23
Residential  361.0  507.786343  297.611310  ...  472.81  784.94  999.08

[3 rows x 8 columns]

f=lambda x:x.describe()

data.groupby('服务分类')['销售额'].apply(f)
Out[48]:
服务分类
Commercial   count    356.000000
             mean     501.893427
             std      283.651006
             min        0.890000
             25%      282.910000
             50%      501.220000
             75%      741.690000
             max      994.490000
Public       count    339.000000
             mean     482.379499
             std      292.925777
             min        2.690000
             25%      214.590000
             50%      456.370000
             75%      737.360000
             max      997.230000
Residential  count    361.000000
             mean     507.786343
             std      297.611310
             min        3.300000
             25%      247.000000
             50%      472.810000
             75%      784.940000
             max      999.080000
Name: 销售额, dtype: float64

data.groupby('服务分类')['销售额'].apply(f).unstack()
Out[49]:
             count        mean         std  ...     50%     75%     max
服务分类                                        ...

Commercial   356.0  501.893427  283.651006  ...  501.22  741.69  994.49
Public       339.0  482.379499  292.925777  ...  456.37  737.36  997.23
Residential  361.0  507.786343  297.611310  ...  472.81  784.94  999.08

[3 rows x 8 columns]

1.压缩分组键 group_keys=False

先前的例子中,得到的对象 既有分组键形成的分层索引,又有原始索引,要禁用这个功能:group_keys=False


data.groupby('服务分类').apply(top)
Out[50]:
                         日期                                国家  ...     销售额     利润
服务分类                                                           ...

Commercial  592  2019-01-28                           Somalia  ...  985.13  59.11
            100  2017-04-09                             Benin  ...  989.86  69.29
            985  2020-09-02                           Nigeria  ...  992.34  29.77
            1021 2020-10-13                      South Africa  ...  993.94 -49.70
            623  2019-04-02                        Mozambique  ...  994.49 -59.67
Public      391  2018-05-18                         Swaziland  ...  989.02  89.01
            552  2019-01-10                             Ghana  ...  992.12  19.84
            820  2020-01-11                              Togo  ...  992.23 -19.84
            402  2018-06-10                      Burkina Faso  ...  994.43   9.94
            464  2018-09-12  Democratic Republic of the Congo  ...  997.23  49.86
Residential 459  2018-09-03                           Nigeria  ...  992.39  39.70
            60   2017-01-26                             Kenya  ...  992.58  49.63
            365  2018-04-11          Central African Republic  ...  995.01 -59.70
            75   2017-02-20                 Equatorial Guinea  ...  996.56   9.97
            627  2019-04-14                           Namibia  ...  999.08  19.98

[15 rows x 7 columns]

data.groupby('服务分类',group_keys=False).apply(top)
Out[51]:
             日期                                国家  ...     销售额     利润
592  2019-01-28                           Somalia  ...  985.13  59.11
100  2017-04-09                             Benin  ...  989.86  69.29
985  2020-09-02                           Nigeria  ...  992.34  29.77
1021 2020-10-13                      South Africa  ...  993.94 -49.70
623  2019-04-02                        Mozambique  ...  994.49 -59.67
391  2018-05-18                         Swaziland  ...  989.02  89.01
552  2019-01-10                             Ghana  ...  992.12  19.84
820  2020-01-11                              Togo  ...  992.23 -19.84
402  2018-06-10                      Burkina Faso  ...  994.43   9.94
464  2018-09-12  Democratic Republic of the Congo  ...  997.23  49.86
459  2018-09-03                           Nigeria  ...  992.39  39.70
60   2017-01-26                             Kenya  ...  992.58  49.63
365  2018-04-11          Central African Republic  ...  995.01 -59.70
75   2017-02-20                 Equatorial Guinea  ...  996.56   9.97
627  2019-04-14                           Namibia  ...  999.08  19.98

2.分位数与桶分析

第八章的cut(),qcut()函数与groupby函数一起使用,更方便

使用:利用cut进行等长桶划分,并用groupby和apply查看各组情况

frame=pd.DataFrame({'data1':np.random.randn(1000),'data2':np.random.randn(1000)})

quartiles=pd.cut(frame.data1,4)

quartiles[:10]
Out[58]:
0      (0.282, 1.85]
1    (-1.285, 0.282]
2    (-1.285, 0.282]
3    (-1.285, 0.282]
4    (-1.285, 0.282]
5      (1.85, 3.417]
6      (0.282, 1.85]
7      (0.282, 1.85]
8    (-1.285, 0.282]
9      (0.282, 1.85]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-2.859, -1.285] < (-1.285, 0.282] < (0.282, 1.85] < (1.85, 3.417]]

def get_stats(group):
    return{'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}

grouped=frame.data2.groupby(quartiles)

grouped.apply(get_stats).unstack()
Out[62]:
                       min       max  count      mean
data1
(-2.859, -1.285] -2.489030  2.060369  100.0 -0.046831
(-1.285, 0.282]  -2.551677  2.689849  513.0 -0.031574
(0.282, 1.85]    -2.855721  3.457885  365.0  0.047201
(1.85, 3.417]    -2.793339  1.713071   22.0 -0.203545

使用:利用qcut进行等量桶划分,并用groupby和apply查看各组情况


grouping=pd.qcut(frame.data1,10,labels=False)

grouped=frame.data2.groupby(grouping)

grouped.apply(get_stats).unstack()
Out[65]:
            min       max  count      mean
data1
0     -2.489030  2.060369  100.0 -0.046831
1     -2.280951  2.689849  100.0 -0.011814
2     -2.551677  2.401666  100.0 -0.028943
3     -1.623332  2.366734  100.0 -0.000173
4     -2.207725  2.195542  100.0  0.070011
5     -2.025112  1.951863  100.0 -0.115340
6     -2.249669  2.357705  100.0  0.018044
7     -2.450440  2.438719  100.0  0.062580
8     -2.855721  3.457885  100.0  0.016411
9     -2.793339  2.923083  100.0 -0.045247

3.示例:使用指定分组填充缺失值

  • 之前我们学了用fillna指定填充值:

s=pd.Series(np.random.randn(6))

s[::2]=np.nan

s
Out[84]:
0         NaN
1    1.085667
2         NaN
3   -0.214952
4         NaN
5    0.909982
dtype: float64

s.fillna(s.mean())
Out[85]:
0    0.593566
1    1.085667
2    0.593566
3   -0.214952
4    0.593566
5    0.909982
dtype: float64

  • 我们可以利用apply分组填充缺失值
states=['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho']

group_key=['East']*4+['West']*4

data=pd.Series(np.random.randn(8),index=states)

data
Out[89]:
Ohio          1.275526
New York     -0.790592
Vermont       1.063371
Florida       0.103123
Oregon       -1.586148
Nevada       -0.062837
California    1.651587
Idaho        -1.263948
dtype: float64

data[['Vermont','Nevada','Idaho']]=np.nan

data
Out[91]:
Ohio          1.275526
New York     -0.790592
Vermont            NaN
Florida       0.103123
Oregon       -1.586148
Nevada             NaN
California    1.651587
Idaho              NaN
dtype: float64

data.groupby(group_key).mean()
Out[93]:
East    0.196019
West    0.032719
dtype: float64

fill_mean=lambda g:g.fillna(g.mean())

data.groupby(group_key).apply(fill_mean)
Out[95]:
Ohio          1.275526
New York     -0.790592
Vermont       0.196019
Florida       0.103123
Oregon       -1.586148
Nevada        0.032719
California    1.651587
Idaho         0.032719
dtype: float64

fill_values={'East':0.5,'West':-1}

fill_func=lambda g:g.fillna(fill_values[g.name])

data.groupby(group_key).apply(fill_func)
Out[98]:
Ohio          1.275526
New York     -0.790592
Vermont       0.500000
Florida       0.103123
Oregon       -1.586148
Nevada       -1.000000
California    1.651587
Idaho        -1.000000
dtype: float64

注意,分组对象会有一个内置的name属性

1.聚合表pivot_table()

选项名描述values需要聚合的列;默认聚合所有数值型的列index在结果透视表上的行上进行分组的列名或其他分组键columns在结果透视表上的列上进行分组的列名或其他分组键aggfunc聚合函数或函数列表(默认’mean’);可以是groupby上下文的任意有效函数fill_value在结果表中替换缺失值dropna如果为True,将不含所有条目均为NA的列margins添加行/列小计和总计

data.pivot_table(index=['服务分类','地区'])
Out[66]:
                             利润         销售额
服务分类        地区
Commercial  Eastern    6.590345  516.283218
            Middle     4.921781  494.972603
            Northern   6.058205  472.082821
            Southern  13.109286  536.559286
            Western   -1.400696  492.849565
Public      Eastern    0.291089  487.137327
            Middle     7.814237  452.043729
            Northern   8.741556  416.683111
            Southern  15.928966  609.509310
            Western   -1.576667  487.892286
Residential Eastern    5.482900  507.873900
            Middle     8.156500  530.288333
            Northern  13.494318  524.036818
            Southern   1.455614  529.237719
            Western    5.709300  474.820100

data.pivot_table('利润',index='服务分类',columns='地区')
Out[68]:
地区            Eastern    Middle   Northern   Southern   Western
服务分类
Commercial   6.590345  4.921781   6.058205  13.109286 -1.400696
Public       0.291089  7.814237   8.741556  15.928966 -1.576667
Residential  5.482900  8.156500  13.494318   1.455614  5.709300

data.pivot_table(['利润','销售额'],index='服务分类',columns='地区')
Out[69]:
                   利润                       ...         销售额
地区            Eastern    Middle   Northern  ...    Northern    Southern     Western
服务分类                                        ...

Commercial   6.590345  4.921781   6.058205  ...  472.082821  536.559286  492.849565
Public       0.291089  7.814237   8.741556  ...  416.683111  609.509310  487.892286
Residential  5.482900  8.156500  13.494318  ...  524.036818  529.237719  474.820100

[3 rows x 10 columns]

data.pivot_table(['利润','销售额'],index='服务分类',columns='地区',margins=True)
Out[70]:
                   利润                       ...         销售额
地区            Eastern    Middle   Northern  ...    Southern     Western         All
服务分类                                        ...

Commercial   6.590345  4.921781   6.058205  ...  536.559286  492.849565  501.893427
Public       0.291089  7.814237   8.741556  ...  609.509310  487.892286  482.379499
Residential  5.482900  8.156500  13.494318  ...  529.237719  474.820100  507.786343
All          3.996701  6.821458   9.557734  ...  549.826641  485.588750  497.643542

[4 rows x 12 columns]

data.pivot_table('利润',index='服务分类',columns='地区',margins=True)
Out[71]:
地区            Eastern    Middle   Northern   Southern   Western       All
服务分类
Commercial   6.590345  4.921781   6.058205  13.109286 -1.400696  4.377612
Public       0.291089  7.814237   8.741556  15.928966 -1.576667  3.481416
Residential  5.482900  8.156500  13.494318   1.455614  5.709300  6.330554
All          3.996701  6.821458   9.557734   8.558594  0.763438  4.757538

data.pivot_table('利润',index='服务分类',columns='地区',aggfunc=len,margins=True)
Out[72]:
地区           Eastern  Middle  Northern  Southern  Western     All
服务分类
Commercial      87.0    73.0      39.0      42.0    115.0   356.0
Public         101.0    59.0      45.0      29.0    105.0   339.0
Residential    100.0    60.0      44.0      57.0    100.0   361.0
All            288.0   192.0     128.0     128.0    320.0  1056.0

2.交叉表 crosstab()

交叉表是透视表的特例,计算的是分组中的 频率

pd.crosstab(data.服务分类,data.地区,margins=True)
Out[75]:
地区           Eastern  Middle  Northern  Southern  Western   All
服务分类
Commercial        87      73        39        42      115   356
Public           101      59        45        29      105   339
Residential      100      60        44        57      100   361
All              288     192       128       128      320  1056

crosstab的前两个参数可以是数组、Series、或者是 列表

`python
data2=pd.read_excel(r’D:\Data\学习\酒店数据.xlsx’)

data2.info()

RangeIndex: 396 entries, 0 to 395
Data columns (total 11 columns):

Original: https://blog.csdn.net/weixin_55730631/article/details/124597861
Author: 真是喵啊
Title: 啃书 《利用python进行数据分析》 第十章数据聚合与分组操作:GroupBy详解、agg()、apply()、pivot_table()、crosstab()详解

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/675109/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球