groupby就是分组嘛,分组完后会产生一个GroupBy对象,
这个对象可以遍历(1.)和选择子集(2.)
这个对象可以根据某一类别变量列(或多列)来构建,也可以通过字典、Series(3.)分组、也可以通过函数分组
先来看一下最常用的情况:
根据列来构建,得到GroupBy对象
df=pd.DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
df
Out[5]:
key1 key2 data1 data2
0 a one -1.159168 2.053633
1 a two 0.329434 -2.261167
2 b one -0.032842 1.040264
3 b two 1.288403 -0.802069
4 a one -0.523062 -0.687959
grouped=df['data1'].groupby(df['key1'])
grouped
Out[8]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x0000014A100E1490>
grouped.mean()
Out[9]:
key1
a -0.450932
b 0.627781
Name: data1, dtype: float64
means=df['data1'].groupby([df['key1'],df['key2']]).mean()
means
Out[13]:
key1 key2
a one -0.841115
two 0.329434
b one -0.032842
two 1.288403
Name: data1, dtype: float64
means.unstack()
Out[14]:
key2 one two
key1
a -0.841115 0.329434
b -0.032842 1.288403
states=np.array(['Ohio','California','California','Ohio','Ohio'])
years=np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()
Out[17]:
California 2005 0.329434
2006 -0.032842
Ohio 2005 0.064618
2006 -0.523062
Name: data1, dtype: float64
df.groupby('key1').mean()
Out[18]:
data1 data2
key1
a -0.450932 -0.298498
b 0.627781 0.119097
df.groupby(['key1','key2']).mean()
Out[19]:
data1 data2
key1 key2
a one -0.841115 0.682837
two 0.329434 -2.261167
b one -0.032842 1.040264
two 1.288403 -0.802069
df.groupby(['key1','key2']).size()
Out[20]:
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
通过以上例子我们发现:
- groupby的分组键可以是一列,也可以是多列;
- 分组键只要和被分的列相同长度就可以
- 直接对一个dataframe对象分组时,分组键直接写列名就好
- GroupBy分组后产生GroupBy对象,将函数应用在GroupBy对象上会得到想要的效果
1.遍历个分组
对GroupBy对象遍历,会生成一个包含 组名和数据块的2维元组序列。
for name,group in df.groupby('key1'):
print(name)
print(group)
a
key1 key2 data1 data2
0 a one -1.159168 2.053633
1 a two 0.329434 -2.261167
4 a one -0.523062 -0.687959
b
key1 key2 data1 data2
2 b one -0.032842 1.040264
3 b two 1.288403 -0.802069
for (k1,k2),group in df.groupby(['key1','key2']):
print((k1,k2))
print(group)
('a', 'one')
key1 key2 data1 data2
0 a one -1.159168 2.053633
4 a one -0.523062 -0.687959
('a', 'two')
key1 key2 data1 data2
1 a two 0.329434 -2.261167
('b', 'one')
key1 key2 data1 data2
2 b one -0.032842 1.040264
('b', 'two')
key1 key2 data1 data2
3 b two 1.288403 -0.802069
- 转成字典形式,方便选择
pieces=dict(list(df.groupby('key1')))
pieces['b']
Out[31]:
key1 key2 data1 data2
2 b one -0.032842 1.040264
3 b two 1.288403 -0.802069
- 在Axis=1轴上根据dtype分组
df.dtypes
Out[32]:
key1 object
key2 object
data1 float64
data2 float64
dtype: object
grouped=df.groupby(df.dtypes,axis=1)
grouped
Out[34]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000014A11C6E4F0>
for dtype,group in grouped:
print(dtype)
print(group)
float64
data1 data2
0 -1.159168 2.053633
1 0.329434 -2.261167
2 -0.032842 1.040264
3 1.288403 -0.802069
4 -0.523062 -0.687959
object
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one
2.选择一列或所有列的子列
df.groupby(['key1','key2'])[['data1']].mean()
Out[39]:
data1
key1 key2
a one -0.841115
two 0.329434
b one -0.032842
two 1.288403
df.groupby(['key1','key2']).mean()
Out[40]:
data1 data2
key1 key2
a one -0.841115 0.682837
two 0.329434 -2.261167
b one -0.032842 1.040264
two 1.288403 -0.802069
3.使用字典和Series分组
这是在axis=1轴向上的 分组
people=pd.DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])
people.iloc[2:3,[1,2]]=np.nan
people
Out[43]:
a b c d e
Joe -0.030326 1.494879 -0.281233 -0.672231 0.420580
Steve 0.591620 -0.295068 0.468036 -0.275808 0.306458
Wes -0.609984 NaN NaN -0.032940 0.742902
Jim -0.305856 -1.170328 -0.091456 -0.245780 -0.248539
Travis -0.730377 1.098705 0.496633 -0.535223 1.077263
mapping={'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
by_column=people.groupby(mapping,axis=1)
by_column.sum()
Out[46]:
blue red
Joe -0.953464 1.885133
Steve 0.192228 0.603010
Wes -0.032940 0.132917
Jim -0.337237 -1.724723
Travis -0.038590 1.445590
map_series=pd.Series(mapping)
map_series
Out[48]:
a red
b red
c blue
d blue
e red
f orange
dtype: object
people.groupby(map_series,axis=1).count()
Out[50]:
blue red
Joe 2 3
Steve 2 3
Wes 1 2
Jim 2 3
Travis 2 3
4.使用函数分组
people.groupby(len).sum()
Out[51]:
a b c d e
3 -0.946167 0.324551 -0.372689 -0.950951 0.914943
5 0.591620 -0.295068 0.468036 -0.275808 0.306458
6 -0.730377 1.098705 0.496633 -0.535223 1.077263
key_list=['one','one','one','two','two']
people.groupby([len,key_list]).min()
Out[54]:
a b c d e
3 one -0.609984 1.494879 -0.281233 -0.672231 0.420580
two -0.305856 -1.170328 -0.091456 -0.245780 -0.248539
5 one 0.591620 -0.295068 0.468036 -0.275808 0.306458
6 two -0.730377 1.098705 0.496633 -0.535223 1.077263
5.根据索引层级分组
columns=pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3]],names=['cty','tenor'])
hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns)
hier_df
Out[57]:
cty US JP
tenor 1 3 5 1 3
0 -0.286884 0.558246 0.453022 1.439828 0.045596
1 -0.164324 0.497046 0.173779 0.376164 -0.513194
2 -0.411519 -1.334042 0.470202 -0.249659 1.129361
3 0.459289 -1.111118 0.217509 1.570917 0.940517
hier_df.groupby(level='cty',axis=1).count()
Out[58]:
cty JP US
0 2 3
1 2 3
2 2 3
3 2 3
小结:
- groupby的分组方式:
- 分组键(具体的一列值)、字典、Series、函数、等
- 可以是axis=0向的,也可以是axis=1向的。但长度都要对应
- GroupBy对象可以遍历、选择
聚合就是根据数组产生 标量值
- 聚合的方法有很多,可以参考pandas 的描述性统计
- 如果要使用自己的函数,需要用agg或aggregate方法来传递
df
Out[59]:
key1 key2 data1 data2
0 a one -1.159168 2.053633
1 a two 0.329434 -2.261167
2 b one -0.032842 1.040264
3 b two 1.288403 -0.802069
4 a one -0.523062 -0.687959
def peak_to_peak(arr):
return arr.max()-arr.min()
df.groupby('key1').agg(peak_to_peak)
Out[62]:
data1 data2
key1
a 1.488602 4.314799
b 1.321245 1.842333
- describe()方法虽然不是聚合函数,但也是有效的
df.groupby('key1').describe()
Out[63]:
data1 ... data2
count mean std ... 50% 75% max
key1 ...
a 3.0 -0.450932 0.746918 ... -0.687959 0.682837 2.053633
b 2.0 0.627781 0.934261 ... 0.119097 0.579681 1.040264
[2 rows x 16 columns]
1.逐列及多函数应用
- 将函数名以字符串的形式传递给agg()
- 也可以传递函数名的列表
agg函数传达聚合函数的方法多种多样:
- 多列&一个(多个)聚合函数
- 多列&一个(多个)聚合函数
- 利用元组实现对聚合结果的列名重命名
- 利用字典实现 对不同列应用不同的聚合函数
以下是案例
数据导入
data=pd.read_excel(r'C:\Users\章ky\Desktop\非洲通讯产品销售数据.xlsx')
data.head()
Out[9]:
日期 国家 城市 地区 服务分类 销售额 利润
0 2017-01-01 Cote d'Ivoire Abidjan Western Commercial 656.96 6.57
1 2017-01-01 Madagascar Antananarivo Eastern Public 875.94 -70.08
2 2017-01-01 Rwanda Kigali Eastern Public 258.35 18.08
3 2017-01-01 Zimbabwe Harare Eastern Residential 875.62 -35.02
4 2017-01-02 Ethiopia Addis Ababa Eastern Residential 509.93 10.20
grouped=data.groupby(['服务分类','地区'])
grouped_pct=grouped['销售额']
grouped_pct
Out[12]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x0000016605D20B20>
多个聚合函数
grouped_pct.agg('mean')
Out[13]:
服务分类 地区
Commercial Eastern 516.283218
Middle 494.972603
Northern 472.082821
Southern 536.559286
Western 492.849565
Public Eastern 487.137327
Middle 452.043729
Northern 416.683111
Southern 609.509310
Western 487.892286
Residential Eastern 507.873900
Middle 530.288333
Northern 524.036818
Southern 529.237719
Western 474.820100
Name: 销售额, dtype: float64
grouped_pct.agg(['mean','std'])
Out[14]:
mean std
服务分类 地区
Commercial Eastern 516.283218 283.011987
Middle 494.972603 303.269793
Northern 472.082821 286.887794
Southern 536.559286 287.523768
Western 492.849565 271.758411
Public Eastern 487.137327 292.310160
Middle 452.043729 320.831745
Northern 416.683111 252.796114
Southern 609.509310 238.548006
Western 487.892286 299.995695
Residential Eastern 507.873900 285.163213
Middle 530.288333 315.332683
Northern 524.036818 263.298494
Southern 529.237719 314.481258
Western 474.820100 306.045465
利用元组修改聚合结果的列名
grouped_pct.agg([('平均值','mean'),('标准差',np.std)])
Out[15]:
平均值 标准差
服务分类 地区
Commercial Eastern 516.283218 283.011987
Middle 494.972603 303.269793
Northern 472.082821 286.887794
Southern 536.559286 287.523768
Western 492.849565 271.758411
Public Eastern 487.137327 292.310160
Middle 452.043729 320.831745
Northern 416.683111 252.796114
Southern 609.509310 238.548006
Western 487.892286 299.995695
Residential Eastern 507.873900 285.163213
Middle 530.288333 315.332683
Northern 524.036818 263.298494
Southern 529.237719 314.481258
Western 474.820100 306.045465
functions=['count','mean','max']
result=grouped['销售额','利润'].agg(functions)
<ipython-input-17-b855aa14b760>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
result=grouped['销售额','利润'].agg(functions)
result
Out[18]:
销售额 利润
count mean max count mean max
服务分类 地区
Commercial Eastern 87 516.283218 994.49 87 6.590345 88.23
Middle 73 494.972603 984.20 73 4.921781 85.03
Northern 39 472.082821 928.29 39 6.058205 72.88
Southern 42 536.559286 993.94 42 13.109286 95.39
Western 115 492.849565 992.34 115 -1.400696 69.29
Public Eastern 101 487.137327 982.91 101 0.291089 77.22
Middle 59 452.043729 997.23 59 7.814237 89.74
Northern 45 416.683111 928.39 45 8.741556 58.28
Southern 29 609.509310 989.02 29 15.928966 95.13
Western 105 487.892286 994.43 105 -1.576667 89.62
Residential Eastern 100 507.873900 992.58 100 5.482900 69.66
Middle 60 530.288333 996.56 60 8.156500 85.19
Northern 44 524.036818 984.51 44 13.494318 74.07
Southern 57 529.237719 999.08 57 1.455614 76.24
Western 100 474.820100 992.39 100 5.709300 91.98
result['销售额']
Out[19]:
count mean max
服务分类 地区
Commercial Eastern 87 516.283218 994.49
Middle 73 494.972603 984.20
Northern 39 472.082821 928.29
Southern 42 536.559286 993.94
Western 115 492.849565 992.34
Public Eastern 101 487.137327 982.91
Middle 59 452.043729 997.23
Northern 45 416.683111 928.39
Southern 29 609.509310 989.02
Western 105 487.892286 994.43
Residential Eastern 100 507.873900 992.58
Middle 60 530.288333 996.56
Northern 44 524.036818 984.51
Southern 57 529.237719 999.08
Western 100 474.820100 992.39
ftuples=[('平均值','mean'),('中位数','median')]
grouped['销售额','利润'].agg(ftuples)
<ipython-input-21-10538c1cd5b3>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
grouped['销售额','利润'].agg(ftuples)
Out[21]:
销售额 利润
平均值 中位数 平均值 中位数
服务分类 地区
Commercial Eastern 516.283218 506.010 6.590345 3.620
Middle 494.972603 483.350 4.921781 2.120
Northern 472.082821 506.780 6.058205 2.610
Southern 536.559286 510.665 13.109286 5.475
Western 492.849565 494.930 -1.400696 -3.710
Public Eastern 487.137327 472.240 0.291089 -0.850
Middle 452.043729 398.180 7.814237 2.570
Northern 416.683111 434.160 8.741556 6.570
Southern 609.509310 621.630 15.928966 17.850
Western 487.892286 462.980 -1.576667 0.520
Residential Eastern 507.873900 500.125 5.482900 4.230
Middle 530.288333 471.820 8.156500 3.375
Northern 524.036818 489.030 13.494318 11.835
Southern 529.237719 526.930 1.455614 3.330
Western 474.820100 393.110 5.709300 3.660
grouped.agg({'销售额':np.max,'利润':'sum'})
Out[22]:
销售额 利润
服务分类 地区
Commercial Eastern 994.49 573.36
Middle 984.20 359.29
Northern 928.29 236.27
Southern 993.94 550.59
Western 992.34 -161.08
Public Eastern 982.91 29.40
Middle 997.23 461.04
Northern 928.39 393.37
Southern 989.02 461.94
Western 994.43 -165.55
Residential Eastern 992.58 548.29
Middle 996.56 489.39
Northern 984.51 593.75
Southern 999.08 82.97
Western 992.39 570.93
利用字典不同的列应用不同的聚合函数
grouped.agg({'销售额':[np.max,'min','sum','count'],'利润':'sum'})
Out[23]:
销售额 利润
amax min sum count sum
服务分类 地区
Commercial Eastern 994.49 1.14 44916.64 87 573.36
Middle 984.20 0.89 36133.00 73 359.29
Northern 928.29 43.55 18411.23 39 236.27
Southern 993.94 1.39 22535.49 42 550.59
Western 992.34 5.47 56677.70 115 -161.08
Public Eastern 982.91 3.10 49200.87 101 29.40
Middle 997.23 32.15 26670.58 59 461.04
Northern 928.39 22.19 18750.74 45 393.37
Southern 989.02 139.31 17675.77 29 461.94
Western 994.43 2.69 51228.69 105 -165.55
Residential Eastern 992.58 6.87 50787.39 100 548.29
Middle 996.56 18.24 31817.30 60 489.39
Northern 984.51 120.45 23057.62 44 593.75
Southern 999.08 23.96 30166.55 57 82.97
Western 992.39 3.30 47482.01 100 570.93
2.返回不含索引的聚合函数 as_index=False
前面的例子中都是带有索引的
利用参数as_index=False来禁用分组键作为索引的行为
data.groupby(['服务分类','地区'],as_index=False).mean()
Out[24]:
服务分类 地区 销售额 利润
0 Commercial Eastern 516.283218 6.590345
1 Commercial Middle 494.972603 4.921781
2 Commercial Northern 472.082821 6.058205
3 Commercial Southern 536.559286 13.109286
4 Commercial Western 492.849565 -1.400696
5 Public Eastern 487.137327 0.291089
6 Public Middle 452.043729 7.814237
7 Public Northern 416.683111 8.741556
8 Public Southern 609.509310 15.928966
9 Public Western 487.892286 -1.576667
10 Residential Eastern 507.873900 5.482900
11 Residential Middle 530.288333 8.156500
12 Residential Northern 524.036818 13.494318
13 Residential Southern 529.237719 1.455614
14 Residential Western 474.820100 5.709300
小结:agg()的作用
- 逐列传递聚合函数
- 对不同的列传递不同函数(字典)
- 对聚合的列重命名(元组)
- 禁止用分组键作为索引的行为,as_index=False
apply() 函数重点在于将函数 应用到每一个分组中,这里的 应用函数既包括了 聚合函数也包括了 非聚合函数
agg() 重点在 聚合,如果是非聚合函数,就不太行咯(也有例外,但总之效果不好)
data.head()
Out[26]:
日期 国家 城市 地区 服务分类 销售额 利润
0 2017-01-01 Cote d'Ivoire Abidjan Western Commercial 656.96 6.57
1 2017-01-01 Madagascar Antananarivo Eastern Public 875.94 -70.08
2 2017-01-01 Rwanda Kigali Eastern Public 258.35 18.08
3 2017-01-01 Zimbabwe Harare Eastern Residential 875.62 -35.02
4 2017-01-02 Ethiopia Addis Ababa Eastern Residential 509.93 10.20
def top(data,n=5,column='销售额'):
return data.sort_values(by=column)[-n:]
top(data,n=6)
Out[29]:
日期 国家 ... 销售额 利润
402 2018-06-10 Burkina Faso ... 994.43 9.94
623 2019-04-02 Mozambique ... 994.49 -59.67
365 2018-04-11 Central African Republic ... 995.01 -59.70
75 2017-02-20 Equatorial Guinea ... 996.56 9.97
464 2018-09-12 Democratic Republic of the Congo ... 997.23 49.86
627 2019-04-14 Namibia ... 999.08 19.98
[6 rows x 7 columns]
data.groupby('服务分类').apply(top)
Out[30]:
日期 国家 ... 销售额 利润
服务分类 ...
Commercial 592 2019-01-28 Somalia ... 985.13 59.11
100 2017-04-09 Benin ... 989.86 69.29
985 2020-09-02 Nigeria ... 992.34 29.77
1021 2020-10-13 South Africa ... 993.94 -49.70
623 2019-04-02 Mozambique ... 994.49 -59.67
Public 391 2018-05-18 Swaziland ... 989.02 89.01
552 2019-01-10 Ghana ... 992.12 19.84
820 2020-01-11 Togo ... 992.23 -19.84
402 2018-06-10 Burkina Faso ... 994.43 9.94
464 2018-09-12 Democratic Republic of the Congo ... 997.23 49.86
Residential 459 2018-09-03 Nigeria ... 992.39 39.70
60 2017-01-26 Kenya ... 992.58 49.63
365 2018-04-11 Central African Republic ... 995.01 -59.70
75 2017-02-20 Equatorial Guinea ... 996.56 9.97
627 2019-04-14 Namibia ... 999.08 19.98
[15 rows x 7 columns]
data.groupby(['服务分类','地区']).apply(top,n=2,column='利润')
Out[42]:
日期 ... 利润
服务分类 地区 ...
Commercial Eastern 708 2019-08-09 ... 85.17
283 2018-01-11 ... 88.23
Middle 198 2017-08-26 ... 80.60
548 2019-01-09 ... 85.03
Northern 257 2017-12-15 ... 60.19
782 2019-12-09 ... 72.88
Southern 157 2017-07-03 ... 78.72
118 2017-05-10 ... 95.39
Western 439 2018-08-06 ... 66.69
100 2017-04-09 ... 69.29
Public Eastern 878 2020-02-28 ... 68.76
858 2020-02-01 ... 77.22
Middle 588 2019-01-26 ... 78.78
562 2019-01-14 ... 89.74
Northern 185 2017-08-13 ... 56.11
63 2017-01-27 ... 58.28
Southern 391 2018-05-18 ... 89.01
1046 2020-12-07 ... 95.13
Western 569 2019-01-18 ... 79.70
460 2018-09-04 ... 89.62
Residential Eastern 1033 2020-11-18 ... 62.62
497 2018-10-22 ... 69.66
Middle 1001 2020-09-19 ... 78.98
407 2018-06-12 ... 85.19
Northern 64 2017-01-28 ... 60.88
293 2018-01-15 ... 74.07
Southern 694 2019-07-25 ... 70.47
693 2019-07-22 ... 76.24
Western 94 2017-03-18 ... 86.98
829 2020-01-15 ... 91.98
[30 rows x 7 columns]
分析上例:
- top函数是一个排序函数
- 复习下:sort_values()默认升序,通过参数by传递根据哪列来排序
- groupby(‘服务分类’):根据’服务分类’来分组
- apply(top):将top函数应用到每个分组上,即每个分组内部的,根据销售额排序的最高的6条数据
前面提到的GroupBy对象直接应用非聚合函数describe()函数,原理其实是省略了apply()这一部
result=data.groupby('服务分类')['销售额'].describe()
result
Out[45]:
count mean std ... 50% 75% max
服务分类 ...
Commercial 356.0 501.893427 283.651006 ... 501.22 741.69 994.49
Public 339.0 482.379499 292.925777 ... 456.37 737.36 997.23
Residential 361.0 507.786343 297.611310 ... 472.81 784.94 999.08
[3 rows x 8 columns]
f=lambda x:x.describe()
data.groupby('服务分类')['销售额'].apply(f)
Out[48]:
服务分类
Commercial count 356.000000
mean 501.893427
std 283.651006
min 0.890000
25% 282.910000
50% 501.220000
75% 741.690000
max 994.490000
Public count 339.000000
mean 482.379499
std 292.925777
min 2.690000
25% 214.590000
50% 456.370000
75% 737.360000
max 997.230000
Residential count 361.000000
mean 507.786343
std 297.611310
min 3.300000
25% 247.000000
50% 472.810000
75% 784.940000
max 999.080000
Name: 销售额, dtype: float64
data.groupby('服务分类')['销售额'].apply(f).unstack()
Out[49]:
count mean std ... 50% 75% max
服务分类 ...
Commercial 356.0 501.893427 283.651006 ... 501.22 741.69 994.49
Public 339.0 482.379499 292.925777 ... 456.37 737.36 997.23
Residential 361.0 507.786343 297.611310 ... 472.81 784.94 999.08
[3 rows x 8 columns]
1.压缩分组键 group_keys=False
先前的例子中,得到的对象 既有分组键形成的分层索引,又有原始索引,要禁用这个功能:group_keys=False
data.groupby('服务分类').apply(top)
Out[50]:
日期 国家 ... 销售额 利润
服务分类 ...
Commercial 592 2019-01-28 Somalia ... 985.13 59.11
100 2017-04-09 Benin ... 989.86 69.29
985 2020-09-02 Nigeria ... 992.34 29.77
1021 2020-10-13 South Africa ... 993.94 -49.70
623 2019-04-02 Mozambique ... 994.49 -59.67
Public 391 2018-05-18 Swaziland ... 989.02 89.01
552 2019-01-10 Ghana ... 992.12 19.84
820 2020-01-11 Togo ... 992.23 -19.84
402 2018-06-10 Burkina Faso ... 994.43 9.94
464 2018-09-12 Democratic Republic of the Congo ... 997.23 49.86
Residential 459 2018-09-03 Nigeria ... 992.39 39.70
60 2017-01-26 Kenya ... 992.58 49.63
365 2018-04-11 Central African Republic ... 995.01 -59.70
75 2017-02-20 Equatorial Guinea ... 996.56 9.97
627 2019-04-14 Namibia ... 999.08 19.98
[15 rows x 7 columns]
data.groupby('服务分类',group_keys=False).apply(top)
Out[51]:
日期 国家 ... 销售额 利润
592 2019-01-28 Somalia ... 985.13 59.11
100 2017-04-09 Benin ... 989.86 69.29
985 2020-09-02 Nigeria ... 992.34 29.77
1021 2020-10-13 South Africa ... 993.94 -49.70
623 2019-04-02 Mozambique ... 994.49 -59.67
391 2018-05-18 Swaziland ... 989.02 89.01
552 2019-01-10 Ghana ... 992.12 19.84
820 2020-01-11 Togo ... 992.23 -19.84
402 2018-06-10 Burkina Faso ... 994.43 9.94
464 2018-09-12 Democratic Republic of the Congo ... 997.23 49.86
459 2018-09-03 Nigeria ... 992.39 39.70
60 2017-01-26 Kenya ... 992.58 49.63
365 2018-04-11 Central African Republic ... 995.01 -59.70
75 2017-02-20 Equatorial Guinea ... 996.56 9.97
627 2019-04-14 Namibia ... 999.08 19.98
2.分位数与桶分析
第八章的cut(),qcut()函数与groupby函数一起使用,更方便
使用:利用cut进行等长桶划分,并用groupby和apply查看各组情况
frame=pd.DataFrame({'data1':np.random.randn(1000),'data2':np.random.randn(1000)})
quartiles=pd.cut(frame.data1,4)
quartiles[:10]
Out[58]:
0 (0.282, 1.85]
1 (-1.285, 0.282]
2 (-1.285, 0.282]
3 (-1.285, 0.282]
4 (-1.285, 0.282]
5 (1.85, 3.417]
6 (0.282, 1.85]
7 (0.282, 1.85]
8 (-1.285, 0.282]
9 (0.282, 1.85]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-2.859, -1.285] < (-1.285, 0.282] < (0.282, 1.85] < (1.85, 3.417]]
def get_stats(group):
return{'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}
grouped=frame.data2.groupby(quartiles)
grouped.apply(get_stats).unstack()
Out[62]:
min max count mean
data1
(-2.859, -1.285] -2.489030 2.060369 100.0 -0.046831
(-1.285, 0.282] -2.551677 2.689849 513.0 -0.031574
(0.282, 1.85] -2.855721 3.457885 365.0 0.047201
(1.85, 3.417] -2.793339 1.713071 22.0 -0.203545
使用:利用qcut进行等量桶划分,并用groupby和apply查看各组情况
grouping=pd.qcut(frame.data1,10,labels=False)
grouped=frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()
Out[65]:
min max count mean
data1
0 -2.489030 2.060369 100.0 -0.046831
1 -2.280951 2.689849 100.0 -0.011814
2 -2.551677 2.401666 100.0 -0.028943
3 -1.623332 2.366734 100.0 -0.000173
4 -2.207725 2.195542 100.0 0.070011
5 -2.025112 1.951863 100.0 -0.115340
6 -2.249669 2.357705 100.0 0.018044
7 -2.450440 2.438719 100.0 0.062580
8 -2.855721 3.457885 100.0 0.016411
9 -2.793339 2.923083 100.0 -0.045247
3.示例:使用指定分组填充缺失值
- 之前我们学了用fillna指定填充值:
s=pd.Series(np.random.randn(6))
s[::2]=np.nan
s
Out[84]:
0 NaN
1 1.085667
2 NaN
3 -0.214952
4 NaN
5 0.909982
dtype: float64
s.fillna(s.mean())
Out[85]:
0 0.593566
1 1.085667
2 0.593566
3 -0.214952
4 0.593566
5 0.909982
dtype: float64
- 我们可以利用apply分组填充缺失值
states=['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho']
group_key=['East']*4+['West']*4
data=pd.Series(np.random.randn(8),index=states)
data
Out[89]:
Ohio 1.275526
New York -0.790592
Vermont 1.063371
Florida 0.103123
Oregon -1.586148
Nevada -0.062837
California 1.651587
Idaho -1.263948
dtype: float64
data[['Vermont','Nevada','Idaho']]=np.nan
data
Out[91]:
Ohio 1.275526
New York -0.790592
Vermont NaN
Florida 0.103123
Oregon -1.586148
Nevada NaN
California 1.651587
Idaho NaN
dtype: float64
data.groupby(group_key).mean()
Out[93]:
East 0.196019
West 0.032719
dtype: float64
fill_mean=lambda g:g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
Out[95]:
Ohio 1.275526
New York -0.790592
Vermont 0.196019
Florida 0.103123
Oregon -1.586148
Nevada 0.032719
California 1.651587
Idaho 0.032719
dtype: float64
fill_values={'East':0.5,'West':-1}
fill_func=lambda g:g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)
Out[98]:
Ohio 1.275526
New York -0.790592
Vermont 0.500000
Florida 0.103123
Oregon -1.586148
Nevada -1.000000
California 1.651587
Idaho -1.000000
dtype: float64
注意,分组对象会有一个内置的name属性
1.聚合表pivot_table()
选项名描述values需要聚合的列;默认聚合所有数值型的列index在结果透视表上的行上进行分组的列名或其他分组键columns在结果透视表上的列上进行分组的列名或其他分组键aggfunc聚合函数或函数列表(默认’mean’);可以是groupby上下文的任意有效函数fill_value在结果表中替换缺失值dropna如果为True,将不含所有条目均为NA的列margins添加行/列小计和总计
data.pivot_table(index=['服务分类','地区'])
Out[66]:
利润 销售额
服务分类 地区
Commercial Eastern 6.590345 516.283218
Middle 4.921781 494.972603
Northern 6.058205 472.082821
Southern 13.109286 536.559286
Western -1.400696 492.849565
Public Eastern 0.291089 487.137327
Middle 7.814237 452.043729
Northern 8.741556 416.683111
Southern 15.928966 609.509310
Western -1.576667 487.892286
Residential Eastern 5.482900 507.873900
Middle 8.156500 530.288333
Northern 13.494318 524.036818
Southern 1.455614 529.237719
Western 5.709300 474.820100
data.pivot_table('利润',index='服务分类',columns='地区')
Out[68]:
地区 Eastern Middle Northern Southern Western
服务分类
Commercial 6.590345 4.921781 6.058205 13.109286 -1.400696
Public 0.291089 7.814237 8.741556 15.928966 -1.576667
Residential 5.482900 8.156500 13.494318 1.455614 5.709300
data.pivot_table(['利润','销售额'],index='服务分类',columns='地区')
Out[69]:
利润 ... 销售额
地区 Eastern Middle Northern ... Northern Southern Western
服务分类 ...
Commercial 6.590345 4.921781 6.058205 ... 472.082821 536.559286 492.849565
Public 0.291089 7.814237 8.741556 ... 416.683111 609.509310 487.892286
Residential 5.482900 8.156500 13.494318 ... 524.036818 529.237719 474.820100
[3 rows x 10 columns]
data.pivot_table(['利润','销售额'],index='服务分类',columns='地区',margins=True)
Out[70]:
利润 ... 销售额
地区 Eastern Middle Northern ... Southern Western All
服务分类 ...
Commercial 6.590345 4.921781 6.058205 ... 536.559286 492.849565 501.893427
Public 0.291089 7.814237 8.741556 ... 609.509310 487.892286 482.379499
Residential 5.482900 8.156500 13.494318 ... 529.237719 474.820100 507.786343
All 3.996701 6.821458 9.557734 ... 549.826641 485.588750 497.643542
[4 rows x 12 columns]
data.pivot_table('利润',index='服务分类',columns='地区',margins=True)
Out[71]:
地区 Eastern Middle Northern Southern Western All
服务分类
Commercial 6.590345 4.921781 6.058205 13.109286 -1.400696 4.377612
Public 0.291089 7.814237 8.741556 15.928966 -1.576667 3.481416
Residential 5.482900 8.156500 13.494318 1.455614 5.709300 6.330554
All 3.996701 6.821458 9.557734 8.558594 0.763438 4.757538
data.pivot_table('利润',index='服务分类',columns='地区',aggfunc=len,margins=True)
Out[72]:
地区 Eastern Middle Northern Southern Western All
服务分类
Commercial 87.0 73.0 39.0 42.0 115.0 356.0
Public 101.0 59.0 45.0 29.0 105.0 339.0
Residential 100.0 60.0 44.0 57.0 100.0 361.0
All 288.0 192.0 128.0 128.0 320.0 1056.0
2.交叉表 crosstab()
交叉表是透视表的特例,计算的是分组中的 频率
pd.crosstab(data.服务分类,data.地区,margins=True)
Out[75]:
地区 Eastern Middle Northern Southern Western All
服务分类
Commercial 87 73 39 42 115 356
Public 101 59 45 29 105 339
Residential 100 60 44 57 100 361
All 288 192 128 128 320 1056
crosstab的前两个参数可以是数组、Series、或者是 列表。
`python
data2=pd.read_excel(r’D:\Data\学习\酒店数据.xlsx’)
data2.info()
RangeIndex: 396 entries, 0 to 395
Data columns (total 11 columns):
Original: https://blog.csdn.net/weixin_55730631/article/details/124597861
Author: 真是喵啊
Title: 啃书 《利用python进行数据分析》 第十章数据聚合与分组操作:GroupBy详解、agg()、apply()、pivot_table()、crosstab()详解
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/675109/
转载文章受原作者版权保护。转载请注明原作者出处!