数据科学-pandas的分组和聚合

2023年7月8日下午4:18 • 人工智能 • 阅读 70

导入

现在我们有一组关于全球星巴克店铺的统计数据，如果我想知道美国的星巴克数量和中国的哪个多，或者我想知道中国每个省份星巴克的数量的情况，那么应该怎么办？

思路：遍历一遍，每次加1 ？？？

import pandas as pd
import numpy as np

file_path = "./starbucks_store_worldwide.csv"

df = pd.read_csv(file_path)
print(df.head(1))
print(df.info())
grouped = df.groupby(by="Country")
print(grouped)

#DataFrameGroupBy
#可以进行遍历
for i,j in grouped:
   print(i)

数据来源：https://www.kaggle.com /starbucks /store-locations/data

在pandas 中类似的分组的操作我们有很简单的方式来完成

df.groupby (by=”columns_name “)

import pandas as pd
import numpy as np

file_path = "./starbucks_store_worldwide.csv"

df = pd.read_csv(file_path)
grouped = df.groupby(by="Country")
print(grouped)

country_count = grouped["Brand"].count()
print(country_count)
print(country_count["US"])
print(country_count["CN"])

#统计中国每个省店铺的数量
china_data = df[df["Country"] =="CN"]

grouped = china_data.groupby(by="State/Province").count()["Brand"]

print(grouped)

那问题来了，调用groupby 方法之后返回的是什么内容？

分组和聚合

grouped = df.groupby (by=”columns_name “)

grouped 是一个DataFrameGroupBy 对象，是可迭代的

grouped 中的每一个元素是一个元组

元组里面是（索引(分组的值)，分组之后的DataFrame ）

DataFrameGroupBy 对象有很多经过优化的方法

如果我们需要对国家和省份进行分组统计，应该怎么操作呢？

grouped = df.groupby(by=[df["Country"],df["State/Province"]])

获取分组之后的某一部分数据：

 df.groupby(by=["Country","State/Province"])["Country"].count()

对某几列数据进行分组：

 df["Country"].groupby(by=[df["Country"],df["State/Province"]]).count()

返回值（2个索引）一个Series类型

观察结果，由于只选择了一列数据，所以结果是一个Series类型

如果我想返回一个DataFrame 类型呢？

df[[“Brand”]] 用双方括号取出来是DataFrame类型，以下3种都是DataFrame类型

#数据按照多个条件进行分组,返回DataFrame
grouped1 = df[["Brand"]].groupby(by=[df["Country"],df["State/Province"]]).count()
print(grouped1,type(grouped1))
grouped2= df.groupby(by=[df["Country"],df["State/Province"]])[["Brand"]].count()
grouped3 = df.groupby(by=[df["Country"],df["State/Province"]]).count()[["Brand"]]
print(grouped2,type(grouped2))
print(grouped3,type(grouped3))

索引和复合索引

简单的索引操作：

获取index ：df.index

In [19]: t=pd.DataFrame(np.arange(12).reshape((3,4)),index=list(string.ascii_uppercase[:3]),columns=list(string.ascii_uppercase[-4:]))

In [20]: t
Out[20]:
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

In [21]: t.index
Out[21]: Index(['A', 'B', 'C'], dtype='object')

指定index ：df.index = [‘x’,’y ‘]

In [22]: t.index=["a","b","c"]

In [23]: t
Out[23]:
   W  X   Y   Z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

重新设置index : df.reindex (list(“abcedf “))

 t.reindex(list("abcd"))
Out[26]:
     W    X     Y     Z
a  0.0  1.0   2.0   3.0
b  4.0  5.0   6.0   7.0
c  8.0  9.0  10.0  11.0
d  NaN  NaN   NaN   NaN

指定某一列作为index ：df.set_index (“Country”,drop =False)

In [27]: t.set_index("W",drop=False)
Out[27]:
   W  X   Y   Z
W
0  0  1   2   3
4  4  5   6   7
8  8  9  10  11

返回index 的唯一值：df.set_index (“Country”).index.unique ()

In [42]: t=pd.DataFrame(np.ones(12).reshape((3,4)),index=list(string.ascii_uppercase[:3]),columns=list(string.ascii_uppercase[-4:]))

In [43]: t
Out[43]:
     W    X    Y    Z
A  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0
C  1.0  1.0  1.0  1.0

In [44]: t.set_index("W").index.unique()
Out[44]: Float64Index([1.0], dtype='float64', name='W')

假设a 为一个DataFrame ,那么当a.set_index ([“c”,”d “])即设置两个索引的时候是什么样子的结果呢？

In [46]: t
Out[46]:
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

In [48]: t.set_index(["X","Y"])
Out[48]:
      W   Z
X Y
1 2   0   3
5 6   4   7
9 10  8  11

In [49]: t.set_index(["X","Y"],drop=False)
Out[49]:
      W  X   Y   Z
X Y
1 2   0  1   2   3
5 6   4  5   6   7
9 10  8  9  10  11

我只想取索引h 对应值怎么办？

level相当于是复合索引的里外层，交换了level之后，里外就交换了，索引可以直接从h开始取值

In [8]: x.swaplevel()   #level相当于是复合索引的里外层，交换了level之后，里外就交换了，索引可以直接从h开始取值
Out[8]:
d  c
h  one    0
j  one    1
k  one    2
l  two    3
m  two    4
n  two    5
o  two    6
Name: a, dtype: int64

In [9]: x.swaplevel()["h"]
Out[9]:
c
one    0
Name: a, dtype: int64

那么：DataFrame 是怎样取值呢？

In [11]: x.loc["one"]
Out[11]:
d
h    0
j    1
k    2
Name: a, dtype: int64

In [12]: x.loc["one"].loc["h"]
Out[12]: 0

In [13]: x.swaplevel().loc["j"]
Out[13]:
c
one    1
Name: a, dtype: int64

动手

1.使用matplotlib 呈现出店铺总数排名前10 的国家

coding=utf-8
import pandas as pd
from matplotlib import pyplot as plt

file_path = "./starbucks_store_worldwide.csv"

df = pd.read_csv(file_path)

#使用matplotlib呈现出店铺总数排名前10的国家
#准备数据
data1 = df.groupby(by="Country").count()["Brand"].sort_values(ascending=False)[:10]

_x = data1.index
print(_x)
print(len(_x))
print(range(len(_x)))
_y = data1.values

#画图
plt.figure(figsize=(20,8),dpi=80)

plt.bar(range(len(_x)),_y)

plt.xticks(range(len(_x)),_x)

plt.show()

输出图片

2.使用matplotlib 呈现出每个中国每个城市的店铺数量

coding=utf-8
import pandas as pd
from matplotlib import pyplot as plt
#指定中文识别
plt.rcParams['font.sans-serif'] = ['SimHei']
#获取文件内容
file_path = "./starbucks_store_worldwide.csv"
#读取数据
df = pd.read_csv(file_path)
df = df[df["Country"]=="CN"]

#准备数据
data1 = df.groupby(by="City").count()["Brand"].sort_values(ascending=False)[:25]

_x = data1.index
_y = data1.values

#画图
plt.figure(figsize=(20,12),dpi=80)

plt.bar(range(len(_x)),_y,width=0.3,color="orange")
plt.barh(range(len(_x)),_y,height=0.3,color="orange")

plt.yticks(range(len(_x)),_x)

plt.show()

输出图形

现在我们有全球排名靠前的10000 本书的数据，那么请统计一下下面几个问题：

不同年份书的数量
不同年份书的平均评分情况

收据来源：https://www.kaggle.com /zygmunt /goodbooks-10k

coding=utf-8
import pandas as pd
from matplotlib import pyplot as plt
#获取数据的位置
file_path="./books.csv"
#读取数据
df=pd.read_csv(file_path)
#print(df.head(2))
#不同年份书的平均评分情况
data1=df[pd.notnull(df["original_publication_year"])]
grouped = data1["average_rating"].groupby(by=data1["original_publication_year"]).mean()
#print(grouped)
_x = grouped.index
_y = grouped.values

#画图
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y)
print(len(_x))
plt.xticks(list(range(len(_x)))[::10],_x[::10].astype(int),rotation=45)
plt.show()

画出图形

总结

Original: https://blog.csdn.net/Colorfully_lu/article/details/121444725
Author: Colorfully_lu
Title: 数据科学-pandas的分组和聚合

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/678849/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

解决报错：ValueError: Expected input batch_size (10) to match target batch_size (1).

原做的是二分类，在训练的时候遇到了这个报错。 for i, (inputs, target) in enumerate(trainloader): # forward output…

人工智能 2023年6月16日
0090
【计算机视觉】新冠肺炎COVID-19 CT影片阳性检测，感染区域分割，肺部分割，智慧医疗实践，医疗影像处理示例

引言新型冠状病毒肺炎（Corona Virus Disease 2019，COVID-19），简称”新冠肺炎”，世界卫生组织命名为”2019冠…

人工智能 2023年5月26日
0072
声音对比处理_影视配音制作中关于声音的处理及组合形式

在影视制作中，声音不仅与画面的教学内容紧密相连，而且通过声音本身的组合关注点，展示了声音在主题表现中的重要作用。优秀的声音可以让观众对电影的目的有更强的理解，所以声音的成功也是一部…

人工智能 2023年5月27日
0069
【SVM分类】基于matlab鸽群算法优化支持向量机SVM分类【含Matlab源码 2242期】

⛄一、鸽群算法简介基于鸽群在归巢过程中的特殊导航行为,Duan等提出了一种仿生群体智能优化算法———鸽群优化算法.在这个算法中,通过模仿鸽子在寻找目标的不同阶段使用不同导航工具这…

人工智能 2023年7月29日
0060
Android Studio详细安装教程

一、下载 1.打开Android Studio官网Download Android Studio and SDK tools | Android Developers，点击下载按钮…

人工智能 2023年5月30日
0063
pandas 涉及内容的用法

1.1 DataFrame 的构建 DataFrame 是由索引和内容组成的，索引有: 行索引和列索引；创建方式： pd.DataFrame(ndarray数据，index=[‘…

人工智能 2023年7月8日
0062
大数据导论（五：大数据分析）

1、数据分析概念和分类数据分析是指收集、处理数据并获取数据隐含信息的过程。大数据具有数据量大、数据结构复杂、数据产生速度快、数据价值密度低等特点，这些特点增加了对大数据进行有效分…

人工智能 2023年7月17日
0039
Python 数据竞赛常用 | 可视化数据集缺失情况

无论是打比赛还是在实际工程项目中，都会遇到数据缺失的情况，如果数据集较小，还能在 Excel 或者其他可视化软件大致看一下导致数据缺失的原因。但当数据集较大时，手工查看每个变量的缺…

人工智能 2023年7月16日
0045
如何使用vscode向github上传文件or项目

问题描述：使用vscode连接服务器，编辑程序，欲将服务器上的代码上传到github。操作步骤： GIT端操作： Step1: 安装git 客户端 (这个是傻瓜式安装，在此不赘…

人工智能 2023年6月28日
0092
python:图像的读入、显示与存储

目录读入图像 OpenCV： skimage.io：存储图像 OpenCV： skimage.io: matplotlib.pyplot: 显示图像 OpenCV： matpl…

人工智能 2023年6月21日
0085
python-for循环的多种使用

🌳🌳🌳前言：本文简单总结了一下python中for循环的使用目录 🌸for循环迭代字符串 🌸for打印数字 🌟注意for循环不能迭代数值类型 🍀for循环打印数字的话要借用ran…

人工智能 2023年7月4日
0074
李宏毅 2020机器学习作业1 详细解析

课程链接： http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML20.html要做这个作业的话需要一定的高数、线代的基础，而且尽量要会使…

人工智能 2023年6月16日
0088
pandas库的基本操作(一）

数据结构引入模块 import pandas as pd 序列Series：带标签(索引)的一维数组创建序列 d = {‘b’:1,’a’:0,’c’:3} # 参数为字典 s…

人工智能 2023年7月6日
0055
Conda常用命令

目录应用场景说明一、创建虚拟环境二、激活/使用/进入某个虚拟环境三、退出当前环境四、复制某个虚拟环境五、删除某个环境六、查看当前所有环境七、查看当前虚拟环境下的所有…

人工智能 2023年7月4日
0052
opencv特征匹配中match与KnnMatch返回数据类型

1、match # 初始化 BFMatcher bf = cv.BFMatcher() # &#x5BF9…

人工智能 2023年6月19日
0067
GMSL 介绍

一、GMSL是什么？ GMSL–Gigabit Multimedia Serial Link (中文译为：吉比特多媒体串行链路)，整个传输线路包含串行器和解串器(Se…

人工智能 2023年6月10日
00101

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

数据科学-pandas的分组和聚合

大家都在看