数据分析_表和表的运用

2023年8月9日上午3:50 • Python • 阅读 49

kobe_df = pd.read_csv('data/Kobe_data.csv', index_col='shot_id')


ser = kobe_df.action_type + '-' + kobe_df.combined_shot_type
ser.value_counts().index[0]


kobe_df.drop_duplicates('game_id').opponent.value_counts().index[0]


kobe_df['shot_made_flag'] = kobe_df.shot_made_flag.fillna(0).map(int)
temp = kobe_df[kobe_df.shot_made_flag == 1].shot_type

temp.str.extract(r'(\d+)')[0].map(int).sum()

import pymysql

conn = pymysql.connect(host='47.104.31.138', port=3306,
                       user='guest', password='Guest.618',
                       database='hrs', charset='utf8mb4')
dept_df = pd.read_sql('select dno, dname, dloc from tb_dept', conn)
dept_df

emp_df = pd.read_sql(
    sql='select eno, ename, job, mgr, sal, comm, dno from tb_emp',
    con=conn,

)


pd.merge(emp_df, dept_df, how='inner', on='dno').set_index('eno')

emp_df[(emp_df.dno == 20) & (emp_df.sal >= 5000)]

emp_df.query('dno == 20 and sal >= 5000')


emp_df.drop(index=emp_df[emp_df.dno != 20].index, inplace=True)

emp_df.drop(columns=['mgr', 'dno'], inplace=True)


emp_df.rename(columns={'sal': 'salary', 'comm': 'allowance'}, inplace=True)


emp_df.reset_index(inplace=True)

emp_df.set_index('ename', inplace=True)
emp_df

emp_df.reindex(columns=['salary', 'job'])

emp_df.reindex(index=['李莫愁', '张三丰', '乔峰'])

import itertools

names = ['高新', '犀浦', '新津']
years = ['2018', '2019']
groups = ['A', 'B']

for name, year, group in itertools.product(names, years, groups):
    print(name, year, group)

import itertools

names = ['高新', '犀浦', '新津']
years = ['2018', '2019']
dfs = [pd.read_excel(f'data/小宝剑大药房（{name}店）{year}年销售数据.xlsx', header=1)
       for name, year in itertools.product(names, years)]

pd.concat(dfs, ignore_index=True).to_excel('小宝剑大药房2018-2019年汇总数据.xlsx')

emp_df.isnull()

youtube_df.tail(10)
youtube_df.head(10)

youtube_df.duplicated('video_id')


youtube_df = youtube_df.drop_duplicates('video_id', keep='first')
youtube_df

emp_df.replace('程序员', '程序猿', inplace=True)

Get value at specified row/column pair
>>> df.at[4, 'B']
2

Set value at specified row/column pair

>>> df.at[4, 'B'] = 10
>>> df.at[4, 'B']
10

emp_df.at[2, 'job'] = '程序媛'


emp_df.replace(regex='程序[猿媛]', value='程序员')


emp_df['job'] = emp_df.job.str.replace('程序[猿媛]', '程序员', regex=True)

temp = pd.DataFrame({'A': range(3), 'B': range(1, 4)})

temp.apply(np.sqrt)

temp.apply(np.sum)

temp.apply(np.sum, axis=1)

temp.transform(np.sqrt)

temp.transform([np.exp, np.sqrt])


student_df['stu_sex'] = student_df.stu_sex.transform(lambda x: '男' if x == 1 else '女')
student_df


temp = youtube_df.sort_values(by=['likes', 'views'], ascending=[False, True])

youtube_df.drop_duplicates('video_id', keep='last').nlargest(10, 'likes')

youtube_df['hot_value'] = youtube_df.views + youtube_df.likes + youtube_df.dislikes + youtube_df.comment_count


youtube_df.groupby(by='channel_title').hot_value.sum()

def ptp(g):
    return g.max() - g.min()

temp = youtube_df.groupby(by='channel_title')
temp[['hot_value', 'likes']].agg(['sum', 'max', 'min', ptp])

student_df.stu_sex.value_counts()

student_df.groupby('stu_sex').count()

temp = student_df.groupby(by=['collid', 'stusex']).count()

按多个分类

temp.loc[(1, '男')]


student_df.pivot_table(index='stu_sex', values='stu_id', aggfunc='count')


student_df.pivot_table(index=['col_id', 'stu_sex'], values='stuid', aggfunc='count')

temp = student_df.pivot_table(
    index='col_id',
    columns='stu_sex',
    values='stu_id',
    aggfunc='count',
    fill_value=0
)

df1 = pd.DataFrame({
    "类别": ["手机", "手机", "手机", "手机", "手机", "电脑", "电脑", "电脑", "电脑"],
    "品牌": ["华为", "华为", "华为", "小米", "小米", "华为", "华为", "小米", "小米"],
    "等级": ["A类", "B类", "A类", "B类", "C类", "A类", "B类", "C类", "A类"],
    "A组": [1, 2, 2, 3, 3, 4, 5, 6, 7],
    "B组": [2, 4, 5, 5, 6, 6, 8, 9, 9]
})


df1.pivot_table(index='类别', values='A组', aggfunc=np.sum)


df1.pivot_table(index='类别', columns='品牌', values='A组', aggfunc=np.sum)

df2 = pd.DataFrame({
    '类别': ['水果', '水果', '水果', '蔬菜', '蔬菜', '肉类', '肉类'],
    '产地': ['美国', '中国', '中国', '中国', '新西兰', '新西兰', '美国'],
    '名称': ['苹果', '梨', '草莓', '番茄', '黄瓜', '羊肉', '牛肉'],
    '数量': [5, 5, 9, 3, 2, 10, 8],
    '价格': [5.8, 5.2, 10.8, 3.5, 3.0, 13.1, 20.5]
})


pd.crosstab(
    index=df2['类别'],
    columns=df2['产地'],
    values=df2['数量'],
    aggfunc=np.sum,
    margins=True,
    margins_name='总计'
).fillna(0).applymap(int)

形成一个元素来自均值是μ,方差为σ正态分布,n行m列的数组

heights = np.round(np.random.normal(172, 8, 500), 1)
temp_df = pd.DataFrame(data=heights, index=np.arange(1001, 1501), columns=['身高'])


bins = [0, 150, 160, 170, 180, 190, 200, np.inf]

cate = pd.cut(temp_df['身高'], bins, right=False)
result = temp_df.groupby(cate).count()

luohu_df = pd.read_csv('data/2018年北京积分落户数据.csv', index_col='id')

pd.to_datetime(luohu_df.birthday)


from datetime import datetime

ref_date = datetime(2018, 7, 1)
ser = ref_date - pd.to_datetime(luohu_df.birthday)
luohu_df['age'] = ser.dt.days // 365
luohu_df


temp = luohu_df.company.value_counts()
temp[temp > 20]


bins = np.arange(30, 61, 5)
cate = pd.cut(luohu_df.age, bins)
temp = luohu_df.groupby(cate).name.count()
temp.plot(kind='bar')
for i in range(temp.size):
    plt.text(i, temp[i], temp[i], ha='center')
plt.xticks(
    np.arange(temp.size),
    labels=[f'{index.left}~{index.right}岁' for index in temp.index],
    rotation=30
)
plt.show()


bins = np.arange(90, 126, 5)
cate = pd.cut(luohu_df.score, bins, right=False)
luohu_df.groupby(cate).name.count()

lagou_df = pd.read_csv(
    'data/lagou.csv',
    index_col='no',
    usecols=['no', 'city', 'companyFullName', 'positionName', 'industryField', 'salary']
)
lagou_df.shape


lagou_df = lagou_df[lagou_df.positionName.str.contains('数据分析')]
lagou_df.tail()


temp = lagou_df.salary.str.extract('(\d+)[kK]?-(\d+)[kK]?').applymap(int)
temp

lagou_df.loc[:, 'salary'] = temp.mean(axis=1)


lagou_df.city.value_counts()

ser = lagou_df.groupby('city').companyFullName.count()
ser.plot(figsize=(10, 4), kind='bar', color=['r', 'g', 'b', 'y'], width=0.8)

plt.grid(True, alpha=0.25, linestyle=':', axis='y')

plt.xticks(rotation=0)

plt.yticks(np.arange(0, 501, 50))

plt.xlabel('')

plt.title('各大城市岗位数量')
for i in range(ser.size):

    plt.text(i, ser[i], ser[i], ha='center')
plt.show()


temp = lagou_df.industryField.str.split(pat='[丨,]', expand=True)
lagou_df['modifiedIndustryField'] = temp[0]
ser = lagou_df.groupby('modifiedIndustryField').companyFullName.count()
explodes = [0, 0.15, 0, 0, 0.05, 0, 0, 0, 0, 0]
ser.nlargest(10).plot(
    figsize=(6, 6),
    kind='pie',
    autopct='%.2f%%',
    pctdistance=0.75,
    shadow=True,
    explode=explodes,
    wedgeprops={
        'edgecolor': 'white',
        'width': 0.5
    }
)
plt.ylabel('')
plt.show()


ser = np.round(lagou_df.groupby('city').salary.mean(), 2)
ser.plot(figsize=(8, 4), kind='bar')
ser.plot(kind='line', color='red', marker='o', linestyle='--')
plt.show()

提取2019年的订单数据


from datetime import datetime

start = datetime(2019, 1, 1)
end = datetime(2019, 12, 31, 23, 59, 59)
order_df.drop(index=order_df[order_df.orderTime < start].index, inplace=True)
order_df.drop(index=order_df[order_df.orderTime > end].index, inplace=True)
order_df.shape

处理支付时间早于下单时间的数据


order_df.drop(order_df[order_df.payTime < order_df.orderTime].index, inplace=True)
order_df.shape

折扣字段的处理


order_df['discount'] = np.round(order_df.payment / order_df.orderAmount, 4)
mean_discount = np.mean(order_df[order_df.discount  1].discount)
order_df['discount'] = order_df.discount.apply(lambda x: x if x  1 else mean_discount)
order_df['payment'] = order_df.orderAmount * order_df.discount

显示整体分析

print(f'GMV: {order_df.orderAmount.sum() / 10000:.4f}万元')
print(f'总销售额: {order_df.payment.sum() / 10000:.4f}万元')
real_total = order_df[order_df.chargeback == "否"].payment.sum()
print(f'实际销售额: {real_total / 10000:.4f}万元')
back_rate = order_df[order_df.chargeback == '是'].orderID.size / order_df.orderID.size
print(f'退货率: {back_rate * 100:.2f}%')
print(f'客单价：{real_total / order_df.userID.nunique():.2f}元')
print(order_df[order_df.chargeback == '是'].orderID.size)

np.mean(order_df[order_df.discount


&#x663E;&#x793A;&#x6574;&#x4F53;&#x5206;&#x6790;

`python
print(f'GMV: {order_df.orderAmount.sum() / 10000:.4f}&#x4E07;&#x5143;')
print(f'&#x603B;&#x9500;&#x552E;&#x989D;: {order_df.payment.sum() / 10000:.4f}&#x4E07;&#x5143;')
real_total = order_df[order_df.chargeback == "&#x5426;"].payment.sum()
print(f'&#x5B9E;&#x9645;&#x9500;&#x552E;&#x989D;: {real_total / 10000:.4f}&#x4E07;&#x5143;')
back_rate = order_df[order_df.chargeback == '&#x662F;'].orderID.size / order_df.orderID.size
print(f'&#x9000;&#x8D27;&#x7387;: {back_rate * 100:.2f}%')
print(f'&#x5BA2;&#x5355;&#x4EF7;&#xFF1A;{real_total / order_df.userID.nunique():.2f}&#x5143;')
print(order_df[order_df.chargeback == '&#x662F;'].orderID.size)

Original: https://blog.csdn.net/niki__/article/details/121624646
Author: niki__
Title: 数据分析_表和表的运用

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/743852/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

我不谈ChatGPT

（1）数据有两个未经证实的传闻： 1、客服问答：80%用户问的问题都是那20%常见问题，但是就是这样，占用了客服人员80%的工作量和工作时间 2、资讯搜索：谷歌一位员工说，在互联…

Python 2023年11月3日
0046
vue结合django restframework登录和权限菜单加载

成品效果只要django 安装了 ‘corsheaders’, settings.py文件中配置了这个跨域参数就可以，就可以让vue访问restframe…

Python 2023年8月4日
0037
低代码平台前端的设计与实现（一）构建引擎BuildEngine的基本实现

这两年低代码平台的话题愈来愈火，一眼望去全是关于低代码开发的概念，鲜有关于低代码平台的设计实现。本文将以实际的代码入手，逐步介绍如何打造一款低开的平台。低开概念我们不再赘述，但对…

Python 2023年10月21日
0064
自动化测试

目录 1、前言 2、自动化测试的意义和优点 3、自动化测试的局限性 4、自动化测试的要求 4.1、对于测试用例的要求 4.2、对于测试人员的要求 4.3、对于团队的要求 5、自动化…

Python 2023年8月2日
0048
pandas 缺失数据处理大全（附代码）

大家好，我是东哥。之前一直在分享 pandas的一些骚操作：pandas骚操作，根据大家反映还不错，但是很多技巧都混在了一起，没有细致的分类，这样不利于查找，也不成体系。利用闲…

Python 2023年8月25日
0056
Django中间件

中间件的作用：比如你view视图下有多个接口，每个接口在请求的时候，都需要判断token信息，这时候你就不能给每个接口都判断一下把中间件就是解决这个问题的，中间件执行的顺序是，执…

Python 2023年8月5日
0040
用Python做童年回忆的游戏贪吃蛇

为了让大家对python产生兴趣，不让大家学编程枯燥无味，所以今天老袁还是准备了一个小游戏给大家来玩玩，喜欢的可以跟着我敲代码哟。那么废话不多说了直接开始吧！我们选择好开发工具 …

Python 2023年9月24日
0022
conda: error: invalid choice

当执行conda xx命令或者打开anaconda prompt时显示： usage: conda [-h] {unpack,pack,convert,version,help} …

Python 2023年9月9日
0043
迅为3A5000开发板龙芯自主指令集从里到外100%全国产设计方案

迅为3A5000开发板龙芯处理器自主指令集架构从里到外100%全国产设计方案 iTOP-3A5000 开发板采用全国产龙芯3A5000处理器，基于龙芯自主指令系统（LoongArc…

Python 2023年10月24日
0023
Numpy使用手册

注：本文章所有的内容整理都来源于imooc 夏正东老师讲述的《Numpy基础入门》 Numpy 用于存&#…

Python 2023年8月26日
0046
pygame教你从0到1一步步实现点到点的智能追踪系统（其二）

续上篇，上篇实现了点到点的智能追踪系统，这篇将实现图片到目标点的智能追踪。还是参照上篇的代码，将追踪点修改为图片，实现一个版本。文章目录一、失败的图片到目标点版本 * （一）核…

Python 2023年9月24日
0045
对python的理解

python是一种解释性语言，python解释器主流版本由c语言编写；python 比较注重格式，包丰富，效率方面低于java，是一种高级语言；python功能丰富，轮子多，不用自…

Python 2023年9月20日
0036
Python+ Flask轻松实现Mock Server

1、什么是Mock 模拟接口接口Mock测试：在接口测试中，对于某些不容易构造或者不容易获取的接口，可以用一个模拟接口来代替 2、Mock的三种典型应用场景依赖的接口未实现依…

Python 2023年8月14日
0029
python可视化3d柱状图_「Python实现数据可视化」创建3D柱状图

虽然matplotlib主要专注于绘图，并且主要是二维的图形，但是它也有一些不同的扩展，能让我们在地理图上绘图，让我们把Excel和3D图表结合起来。在matplotlib的世界里…

Python 2023年9月6日
0062
《上海悠悠接口自动化平台》-2.extract 提取结果与validate 校验结果

前言当接口请求成功后，返回的内容，我们需要提取内容，并校验实际结果与预期结果是否一致。平台可以支持3种方式提取结果 1.body.key 方式根据属性点的方式提取，或者下标取值 …

Python 2023年8月11日
0045
给 hugo 博客添加搜索功能

起因我的博客使用了 hugo 作为静态生成工具，自带的主题里也没有附带搜索功能。看来，还是得自己给博客添加一个搜索功能。经过多方查找，从 Hugo Fast Search · …

Python 2023年10月18日
0057

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

数据分析_表和表的运用

大家都在看