Pandas操作题
1.学生数据分析
1.数据导入并展示, 文件位置位于../data/下
- 请将文件student-info.csv中的内容读入pandas DataFrame “df_info”,注意该文件的分隔符是分号,并展示前五行
- 请将文件student-score.csv中的内容读入pandas DataFrame “df_score”,注意该文件的分隔符是分号,并展示前五行
In [1]:
import pandas as pd
df_info = pd.read_csv("C:/Users/PycharmProjects/pythonProject/ML/Projects/课程/资料/homework/data/student-info.csv",sep = ';')
df_info.head(5)
Out[1]:
IDschoolsexageaddressfamsizePstatusMeduFeduMjob…higherinternetromanticfamrelfreetimegooutDalcWalchealthabsences00GPF18UGT3A44at_home…yesnono434113611GPF17UGT3T11at_home…yesyesno533113422GPF15ULE3T11at_home…yesyesno4322331033GPF15UGT3T42health…yesyesyes322115244GPF16UGT3T33other…yesnono4321254
5 rows × 31 columns
In [3]:
import pandas as pd
df_score = pd.read_csv("C:/Users/PycharmProjects/pythonProject/ML/Projects/课程/资料/homework/data/student-score.csv",sep = ';')
df_score.head(5)
Out[3]:
IDG1G2G30056611556227810331514154461010
2.把两个DataFrame “df_info” 和 “df_score” 按照学生ID对齐拼接起来,结果存在一个新DataFrame “df” 中
In [55]:
df = pd.concat([df_info,df_score], axis=1)
df_info.merge(df_score)
df = pd.merge(df_info,df_score,on=['ID'])
3.把”df”所有的列展示出来
In [56]:
df
Out[56]:
IDschoolsexageaddressfamsizePstatusMeduFeduMjob…famrelfreetimegooutDalcWalchealthabsencesG1G2G300GPF18UGT3A44at_home…434113656611GPF17UGT3T11at_home…533113455622GPF15ULE3T11at_home…43223310781033GPF15UGT3T42health…322115215141544GPF16UGT3T33other…432125461010…………………………………………………………390390MSM20ULE3A22services…55445411999391391MSM17ULE3T31services…2453423141616392392MSM21RGT3T11other…55333331087393393MSM18RLE3T32services…4413450111210394394MSM19ULE3T11other…3233355899
395 rows × 34 columns
In [16]:
df.columns
Out[16]:
Index(['ID', 'school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu',
'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
'Walc', 'health', 'absences', 'ID', 'G1', 'G2', 'G3'],
dtype='object')
4.统计男生的平均分和女生的平均分(G1, G2, G3),使用groupby和aggregate来操作
In [59]:
group_by_sex = df.groupby('sex').agg('mean')
group_by_sex[['G1','G2','G3']]
#group_by_sex = df.groupby('sex').mean()
#group_by_sex[['G1','G2','G3']]
Out[59]:
G1G2G3sex F10.62019210.3894239.966346M11.22994711.07486610.914439
In [60]:
group_by_agg = group_by_sex.aggregate('mean')
group_by_agg[['G1','G2','G3']]
Out[60]:
G1 10.925069
G2 10.732145
G3 10.440392
dtype: float64
5.统计不同年龄的学生(G1, G2, G3)的平均分,并作出柱状图展示
In [70]:
group_by_age = df.groupby('age').mean()
#group_by_age[['G1','G2','G3']]
group_by_age[['G1','G2','G3']].plot(kind='bar')
Out[70]:
6.统计不同学校(school一列)的学生的平均分,并作柱状图展示,要求同一科目的两个学校成绩贴在一起展示
In [88]:
group_by_school=df.groupby('school').agg('mean').transpose()
group_by_school.plot(kind='bar')
#roup_by_school.T.plot(kind='bar')
Out[88]:
7.统计一下每个学校三个科目总分的前十名,然后把前十名的同学所有信息输出
In [201]:
#group_by_school=df.groupby(['school']).agg('sum').transpose()
group_by_school['G1','G2','G3']
df["total"] = df["G1"] + df["G2"]+ df["G3"]
df.groupb
6 # df.groupby(['school','total'])
7
TypeError: 'method' object is not subscriptable
2.沃尔玛销售数据整理
-
数据导入并展示, 文件位置位于 ../data下
-
请将文件 stores.csv 中的内容读入pandas DataFrame “df_stores”,并展示前五行以及打印df_stores的shape
In [211]:
import pandas as pd
df_stores = pd.read_csv(r"C:\Users\PycharmProjects\pythonProject\ML\Projects\课程\资料\homework\data\walmart\walmart\stores.csv")
df_stores.head(5)
Out[211]:
StoreTypeSize01A15131512A20230723B3739234A20586345B34875
In [212]:
df_stores.shape
Out[212]:
(45, 3)
- 请将文件 features.csv 中的内容读入pandas DataFrame “df_feats”,将df_feats改成只留下以下几个columns: “Store”, “Date”, “Temperature”, “Fuel_Price”, “CPI”, “Unemployment”, 并展示前五行以及打印df_feats的shape
In [215]:
df_feats = pd.read_csv(r"C:\Users\PycharmProjects\pythonProject\ML\Projects\课程\资料\homework\data\walmart\walmart\features.csv")
df_feats.loc[:,["Store", "Date", "Temperature", "Fuel_Price", "CPI", "Unemployment"]].head()
Out[215]:
StoreDateTemperatureFuel_PriceCPIUnemployment012010-02-0542.312.572211.0963588.106112010-02-1238.512.548211.2421708.106212010-02-1939.932.514211.2891438.106312010-02-2646.632.561211.3196438.106412010-03-0546.502.625211.3501438.106
In [217]:
df_feats.shape
Out[217]:
(8190, 12)
- 请将文件 sales.csv 中的内容读入pandas DataFrame “df_sales”,并展示前五行以及打印df_sales的shape
In [219]:
df_sales = pd.read_csv(r"C:\Users\PycharmProjects\pythonProject\ML\Projects\课程\资料\homework\data\walmart\walmart\sales.csv")
df_sales.head(5)
Out[219]:
StoreDeptDateWeekly_SalesIsHoliday0112010-02-0524924.50False1112010-02-1246039.49True2112010-02-1941595.55False3112010-02-2619403.54False4112010-03-0521827.90False
In [220]:
df_sales.shape
Out[220]:
(421570, 5)
- 我们发现 df_sales 中每个Store有很多个Dept(department), 我们只关心整家店的销售情况,所以请把这张表格按照Store和Date来计算整店销售总额,IsHoliday这一列不必保留。结果保存在df_sales_by_store中
In [257]:
df_sales_by_store=df_sales.loc[:,["Store", "Dept", "Date", "Weekly_Sales"]].groupby(['Store','Date']).sum()
df_sales_by_store
Out[257]:
DeptWeekly_SalesStoreDate 12010-02-0531841643690.902010-02-1231371641957.442010-02-1931841611968.172010-02-2631371409727.592010-03-0532151554806.68…………452012-09-282853713173.952012-10-052952733455.072012-10-122853734464.362012-10-192853718125.532012-10-262907760281.43
6435 rows × 2 columns
- 使用 Store 和 Date 这两列合并 df_sales_by_store 和 df_feats 两张表格,结果保存在 df_sale_feats 。
In [261]:
df_sale_feats = pd.merge(df_sales_by_store,df_feats,on=['Store','Date'])
df_sale_feats
Out[261]:
StoreDateDeptWeekly_SalesTemperatureFuel_PriceMarkDown1MarkDown2MarkDown3MarkDown4MarkDown5CPIUnemploymentIsHoliday012010-02-0531841643690.9042.312.572NaNNaNNaNNaNNaN211.0963588.106False112010-02-1231371641957.4438.512.548NaNNaNNaNNaNNaN211.2421708.106True212010-02-1931841611968.1739.932.514NaNNaNNaNNaNNaN211.2891438.106False312010-02-2631371409727.5946.632.561NaNNaNNaNNaNNaN211.3196438.106False412010-03-0532151554806.6846.502.625NaNNaNNaNNaNNaN211.3501438.106False………………………………………6430452012-09-282853713173.9564.883.9974556.6120.641.501601.013288.25192.0135588.684False6431452012-10-052952733455.0764.893.9855046.74NaN18.822253.432340.01192.1704128.667False6432452012-10-122853734464.3654.474.0001956.28NaN7.89599.323990.54192.3272658.667False6433452012-10-192853718125.5356.473.9692004.02NaN3.18437.731537.49192.3308548.667False6434452012-10-262907760281.4358.853.8824018.9158.08100.00211.94858.33192.3088998.667False
6435 rows × 14 columns
- 使用 Store 这一列合并 df_sale_feats 和 df_stores 两张表格,结果仍然保留在df_sale_feats 中。
In [266]:
df_sale_feats = pd.merge(df_sale_feats,df_stores,on=['Store'])
df_sale_feats
Out[266]:
StoreDateDeptWeekly_SalesTemperatureFuel_PriceMarkDown1MarkDown2MarkDown3MarkDown4MarkDown5CPIUnemploymentIsHolidayTypeSize012010-02-0531841643690.9042.312.572NaNNaNNaNNaNNaN211.0963588.106FalseA151315112010-02-1231371641957.4438.512.548NaNNaNNaNNaNNaN211.2421708.106TrueA151315212010-02-1931841611968.1739.932.514NaNNaNNaNNaNNaN211.2891438.106FalseA151315312010-02-2631371409727.5946.632.561NaNNaNNaNNaNNaN211.3196438.106FalseA151315412010-03-0532151554806.6846.502.625NaNNaNNaNNaNNaN211.3501438.106FalseA151315……………………………………………6430452012-09-282853713173.9564.883.9974556.6120.641.501601.013288.25192.0135588.684FalseB1182216431452012-10-052952733455.0764.893.9855046.74NaN18.822253.432340.01192.1704128.667FalseB1182216432452012-10-122853734464.3654.474.0001956.28NaN7.89599.323990.54192.3272658.667FalseB1182216433452012-10-192853718125.5356.473.9692004.02NaN3.18437.731537.49192.3308548.667FalseB1182216434452012-10-262907760281.4358.853.8824018.9158.08100.00211.94858.33192.3088998.667FalseB118221
6435 rows × 16 columns
In [ ]:
Original: https://blog.csdn.net/u011868279/article/details/115420604
Author: 梦想家DBA
Title: Pandas操作题
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/753310/
转载文章受原作者版权保护。转载请注明原作者出处!