1 pandas简介
- Python在数据处理和准备方面一直做得很好,但在 数据分析和 建模方面就差一些。pandas帮助填补了这一空白,使您能够在Python中执行整个数据分析工作流程,而不必切换到更特定于领域的语言,如R。
- 与出色的 jupyter工具包和其他库相结合,Python中用于进行数据分析的环境在性能、生产率和协作能力方面都是卓越的。pandas是 Python 的 核心数据分析支持库,提供了快速、灵活、明确的数据结构,旨在简单、直观地处理关系型、标记型数据。pandas是Python进行数据分析的必备高级工具。
- pandas的主要 数据结构是 Series(一维数据)与 DataFrame (二维数据),这两种数据结构足以处理金融、统计、社会科学、工程等领域里的大多数案例。
- 处理数据一般分为几个阶段:
(1)数据整理与清洗
(2)数据分析与建模
(3)数据可视化与制表
(4)Pandas 是处理数据的理想工具 - 安装方式(安装了python后,使用以下命令即可。推荐使用anaconda工具来创建python环境):
pip install pandas -i https://pypi.tuna.tsinghua.edu.cn/simple
2 数据结构
2.1 Series
用列表生成 Series时,Pandas 默认 自动生成整数索引,也可以指定索引
import numpy as np
import pandas as pd
l = [0,1,7,9,np.NAN,None,1024,512]
s1 = pd.Series(data=l)
s2 = pd.Series(data=l,index=list('abcdefhi'),dtype='float32')
s3 = pd.Series(data={'a':99,'b':137,'c':149},name='python_score')
display(s1,s2,s3)
2.2 DataFrame
DataFrame是由多种类型的列构成的 二维标签数据结构,类似于 Excel 、SQL 表,或 Series 对象构成的
字典。
import numpy as np
import pandas as pd
df1 = pd.DataFrame(data={'python':[99,107,122],'math':[111,137,88],'En':[68,108,43]},
index=['张三','李四','jayden'])
df2 = pd.DataFrame(data=np.random.randint(0,151,size=(5,3)),
index = ['Danial','Brandon','softpo','Ella','Cindy'],
columns=['Python','Math','En'])
3 数据查看
可以DataFrame的常用属性和DataFrame的概览和统计信息
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,151,size = (150,3)),
index = None,
columns=['Python','Math','En'])
df.head(10)
df.tail(10)
df.shape
df.dtypes
df.index
df.columns
df.values
df.describe()
df.info()
4 数据输入和输出
4.1 csv文件的输入和输出
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,50,size = [50,5]),
columns=['IT','化工','生物','教师','士兵'])
df.to_csv("./salary.csv",
sep=';',
header = True,
index = True)
pd.read_csv("./salary.csv",
sep=';',
header = [0],
index_col=0)
pd.read_table('./salary.csv',
sep = ';',
header = [0],
index_col=0)
4.2 Excel文件的输入和输出
首先要安装专门处理Excel的库(在当前python环境中安装)
pip install xlrd -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install xlwt -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install openpyxl
import numpy as np
import pandas as pd
df1 = pd.DataFrame(data=np.random.randint(0,50,size=(50,5)),
columns=('IT','化工','生物','教师','士兵'))
df2 = pd.DataFrame(data = np.random.randint(0,50,size = [150,3]),
columns=['Python','Tensorflow','Keras'])
df1.to_excel("./salary.xls",
sheet_name='salary',
header = True,
index = False)
pd.read_excel('./salary.xls',
sheet_name=0,
header = 0,
names = list('ABCDE'),
index_col=1)
with pd.ExcelWriter('./data.xls') as writer:
df1.to_excel(writer,sheet_name='salary',index=False)
df2.to_excel(writer,sheet_name='score',index=False)
pd.read_excel('./data.xls',
sheet_name='salary')
4.3 SQL的读取
pip install sqlalchemy -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install pymysql -i https://pypi.tuna.tsinghua.edu.cn/simple
数据库引擎配置:https://docs.sqlalchemy.org/en/13/core/engines.html
import pandas as pd
from sqlalchemy import create_engine
df = pd.DataFrame(data=np.random.randint(0,50,size=[150,3]),
columns=['Python','Tensorflow','Keras'])
conn = create_engine('mysql+pymysql://root:root@localhost/pandas?charset=UTF8MB4')
df.to_sql('score',
conn,
if_exists='append')
pd.read_sql('select * from score limit 10',
conn,
index_col='Python')
4.4 HDF5
pip install tables -i https://pypi.tuna.tsinghua.edu.cn/simple
- HDF5是一个独特的技术套件,可以管理非常大和复杂的数据收集。
- HDF5,可以存储不同类型数据的文件格式,后缀通常是.h5,它的结构是层次性的。
- 一个HDF5文件可以被看作是一个组包含了各类不同的数据集。
- 对于HDF5文件中的数据存储,有两个核心概念:group 和 dataset。
- dataset 代表数据集,一个文件当中可以存放不同种类的数据集,这些数据集如何管理,就用到了group
最直观的理解,可以参考我们的文件管理系统,不同的文件位于不同的目录下。 - 目录就是HDF5中的group, 描述了数据集dataset的分类信息,通过group 有效的将多种dataset 进行管
理和区分;文件就是HDF5中的dataset, 表示的是具体的数据。
import numpy as np
import pandas as pd
df1 = pd.DataFrame(data = np.random.randint(0,50,size = [50,5]),
columns=['IT','化工','生物','教师','士兵'])
df2 = pd.DataFrame(data = np.random.randint(0,50,size = [150,3]),
columns=['Python','Tensorflow','Keras'])
df1.to_hdf('./data.h5',key='salary')
df2.to_hdf('./data.h5',key = 'score')
pd.read_hdf('./data.h5',
key = 'salary')
5 数据选取
5.1 字段选择
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0,150,size=[150,3]),
columns=['Python','Tensorflow','Keras'])
df['Python']
df.Python
df[['Python','Keras']]
df[3:15]
5.2 标签选择
import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),
index = list('ABCDEFGHIJ'),
columns=['Python','Tensorflow','Keras'])
df.loc[['A','C','D','F']]
df.loc['A':'E',['Python','Keras']]
df.loc[:,['Keras','Tensorflow']]
df.loc['E'::2,'Python':'Tensorflow']
df.loc['A','Python']
5.3 位置选择
import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),
index = list('ABCDEFGHIJ'),
columns=['Python','Tensorflow','Keras'])
df.iloc[4]
df.iloc[4]
df.iloc[2:8,0:2]
df.iloc[[1,3,5],[0,2,1]]
df.iloc[1:3,:]
df.iloc[:,:2]
df.iloc[0,2]
5.4 booleanx索引
import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),
index = list('ABCDEFGHIJ'),
columns=['Python','Tensorflow','Keras'])
cond1 = df.Python > 100
df[cond1]
cond2 = (df.Python > 50) & (df['Keras'] > 50)
df[cond2]
df[df > 50]
df[df.index.isin(['A','C','F','Z'])]
5.5 赋值操作
import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),
index = list('ABCDEFGHIJ'),
columns=['Python','Tensorflow','Keras'])
s = pd.Series(data = np.random.randint(0,150,size = 9),index=list('BCDEFGHIJ'),name = 'PyTorch')
df['PyTorch'] = s
df.loc['A','Python'] = 256
df.iloc[3,2] = 512
df.loc[:,'Python'] = np.array([128]*10)
df[df >= 128] = -df
6 数据集成
pandas 提供了多种将 Series、DataFrame 对象组合在一起的功能
6.1 concat数据串联
import pandas as pd
import numpy as np
df1 = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),
index = list('ABCDEFGHIJ'),
columns=['Python','Tensorflow','Keras'])
df2 = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),
index = list('KLMNOPQRST'),
columns=['Python','Tensorflow','Keras'])
df3 = pd.DataFrame(data = np.random.randint(0,150,size = (10,2)),
index = list('ABCDEFGHIJ'),
columns=['PyTorch','Paddle'])
pd.concat([df1,df2],axis = 0)
df1.append(df2)
pd.concat([df1,df3],axis = 1)
6.2 在列表中插入一列数据
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,151,size = (10,3)),
index = list('ABCDEFGHIJ'),
columns = ['Python','Keras','Tensorflow'])
df.insert(loc = 1,column='Pytorch',value=1024)
df
6.3 Join SQL风格合并
数据集的合并(merge)或连接(join)运算是 通过一个或者 多个键将数据链接起来的。这些运算是关系型数据库的核心操作。pandas的merge函数是数据集进行join运算的主要切入点。
import numpy as np
import pandas as pd
df1 = pd.DataFrame(data={'name':['softpo','Daniel','Brandon','Ella'],'weight': [70,55,75,65]})
df2 = pd.DataFrame(data={'name':['softpo','Danie','Brandon','Cindy'],'height':[172,170,170,166]})
df3 = pd.DataFrame(data = {'名字':['softpo','Daniel','Brandon','Cindy'],'height': [172,170,170,166]})
pd.merge(df1,df2, how = 'inner',
on = 'name')
pd.merge(df1,df3, how = 'outer',
left_on = 'name',
right_on = '名字')
df4 = pd.DataFrame(data = np.random.randint(0,151,size = (10,3)),
index = list('ABCDEFHIJK'),
columns=['Python','Keras','Tensorflow'])
score_mean = pd.DataFrame(df4.mean(axis = 1).round(1),columns=['平均分'])
pd.merge(left = df4,right = score_mean,
left_index=True,
right_index=True)
7 数据清洗
import numpy as np
import pandas as pd
df = pd.DataFrame(data={'color':['red','blue','red','green','blue',None,'red'],
'price':[10,20,10,15,20,0,np.NaN]})
df.duplicated()
df.drop_duplicates()
df.isnull()
df.dropna(how = 'any')
df.fillna(value=1111)
del df['color']
df.drop(labels = ['price'],axis = 1)
df.drop(labels = [0,1,5],axis = 0)
df = pd.DataFrame(np.array(([3,7,1], [2, 8, 256])),
index=['dog', 'cat'],
columns=['China', 'America', 'France'])
df.filter(items=['China', 'France'])
df.filter(regex='a$', axis=1)
df.filter(like='og', axis=0)
df2 = pd.DataFrame(data = np.random.randn(10000,3))
cond = (df2 > 3*df2.std()).any(axis = 1)
index = df2[cond].index
df2.drop(labels=index,axis = 0)
8 数据转化
8.1 轴和元素替换
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,10,size = (10,3)),
index = list('ABCDEFHIJK'),
columns=['Python','Tensorflow','Keras'])
df.iloc[4,2] = None
df.rename(index = {'A':'AA','B':'BB'},
columns = {'Python':'人工智能'})
df.replace(3,1024)
df.replace([0,7],2048)
df.replace({0:512,np.nan:998})
df.replace({'Python':2},-1024)
8.2 map Series 元素改变
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,10,size = (10,3)),
index = list('ABCDEFHIJK'),
columns=['Python','Tensorflow','Keras'])
df.iloc[4,2] = None
df['Keras'].map({1:'Hello',5:'World',7:'AI'})
df['Python'].map(lambda x:True if x >=5 else False)
def convert(x):
if x%3 == 0:
return True
elif x%3 == 1:
return False
df['Tensorflow'].map(convert)
8.3 apply元素改变
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,10,size = (10,3)),
index = list('ABCDEFHIJK'),
columns=['Python','Tensorflow','Keras'])
df.iloc[4,2] = None
df['Keras'].apply(lambda x:True if x >5 else False)
df.apply(lambda x : x.median(),axis = 0)
def convert(x):
return (x.mean().round(1),x.count())
df.apply(convert,axis = 1)
df.applymap(lambda x : x + 100)
8.4 transform变形金刚
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,10,size = (10,3)),
index = list('ABCDEFHIJK'),
columns=['Python','Tensorflow','Keras'])
df.iloc[4,2] = None
df['Python'].transform([np.sqrt,np.exp])
def convert(x):
if x.mean() > 5:
x *= 10
else:
x *= -10
return x
df.transform({'Python':convert,'Tensorflow':np.max,'Keras':np.min})
8.5 重排随机抽样、哑变量
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,10,size = (10,3)),
index = list('ABCDEFHIJK'),
columns=['Python','Tensorflow','Keras'])
ran = np.random.permutation(10)
df.take(ran)
df.take(np.random.randint(0,10,size = 15))
df = pd.DataFrame({'key':['b','b','a','c','a','b']})
pd.get_dummies(df,prefix='',prefix_sep='')
9 数据重塑
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,100,size = (10,3)),
index = list('ABCDEFHIJK'),
columns = ['Python','Tensorflow','Keras'])
df.T
df2 = pd.DataFrame(data=np.random.randint(0,100,size=(20,3)),
index = pd.MultiIndex.from_product([list('ABCDEFHIJK'),['期中','期末']]),
columns = ['Python','Tensorflow','Keras'])
df2.unstack(level = -1)
df2.stack()
df2.stack().unstack(level = 1)
df2
df2.mean()
df2.mean(level=0)
df2.mean(level = 1)
10 数学和统计方法
pandas对象拥有一组常用的数学和统计方法。它们属于汇总统计,对Series汇总计算获取mean、max
值或者对DataFrame行、列汇总计算返回一个Series。
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,100,size = (20,3)),
index = list('ABCDEFHIJKLMNOPQRSTU'),
columns=['Python','Tensorflow','Keras'])
df.count()
df.max(axis = 0)
df.min()
df.median()
df.sum()
df.mean(axis = 1)
df.quantile(q = [0.2,0.4,0.8])
df.describe()
df['Python'].argmin()
df['Keras'].argmax()
df.idxmax()
df.idxmin()
df['Python'].value_counts()
df['Keras'].unique()
df.cumsum()
df.cumprod()
df.std()
df.var()
df.cummin()
df.cummax()
df.diff()
df.pct_change()
df.cov()
df['Python'].cov(df['Keras'])
df.corr()
df.corrwith(df['Tensorflow'])
11 数据排序
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.randint(0,30,size=(30,3)),
index= list('qwertyuioijhgfcasdcvbnerfghjcf'),
columns = ['Python','Keras','Pytorch'])
df.sort_index(axis = 0 , ascending=True)
df.sort_index(axis = 1,ascending=False)
df.sort_values(by = ['Python'])
df.sort_values(by = ['Python','Keras'])
df.nlargest(10,columns='Keras')
df.nsmallest(5,columns='Python')
12 分箱操作
- 分箱操作就是 将连续数据转换为 分类对应物的过程。比如将连续的身高数据划分为:矮中高。
- 分箱操作分为 等距分箱和 等频分箱。
- 分箱操作也叫 面元划分或者 离散化。
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,150,size = (6,3)),
columns=['Python','Tensorflow','Keras'])
pd.cut(df.Python,bins = 3)
pd.cut(df.Keras,
bins = [0,60,90,120,150],
right = False,
labels=['不及格','中等','良好','优秀'])
pd.qcut(df.Python,q = 4,
labels=['差','中','良','优'])
13 分组聚合
; 13.1 分组和聚合
import numpy as np
import pandas as pd
df = pd.DataFrame(data={
'sex':np.random.randint(0,2,size=300),
'class':np.random.randint(1,9,size=300),
'Python':np.random.randint(0,151,size = 300),
'Keras':np.random.randint(0,151,size =300),
'Tensorflow':np.random.randint(0,151,size=300),
'Java':np.random.randint(0,151,size = 300),
'C++':np.random.randint(0,151,size = 300)
})
df
df['sex'] = df['sex'].map({0:'男',1:'女'})
df
g = df.groupby(by = 'sex') [['Python','Java']]
g = df.groupby(by = ['class','sex'])[['Python']]
g = df['Python'].groupby(df['class'])
g = df[['Keras','Python']].groupby([df['class'],df['sex']])
g = df.groupby(df.dtypes,axis = 1)
for name,data in g:
print('组名:',name)
print('数据:',data)
m = {'sex':'category','class':'category','Python':'IT','Keras':'IT','Tensorflow':'IT','Java':'IT','C++':'IT'}
for name,data in df.groupby(m,axis = 1):
print('组名',name)
print('数据',data)
df.groupby(by = 'sex').mean().round(1)
df.groupby(by = ['class','sex'])[['Python','Keras']].max()
df.groupby(by = ['class','sex']).size()
df.groupby(by = ['class','sex']).describe()
13.2 分组聚合apply、transform
df.groupby(by = ['class','sex'])[['Python','Keras']].apply(np.mean).round(1)
def normalization(x):
return (x - x.min())/(x.max() - x.min())
df.groupby(by = ['class','sex']) [['Python','Tensorflow']].transform(normalization).round(3)
13.3 分组聚合agg
df.groupby(by = ['class','sex']) [['Tensorflow','Keras']].agg([np.max,np.min,pd.Series.count])
df.groupby(by = ['class','sex'])[['Python','Keras']].agg({
'Python':[('最大 值',np.max),('最小值',np.min)],
'Keras':[('计 数',pd.Series.count),('中位数',np.median)]})
13.4 透视表 pivot-table
def count(x):
return len(x)
df.pivot_table(values=['Python','Keras','Tensorflow'],
index=['class','sex'],
aggfunc={'Python':[('最大值',np.max)],
'Keras':[('最小值',np.min),('中位数',np.median)],
'Tensorflow':[('最小值',np.min),('平均值',np.mean),('计 数',count)]})
14 时间序列
14.1 时间戳操作
import numpy as np
import pandas as pd
pd.Timestamp('2020-8-24 12')
pd.Period('2020-8-24',freq = 'M')
index = pd.date_range('2020.08.24',periods=5,freq = 'M')
index
pd.period_range('2020.08.24',periods=5,freq='M')
ts = pd.Series(np.random.randint(0,10,size = 5),index = index)
ts
pd.to_datetime(['2020.08.24','2020-08-24','24/08/2020','2020/8/24'])
pd.to_datetime([1598582232],unit='s')
dt = pd.to_datetime([1598582420401],unit = 'ms')
dt
dt + pd.DateOffset(hours = 8)
dt + pd.DateOffset(days = 100)
14.2 时间戳索引
import numpy as np
import pandas as pd
index = pd.date_range("2020-8-24", periods=200, freq="D")
ts = pd.Series(range(len(index)), index=index)
ts['2020-08-30']
ts['2020-08-24':'2020-09-3']
ts['2020-08']
ts[pd.Timestamp('2020-08-30')]
ts[pd.Timestamp('2020-08-24'):pd.Timestamp('2020-08-30')]
ts[pd.date_range('2020-08-24',periods=10,freq='D')]
ts.index.year
ts.index.dayofweek
ts.index.weekofyear
14.3 时间序列常用方法
在做时间序列相关的工作时,经常要对时间做一些移动/滞后、频率转换、采样等相关操作,我们来看下
这些操作如何使用。
import numpy as np
import pandas as pd
index = pd.date_range('8/1/2020', periods=365, freq='D')
ts = pd.Series(np.random.randint(0, 500, len(index)), index=index)
ts.shift(periods = 2)
ts.shift(periods = -2)
ts.shift(periods = 2,freq = pd.tseries.offsets.Day())
ts.asfreq(pd.tseries.offsets.Week())
ts.asfreq(pd.tseries.offsets.MonthEnd())
ts.asfreq(pd.tseries.offsets.Hour(),fill_value = 0)
ts.resample('2W').sum()
ts.resample('3M').sum().cumsum()
d = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
'volume': [50, 60, 40, 100, 50, 100, 40, 50],
'week_starting':pd.date_range('24/08/2020',periods=8,freq='W')})
df1 = pd.DataFrame(d)
df1.resample('M',on = 'week_starting').apply(np.sum)
df1.resample('M',on = 'week_starting').agg({'price':np.mean,'volume':np.sum})
days = pd.date_range('1/8/2020', periods=4, freq='D')
data2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19], 'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
df2 = pd.DataFrame(data2,
index=pd.MultiIndex.from_product([days, ['morning','afternoon']]))
df2.resample('D', level=0).sum()
14.4 时区表示
import numpy as np
import pandas as pd
import pytz
index = pd.date_range('8/1/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(index)), index)
pytz.common_timezones
ts = ts.tz_localize(tz='UTC')
ts.tz_convert(tz = 'Asia/Shanghai')
15 数据可视化
pip install matplotlib -i https://pypi.tuna.tsinghua.edu.cn/simple
因为jupyter notebook 内置了matplotlib,所以只要下载了包就可以使用,不用再导入包了
import numpy as np
import pandas as pd
df1 = pd.DataFrame(data=np.random.randn(1000,4),
index= pd.date_range(start='27/6/2012',periods=1000),
columns=list('ABCD'))
df1.cumsum().plot()
df2 = pd.DataFrame(data = np.random.rand(10,4),
columns = list('ABCD'))
df2.plot.bar(stacked = True)
df3 = pd.DataFrame(data = np.random.rand(4,2),
index = list('ABCD'),
columns=['One','Two'])
df3.plot.pie(subplots = True,figsize = (8,8))
df4 = pd.DataFrame(np.random.rand(50, 4),
columns=list('ABCD'))
df4.plot.scatter(x='A', y='B')
ax = df4.plot.scatter(x='A', y='C', color='DarkBlue', label='Group 1');
df4.plot.scatter(x='B', y='D', color='DarkGreen', label='Group 2', ax=ax)
df4.plot.scatter(x='A',y='B',s = df4['C']*200)
df5 = pd.DataFrame(data = np.random.rand(10, 4),
columns=list('ABCD'))
df5.plot.area(stacked = True);
df6 = pd.DataFrame(data = np.random.rand(10, 5),
columns=list('ABCDE'))
df6.plot.box()
df7 = pd.DataFrame({'A': np.random.randn(1000) + 1, 'B': np.random.randn(1000), 'C': np.random.randn(1000) - 1})
df7.plot.hist(alpha=0.5)
df7.plot.hist(stacked = True)
df7.hist(figsize = (8,8))
如果是使用pycharm等编辑器来使用matplotlib,则只需稍稍改变,导入包即可
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame(data=np.random.randn(1000,4),
index= pd.date_range(start='27/6/2012',periods=1000),
columns=list('ABCD'))
df1 = df1.cumsum()
plt.plot(df1)
plt.show()
16 pandas总结
- 一个快速、高效的DataFrame对象,用于 数据操作和 综合索引;
- 可以用于在内存数据结构和不同格式之间 读写数据的工具:CSV和txt文本文件、Microsoft Excel、SQL数据库和快速HDF5格式;
- 智能 数据对齐和 丢失数据的综合处理:在计算中获得基于标签的自动对齐,并轻松地将凌乱的数据操作为有序的形式;
- 数据集的 灵活调整和 旋转;
- 基于智能标签的 切片、花式索引和大型数据集的 子集;
- 可以从数据结构中 插入和删除列,以实现大小可变;
- 通过在强大的引擎中 聚合或 转换数据,允许对数据集进行 拆分应用组合操作;
- 数据集的高性能 合并和 连接;
- 层次轴索引提供了在低维数据结构中处理高维数据的直观方法;
- 时间序列-功能:日期范围生成和频率转换、移动窗口统计、移动窗口线性回归、日期转换和滞后。甚至在不丢失数据的情况下创建特定领域的时间偏移和加入时间序列;
- 对 性能进行了高度优化,用Cython或C编写了关键代码路径。
- Python与pandas在广泛的 学术和商业领域中使用,包括金融,神经科学,经济学,统计学,广告,网络分析,等等
- 学到这里,体会一会pandas库的亮点,如果对哪些还不熟悉,请对之前知识点再次进行复习。
Original: https://blog.csdn.net/qq_34516746/article/details/124280937
Author: jaydenStyle
Title: Pandas数据分析库(2)Python数据分析
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/695481/
转载文章受原作者版权保护。转载请注明原作者出处!