官方文档:https://pandas.pydata.org/docs/
安装
pip install pandas
jupyter notebook安装
! pip install pandas
数据结构
Pandas最重要的数据结构:Series、DataFrame
Series
类似一维数组,由一组数据(Numpy数据类型)及一组与之对应的索引(数据标签)组成
构建
import numpy as np
import pandas as pd
ser_obj = pd.Series(range(10, 20))
print(ser_obj)
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int64
获取数据
print(ser_obj.values)
[10 11 12 13 14 15 16 17 18 19]
获取索引
print(ser_obj.index)
RangeIndex(start=0, stop=10, step=1)
运算
print(ser_obj*2)
0 20
1 22
2 24
3 26
4 28
5 30
6 32
7 34
8 36
9 38
dtype: int64
print(ser_obj>15)
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 True
8 True
9 True
dtype: bool
使用字典构建
year_data = {2001: 17.8, 2005: 20.1, 2003: 16.5}
ser_obj2 = pd.Series(year_data)
print(ser_obj2)
print(ser_obj2.index)
print(ser_obj2[2001])
2001 17.8
2005 20.1
2003 16.5
dtype: float64
Int64Index([2001, 2005, 2003], dtype=’int64′)
17.8
name属性
对象名:ser_obj.name
对象索引名:ser_obj.index.name
print(ser_obj2.name)
print(ser_obj2.index.name)
None
None
ser_obj2.name = 'temp'
ser_obj2.index.name = 'year'
print(ser_obj2.head())
year
2001 17.8
2005 20.1
2003 16.5
Name: temp, dtype: float64
series对象本质上由两个数组构成,一个数组构成对象的键,一个数组构成对象的值
DataFrame
表格型数据结构,每列可以是不同类型的值,既有行索引也有列索引
行索引:index,0轴
列索引:columns,1轴
构建
t = pd.DataFrame(np.arange(12).reshape(3,4))
print(t)
t1 = pd.DataFrame(np.random.randn(5,4))
print(t1)
print(t1.head())
dict_data = {'A':1,
'B':pd.Timestamp('20210831'),
'C':pd.Series(1,index=list(range(4)),dtype='float32'),
'D':np.array([3]*4,dtype='int32'),
'E':['python','java','c++','c'],
'F':'hello'}
t2 = pd.DataFrame(dict_data)
print(t2)
print(t2['A'])
0 1
1 1
2 1
3 1
Name: A, dtype: int64
print(type(t2['A'])
print(t2.A)
0 1
1 1
2 1
3 1
Name: A, dtype: int64
d1 = [{"name" : "xiaohong" ,"age" :32,"tel" :10010},{ "name": "xiaogang" ,"tel": 10000} ,{"name":"xiaowang" ,"age":22}]
t3 = pd.DataFrame(d1)
t3
增加列
t2['G'] = t2['D']+4
t2
删除列
del(t2['G'])
t2.drop_duplicates(subset=['G'])
t2
- subset:默认采用所有列,可以指定特定列
- keep:可选{‘first’, ‘last’, False}, 默认 ‘first’,选择保留第一次还是最后一次行,或者都不要
- inplace:bool, 默认 False, 判断是原地替换还是返回一个copy
- ignore_index:bool, 默认 False,如果设置为True,索引会重新从0开始
索引
print(type(t2.index))
print(t2.index)
索引对象不可变,保证数据安全。
常见索引:
- Index 索引
- Int64Index 整数索引
- MultiIndex 层级索引
- DatetimeIndex 时间戳索引
指定行索引名
s = pd.Series(range(5),index=['a','b','c','d','e'])
s
a 0
b 1
c 2
d 3
e 4
dtype: int64
print(s['b'])
print(s[1])
1
1
切片索引
按数字前闭后开,按名前闭后闭
print(s[1:3])
print(s['b':'d'])
b 1
c 2
dtype: int64
b 1
c 2
d 3
dtype: int64
不连续索引
print(s[['a','c']])
a 0
c 2
dtype: int64
布尔索引
s_bool = s>2
print(s_bool)
print(s[s_bool])
print(s[s>2])
a False
b False
c False
d True
e True
dtype: bool
d 3
e 4
dtype: int64
d 3
e 4
dtype: int64
指定列索引
d = pd.DataFrame(np.random.randn(5,4),columns=['a','b','c','d'])
d
loc标签索引
DataFrame不能直接切片,可通过loc切片
print(s.loc['b':'d'])
b 1
c 2
d 3
dtype: int64
print(d.loc[0:2])
print(d.loc[0:2,'a'])
iloc位置索引
作用与loc一致,基于索引编号来索引
print(s.iloc[1:3])
print(d.iloc[0:2,0])
b 1
c 2
dtype: int64
0 -1.126207
1 -0.614330
Name: a, dtype: float64
层级索引
ser = pd.Series(np.random.randn(12),index=[['a','a','a','b','b','b','c','c','c','d','d','d'],
[0,1,2,0,1,2,0,1,2,0,1,2]])
print(ser)
print(ser.index)
print(ser['c'])
0 -0.428242
1 0.178003
2 -1.399262
dtype: float64
print(ser[:,2])
a -0.294524
b -1.255692
c -1.399262
d 0.243327
dtype: float64
交换内外层索引
swaplevel()
print(ser.swaplevel())
0 a -0.670116
1 a -0.167777
2 a -0.294524
0 b -0.246279
1 b -0.919629
2 b -1.255692
0 c -0.428242
1 c 0.178003
2 c -1.399262
0 d 1.189471
1 d 0.063832
2 d 0.243327
dtype: float64
通过 unstack()
将Series变为DataFrame
print(ser.unstack(0))
df = ser.unstack()
print(df)
print()
print(df.stack())
对齐运算
数据清洗的重要过程,按索引对齐进行运算,没对齐的位置补NaN
s1 = pd.Series(range(0,10))
s2 = pd.Series(range(10,15))
print(s1+s2)
0 10.0
1 12.0
2 14.0
3 16.0
4 18.0
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
df1 = pd.DataFrame(np.ones((2,2)),columns=['a','b'])
df2 = pd.DataFrame(np.ones((3,3)),columns=['a','b','c'])
print(df1)
print(df2)
print(df1+df2)
指定填充值
fill_value
print(s1)
print(s2)
print(s1.add(s2,fill_value=0))
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64
0 10
1 11
2 12
3 13
4 14
dtype: int64
0 10.0
1 12.0
2 14.0
3 16.0
4 18.0
5 5.0
6 6.0
7 7.0
8 8.0
9 9.0
dtype: float64
函数应用
可直接应用numpy函数
df = pd.DataFrame(np.random.randn(5,4))
print(df)
print(np.abs(df))
apply
print(df.apply(lambda x: x.max()))
0 3.147196
1 0.624949
2 1.078376
3 1.307140
dtype: float64
print(df.apply(lambda x: x.max(),axis=1))
0 0.683147
1 1.094432
2 3.147196
3 3.080139
4 0.624949
dtype: float64
applymap
应用到每个数据
f = lambda x : '%.2f'%x
print(df.applymap(f))
排序
索引排序
sort_index(axis=0, ascending=True)
默认升序
s = pd.Series(range(5,10),index=np.random.randint(5,size=5))
print(s)
s.sort_index()
按值排序
sort_values(by=’column name’)
根据某个唯一列名排序,若有重复列名报错
df = pd.DataFrame(np.arange(24).reshape(6,4))
df[3][3]=1
print(df)
print('-'*50)
df_vsort = df.sort_values(by=3,axis=0,ascending=False)
print(df_vsort)
print('-'*50)
df_vsort = df.sort_values(by=3,axis=1, ascending=False)
print(df_vsort)
处理缺失数据
- 判断是否存在缺失值
isnull()
df_data = pd.DataFrame([np.random.randn(3),[1.,2.,np.nan],
[np.nan,4.,np.nan],[0,1.,2.]])
print(df_data)
print()
print(df_data.isnull())
2. 丢弃缺失数据
dropna()
丢弃包含缺失值的行或列,默认丢弃行
print(df_data.dropna())
print(df_data.dropna(axis=1))
3. 填充缺失数据
fillna()
print(df_data.fillna(5. ))
计算
sum()
求和
max()
最大值
min()
最小值
mean()
均值
axis=0按列统计,axis=1按行统计
skipna排除缺失值,默认True
df = pd.DataFrame(np.random.randn(5,4),columns=['a','b','c','d'])
print(df)
print()
print(df.sum())
print(df.max())
print(df.min(axis=1,skipna=False))
print(df.describe())
- count:非NaN值的数量
- mean:均值
- std:标准差
- var:方差
- 25%:第一四分位数 (Q1),又称”较小四分位数”,等于该样本中所有数值由小到大排列后第25%的数字。
- 50%:第二四分位数 (Q2),又称”中位数”,等于该样本中所有数值由小到大排列后第50%的数字。
- 75%:第三四分位数 (Q3),又称”较大四分位数”,等于该样本中所有数值由小到大排列后第75%的数字。
分组与聚合
分组
groupby()
dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randn(8),
'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)
print()
grouped1 = df_obj.groupby('key1')
print(type(grouped1))
print(grouped1.mean())
print()
grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(type(grouped2))
print(grouped2.mean())
print(grouped1.size())
print(grouped2.size())
自定义分组
self_def_key = [0, 1, 2, 3, 3, 3, 5, 7]
print(df_obj.groupby(self_def_key).size())
print(df_obj.groupby(self_def_key).sum())
3为原3,4,5行,5为原第6行
print(df_obj.groupby([df_obj['key1'], df_obj['key2']]).size())
grouped2 = df_obj.groupby(['key2', 'key1'])
print(grouped2.size())
print(grouped2.mean())
按数据类型分组
print(df_obj.groupby(df_obj.dtypes,axis=1).size())
聚合
聚合函数:min() max() mean() sum() count() size() describe()
def peak_range(df):
return df.max() - df.min()
print(df_obj.groupby('key1').agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df : df.max() - df.min()))
数据清洗
数据连接
pd.merge()
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b' ,'d'],
'data2' : np.random.randint(0,10,3)})
print(df_obj1)
print(df_obj2)
print()
pd.merge(df_obj1, df_obj2)
left_on
, right_on
分别指定左侧数据和右侧数据外键
df_obj1 = df_obj1.rename(columns={'key':'key1'})
df_obj2 = df_obj2.rename(columns={'key':'key2'})
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2')
默认内链接(inner),即结果中的键是交集
how
指定连接方式外链接(outer),结果中的键是并集
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='outer')
左链接(left)
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='left')
右链接(right)
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='right')
suffixes
给数据添加后缀避免重复列名
print(pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right')))
数据合并
pd.concat()
numpy合并数据
arr1 = np.random.randint(0, 10, (3, 4))
arr2 = np.random.randint(0, 10, (3, 4))
print(arr1)
print(arr2)
print(np.concatenate([arr1, arr2]))
print(np.concatenate([arr1, arr2], axis=1))
index不重复
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12))
print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
index有重复
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3))
print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join='outer'))
duplicated()
判断每行是否是重复行
df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,
'data2' : np.random.randint(0, 4, 8)})
print(df_obj)
print(df_obj.duplicated())
print(df_obj.duplicated('data2'))
drop_duplicates
过滤重复行
print(df_obj.drop_duplicates())
print(df_obj.drop_duplicates('data2'))
map()
根据map里的函数对行列进行转换
ser_obj = pd.Series(np.random.randint(0,10,10))
print(ser_obj)
print(ser_obj.map(lambda x : x ** 2))
replace()
替换数据
ser_obj.replace(1, -100)
ser_obj.replace([6, 8], -100)
ser_obj.replace([4, 7], [-100, -200])
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
df.replace(to_replace=r'^a', value=100, regex=True)
Original: https://blog.csdn.net/qq_38929220/article/details/119964360
Author: 叶柖
Title: python库——pandas
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/742391/
转载文章受原作者版权保护。转载请注明原作者出处!