Pandas快速入门指南

2023年8月8日上午2:53 • Python • 阅读 46

在阅读此文章前，你应该对Numpy有一定的了解，如果有对Numpy还不太熟悉的读者，可以参考我的另一篇文章。

本文仅涉及用于数据预处理阶段规范化的常用操作，对Pandas进阶操作内容涉及有限

Pandas简介
Pandas是一个用来处理表格数据集的Python库，可以轻松地做到对数据的挖掘和处理，通常它与Numpy库和Matplotlib库结合使用，可以做到数据的可视化。Pandas 对一些基本数据的计算是十分简单的，比如均值、中位数，最大值，最小值等。本文将介绍Pandas的基本用法，让读者快速入门Pandas的使用。
在使用之前，我们需要导入Pandas库，以及作为辅助的Numpy库：

>>> import pandas as pd
>>> import numpy as np

1.Pandas数据结构对象创建

Pandas有两种数据结构，Series和DataFrame，其中Series用于一位数组的处理，而DataFrame用于二维数组的处理。

我们可以通过下面的方式来创建一个Series结构：

>>> s = pd.Series([1,2,3,np.nan,5,6])
>>> s
0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64

可以发现，Series对象在创建时有一个默认的索引，我们也可以自定义这个索引：

>>> s = pd.Series(['D','5.0','str'],index = [4,5,7])
>>> s
4      D
5    5.0
7    str
dtype: object

可以看到，索引被重新定义，而且我们传入的数据也可以是任何类型的数据.

我们也可以通过一个Numpy的ndarray来创建一个Series对象：

>>> s = pd.Series(np.random.randn(5),index = ['a','b','c','d','e'])
>>> s
a   -0.516244
b   -1.249185
c    0.704626
d   -0.430690
e   -0.297458
dtype: float64

我们可以通过访问Series对象的index属性，来获取它的索引

>>> s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

我们可以通过字典来创建Series对象：

>>> d = {'a':1,'b':2,'c':3}
>>> s = pd.Series(d)
>>> s
a    1
b    2
c    3
dtype: int64

当你使用字典创建时，如果传入的索引数目多于你的字典键值对，则会用NaN代替对应值

>>> d = {'a':1,'b':2,'c':3}
>>> pd.Series(d,index=['a','b','d','c'])
a    1.0
b    2.0
d    NaN
c    3.0
dtype: float64

也可以使用一个标量创建Series对象，此时，必须要传入index索引，会根据传入的索引的长度重复这个标量：

>>> pd.Series(6,index=[1,2,3,4,5])
1    6
2    6
3    6
4    6
5    6
dtype: int64

Series的很多操作都和Numpy中的ndarray相似，大部分的Numpy函数操作也可以使用在Series中，可以参考下面几个例子：

>>> s = pd.Series(np.random.randn(5),index = ['a','b','c','d','e'])
>>> s
a   -1.179425
b    1.737900
c   -1.817220
d    0.635392
e   -0.008855
dtype: float64

>>> s[0]
-1.1794246984641934

>>> s[3:]
d    0.635392
e   -0.008855
dtype: float64

>>> s[s > s.median()]
b    1.737900
d    0.635392
dtype: float64

>>> np.exp(s)
a    0.307456
b    5.685393
c    0.162477
d    1.887762
e    0.991184
dtype: float64

我们可以查看Series对象的dtype属性知道其保存数据的类型：

>>> s.dtype
dtype('float64')

我们可以通过Series.array的方式把Series结构的数据转化为array格式，请看下面的例子：

>>> s.array
<PandasArray>
[-1.1794246984641934,  1.7379002494230982, -1.8172198722285076,
  0.6353918842322767, -0.0088550837527395]
Length: 5, dtype: float64

我们也可以使用to_numpy把Series对象转化为ndarray对象:

>>> s.to_numpy()
array([-1.1794247 ,  1.73790025, -1.81721987,  0.63539188, -0.00885508])

对于这些索引不是默认的索引值（0，1···）时，我们也可以使用对应的形式进行索引：

>>> s['a']
-1.1794246984641934

>>> 'e' in s
True

>>> 'f' in s
False

而如果使用get()方法，则什么都不会返回，或者返回你所指定的而默认值：

>>> s.get('f')

>>> s.get('f',np.nan)
nan

使用Series的一些基本运算和Numpy有一定的相似之处，请看下面的例子：

>>> s = pd.Series(np.random.randn(4),index = ['a','b','c','d'])
>>> s
a    1.037190
b    0.226480
c    0.507962
d   -1.296568
dtype: float64

>>> s + s
a    2.074381
b    0.452961
c    1.015925
d   -2.593136
dtype: float64

>>> s * 2
a    2.074381
b    0.452961
c    1.015925
d   -2.593136
dtype: float64

特殊地，Series之间的运算是和索引有关的，他会跟据索引的大小做运算，所以即使Series的元素数目不相等，也会根据索引值进行运算，而对应索引找不到对应值的结果将被赋予nan

>>> s[:2] + s[1:]
a         NaN
b    0.452961
c         NaN
d         NaN
dtype: float64

在建立一个Series对象之后，我们可以对这个对象起名，然后通过查看对象的name属性进行查看，使用rename()方法进行改名并复制到另一个Series对象上，请看下面的例子：

>>> s = pd.Series(np.random.randn(3),name='一个Series对象')
>>>s.name
'一个Series对象'

>>> s1 = s.rename('一个新的名字')
>>> s1.name
'一个新的名字'

DataFrame的创建方法有很多，我们可以从字典或者Series通过键值对的方式来建立,其中键的值为DataFrame列的名称：

>>> d = {
... 'one':pd.Series([1,2,3],index=['a','b','c']),
... 'two':pd.Series([4,5,6,7],index=['a','b','c', 'd'])
...}
>>> df = pd.DataFrame(d)
>>> df
    one two
a   1.0   4
b   2.0   5
c   3.0   6
d   NaN   7

>>> df = pd.DataFrame(d,index = ['a','b','c','d'],
... columns=['two','three'])
>>> df
    two three
a   4   NaN
b   5   NaN
c   6   NaN
d   7   NaN

我们可以通过DataFrame的index属性和columns属性获取它的标签和列名：

>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')

>>> df.columns
Index(['two', 'three'], dtype='object')

除了这种方法，还可以通过ndarray和列表并用字典的方式创建DataFrame，同样字典的键的值是它的列名，请看下面的例子：

>>> d = {'first':[1,2,3],'second':[4,5,6]}
>>> df = pd.DataFrame(d)
>>> df
    first   second
0       1       4
1       2       5
2       3       6

还可以从一个元素为字典的列表中创建一个DataFrame对象：

>>> data = [{'a':1,'b':2},{'c':3,'d':4}]
>>> df = pd.DataFrame(data)
>>> df
    a   b   c   d
0   1.0 2.0 NaN NaN
1   NaN NaN 3.0 4.0

也可以从Series对象创建DataFrame对象:

>>> s1 = pd.Series(np.random.randn(4))
>>> s2 = pd.Series(np.random.randn(4))
>>> pd.DataFrame({'one':s1,'two':s2})
    one         two
0   1.761223    0.561849
1   0.896577    1.826814
2   1.841716    0.056537
3   1.103170    -1.337749

我们可以使用columns列名，对DataFrame的列进行选取、添加和删除操作：

>>> df = pd.DataFrame([{'a':1,'b':2},{'a':3,'b':4},{'a':5,'b':6}],
... index=['one','two','three'])
>>> df
        a   b
one     1   2
two     3   4
three   5   6

>>> df['a']
one      1
two      3
three    5
Name: a, dtype: int64

>>> df['c'] = df['a'] * df['b']
>>> df
        a   b   c
one     1   2   2
two     3   4   12
three   5   6   30

>>> df['flag'] = df['c']>10
>>> df
        a   b   c   flag
one     1   2   2   False
two     3   4   12  True
three   5   6   30  True

>>> del df['flag']
>>> df.pop('c')
>>> df
        a   b
one     1   2
two     3   4
three   5   6

我们也可以把一个标量直接赋值给一个列,或者把某一列的某些数据复制到新的列中

>>> df['new'] = 'Hello'
>>> df
        a   b   new
one     1   2   Hello
two     3   4   Hello
three   5   6   Hello

>>> df['new'] = df['b'][:1]
>>> df
        a   b   new
one     1   2   2.0
two     3   4   NaN
three   5   6   NaN

我们也可以使用insert()方法实现在指定的位置处插入一列的操作：

>>> df.insert(1,'insert',df['b'])

>> df
        a   insert  b   new
one     1   2   2   2.0
two     3   4   4   NaN
three   5   6   6   NaN

我们可以对DataFrame对象进行索引和分割，下面用一个表格说明具体的情况：

操作方法结果选择某一列

Series通过标签选择某一行

Series通过整型的位置选择某一行

Series分割某些行

DataFrame通过布尔向量选择某些行

DataFrame

下面看几个例子：

>>> df.loc['one']
a         1.0
insert    2.0
b         2.0
new       2.0
Name: one, dtype: float64

>>> df.iloc[2]
a         5.0
insert    6.0
b         6.0
new       NaN
Name: three, dtype: float64

我们可以使用.T对一个DataFrame进行转置

>>> df.T
        one two three
a       1.0 3.0 5.0
insert  2.0 4.0 6.0
b       2.0 4.0 6.0
new     2.0 NaN NaN

同时，标签名其实也是DataFrame对象的一个属性变量，可以直接查看：

>>> df.a
one      1
two      3
three    5
Name: a, dtype: int64

2.基本功能

这里将介绍一些Pandas的基本功能，我们首先创建一些对象以便之后的使用；

>>> date = pd.date_range('2022/1/1',periods=8)
>>> date
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
               '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08'],
              dtype='datetime64[ns]', freq='D')

>>> s = pd.Series(np.random.randn(5),index = ['a','b','c','d','e'])
>>> df = pd.DataFrame(np.random.randn(8,3),index=date,columns=['A','B','C'])

使用head()和tail()方法可以看到数据的前几行数据和后几行数据，默认为5行，你可以自己指定查看多少行的数据，请看下面的例子：

>>> long_series = pd.Series(np.random.randn(1000))
>>> long_series.head()
0   -0.822034
1    0.918064
2   -0.029024
3   -0.453670
4    0.148209
dtype: float64

>>> long_series.tail(3)
997    0.482723
998   -1.188191
999    0.384464
dtype: float64

Pandas的Series对象和DataFrame对象有以下的属性：
shape属性，可以获取DataFrame的形状；
轴标签属性，可以获取它们的轴的标签。

请看如下的例子：

>>> df.shape
(3, 4)

>>> df.columns = [x.lower() for x in df.columns]
>>> df.head()
              a          b           c
2022-01-01  0.278170    0.699225    0.848391
2022-01-02  0.145116    2.537738    -1.286354
2022-01-03  1.715956    0.716130    -0.923585
2022-01-04  -0.743891   0.571609    -1.313765
2022-01-05  -0.779608   -0.757700   0.469854

我们可以使用.array使一个Index或Series变为一个数组：

>>> s.index.array
<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

>>> s.array
<PandasArray>
[-0.30892022321403717,  -1.3042857117851279,  0.12166198582976105,
   0.3368197766037855,  0.10975093002310791]
Length: 5, dtype: float64

也可以使用to_numpy或者numpy.asarray()把它们转化为一个Numpy中的ndarray数组:

>>> s.to_numpy()
array([-0.30892022, -1.30428571,  0.12166199,  0.33681978,  0.10975093])

>>> np.asarray(s)
array([-0.30892022, -1.30428571,  0.12166199,  0.33681978,  0.10975093])

Pandas提供了一些用于计算的方法：add(),sub(),mul(),div(),分别对应加减乘除操作，下面的几个例子将演示它们的用法：

>>> df = pd.DataFrame(
...  {
...      'one':pd.Series(np.random.randn(3),index=['a','b','c']),
...      'two':pd.Series(np.random.randn(4),index=['a','b','c','d']),
...      'three':pd.Series(np.random.randn(2),index=['b','c'])
...  })
>>> df
    one         two         three
a   0.266653    -0.212135   NaN
b   0.870522    0.298908    -0.525418
c   -1.113338   -1.510115   0.166605
d   NaN         0.665832    NaN

>>> row = df.iloc[1]
>>> df.add(row,axis=1)
    one         two         three
a   1.137175    0.086774    NaN
b   1.741043    0.597817    -1.050836
c   -0.242817   -1.211206   -0.358813
d   NaN         0.964741    NaN

>>> columns = df['two']
>>> df.sub(columns,axis=0)
    one         two     three
a   0.478788    0.0     NaN
b   0.571613    0.0     -0.824326
c   0.396776    0.0     1.676719
d   NaN         0.0     NaN

可以从中体会到Pandas的广播机制也就是使用axis参数与index、columns进行了关联。

缺失值处理

当我们在使用上述的函数的时候，发现有一些缺失值是无法参与计算的，此时可以在调用函数的时候，增加一个f i l l _ v a l u e fill_value f i l l _v a l u e的选项，使得NaN变为一个实际的值。请看下面的例子。

>>> df
    one         two         three
a   0.266653    -0.212135   NaN
b   0.870522    0.298908    -0.525418
c   -1.113338   -1.510115   0.166605
d   NaN         0.665832    NaN

>>> df2 = pd.DataFrame({'one':[1,2,3,4],'two':[2,5,4,8],
    'three':[5,6,1,2]},index=['a','b','c','d'])
>>> df2
    one two three
a   1   2   5
b   2   5   6
c   3   4   1
d   4   8   2

>>> df.add(df2, fill_value=0)
    one         two         three
a   1.266653    1.787865    5.000000
b   2.870522    5.298908    5.474582
c   1.886662    2.489885    1.166605
d   4.000000    8.665832    2.000000

我们可以使用一些判断方法来对数据进行描述empty,any(),all()和bool()
其中，all()用来判断是否所有的值都满足条件；
any()表示是否至少有一个值满足条件；
empty可以判断数据是否为空；
bool()可以判断单个元素的布尔值；
请看下面的例子：

>>> (df>0).all()
one      False
two      False
three    False
dtype: bool

>>> (df>0).all().all()
False

>>> df.empty
False

>>> pd.DataFrame(columns=['ABC']).empty
True

>>> (df>0).any()
one      True
two      True
three    True
dtype: bool

>>> pd.Series([False]).bool()
False

>>> pd.DataFrame([[True]]).bool()
True

有些时候我们在使用符号进行表示的时候，比如d f + d f = = 2 ∗ d f df + df ==2 * df d f +d f ==2 ∗d f的时候，会发现有的值是False，那是因为在df中的nan并不是一个明确的数，这一点在这里不做过多讨论。
加以替代的是使用equals()方法

>>> (df +df).equals(2 * df)
True

注意在使用的时候，有时也有可能返回一个False，那是因为没有把索引Index排序，只需要在使用equals()方法时对未将索引排序的df进行排序df.sort_index()即可。

在进行数据处理时，我们通常想快速知道这组数据的一些数字特征，以便对数据有一个大致的了解，此时可以使用一些Pandas提供的方法进行快速处理，请看下面的例子：

>>> df
    one         two         three
a   0.242950    1.626914    NaN
b   -0.948663   -0.352977   0.956809
c   -0.742458   0.882777    0.374549
d   NaN         -0.285258   NaN

>>> df.mean()
one      0.312146
two      0.985132
three   -0.456096
dtype: float64

>>> df.median(axis=1)
a    0.419152
b   -0.833066
c   -0.092395
d   -0.280310
dtype: float64

我们可以使用skipna选项排除NaN值：

>>> df.std(axis=0,skipna=True)
one      0.724562
two      0.790647
three    0.297683
dtype: float64

除了上述演示出来的方法，还有其他常用的方法，在下表中列出:

方法名功能count显示非NaN值的个数sum求和mean求均值mad平均绝对偏差median算术中位数min最小值max最大值mode众数abs绝对值std样品标准偏差var无偏方差sem均值标准误差skew样本偏斜度（第三时刻）kurt样本偏斜度（第四时刻）quantile样本分位数（在%处）cumsum累积和cumprod累积积cummax累积最大值cummin累积最小值

我们也可以使用d e s c r i b e ( ) describe()d e s c r i b e ()方法快速得到一些数据特征：

>>> s = pd.Series(np.random.randn(1000))
>>> s.describe()
count    1000.000000
mean        0.023479
std         1.022949
min        -3.031548
25%        -0.656509
50%         0.014604
75%         0.681994
max         2.823129
dtype: float64

我们可以使用idxmin(),idxmax()方法找出其中最小值和最大值的索引：

>>> df = pd.DataFrame(np.random.randn(5, 3), columns=["A", "B", "C"])
>>> df
    A           B           C
0   -0.071734   0.336667    -0.413956
1   -0.588041   1.325679    -0.695395
2   -0.926539   0.674262    0.591602
3   0.292243    0.612337    0.401866
4   -0.526544   -0.699263   0.006968

>>> df.idxmax(axis=1)
0    B
1    B
2    B
3    B
4    C
dtype: object

>>> df.idxmin(axis=0)
A    2
B    4
C    1
dtype: int64

使用value_count方法进行数据的计数，显示每个数字出现了几次

>>> d = np.random.randint(0,10,size=(50))
>>> d
array([2, 0, 8, 4, 5, 8, 1, 8, 9, 5, 2, 2, 5, 3, 1, 5, 2, 5, 4, 7, 6, 2,
       1, 3, 9, 4, 1, 1, 4, 8, 1, 3, 3, 7, 6, 0, 2, 4, 5, 0, 6, 5, 9, 2,
       8, 1, 1, 9, 8, 8])

>>> s = pd.Series(d)
>>> s.value_counts()
1    8
2    7
8    7
5    7
4    5
9    4
3    4
0    3
6    3
7    2
dtype: int64

我们可以使用reindex()方法进行标签的重置，如果你希望一个数据的标签和另一个数据的标签一致，那么可以使用reindex_like()

>>> s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
>>> s
a    1.584789
b    0.996338
c    0.484087
d    0.496449
e   -1.393620
dtype: float64

>>> s.reindex(['e','b','g','a','i'])
e   -1.393620
b    0.996338
g         NaN
a    1.584789
i         NaN
dtype: float64

>>> s1.reindex_like(s)
a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
dtype: float64

我们可以在进行重置索引的时候对缺失值进行填充，这里需要在reindex中加入method选项，填充方法主要包括以下几种：

方法规则ffill向前填充值bfill向后填充值nearest从最近的index值处填充

>>> date = pd.date_range('2022/1/1',periods=8)
>>> s = pd.Series(np.random.randn(8),index=date)
>>> s2 = s[[0,3,6]]
2022-01-01   -0.579916
2022-01-02    2.261039
2022-01-03    2.422621
2022-01-04   -0.350369
2022-01-05   -0.400870
2022-01-06   -1.017559
2022-01-07   -1.343754
2022-01-08    0.411179
Freq: D, dtype: float64

>>> s2
2022-01-01   -0.579916
2022-01-04   -0.350369
2022-01-07   -1.343754
Freq: 3D, dtype: float64

>>> ts.reindex(s2.index)
2022-01-01   -0.579916
2022-01-02         NaN
2022-01-03         NaN
2022-01-04   -0.350369
2022-01-05         NaN
2022-01-06         NaN
2022-01-07   -1.343754
2022-01-08         NaN
Freq: D, dtype: float64

>>> s2.reindex(s.index,method = 'ffill')
2022-01-01   -0.579916
2022-01-02   -0.579916
2022-01-03   -0.579916
2022-01-04   -0.350369
2022-01-05   -0.350369
2022-01-06   -0.350369
2022-01-07   -1.343754
2022-01-08   -1.343754
Freq: D, dtype: float64

>>> s2.reindex(s.index,method = 'bfill')
2022-01-01   -0.579916
2022-01-02   -0.350369
2022-01-03   -0.350369
2022-01-04   -0.350369
2022-01-05   -1.343754
2022-01-06   -1.343754
2022-01-07   -1.343754
2022-01-08         NaN
Freq: D, dtype: float64

>>> s2.reindex(s.index,method = 'nearest')
2022-01-01   -0.579916
2022-01-02   -0.579916
2022-01-03   -0.350369
2022-01-04   -0.350369
2022-01-05   -0.350369
2022-01-06   -1.343754
2022-01-07   -1.343754
2022-01-08   -1.343754
Freq: D, dtype: float64

我们还可以通过limit选项控制匹配的个数，以至于消除一定的过饱和现象

>>> s2.reindex(s.index,method = 'ffill',limit = 1)
2022-01-01   -0.579916
2022-01-02   -0.579916
2022-01-03         NaN
2022-01-04   -0.350369
2022-01-05   -0.350369
2022-01-06         NaN
2022-01-07   -1.343754
2022-01-08   -1.343754
Freq: D, dtype: float64

我们可以使用drop()方法删除数据中的某一行或某一列，请看下面的例子：

>>> df.drop('A',axis=1)
    B           C
0   -0.338204   -0.144193
1   -0.701576   -0.366568
2   -0.056439   0.678802
3   1.076433    -1.252925
4   -2.340144   -0.469283

>>> df.drop([1,3],axis=0)
    A           B           C
0   -0.567549   -0.338204   -0.144193
2   -0.107124   -0.056439   0.678802
4   -0.753300   -2.340144   -0.469283

我们可以看到的是，虽然使用了reindex方法进行了重命名，但是其中的数据也随之消失，如果想在不改变数据的情况下重命名标签，我们可以使用rename()方法：

>>> s = pd.Series(np.random.randn(5),index=['a','b','c','d','e'])
a   -0.326658
b    1.032068
c    1.408185
d   -0.498249
e   -0.780058
dtype: float64

>>> s.rename(str.upper)
A   -0.326658
B    1.032068
C    1.408185
D   -0.498249
E   -0.780058
dtype: float64

>>> s.rename({'a':'o','b':'p','c':'q','d':'r','e':'s'})
o   -0.326658
p    1.032068
q    1.408185
r   -0.498249
s   -0.780058
dtype: float64

最简单的迭代方式是for循环，请看下面的例子：

>>> df = pd.DataFrame({"col1": np.random.randn(3),
...                 "col2": np.random.randn(3)},
...                 index=["a", "b","c"])
>>> for col in df:
...     print(col)
col1
col2

Pandas 还提供了一个item()方法用来遍历键值对类似的数据，也可以使用如下的iterrows()和itertuples()方法来迭代,同城情况下后者比前者迭代的速度快得多：

>>> for label, ser in df.items():
    print(label)
    print(ser)
col1
a   -0.752748
b    0.355623
c   -1.396863
Name: col1, dtype: float64
col2
a    0.170046
b    2.115181
c    1.130240
Name: col2, dtype: float64

iterrows()可以使你像一个Series对象进行迭代DataFrame对象

>>>  for index, row in df.iterrows():
...    print(index,row,sep='\n')
a
col1   -0.752748
col2    0.170046
Name: a, dtype: float64
b
col1    0.355623
col2    2.115181
Name: b, dtype: float64
c
col1   -1.396863
col2    1.130240
Name: c, dtype: float64

如果使用itertuples()，则返回的值是一个元组

>>> for row in df.itertuples():
...     print(row)
Pandas(Index='a', col1=-0.7527479960512652, col2=0.17004562449070096)
Pandas(Index='b', col1=0.3556232583123519, col2=2.1151810120272803)
Pandas(Index='c', col1=-1.3968625542650128, col2=1.1302402391429855)

Pandas支持三种类型的排序：对标签排序，对值排序或者使用它们两者的组合进行排序：

使用sort_index()方法即可

>>> df = pd.DataFrame({
... "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
... "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
... "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),})
>>> unsorted_df = df.reindex(
... index=["a", "d", "c", "b"], columns=["three", "two", "one"])
>>> unsorted_df
    three       two         one
a   NaN         0.663213    0.334880
d   0.002582    0.722183    NaN
c   -0.063257   -2.603502   0.459381
b   -0.926569   0.244567    -0.165420

>>> unsorted_df.sort_index()
    three       two         one
a   NaN         0.663213    0.334880
b   -0.926569   0.244567    -0.165420
c   -0.063257   -2.603502   0.459381
d   0.002582    0.722183    NaN

>>> unsorted_df.sort_index(ascending=False)
    three       two         one
d   0.002582    0.722183    NaN
c   -0.063257   -2.603502   0.459381
b   -0.926569   0.244567    -0.165420
a   NaN         0.663213    0.334880

>>> unsorted_df.sort_index(axis=1)
    one         three       two
a   0.334880    NaN         0.663213
d   NaN         0.002582    0.722183
c   0.459381    -0.063257   -2.603502
b   -0.165420   -0.926569   0.244567

直接使用sort_values()方法即可，by选项可以控制对哪一部分数据进行排序,当数据中存在NaN值的时候，我们可以使用na_position选项确定NaN出现的位置：

>>> df1 = pd.DataFrame({"one": [2, 1, 1, 1],
...                     "two": [1, 3, 2, np.nan],
...                     "three": [5, 4, 3, 2]})
>>> df1.sort_values(by='two')
    one two three
0   2   1   5
2   1   2   3
1   1   3   NaN
3   1   4   2

>>> df1.sort_values('three',na_position='first')
    one two three
1   1   3   NaN
3   1   4   2.0
2   1   2   3.0
0   2   1   5.0

一般情况下我们得到的数据集是一个文件，比如csv格式的文件，我们需要从文件中读取数据，然后使用Pandas进行进一步的处理，读文件也相对比较简单，这里简单介绍：
一般情况下，想要读取哪一种类型的文件，只需要调用p d . r e a d _ ∗ pd.read_p d .r e a d _∗即可，表示那个文件的类型。例如，我现在需要读一个csv类型的文件，只需要键入

pd.read_csv('data.csv')

再例如需要读入Excel文件：

pd.read_excel('data.xls')

其他类型的文件类型见到的相对较少，这里不做过多介绍。

Original: https://blog.csdn.net/Leslie_i/article/details/124733352
Author: ブリンク
Title: Pandas快速入门指南

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/741418/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

用Python一个天天酷跑（源码附上）

Original: https://www.cnblogs.com/123456feng/p/16051295.htmlAuthor: 蚂蚁ailingTitle: 用Python…

Python 2023年5月24日
0091
目标检测-＞SSD算法

目标检测算法总体分为：基于区域的算法和基于回归的算法 1）基于区域的算法： RCNN, Fast RCNN, Faster RCNN, Mask RCNN 等。整个检测过程分为两个…

Python 2023年9月28日
0040
[附源码]Python计算机毕业设计Django高校体育场馆管理系统

项目运行环境配置：Pychram社区版+ python3.7.7 + Mysql5.7 + HBuilderX+list pip+Navicat11+Django+nodejs。 …

Python 2023年8月6日
0049
全功能Python测试框架：Pytest的基本使用

pytest介绍 pytest是一个非常成熟的全功能的Python测试框架，主要特点有以下几点： 1、简单灵活，容易上手，文档丰富； 2、支持参数化，可以细粒度地控制要测试的测试用…

Python 2023年9月12日
0068
web前端期末大作业【仿12306铁路官网首页】学生网页设计作业源码

🎉精彩专栏推荐 💭文末获取联系✍️ 作者简介: 一个热爱把逻辑思维转变为代码的技术博主💂 作者主页: 【主页——🚀获取更多优质源码】🎓 web前端期末大作业：【📚毕设项目精品实战…

Python 2023年9月30日
0049
Python中把字典dic转换为DataFrame

有时候，需要把dic转换为DataFrame格式，便于查看和存储。假设有以下的dic： dic={‘a’:”1″, ‘b’: “2”, ‘c’:’3′} 本文来探讨一下把它转换为…

Python 2023年8月7日
0066
superset 升级到最新版本踩到的坑

superset最近的更新颇为频繁，还增加了对es的支持，必须升级一把。升级的方法也比较简单，现在测试环境试验，官方的文档有说明：按照网上提供的升级教程升级。 cd ~ 停…

Python 2023年8月12日
0048
Python-pytest、unittest

pytest unittest 1、pytest 默认规则模块名必须为test_开头或者_test结尾类名必须为Test开头,并且不能有__init__方法函数名和方法名必须…

Python 2023年9月10日
0052
现代 CSS 高阶技巧，不规则边框解决方案

本文是 CSS Houdini 之 CSS Painting API 系列第四篇。现代 CSS 之高阶图片渐隐消失术现代 CSS 高阶技巧，像 Canvas 一样自由绘图构建样…

Python 2023年10月12日
0075
基于 .NET 7 的 QUIC 实现 Echo 服务

前言随着今年6月份的 HTTP/3 协议的正式发布，它背后的网络传输协议 QUIC，凭借其高效的传输效率和多路并发的能力，也大概率会取代我们熟悉的使用了几十年的 TCP，成为互联…

Python 2023年10月16日
0044
Vue+Flask–前后端分离项目部署到Tomcat并启动

最近用Vue+Flask写了一个Web应用，展示项目最新版本对应的所有回归用例的测试进度，实时同步测试的回归测试进度。记录下过程和步骤，给小伙伴们参考。一、运用技术 Vue+El…

Python 2023年8月13日
0082
pytest学习（二）

1.pytest-html报告安装：pip install pytest-html==2.1.1 运行：pytest 文件名 –html=路径/文件名称 2.pyte…

Python 2023年9月14日
0065
python pandas数据处理和基本操作

本文介绍的方法均为我在做作业是常用的方法，要是有不对的地方还请大神指正本文示例文件 排名,姓&#x540…

Python 2023年8月19日
0052
python的request返回400_python – scrapy.Request请求地址返回400,但是用单独request模块请求同样的url返回正常…

一问题描述用scrapy.Request()方法请求一个url地址，发现返回400错误，我检查了我不是被封ip，把请求链接直接copy到浏览器上，是可以正常显示结果的，单独用p…

Python 2023年10月2日
0060
上班用Python采集热搜榜，堪称摸鱼神器

前言不知道大家在工作无聊时，有没有一种冲动：总想掏出手机，看看微博热搜在讨论什么有趣的话题，但又不方便直接打开微博浏览，今天就和大家分享一个有趣的小爬虫，定时采集微博热搜榜&a…

Python 2023年11月2日
0037
python数据分析（四）——numpy中的nan和数据的填充

系列文章：python数据分析（一）——numpy数组的创建python数据分析（二）——numpy数组的计算python数据分析（三）——numpy读取本地数据和索引python…

Python 2023年8月23日
0054

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Pandas快速入门指南

1.Pandas数据结构对象创建

2.基本功能

大家都在看