pandas 数据结构(二):DataFrame

DataFrame 的基本结构

DataFrame是一个表格型的数据,它是一系列有序的列的集合,每一列可以是不同的数据类型(数值、字符串、布尔值)。DataFrame既有行索引(columns),又有列索引(index),可以看成是一个共享相同索引的Series的集合。

pandas 数据结构(二):DataFrame
pandas 数据结构(二):DataFrame

; DataFrame 的构建

包含等长度列表的字典创建DataFrame

字典中的列表长度一定要相等,否则会报错。

In [1]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
   ...:         'year': [2000, 2001, 2002, 2001, 2002, 2003],
   ...:         'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [4]: df = pd.DataFrame(data)
In [5]: df
Out[5]:
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2
  • 通过 columns关键字指定列的顺序。如果通过 columns传递的列不包含在字典中,会自动以NaN补齐该列。
In [6]: df1 = pd.DataFrame(data, columns=['year', 'state', 'pop'])

In [7]: df1
Out[7]:
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2
In [8]: df2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'])

In [9]: df2
Out[9]:
   year   state  pop debt
0  2000    Ohio  1.5  NaN
1  2001    Ohio  1.7  NaN
2  2002    Ohio  3.6  NaN
3  2001  Nevada  2.4  NaN
4  2002  Nevada  2.9  NaN
5  2003  Nevada  3.2  NaN
  • 通过 index关键词指定行索引。
In [10]: df3 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])

In [11]: df3
Out[11]:
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN

包含等长度数组的字典创建DataFrame

In [28]: data = {'state': np.array(['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada']),
    ...:         'year': np.array([2000, 2001, 2002, 2001, 2002, 2003]),
    ...:         'pop': np.array([1.5, 1.7, 3.6, 2.4, 2.9, 3.2])}

In [29]: df4 = pd.DataFrame(data)
In [30]: df4
Out[30]:
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2

包含字典的嵌套字典创建DataFrame

In [31]: pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002:3.6}}

In [36]: df5 = pd.DataFrame(pop, index=[2000, 2001, 2002])
In [37]: df5
Out[37]:
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

使用嵌套字典构建DataFrame,字典的key会作为DataFrame的列索引,内部字典的key会作为行索引。内部字典的长度可以不等,缺失值以NaN补齐。

包含Series的字典创建DataFrame

In [41]: df5['Ohio']
Out[41]:
2000    1.5
2001    1.7
2002    3.6
Name: Ohio, dtype: float64

In [38]: pdata = {'Ohio': df5['Ohio'][:2], 'Nevada': df5['Nevada'][:2]}

In [39]: df6 = pd.DataFrame(pdata)
In [40]: df6
Out[40]:
      Ohio  Nevada
2000   1.5     NaN
2001   1.7     2.4

DataFrame 的增删改查

选取DataFrame的列

  • 方式1: DataFrame[columns]
  • 方式2: DataFrame.columns
In [12]: df3['state']
Out[12]:
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
In [13]: df3.state
Out[13]:
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

从DataFrame中获取一列,呈现出Series的形式。且返回的Series具有和DataFrame相同的行索引,Series的name属性也被合理的设置。

通过 loc 关键字获取行

In [14]: df3.loc['two']
Out[14]:
year     2001
state    Ohio
pop       1.7
debt      NaN
Name: two, dtype: object

重新为列赋值

  • 使用标量为列赋值。
In [15]: df3['debt'] = 1.2

In [16]: df3
Out[16]:
       year   state  pop  debt
one    2000    Ohio  1.5   1.2
two    2001    Ohio  1.7   1.2
three  2002    Ohio  3.6   1.2
four   2001  Nevada  2.4   1.2
five   2002  Nevada  2.9   1.2
six    2003  Nevada  3.2   1.2
  • 使用等长的列表或者数组为列赋值
In [17]: df3['debt'] = [-1.2, 3.0, 4.1, -5.2, 7.1, 10.9]

In [18]: df3
Out[18]:
       year   state  pop  debt
one    2000    Ohio  1.5  -1.2
two    2001    Ohio  1.7   3.0
three  2002    Ohio  3.6   4.1
four   2001  Nevada  2.4  -5.2
five   2002  Nevada  2.9   7.1
six    2003  Nevada  3.2  10.9
In [19]: df3['debt'] = np.array([-1.2, 2.4, 5.2, -7.1, 8.9, 3.6])

In [20]: df3
Out[20]:
       year   state  pop  debt
one    2000    Ohio  1.5  -1.2
two    2001    Ohio  1.7   2.4
three  2002    Ohio  3.6   5.2
four   2001  Nevada  2.4  -7.1
five   2002  Nevada  2.9   8.9
six    2003  Nevada  3.2   3.6
  • 使用Series为列赋值。将Series赋值给某一列时,长度可以不一致,Series的索引会按照DataFrame的索引重新排列,缺失值以NaN填补。
In [21]: val = pd.Series([-1.2, 3.5, 7.1], index=['one', 'three', 'six'])

In [22]: df3['debt'] = val
In [23]: df3
Out[23]:
       year   state  pop  debt
one    2000    Ohio  1.5  -1.2
two    2001    Ohio  1.7   NaN
three  2002    Ohio  3.6   3.5
four   2001  Nevada  2.4   NaN
five   2002  Nevada  2.9   NaN
six    2003  Nevada  3.2   7.1

添加列

如果被赋值的列不存在,则会创建一个新的列

In [24]: df3['eastern'] = df3['state'] == 'Ohio'

In [25]: df3
Out[25]:
       year   state  pop  debt  eastern
one    2000    Ohio  1.5  -1.2     True
two    2001    Ohio  1.7   NaN     True
three  2002    Ohio  3.6   3.5     True
four   2001  Nevada  2.4   NaN    False
five   2002  Nevada  2.9   NaN    False
six    2003  Nevada  3.2   7.1    False

使用 del 关键词删除列

In [26]: del df3['eastern']

In [27]: df3
Out[27]:
       year   state  pop  debt
one    2000    Ohio  1.5  -1.2
two    2001    Ohio  1.7   NaN
three  2002    Ohio  3.6   3.5
four   2001  Nevada  2.4   NaN
five   2002  Nevada  2.9   NaN
six    2003  Nevada  3.2   7.1

Original: https://blog.csdn.net/m0_58830154/article/details/125946132
Author: cjwdllj
Title: pandas 数据结构(二):DataFrame

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/679909/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球