DataFrame 的基本结构
DataFrame是一个表格型的数据,它是一系列有序的列的集合,每一列可以是不同的数据类型(数值、字符串、布尔值)。DataFrame既有行索引(columns),又有列索引(index),可以看成是一个共享相同索引的Series的集合。
; DataFrame 的构建
包含等长度列表的字典创建DataFrame
字典中的列表长度一定要相等,否则会报错。
In [1]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
...: 'year': [2000, 2001, 2002, 2001, 2002, 2003],
...: 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
In [4]: df = pd.DataFrame(data)
In [5]: df
Out[5]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
- 通过
columns
关键字指定列的顺序。如果通过columns
传递的列不包含在字典中,会自动以NaN补齐该列。
In [6]: df1 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
In [7]: df1
Out[7]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
In [8]: df2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'])
In [9]: df2
Out[9]:
year state pop debt
0 2000 Ohio 1.5 NaN
1 2001 Ohio 1.7 NaN
2 2002 Ohio 3.6 NaN
3 2001 Nevada 2.4 NaN
4 2002 Nevada 2.9 NaN
5 2003 Nevada 3.2 NaN
- 通过
index
关键词指定行索引。
In [10]: df3 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])
In [11]: df3
Out[11]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
包含等长度数组的字典创建DataFrame
In [28]: data = {'state': np.array(['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada']),
...: 'year': np.array([2000, 2001, 2002, 2001, 2002, 2003]),
...: 'pop': np.array([1.5, 1.7, 3.6, 2.4, 2.9, 3.2])}
In [29]: df4 = pd.DataFrame(data)
In [30]: df4
Out[30]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
包含字典的嵌套字典创建DataFrame
In [31]: pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002:3.6}}
In [36]: df5 = pd.DataFrame(pop, index=[2000, 2001, 2002])
In [37]: df5
Out[37]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
使用嵌套字典构建DataFrame,字典的key会作为DataFrame的列索引,内部字典的key会作为行索引。内部字典的长度可以不等,缺失值以NaN补齐。
包含Series的字典创建DataFrame
In [41]: df5['Ohio']
Out[41]:
2000 1.5
2001 1.7
2002 3.6
Name: Ohio, dtype: float64
In [38]: pdata = {'Ohio': df5['Ohio'][:2], 'Nevada': df5['Nevada'][:2]}
In [39]: df6 = pd.DataFrame(pdata)
In [40]: df6
Out[40]:
Ohio Nevada
2000 1.5 NaN
2001 1.7 2.4
DataFrame 的增删改查
选取DataFrame的列
- 方式1:
DataFrame[columns]
- 方式2:
DataFrame.columns
In [12]: df3['state']
Out[12]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
In [13]: df3.state
Out[13]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
从DataFrame中获取一列,呈现出Series的形式。且返回的Series具有和DataFrame相同的行索引,Series的name属性也被合理的设置。
通过 loc
关键字获取行
In [14]: df3.loc['two']
Out[14]:
year 2001
state Ohio
pop 1.7
debt NaN
Name: two, dtype: object
重新为列赋值
- 使用标量为列赋值。
In [15]: df3['debt'] = 1.2
In [16]: df3
Out[16]:
year state pop debt
one 2000 Ohio 1.5 1.2
two 2001 Ohio 1.7 1.2
three 2002 Ohio 3.6 1.2
four 2001 Nevada 2.4 1.2
five 2002 Nevada 2.9 1.2
six 2003 Nevada 3.2 1.2
- 使用等长的列表或者数组为列赋值
In [17]: df3['debt'] = [-1.2, 3.0, 4.1, -5.2, 7.1, 10.9]
In [18]: df3
Out[18]:
year state pop debt
one 2000 Ohio 1.5 -1.2
two 2001 Ohio 1.7 3.0
three 2002 Ohio 3.6 4.1
four 2001 Nevada 2.4 -5.2
five 2002 Nevada 2.9 7.1
six 2003 Nevada 3.2 10.9
In [19]: df3['debt'] = np.array([-1.2, 2.4, 5.2, -7.1, 8.9, 3.6])
In [20]: df3
Out[20]:
year state pop debt
one 2000 Ohio 1.5 -1.2
two 2001 Ohio 1.7 2.4
three 2002 Ohio 3.6 5.2
four 2001 Nevada 2.4 -7.1
five 2002 Nevada 2.9 8.9
six 2003 Nevada 3.2 3.6
- 使用Series为列赋值。将Series赋值给某一列时,长度可以不一致,Series的索引会按照DataFrame的索引重新排列,缺失值以NaN填补。
In [21]: val = pd.Series([-1.2, 3.5, 7.1], index=['one', 'three', 'six'])
In [22]: df3['debt'] = val
In [23]: df3
Out[23]:
year state pop debt
one 2000 Ohio 1.5 -1.2
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 3.5
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 7.1
添加列
如果被赋值的列不存在,则会创建一个新的列
In [24]: df3['eastern'] = df3['state'] == 'Ohio'
In [25]: df3
Out[25]:
year state pop debt eastern
one 2000 Ohio 1.5 -1.2 True
two 2001 Ohio 1.7 NaN True
three 2002 Ohio 3.6 3.5 True
four 2001 Nevada 2.4 NaN False
five 2002 Nevada 2.9 NaN False
six 2003 Nevada 3.2 7.1 False
使用 del
关键词删除列
In [26]: del df3['eastern']
In [27]: df3
Out[27]:
year state pop debt
one 2000 Ohio 1.5 -1.2
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 3.5
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 7.1
Original: https://blog.csdn.net/m0_58830154/article/details/125946132
Author: cjwdllj
Title: pandas 数据结构(二):DataFrame
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/679909/
转载文章受原作者版权保护。转载请注明原作者出处!