- DataFrame(续)
索引和选择
索引的基础语法如下
选择列
df[col]
Series
用标签选择行
df.loc[label]
Series
用整数位置选择行
df.iloc[loc]
Series
用布尔向量选择行
df[bool_vec]
DataFrame
行切片
df[5:10]
DataFrame
例如,选择行返回的是 Series,其索引是 DataFrame 的列名:
In [89]: df.loc[“b”]
Out[89]:
one 2.0
bar 2.0
flag False
foo bar
one_trunc 2.0
Name: b, dtype: object
In [90]: df.iloc[2]
Out[90]:
one 3.0
bar 3.0
flag True
foo bar
one_trunc NaN
Name: c, dtype: object
关于索引切片的详细内容,我们将会在后续的索引章节详细介绍
数据对齐和运算
DataFrame 对象之间的数据会根据索引和列名自动对齐,结果将是索引和列名的并集
In [92]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=[“A”, “B”, “C”])
In [93]: df + df2
Out[93]:
A B C D
0 0.045691 -0.014138 1.380871 NaN
1 -0.955398 -1.501007 0.037181 NaN
2 -0.662690 1.534833 -0.859691 NaN
3 -2.452949 1.237274 -0.133712 NaN
4 1.414490 1.951676 -2.320422 NaN
5 -0.494922 -1.649727 -1.084601 NaN
6 -1.047551 -0.748572 -0.805479 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
DataFrame 和 Series 之间执行操作时,默认行为是 DataFrame 的列名与 Series 的索引对齐,然后按行执行广播操作。例如
In [94]: df – df.iloc[0]
Out[94]:
A B C D
0 0.000000 0.000000 0.000000 0.000000
1 -1.359261 -0.248717 -0.453372 -1.754659
2 0.253128 0.829678 0.010026 -1.991234
3 -1.311128 0.054325 -1.724913 -1.620544
4 0.573025 1.500742 -0.676070 1.367331
5 -1.741248 0.781993 -1.241620 -2.053136
6 -1.240774 -0.869551 -0.153282 0.000430
7 -0.743894 0.411013 -0.929563 -0.282386
8 -1.194921 1.320690 0.238224 -1.482644
9 2.293786 1.856228 0.773289 -1.446531
那如果使用的是列会发生什么
df
A B C
0 1 3 4
1 2 5 0
2 3 1 1
3 4 7 6
4 5 2 2
df – df[‘A’]
A B C 0 1 2 3 4
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN
因为我们提取的 A 列的索引是 0-4,与 df 的列名 A、B、C 不匹配,最后导致结果都为 NaN
标量操作与其它数据结构是一样的
In [95]: df * 5 + 2
Out[95]:
A B C D
0 3.359299 -0.124862 4.835102 3.381160
1 -3.437003 -1.368449 2.568242 -5.392133
2 4.624938 4.023526 4.885230 -6.575010
3 -3.196342 0.146766 -3.789461 -4.721559
4 6.224426 7.378849 1.454750 10.217815
5 -5.346940 3.785103 -1.373001 -6.884519
6 -2.844569 -4.472618 4.068691 3.383309
7 -0.360173 1.930201 0.187285 1.969232
8 -2.615303 6.478587 6.026220 -4.032059
9 14.828230 9.156280 8.701544 -3.851494
In [96]: 1 / df
Out[96]:
A B C D
0 3.678365 -2.353094 1.763605 3.620145
1 -0.919624 -1.484363 8.799067 -0.676395
2 1.904807 2.470934 1.732964 -0.583090
3 -0.962215 -2.697986 -0.863638 -0.743875
4 1.183593 0.929567 -9.170108 0.608434
5 -0.680555 2.800959 -1.482360 -0.562777
6 -1.032084 -0.772485 2.416988 3.614523
7 -2.118489 -71.634509 -2.758294 -162.507295
8 -1.083352 1.116424 1.241860 -0.828904
9 0.389765 0.698687 0.746097 -0.854483
In [97]: df ** 4
Out[97]:
A B C D
0 0.005462 3.261689e-02 0.103370 5.822320e-03
1 1.398165 2.059869e-01 0.000167 4.777482e+00
2 0.075962 2.682596e-02 0.110877 8.650845e+00
3 1.166571 1.887302e-02 1.797515 3.265879e+00
4 0.509555 1.339298e+00 0.000141 7.297019e+00
5 4.661717 1.624699e-02 0.207103 9.969092e+00
6 0.881334 2.808277e+00 0.029302 5.858632e-03
7 0.049647 3.797614e-08 0.017276 1.433866e-09
8 0.725974 6.437005e-01 0.420446 2.118275e+00
9 43.329821 4.196326e+00 3.227153 1.875802e+00
对于布尔运算同样适用
In [98]: df1 = pd.DataFrame({“a”: [1, 0, 1], “b”: [0, 1, 1]}, dtype=bool)
In [99]: df2 = pd.DataFrame({“a”: [0, 1, 1], “b”: [1, 1, 0]}, dtype=bool)
In [100]: df1 & df2
Out[100]:
a b
0 False False
1 False True
2 True False
In [101]: df1 | df2
Out[101]:
a b
0 True True
1 True True
2 True True
In [102]: df1 ^ df2
Out[102]:
a b
0 True True
1 True False
2 False True
In [103]: -df1
Out[103]:
a b
0 False True
1 True False
与多位数组类似,可以对 DataFrame 转置,使用 T 属性或 transpose 函数
In [104]: df[:5].T
Out[104]:
0 1 2 3 4
A 0.271860 -1.087401 0.524988 -1.039268 0.844885
B -0.424972 -0.673690 0.404705 -0.370647 1.075770
C 0.567020 0.113648 0.577046 -1.157892 -0.109050
D 0.276232 -1.478427 -1.715002 -1.344312 1.643563
应用 numpy 函数
如果你的 DataFrame 存储的都是数字,可以使用许多 NumPy 的函数
In [105]: np.exp(df)
Out[105]:
A B C D
0 1.312403 0.653788 1.763006 1.318154
1 0.337092 0.509824 1.120358 0.227996
2 1.690438 1.498861 1.780770 0.179963
3 0.353713 0.690288 0.314148 0.260719
4 2.327710 2.932249 0.896686 5.173571
5 0.230066 1.429065 0.509360 0.169161
6 0.379495 0.274028 1.512461 1.318720
7 0.623732 0.986137 0.695904 0.993865
8 0.397301 2.449092 2.237242 0.299269
9 13.009059 4.183951 3.820223 0.310274
In [106]: np.asarray(df)
Out[106]:
array([[ 0.2719, -0.425 , 0.567 , 0.2762],
[-1.0874, -0.6737, 0.1136, -1.4784],
[ 0.525 , 0.4047, 0.577 , -1.715 ],
[-1.0393, -0.3706, -1.1579, -1.3443],
[ 0.8449, 1.0758, -0.109 , 1.6436],
[-1.4694, 0.357 , -0.6746, -1.7769],
[-0.9689, -1.2945, 0.4137, 0.2767],
[-0.472 , -0.014 , -0.3625, -0.0062],
[-0.9231, 0.8957, 0.8052, -1.2064],
[ 2.5656, 1.4313, 1.3403, -1.1703]])
如果在 NumPy 通用函数中使用了多个 Series,会在执行函数之前,自动对齐。
In [109]: ser1 = pd.Series([1, 2, 3], index=[“a”, “b”, “c”])
In [110]: ser2 = pd.Series([1, 3, 5], index=[“b”, “a”, “c”])
In [111]: ser1
Out[111]:
a 1
b 2
c 3
dtype: int64
In [112]: ser2
Out[112]:
b 1
a 3
c 5
dtype: int64
In [113]: np.remainder(ser1, ser2)
Out[113]:
a 1
b 0
c 3
dtype: int64
如果存在对应不上的索引,会被赋值为 NaN
In [114]: ser3 = pd.Series([2, 4, 6], index=[“b”, “c”, “d”])
In [115]: ser3
Out[115]:
b 2
c 4
d 6
dtype: int64
In [116]: np.remainder(ser1, ser3)
Out[116]:
a NaN
b 0.0
c 3.0
d NaN
dtype: float64
如果在 Series 和 index 上应用二元函数时,会按照 Series 执行并输出
In [117]: ser = pd.Series([1, 2, 3])
In [118]: idx = pd.Index([4, 5, 6])
In [119]: np.maximum(ser, idx)
Out[119]:
0 4
1 5
2 6
dtype: int64
控制台显示
在控制台显示大型数据时,会根据数据量进行折叠展示前面和后面的几行
In [120]: baseball = pd.read_csv(“data/baseball.csv”)
In [121]: print(baseball)
id player year stint team lg g ab r h … rbi sb cs bb so ibb hbp sh sf gidp
0 88641 womacto01 2006 2 CHN NL 19 50 6 14 … 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
1 88643 schilcu01 2006 1 BOS AL 31 2 0 1 … 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
.. … … … … … .. .. … .. … … … … … .. … … … … … …
98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 … 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 … 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0
[100 rows x 23 columns]
可以使用 info 函数显示汇总信息
In [122]: baseball.info()
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
Column Non-Null Count Dtype
Original: https://blog.csdn.net/weixin_28893597/article/details/113963466
Author: yishan li
Title: python randn(5)_Python 数据处理(五)
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/740042/
转载文章受原作者版权保护。转载请注明原作者出处!