python randn(5)_Python 数据处理(五)

  1. DataFrame(续)

索引和选择

索引的基础语法如下

选择列

df[col]

Series

用标签选择行

df.loc[label]

Series

用整数位置选择行

df.iloc[loc]

Series

用布尔向量选择行

df[bool_vec]

DataFrame

行切片

df[5:10]

DataFrame

例如,选择行返回的是 Series,其索引是 DataFrame 的列名:

In [89]: df.loc[“b”]

Out[89]:

one 2.0

bar 2.0

flag False

foo bar

one_trunc 2.0

Name: b, dtype: object

In [90]: df.iloc[2]

Out[90]:

one 3.0

bar 3.0

flag True

foo bar

one_trunc NaN

Name: c, dtype: object

关于索引切片的详细内容,我们将会在后续的索引章节详细介绍

数据对齐和运算

DataFrame 对象之间的数据会根据索引和列名自动对齐,结果将是索引和列名的并集

In [92]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=[“A”, “B”, “C”])

In [93]: df + df2

Out[93]:

A B C D

0 0.045691 -0.014138 1.380871 NaN

1 -0.955398 -1.501007 0.037181 NaN

2 -0.662690 1.534833 -0.859691 NaN

3 -2.452949 1.237274 -0.133712 NaN

4 1.414490 1.951676 -2.320422 NaN

5 -0.494922 -1.649727 -1.084601 NaN

6 -1.047551 -0.748572 -0.805479 NaN

7 NaN NaN NaN NaN

8 NaN NaN NaN NaN

9 NaN NaN NaN NaN

DataFrame 和 Series 之间执行操作时,默认行为是 DataFrame 的列名与 Series 的索引对齐,然后按行执行广播操作。例如

In [94]: df – df.iloc[0]

Out[94]:

A B C D

0 0.000000 0.000000 0.000000 0.000000

1 -1.359261 -0.248717 -0.453372 -1.754659

2 0.253128 0.829678 0.010026 -1.991234

3 -1.311128 0.054325 -1.724913 -1.620544

4 0.573025 1.500742 -0.676070 1.367331

5 -1.741248 0.781993 -1.241620 -2.053136

6 -1.240774 -0.869551 -0.153282 0.000430

7 -0.743894 0.411013 -0.929563 -0.282386

8 -1.194921 1.320690 0.238224 -1.482644

9 2.293786 1.856228 0.773289 -1.446531

那如果使用的是列会发生什么

df

A B C

0 1 3 4

1 2 5 0

2 3 1 1

3 4 7 6

4 5 2 2

df – df[‘A’]

A B C 0 1 2 3 4

0 NaN NaN NaN NaN NaN NaN NaN NaN

1 NaN NaN NaN NaN NaN NaN NaN NaN

2 NaN NaN NaN NaN NaN NaN NaN NaN

3 NaN NaN NaN NaN NaN NaN NaN NaN

4 NaN NaN NaN NaN NaN NaN NaN NaN

因为我们提取的 A 列的索引是 0-4,与 df 的列名 A、B、C 不匹配,最后导致结果都为 NaN

标量操作与其它数据结构是一样的

In [95]: df * 5 + 2

Out[95]:

A B C D

0 3.359299 -0.124862 4.835102 3.381160

1 -3.437003 -1.368449 2.568242 -5.392133

2 4.624938 4.023526 4.885230 -6.575010

3 -3.196342 0.146766 -3.789461 -4.721559

4 6.224426 7.378849 1.454750 10.217815

5 -5.346940 3.785103 -1.373001 -6.884519

6 -2.844569 -4.472618 4.068691 3.383309

7 -0.360173 1.930201 0.187285 1.969232

8 -2.615303 6.478587 6.026220 -4.032059

9 14.828230 9.156280 8.701544 -3.851494

In [96]: 1 / df

Out[96]:

A B C D

0 3.678365 -2.353094 1.763605 3.620145

1 -0.919624 -1.484363 8.799067 -0.676395

2 1.904807 2.470934 1.732964 -0.583090

3 -0.962215 -2.697986 -0.863638 -0.743875

4 1.183593 0.929567 -9.170108 0.608434

5 -0.680555 2.800959 -1.482360 -0.562777

6 -1.032084 -0.772485 2.416988 3.614523

7 -2.118489 -71.634509 -2.758294 -162.507295

8 -1.083352 1.116424 1.241860 -0.828904

9 0.389765 0.698687 0.746097 -0.854483

In [97]: df ** 4

Out[97]:

A B C D

0 0.005462 3.261689e-02 0.103370 5.822320e-03

1 1.398165 2.059869e-01 0.000167 4.777482e+00

2 0.075962 2.682596e-02 0.110877 8.650845e+00

3 1.166571 1.887302e-02 1.797515 3.265879e+00

4 0.509555 1.339298e+00 0.000141 7.297019e+00

5 4.661717 1.624699e-02 0.207103 9.969092e+00

6 0.881334 2.808277e+00 0.029302 5.858632e-03

7 0.049647 3.797614e-08 0.017276 1.433866e-09

8 0.725974 6.437005e-01 0.420446 2.118275e+00

9 43.329821 4.196326e+00 3.227153 1.875802e+00

对于布尔运算同样适用

In [98]: df1 = pd.DataFrame({“a”: [1, 0, 1], “b”: [0, 1, 1]}, dtype=bool)

In [99]: df2 = pd.DataFrame({“a”: [0, 1, 1], “b”: [1, 1, 0]}, dtype=bool)

In [100]: df1 & df2

Out[100]:

a b

0 False False

1 False True

2 True False

In [101]: df1 | df2

Out[101]:

a b

0 True True

1 True True

2 True True

In [102]: df1 ^ df2

Out[102]:

a b

0 True True

1 True False

2 False True

In [103]: -df1

Out[103]:

a b

0 False True

1 True False

与多位数组类似,可以对 DataFrame 转置,使用 T 属性或 transpose 函数

In [104]: df[:5].T

Out[104]:

0 1 2 3 4

A 0.271860 -1.087401 0.524988 -1.039268 0.844885

B -0.424972 -0.673690 0.404705 -0.370647 1.075770

C 0.567020 0.113648 0.577046 -1.157892 -0.109050

D 0.276232 -1.478427 -1.715002 -1.344312 1.643563

应用 numpy 函数

如果你的 DataFrame 存储的都是数字,可以使用许多 NumPy 的函数

In [105]: np.exp(df)

Out[105]:

A B C D

0 1.312403 0.653788 1.763006 1.318154

1 0.337092 0.509824 1.120358 0.227996

2 1.690438 1.498861 1.780770 0.179963

3 0.353713 0.690288 0.314148 0.260719

4 2.327710 2.932249 0.896686 5.173571

5 0.230066 1.429065 0.509360 0.169161

6 0.379495 0.274028 1.512461 1.318720

7 0.623732 0.986137 0.695904 0.993865

8 0.397301 2.449092 2.237242 0.299269

9 13.009059 4.183951 3.820223 0.310274

In [106]: np.asarray(df)

Out[106]:

array([[ 0.2719, -0.425 , 0.567 , 0.2762],

[-1.0874, -0.6737, 0.1136, -1.4784],

[ 0.525 , 0.4047, 0.577 , -1.715 ],

[-1.0393, -0.3706, -1.1579, -1.3443],

[ 0.8449, 1.0758, -0.109 , 1.6436],

[-1.4694, 0.357 , -0.6746, -1.7769],

[-0.9689, -1.2945, 0.4137, 0.2767],

[-0.472 , -0.014 , -0.3625, -0.0062],

[-0.9231, 0.8957, 0.8052, -1.2064],

[ 2.5656, 1.4313, 1.3403, -1.1703]])

如果在 NumPy 通用函数中使用了多个 Series,会在执行函数之前,自动对齐。

In [109]: ser1 = pd.Series([1, 2, 3], index=[“a”, “b”, “c”])

In [110]: ser2 = pd.Series([1, 3, 5], index=[“b”, “a”, “c”])

In [111]: ser1

Out[111]:

a 1

b 2

c 3

dtype: int64

In [112]: ser2

Out[112]:

b 1

a 3

c 5

dtype: int64

In [113]: np.remainder(ser1, ser2)

Out[113]:

a 1

b 0

c 3

dtype: int64

如果存在对应不上的索引,会被赋值为 NaN

In [114]: ser3 = pd.Series([2, 4, 6], index=[“b”, “c”, “d”])

In [115]: ser3

Out[115]:

b 2

c 4

d 6

dtype: int64

In [116]: np.remainder(ser1, ser3)

Out[116]:

a NaN

b 0.0

c 3.0

d NaN

dtype: float64

如果在 Series 和 index 上应用二元函数时,会按照 Series 执行并输出

In [117]: ser = pd.Series([1, 2, 3])

In [118]: idx = pd.Index([4, 5, 6])

In [119]: np.maximum(ser, idx)

Out[119]:

0 4

1 5

2 6

dtype: int64

控制台显示

在控制台显示大型数据时,会根据数据量进行折叠展示前面和后面的几行

In [120]: baseball = pd.read_csv(“data/baseball.csv”)

In [121]: print(baseball)

id player year stint team lg g ab r h … rbi sb cs bb so ibb hbp sh sf gidp

0 88641 womacto01 2006 2 CHN NL 19 50 6 14 … 2.0 1.0 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0

1 88643 schilcu01 2006 1 BOS AL 31 2 0 1 … 0.0 0.0 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0

.. … … … … … .. .. … .. … … … … … .. … … … … … …

98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 … 49.0 3.0 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0

99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 … 0.0 0.0 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0

[100 rows x 23 columns]

可以使用 info 函数显示汇总信息

In [122]: baseball.info()

RangeIndex: 100 entries, 0 to 99

Data columns (total 23 columns):

Column Non-Null Count Dtype

Original: https://blog.csdn.net/weixin_28893597/article/details/113963466
Author: yishan li
Title: python randn(5)_Python 数据处理(五)

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/677129/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球