10 分钟 pandas-大致了解pandas能做的工作

pandas 中有两种对象, Series 和 DataFrame


s = pd.Series([1, 3, 5, np.nan, 6, 8])

dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
Out[8]:
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988
df.head()
df.tail(3)
df.index
df.columns
df.describe()

DataFrame 类型转为 Numpy 类型,DataFrame.to_numpy() 。Numpy 数组全部数据为一种类型,DataFrame 中列数据为一种类型。所以当 DataFrame 列有多种数据类型时,会将数据转到一种泛化类型


df.T

df.sort_index(axis=1, ascending=False)

df.sort_values(by="B")

3.1 获取数据


df["A"]

df[0:3]

3.2 根据标签获取数据


df.loc[dates[0]]
df.loc[:, ["A", "B"]]

df.loc[dates[0], "A"]

3.3 根据位置获取数据

loc根据标签值获取数据,iloc则根据位置获取数据

df.iloc[3]

df.iloc[3:5, 0:2]

df.iloc[[1, 2, 4], [0, 2]]

df.iloc[1:3, :]

df.iloc[1, 1]

3.4 布尔索引


df[df["A"] > 0]

df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2[df2["E"].isin(["two", "four"])]

Out[44]:
                   A         B         C         D     E
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804   two
2013-01-05 -0.424972  0.567020  0.276232 -1.087401  four

3.5 增加新列

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))

Out[46]:
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

df["F"] = s1

不同方式设置值


df.at[dates[0], "A"] = 0

df.iat[0, 1] = 0

df.loc[:, "D"] = np.array([5] * len(df))

Out[51]:
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -1.509059  5  NaN
2013-01-02  1.212112 -0.173215  0.119209  5  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0
2013-01-05 -0.424972  0.567020  0.276232  5  4.0
2013-01-06 -0.673690  0.113648 -1.478427  5  5.0

通过where操作后设置值

df2 = df.copy()

df2[df2 > 0] = -df2

pandas 使用 np.nan 代表缺省值

重索引允许更改/添加/删除指定轴上的索引,返回数据的副本

Out[51]:
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -1.509059  5  NaN
2013-01-02  1.212112 -0.173215  0.119209  5  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0
2013-01-05 -0.424972  0.567020  0.276232  5  4.0
2013-01-06 -0.673690  0.113648 -1.478427  5  5.0

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])

df1.loc[dates[0] : dates[1], "E"] = 1

Out[57]:
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000 -1.509059  5  NaN  1.0
2013-01-02  1.212112 -0.173215  0.119209  5  1.0  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0  NaN
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0  NaN
缺省值处理



df1.dropna(how="any")

df1.fillna(value=5)

pd.isna(df1)
Out[60]:
                A      B      C      D      F      E
2013-01-01  False  False  False  False   True  False
2013-01-02  False  False  False  False  False  False
2013-01-03  False  False  False  False  False   True
2013-01-04  False  False  False  False  False   True

4.1 数据统计

此类操作会排除 nan 值


df.mean()

df.mean(1)

对于操作 pandas 会利用广播进行自动对齐

df = pd.DataFrame(np.random.randn(6, 4), index=pd.date_range("20130101", periods=6),columns=["A", "B", "C", "D"])
ds = pd.Series([1,2,3,4,5,6],index=pd.date_range("20130102", periods=6))
df["F"]=ds

dates=pd.date_range("20130101", periods=6)
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
Out[64]:
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

df.sub(s, axis="index")
Out[65]:
                   A         B         C    D    F
2013-01-01       NaN       NaN       NaN  NaN  NaN
2013-01-02       NaN       NaN       NaN  NaN  NaN
2013-01-03 -1.861849 -3.104569 -1.494929  4.0  1.0
2013-01-04 -2.278445 -3.706771 -4.039575  2.0  0.0
2013-01-05 -5.424972 -4.432980 -4.723768  0.0 -1.0
2013-01-06       NaN       NaN       NaN  NaN  NaN

4.2 Apply 方法

df.apply(np.cumsum)
Out[66]:
                   A         B         C   D     F
2013-01-01  0.000000  0.000000 -1.509059   5   NaN
2013-01-02  1.212112 -0.173215 -1.389850  10   1.0
2013-01-03  0.350263 -2.277784 -1.884779  15   3.0
2013-01-04  1.071818 -2.984555 -2.924354  20   6.0
2013-01-05  0.646846 -2.417535 -2.648122  25  10.0
2013-01-06 -0.026844 -2.303886 -4.126549  30  15.0

df.apply(lambda x: x.max() - x.min())
Out[67]:
A    2.073961
B    2.671590
C    1.785291
D    0.000000
F    4.000000
dtype: float64

4.3 Histogramming

s = pd.Series(np.random.randint(0, 7, size=10))
s.value_counts()
Out[70]:
4    5
2    2
6    2
1    1

4.4 字符串方法

s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
s.str.lower()
df = pd.DataFrame(np.random.randn(10, 4))

pieces = [df[:3], df[3:7], df[7:]]

pd.concat(pieces)

值得注意的是,在 pandas 添加新列是非常快的,而添加行需要用到复制操作,代价相对较高

SQL 方式的合并

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
pd.merge(left, right, on="key")

Out[86]:
   key  lval  rval
0  foo     1     4
1  bar     2     5

分组一般涉及以下一个或多个步骤的流程:

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

df.groupby("A").sum()
Out[89]:
            C         D
A
bar  1.732707  1.073134
foo  2.824590 -0.574779

df.groupby(["A", "B"]).sum()
Out[90]:
                  C         D
A   B
bar one    1.511763  0.396823
    three -0.990582 -0.532532
    two    1.211526  1.208843
foo one    1.614581 -1.658537
    three  0.024580 -0.264610
    two    1.185429  1.348368

待补充

7.1 Stack

7.2Pivot tables

待补充

待补充

待补充

11.1 CSV


df.to_csv("foo.csv")

pd.read_csv("foo.csv")

11.2 HDF5

df.to_hdf("foo.h5", "df")
pd.read_hdf("foo.h5", "df")

11.3 Excel

df.to_excel("foo.xlsx", sheet_name="Sheet1")
pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])

Original: https://blog.csdn.net/weixin_40994552/article/details/124888127
Author: 小何才露尖尖角
Title: 10 分钟 pandas-大致了解pandas能做的工作

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/742892/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球