pandas 中有两种对象, Series 和 DataFrame
s = pd.Series([1, 3, 5, np.nan, 6, 8])
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
Out[8]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
df.head()
df.tail(3)
df.index
df.columns
df.describe()
DataFrame 类型转为 Numpy 类型,DataFrame.to_numpy() 。Numpy 数组全部数据为一种类型,DataFrame 中列数据为一种类型。所以当 DataFrame 列有多种数据类型时,会将数据转到一种泛化类型
df.T
df.sort_index(axis=1, ascending=False)
df.sort_values(by="B")
3.1 获取数据
df["A"]
df[0:3]
3.2 根据标签获取数据
df.loc[dates[0]]
df.loc[:, ["A", "B"]]
df.loc[dates[0], "A"]
3.3 根据位置获取数据
loc根据标签值获取数据,iloc则根据位置获取数据
df.iloc[3]
df.iloc[3:5, 0:2]
df.iloc[[1, 2, 4], [0, 2]]
df.iloc[1:3, :]
df.iloc[1, 1]
3.4 布尔索引
df[df["A"] > 0]
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2[df2["E"].isin(["two", "four"])]
Out[44]:
A B C D E
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
3.5 增加新列
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
Out[46]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
df["F"] = s1
不同方式设置值
df.at[dates[0], "A"] = 0
df.iat[0, 1] = 0
df.loc[:, "D"] = np.array([5] * len(df))
Out[51]:
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 5 NaN
2013-01-02 1.212112 -0.173215 0.119209 5 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0
2013-01-05 -0.424972 0.567020 0.276232 5 4.0
2013-01-06 -0.673690 0.113648 -1.478427 5 5.0
通过where操作后设置值
df2 = df.copy()
df2[df2 > 0] = -df2
pandas 使用 np.nan
代表缺省值
重索引允许更改/添加/删除指定轴上的索引,返回数据的副本
Out[51]:
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 5 NaN
2013-01-02 1.212112 -0.173215 0.119209 5 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0
2013-01-05 -0.424972 0.567020 0.276232 5 4.0
2013-01-06 -0.673690 0.113648 -1.478427 5 5.0
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
Out[57]:
A B C D F E
2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN
缺省值处理
df1.dropna(how="any")
df1.fillna(value=5)
pd.isna(df1)
Out[60]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
4.1 数据统计
此类操作会排除 nan 值
df.mean()
df.mean(1)
对于操作 pandas 会利用广播进行自动对齐
df = pd.DataFrame(np.random.randn(6, 4), index=pd.date_range("20130101", periods=6),columns=["A", "B", "C", "D"])
ds = pd.Series([1,2,3,4,5,6],index=pd.date_range("20130102", periods=6))
df["F"]=ds
dates=pd.date_range("20130101", periods=6)
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
Out[64]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
df.sub(s, axis="index")
Out[65]:
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -1.861849 -3.104569 -1.494929 4.0 1.0
2013-01-04 -2.278445 -3.706771 -4.039575 2.0 0.0
2013-01-05 -5.424972 -4.432980 -4.723768 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN
4.2 Apply 方法
df.apply(np.cumsum)
Out[66]:
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 5 NaN
2013-01-02 1.212112 -0.173215 -1.389850 10 1.0
2013-01-03 0.350263 -2.277784 -1.884779 15 3.0
2013-01-04 1.071818 -2.984555 -2.924354 20 6.0
2013-01-05 0.646846 -2.417535 -2.648122 25 10.0
2013-01-06 -0.026844 -2.303886 -4.126549 30 15.0
df.apply(lambda x: x.max() - x.min())
Out[67]:
A 2.073961
B 2.671590
C 1.785291
D 0.000000
F 4.000000
dtype: float64
4.3 Histogramming
s = pd.Series(np.random.randint(0, 7, size=10))
s.value_counts()
Out[70]:
4 5
2 2
6 2
1 1
4.4 字符串方法
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
s.str.lower()
df = pd.DataFrame(np.random.randn(10, 4))
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)
值得注意的是,在 pandas 添加新列是非常快的,而添加行需要用到复制操作,代价相对较高
SQL 方式的合并
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
pd.merge(left, right, on="key")
Out[86]:
key lval rval
0 foo 1 4
1 bar 2 5
分组一般涉及以下一个或多个步骤的流程:
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
df.groupby("A").sum()
Out[89]:
C D
A
bar 1.732707 1.073134
foo 2.824590 -0.574779
df.groupby(["A", "B"]).sum()
Out[90]:
C D
A B
bar one 1.511763 0.396823
three -0.990582 -0.532532
two 1.211526 1.208843
foo one 1.614581 -1.658537
three 0.024580 -0.264610
two 1.185429 1.348368
待补充
7.1 Stack
7.2Pivot tables
待补充
待补充
待补充
11.1 CSV
df.to_csv("foo.csv")
pd.read_csv("foo.csv")
11.2 HDF5
df.to_hdf("foo.h5", "df")
pd.read_hdf("foo.h5", "df")
11.3 Excel
df.to_excel("foo.xlsx", sheet_name="Sheet1")
pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
Original: https://blog.csdn.net/weixin_40994552/article/details/124888127
Author: 小何才露尖尖角
Title: 10 分钟 pandas-大致了解pandas能做的工作
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/742892/
转载文章受原作者版权保护。转载请注明原作者出处!