pandas10minnutes_中英对照01

本次主要讲以下三部分:
1.Object creation(对象创建)
2.Viewing data(查看数据)
3.Selection(筛选)

导入包

import numpy as np
import pandas as pd

1.Object creation(对象创建)

Creating a Series by passing a list of values, letting pandas create a default integer index:
通过传递一列值创建序列,利用pandas(熊猫)创建默认整数索引

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
通过传递NumPy数组创建带有日期时间索引和带标签列名的数据帧(数据框),
创建时间索引


dates = pd.date_range("2013/01/01", periods=6,freq='d')
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df = pd.DataFrame(np.random.randn(6, 4), index=dates,columns=['A','B','C','D'])
df

ABCD2013-01-01-0.520896-0.340412-1.265841-0.4195622013-01-02-0.2704851.139635-0.099596-0.6226232013-01-031.380236-1.9222051.406446-1.5342922013-01-041.0490230.363657-0.479516-0.2430512013-01-050.7208960.8215810.369389-0.1330512013-01-06-0.337006-0.3295371.296696-2.602595

Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:
通过传递字典对象创建数据帧,这些对象可以转换为类似序列的结构

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

ABCDEF01.02013-01-021.03testfoo11.02013-01-021.03trainfoo21.02013-01-021.03testfoo31.02013-01-021.03trainfoo

The columns of the resulting DataFrame have different dtypes:
结果数据帧的列具有不同的数据类型:


df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:
如果使用的是IPython,则会自动启用列名(以及公共属性)的制表符补齐功能,以下是将要完成的属性子集:


df2.describe()

ACDcount4.04.04.0mean1.01.03.0std0.00.00.0min1.01.03.025%1.01.03.050%1.01.03.075%1.01.03.0max1.01.03.0

As you can see, the columns A, B, C, and D are automatically tab completed. E and F are there as well; the rest of the attributes have been truncated for brevity.

如您所见,A、B、C和D列是自动完成的。E和F也存在;为简洁起见,其余属性已被截断。

2.Viewing data(查看数据)

Here is how to view the top and bottom rows of the frame:
以下是如何查看数据框的顶行和底行:


df.head()

ABCD2013-01-01-0.520896-0.340412-1.265841-0.4195622013-01-02-0.2704851.139635-0.099596-0.6226232013-01-031.380236-1.9222051.406446-1.5342922013-01-041.0490230.363657-0.479516-0.2430512013-01-050.7208960.8215810.369389-0.133051


df.tail(3)

ABCD2013-01-041.0490230.363657-0.479516-0.2430512013-01-050.7208960.8215810.369389-0.1330512013-01-06-0.337006-0.3295371.296696-2.602595

Display the index, columns:
显示索引,列:


df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

DataFrame.to_numpy()给出了底层数据的numpy表示。请注意,当您的DataFrame具有不同数据类型的列时,这可能是一个代价昂贵的操作,这可以归结为pandas和numpy之间的一个根本区别:NumPy整个数组有一个数据类型,而pandas数据框的每列有自己的一个数据类型。当你调用函数DataFrame.to_numpy()时,pandas需要找到可以保存数据帧中所有数据类型的NumPy数据类型。这最终可能将数据类型转化为一个对象,需要将每个值都转换为Python对象。

For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data:
对于df,DataFrame中的每个值都是浮点型,DataFrame.to_numpy() 很快,不需要对数据进行复制

For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive:
对于 df2, DataFrame(数据框)有多种数据类型, DataFrame.to_numpy() 的操作代价相对昂贵

df2.to_numpy()
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

note:
DataFrame.to_numpy() does not include the index or column labels in the output.

注意:DataFrame.to_numpy() 在输出中不包括索引或列标签。

df.describe()

ABCDcount6.0000006.0000006.0000006.000000mean0.336962-0.0445470.204596-0.925862std0.8126501.0966721.0379800.961737min-0.520896-1.922205-1.265841-2.60259525%-0.320375-0.337693-0.384536-1.30637550%0.2252060.0170600.134896-0.52109375%0.9669910.7071001.064869-0.287179max1.3802361.1396351.406446-0.133051

Transposing your data:
对数据进行转置

df.T

2013-01-01 00:00:002013-01-02 00:00:002013-01-03 00:00:002013-01-04 00:00:002013-01-05 00:00:002013-01-06 00:00:00A-0.520896-0.2704851.3802361.0490230.720896-0.337006B-0.3404121.139635-1.9222050.3636570.821581-0.329537C-1.265841-0.0995961.406446-0.4795160.3693891.296696D-0.419562-0.622623-1.534292-0.243051-0.133051-2.602595

Sorting by an axis:
按轴排序:


df.sort_index(axis=1, ascending=False)

DCBA2013-01-01-0.419562-1.265841-0.340412-0.5208962013-01-02-0.622623-0.0995961.139635-0.2704852013-01-03-1.5342921.406446-1.9222051.3802362013-01-04-0.243051-0.4795160.3636571.0490232013-01-05-0.1330510.3693890.8215810.7208962013-01-06-2.6025951.296696-0.329537-0.337006


df.sort_index(axis=0, ascending=False)

ABCD2013-01-06-0.337006-0.3295371.296696-2.6025952013-01-050.7208960.8215810.369389-0.1330512013-01-041.0490230.363657-0.479516-0.2430512013-01-031.380236-1.9222051.406446-1.5342922013-01-02-0.2704851.139635-0.099596-0.6226232013-01-01-0.520896-0.340412-1.265841-0.419562

Sorting by values:
对值进行排序

df.sort_values(by="B")

ABCD2013-01-031.380236-1.9222051.406446-1.5342922013-01-01-0.520896-0.340412-1.265841-0.4195622013-01-06-0.337006-0.3295371.296696-2.6025952013-01-041.0490230.363657-0.479516-0.2430512013-01-050.7208960.8215810.369389-0.1330512013-01-02-0.2704851.139635-0.099596-0.622623

3.Selection (筛选)

note:
While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc. See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.

注意:虽然用于选择和设置的标准Python/NumPy表达式非常直观,并且对于交互式工作非常方便,对于生产代码,我们推荐优化的pandas数据访问方法。如 at,iat,loc和 .iloc.请参阅索引文档索引,选择数据以及多索引/高级索引。

Selecting a single column, which yields a Series, equivalent to df.A:
选择一个列,生成一个序列,相当于df.A:

df["A"]
2013-01-01   -0.520896
2013-01-02   -0.270485
2013-01-03    1.380236
2013-01-04    1.049023
2013-01-05    0.720896
2013-01-06   -0.337006
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows:
通过[]进行筛选,将行切片

df[0:3]

ABCD2013-01-01-0.520896-0.340412-1.265841-0.4195622013-01-02-0.2704851.139635-0.099596-0.6226232013-01-031.380236-1.9222051.406446-1.534292

df["2013-01-02":"2013-05-04"]

ABCD2013-01-02-0.2704851.139635-0.099596-0.6226232013-01-031.380236-1.9222051.406446-1.5342922013-01-041.0490230.363657-0.479516-0.2430512013-01-050.7208960.8215810.369389-0.1330512013-01-06-0.337006-0.3295371.296696-2.602595

See more in Selection by Label.

For getting a cross section using a label:
按标签选择
请参阅”按标签选择”中的详细信息
要使用标签获取横截面:

df.loc[dates[0]]
A   -0.520896
B   -0.340412
C   -1.265841
D   -0.419562
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:
按标签在多轴上选择:

df.loc[:, ["A", "B"]]

AB2013-01-31-0.512502-1.0737982013-02-281.671920-1.6031492013-03-310.116484-0.5197652013-04-300.3833180.4106092013-05-31-0.818920-2.5959572013-06-301.0591150.402510

Showing label slicing, both endpoints are included:
显示标签切片时,包括两个端点:

df.loc["20130102":"20130104", ["A", "B"]]

AB2013-01-02-0.2704851.1396352013-01-031.380236-1.9222052013-01-041.0490230.363657

Reduction in the dimensions of the returned object:
减少返回对象的维度:

df.loc["20130102", ["A", "B"]]
A   -0.270485
B    1.139635
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value:
要获取标量值,

df.loc[dates[0], "A"]
-0.52089556678858

For getting fast access to a scalar (equivalent to the prior method):
为了快速访问标量(相当于前面的方法):

df.at[dates[0], "A"]
-0.52089556678858

See more in Selection by Position.

Select via the position of the passed integers:
按位置选择
请参阅”按位置选择”中的更多内容
通过传递的整数位置选择:

df.iloc[3]
A    1.049023
B    0.363657
C   -0.479516
D   -0.243051
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to NumPy/Python:
通过整数切片,其作用类似于NumPy/Python:

df.iloc[3:5, 0:2]

By lists of integer position locations, similar to the NumPy/Python style:
通过整数位置列表,类似于NumPy/Python样式:

df.iloc[[1, 2, 4], [0, 2]]

AC2013-01-02-0.440009-0.0949012013-01-03-1.0955891.4432712013-01-05-0.8263572.082919

For slicing rows explicitly:
对于精确地行切片:

df.iloc[1:3, :]

ABCD2013-01-02-0.4400090.666086-0.0949011.0876102013-01-03-1.0955890.7084281.443271-0.012472

For slicing columns explicitly:
对于精确地列切片:

df.iloc[:, 1:3]

BC2013-01-010.1299660.7491872013-01-020.666086-0.0949012013-01-030.7084281.4432712013-01-04-0.3399910.5848772013-01-050.0721592.0829192013-01-06-0.7462470.195187

For getting a value explicitly:
对于精确地获取值,

df.iloc[1, 1]
0.6660861685291358

For getting fast access to a scalar (equivalent to the prior method):
为了快速访问标量(相当于前面的方法):

df.iat[1, 1]
0.6660861685291358

Using a single column’s values to select data:
布尔索引
使用某个列的值选择数据:

df[df["A"] > 0]

ABCD2013-02-281.671920-1.603149-0.154643-0.7521012013-03-310.116484-0.5197650.918146-0.7175622013-04-300.3833180.4106090.071098-0.0299652013-06-301.0591150.4025100.773409-1.164358

Selecting values from a DataFrame where a boolean condition is met:
从满足布尔条件的 DataFrame(数据帧)中选择值:

df[df > 0]

ABCD2013-01-31NaNNaN1.407725NaN2013-02-281.671920NaNNaNNaN2013-03-310.116484NaN0.918146NaN2013-04-300.3833180.4106090.071098NaN2013-05-31NaNNaN0.362031NaN2013-06-301.0591150.4025100.773409NaN

Using the isin() method for filtering:
通过isin()方法进行过滤

df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2

ABCDE2013-01-01-0.6330720.1299660.7491871.201542one2013-01-02-0.4400090.666086-0.0949011.087610one2013-01-03-1.0955890.7084281.443271-0.012472two2013-01-04-0.012166-0.3399910.584877-0.930127three2013-01-05-0.8263570.0721592.082919-0.478526four2013-01-06-0.357370-0.7462470.195187-1.009280three

df2[df2["E"].isin(["two", "four"])]

ABCDE2013-01-03-1.0955890.7084281.443271-0.012472two2013-01-05-0.8263570.0721592.082919-0.478526four

Setting a new column automatically aligns the data by the indexes:
设置新列并自动按索引对齐原数据:

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130131", periods=6))
df["F"] = s1
df

ABCDF2013-01-31-0.512502-1.0737981.407725-2.0425281.02013-02-281.671920-1.603149-0.154643-0.752101NaN2013-03-310.116484-0.5197650.918146-0.717562NaN2013-04-300.3833180.4106090.071098-0.029965NaN2013-05-31-0.818920-2.5959570.362031-1.440398NaN2013-06-301.0591150.4025100.773409-1.164358NaN

Setting values by label:

df.at[dates[0], "A"] = 0
df

ABCDF2013-01-010.0000000.1299660.7491871.201542NaN2013-01-02-0.4400090.666086-0.0949011.0876101.02013-01-03-1.0955890.7084281.443271-0.0124722.02013-01-04-0.012166-0.3399910.584877-0.9301273.02013-01-05-0.8263570.0721592.082919-0.4785264.02013-01-06-0.357370-0.7462470.195187-1.0092805.0

Setting values by position:
按位置设置值:

df.iat[0, 1] = 0
df

ABCDF2013-01-010.0000000.0000000.7491871.201542NaN2013-01-02-0.4400090.666086-0.0949011.0876101.02013-01-03-1.0955890.7084281.443271-0.0124722.02013-01-04-0.012166-0.3399910.584877-0.9301273.02013-01-05-0.8263570.0721592.082919-0.4785264.02013-01-06-0.357370-0.7462470.195187-1.0092805.0

Setting by assigning with a NumPy array:
通过使用NumPy数组来赋值:

df.loc[:, "D"] = np.array([5] * len(df))
df

ABCDF2013-01-010.0000000.0000000.7491875NaN2013-01-02-0.4400090.666086-0.09490151.02013-01-03-1.0955890.7084281.44327152.02013-01-04-0.012166-0.3399910.58487753.02013-01-05-0.8263570.0721592.08291954.02013-01-06-0.357370-0.7462470.19518755.0

A where operation with setting:
使用where操作赋值:

df2 = df.copy()
df2[df2 > 0] = -df2
df2

ABCDF2013-01-010.0000000.000000-0.749187-5NaN2013-01-02-0.440009-0.666086-0.094901-5-1.02013-01-03-1.095589-0.708428-1.443271-5-2.02013-01-04-0.012166-0.339991-0.584877-5-3.02013-01-05-0.826357-0.072159-2.082919-5-4.02013-01-06-0.357370-0.746247-0.195187-5-5.0

df2[df2 <0] = -df2+1
df2

ABCDF2013-01-010.0000000.0000001.7491876NaN2013-01-021.4400091.6660861.09490162.02013-01-032.0955891.7084282.44327163.02013-01-041.0121661.3399911.58487764.02013-01-051.8263571.0721593.08291965.02013-01-061.3573701.7462471.19518766.0

Original: https://blog.csdn.net/u012338969/article/details/124558022
Author: 雪龙无敌
Title: pandas10minnutes_中英对照01

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/738468/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球