利用python进行股票分析(四)pandas

文章目录

4.1. 环境配置

pip3 install pandas

4.2. pandas基础

4.2.1. 简单的pandas


import pandas as pd
test_data = {
    "country":["China","China","China"],
    "sites":["baidu","sougou","hao123"],
    "rank":[1,3,2]
}
df1 = pd.DataFrame(test_data)
print(df1)

  country   sites  rank
0   China   baidu     1
1   China  sougou     3
2   China  hao123     2

4.2.2. pandas Series简单示例


se1 = pd.Series(["Hello","world"], index=["a","b"])
print(se1)
print(se1["a"])

a    Hello
b    world
dtype: object
Hello

4.2.3. 通过字典来创建Series


dict1 = {1:"hello", 2:"world", "3":"kelvin"}
se1 = pd.Series(dict1)
se1

1     hello
2     world
3    kelvin
dtype: object

4.2.4. DataFrame,如果不指定index的话,默认就是从0开始编号


my_data = [["kelvin","31"],["tom","29"],["kipper","13"]]
my_column = ["name", "age"]
df = pd.DataFrame(data = my_data, columns = my_column)
print(df)

     name age
0  kelvin  31
1     tom  29
2  kipper  13

4.2.5. DataFrame获取数据 df.loc[0]


my_data = [["kelvin","31"],["tom","29"],["kipper","13"]]
my_column = ["name", "age"]
df = pd.DataFrame(data = my_data, columns = my_column)
print(df)
print()
print(df.loc[0])
print()
print(df.loc[0]["name"])

     name age
0  kelvin  31
1     tom  29
2  kipper  13

name    kelvin
age         31
Name: 0, dtype: object

kelvin

4.2.6. DataFrame返回多行数据


data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
df.loc[["day1","day2"]]
print(df)

      calories  duration
day1       420        50
day2       380        40
day3       390        45
      calories  duration
day1       420        50
day2       380        40
day3       390        45

4.2.7. DataFrame写入到csv


name = ["kelvin", "tom", "kipper"]
age = [31, 29, 15]
region = ["江苏", "上海", "杭州"]
dict1 = {"name":name, "age":age, "region":region}
df = pd.DataFrame(dict1)
print(df)
df.to_csv("test_csv.csv")

     name  age region
0  kelvin   31     江苏
1     tom   29     上海
2  kipper   15     杭州

4.2.8. pandas读取csv, read_csv, head, tail


df = pd.read_csv("nba.csv")
df.head(10)
print(df)

              Name            Team  Number Position   Age Height  Weight  \
0    Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0
1      Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0
2     John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0
3      R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0
4    Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0
..             ...             ...     ...      ...   ...    ...     ...

453   Shelvin Mack       Utah Jazz     8.0       PG  26.0    6-3   203.0
454      Raul Neto       Utah Jazz    25.0       PG  24.0    6-1   179.0
455   Tibor Pleiss       Utah Jazz    21.0        C  26.0    7-3   256.0
456    Jeff Withey       Utah Jazz    24.0        C  26.0    7-0   231.0
457            NaN             NaN     NaN      NaN   NaN    NaN     NaN

               College     Salary
0                Texas  7730337.0
1            Marquette  6796117.0
2    Boston University        NaN
3        Georgia State  1148640.0
4                  NaN  5000000.0
..                 ...        ...

453             Butler  2433333.0
454                NaN   900000.0
455                NaN  2900000.0
456             Kansas   947276.0
457                NaN        NaN

[458 rows x 9 columns]

4.2.9. 构造DataFrame:从json构建DataFrame


json1 = [
    {"name":"kelvin","age":"31","region":"江苏"},
    {"name":"tom","age":"29","region":"上海","unkonwn":"111"},
    {"name":"kipper","age":"15","region":"杭州"}
]
df = pd.DataFrame(json1)
print(df)
df.to_json()

     name age region unkonwn
0  kelvin  31     江苏     NaN
1     tom  29     上海     111
2  kipper  15     杭州     NaN
'{"name":{"0":"kelvin","1":"tom","2":"kipper"},"age":{"0":"31","1":"29","2":"15"},"region":{"0":"\\u6c5f\\u82cf","1":"\\u4e0a\\u6d77","2":"\\u676d\\u5dde"},"unkonwn":{"0":null,"1":"111","2":null}}'

4.2.10. 构造DataFrame:字典格式的json


json1 = {
    "2021-07-01":{"name":"kelvin","age":"31","region":"江苏"},
    "2021-07-02":{"name":"tom","age":"29","region":"上海","unkonwn":"111"},
    "2021-07-03":{"name":"kipper","age":"15","region":"杭州"}
}
df = pd.DataFrame(json1).T
print(df)

              name age region unkonwn
2021-07-01  kelvin  31     江苏     NaN
2021-07-02     tom  29     上海     111
2021-07-03  kipper  15     杭州     NaN

4.2.11. DataFrame去掉空数据 dropna()


df = pd.read_csv("property-data.csv")
print(df)
print(df.isnull())
print()
print(df.dropna())

           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800
     PID  ST_NUM  ST_NAME  OWN_OCCUPIED  NUM_BEDROOMS  NUM_BATH  SQ_FT
0  False   False    False         False         False     False  False
1  False   False    False         False         False     False  False
2  False    True    False         False          True     False  False
3  False   False    False         False         False      True  False
4   True   False    False         False         False     False  False
5  False   False    False         False          True     False  False
6  False    True    False          True         False     False  False
7  False   False    False         False         False     False   True
8  False   False    False         False         False     False  False

           PID  ST_NUM    ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0     PUTNAM            Y            3        1  1000
1  100002000.0   197.0  LEXINGTON            N            3      1.5    --
8  100009000.0   215.0    TREMONT            Y           na        2  1800

4.2.12. 读取csv的时候,可以指定哪些是空数据


missing_value = ["na","--"]
df = pd.read_csv("property-data.csv", na_values = missing_value)
print(df)

           PID  ST_NUM     ST_NAME OWN_OCCUPIED  NUM_BEDROOMS NUM_BATH   SQ_FT
0  100001000.0   104.0      PUTNAM            Y           3.0        1  1000.0
1  100002000.0   197.0   LEXINGTON            N           3.0      1.5     NaN
2  100003000.0     NaN   LEXINGTON            N           NaN        1   850.0
3  100004000.0   201.0    BERKELEY           12           1.0      NaN   700.0
4          NaN   203.0    BERKELEY            Y           3.0        2  1600.0
5  100006000.0   207.0    BERKELEY            Y           NaN        1   800.0
6  100007000.0     NaN  WASHINGTON          NaN           2.0   HURLEY   950.0
7  100008000.0   213.0     TREMONT            Y           1.0        1     NaN
8  100009000.0   215.0     TREMONT            Y           NaN        2  1800.0

4.2.13. 指定列,如果有空数据,则删除整行


df = pd.read_csv("property-data.csv")
print(df)
print()
df.dropna(subset=["ST_NUM"])
print(df)

           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800

           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800

4.2.14. 填充NaN,空数据,替换


df = pd.read_csv("property-data.csv")
print(df)
df["ST_NUM"].fillna(10000, inplace=True)
print()
print(df)

           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800

           PID   ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0    104.0      PUTNAM            Y            3        1  1000
1  100002000.0    197.0   LEXINGTON            N            3      1.5    --
2  100003000.0  10000.0   LEXINGTON            N          NaN        1   850
3  100004000.0    201.0    BERKELEY           12            1      NaN   700
4          NaN    203.0    BERKELEY            Y            3        2  1600
5  100006000.0    207.0    BERKELEY            Y          NaN        1   800
6  100007000.0  10000.0  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0    213.0     TREMONT            Y            1        1   NaN
8  100009000.0    215.0     TREMONT            Y           na        2  1800

4.2.15. 修改DataFrame中的错误数据


person = {
  "name": ['Google', 'Runoob' , 'Taobao'],
  "age": [50, 40, 12345]
}
df = pd.DataFrame(person)
print(df)
print()
df.loc[2,"age"] = 30
print(df)

     name    age
0  Google     50
1  Runoob     40
2  Taobao  12345

     name  age
0  Google   50
1  Runoob   40
2  Taobao   30

4.2.16. DataFrame的遍历,遍历DataFrame


person = {
  "name": ['Google', 'Runoob' , 'Taobao'],
  "age": [50, 200, 12345]
}
df = pd.DataFrame(person)
print(df)
print(df.index)
for x in df.index:
    print(x)
    if df.loc[x, "age"] > 120:
        df.loc[x, "age"] = 1

print(df)

     name    age
0  Google     50
1  Runoob    200
2  Taobao  12345
RangeIndex(start=0, stop=3, step=1)
0
1
2
     name  age
0  Google   50
1  Runoob    1
2  Taobao    1

Original: https://blog.csdn.net/sdfiiiiii/article/details/119422583
Author: Kelvin写代码
Title: 利用python进行股票分析(四)pandas

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/738001/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球