Pandas 毫秒级时间解析

pandas 对含时间戳数据进行提取,读取时间可能是 int、str 的数据类型,因此第一步需要对齐进行时间解析。

创建一个DataFrame 测试数据

import pandas as pd
import numpy as np

df = pd.DataFrame({
             'time1': [20210101000000, 20210101000001, '20210101000002', '20210101000003',
                      '2021-01-01 00:00:04', '2021-01-01 00:00:05', '2021/01/01 00:00:06',
                      '2021/1/1 00:00:07', '2021-01-01 00:00:08', '2021-01-01 00:00:09'],
             'time2': ['2021-01-01 00:00:00:167', '2021-01-01 00:00:01:167', '2021-01-01 00:00:02:167',
                       '2021-01-01 00:00:03:167', '2021-01-01 00:00:04:167', '2021-01-01 00:00:05:167',
                       '2021-01-01 00:00:06:167', '2021-01-01 00:00:07:167', '2021-01-01 00:00:08:167',
                       '2021-01-01 00:00:09:167'],
             'name': ['赵一', '钱二', '张三', '李四', '王五', '麻六', '孙七', '周八', '吴九', '郑十'],
             'age': [18, 28, 38, 48, 16, 26, 39, 25, 19, 24],
             'sex': [True, True, False, True, False, False, True, False, True, False],
             'score': [85.5, 59, 78, 88, 69, 96, 90, 85, 69, 79],
             })

print(df['time1'][0], type(df['time1'][0]))
print(df['time1'][2], type(df['time1'][2]))    ''' 查看数据类型 '''
print('*'*50)
print(df['time2'][0], type(df['time2'][0]))
print('*'*50)
df

_______________________________________________________________________________________________
结果输出:
20210101000000 <class 'int'>
20210101000002 <class 'str'>
**************************************************
2021-01-01 00:00:00:167 <class 'str'>
**************************************************
time1   time2   name    age sex score
0   20210101000000  2021-01-01 00:00:00:167 赵一  18  True    85.5
1   20210101000001  2021-01-01 00:00:01:167 钱二  28  True    59.0
2   20210101000002  2021-01-01 00:00:02:167 张三  38  False   78.0
3   20210101000003  2021-01-01 00:00:03:167 李四  48  True    88.0
4   2021-01-01 00:00:04 2021-01-01 00:00:04:167 王五  16  False   69.0
5   2021-01-01 00:00:05 2021-01-01 00:00:05:167 麻六  26  False   96.0
6   2021/01/01 00:00:06 2021-01-01 00:00:06:167 孙七  39  True    90.0
7   2021/1/1 00:00:07   2021-01-01 00:00:07:167 周八  25  False   85.0
8   2021-01-01 00:00:08 2021-01-01 00:00:08:167 吴九  19  True    69.0
9   2021-01-01 00:00:09 2021-01-01 00:00:09:167 郑十  24  False   79.0

pandas默认的时间解析格式 “%Y-%m-%d %H:%M:%S”,由上段代码,可以看出,需要解析的时间格式有 int,str 两种数据格式,需要解析的有标准 格式(秒级),也有(毫秒级)需要解析

pandas 读取中 read_csv( parse_dates=[‘**’]) 参数

''' 存储再读取,验证 pd.read_csv(parse_dates=['time'])的能力 '''
df.to_csv('test.csv', index=None, encoding='utf-8')
df = pd.read_csv('test.csv', parse_dates=['time1','time2'])
print(df['time1'][0], type(df['time1'][0]))
print(df['time1'][2], type(df['time1'][2]))   # 解析成功
print('*'*50)
print(df['time2'][0], type(df['time2'][0]))   # 解析失败
print('*'*50)
df
_______________________________________________________________________________________________
结果输出:
2021-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2021-01-01 00:00:02 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
**************************************************
2021-01-01 00:00:00:167 <class 'str'>
**************************************************
time1   time2   name    age sex score
0   2021-01-01 00:00:00 2021-01-01 00:00:00:167 赵一  18  True    85.5
1   2021-01-01 00:00:01 2021-01-01 00:00:01:167 钱二  28  True    59.0
2   2021-01-01 00:00:02 2021-01-01 00:00:02:167 张三  38  False   78.0
3   2021-01-01 00:00:03 2021-01-01 00:00:03:167 李四  48  True    88.0
4   2021-01-01 00:00:04 2021-01-01 00:00:04:167 王五  16  False   69.0
5   2021-01-01 00:00:05 2021-01-01 00:00:05:167 麻六  26  False   96.0
6   2021-01-01 00:00:06 2021-01-01 00:00:06:167 孙七  39  True    90.0
7   2021-01-01 00:00:07 2021-01-01 00:00:07:167 周八  25  False   85.0
8   2021-01-01 00:00:08 2021-01-01 00:00:08:167 吴九  19  True    69.0
9   2021-01-01 00:00:09 2021-01-01 00:00:09:167 郑十  24  False   79.0

上面可以看到解析的局限,对于毫秒级的数据,str并未解析

pandas自带 to_datetime() 方法

df = pd.read_csv('test.csv')
print(df['time1'][0], type(df['time1'][0]))  # 解析前
print('*'*50)
print(pd.to_datetime(df['time1']))
print(pd.to_datetime(df['time2']))  # ParserError: Unknown string format: 2021-01-01 00:00:00:167   ''' (报错) '''
print(pd.to_datetime(df['time2'], format='%Y-%m-%d %H:%M:%S:%f'))  # 进行毫秒级数据解析
______________________________________________________________________________________________
结果输出:
20210101000000 <class 'str'>
**************************************************
0   2021-01-01 00:00:00
1   2021-01-01 00:00:01
2   2021-01-01 00:00:02
3   2021-01-01 00:00:03
4   2021-01-01 00:00:04
5   2021-01-01 00:00:05
6   2021-01-01 00:00:06
7   2021-01-01 00:00:07
8   2021-01-01 00:00:08
9   2021-01-01 00:00:09
Name: time1, dtype: datetime64[ns]
0   2021-01-01 00:00:00.167
1   2021-01-01 00:00:01.167
2   2021-01-01 00:00:02.167
3   2021-01-01 00:00:03.167
4   2021-01-01 00:00:04.167
5   2021-01-01 00:00:05.167
6   2021-01-01 00:00:06.167
7   2021-01-01 00:00:07.167
8   2021-01-01 00:00:08.167
9   2021-01-01 00:00:09.167
Name: time2, dtype: datetime64[ns]

可以看出,解析需要毫秒级,需要制定解析格式为 ” format=’%Y-%m-%d %H:%M:%S:%f’ “,在to_datetime() 方法可以进行指定,其它方法还有待发现,补充。

插入一个补充(笔记)

df_1 = pd.DataFrame({'Time Date':['1/15/2021 11:00','1/15/2021 11:10']})
print (df_1)
df_1['Time Date'] = pd.to_datetime(df_1['Time Date'], format='%m/%d/%Y %H:%M')
print (df_1)
         Time Date
0  1/15/2021 11:00
1  1/15/2021 11:10
_________________________
datetime.datetime(2021, 1, 15, 11, 0)
df1 = pd.DataFrame({'Join Date':['1/4/21 14:44','2/3/21 14:44']})
print (df1)
关注点 %y 和 floor()方法
df1['Join Date'] = pd.to_datetime(df1['Join Date'], format='%d/%m/%y %H:%M').dt.floor('D')
print (df1)
      Join Date
0  1/4/21 14:44
1  2/3/21 14:44
_________________________
      Join Date
0 2021-04-01 14:44:00
1 2021-03-02 14:44:00

插入一个补充(笔记)

df = pd.read_csv('test.csv', chunksize=3)
for i in df:
    print(i)   # 可以看出 数据分块,不足3行的也会取出,不会遗失
____________________________________________________________________________________________
结果输出:
                 time1                    time2 name  age    sex  score
0  2021-01-01 00:00:00  2021-01-01 00:00:00:167   赵一   18   True   85.5
1  2021-01-01 00:00:01  2021-01-01 00:00:01:167   钱二   28   True   59.0
2  2021-01-01 00:00:02  2021-01-01 00:00:02:167   张三   38  False   78.0
                 time1                    time2 name  age    sex  score
3  2021-01-01 00:00:03  2021-01-01 00:00:03:167   李四   48   True   88.0
4  2021-01-01 00:00:04  2021-01-01 00:00:04:167   王五   16  False   69.0
5  2021-01-01 00:00:05  2021-01-01 00:00:05:167   麻六   26  False   96.0
                 time1                    time2 name  age    sex  score
6  2021-01-01 00:00:06  2021-01-01 00:00:06:167   孙七   39   True   90.0
7  2021-01-01 00:00:07  2021-01-01 00:00:07:167   周八   25  False   85.0
8  2021-01-01 00:00:08  2021-01-01 00:00:08:167   吴九   19   True   69.0
                 time1                    time2 name  age    sex  score
9  2021-01-01 00:00:09  2021-01-01 00:00:09:167   郑十   24  False   79.0

验证 chunksize= 参数设置,不足数据分块的设置大小,也可以读取。

Original: https://blog.csdn.net/xnd_31726/article/details/112297723
Author: 一定波兮
Title: Pandas 毫秒级时间解析

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/738789/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球