从最基础的部分开始学习,完整的解决一个任务!本文的代码借鉴自2020CCF冠军的开源代码。简单的数据格式转换是我数据挖掘道路上的拦路虎,今天看到人家的处理方式,惊呼!学到了~
- 将txt文件中的数据用
open(file)
读取出来是一堆字符串类型的数据,处理起来很不方便。以下代码,使用pandas的read_csv
读取txt
文件,使用sep
字段对字符串进行划分。先粗略划分,随后逐步细化、合并,详情看代码。 - 最后将处理好的dataframe数据并没有用CSV存储,而使用
pkl
格式,原因是:pickle文件的读取速度比csv文件读取速度快2倍 !csv VS pkl 请查看这篇博文
原txt文件248MB,转换为pkl文件420MB - 遇到多个相关数据集,找到共有的字段,并用该字段将多个数据集进行
merge
2020CCF 路况预测–数据读取部分:
将数据格式转化成dataFrame
import pandas as pd
import numpy as np
import torch
import scipy
import sklearn
import logging
from tqdm import tqdm
def get_data(path):
columns = ['link', 'label', 'current_slice_id', 'future_slice_id', 'recent_feature_1', 'recent_feature_2',
'recent_feature_3', 'recent_feature_4', 'recent_feature_5']
for i in range(1, 5):
for j in range(1, 6):
columns.append('history_feature_cycle{}_gap{}'.format(i, j))
data = pd.read_csv(path, header=None, sep=';')
print(data.shape)
data['temp'] = data[0].apply(lambda x: x.split())
for i in range(4):
data[columns[i]] = data['temp'].apply(lambda x: x[i])
for i in range(1, 6):
data['temp'] = data[i].apply(lambda x: x.split())
for j in range(5 * i - 1, 5 * i + 4):
data[columns[j]] = data['temp'].apply(lambda x: x[j - 5 * i + 1])
data = data[columns]
data_columns = data.columns
save_columns = []
for col in tqdm(data_columns):
if 'feature' in col:
data['temp'] = data[col].apply(lambda x: x.split(','))
data[col + '_speed'] = data['temp'].apply(lambda x: float(x[0].split(':')[1]))
data[col + '_eta'] = data['temp'].apply(lambda x: float(x[1]))
data[col + '_status'] = data['temp'].apply(lambda x: float(x[2]))
data[col + '_num_car'] = data['temp'].apply(lambda x: float(x[3]))
save_columns.extend([col + '_speed', col + '_eta', col + '_status', col + '_num_car'])
data = data[columns[:4] + save_columns]
return data
logging.warning(u"get or data success!!!")
logging.warning(u"assert env:")
logging.warning("------------------------------")
logging.warning(u"pd.__version__:{}".format(pd.__version__))
logging.warning(u"sklearn.__version__{}".format(sklearn.__version__))
logging.warning(u'numpy version{}'.format(np.__version__))
logging.warning(u'lgb version{}'.format(lgb.__version__))
logging.warning(u'torch version{}'.format(torch.__version__))
logging.warning(u'torch.cuda.is_available{}'.format(torch.cuda.is_available()))
logging.warning(u'scipy.__version__{}'.format(scipy.__version__))
logging.warning("------------------------------")
if __name__ == '__main__':
or_data_path = r'D:\DATA\DIDI_CCF_RoadConditionForecast\traffic\{}.txt'
save_data_path = r'D:\DATA\DIDI_CCF_RoadConditionForecast\train_{}.pkl'
test_data_path = r'D:\DATA\DIDI_CCF_RoadConditionForecast\20190801_testdata.txt'
test_save_data_path = r'D:\DATA\DIDI_CCF_RoadConditionForecast\test_new.pkl'
for i in range(1, 31):
print(i)
num = 20190700 + i
train_df = get_data(or_data_path.format(num))
train_df['day'] = i
print(train_df.shape)
train_df.to_pickle(save_data_path.format(i))
test = get_data(test_data_path)
test['day'] = 32
test.to_pickle(test_save_data_path)
数据一:历史与实时路况 traffic/201907XX.txt
每天对应一个 txt
文件,总体格式:link_id、label、current_slice_id、future_slice_id、recent_feature、history_feature1、history_feature2、history_feature3、history_feature4
字段名称字段含义备注link路段的idlabel对应时间的link的路况状态:取值 {1,2,3,4}
current_slice_id当前时间片id:
( s l i c e _ i d + 720 ) % 720 (slice_id+720)\%720 (s l i c e _i d +7 2 0 )%7 2 0
future_slice_id待预测时间片id
0 < = f u t u r e _ s l i c e _ i d − c u r r e n t _ s l i c e _ i d < = 30 0
recent_feature1. 近期n个时间片路况特征,n=5,时间片之间空格分隔,字段之间”逗号”分隔。
-
具体格式:
时间片:路况速度,eta速度,路况状态,参与路况计算的车辆数
-
特征都为0时,说明此时间片无车经过history_feature历史同期n个时间片路况特征,星期之间”分号”分隔,共4组(-28,-21,-14,-7),每组格式和recent_feature一致;
4表示一个月4个星期的同一天,比如上面recent_feature用的 周一
,那么这四条样本也是 周一
,采用 周一
相同时间点的4天
每天的数据集分别包括:503556、505953、507552、516549、517413、492267、462417、519387、518511、516024、521820、522081、491574、469377、519537、524658、523518、528582、525174、495456、465870、517290、516813、512130、509850、521751、478371、454839、513840、510414条样本,每条样本的数据组成如下:
353495 1 236 245;
232:29.80,32.40,1,4 233:31.60,32.20,1,2 234:20.00,21.90,2,2 235:22.20,25.90,2,5 236:21.30,26.30,2,4;
245:30.00,32.70,0,9 246:30.00,36.10,0,10 247:27.40,35.20,1,12 248:26.90,35.70,1,10 249:28.90,37.00,1,9;
245:36.10,37.30,1,7 246:29.30,38.50,1,7 247:27.70,39.70,1,6 248:28.60,40.20,1,3 249:29.60,38.70,1,4;
245:30.40,40.10,1,6 246:32.30,40.10,1,6 247:30.60,41.10,1,5 248:29.60,39.20,1,4 249:28.00,37.90,1,4;
245:28.30,38.40,1,7 246:28.20,39.40,1,6 247:28.80,35.10,1,3 248:30.00,35.60,1,4 249:29.40,37.20,1,5
以上数据对应的意思如表格所示,分号隔开有六个部分:
- 第一表示路段的基本信息,包括路段ID、路段的label、当前时间片ID和未来时间片ID:
link label current_slice_id future_slice_id;
- 第二部分表示最近
n=5
个时间片的实时路段信息,用recent_feature表示: 每一个时间片的格式为:时间片ID:路况速度,eta速度,路况状态,参与路况计算的车辆数
- 第三~六部分表示历史路段信息,包括
4
组数据,每组包括n=5
个时间片:用history_feature
表示,时间片格式同上。分别表示recent时间的7天前、14天前、21天前、28天前相同时间片的路况状态。
处理后的格式,二维表格,105个字段:
['link',
'label',
'current_slice_id',
'future_slice_id',
'day',
'recent_feature_1_speed',
'recent_feature_1_eta',
'recent_feature_1_status',
'recent_feature_1_num_car',
'recent_feature_2_speed',
'recent_feature_2_eta',
'recent_feature_2_status',
'recent_feature_2_num_car',
'recent_feature_3_speed',
'recent_feature_3_eta',
'recent_feature_3_status',
'recent_feature_3_num_car',
'recent_feature_4_speed',
'recent_feature_4_eta',
'recent_feature_4_status',
'recent_feature_4_num_car',
'recent_feature_5_speed',
'recent_feature_5_eta',
'recent_feature_5_status',
'recent_feature_5_num_car',
'history_feature_cycle1_gap1_speed',
'history_feature_cycle1_gap1_eta',
'history_feature_cycle1_gap1_status',
'history_feature_cycle1_gap1_num_car',
'history_feature_cycle1_gap2_speed',
'history_feature_cycle1_gap2_eta',
'history_feature_cycle1_gap2_status',
'history_feature_cycle1_gap2_num_car',
'history_feature_cycle1_gap3_speed',
'history_feature_cycle1_gap3_eta',
'history_feature_cycle1_gap3_status',
'history_feature_cycle1_gap3_num_car',
'history_feature_cycle1_gap4_speed',
'history_feature_cycle1_gap4_eta',
'history_feature_cycle1_gap4_status',
'history_feature_cycle1_gap4_num_car',
'history_feature_cycle1_gap5_speed',
'history_feature_cycle1_gap5_eta',
'history_feature_cycle1_gap5_status',
'history_feature_cycle1_gap5_num_car',
'history_feature_cycle2_gap1_speed',
'history_feature_cycle2_gap1_eta',
'history_feature_cycle2_gap1_status',
'history_feature_cycle2_gap1_num_car',
'history_feature_cycle2_gap2_speed',
'history_feature_cycle2_gap2_eta',
'history_feature_cycle2_gap2_status',
'history_feature_cycle2_gap2_num_car',
'history_feature_cycle2_gap3_speed',
'history_feature_cycle2_gap3_eta',
'history_feature_cycle2_gap3_status',
'history_feature_cycle2_gap3_num_car',
'history_feature_cycle2_gap4_speed',
'history_feature_cycle2_gap4_eta',
'history_feature_cycle2_gap4_status',
'history_feature_cycle2_gap4_num_car',
'history_feature_cycle2_gap5_speed',
'history_feature_cycle2_gap5_eta',
'history_feature_cycle2_gap5_status',
'history_feature_cycle2_gap5_num_car',
'history_feature_cycle3_gap1_speed',
'history_feature_cycle3_gap1_eta',
'history_feature_cycle3_gap1_status',
'history_feature_cycle3_gap1_num_car',
'history_feature_cycle3_gap2_speed',
'history_feature_cycle3_gap2_eta',
'history_feature_cycle3_gap2_status',
'history_feature_cycle3_gap2_num_car',
'history_feature_cycle3_gap3_speed',
'history_feature_cycle3_gap3_eta',
'history_feature_cycle3_gap3_status',
'history_feature_cycle3_gap3_num_car',
'history_feature_cycle3_gap4_speed',
'history_feature_cycle3_gap4_eta',
'history_feature_cycle3_gap4_status',
'history_feature_cycle3_gap4_num_car',
'history_feature_cycle3_gap5_speed',
'history_feature_cycle3_gap5_eta',
'history_feature_cycle3_gap5_status',
'history_feature_cycle3_gap5_num_car',
'history_feature_cycle4_gap1_speed',
'history_feature_cycle4_gap1_eta',
'history_feature_cycle4_gap1_status',
'history_feature_cycle4_gap1_num_car',
'history_feature_cycle4_gap2_speed',
'history_feature_cycle4_gap2_eta',
'history_feature_cycle4_gap2_status',
'history_feature_cycle4_gap2_num_car',
'history_feature_cycle4_gap3_speed',
'history_feature_cycle4_gap3_eta',
'history_feature_cycle4_gap3_status',
'history_feature_cycle4_gap3_num_car',
'history_feature_cycle4_gap4_speed',
'history_feature_cycle4_gap4_eta',
'history_feature_cycle4_gap4_status',
'history_feature_cycle4_gap4_num_car',
'history_feature_cycle4_gap5_speed',
'history_feature_cycle4_gap5_eta',
'history_feature_cycle4_gap5_status',
'history_feature_cycle4_gap5_num_car']
数据二:道路属性 attr.txt
字段名称字段类型字段含义linkidcategoriclink的idlengthnumericlink的长度,以m为单位directioncategoriclink的通行方向pathclasscategoriclink的的功能等级speedclasscategoriclink的速度限制等级LaneNumcategoriclink的车道数speedlimitnumericlink的限速,以m/s为单位levelcategoriclink的levelwidthnumericlink的宽度,以m为单位
用同样的方式读取文件并处理数据:
- 考虑到category的内存比int64要小很多,读取出数据之后,将其类型进行转换。
- 数值类型 的进行归一化处理
road_attr = pd.read_csv(road_attr_path, header=None, sep='\t')
road_attr.columns=['link','length','direction','path_class','speed_class','LaneNum','speed_limit','level','width']
category_columns = ['direction','path_class','speed_class','LaneNum','level']
for cate in category_columns:
road_attr[cate] = road_attr[cate].astype('category')
road_attr.memory_usage(deep=True)
def scale_numeric(data,columns):
for col in columns:
data[col] = (data[col].values-np.mean(data[col].values))/np.std(data[col].values)
numeric_columns = ['length','speed_limit','width']
scale_numeric(road_attr, numeric_columns)
处理好的 link attr
通过 link
与训练集进行左连接:
train = train.merge(road_attr,on='link',how='left')
test = test.merge(road_attr,on='link',how='left')
数据三:路网拓扑 topo.txt
keyvaluelink_id下游linkid1,下游linkid2,下游linkid3,…
road_topo = pd.read_csv(road_topo_path,header=None, sep='\t')
road_topo.columns = ['link','next_link']
road_topo['next_link'] = road_topo['next_link'].apply(lambda x : [int(i) for i in x.split(',')])
再次通过 link
与训练集左连接:
train=train.merge(road_topo,on='link',how='left')
test=test.merge(road_topo,on='link',how='left')
最终
增加时间特征,从日期中获取周几作为新字段,
得到的训练集维度为: 1500w,117
,将30天的订单数据处理后沿着 axis:0
拼接成 1500w,105
,然后拼接上8个 link attr
字段,在拼接上3个时间字段,最后拼接1个 next_links
得到最终的训练集,其字段如下:
['current_slice_id',
'day',
'future_slice_id',
'history_feature_cycle1_gap1_eta',
'history_feature_cycle1_gap1_num_car',
'history_feature_cycle1_gap1_speed',
'history_feature_cycle1_gap1_status',
'history_feature_cycle1_gap2_eta',
'history_feature_cycle1_gap2_num_car',
'history_feature_cycle1_gap2_speed',
'history_feature_cycle1_gap2_status',
'history_feature_cycle1_gap3_eta',
'history_feature_cycle1_gap3_num_car',
'history_feature_cycle1_gap3_speed',
'history_feature_cycle1_gap3_status',
'history_feature_cycle1_gap4_eta',
'history_feature_cycle1_gap4_num_car',
'history_feature_cycle1_gap4_speed',
'history_feature_cycle1_gap4_status',
'history_feature_cycle1_gap5_eta',
'history_feature_cycle1_gap5_num_car',
'history_feature_cycle1_gap5_speed',
'history_feature_cycle1_gap5_status',
'history_feature_cycle2_gap1_eta',
'history_feature_cycle2_gap1_num_car',
'history_feature_cycle2_gap1_speed',
'history_feature_cycle2_gap1_status',
'history_feature_cycle2_gap2_eta',
'history_feature_cycle2_gap2_num_car',
'history_feature_cycle2_gap2_speed',
'history_feature_cycle2_gap2_status',
'history_feature_cycle2_gap3_eta',
'history_feature_cycle2_gap3_num_car',
'history_feature_cycle2_gap3_speed',
'history_feature_cycle2_gap3_status',
'history_feature_cycle2_gap4_eta',
'history_feature_cycle2_gap4_num_car',
'history_feature_cycle2_gap4_speed',
'history_feature_cycle2_gap4_status',
'history_feature_cycle2_gap5_eta',
'history_feature_cycle2_gap5_num_car',
'history_feature_cycle2_gap5_speed',
'history_feature_cycle2_gap5_status',
'history_feature_cycle3_gap1_eta',
'history_feature_cycle3_gap1_num_car',
'history_feature_cycle3_gap1_speed',
'history_feature_cycle3_gap1_status',
'history_feature_cycle3_gap2_eta',
'history_feature_cycle3_gap2_num_car',
'history_feature_cycle3_gap2_speed',
'history_feature_cycle3_gap2_status',
'history_feature_cycle3_gap3_eta',
'history_feature_cycle3_gap3_num_car',
'history_feature_cycle3_gap3_speed',
'history_feature_cycle3_gap3_status',
'history_feature_cycle3_gap4_eta',
'history_feature_cycle3_gap4_num_car',
'history_feature_cycle3_gap4_speed',
'history_feature_cycle3_gap4_status',
'history_feature_cycle3_gap5_eta',
'history_feature_cycle3_gap5_num_car',
'history_feature_cycle3_gap5_speed',
'history_feature_cycle3_gap5_status',
'history_feature_cycle4_gap1_eta',
'history_feature_cycle4_gap1_num_car',
'history_feature_cycle4_gap1_speed',
'history_feature_cycle4_gap1_status',
'history_feature_cycle4_gap2_eta',
'history_feature_cycle4_gap2_num_car',
'history_feature_cycle4_gap2_speed',
'history_feature_cycle4_gap2_status',
'history_feature_cycle4_gap3_eta',
'history_feature_cycle4_gap3_num_car',
'history_feature_cycle4_gap3_speed',
'history_feature_cycle4_gap3_status',
'history_feature_cycle4_gap4_eta',
'history_feature_cycle4_gap4_num_car',
'history_feature_cycle4_gap4_speed',
'history_feature_cycle4_gap4_status',
'history_feature_cycle4_gap5_eta',
'history_feature_cycle4_gap5_num_car',
'history_feature_cycle4_gap5_speed',
'history_feature_cycle4_gap5_status',
'label',
'link',
'recent_feature_1_eta',
'recent_feature_1_num_car',
'recent_feature_1_speed',
'recent_feature_1_status',
'recent_feature_2_eta',
'recent_feature_2_num_car',
'recent_feature_2_speed',
'recent_feature_2_status',
'recent_feature_3_eta',
'recent_feature_3_num_car',
'recent_feature_3_speed',
'recent_feature_3_status',
'recent_feature_4_eta',
'recent_feature_4_num_car',
'recent_feature_4_speed',
'recent_feature_4_status',
'recent_feature_5_eta',
'recent_feature_5_num_car',
'recent_feature_5_speed',
'recent_feature_5_status',
'length',
'direction',
'path_class',
'speed_class',
'LaneNum',
'speed_limit',
'level',
'width',
'week_day',
'hour',
'time_gap',
'next_link']
pandas 内存缩减
def pandas_reduce_mem_usage(df, igore_columns=[]):
start_mem = df.memory_usage().sum() / 1024 ** 2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
start_time = datetime.datetime.now()
for col in df.columns:
if col in igore_columns:
continue
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
print('{} column dtype is {} and begin convert to others'.format(col, col_type))
if str(col_type)[:3] == 'int':
if c_min < 0:
if c_min >= np.iinfo(np.int8).min and c_max np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min >= np.iinfo(np.int16).min and c_max np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min >= np.iinfo(np.int32).min and c_max np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
else:
df[col] = df[col].astype(np.int64)
else:
if c_min >= np.iinfo(np.uint8).min and c_max np.iinfo(np.uint8).max:
df[col] = df[col].astype(np.uint8)
elif c_min >= np.iinfo(np.uint16).min and c_max np.iinfo(np.uint16).max:
df[col] = df[col].astype(np.uint16)
elif c_min >= np.iinfo(np.uint32).min and c_max np.iinfo(np.uint32).max:
df[col] = df[col].astype(np.uint32)
else:
df[col] = df[col].astype(np.uint64)
else:
if c_min >= np.finfo(np.float16).min and c_max np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min >= np.finfo(np.float32).min and c_max np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
print('\t\tcolumn dtype is {}'.format(df[col].dtype))
else:
print('\t\tcolumns dtype is object and will convert to category')
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024 ** 2
end_time = datetime.datetime.now()
print('consume times: {:.4f}'.format((end_time - start_time).seconds))
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
缩减前的文件大小:
Original: https://blog.csdn.net/qq_33866063/article/details/117235093
Author: 小卜妞~
Title: 2020CCF 路况预测–数据读取部分:将数据格式转化成dataFrame
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/752915/
转载文章受原作者版权保护。转载请注明原作者出处!