2020CCF 路况预测–数据读取部分:将数据格式转化成dataFrame

从最基础的部分开始学习,完整的解决一个任务!本文的代码借鉴自2020CCF冠军的开源代码。简单的数据格式转换是我数据挖掘道路上的拦路虎,今天看到人家的处理方式,惊呼!学到了~

  1. 将txt文件中的数据用 open(file)读取出来是一堆字符串类型的数据,处理起来很不方便。以下代码,使用pandas的 read_csv读取 txt文件,使用 sep字段对字符串进行划分。先粗略划分,随后逐步细化、合并,详情看代码。
  2. 最后将处理好的dataframe数据并没有用CSV存储,而使用 pkl格式,原因是:pickle文件的读取速度比csv文件读取速度快2倍 !csv VS pkl 请查看这篇博文
    原txt文件248MB,转换为pkl文件420MB
  3. 遇到多个相关数据集,找到共有的字段,并用该字段将多个数据集进行 merge

2020CCF 路况预测–数据读取部分:

将数据格式转化成dataFrame

import pandas as pd
import numpy as np

import torch
import scipy
import sklearn
import logging
from tqdm import tqdm

def get_data(path):

    columns = ['link', 'label', 'current_slice_id', 'future_slice_id', 'recent_feature_1', 'recent_feature_2',
               'recent_feature_3', 'recent_feature_4', 'recent_feature_5']
    for i in range(1, 5):
        for j in range(1, 6):
            columns.append('history_feature_cycle{}_gap{}'.format(i, j))
    data = pd.read_csv(path, header=None, sep=';')
    print(data.shape)

    data['temp'] = data[0].apply(lambda x: x.split())
    for i in range(4):
        data[columns[i]] = data['temp'].apply(lambda x: x[i])

    for i in range(1, 6):
        data['temp'] = data[i].apply(lambda x: x.split())
        for j in range(5 * i - 1, 5 * i + 4):
            data[columns[j]] = data['temp'].apply(lambda x: x[j - 5 * i + 1])

    data = data[columns]
    data_columns = data.columns
    save_columns = []
    for col in tqdm(data_columns):
        if 'feature' in col:
            data['temp'] = data[col].apply(lambda x: x.split(','))
            data[col + '_speed'] = data['temp'].apply(lambda x: float(x[0].split(':')[1]))
            data[col + '_eta'] = data['temp'].apply(lambda x: float(x[1]))
            data[col + '_status'] = data['temp'].apply(lambda x: float(x[2]))
            data[col + '_num_car'] = data['temp'].apply(lambda x: float(x[3]))
            save_columns.extend([col + '_speed', col + '_eta', col + '_status', col + '_num_car'])
    data = data[columns[:4] + save_columns]
    return data

logging.warning(u"get or data success!!!")
logging.warning(u"assert env:")
logging.warning("------------------------------")
logging.warning(u"pd.__version__:{}".format(pd.__version__))
logging.warning(u"sklearn.__version__{}".format(sklearn.__version__))
logging.warning(u'numpy version{}'.format(np.__version__))
logging.warning(u'lgb version{}'.format(lgb.__version__))
logging.warning(u'torch version{}'.format(torch.__version__))
logging.warning(u'torch.cuda.is_available{}'.format(torch.cuda.is_available()))
logging.warning(u'scipy.__version__{}'.format(scipy.__version__))
logging.warning("------------------------------")

if __name__ == '__main__':

    or_data_path = r'D:\DATA\DIDI_CCF_RoadConditionForecast\traffic\{}.txt'
    save_data_path = r'D:\DATA\DIDI_CCF_RoadConditionForecast\train_{}.pkl'
    test_data_path = r'D:\DATA\DIDI_CCF_RoadConditionForecast\20190801_testdata.txt'
    test_save_data_path = r'D:\DATA\DIDI_CCF_RoadConditionForecast\test_new.pkl'

    for i in range(1, 31):

        print(i)
        num = 20190700 + i
        train_df = get_data(or_data_path.format(num))
        train_df['day'] = i
        print(train_df.shape)
        train_df.to_pickle(save_data_path.format(i))

    test = get_data(test_data_path)
    test['day'] = 32
    test.to_pickle(test_save_data_path)

数据一:历史与实时路况 traffic/201907XX.txt

每天对应一个 txt文件,总体格式:link_id、label、current_slice_id、future_slice_id、recent_feature、history_feature1、history_feature2、history_feature3、history_feature4

字段名称字段含义备注link路段的idlabel对应时间的link的路况状态:取值 {1,2,3,4}

current_slice_id当前时间片id:
( s l i c e _ i d + 720 ) % 720 (slice_id+720)\%720 (s l i c e _i d +7 2 0 )%7 2 0

future_slice_id待预测时间片id
0 < = f u t u r e _ s l i c e _ i d − c u r r e n t _ s l i c e _ i d < = 30 0

recent_feature1. 近期n个时间片路况特征,n=5,时间片之间空格分隔,字段之间”逗号”分隔。

  1. 具体格式: &#x65F6;&#x95F4;&#x7247;:&#x8DEF;&#x51B5;&#x901F;&#x5EA6;,eta&#x901F;&#x5EA6;,&#x8DEF;&#x51B5;&#x72B6;&#x6001;,&#x53C2;&#x4E0E;&#x8DEF;&#x51B5;&#x8BA1;&#x7B97;&#x7684;&#x8F66;&#x8F86;&#x6570;

  2. 特征都为0时,说明此时间片无车经过history_feature历史同期n个时间片路况特征,星期之间”分号”分隔,共4组(-28,-21,-14,-7),每组格式和recent_feature一致;

4表示一个月4个星期的同一天,比如上面recent_feature用的 &#x5468;&#x4E00;

,那么这四条样本也是 &#x5468;&#x4E00;

,采用 &#x5468;&#x4E00;

相同时间点的4天

每天的数据集分别包括:503556、505953、507552、516549、517413、492267、462417、519387、518511、516024、521820、522081、491574、469377、519537、524658、523518、528582、525174、495456、465870、517290、516813、512130、509850、521751、478371、454839、513840、510414条样本,每条样本的数据组成如下:

353495 1 236 245;

232:29.80,32.40,1,4 233:31.60,32.20,1,2 234:20.00,21.90,2,2 235:22.20,25.90,2,5 236:21.30,26.30,2,4;

245:30.00,32.70,0,9 246:30.00,36.10,0,10 247:27.40,35.20,1,12 248:26.90,35.70,1,10 249:28.90,37.00,1,9;
245:36.10,37.30,1,7 246:29.30,38.50,1,7 247:27.70,39.70,1,6 248:28.60,40.20,1,3 249:29.60,38.70,1,4;
245:30.40,40.10,1,6 246:32.30,40.10,1,6 247:30.60,41.10,1,5 248:29.60,39.20,1,4 249:28.00,37.90,1,4;
245:28.30,38.40,1,7 246:28.20,39.40,1,6 247:28.80,35.10,1,3 248:30.00,35.60,1,4 249:29.40,37.20,1,5

以上数据对应的意思如表格所示,分号隔开有六个部分:

  • 第一表示路段的基本信息,包括路段ID、路段的label、当前时间片ID和未来时间片ID: link label current_slice_id future_slice_id;
  • 第二部分表示最近 n=5个时间片的实时路段信息,用recent_feature表示: 每一个时间片的格式为: &#x65F6;&#x95F4;&#x7247;ID:&#x8DEF;&#x51B5;&#x901F;&#x5EA6;,eta&#x901F;&#x5EA6;,&#x8DEF;&#x51B5;&#x72B6;&#x6001;,&#x53C2;&#x4E0E;&#x8DEF;&#x51B5;&#x8BA1;&#x7B97;&#x7684;&#x8F66;&#x8F86;&#x6570;
  • 第三~六部分表示历史路段信息,包括 4组数据,每组包括 n=5个时间片:用 history_feature表示,时间片格式同上。分别表示recent时间的7天前、14天前、21天前、28天前相同时间片的路况状态。

处理后的格式,二维表格,105个字段:

['link',
 'label',
 'current_slice_id',
 'future_slice_id',
 'day',
 'recent_feature_1_speed',
 'recent_feature_1_eta',
 'recent_feature_1_status',
 'recent_feature_1_num_car',
 'recent_feature_2_speed',
 'recent_feature_2_eta',
 'recent_feature_2_status',
 'recent_feature_2_num_car',
 'recent_feature_3_speed',
 'recent_feature_3_eta',
 'recent_feature_3_status',
 'recent_feature_3_num_car',
 'recent_feature_4_speed',
 'recent_feature_4_eta',
 'recent_feature_4_status',
 'recent_feature_4_num_car',
 'recent_feature_5_speed',
 'recent_feature_5_eta',
 'recent_feature_5_status',
 'recent_feature_5_num_car',
 'history_feature_cycle1_gap1_speed',
 'history_feature_cycle1_gap1_eta',
 'history_feature_cycle1_gap1_status',
 'history_feature_cycle1_gap1_num_car',
 'history_feature_cycle1_gap2_speed',
 'history_feature_cycle1_gap2_eta',
 'history_feature_cycle1_gap2_status',
 'history_feature_cycle1_gap2_num_car',
 'history_feature_cycle1_gap3_speed',
 'history_feature_cycle1_gap3_eta',
 'history_feature_cycle1_gap3_status',
 'history_feature_cycle1_gap3_num_car',
 'history_feature_cycle1_gap4_speed',
 'history_feature_cycle1_gap4_eta',
 'history_feature_cycle1_gap4_status',
 'history_feature_cycle1_gap4_num_car',
 'history_feature_cycle1_gap5_speed',
 'history_feature_cycle1_gap5_eta',
 'history_feature_cycle1_gap5_status',
 'history_feature_cycle1_gap5_num_car',
 'history_feature_cycle2_gap1_speed',
 'history_feature_cycle2_gap1_eta',
 'history_feature_cycle2_gap1_status',
 'history_feature_cycle2_gap1_num_car',
 'history_feature_cycle2_gap2_speed',
 'history_feature_cycle2_gap2_eta',
 'history_feature_cycle2_gap2_status',
 'history_feature_cycle2_gap2_num_car',
 'history_feature_cycle2_gap3_speed',
 'history_feature_cycle2_gap3_eta',
 'history_feature_cycle2_gap3_status',
 'history_feature_cycle2_gap3_num_car',
 'history_feature_cycle2_gap4_speed',
 'history_feature_cycle2_gap4_eta',
 'history_feature_cycle2_gap4_status',
 'history_feature_cycle2_gap4_num_car',
 'history_feature_cycle2_gap5_speed',
 'history_feature_cycle2_gap5_eta',
 'history_feature_cycle2_gap5_status',
 'history_feature_cycle2_gap5_num_car',
 'history_feature_cycle3_gap1_speed',
 'history_feature_cycle3_gap1_eta',
 'history_feature_cycle3_gap1_status',
 'history_feature_cycle3_gap1_num_car',
 'history_feature_cycle3_gap2_speed',
 'history_feature_cycle3_gap2_eta',
 'history_feature_cycle3_gap2_status',
 'history_feature_cycle3_gap2_num_car',
 'history_feature_cycle3_gap3_speed',
 'history_feature_cycle3_gap3_eta',
 'history_feature_cycle3_gap3_status',
 'history_feature_cycle3_gap3_num_car',
 'history_feature_cycle3_gap4_speed',
 'history_feature_cycle3_gap4_eta',
 'history_feature_cycle3_gap4_status',
 'history_feature_cycle3_gap4_num_car',
 'history_feature_cycle3_gap5_speed',
 'history_feature_cycle3_gap5_eta',
 'history_feature_cycle3_gap5_status',
 'history_feature_cycle3_gap5_num_car',
 'history_feature_cycle4_gap1_speed',
 'history_feature_cycle4_gap1_eta',
 'history_feature_cycle4_gap1_status',
 'history_feature_cycle4_gap1_num_car',
 'history_feature_cycle4_gap2_speed',
 'history_feature_cycle4_gap2_eta',
 'history_feature_cycle4_gap2_status',
 'history_feature_cycle4_gap2_num_car',
 'history_feature_cycle4_gap3_speed',
 'history_feature_cycle4_gap3_eta',
 'history_feature_cycle4_gap3_status',
 'history_feature_cycle4_gap3_num_car',
 'history_feature_cycle4_gap4_speed',
 'history_feature_cycle4_gap4_eta',
 'history_feature_cycle4_gap4_status',
 'history_feature_cycle4_gap4_num_car',
 'history_feature_cycle4_gap5_speed',
 'history_feature_cycle4_gap5_eta',
 'history_feature_cycle4_gap5_status',
 'history_feature_cycle4_gap5_num_car']

数据二:道路属性 attr.txt

字段名称字段类型字段含义linkidcategoriclink的idlengthnumericlink的长度,以m为单位directioncategoriclink的通行方向pathclasscategoriclink的的功能等级speedclasscategoriclink的速度限制等级LaneNumcategoriclink的车道数speedlimitnumericlink的限速,以m/s为单位levelcategoriclink的levelwidthnumericlink的宽度,以m为单位

用同样的方式读取文件并处理数据:

  • 考虑到category的内存比int64要小很多,读取出数据之后,将其类型进行转换。
  • 数值类型 的进行归一化处理
road_attr = pd.read_csv(road_attr_path, header=None, sep='\t')
road_attr.columns=['link','length','direction','path_class','speed_class','LaneNum','speed_limit','level','width']

category_columns = ['direction','path_class','speed_class','LaneNum','level']
for cate in category_columns:
    road_attr[cate] = road_attr[cate].astype('category')
road_attr.memory_usage(deep=True)

def scale_numeric(data,columns):
    for col in columns:
        data[col] = (data[col].values-np.mean(data[col].values))/np.std(data[col].values)

numeric_columns = ['length','speed_limit','width']
scale_numeric(road_attr, numeric_columns)

处理好的 link attr通过 link与训练集进行左连接:

train = train.merge(road_attr,on='link',how='left')
test = test.merge(road_attr,on='link',how='left')

数据三:路网拓扑 topo.txt

keyvaluelink_id下游linkid1,下游linkid2,下游linkid3,…

road_topo = pd.read_csv(road_topo_path,header=None, sep='\t')
road_topo.columns = ['link','next_link']
road_topo['next_link'] = road_topo['next_link'].apply(lambda x : [int(i) for i in x.split(',')])

再次通过 link与训练集左连接:

train=train.merge(road_topo,on='link',how='left')
test=test.merge(road_topo,on='link',how='left')

最终

增加时间特征,从日期中获取周几作为新字段,
得到的训练集维度为: 1500w,117,将30天的订单数据处理后沿着 axis:0拼接成 1500w,105,然后拼接上8个 link attr字段,在拼接上3个时间字段,最后拼接1个 next_links得到最终的训练集,其字段如下:

['current_slice_id',
 'day',
 'future_slice_id',
 'history_feature_cycle1_gap1_eta',
 'history_feature_cycle1_gap1_num_car',
 'history_feature_cycle1_gap1_speed',
 'history_feature_cycle1_gap1_status',
 'history_feature_cycle1_gap2_eta',
 'history_feature_cycle1_gap2_num_car',
 'history_feature_cycle1_gap2_speed',
 'history_feature_cycle1_gap2_status',
 'history_feature_cycle1_gap3_eta',
 'history_feature_cycle1_gap3_num_car',
 'history_feature_cycle1_gap3_speed',
 'history_feature_cycle1_gap3_status',
 'history_feature_cycle1_gap4_eta',
 'history_feature_cycle1_gap4_num_car',
 'history_feature_cycle1_gap4_speed',
 'history_feature_cycle1_gap4_status',
 'history_feature_cycle1_gap5_eta',
 'history_feature_cycle1_gap5_num_car',
 'history_feature_cycle1_gap5_speed',
 'history_feature_cycle1_gap5_status',
 'history_feature_cycle2_gap1_eta',
 'history_feature_cycle2_gap1_num_car',
 'history_feature_cycle2_gap1_speed',
 'history_feature_cycle2_gap1_status',
 'history_feature_cycle2_gap2_eta',
 'history_feature_cycle2_gap2_num_car',
 'history_feature_cycle2_gap2_speed',
 'history_feature_cycle2_gap2_status',
 'history_feature_cycle2_gap3_eta',
 'history_feature_cycle2_gap3_num_car',
 'history_feature_cycle2_gap3_speed',
 'history_feature_cycle2_gap3_status',
 'history_feature_cycle2_gap4_eta',
 'history_feature_cycle2_gap4_num_car',
 'history_feature_cycle2_gap4_speed',
 'history_feature_cycle2_gap4_status',
 'history_feature_cycle2_gap5_eta',
 'history_feature_cycle2_gap5_num_car',
 'history_feature_cycle2_gap5_speed',
 'history_feature_cycle2_gap5_status',
 'history_feature_cycle3_gap1_eta',
 'history_feature_cycle3_gap1_num_car',
 'history_feature_cycle3_gap1_speed',
 'history_feature_cycle3_gap1_status',
 'history_feature_cycle3_gap2_eta',
 'history_feature_cycle3_gap2_num_car',
 'history_feature_cycle3_gap2_speed',
 'history_feature_cycle3_gap2_status',
 'history_feature_cycle3_gap3_eta',
 'history_feature_cycle3_gap3_num_car',
 'history_feature_cycle3_gap3_speed',
 'history_feature_cycle3_gap3_status',
 'history_feature_cycle3_gap4_eta',
 'history_feature_cycle3_gap4_num_car',
 'history_feature_cycle3_gap4_speed',
 'history_feature_cycle3_gap4_status',
 'history_feature_cycle3_gap5_eta',
 'history_feature_cycle3_gap5_num_car',
 'history_feature_cycle3_gap5_speed',
 'history_feature_cycle3_gap5_status',
 'history_feature_cycle4_gap1_eta',
 'history_feature_cycle4_gap1_num_car',
 'history_feature_cycle4_gap1_speed',
 'history_feature_cycle4_gap1_status',
 'history_feature_cycle4_gap2_eta',
 'history_feature_cycle4_gap2_num_car',
 'history_feature_cycle4_gap2_speed',
 'history_feature_cycle4_gap2_status',
 'history_feature_cycle4_gap3_eta',
 'history_feature_cycle4_gap3_num_car',
 'history_feature_cycle4_gap3_speed',
 'history_feature_cycle4_gap3_status',
 'history_feature_cycle4_gap4_eta',
 'history_feature_cycle4_gap4_num_car',
 'history_feature_cycle4_gap4_speed',
 'history_feature_cycle4_gap4_status',
 'history_feature_cycle4_gap5_eta',
 'history_feature_cycle4_gap5_num_car',
 'history_feature_cycle4_gap5_speed',
 'history_feature_cycle4_gap5_status',
 'label',
 'link',
 'recent_feature_1_eta',
 'recent_feature_1_num_car',
 'recent_feature_1_speed',
 'recent_feature_1_status',
 'recent_feature_2_eta',
 'recent_feature_2_num_car',
 'recent_feature_2_speed',
 'recent_feature_2_status',
 'recent_feature_3_eta',
 'recent_feature_3_num_car',
 'recent_feature_3_speed',
 'recent_feature_3_status',
 'recent_feature_4_eta',
 'recent_feature_4_num_car',
 'recent_feature_4_speed',
 'recent_feature_4_status',
 'recent_feature_5_eta',
 'recent_feature_5_num_car',
 'recent_feature_5_speed',
 'recent_feature_5_status',
 'length',
 'direction',
 'path_class',
 'speed_class',
 'LaneNum',
 'speed_limit',
 'level',
 'width',
 'week_day',
 'hour',
 'time_gap',
 'next_link']

pandas 内存缩减

def pandas_reduce_mem_usage(df, igore_columns=[]):
    start_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    start_time = datetime.datetime.now()
    for col in df.columns:
        if col in igore_columns:
            continue
        col_type = df[col].dtype
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            print('{} column dtype is {} and begin convert to others'.format(col, col_type))
            if str(col_type)[:3] == 'int':
                if c_min < 0:
                    if c_min >= np.iinfo(np.int8).min and c_max  np.iinfo(np.int8).max:
                        df[col] = df[col].astype(np.int8)
                    elif c_min >= np.iinfo(np.int16).min and c_max  np.iinfo(np.int16).max:
                        df[col] = df[col].astype(np.int16)
                    elif c_min >= np.iinfo(np.int32).min and c_max  np.iinfo(np.int32).max:
                        df[col] = df[col].astype(np.int32)
                    else:
                        df[col] = df[col].astype(np.int64)
                else:
                    if c_min >= np.iinfo(np.uint8).min and c_max  np.iinfo(np.uint8).max:
                        df[col] = df[col].astype(np.uint8)
                    elif c_min >= np.iinfo(np.uint16).min and c_max  np.iinfo(np.uint16).max:
                        df[col] = df[col].astype(np.uint16)
                    elif c_min >= np.iinfo(np.uint32).min and c_max  np.iinfo(np.uint32).max:
                        df[col] = df[col].astype(np.uint32)
                    else:
                        df[col] = df[col].astype(np.uint64)
            else:
                if c_min >= np.finfo(np.float16).min and c_max  np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min >= np.finfo(np.float32).min and c_max  np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
            print('\t\tcolumn dtype is {}'.format(df[col].dtype))
        else:
            print('\t\tcolumns dtype is object and will convert to category')
            df[col] = df[col].astype('category')
    end_mem = df.memory_usage().sum() / 1024 ** 2
    end_time = datetime.datetime.now()
    print('consume times: {:.4f}'.format((end_time - start_time).seconds))
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

缩减前的文件大小:

2020CCF 路况预测--数据读取部分:将数据格式转化成dataFrame

Original: https://blog.csdn.net/qq_33866063/article/details/117235093
Author: 小卜妞~
Title: 2020CCF 路况预测–数据读取部分:将数据格式转化成dataFrame

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/752915/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球