[ASTGCN之1个特征]解读(torch)之参数读取和数据读入(一)

2023年7月23日上午5:07 • 人工智能 • 阅读 72

一、目录

; 二、configparser和argparse

不了解的和遇见参数读取报错的：见博文《【configparser】参数读取》和《【argparse】参数配置》
configurations

PEMS04_astgcn.conf的内容展示

; 三、 `prepareData.py` 文件

该文件是进行参数读取和数据预处理的.

导入库

import os
import numpy as np
import argparse
import configparser

3.1 参数读入

argparse的使用方法（见链接）
configparser的使用方法（见链接）
代码


parser = argparse.ArgumentParser()
parser.add_argument("--config", default='configurations/PEMS08_astgcn.conf', type=str,
                    help="configuration file path")
args = parser.parse_args()
config = configparser.ConfigParser()
print('Read configuration file: %s' % (args.config))
config.read(args.config)
data_config = config['Data']
training_config = config['Training']

adj_filename = data_config['adj_filename']
graph_signal_matrix_filename = data_config['graph_signal_matrix_filename']

if config.has_option('Data', 'id_filename'):
    id_filename = data_config['id_filename']
else:
    id_filename = None

num_of_vertices = int(data_config['num_of_vertices'])
points_per_hour = int(data_config['points_per_hour'])
num_for_predict = int(data_config['num_for_predict'])
len_input = int(data_config['len_input'])
dataset_name = data_config['dataset_name']
num_of_weeks = int(training_config['num_of_weeks'])
num_of_days = int(training_config['num_of_days'])
num_of_hours = int(training_config['num_of_hours'])

3.2 变量解析

应用到模型中的变量对照表

代码变量含义解析原文变量举例num_of_vertices网络中顶点个数170points_per_hour每个小时中观察的时间点个数12num_for_predict预测未来多少个时间点结果12len_input12num_of_weeks20num_of_days20num_of_hours21

应用到模型控制的变量对照表

代码变量默认值含义解析作用args.config’configurations/PEMS04_astgcn.conf’文件路径用于读取模型参数的路径configconfigparser实例化并读取参数后里面包含了两个节,节中包含大量变量data_configconfig[“Data”]表示config中节[“Data”] (里面包含众多变量)training_configconfig[“Training”]表示config中节[“Training”] (里面包含众多变量)adj_file’./data/PEMS04/distance.csv’文件路径用于读取邻接矩阵graph_signal_matrix_filename’./data/PEMS04/PEMS04.npz’文件路径用于读取图信号矩阵dataset_name数据集名称

四、函数 `read_and_generate_dataset`

读取数据的代码

all_data = read_and_generate_dataset(graph_signal_matrix_filename, 0, 0, num_of_hours, num_for_predict, points_per_hour=points_per_hour, save=True)

函数定义

def read_and_generate_dataset(graph_signal_matrix_filename,
                              num_of_weeks, num_of_days,
                              num_of_hours, num_for_predict,
                             points_per_hour=12, save=False):
    '''
    Parameters
    ----------
    graph_signal_matrix_filename: str, path of graph signal matrix file
    num_of_weeks: int,0
    num_of_days:  int,0
    num_of_hours: int,1
    num_for_predict: int, 12
    points_per_hour: int, default 12, depends on data
    save:bool,True
    Returns
    ----------
    feature: np.ndarray,
             shape is (num_of_samples, num_of_depend * points_per_hour,
                       num_of_vertices, num_of_features)
    target: np.ndarray,
            shape is (num_of_samples, num_of_vertices, num_for_predict)
    '''

    data_seq = np.load(graph_signal_matrix_filename)['data']

    all_samples = []
    for idx in range(data_seq.shape[0]):
        sample = get_sample_indices(data_seq, num_of_weeks, num_of_days,
                                    num_of_hours, idx, num_for_predict,
                                    points_per_hour)

        if ((sample[0] is None) and (sample[1] is None) and (sample[2] is None)):
            continue

        week_sample, day_sample, hour_sample, target = sample

        sample = []

        if num_of_weeks > 0:
            week_sample = np.expand_dims(week_sample, axis=0).transpose((0, 2, 3, 1))
            sample.append(week_sample)

        if num_of_days > 0:
            day_sample = np.expand_dims(day_sample, axis=0).transpose((0, 2, 3, 1))
            sample.append(day_sample)

        if num_of_hours > 0:
            hour_sample = np.expand_dims(hour_sample, axis=0).transpose((0, 2, 3, 1))
            sample.append(hour_sample)

        target = np.expand_dims(target, axis=0).transpose((0, 2, 3, 1))[:, :, 0, :]
        sample.append(target)

        time_sample = np.expand_dims(np.array([idx]), axis=0)
        sample.append(time_sample)

        all_samples.append(
            sample)

    split_line1 = int(len(all_samples) * 0.6)
    split_line2 = int(len(all_samples) * 0.8)

    training_set = [np.concatenate(i, axis=0)
                    for i in zip(*all_samples[:split_line1])]
    validation_set = [np.concatenate(i, axis=0)
                      for i in zip(*all_samples[split_line1: split_line2])]
    testing_set = [np.concatenate(i, axis=0)
                   for i in zip(*all_samples[split_line2:])]

    train_x = np.concatenate(training_set[:-2], axis=-1)
    val_x = np.concatenate(validation_set[:-2], axis=-1)
    test_x = np.concatenate(testing_set[:-2], axis=-1)

    train_target = training_set[-2]
    val_target = validation_set[-2]
    test_target = testing_set[-2]

    train_timestamp = training_set[-1]
    val_timestamp = validation_set[-1]
    test_timestamp = testing_set[-1]

    (stats, train_x_norm, val_x_norm, test_x_norm) = normalization(train_x, val_x, test_x)

    all_data = {
        'train': {
            'x': train_x_norm,
            'target': train_target,
            'timestamp': train_timestamp,
        },
        'val': {
            'x': val_x_norm,
            'target': val_target,
            'timestamp': val_timestamp,
        },
        'test': {
            'x': test_x_norm,
            'target': test_target,
            'timestamp': test_timestamp,
        },
        'stats': {
            '_mean': stats['_mean'],
            '_std': stats['_std'],
        }
    }
    print('train x:', all_data['train']['x'].shape)
    print('train target:', all_data['train']['target'].shape)
    print('train timestamp:', all_data['train']['timestamp'].shape)
    print()
    print('val x:', all_data['val']['x'].shape)
    print('val target:', all_data['val']['target'].shape)
    print('val timestamp:', all_data['val']['timestamp'].shape)
    print()
    print('test x:', all_data['test']['x'].shape)
    print('test target:', all_data['test']['target'].shape)
    print('test timestamp:', all_data['test']['timestamp'].shape)
    print()
    print('train data _mean :', stats['_mean'].shape, stats['_mean'])
    print('train data _std :', stats['_std'].shape, stats['_std'])

    if save:
        file = os.path.basename(graph_signal_matrix_filename).split('.')[0]
        dirpath = os.path.dirname(graph_signal_matrix_filename)
        filename = os.path.join(dirpath, file + '_r' + str(num_of_hours) + '_d' + str(num_of_days) + '_w' + str(num_of_weeks)) + '_astcgn'
        print('save file:', filename)
        np.savez_compressed(filename,
                            train_x=all_data['train']['x'], train_target=all_data['train']['target'],
                            train_timestamp=all_data['train']['timestamp'],
                            val_x=all_data['val']['x'], val_target=all_data['val']['target'],
                            val_timestamp=all_data['val']['timestamp'],
                            test_x=all_data['test']['x'], test_target=all_data['test']['target'],
                            test_timestamp=all_data['test']['timestamp'],
                            mean=all_data['stats']['_mean'], std=all_data['stats']['_std']
                            )
    return all_data

函数解析：
注释1：数据读入
np.load(graph_signal_matrix_filename)['data'] :读取data数据
读取的是 .npz文件的时候，通常里面至少包含一个array 数组。
1. 查看里面包含几个数组内容的方法：

  data = np.load(graph_signal_matrix_filename)
  print(data.files)

out: ['data']
包含一个数组，该数组名为 "data"
2. 取出数组的方法
data["data"]
3. data_seq.shape为 (17856, 170, 3)当PEMS08的时候，分别表示(序列长度，顶点个数，顶点的特征个数)。提醒：这里的序列长度，其实时按照时间顺序排列的。
注释二：生成滑动窗口

伪代码

读完原理图，可以阅读section 5，section6.

; A. 原理图

利用滑动窗口生成新序列的原理图。右侧顶部是一个连续时间的序列，滑动窗口设为4，当窗口每走一步(也可能走unit步后)获得一个”窗口数据”，直到最后一个窗口。将所有的”窗口数据”在axis=0轴合并成为一个新的数据。(见左图)

调用函数 get_sample_indices获得week,day,hour,target的样本切片，函数解析见section 5. 切片的shape=(序列某长度，顶点个数，特征个数)
对

B.数据合成演示图

; 注释三：构造train\val\test数据

split_line1

五、 `prepareData.py` 中的函数 `get_sample_indices`

5.1 该函数的调用

可以看到每个idx –>调用一次函数

; 5.2 函数介绍

在本函数中，可以获得周样本数据，日样本数据，时样本数据。在本函数中，首先按照原理图的思想，获得小方块组的索引列表(这一步见section 6 函数 search_data),之后再通过 np.concatenate进行数据合并，获得最终的样本。

最终样本的shape=(窗口大小× \times × 小方块组的个数，顶点个数，特征个数)

以时样本为例，进行说明

5.4 函数代码

def get_sample_indices(data_sequence, num_of_weeks, num_of_days, num_of_hours,
                       label_start_idx, num_for_predict, points_per_hour=12):
    '''
    Parameters
    ----------
    data_sequence: np.ndarray
                   shape is (sequence_length, num_of_vertices, num_of_features)=（序列长度，顶点个数，特征个数）
    num_of_weeks :int, 0
    num_of_days  :int, 0
    num_of_hours: int, 1
    label_start_idx: int, the first index of predicting target，0~16992
    num_for_predict: int, 12 ,num_timesteps_output,标签中的滑动窗口大小==预测的数据长度
    points_per_hour: int, default 12, number of points per hour每小时分为12个阶段，每间隔5分钟记录一次数据。或着说两个小方块的时间距离为5分钟。
    Returns
    ----------
    week_sample: np.ndarray周样本
                 shape is (num_of_weeks * points_per_hour,
                           num_of_vertices, num_of_features)
    day_sample: np.ndarray日样本
                 shape is (num_of_days * points_per_hour,
                           num_of_vertices, num_of_features)
    hour_sample: np.ndarray时样本
                 shape is (num_of_hours * points_per_hour,
                           num_of_vertices, num_of_features)
    target: np.ndarray 标签样本
            shape is (num_for_predict, num_of_vertices, num_of_features)
    '''
    week_sample, day_sample, hour_sample = None, None, None

    if label_start_idx + num_for_predict > data_sequence.shape[0]:

        return week_sample, day_sample, hour_sample, None

    if num_of_weeks > 0:
        week_indices = search_data(data_sequence.shape[0], num_of_weeks,
                                   label_start_idx, num_for_predict,
                                   7 * 24, points_per_hour)

        if not week_indices:
            return None, None, None, None

        week_sample = np.concatenate([data_sequence[i: j]
                                      for i, j in week_indices], axis=0)

    if num_of_days > 0:
        day_indices = search_data(data_sequence.shape[0], num_of_days,
                                  label_start_idx, num_for_predict,
                                  24, points_per_hour)

        if not day_indices:
            return None, None, None, None

        day_sample = np.concatenate([data_sequence[i: j]
                                     for i, j in day_indices], axis=0)

    if num_of_hours > 0:
        hour_indices = search_data(data_sequence.shape[0], num_of_hours,
                                   label_start_idx, num_for_predict,
                                   1, points_per_hour)

        if not hour_indices:
            return None, None, None, None

        hour_sample = np.concatenate([data_sequence[i: j]
                                      for i, j in hour_indices], axis=0)

    target = data_sequence[label_start_idx: label_start_idx + num_for_predict]

    return week_sample, day_sample, hour_sample, target

六、函数 `search_data`

在这个函数中获取每个窗口(滑动生成的)索引的首尾，假设在原理图中获得新数据的shape=(20,4,170,3),表示获得20个窗口，每个窗口的范围为4.也就是如原理图中所展示的具有20个小方块组(4个为一组)，该函数返回的是每个小方块组的第一个和最后一个的索引的组合序列。
重要提醒：

在原理图中，我们使窗口每次滑动一步，在本函数中我们使窗口每次滑动 points_per_hour * units个步伐.
在原理图中，我们的窗口大小假设为4(方便作图)，在本函数中窗口大小固定为 points_per_hour,在本文中普遍默认为12.
在原理图中，我们的窗口滑动是沿时间轴的方向移动的，而在本函数中窗口时逆时间轴的方向移动的。当滑动窗口的起始索引start_idx

6.1 假设num_of_days=1

调用函数的命令

; 6.2 假设

def search_data(sequence_length, num_of_depend, label_start_idx,
                num_for_predict, units, points_per_hour):
    '''
    Parameters
    ----------
    sequence_length: int, length of all history data序列的长度
    num_of_depend: int,
    label_start_idx: int, the first index of predicting target
    num_for_predict: int, the number of points will be predicted for each sample
    units: int, week: 7 * 24, day: 24, recent(hour): 1
    points_per_hour: int, number of points per hour, depends on data
    Returns
    ----------
    list[(start_idx, end_idx)]
    '''

    if points_per_hour < 0:
        raise ValueError("points_per_hour should be greater than 0!")

    if label_start_idx + num_for_predict > sequence_length:
        return None

    x_idx = []
    for i in range(1, num_of_depend + 1):
        start_idx = label_start_idx - points_per_hour * units * i
        end_idx = start_idx + num_for_predict
        if start_idx >= 0:
            x_idx.append((start_idx, end_idx))
        else:
            return None

    if len(x_idx) != num_of_depend:
        return None

    return x_idx[::-1]

七、函数 `normalization`

对样本train,val,test的数据进行标准化。标准化的方法：减均值除以标准差。

def normalization(train, val, test):
    '''
    Parameters
    ----------
    train, val, test: np.ndarray (B,N,F,T)
    Returns
    ----------
    stats: dict, two keys: mean and std
    train_norm, val_norm, test_norm: np.ndarray,
                                     shape is the same as original
    '''
    assert train.shape[1:] == val.shape[1:] and val.shape[1:] == test.shape[1:]
    mean = train.mean(axis=(0,1,3), keepdims=True)
    std = train.std(axis=(0,1,3), keepdims=True)
    print('mean.shape:',mean.shape)
    print('std.shape:',std.shape)

    def normalize(x):
        return (x - mean) / std

    train_norm = normalize(train)
    val_norm = normalize(val)
    test_norm = normalize(test)

    return {'_mean': mean, '_std': std}, train_norm, val_norm, test_norm

八、测试read_and_generate_dataset

调用该函数

all_data = read_and_generate_dataset(graph_signal_matrix_filename,
                                     2, 1, 2, num_for_predict,
                                     points_per_hour=points_per_hour, save=True)

idx : int , 0~16994，每个idx对应调用get_sample_indices

for idx in range(data_seq.shape[0]):
    sample = get_sample_indices(data_seq, num_of_weeks, num_of_days,
                                num_of_hours, idx, num_for_predict,
                                points_per_hour)

通过 3,4,5中可以看出idx=max(7 ∗ 24 ∗ 12 ∗ 724127 ∗24 ∗12 ∗num_of_weeks, 24 ∗ 12 ∗ 241224 ∗12 ∗num_of_days,12num_of_hours),才可以正常的取值。

num_of_weeks=2 >0: 调用search_data

week_indices = search_data(data_sequence.shape[0], num_of_weeks,
                                   label_start_idx, num_for_predict,
                                   7 * 24, points_per_hour)

7 ∗ 24 ∗ 12 = 2016 72412=2016 7 ∗24 ∗12 =2016

idxstart_idx_1end_idx_1start_idx_2end_idx_2return0->2015
− 2016 → − 1 -2016\to -1 −2016 →−1

None2016-012-2016-2014None403120152017-111None403220162028012[(0, 12), (2016, 2028)]

数据拼接： week_sample=np.concatenate([dataseq[0:12],dataseq[2016:2018],axis=0) ,其shape=(24,307,12)
num_of_days=2>0: 调用search_data

day_indices = search_data(data_sequence.shape[0], num_of_days,
                          label_start_idx, num_for_predict,
                          24, points_per_hour)

idxstart_idx_1end_idx_1start_idx_2end_idx_2return403237443775634563468[(3456, 3468), (3744, 3756)]

数据拼接： day_sample=np.concatenate([dataseq[3456:3468],dataseq[3744:3756]],axis=0),其shape=(24,207,12)
num_of_days=2 >0: 调用search_data

hour_indices = search_data(data_sequence.shape[0], num_of_hours,
                           label_start_idx, num_for_predict,
                           1, points_per_hour)

idxstart_idx_1end_idx_1start_idx_2end_idx_2return40324020403240084020[(4008, 4020), (4020, 4032)]

数据拼接： np.concatenate([dataseq[4008:4020],[4020,4032]],axis=0)，其shape=(24,307,12)

后篇博文见：
【ASTGCN】代码解读(torch)之train_ASTGCN_r（二）

【ASTGCN】模型解读(torch)之模型框架(三)

Original: https://blog.csdn.net/panbaoran913/article/details/124332937
Author: panbaoran913
Title: [ASTGCN之1个特征]解读(torch)之参数读取和数据读入(一)

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/710119/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

ML之FE：数据预处理中基于pandas实现类别型数据数值化(包括自定义编码映射字典)、目标变量布尔类型化且同时输出raw_df和df数据之代码实现攻略

ML之FE：数据预处理中基于pandas实现类别型字段数据编码(包括自定义编码映射字典)、目标变量布尔类型化且同时输出raw_df和df数据之代码实现攻略一、类别型字段数据编码 …

人工智能 2023年7月16日
0054
Featuretools快速使用指南–看这一篇就够了

Featuretools简单攻略 Featuretools介绍 Featuretools快速开始 Featuretools介绍人工特性工程是一项冗长乏味的任务，并且受到人类想象力…

人工智能 2023年6月19日
0078
Alfred插件之有道翻译配置过程

Notion文章地址：https://jimmyhjh.com/Alfred-8a57946a707b4f6fa8ac90653708cb5c 一、起因之前一直在使用欧路词典的鼠…

人工智能 2023年5月28日
00153
产业链图谱：2022年中国智能安防产业链图谱 | 产业全景图

随着人工智能技术的不断演进，机器视觉、机器学习、算法等领域的长足进步共同推动了传统安防行业的转型升级，对传统安防行业形成了更为先进的技术替代。未来二三十年，人类一定会进入万物感知、…

人工智能 2023年6月1日
0093
解决新创建的anaconda环境在C:Usersxxx.condaenvs，而不在anaconda安装目录下的envs中

文章目录问题描述问题分析解决方法参考资料问题描述今天调试一个模型的代码时，需要新创建一个anaconda的环境，而新创建的环境之前都是在anaconda安装目录下的en…

人工智能 2023年7月3日
0078
TransRHS: A Representation Learning Method for Knowledge Graphs with Relation Hierarchical Structure

研究问题在学习知识图谱上的嵌入时对关系的层次结构（RHS）进行建模，训练一个向量和一个球（sphere），用向量差和球的相对位置关系表示关系的层级结构。背景动机之前基于聚类的…

人工智能 2023年6月1日
00100
亚马逊云科技re:Invent：企业分析版的ChatGPT来了

最近ChatGPT已经被大家玩疯了，那么企业分析版的ChatGPT大家见过没有呢？火爆异常的聊天机器人ChatGPT 如果要评选当下最炙手可热的机器人，那么我想很多读者朋友都会毫…

人工智能 2023年7月31日
0052
tushare使用分享

tushare ID：509298 在师兄的介绍下了解到了Tushare Pro平台（Tushare大数据社区）。利用这个平台可以很方便的获得股票、基金、期货、债券等金融数据，数据…

人工智能 2023年7月16日
0069
马尔可夫过程与隐马尔科夫模型

为什么是马尔可夫过程？马尔科夫过程(Markov process)是一类随机过程。在已知目前状态(现在)的条件下，它未来的演变(将来)不依赖于它以往的演变(过去)。主要研究一个…

人工智能 2023年5月27日
0088
JS中，a标签里的javascript:；和 javascript:void(0)还有##

目录 1. javascript:;【常用】点击链接之后不会刷新页面，不会跳转链接，也不会传递参数 2. javascript:void(0) 【少用】点击链接后不会刷新页面，不会…

人工智能 2023年7月30日
0045
【读书笔记-＞统计学】02-01 各种“平均数”-均值、中位数和众数概念简介

各种”平均数” 在这之前，请大家先要知道这里的”平均数”可不指代平常的概念，在统计学中，平均数可以帮我们把握一批数据的总体情况。均…

人工智能 2023年7月15日
0049
模糊PID控制C++实现

PID大家都非常熟悉了，这里就不多谈了，模糊控制可以看一下B站的相关视频，比如这个【入门】智能控制 | 20分钟搞定模糊控制下面的代码来自github，我主要对github的代码…

人工智能 2023年6月4日
0098
GNN-图卷积模型-2016：GCN【消息传递（前向传播）：聚合函数+更新函数】【聚合函数：mean（邻域所有节点取平均值）】【训练更新函数的参数】【空域+频域】【直推式学习】【同质图】

; 一、概述在扎进GCN的汪洋大海前，我们先搞清楚GCN是做什么的，有什么用。深度学习一直都是被几大经典模型给统治着，如CNN、RNN等等，它们无论再CV还是NLP领域都取得了优…

人工智能 2023年5月31日
0048
Yolov5学习笔记(2)——部署在jetson nano上

本教程系列将从模型训练开始，从0开始带领你部署Yolov5模型到jetson nano上这是本系列的第二部分。 [En] This is the second part of t…

人工智能 2023年5月23日
0077
回归统计绘图_GraphPad Prism 统计教程 | 如何用Prism拟合模型

” Prism 是市场上易于使用的非线性回归分析软件，绘图选项一流！”——David R.Edwards，博士，高级研发科学家，Afton Chemical…

人工智能 2023年6月18日
0058
数据科学学习之统计实验的设计、检验与分析

专栏/前文链接本文为《数据分析与数据科学》专栏中的第三篇，专栏的链接在这里. 第一篇博文的链接在这里. 第二篇博文的链接在这里. 希望本文与此专栏能够对接触，学习和研究数据科学的…

人工智能 2023年7月18日
0044

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

[ASTGCN之1个特征]解读(torch)之参数读取和数据读入(一)

一、目录

; 二、configparser和argparse

; 三、 prepareData.py 文件

四、函数 read_and_generate_dataset

; A. 原理图

B.数据合成演示图

五、 prepareData.py 中的函数 get_sample_indices

六、函数 search_data

七、 函数 normalization