时间序列库sktime 输入和模型构建介绍

sktime 序列分析

sktime 是一个新的处理序列数据的库,可以进行多种任务的处理,分类、回归、聚类、预测和注释,是一个值得考察的库。

构建输入

sktime使用的是一种叫做nested数据形式, 为了将其它类型的数据转化成nested数据,sktime提共了多种构建输入的工具,适用于不同的数据形式:
1.from_2d_array_to_nested:从2D数据,也叫tabular型数据, 是一种仅能容纳单变量的数据形式,这种数据形式中, 0轴表示instance,比如患者,1轴表示该变量多个时间点的测量数据,比如 多个时间点药物浓度(变量)的测量数据;
库的引用:from sktime.datatypes._panel._convert import ( from_2d_array_to_nes, from_nested_to_2d_array,is_nested_dataframe,)
2.from_long_to_nested:从纵向数据生成nested数据形式,纵向数据将多次测量的数据沿着0轴进行排列,而1轴需要罗列的几个特征:case_id(instance编号,比如患者),dim_id(维度编号,说明测量了几个变量,比如心率、血压等),reading_id(另一个名字可能更好理解一些,序列长度series_len,比如24小时)
3.from_multi_index_to_nested:从多索引pandas df 生成nested数据形式,具体说是两个索引,case_id和reading_id, 而维度沿着1轴展开。
4.from_3d_numpy_to_nested:从numpy 3D数据生成nested数据形式, numpy 3D数据和多索引数据相似, 只是索引是默认索引。

掌握这几个该是够用了, sktime还是提供了其它的许多读取数据的形式。

注:对于时间序列来说,一般会有年-月-日 小时-分钟-秒 来表示, 如果测量数据是每小时测量一次,那么reading_id 就是0-23, case_id就是天数;如果以上数据处理为每月一个周期, 那reading_id 就是24*30,而case_id 就成为月的数量。

以下用sktime提供的代码进行一定的演示:


import sktime
import pandas as pd
from numpy.random import default_rng
from sktime.datatypes._panel._convert import (
    from_2d_array_to_nested,
    from_nested_to_2d_array,
    is_nested_dataframe,
)
rng = default_rng()
X_2d = rng.standard_normal((50, 20))
print(f"The tabular data has the shape {X_2d.shape}")
print(pd.DataFrame(X_2d).head(3))
print(pd.DataFrame(X_2d).tail(3))

output:
The tabular data has the shape (50, 20)
0 1 2 3 4 5 6
0 0.951345 -1.315426 0.214762 -0.136433 0.247084 -1.932268 0.785887
1 -0.010882 1.698942 0.781392 0.007080 -0.734776 0.233316 -1.657900
2 1.018911 -0.964553 -0.682563 0.702798 -0.032204 -1.105005 0.095655

     7         8         9         10        11        12        13  \

0 -0.042764 0.569522 1.017830 0.186783 0.249294 0.244153 -0.985478
1 -0.279308 0.831627 0.765397 1.470033 -0.484943 -0.084951 -1.571112
2 -0.861139 0.122383 -0.389639 1.494302 0.343313 0.297859 0.142106

     14        15        16        17        18        19

0 1.862123 -0.193330 -0.433783 0.750041 -0.500827 -2.268721
1 0.047955 0.664481 -1.374092 -0.187755 1.350386 -0.360479
2 -0.572279 -0.419691 -0.341976 0.008623 0.901157 0.709582
0 1 2 3 4 5 6
47 -2.027696 0.868153 0.083681 0.045453 -0.199436 -0.960754 0.611705
48 1.278521 0.740739 -0.581048 -0.274030 0.559486 -1.960081 -0.527335
49 0.256974 -1.053948 -0.180352 0.495492 -0.229110 3.771682 0.350383

      7         8         9         10        11        12        13  \

47 -0.909587 0.887509 0.960593 -0.712712 0.668460 0.539432 -1.276410
48 -1.160588 0.541308 -1.091403 2.419113 -0.700643 1.003165 -1.082824
49 0.555191 -1.780015 -0.549731 1.503096 -1.293898 0.111052 0.422830

      14        15        16        17        18        19

47 2.669976 0.632130 -0.482352 -0.309763 1.390071 -0.773413
48 -0.796634 0.635805 0.935867 0.563172 -1.042389 -0.277242
49 1.761040 -1.392075 0.873080 -1.138395 -0.788783 1.283540

X_nested = from_2d_array_to_nested(X_2d)
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
print(X_nested.head(1))
print(X_nested.tail(1))

output:
X_nested is a nested DataFrame: True
The cell contains a

from sktime.datasets import generate_example_long_table
X = generate_example_long_table(num_cases=50, series_len=20, num_dims=5)
print(X.head())
print(X.tail())

纵向数据

case_id dim_id reading_id value
0 0 0 0 0.562555
1 0 0 1 0.280241
2 0 0 2 0.738769
3 0 0 3 0.258843
4 0 0 4 0.354176
case_id dim_id reading_id value
4995 49 4 15 0.734340
4996 49 4 16 0.991652
4997 49 4 17 0.665473
4998 49 4 18 0.272355
4999 49 4 19 0.695632

from sktime.datatypes._panel._convert import from_long_to_nested, from_nested_to_long
X_nested = from_long_to_nested(X)
X_nested.head()
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
X_nested.iloc[0, 0]

X_nested is a nested DataFrame: True
The cell contains a


from sktime.datasets import make_multi_index_dataframe
from sktime.datatypes._panel._convert import (
    from_multi_index_to_nested,
    from_nested_to_multi_index,
)
X_mi = make_multi_index_dataframe(n_instances=50, n_columns=5, n_timepoints=20)
print(f"The multi-indexed DataFrame has shape {X_mi.shape}")
print(f"The multi-index names are {X_mi.index.names}")
X_mi.head()

output:
The multi-indexed DataFrame has shape (1000, 5)
The multi-index names are [‘case_id’, ‘reading_id’]
var_0 var_1 var_2 var_3 var_4
case_id reading_id
0 0 0.995135 0.038909 0.181639 0.104997 0.122326
1 0.355448 0.991423 0.807475 0.062603 0.175412
2 0.935102 0.403729 0.031880 0.664581 0.438267
3 0.724594 0.291621 0.833333 0.780480 0.201459
4 0.019985 0.743191 0.485572 0.444321 0.146891

X_nested = from_multi_index_to_nested(X_mi, instance_index="case_id")
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
X_nested.head()

output:
X_nested is a nested DataFrame: True
The cell contains a

构建模型

from sktime.datasets import load_airline
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.theta import ThetaForecaster
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error

import matplotlib.pyplot  as plt
y = load_airline()
print(y)

y_train, y_test = temporal_train_test_split(y)
fh = ForecastingHorizon(y_test.index, is_relative=False)
print(fh)
forecaster = ThetaForecaster(sp=12)
forecaster.fit(y_train)
y_pred = forecaster.predict(fh)
mean_absolute_percentage_error(y_test, y_pred)
y_train.plot()
y_test.plot()
y_pred.plot()
plt.show()

output:
Period
1949-01 112.0
1949-02 118.0
1949-03 132.0
1949-04 129.0
1949-05 121.0

1960-08 606.0
1960-09 508.0
1960-10 461.0
1960-11 390.0
1960-12 432.0
Freq: M, Name: Number of airline passengers, Length: 144, dtype: float64
ForecastingHorizon([‘1958-01’, ‘1958-02’, ‘1958-03’, ‘1958-04’, ‘1958-05’, ‘1958-06’,
‘1958-07’, ‘1958-08’, ‘1958-09’, ‘1958-10’, ‘1958-11’, ‘1958-12’,
‘1959-01’, ‘1959-02’, ‘1959-03’, ‘1959-04’, ‘1959-05’, ‘1959-06’,
‘1959-07’, ‘1959-08’, ‘1959-09’, ‘1959-10’, ‘1959-11’, ‘1959-12’,
‘1960-01’, ‘1960-02’, ‘1960-03’, ‘1960-04’, ‘1960-05’, ‘1960-06’,
‘1960-07’, ‘1960-08’, ‘1960-09’, ‘1960-10’, ‘1960-11’, ‘1960-12′],
dtype=’period[M]’, name=’Period’, is_relative=False)

from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_arrow_head
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_arrow_head()
print(X.shape)
print(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
classifier = TimeSeriesForestClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy_score(y_test, y_pred)
0.8679245283018868
X.iloc[0,0]

output:
(211, 1)
[‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’
‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’ ‘0’ ‘1’ ‘2’
‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’
‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’
‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’
‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘0’ ‘1’ ‘1’ ‘1’
‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’
‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’
‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘1’ ‘2’ ‘2’ ‘2’ ‘2’
‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’
‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’
‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’ ‘2’]
0 -1.963009
1 -1.957825
2 -1.956145
3 -1.938289
4 -1.896657

246 -1.841345
247 -1.884289
248 -1.905393
249 -1.923905
250 -1.909153
Length: 251, dtype: float64

from sktime.annotation.adapters import PyODAnnotator
from pyod.models.iforest import IForest
from sktime.datasets import load_airline
y = load_airline()
print(y.shape)
pyod_model = IForest()
pyod_sktime_annotator = PyODAnnotator(pyod_model)
pyod_sktime_annotator.fit(y)
annotated_series = pyod_sktime_annotator.predict(y)
annotated_series

output:
(144,)
1949-01 1
1949-02 0
1949-03 0
1949-04 0
1949-05 0

1960-08 1
1960-09 1
1960-10 0
1960-11 0
1960-12 1
Freq: M, Length: 144, dtype: int32

Original: https://blog.csdn.net/skyskytotop/article/details/123895815
Author: 预测模型的开发与应用研究
Title: 时间序列库sktime 输入和模型构建介绍

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/694116/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球