# Python 实现针对时间序列预测的特征选择

2017/06/21 19:41

● 如何创建和解释滞后观察的相关图。

● 如何计算和解释时间序列特征的重要性得分。

● 如何对时间序列输入变量进行特征选择。

1. 载入每月汽车销量数据集：即载入我们将要使用的数据集。

2. 平稳化：讲述如何使数据集平稳化，以便于后续的分析和预测。

3. 自相关图：讲述如何创建时间序列数据的相关图。

4. 时间序列到监督学习：将时间单变量的时间序列转化为监督性学习问题。

5. 滞后变量的特征重要性：讲述如何计算和查看时间序列数据的特征重要性得分。

6. 滞后变量的特征选择：讲述如何计算和查看时间序列数据的特征选择结果。

## 1. 载入数据

# line plot of time series

from pandas import Series

from matplotlib import pyplot

# display first few rows

# line plot of dataset

series.plot()

pyplot.show()

Month

1960-01-01 6550

1960-02-01 8728

1960-03-01 12026

1960-04-01 14395

1960-05-01 14587

Name: Sales, dtype: int64

## 2. 平稳化

# seasonally adjust the time series

from pandas import Series

from matplotlib import pyplot

# seasonal difference

differenced = series.diff(12)

# trim off the first year of empty data

differenced = differenced[12:]

# save differenced dataset to file

# plot differenced dataset

differenced.plot()

pyplot.show()

## 3. 自相关图

from pandas import Series

from statsmodels.graphics.tsaplots import plot_acf

from matplotlib import pyplot

plot_acf(series)

pyplot.show()

## 4. 时间序列到监督学习

from pandas import Series

from pandas import DataFrame

# reframe as supervised learning

dataframe = DataFrame()

for i in range(12,0,-1):

dataframe['t-'+str(i)] = series.shift(i)

dataframe['t'] = series.values

dataframe = dataframe[13:]

# save to new file

dataframe.to_csv('lags_12months_features.csv', index=False)

## 5. 滞后变量的特征重要性

from sklearn.ensemble import RandomForestRegressor

from matplotlib import pyplot

array = dataframe.values

# split into input and output

X = array[:,0:-1]

y = array[:,-1]

# fit random forest model

model = RandomForestRegressor(n_estimators=500, random_state=1)

model.fit(X, y)

# show importance scores

print(model.feature_importances_)

# plot importance scores

names = dataframe.columns.values[0:-1]

ticks = [i for i in range(len(names))]

pyplot.bar(ticks, model.feature_importances_)

pyplot.xticks(ticks, names)

pyplot.show()

[ 0.21642244  0.06271259  0.05662302  0.05543768  0.07155573  0.08478599

0.07699371  0.05366735  0.1033234   0.04897883  0.1066669   0.06283236]

## 6. 滞后变量的特征选择

RFE 可以创建预测模型，对特征值赋予不同的权值，并删掉那些权重最小的特征，通过不断重复这一流程，最终就能得到预期数量的特征。

from sklearn.feature_selection import RFE

from sklearn.ensemble import RandomForestRegressor

from matplotlib import pyplot

# separate into input and output variables

array = dataframe.values

X = array[:,0:-1]

y = array[:,-1]

# perform feature selection

rfe = RFE(RandomForestRegressor(n_estimators=500, random_state=1), 4)

fit = rfe.fit(X, y)

# report selected features

print('Selected Features:')

names = dataframe.columns.values[0:-1]

for i in range(len(fit.support_)):

if fit.support_[i]:

print(names[i])

# plot feature rank

names = dataframe.columns.values[0:-1]

ticks = [i for i in range(len(names))]

pyplot.bar(ticks, fit.ranking_)

pyplot.xticks(ticks, names)

pyplot.show()

Selected Features:

t-12

t-6

t-4

t-2

## 总结

● 如何解释具有高度相关性的滞后观测的相关图。

● 如何计算和查看时间序列数据中的特征重要性得分。

● 如何使用特征选择来确定时间序列数据中最相关的输入变量。

0
0 收藏

0 评论
0 收藏
0