时间序列分析

2018-11-01 本文已影响0人 Alex_杨策

时间序列简介

时间序列分析是数据分析过程中，尤其是在金融数据分析过程中会经常遇到的。时间序列，就是以时间排序的一组随机变量。例如国家统计局每年或每月定期发布的 GDP 或 CPI 指数；24 小时内某一股票、基金、指数的数值变化等，都是时间序列。

时间序列处理

当拿到一些时间序列的原始数据时，可能会遇到一些情况：
1.某一段时间缺失，需要填充。
2.时间序列错位，需要对齐。
3.数据表a和数据表b所采用的时间间隔不一致，需要重新采样。
......
面对这些问题，就需要通过一些手段来获得最终想要的数据。

Timestamp时间戳

时间戳，即代表一个时间时刻。可以直接用pd.Timestamp()来创建时间戳。

In [1]: import pandas as pd

In [2]: pd.Timestamp("2018-11-1")
Out[2]: Timestamp('2018-11-01 00:00:00')

In [3]: pd.Timestamp(2018,11,1)
Out[3]: Timestamp('2018-11-01 00:00:00')

In [4]: pd.Timestamp("2018-11-1 11:31:00")
Out[4]: Timestamp('2018-11-01 11:31:00')

时间戳索引

单个时间戳为Timestamp数据，当时间戳以列表形式存在时，Pandas将强制转换为DatetimeIndex。这时就不能再使用pd.Timestamp()来创建时间戳了，而是pd.to_datetime()来创建：

In [5]: pd.to_datetime(["2018-11-1", "2018-11-2", "2018-11-3"])
Out[5]: DatetimeIndex(['2018-11-01', '2018-11-02', '2018-11-03'], dtype='datetime64[ns]', freq=None)

pd.to_datetime()不仅仅可用来创建DatetimeIndex，还可以将时间戳序列格式进行转换等操作。常见的时间戳书写样式都可以通过pd.to_datetime()规范化。

In [7]: pd.to_datetime(['Jul 10, 2018', '2018-12-12', None])
Out[7]: DatetimeIndex(['2018-07-10', '2018-12-12', 'NaT'], dtype='datetime64[ns]', freq=None)

In [8]: pd.to_datetime(['2017/10/11', '2018/11/1'])
Out[8]: DatetimeIndex(['2017-10-11', '2018-11-01'], dtype='datetime64[ns]', freq=None)

对于欧洲时区普遍采用的书写样式，可以通过dayfirst=True参数进行修正：

In [9]: pd.to_datetime('1-10-2018')
Out[9]: Timestamp('2018-01-10 00:00:00')

In [10]: pd.to_datetime('1-10-2018', dayfirst=True)
Out[10]: Timestamp('2018-10-01 00:00:00')

Pandas熟悉的Series和DataFrame格式的字符串，也可以直接通过to_datetime转换：

In [11]: pd.to_datetime(pd.Series(['2018-11-1', '2018-11-2', '2018-11-3']))
Out[11]:
0   2018-11-01
1   2018-11-02
2   2018-11-03
dtype: datetime64[ns]

In [12]: pd.to_datetime(['2018-11-1', '2018-11-2', '2018-11-3'])
Out[12]: DatetimeIndex(['2018-11-01', '2018-11-02', '2018-11-03'], dtype='datetime64[ns]', freq=None)

In [13]: pd.to_datetime(pd.DataFrame({'year': [2017, 2017], 'month': [1, 2],
    ...:  'day': [3, 4], 'hour': [5, 6]}))
Out[13]:
0   2017-01-03 05:00:00
1   2017-02-04 06:00:00
dtype: datetime64[ns]

其中：

pd.to_datetime(Series/DataFrame)返回的是Series。
pd.to_datetime(List)返回的是DatetimeIndex。

如果要转换如上所示的DataFrame，必须存在的列名有year，month，day。另外 hour, minute, second, millisecond, microsecond, nanosecond可选。

当使用pd.to_datetime() 转换数据时，很容易遇到无效数据。有一些任务对无效数据非常苛刻，所以报错让我们找到这些无效数据是不错的方法。当然，也有一些任务不在乎零星的无效数据，这时候就可以选择忽略。

# 遇到无效数据报错
In [15]: pd.to_datetime(['2018-11-1', 'invalid'], errors='raise')
ValueError: ('Unknown string format:', 'invalid')

# 忽略无效数据
In [17]: pd.to_datetime(['2018-11-1', 'invalid'], errors='ignore')
Out[17]: array(['2018-11-1', 'invalid'], dtype=object)

# 将无效数据显示为NaT
In [18]: pd.to_datetime(['2018-11-1', 'invalid'], errors='coerce')
Out[18]: DatetimeIndex(['2018-11-01', 'NaT'], dtype='datetime64[ns]', freq=None)

生成DatetimeIndex的另一个重要方法pandas.data_range。
可以通过指定一个规则，让pandas.data_range生成有序的DatetimeIndex.
pandas.data_range带有的默认参数如下：

pandas.date_range(start=None, end=None, periods=None, freq=’D’, tz=None, normalize=False,
name=None, closed=None, **kwargs)

常用参数的含义如下：

start= : 设置起始时间
end= : 设置截至时间
periods= ：设置时间区间，若None则需要单独设置起止和截至时间。
freq= : 设置间隔周期。
设置时区。

其中，freq= 参数是非常关键的参数，可以设置的周期有：

freq='s' ：秒
freq='min' ：分钟
freq='H' ：小时
freq='D' ：天
freq='w' ：周
freq='m' ：月
freq='BM' ：每个月最后一个工作日
freq='W' ：每周的星期日

# 从2018-11-1 到 2018-11-2，以小时间隔
In [19]: pd.date_range('2018-11-1', '2018-11-2', freq='H')
Out[19]:
DatetimeIndex(['2018-11-01 00:00:00', '2018-11-01 01:00:00',
               '2018-11-01 02:00:00', '2018-11-01 03:00:00',
               '2018-11-01 04:00:00', '2018-11-01 05:00:00',
               '2018-11-01 06:00:00', '2018-11-01 07:00:00',
               '2018-11-01 08:00:00', '2018-11-01 09:00:00',
               '2018-11-01 10:00:00', '2018-11-01 11:00:00',
               '2018-11-01 12:00:00', '2018-11-01 13:00:00',
               '2018-11-01 14:00:00', '2018-11-01 15:00:00',
               '2018-11-01 16:00:00', '2018-11-01 17:00:00',
               '2018-11-01 18:00:00', '2018-11-01 19:00:00',
               '2018-11-01 20:00:00', '2018-11-01 21:00:00',
               '2018-11-01 22:00:00', '2018-11-01 23:00:00',
               '2018-11-02 00:00:00'],
              dtype='datetime64[ns]', freq='H')

# 从2018-11-1 开始，以1s为间隔，向后推 10 次
In [20]: pd.date_range('2018-11-1', periods=10, freq='s')
Out[20]:
DatetimeIndex(['2018-11-01 00:00:00', '2018-11-01 00:00:01',
               '2018-11-01 00:00:02', '2018-11-01 00:00:03',
               '2018-11-01 00:00:04', '2018-11-01 00:00:05',
               '2018-11-01 00:00:06', '2018-11-01 00:00:07',
               '2018-11-01 00:00:08', '2018-11-01 00:00:09'],
              dtype='datetime64[ns]', freq='S')

# 从2018-11-1 开始， 以 1H20min为间隔， 向后推 10 次
In [21]: pd.date_range('11/1/2018', periods=10, freq='1H20min')
Out[21]:
DatetimeIndex(['2018-11-01 00:00:00', '2018-11-01 01:20:00',
               '2018-11-01 02:40:00', '2018-11-01 04:00:00',
               '2018-11-01 05:20:00', '2018-11-01 06:40:00',
               '2018-11-01 08:00:00', '2018-11-01 09:20:00',
               '2018-11-01 10:40:00', '2018-11-01 12:00:00'],
              dtype='datetime64[ns]', freq='80T')

除了生成DatetimeIndex，还可以对已有的DatetimeIndex进行操作。这些操作包括选择，切片等。类似于对Series的操作。

In [23]: x = pd.date_range('2018-11-1', periods=10, freq='1D1H')

In [24]: x
Out[24]:
DatetimeIndex(['2018-11-01 00:00:00', '2018-11-02 01:00:00',
               '2018-11-03 02:00:00', '2018-11-04 03:00:00',
               '2018-11-05 04:00:00', '2018-11-06 05:00:00',
               '2018-11-07 06:00:00', '2018-11-08 07:00:00',
               '2018-11-09 08:00:00', '2018-11-10 09:00:00'],
              dtype='datetime64[ns]', freq='25H')

# 选取索引为1的时间戳
In [27]: x[1]
Out[27]: Timestamp('2018-11-02 01:00:00', freq='25H')

# 对索引从0到4的时间进行切片
In [28]: x[:5]
Out[28]:
DatetimeIndex(['2018-11-01 00:00:00', '2018-11-02 01:00:00',
               '2018-11-03 02:00:00', '2018-11-04 03:00:00',
               '2018-11-05 04:00:00'],
              dtype='datetime64[ns]', freq='25H')

DateOffset对象

上面，使用freq='1D1H'参数，可以生成间隔1天+1小时的时间戳索引。而在时间序列处理中，还常用到一种叫做DateOffset对象，可以对时间戳索引进行更加灵活的变化。DateOffset对象主要作用有：

可以让时间索引增加或减少一定时间段。
可以让时间索引乘以一个整数。
可以让时间索引向前或后移动到下一个或上一个特定的偏移日期。

In [29]: from pandas import offsets

In [30]: a = pd.date_range('2018-11-1', periods=10, freq='1D1H')

In [31]: a
Out[31]:
DatetimeIndex(['2018-11-01 00:00:00', '2018-11-02 01:00:00',
               '2018-11-03 02:00:00', '2018-11-04 03:00:00',
               '2018-11-05 04:00:00', '2018-11-06 05:00:00',
               '2018-11-07 06:00:00', '2018-11-08 07:00:00',
               '2018-11-09 08:00:00', '2018-11-10 09:00:00'],
              dtype='datetime64[ns]', freq='25H')

# 使用DateOffset 对象让a依次增加1个月+2天+3小时
In [33]: a + offsets.DateOffset(months=1, days=2, hours=3)
Out[33]:
DatetimeIndex(['2018-12-03 03:00:00', '2018-12-04 04:00:00',
               '2018-12-05 05:00:00', '2018-12-06 06:00:00',
               '2018-12-07 07:00:00', '2018-12-08 08:00:00',
               '2018-12-09 09:00:00', '2018-12-10 10:00:00',
               '2018-12-11 11:00:00', '2018-12-12 12:00:00'],
              dtype='datetime64[ns]', freq='25H')

# 使用DateOffset对象让a向后偏移2周
In [34]: a + 2*offsets.Week()
Out[34]:
DatetimeIndex(['2018-11-15 00:00:00', '2018-11-16 01:00:00',
               '2018-11-17 02:00:00', '2018-11-18 03:00:00',
               '2018-11-19 04:00:00', '2018-11-20 05:00:00',
               '2018-11-21 06:00:00', '2018-11-22 07:00:00',
               '2018-11-23 08:00:00', '2018-11-24 09:00:00'],
              dtype='datetime64[ns]', freq='25H')

Period 时间间隔

Pandas 中还存在 Period 时间间隔和 PeriodIndex 时间间隔索引对象。它们用来定义一定时间跨度。

# 一年跨度
In [35]: pd.Period('2018')
Out[35]: Period('2018', 'A-DEC')

# 一个月跨度
In [36]: pd.Period('2018-11')
Out[36]: Period('2018-11', 'M')

# 一天跨度
In [37]: pd.Period('2018-11-1')
Out[37]: Period('2018-11-01', 'D')

# 一小时跨度
In [38]: pd.Period('2018-11-1 13')
Out[38]: Period('2018-11-01 13:00', 'H')

# 一分钟跨度
In [39]: pd.Period('2018-11-1 13:22')
Out[39]: Period('2018-11-01 13:22', 'T')

# 一秒跨度
In [40]: pd.Period('2018-11-1 13:22:12')
Out[40]: Period('2018-11-01 13:22:12', 'S')

同样可以通过pandas.period_range()方法来生成序列：

In [42]: pd.period_range('2017-11', '2018-11', freq='M')
Out[42]:
PeriodIndex(['2017-11', '2017-12', '2018-01', '2018-02', '2018-03', '2018-04',
             '2018-05', '2018-06', '2018-07', '2018-08', '2018-09', '2018-10',
             '2018-11'],
            dtype='period[M]', freq='M')

DatetimeIndex 的dtype 类型为 datetime64[ns]，而 PeriodIndex 的 dtype 类型为 period[M]。另外，对于 Timestamp和 Period 的区别，在单独拿出来看一下：

In [43]: pd.Period('2018-11-1')
Out[43]: Period('2018-11-01', 'D')

In [44]: pd.Timestamp('2018-11-1')
Out[44]: Timestamp('2018-11-01 00:00:00')

可以看到，上面代表是2017-01-01这一天，而下面仅代表 2017-01-01 00:00:00 这一时刻。

时序数据检索

DatetimeIndex 之所以称之为时间戳索引，当然是它的主要用途是作为 Series 或者 DataFrame 的索引。下面，就随机生成一些数据，看一看如何对时间序列数据进行操作。

In [1]: import numpy as np
In [2]: import pandas as pd

# 生成时间索引
In [3]: i = pd.date_range('2018-11-1', periods=20, freq='M')

# 生成随机数据并添加时间作为索引
In [4]: data = pd.Series(np.random.randn(len(i)), index=i)

# 查看数据
In [5]: data
Out[5]:
2018-11-30   -1.040781
2018-12-31   -2.396724
2019-01-31    0.370134
2019-02-28   -1.655618
2019-03-31   -0.755367
2019-04-30   -1.465855
2019-05-31   -1.212847
2019-06-30   -0.816448
2019-07-31    0.360213
2019-08-31    0.100798
2019-09-30    1.004533
2019-10-31    0.488605
2019-11-30   -2.452875
2019-12-31   -1.495978
2020-01-31    0.535245
2020-02-29   -0.480371
2020-03-31   -0.536331
2020-04-30    0.640610
2020-05-31    0.271148
2020-06-30    0.522567
Freq: M, dtype: float64

上面就生成了一个以时间为索引的 Series 序列。这就回到了对 Pandas 中 Series 和 DataFrame 类型数据操作的问题。下面演示一些操作：

# 检索2018年的所有数据
In [6]: data['2018']
Out[6]:
2018-11-30   -1.040781
2018-12-31   -2.396724
Freq: M, dtype: float64

# 检索2019年7月到2020年3月之间的所有数据
In [7]: data['2019-07':'2020-03']
Out[7]:
2019-07-31    0.360213
2019-08-31    0.100798
2019-09-30    1.004533
2019-10-31    0.488605
2019-11-30   -2.452875
2019-12-31   -1.495978
2020-01-31    0.535245
2020-02-29   -0.480371
2020-03-31   -0.536331
Freq: M, dtype: float64

# 使用loc方法检索2019年1月的所有数据
In [8]: data.loc['2019-01']
Out[8]:
2019-01-31    0.370134
Freq: M, dtype: float64

# 使用truncate方法检索2019-3-1 到 2020-4-2 期间的数据
In [9]: data.truncate(before='2019-3-1', after='2020-4-2')
Out[9]:
2019-03-31   -0.755367
2019-04-30   -1.465855
2019-05-31   -1.212847
2019-06-30   -0.816448
2019-07-31    0.360213
2019-08-31    0.100798
2019-09-30    1.004533
2019-10-31    0.488605
2019-11-30   -2.452875
2019-12-31   -1.495978
2020-01-31    0.535245
2020-02-29   -0.480371
2020-03-31   -0.536331
Freq: M, dtype: float64

时间数据偏移

这里可能会用到 Shifting 方法，将时间索引进行整体偏移。

# 生成时间索引
In [10]: i = pd.date_range('2017-1-1', periods=5, freq='M')

# 生成随机数据并添加时间作为索引
In [11]: data = pd.Series(np.random.randn(len(i)), index=i)

# 查看数据
In [12]: data
Out[12]:
2017-01-31   -0.440074
2017-02-28    0.706395
2017-03-31    0.823844
2017-04-30    0.703313
2017-05-31    0.920151
Freq: M, dtype: float64

# 将索引向前移位3个单位，也就是数据向后位移3个单位，缺失数据Pandas会用NaN填充
In [13]: data.shift(3)
Out[13]:
2017-01-31         NaN
2017-02-28         NaN
2017-03-31         NaN
2017-04-30   -0.440074
2017-05-31    0.706395
Freq: M, dtype: float64

# 将索引向后位移3个单位，也就是数据向前位移3个单位
In [14]: data.shift(-3)
Out[14]:
2017-01-31    0.703313
2017-02-28    0.920151
2017-03-31         NaN
2017-04-30         NaN
2017-05-31         NaN
Freq: M, dtype: float64

# 将索引的时间向后移动3天
In [15]: data.shift(3, freq='D')
Out[15]:
2017-02-03   -0.440074
2017-03-03    0.706395
2017-04-03    0.823844
2017-05-03    0.703313
2017-06-03    0.920151
dtype: float64

时间数据重采样

除了 Shifting 方法，重采样 Resample 也会经常用到。Resample 可以提升或降低一个时间索引序列的频率，大有用处。例如：当时间序列数据量非常大时，我们可以通过低频率采样的方法得到规模较小到时间覆盖依然较为全面的新数据集。另外，对于多个不同频率的数据集需要数据对齐时，重采样可以十分重要的手段。

In [16]: i = pd.date_range('2017-01-01', periods=20, freq='D')

In [17]: data = pd.Series(np.random.randn(len(i)), index=i)

In [18]: data
Out[18]:
2017-01-01    1.096656
2017-01-02   -2.404326
2017-01-03   -0.883177
2017-01-04    0.554299
2017-01-05   -1.004089
2017-01-06   -0.014365
2017-01-07   -0.514893
2017-01-08   -0.049173
2017-01-09    1.633568
2017-01-10    2.076252
2017-01-11    0.132104
2017-01-12   -1.011756
2017-01-13   -1.330824
2017-01-14    1.626463
2017-01-15   -0.339399
2017-01-16   -0.622435
2017-01-17   -0.201180
2017-01-18   -1.193216
2017-01-19   -1.522457
2017-01-20    1.217058
Freq: D, dtype: float64

# 按照2天进行降采样，并对2天对应的数据求和作为新数据
In [19]: data.resample('2D').sum()
Out[19]:
2017-01-01   -1.307670
2017-01-03   -0.328878
2017-01-05   -1.018454
2017-01-07   -0.564066
2017-01-09    3.709820
2017-01-11   -0.879652
2017-01-13    0.295638
2017-01-15   -0.961834
2017-01-17   -1.394395
2017-01-19   -0.305399
dtype: float64

# 按照2天进行降采样，并对2天对应的数据求平均值作为新数据
In [20]: data.resample('2D').mean()
Out[20]:
2017-01-01   -0.653835
2017-01-03   -0.164439
2017-01-05   -0.509227
2017-01-07   -0.282033
2017-01-09    1.854910
2017-01-11   -0.439826
2017-01-13    0.147819
2017-01-15   -0.480917
2017-01-17   -0.697198
2017-01-19   -0.152700
dtype: float64

# 按照2天进行降采样，并对2天对应的数据求最大值作为新数据
In [21]: data.resample('2D').max()
Out[21]:
2017-01-01    1.096656
2017-01-03    0.554299
2017-01-05   -0.014365
2017-01-07   -0.049173
2017-01-09    2.076252
2017-01-11    0.132104
2017-01-13    1.626463
2017-01-15   -0.339399
2017-01-17   -0.201180
2017-01-19    1.217058
dtype: float64

# 按照2天进行降采样，并将2天对应的数据的原值，最大值，最小值，以及临近值列出
In [22]: data.resample('2D').ohlc()
Out[22]:
                open      high       low     close
2017-01-01  1.096656  1.096656 -2.404326 -2.404326
2017-01-03 -0.883177  0.554299 -0.883177  0.554299
2017-01-05 -1.004089 -0.014365 -1.004089 -0.014365
2017-01-07 -0.514893 -0.049173 -0.514893 -0.049173
2017-01-09  1.633568  2.076252  1.633568  2.076252
2017-01-11  0.132104  0.132104 -1.011756 -1.011756
2017-01-13 -1.330824  1.626463 -1.330824  1.626463
2017-01-15 -0.339399 -0.339399 -0.622435 -0.622435
2017-01-17 -0.201180 -0.201180 -1.193216 -1.193216
2017-01-19 -1.522457  1.217058 -1.522457  1.217058

采样操作起来只是需要注意采样后对新数据不同的处理方法。上面介绍的是降频采样。也可以升频采样。

# 时间频率从天提升到小时，并使用相同的数据对新增加行填充
In [23]: data.resample('H').ffill()
Out[23]:
2017-01-01 00:00:00    1.096656
2017-01-01 01:00:00    1.096656
2017-01-01 02:00:00    1.096656
2017-01-01 03:00:00    1.096656
                       ...
2017-01-19 20:00:00   -1.522457
2017-01-19 21:00:00   -1.522457
2017-01-19 22:00:00   -1.522457
2017-01-19 23:00:00   -1.522457
2017-01-20 00:00:00    1.217058
Freq: H, Length: 457, dtype: float64

# 时间频率从天提升到小时，不对新增加行填充
In [24]: data.resample('H').asfreq()
Out[24]:
2017-01-01 00:00:00    1.096656
2017-01-01 01:00:00         NaN
2017-01-01 02:00:00         NaN
2017-01-01 03:00:00         NaN
2017-01-01 04:00:00         NaN
                       ...
2017-01-19 20:00:00         NaN
2017-01-19 21:00:00         NaN
2017-01-19 22:00:00         NaN
2017-01-19 23:00:00         NaN
2017-01-20 00:00:00    1.217058
Freq: H, Length: 457, dtype: float64

# 时间频率从天提升到小时，只对新增加前3行填充
In [25]: data.resample('H').ffill(limit=3)
Out[25]:
2017-01-01 00:00:00    1.096656
2017-01-01 01:00:00    1.096656
2017-01-01 02:00:00    1.096656
2017-01-01 03:00:00    1.096656
2017-01-01 04:00:00         NaN
                       ...
2017-01-19 20:00:00         NaN
2017-01-19 21:00:00         NaN
2017-01-19 22:00:00         NaN
2017-01-19 23:00:00         NaN
2017-01-20 00:00:00    1.217058
Freq: H, Length: 457, dtype: float64