pandas我爱编程数据清洗

Pandas中时间和日期处理

2016-08-25  本文已影响27493人  一刀YiDao

1、生成一个时间段

In [1]:import pandas as pd
In [2]:import numpy as np
#1、生成一个时间区间段,间隔为小时
In [3]:rng = pd.date_range('1/1/2011', periods=72, freq='H')
#2、生成一个Series,并制定索引为时间段
In [4]:ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [5]:ts
Out[5]: 
2011-01-01 00:00:00   -0.204085
2011-01-01 01:00:00    1.101711
2011-01-01 02:00:00    1.840500
2011-01-01 03:00:00    0.112426
2011-01-01 04:00:00   -0.310413
2011-01-01 05:00:00    1.180762
2011-01-01 06:00:00    0.087775
2011-01-01 07:00:00    1.087877
2011-01-01 08:00:00   -0.950237
2011-01-01 09:00:00   -0.468453
Freq: H, dtype: float64

#3、改变时间间隔
In [6]:converted = ts.asfreq('45Min', method='pad')
In [7]:converted
Out[7]: 
2011-01-01 00:00:00   -0.204085
2011-01-01 00:45:00   -0.204085
2011-01-01 01:30:00    1.101711
2011-01-01 02:15:00    1.840500
2011-01-01 03:00:00    0.112426
2011-01-01 03:45:00    0.112426
2011-01-01 04:30:00   -0.310413
2011-01-01 05:15:00    1.180762
2011-01-01 06:00:00    0.087775
2011-01-01 06:45:00    0.087775
2011-01-01 07:30:00    1.087877
2011-01-01 08:15:00   -0.950237
2011-01-01 09:00:00   -0.468453
Freq: 45T, dtype: float64

2、转华为日期格式

2.1 数字生成日期格式

In [8]: pd.Timestamp(datetime(2012, 5, 1))
Out[8]: Timestamp('2012-05-01 00:00:00')

2.2 字符生成日期格式

In [9]: pd.Timestamp('2012-05-01')
Out[9]: Timestamp('2012-05-01 00:00:00')

2.3 只有年月

In [10]: pd.Period('2011-01')
Out[10]: Period('2011-01', 'M')

In [11]: pd.Period('2012-05', freq='D')
Out[11]: Period('2012-05-01', 'D')

2.4 转化为日期格式

In [22]: pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', None]))
Out[22]: 
0   2009-07-31
1   2010-01-10
2          NaT
dtype: datetime64[ns]

In [23]: pd.to_datetime(['2005/11/23', '2010.12.31'])
Out[23]: DatetimeIndex(['2005-11-23', '2010-12-31'], dtype='datetime64[ns]', freq=None)

3、生成一个时间段

3.1 生成索引的方法

In [35]: dates = [datetime(2012, 5, 1), datetime(2012, 5, 2), datetime(2012, 5, 3)]

# Note the frequency information
In [36]: index = pd.DatetimeIndex(dates)

In [37]: index
Out[37]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)

# Automatically converted to DatetimeIndex
In [38]: index = pd.Index(dates)

In [39]: index
Out[39]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)


# date_range日历,bdate_range工作日
In [40]: index = pd.date_range('2000-1-1', periods=1000, freq='M')

In [41]: index
Out[41]: 
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-30',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-30', '2000-10-31',
               ...
               '2082-07-31', '2082-08-31', '2082-09-30', '2082-10-31',
               '2082-11-30', '2082-12-31', '2083-01-31', '2083-02-28',
               '2083-03-31', '2083-04-30'],
              dtype='datetime64[ns]', length=1000, freq='M')

In [42]: index = pd.bdate_range('2012-1-1', periods=250)

In [43]: index
Out[43]: 
DatetimeIndex(['2012-01-02', '2012-01-03', '2012-01-04', '2012-01-05',
               '2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11',
               '2012-01-12', '2012-01-13',
               ...
               '2012-12-03', '2012-12-04', '2012-12-05', '2012-12-06',
               '2012-12-07', '2012-12-10', '2012-12-11', '2012-12-12',
               '2012-12-13', '2012-12-14'],
              dtype='datetime64[ns]', length=250, freq='B')


In [44]: start = datetime(2011, 1, 1)

In [45]: end = datetime(2012, 1, 1)

In [46]: rng = pd.date_range(start, end)

In [47]: rng
Out[47]: 
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
               '2011-01-09', '2011-01-10',
               ...
               '2011-12-23', '2011-12-24', '2011-12-25', '2011-12-26',
               '2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30',
               '2011-12-31', '2012-01-01'],
              dtype='datetime64[ns]', length=366, freq='D')

In [48]: rng = pd.bdate_range(start, end)

In [49]: rng
Out[49]: 
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
               '2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',
               '2011-01-13', '2011-01-14',
               ...
               '2011-12-19', '2011-12-20', '2011-12-21', '2011-12-22',
               '2011-12-23', '2011-12-26', '2011-12-27', '2011-12-28',
               '2011-12-29', '2011-12-30'],
              dtype='datetime64[ns]', length=260, freq='B')

3.2 每个月末,每隔一周

In [50]: pd.date_range(start, end, freq='BM')
Out[50]: 
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
               '2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
               '2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
              dtype='datetime64[ns]', freq='BM')

In [51]: pd.date_range(start, end, freq='W')
Out[51]: 
DatetimeIndex(['2011-01-02', '2011-01-09', '2011-01-16', '2011-01-23',
               '2011-01-30', '2011-02-06', '2011-02-13', '2011-02-20',
               '2011-02-27', '2011-03-06', '2011-03-13', '2011-03-20',
               '2011-03-27', '2011-04-03', '2011-04-10', '2011-04-17',
               '2011-04-24', '2011-05-01', '2011-05-08', '2011-05-15',
               '2011-05-22', '2011-05-29', '2011-06-05', '2011-06-12',
               '2011-06-19', '2011-06-26', '2011-07-03', '2011-07-10',
               '2011-07-17', '2011-07-24', '2011-07-31', '2011-08-07',
               '2011-08-14', '2011-08-21', '2011-08-28', '2011-09-04',
               '2011-09-11', '2011-09-18', '2011-09-25', '2011-10-02',
               '2011-10-09', '2011-10-16', '2011-10-23', '2011-10-30',
               '2011-11-06', '2011-11-13', '2011-11-20', '2011-11-27',
               '2011-12-04', '2011-12-11', '2011-12-18', '2011-12-25',
               '2012-01-01'],
              dtype='datetime64[ns]', freq='W-SUN')

3.3 从End往前数20个工作日,从start往后数20个工作日

In [52]: pd.bdate_range(end=end, periods=20)
Out[52]: 
DatetimeIndex(['2011-12-05', '2011-12-06', '2011-12-07', '2011-12-08',
               '2011-12-09', '2011-12-12', '2011-12-13', '2011-12-14',
               '2011-12-15', '2011-12-16', '2011-12-19', '2011-12-20',
               '2011-12-21', '2011-12-22', '2011-12-23', '2011-12-26',
               '2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30'],
              dtype='datetime64[ns]', freq='B')

In [53]: pd.bdate_range(start=start, periods=20)
Out[53]: 
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
               '2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',
               '2011-01-13', '2011-01-14', '2011-01-17', '2011-01-18',
               '2011-01-19', '2011-01-20', '2011-01-21', '2011-01-24',
               '2011-01-25', '2011-01-26', '2011-01-27', '2011-01-28'],
              dtype='datetime64[ns]', freq='B')

4、根据部分索引选择,切片

In [56]: rng = pd.date_range(start, end, freq='BM')

In [57]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [58]: ts.index
Out[58]: 
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
               '2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
               '2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
              dtype='datetime64[ns]', freq='BM')

In [59]: ts[:5].index
Out[59]: 
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
               '2011-05-31'],
              dtype='datetime64[ns]', freq='BM')

In [60]: ts[::2].index
Out[60]: 
DatetimeIndex(['2011-01-31', '2011-03-31', '2011-05-31', '2011-07-29',
               '2011-09-30', '2011-11-30'],
              dtype='datetime64[ns]', freq='2BM')

In [61]: ts['1/31/2011']
Out[61]: -1.2812473076599531

In [62]: ts[pd.datetime(2011, 12, 25):]
Out[62]: 
2011-12-30    0.687738
Freq: BM, dtype: float64

In [63]: ts['10/31/2011':'12/31/2011']
Out[63]: 
2011-10-31    0.149748
2011-11-30   -0.732339
2011-12-30    0.687738
Freq: BM, dtype: float64

In [64]: ts['2011']
Out[64]: 
2011-01-31   -1.281247
2011-02-28   -0.727707
2011-03-31   -0.121306
2011-04-29   -0.097883
2011-05-31    0.695775
2011-06-30    0.341734
2011-07-29    0.959726
2011-08-31   -1.110336
2011-09-30   -0.619976
2011-10-31    0.149748
2011-11-30   -0.732339
2011-12-30    0.687738
Freq: BM, dtype: float64

In [65]: ts['2011-6']
Out[65]: 
2011-06-30    0.341734
Freq: BM, dtype: float64


# DataFrame中指定了时间索引,可以根据时间索引提取子集
In [66]: dft = pd.DataFrame(np.random.randn(100000,1),columns=['A'],index=pd.date_range('20130101',periods=100000,freq='T'))


In [67]: dft
Out[67]: 
                            A
2013-01-01 00:00:00  0.176444
2013-01-01 00:01:00  0.403310
2013-01-01 00:02:00 -0.154951
2013-01-01 00:03:00  0.301624
2013-01-01 00:04:00 -2.179861
2013-01-01 00:05:00 -1.369849
2013-01-01 00:06:00 -0.954208
...                       ...
2013-03-11 10:33:00 -0.293083
2013-03-11 10:34:00 -0.059881
2013-03-11 10:35:00  1.252450
2013-03-11 10:36:00  0.046611
2013-03-11 10:37:00  0.059478
2013-03-11 10:38:00 -0.286539
2013-03-11 10:39:00  0.841669

[100000 rows x 1 columns]

In [68]: dft['2013']
Out[68]: 
                            A
2013-01-01 00:00:00  0.176444
2013-01-01 00:01:00  0.403310
2013-01-01 00:02:00 -0.154951
2013-01-01 00:03:00  0.301624
2013-01-01 00:04:00 -2.179861
2013-01-01 00:05:00 -1.369849
2013-01-01 00:06:00 -0.954208
...                       ...
2013-03-11 10:33:00 -0.293083
2013-03-11 10:34:00 -0.059881
2013-03-11 10:35:00  1.252450
2013-03-11 10:36:00  0.046611
2013-03-11 10:37:00  0.059478
2013-03-11 10:38:00 -0.286539
2013-03-11 10:39:00  0.841669

[100000 rows x 1 columns]


In [69]: dft['2013-1':'2013-2']
Out[69]: 
                            A
2013-01-01 00:00:00  0.176444
2013-01-01 00:01:00  0.403310
2013-01-01 00:02:00 -0.154951
2013-01-01 00:03:00  0.301624
2013-01-01 00:04:00 -2.179861
2013-01-01 00:05:00 -1.369849
2013-01-01 00:06:00 -0.954208
...                       ...
2013-02-28 23:53:00  0.103114
2013-02-28 23:54:00 -1.303422
2013-02-28 23:55:00  0.451943
2013-02-28 23:56:00  0.220534
2013-02-28 23:57:00 -1.624220
2013-02-28 23:58:00  0.093915
2013-02-28 23:59:00 -1.087454

[84960 rows x 1 columns]


In [70]: dft['2013-1':'2013-2-28']
Out[70]: 
                            A
2013-01-01 00:00:00  0.176444
2013-01-01 00:01:00  0.403310
2013-01-01 00:02:00 -0.154951
2013-01-01 00:03:00  0.301624
2013-01-01 00:04:00 -2.179861
2013-01-01 00:05:00 -1.369849
2013-01-01 00:06:00 -0.954208
...                       ...
2013-02-28 23:53:00  0.103114
2013-02-28 23:54:00 -1.303422
2013-02-28 23:55:00  0.451943
2013-02-28 23:56:00  0.220534
2013-02-28 23:57:00 -1.624220
2013-02-28 23:58:00  0.093915
2013-02-28 23:59:00 -1.087454

[84960 rows x 1 columns]


In [71]: dft['2013-1':'2013-2-28 00:00:00']
Out[71]: 
                            A
2013-01-01 00:00:00  0.176444
2013-01-01 00:01:00  0.403310
2013-01-01 00:02:00 -0.154951
2013-01-01 00:03:00  0.301624
2013-01-01 00:04:00 -2.179861
2013-01-01 00:05:00 -1.369849
2013-01-01 00:06:00 -0.954208
...                       ...
2013-02-27 23:54:00  0.897051
2013-02-27 23:55:00 -0.309230
2013-02-27 23:56:00  1.944713
2013-02-27 23:57:00  0.369265
2013-02-27 23:58:00  0.053071
2013-02-27 23:59:00 -0.019734
2013-02-28 00:00:00  1.388189

[83521 rows x 1 columns]


In [72]: dft['2013-1-15':'2013-1-15 12:30:00']
Out[72]: 
                            A
2013-01-15 00:00:00  0.501288
2013-01-15 00:01:00 -0.605198
2013-01-15 00:02:00  0.215146
2013-01-15 00:03:00  0.924732
2013-01-15 00:04:00 -2.228519
2013-01-15 00:05:00  1.517331
2013-01-15 00:06:00 -1.188774
...                       ...
2013-01-15 12:24:00  1.358314
2013-01-15 12:25:00 -0.737727
2013-01-15 12:26:00  1.838323
2013-01-15 12:27:00 -0.774090
2013-01-15 12:28:00  0.622261
2013-01-15 12:29:00 -0.631649
2013-01-15 12:30:00  0.193284

[751 rows x 1 columns]


In [73]: dft.loc['2013-1-15 12:30:00']
Out[73]: 
A    0.193284
Name: 2013-01-15 12:30:00, dtype: float64

6、常用时间

类别 解释
year
month
day
hour
minute 分钟
second
microsecond 微秒
nanosecond 纳秒
date 返回日期
time 返回时间
dayofyear 年序日
weekofyear 年序周
week
dayofweek 周中的第几天,Monday=0, Sunday=6
weekday 周中的第几天,Monday=0, Sunday=6
weekday_name 周中的星期几,ex: Friday
quarter 季度
days_in_month 一个月中有多少天
is_month_start 是否月初第一天
is_month_end 是否月末最后一天
is_quarter_start 是否季度的最开始
is_quarter_end 是否季度的最后一个
is_year_start 是否年初第一天
is_year_end 是否年末第一天

7、某一时间点,往前往后加一段时间

类别 解释
BDay 工作日
CDay 自定义日期
Week
WeekOfMonth 月中的第几周
LastWeekOfMonth 月中的最后一周
MonthEnd 日历上月末
MonthBegin 日历上月初
BMonthEnd 工作月初
BMonthBegin 月开始营业
CBMonthEnd 自定义月末
CBMonthBegin 自定义月初
QuarterEnd 日历季末
QuarterBegin 日历季初
BQuarterEnd 工作季末
BQuarterBegin 工作季初
FY5253Quarter retail (aka 52-53 week) quarter
YearEnd 日历年末
YearBegin 日历年初
BYearEnd 工作年末
BYearBegin 工作年初
FY5253 retail (aka 52-53 week) year
BusinessHour 工作小时
CustomBusinessHour 自定义小时
Hour 小时
Minute 分钟
Second
In [84]: d = pd.datetime(2008, 8, 18, 9, 0)
In [86]: from pandas.tseries.offsets import *

In [87]: d + DateOffset(months=4, days=5)
Out[87]: Timestamp('2008-12-23 09:00:00')


In [88]: d - 5 * BDay()
Out[88]: Timestamp('2008-08-11 09:00:00')

# 月末
In [89]: d + BMonthEnd()
Out[89]: Timestamp('2008-08-29 09:00:00')


In [90]: d
Out[90]: datetime.datetime(2008, 8, 18, 9, 0)

# 往前数月末
In [91]: offset = BMonthEnd()

In [92]: offset.rollforward(d)
Out[92]: Timestamp('2008-08-29 09:00:00')

# 往后数月末
In [93]: offset.rollback(d)
Out[93]: Timestamp('2008-07-31 09:00:00')


# 时间方面的
In [94]: day = Day()

In [95]: day.apply(pd.Timestamp('2014-01-01 09:00'))
Out[95]: Timestamp('2014-01-02 09:00:00')

In [96]: day = Day(normalize=True)

In [97]: day.apply(pd.Timestamp('2014-01-01 09:00'))
Out[97]: Timestamp('2014-01-02 00:00:00')

In [98]: hour = Hour()

In [99]: hour.apply(pd.Timestamp('2014-01-01 22:00'))
Out[99]: Timestamp('2014-01-01 23:00:00')

In [100]: hour = Hour(normalize=True)

In [101]: hour.apply(pd.Timestamp('2014-01-01 22:00'))
Out[101]: Timestamp('2014-01-01 00:00:00')

In [102]: hour.apply(pd.Timestamp('2014-01-01 23:00'))
Out[102]: Timestamp('2014-01-02 00:00:00')


# 周相关的
In [103]: d
Out[103]: datetime.datetime(2008, 8, 18, 9, 0)

In [104]: d + Week()
Out[104]: Timestamp('2008-08-25 09:00:00')

In [105]: d + Week(weekday=4)
Out[105]: Timestamp('2008-08-22 09:00:00')

In [106]: (d + Week(weekday=4)).weekday()
Out[106]: 4

In [107]: d - Week()
Out[107]: Timestamp('2008-08-11 09:00:00')

8、时间序列相关的时间处理

In [213]: ts = ts[:5]

In [214]: ts.shift(1)
Out[214]: 
2011-01-31         NaN
2011-02-28   -1.281247
2011-03-31   -0.727707
2011-04-29   -0.121306
2011-05-31   -0.097883
Freq: BM, dtype: float64


In [215]: ts.shift(5, freq=datetools.bday)
Out[215]: 
2011-02-07   -1.281247
2011-03-07   -0.727707
2011-04-07   -0.121306
2011-05-06   -0.097883
2011-06-07    0.695775
dtype: float64

In [216]: ts.shift(5, freq='BM')
Out[216]: 
2011-06-30   -1.281247
2011-07-29   -0.727707
2011-08-31   -0.121306
2011-09-30   -0.097883
2011-10-31    0.695775
Freq: BM, dtype: float64


In [217]: ts.tshift(5, freq='D')
Out[217]: 
2011-02-05   -1.281247
2011-03-05   -0.727707
2011-04-05   -0.121306
2011-05-04   -0.097883
2011-06-05    0.695775
dtype: float64

上一篇 下一篇

猜你喜欢

热点阅读