Python 数据处理(二十一)—— HDF5 查询
7 查询
7.1 查询表
select
和 delete
操作有一个可选的条件,可以指定该条件来选择/删除数据的一个子集。
这允许我们从一个非常大的磁盘表中,检索出一小部分的数据。
比较操作
=, ==, !=, >, >=, <, <=
布尔表达式
-
|
: 或 -
&
: 与 -
()
: 分组
注意
-
=
将会自动扩展为比较运算符==
-
~
是取反,使用条件有限 - 如果传递表达式列表或元组,将使用
&
连接
下面是有效的表达式:
'index >= date'
"columns = ['A', 'D']"
"columns in ['A', 'D']"
'columns = A'
'columns == A'
"~(columns = ['A', 'B'])"
'index > df.index[3] & string = "bar"'
'(index > df.index[3] & index <= df.index[6]) | string = "bar"'
"ts >= Timestamp('2012-02-01')"
"major_axis>=20130101"
indexers
位于子表达式的左侧
-
columns
,major_axis
,ts
表达式的右侧可以是:
- 求值函数,如
Timestamp('2012-02-01')
- 字符串,如
"bar"
- 类似日期,如
20130101
, 或"20130101"
- 列表,如
"['A', 'B']"
- 局部定义的变量,如
date
注意:
不要使用字符串内插的方式构造查询表达式,可以将字符串赋值给变量,然后直接的表达式内使用。比如
string = "HolyMoly'"
store.select("df", "index == string")
如果替换为
string = "HolyMoly'"
store.select('df', f'index == {string}')
将会引发 SyntaxError
异常,因为 string
变量双引号内有一个单引号
如果一定要使用内插法,可以使用 "%r"
格式化
store.select("df", "index == %r" % string)
我们有如下数据
In [400]: dfq = pd.DataFrame(
.....: np.random.randn(10, 4),
.....: columns=list("ABCD"),
.....: index=pd.date_range("20130101", periods=10),
.....: )
.....:
In [401]: store.append("dfq", dfq, format="table", data_columns=True)
使用布尔表达式和内联函数求值。
In [402]: store.select("dfq", "index>pd.Timestamp('20130104') & columns=['A', 'B']")
Out[402]:
A B
2013-01-05 -1.083889 0.811865
2013-01-06 -0.402227 1.618922
2013-01-07 0.948196 0.183573
2013-01-08 -1.043530 -0.708145
2013-01-09 0.813949 1.508891
2013-01-10 1.176488 -1.246093
使用内联列
In [403]: store.select("dfq", where="A>0 or C>0")
Out[403]:
A B C D
2013-01-01 0.620028 0.159416 -0.263043 -0.639244
2013-01-04 -0.536722 1.005707 0.296917 0.139796
2013-01-05 -1.083889 0.811865 1.648435 -0.164377
2013-01-07 0.948196 0.183573 0.145277 0.308146
2013-01-08 -1.043530 -0.708145 1.430905 -0.850136
2013-01-09 0.813949 1.508891 -1.556154 0.187597
2013-01-10 1.176488 -1.246093 -0.002726 -0.444249
可以提供 columns
关键字来选择要返回的列,这等价于传递 'columns=list_of_columns_to_filter'
:
In [404]: store.select("df", "columns=['A', 'B']")
Out[404]:
A B
2000-01-01 1.334065 0.521036
2000-01-02 -1.613932 1.088104
2000-01-03 -0.585314 -0.275038
2000-01-04 0.632369 -1.249657
2000-01-05 1.060617 -0.143682
2000-01-06 3.050329 1.317933
2000-01-07 -0.539452 -0.771133
2000-01-08 0.649464 -1.736427
可以指定 start
和 stop
参数来限制行起始和终止。这些值是根据表中的总行数计算的
>>> store.select("dfq", "columns=['A', 'B']", start=3, stop=5)
>>>
A B
2013-01-04 -0.483155 1.143564
2013-01-05 0.218290 -1.391789
7.2 查询 timedelta64[ns]
您可以使用 timedelta64 [ns]
类型进行存储和查询。
Terms
可以用以下格式指定: <float>
(<unit>
),其中 float
可以是有符号的(和小数),单位可以是 D
、s
、ms
、us
、ns
。
这里有一个例子
In [405]: from datetime import timedelta
In [406]: dftd = pd.DataFrame(
.....: {
.....: "A": pd.Timestamp("20130101"),
.....: "B": [
.....: pd.Timestamp("20130101") + timedelta(days=i, seconds=10)
.....: for i in range(10)
.....: ],
.....: }
.....: )
.....:
In [407]: dftd["C"] = dftd["A"] - dftd["B"]
In [408]: dftd
Out[408]:
A B C
0 2013-01-01 2013-01-01 00:00:10 -1 days +23:59:50
1 2013-01-01 2013-01-02 00:00:10 -2 days +23:59:50
2 2013-01-01 2013-01-03 00:00:10 -3 days +23:59:50
3 2013-01-01 2013-01-04 00:00:10 -4 days +23:59:50
4 2013-01-01 2013-01-05 00:00:10 -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10 -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10 -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10 -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10 -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50
In [409]: store.append("dftd", dftd, data_columns=True)
In [410]: store.select("dftd", "C<'-3.5D'")
Out[410]:
A B C
4 2013-01-01 2013-01-05 00:00:10 -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10 -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10 -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10 -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10 -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50
7.3 查询多级索引
可以通过使用 level
的名称从多索引中提取数据
In [411]: df_mi.index.names
Out[411]: FrozenList(['foo', 'bar'])
In [412]: store.select("df_mi", "foo=baz and bar=two")
Out[412]:
A B C
foo bar
baz two 1.064908 1.778161 -0.913867
如果多索引 level=None
,则可以通过默认的级别名称 level_n
方访问,其中 n
为您想要选择的多索引级别
In [413]: index = pd.MultiIndex(
.....: levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
.....: codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
.....: )
.....:
In [414]: df_mi_2 = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])
In [415]: df_mi_2
Out[415]:
A B C
foo one 0.856838 1.491776 0.001283
two 0.701816 -1.097917 0.102588
three 0.661740 0.443531 0.559313
bar one -0.459055 -1.222598 -0.455304
two -0.781163 0.826204 -0.530057
baz two 0.296135 1.366810 1.073372
three -0.994957 0.755314 2.119746
qux one -2.628174 -0.089460 -0.133636
two 0.337920 -0.634027 0.421107
three 0.604303 1.053434 1.109090
In [416]: store.append("df_mi_2", df_mi_2)
# the levels are automatically included as data columns with keyword level_n
In [417]: store.select("df_mi_2", "level_0=foo and level_1=two")
Out[417]:
A B C
foo two 0.701816 -1.097917 0.102588
7.4 索引
如果数据已经存储在表中,可以使用 create_table_index
为表创建和修改索引。
强烈建议创建表索引。这样在执行 select
时,能够大大加快您的查询速度。
注意:
索引会自动在可索引对象和指定的任何数据列上自动创建。可以通过对 append
传递 index=False
来关闭这一行为
# we have automagically already created an index (in the first section)
In [418]: i = store.root.df.table.cols.index.index
In [419]: i.optlevel, i.kind
Out[419]: (6, 'medium')
# change an index by passing new parameters
In [420]: store.create_table_index("df", optlevel=9, kind="full")
In [421]: i = store.root.df.table.cols.index.index
In [422]: i.optlevel, i.kind
Out[422]: (9, 'full')
通常在将大量数据追加到一个存储中时,关闭每次追加时创建索引,然后在添加完后重新创建是很有用的
In [423]: df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))
In [424]: df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))
In [425]: st = pd.HDFStore("appends.h5", mode="w")
In [426]: st.append("df", df_1, data_columns=["B"], index=False)
In [427]: st.append("df", df_2, data_columns=["B"], index=False)
In [428]: st.get_storer("df").table
Out[428]:
/df/table (Table(20,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
在完成追加后,再创建索引。
In [429]: st.create_table_index("df", columns=["B"], optlevel=9, kind="full")
In [430]: st.get_storer("df").table
Out[430]:
/df/table (Table(20,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
autoindex := True
colindexes := {
"B": Index(9, full, shuffle, zlib(1)).is_csi=True}
In [431]: st.close()
7.5 通过数据列查询
有时,你可能只想在某些列上进行查询操作,您可以指定 data_columns=True
强制所有列为 data_columns
。
In [432]: df_dc = df.copy()
In [433]: df_dc["string"] = "foo"
In [434]: df_dc.loc[df_dc.index[4:6], "string"] = np.nan
In [435]: df_dc.loc[df_dc.index[7:9], "string"] = "bar"
In [436]: df_dc["string2"] = "cool"
In [437]: df_dc.loc[df_dc.index[1:3], ["B", "C"]] = 1.0
In [438]: df_dc
Out[438]:
A B C string string2
2000-01-01 1.334065 0.521036 0.930384 foo cool
2000-01-02 -1.613932 1.000000 1.000000 foo cool
2000-01-03 -0.585314 1.000000 1.000000 foo cool
2000-01-04 0.632369 -1.249657 0.975593 foo cool
2000-01-05 1.060617 -0.143682 0.218423 NaN cool
2000-01-06 3.050329 1.317933 -0.963725 NaN cool
2000-01-07 -0.539452 -0.771133 0.023751 foo cool
2000-01-08 0.649464 -1.736427 0.197288 bar cool
# on-disk operations
In [439]: store.append("df_dc", df_dc, data_columns=["B", "C", "string", "string2"])
In [440]: store.select("df_dc", where="B > 0")
Out[440]:
A B C string string2
2000-01-01 1.334065 0.521036 0.930384 foo cool
2000-01-02 -1.613932 1.000000 1.000000 foo cool
2000-01-03 -0.585314 1.000000 1.000000 foo cool
2000-01-06 3.050329 1.317933 -0.963725 NaN cool
# getting creative
In [441]: store.select("df_dc", "B > 0 & C > 0 & string == foo")
Out[441]:
A B C string string2
2000-01-01 1.334065 0.521036 0.930384 foo cool
2000-01-02 -1.613932 1.000000 1.000000 foo cool
2000-01-03 -0.585314 1.000000 1.000000 foo cool
# this is in-memory version of this type of selection
In [442]: df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == "foo")]
Out[442]:
A B C string string2
2000-01-01 1.334065 0.521036 0.930384 foo cool
2000-01-02 -1.613932 1.000000 1.000000 foo cool
2000-01-03 -0.585314 1.000000 1.000000 foo cool
# we have automagically created this index and the B/C/string/string2
# columns are stored separately as ``PyTables`` columns
In [443]: store.root.df_dc.table
Out[443]:
/df_dc/table (Table(8,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2),
"C": Float64Col(shape=(), dflt=0.0, pos=3),
"string": StringCol(itemsize=3, shape=(), dflt=b'', pos=4),
"string2": StringCol(itemsize=4, shape=(), dflt=b'', pos=5)}
byteorder := 'little'
chunkshape := (1680,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"B": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"C": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"string": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"string2": Index(6, medium, shuffle, zlib(1)).is_csi=False}
7.6 迭代器
您可以将 iterator=True
或 chunksize=number_in_a_chunk
传递给 select
和 select_as_multiple
, 然后返回一个迭代器。 默认是一个块返回 50,000
行
In [444]: for df in store.select("df", chunksize=3):
.....: print(df)
.....:
A B C
2000-01-01 1.334065 0.521036 0.930384
2000-01-02 -1.613932 1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
A B C
2000-01-04 0.632369 -1.249657 0.975593
2000-01-05 1.060617 -0.143682 0.218423
2000-01-06 3.050329 1.317933 -0.963725
A B C
2000-01-07 -0.539452 -0.771133 0.023751
2000-01-08 0.649464 -1.736427 0.197288
注意:
你也可以使用 read_hdf
返回一个迭代器,在完成迭代后会自动关闭存储
for df in pd.read_hdf("store.h5", "df", chunksize=3):
print(df)
注意:
chunksize
关键字会应用到原始行中,意味着在你进行查询的时候,会将所有的行进行分割并应用查询,因此返回的迭代器的大小可能不一样
下面的例子用于生成大小相等的块
In [445]: dfeq = pd.DataFrame({"number": np.arange(1, 11)})
In [446]: dfeq
Out[446]:
number
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
In [447]: store.append("dfeq", dfeq, data_columns=["number"])
In [448]: def chunks(l, n):
.....: return [l[i: i + n] for i in range(0, len(l), n)]
.....:
In [449]: evens = [2, 4, 6, 8, 10]
In [450]: coordinates = store.select_as_coordinates("dfeq", "number=evens")
In [451]: for c in chunks(coordinates, 2):
.....: print(store.select("dfeq", where=c))
.....:
number
1 2
3 4
number
5 6
7 8
number
9 10
7.7 高级查询
7.7.1 选择一列
可以使用 select_column
方法来选取一列,这种方法可以快速的获取索引列。返回的结果是 Series
类型,索引变成了递增的行号。
该方法还不支持 where
选择器
In [452]: store.select_column("df_dc", "index")
Out[452]:
0 2000-01-01
1 2000-01-02
2 2000-01-03
3 2000-01-04
4 2000-01-05
5 2000-01-06
6 2000-01-07
7 2000-01-08
Name: index, dtype: datetime64[ns]
In [453]: store.select_column("df_dc", "string")
Out[453]:
0 foo
1 foo
2 foo
3 foo
4 NaN
5 NaN
6 foo
7 bar
Name: string, dtype: object
7.7.2 选择坐标
有时,你可能想获取查询的坐标(即索引的位置)。可以将返回的 Int64Index
结果索引传递给 where
进行后续操作
In [454]: df_coord = pd.DataFrame(
.....: np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
.....: )
.....:
In [455]: store.append("df_coord", df_coord)
In [456]: c = store.select_as_coordinates("df_coord", "index > 20020101")
In [457]: c
Out[457]:
Int64Index([732, 733, 734, 735, 736, 737, 738, 739, 740, 741,
...
990, 991, 992, 993, 994, 995, 996, 997, 998, 999],
dtype='int64', length=268)
In [458]: store.select("df_coord", where=c)
Out[458]:
0 1
2002-01-02 -0.165548 0.646989
2002-01-03 0.782753 -0.123409
2002-01-04 -0.391932 -0.740915
2002-01-05 1.211070 -0.668715
2002-01-06 0.341987 -0.685867
... ... ...
2002-09-22 1.788110 -0.405908
2002-09-23 -0.801912 0.768460
2002-09-24 0.466284 -0.457411
2002-09-25 -0.364060 0.785367
2002-09-26 -1.463093 1.187315
[268 rows x 2 columns]
7.7.3 使用 where 选择
有时您的查询操作可能会涉及要创建一个行列表来进行选择。通常可以使用索引操作返回结果的索引来进行选择。
In [459]: df_mask = pd.DataFrame(
.....: np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
.....: )
.....:
In [460]: store.append("df_mask", df_mask)
In [461]: c = store.select_column("df_mask", "index")
In [462]: where = c[pd.DatetimeIndex(c).month == 5].index
In [463]: store.select("df_mask", where=where)
Out[463]:
0 1
2000-05-01 1.735883 -2.615261
2000-05-02 0.422173 2.425154
2000-05-03 0.632453 -0.165640
2000-05-04 -1.017207 -0.005696
2000-05-05 0.299606 0.070606
... ... ...
2002-05-27 0.234503 1.199126
2002-05-28 -3.021833 -1.016828
2002-05-29 0.522794 0.063465
2002-05-30 -1.653736 0.031709
2002-05-31 -0.968402 -0.393583
[93 rows x 2 columns]
7.7.4 存储对象
如果要检查存储的对象,可以通过 get_store
进行检索。您可以通过编程方式使用它来获取对象中的行数
In [464]: store.get_storer("df_dc").nrows
Out[464]: 8
7.8 多表查询
append_to_multiple
和 select_as_multiple
方法可以一次性追加/选择多个表。
他的思想是让你使用一个表(称为选择器表)来索引大部分或所有的列,并执行查询操作。其他表是数据表,其索引与选择器表的索引匹配。
这样,您就可以在选择器表上执行一个非常快的查询,但是能够得到大量的相关数据。这个方法类似于拥有一个非常宽的表,但是支持更有效的查询。
append_to_multiple
方法会根据一个字典将给定的单个 DataFrame
拆分为多个表,该字典将表名映射到你想在该表中包含的列。如果对应的值为 None
而不是列的列表,则该表将包含给定 DataFrame
中其余未指定的列
参数 selector
定义了哪个表是选择器表(即可用该表进行查询),参数 dropna
会删除 DataFrame
中的空行。也就是说,如果一张表中包含了空行,则该行在其他表中对应的行都会被删除,以确保表是同步的
如果 dropna=False
,则需要用户自行同步表。记住 空行不会被写入 HDFStore
中,所以如果你选择调用 dropna=False
,一些表可能比其他表有更多的行,因此 select_as_multiple
可能不起作用,或者它可能会返回意外的结果
In [465]: df_mt = pd.DataFrame(
.....: np.random.randn(8, 6),
.....: index=pd.date_range("1/1/2000", periods=8),
.....: columns=["A", "B", "C", "D", "E", "F"],
.....: )
.....:
In [466]: df_mt["foo"] = "bar"
In [467]: df_mt.loc[df_mt.index[1], ("A", "B")] = np.nan
# you can also create the tables individually
In [468]: store.append_to_multiple(
.....: {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
.....: )
.....:
In [469]: store
Out[469]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
# individual tables were created
In [470]: store.select("df1_mt")
Out[470]:
A B
2000-01-01 1.251079 -0.362628
2000-01-02 NaN NaN
2000-01-03 0.719421 -0.448886
2000-01-04 1.140998 -0.877922
2000-01-05 1.043605 1.798494
2000-01-06 -0.467812 -0.027965
2000-01-07 0.150568 0.754820
2000-01-08 -0.596306 -0.910022
In [471]: store.select("df2_mt")
Out[471]:
C D E F foo
2000-01-01 1.602451 -0.221229 0.712403 0.465927 bar
2000-01-02 -0.525571 0.851566 -0.681308 -0.549386 bar
2000-01-03 -0.044171 1.396628 1.041242 -1.588171 bar
2000-01-04 0.463351 -0.861042 -2.192841 -1.025263 bar
2000-01-05 -1.954845 -1.712882 -0.204377 -1.608953 bar
2000-01-06 1.601542 -0.417884 -2.757922 -0.307713 bar
2000-01-07 -1.935461 1.007668 0.079529 -1.459471 bar
2000-01-08 -1.057072 -0.864360 -1.124870 1.732966 bar
# as a multiple
In [472]: store.select_as_multiple(
.....: ["df1_mt", "df2_mt"],
.....: where=["A>0", "B>0"],
.....: selector="df1_mt",
.....: )
.....:
Out[472]:
A B C D E F foo
2000-01-05 1.043605 1.798494 -1.954845 -1.712882 -0.204377 -1.608953 bar
2000-01-07 0.150568 0.754820 -1.935461 1.007668 0.079529 -1.459471 bar