Python

Python 数据处理(二十一)—— HDF5 查询

2021-02-24  本文已影响0人  名本无名

7 查询

7.1 查询表

selectdelete 操作有一个可选的条件,可以指定该条件来选择/删除数据的一个子集。

这允许我们从一个非常大的磁盘表中,检索出一小部分的数据。

比较操作

=, ==, !=, >, >=, <, <=

布尔表达式

注意

下面是有效的表达式:

'index >= date'
"columns = ['A', 'D']"
"columns in ['A', 'D']"
'columns = A'
'columns == A'
"~(columns = ['A', 'B'])"
'index > df.index[3] & string = "bar"'
'(index > df.index[3] & index <= df.index[6]) | string = "bar"'
"ts >= Timestamp('2012-02-01')"
"major_axis>=20130101"

indexers 位于子表达式的左侧

表达式的右侧可以是:

注意

不要使用字符串内插的方式构造查询表达式,可以将字符串赋值给变量,然后直接的表达式内使用。比如

string = "HolyMoly'"
store.select("df", "index == string")

如果替换为

string = "HolyMoly'"
store.select('df', f'index == {string}')

将会引发 SyntaxError 异常,因为 string 变量双引号内有一个单引号

如果一定要使用内插法,可以使用 "%r" 格式化

store.select("df", "index == %r" % string)

我们有如下数据

In [400]: dfq = pd.DataFrame(
   .....:     np.random.randn(10, 4),
   .....:     columns=list("ABCD"),
   .....:     index=pd.date_range("20130101", periods=10),
   .....: )
   .....: 

In [401]: store.append("dfq", dfq, format="table", data_columns=True)

使用布尔表达式和内联函数求值。

In [402]: store.select("dfq", "index>pd.Timestamp('20130104') & columns=['A', 'B']")
Out[402]: 
                   A         B
2013-01-05 -1.083889  0.811865
2013-01-06 -0.402227  1.618922
2013-01-07  0.948196  0.183573
2013-01-08 -1.043530 -0.708145
2013-01-09  0.813949  1.508891
2013-01-10  1.176488 -1.246093

使用内联列

In [403]: store.select("dfq", where="A>0 or C>0")
Out[403]: 
                   A         B         C         D
2013-01-01  0.620028  0.159416 -0.263043 -0.639244
2013-01-04 -0.536722  1.005707  0.296917  0.139796
2013-01-05 -1.083889  0.811865  1.648435 -0.164377
2013-01-07  0.948196  0.183573  0.145277  0.308146
2013-01-08 -1.043530 -0.708145  1.430905 -0.850136
2013-01-09  0.813949  1.508891 -1.556154  0.187597
2013-01-10  1.176488 -1.246093 -0.002726 -0.444249

可以提供 columns 关键字来选择要返回的列,这等价于传递 'columns=list_of_columns_to_filter':

In [404]: store.select("df", "columns=['A', 'B']")
Out[404]: 
                   A         B
2000-01-01  1.334065  0.521036
2000-01-02 -1.613932  1.088104
2000-01-03 -0.585314 -0.275038
2000-01-04  0.632369 -1.249657
2000-01-05  1.060617 -0.143682
2000-01-06  3.050329  1.317933
2000-01-07 -0.539452 -0.771133
2000-01-08  0.649464 -1.736427

可以指定 startstop 参数来限制行起始和终止。这些值是根据表中的总行数计算的

>>> store.select("dfq", "columns=['A', 'B']", start=3, stop=5)
>>>
            A           B
2013-01-04  -0.483155   1.143564
2013-01-05  0.218290    -1.391789
7.2 查询 timedelta64[ns]

您可以使用 timedelta64 [ns] 类型进行存储和查询。

Terms 可以用以下格式指定: <float>(<unit>),其中 float 可以是有符号的(和小数),单位可以是 Dsmsusns

这里有一个例子

In [405]: from datetime import timedelta

In [406]: dftd = pd.DataFrame(
   .....:     {
   .....:         "A": pd.Timestamp("20130101"),
   .....:         "B": [
   .....:             pd.Timestamp("20130101") + timedelta(days=i, seconds=10)
   .....:             for i in range(10)
   .....:         ],
   .....:     }
   .....: )
   .....: 

In [407]: dftd["C"] = dftd["A"] - dftd["B"]

In [408]: dftd
Out[408]: 
           A                   B                  C
0 2013-01-01 2013-01-01 00:00:10  -1 days +23:59:50
1 2013-01-01 2013-01-02 00:00:10  -2 days +23:59:50
2 2013-01-01 2013-01-03 00:00:10  -3 days +23:59:50
3 2013-01-01 2013-01-04 00:00:10  -4 days +23:59:50
4 2013-01-01 2013-01-05 00:00:10  -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10  -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10  -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10  -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10  -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50

In [409]: store.append("dftd", dftd, data_columns=True)

In [410]: store.select("dftd", "C<'-3.5D'")
Out[410]: 
           A                   B                  C
4 2013-01-01 2013-01-05 00:00:10  -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10  -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10  -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10  -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10  -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50
7.3 查询多级索引

可以通过使用 level 的名称从多索引中提取数据

In [411]: df_mi.index.names
Out[411]: FrozenList(['foo', 'bar'])

In [412]: store.select("df_mi", "foo=baz and bar=two")
Out[412]: 
                A         B         C
foo bar                              
baz two  1.064908  1.778161 -0.913867

如果多索引 level=None,则可以通过默认的级别名称 level_n 方访问,其中 n 为您想要选择的多索引级别

In [413]: index = pd.MultiIndex(
   .....:     levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
   .....:     codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
   .....: )
   .....: 

In [414]: df_mi_2 = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])

In [415]: df_mi_2
Out[415]: 
                  A         B         C
foo one    0.856838  1.491776  0.001283
    two    0.701816 -1.097917  0.102588
    three  0.661740  0.443531  0.559313
bar one   -0.459055 -1.222598 -0.455304
    two   -0.781163  0.826204 -0.530057
baz two    0.296135  1.366810  1.073372
    three -0.994957  0.755314  2.119746
qux one   -2.628174 -0.089460 -0.133636
    two    0.337920 -0.634027  0.421107
    three  0.604303  1.053434  1.109090

In [416]: store.append("df_mi_2", df_mi_2)

# the levels are automatically included as data columns with keyword level_n
In [417]: store.select("df_mi_2", "level_0=foo and level_1=two")
Out[417]: 
                A         B         C
foo two  0.701816 -1.097917  0.102588
7.4 索引

如果数据已经存储在表中,可以使用 create_table_index 为表创建和修改索引。

强烈建议创建表索引。这样在执行 select 时,能够大大加快您的查询速度。

注意

索引会自动在可索引对象和指定的任何数据列上自动创建。可以通过对 append 传递 index=False 来关闭这一行为

# we have automagically already created an index (in the first section)
In [418]: i = store.root.df.table.cols.index.index

In [419]: i.optlevel, i.kind
Out[419]: (6, 'medium')

# change an index by passing new parameters
In [420]: store.create_table_index("df", optlevel=9, kind="full")

In [421]: i = store.root.df.table.cols.index.index

In [422]: i.optlevel, i.kind
Out[422]: (9, 'full')

通常在将大量数据追加到一个存储中时,关闭每次追加时创建索引,然后在添加完后重新创建是很有用的

In [423]: df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))

In [424]: df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list("AB"))

In [425]: st = pd.HDFStore("appends.h5", mode="w")

In [426]: st.append("df", df_1, data_columns=["B"], index=False)

In [427]: st.append("df", df_2, data_columns=["B"], index=False)

In [428]: st.get_storer("df").table
Out[428]: 
/df/table (Table(20,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)

在完成追加后,再创建索引。

In [429]: st.create_table_index("df", columns=["B"], optlevel=9, kind="full")

In [430]: st.get_storer("df").table
Out[430]: 
/df/table (Table(20,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "B": Index(9, full, shuffle, zlib(1)).is_csi=True}

In [431]: st.close()
7.5 通过数据列查询

有时,你可能只想在某些列上进行查询操作,您可以指定 data_columns=True 强制所有列为 data_columns

In [432]: df_dc = df.copy()

In [433]: df_dc["string"] = "foo"

In [434]: df_dc.loc[df_dc.index[4:6], "string"] = np.nan

In [435]: df_dc.loc[df_dc.index[7:9], "string"] = "bar"

In [436]: df_dc["string2"] = "cool"

In [437]: df_dc.loc[df_dc.index[1:3], ["B", "C"]] = 1.0

In [438]: df_dc
Out[438]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool
2000-01-04  0.632369 -1.249657  0.975593    foo    cool
2000-01-05  1.060617 -0.143682  0.218423    NaN    cool
2000-01-06  3.050329  1.317933 -0.963725    NaN    cool
2000-01-07 -0.539452 -0.771133  0.023751    foo    cool
2000-01-08  0.649464 -1.736427  0.197288    bar    cool

# on-disk operations
In [439]: store.append("df_dc", df_dc, data_columns=["B", "C", "string", "string2"])

In [440]: store.select("df_dc", where="B > 0")
Out[440]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool
2000-01-06  3.050329  1.317933 -0.963725    NaN    cool

# getting creative
In [441]: store.select("df_dc", "B > 0 & C > 0 & string == foo")
Out[441]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool

# this is in-memory version of this type of selection
In [442]: df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == "foo")]
Out[442]: 
                   A         B         C string string2
2000-01-01  1.334065  0.521036  0.930384    foo    cool
2000-01-02 -1.613932  1.000000  1.000000    foo    cool
2000-01-03 -0.585314  1.000000  1.000000    foo    cool

# we have automagically created this index and the B/C/string/string2
# columns are stored separately as ``PyTables`` columns
In [443]: store.root.df_dc.table
Out[443]: 
/df_dc/table (Table(8,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2),
  "C": Float64Col(shape=(), dflt=0.0, pos=3),
  "string": StringCol(itemsize=3, shape=(), dflt=b'', pos=4),
  "string2": StringCol(itemsize=4, shape=(), dflt=b'', pos=5)}
  byteorder := 'little'
  chunkshape := (1680,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "B": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "C": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "string": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "string2": Index(6, medium, shuffle, zlib(1)).is_csi=False}
7.6 迭代器

您可以将 iterator=Truechunksize=number_in_a_chunk 传递给 selectselect_as_multiple, 然后返回一个迭代器。 默认是一个块返回 50,000

In [444]: for df in store.select("df", chunksize=3):
   .....:     print(df)
   .....: 
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
                   A         B         C
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
                   A         B         C
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

注意
你也可以使用 read_hdf 返回一个迭代器,在完成迭代后会自动关闭存储

for df in pd.read_hdf("store.h5", "df", chunksize=3):
    print(df)

注意

chunksize 关键字会应用到原始行中,意味着在你进行查询的时候,会将所有的行进行分割并应用查询,因此返回的迭代器的大小可能不一样

下面的例子用于生成大小相等的块

In [445]: dfeq = pd.DataFrame({"number": np.arange(1, 11)})

In [446]: dfeq
Out[446]: 
   number
0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8       9
9      10

In [447]: store.append("dfeq", dfeq, data_columns=["number"])

In [448]: def chunks(l, n):
   .....:     return [l[i: i + n] for i in range(0, len(l), n)]
   .....: 

In [449]: evens = [2, 4, 6, 8, 10]

In [450]: coordinates = store.select_as_coordinates("dfeq", "number=evens")

In [451]: for c in chunks(coordinates, 2):
   .....:     print(store.select("dfeq", where=c))
   .....: 
   number
1       2
3       4
   number
5       6
7       8
   number
9      10
7.7 高级查询
7.7.1 选择一列

可以使用 select_column 方法来选取一列,这种方法可以快速的获取索引列。返回的结果是 Series 类型,索引变成了递增的行号。

该方法还不支持 where 选择器

In [452]: store.select_column("df_dc", "index")
Out[452]: 
0   2000-01-01
1   2000-01-02
2   2000-01-03
3   2000-01-04
4   2000-01-05
5   2000-01-06
6   2000-01-07
7   2000-01-08
Name: index, dtype: datetime64[ns]

In [453]: store.select_column("df_dc", "string")
Out[453]: 
0    foo
1    foo
2    foo
3    foo
4    NaN
5    NaN
6    foo
7    bar
Name: string, dtype: object
7.7.2 选择坐标

有时,你可能想获取查询的坐标(即索引的位置)。可以将返回的 Int64Index 结果索引传递给 where 进行后续操作

In [454]: df_coord = pd.DataFrame(
   .....:     np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
   .....: )
   .....: 

In [455]: store.append("df_coord", df_coord)

In [456]: c = store.select_as_coordinates("df_coord", "index > 20020101")

In [457]: c
Out[457]: 
Int64Index([732, 733, 734, 735, 736, 737, 738, 739, 740, 741,
            ...
            990, 991, 992, 993, 994, 995, 996, 997, 998, 999],
           dtype='int64', length=268)

In [458]: store.select("df_coord", where=c)
Out[458]: 
                   0         1
2002-01-02 -0.165548  0.646989
2002-01-03  0.782753 -0.123409
2002-01-04 -0.391932 -0.740915
2002-01-05  1.211070 -0.668715
2002-01-06  0.341987 -0.685867
...              ...       ...
2002-09-22  1.788110 -0.405908
2002-09-23 -0.801912  0.768460
2002-09-24  0.466284 -0.457411
2002-09-25 -0.364060  0.785367
2002-09-26 -1.463093  1.187315

[268 rows x 2 columns]
7.7.3 使用 where 选择

有时您的查询操作可能会涉及要创建一个行列表来进行选择。通常可以使用索引操作返回结果的索引来进行选择。

In [459]: df_mask = pd.DataFrame(
   .....:     np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
   .....: )
   .....: 

In [460]: store.append("df_mask", df_mask)

In [461]: c = store.select_column("df_mask", "index")

In [462]: where = c[pd.DatetimeIndex(c).month == 5].index

In [463]: store.select("df_mask", where=where)
Out[463]: 
                   0         1
2000-05-01  1.735883 -2.615261
2000-05-02  0.422173  2.425154
2000-05-03  0.632453 -0.165640
2000-05-04 -1.017207 -0.005696
2000-05-05  0.299606  0.070606
...              ...       ...
2002-05-27  0.234503  1.199126
2002-05-28 -3.021833 -1.016828
2002-05-29  0.522794  0.063465
2002-05-30 -1.653736  0.031709
2002-05-31 -0.968402 -0.393583

[93 rows x 2 columns]
7.7.4 存储对象

如果要检查存储的对象,可以通过 get_store 进行检索。您可以通过编程方式使用它来获取对象中的行数

In [464]: store.get_storer("df_dc").nrows
Out[464]: 8
7.8 多表查询

append_to_multipleselect_as_multiple 方法可以一次性追加/选择多个表。

他的思想是让你使用一个表(称为选择器表)来索引大部分或所有的列,并执行查询操作。其他表是数据表,其索引与选择器表的索引匹配。

这样,您就可以在选择器表上执行一个非常快的查询,但是能够得到大量的相关数据。这个方法类似于拥有一个非常宽的表,但是支持更有效的查询。

append_to_multiple 方法会根据一个字典将给定的单个 DataFrame 拆分为多个表,该字典将表名映射到你想在该表中包含的列。如果对应的值为 None 而不是列的列表,则该表将包含给定 DataFrame 中其余未指定的列

参数 selector 定义了哪个表是选择器表(即可用该表进行查询),参数 dropna 会删除 DataFrame 中的空行。也就是说,如果一张表中包含了空行,则该行在其他表中对应的行都会被删除,以确保表是同步的

如果 dropna=False,则需要用户自行同步表。记住 空行不会被写入 HDFStore 中,所以如果你选择调用 dropna=False,一些表可能比其他表有更多的行,因此 select_as_multiple 可能不起作用,或者它可能会返回意外的结果

In [465]: df_mt = pd.DataFrame(
   .....:     np.random.randn(8, 6),
   .....:     index=pd.date_range("1/1/2000", periods=8),
   .....:     columns=["A", "B", "C", "D", "E", "F"],
   .....: )
   .....: 

In [466]: df_mt["foo"] = "bar"

In [467]: df_mt.loc[df_mt.index[1], ("A", "B")] = np.nan

# you can also create the tables individually
In [468]: store.append_to_multiple(
   .....:     {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
   .....: )
   .....: 

In [469]: store
Out[469]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

# individual tables were created
In [470]: store.select("df1_mt")
Out[470]: 
                   A         B
2000-01-01  1.251079 -0.362628
2000-01-02       NaN       NaN
2000-01-03  0.719421 -0.448886
2000-01-04  1.140998 -0.877922
2000-01-05  1.043605  1.798494
2000-01-06 -0.467812 -0.027965
2000-01-07  0.150568  0.754820
2000-01-08 -0.596306 -0.910022

In [471]: store.select("df2_mt")
Out[471]: 
                   C         D         E         F  foo
2000-01-01  1.602451 -0.221229  0.712403  0.465927  bar
2000-01-02 -0.525571  0.851566 -0.681308 -0.549386  bar
2000-01-03 -0.044171  1.396628  1.041242 -1.588171  bar
2000-01-04  0.463351 -0.861042 -2.192841 -1.025263  bar
2000-01-05 -1.954845 -1.712882 -0.204377 -1.608953  bar
2000-01-06  1.601542 -0.417884 -2.757922 -0.307713  bar
2000-01-07 -1.935461  1.007668  0.079529 -1.459471  bar
2000-01-08 -1.057072 -0.864360 -1.124870  1.732966  bar

# as a multiple
In [472]: store.select_as_multiple(
   .....:     ["df1_mt", "df2_mt"],
   .....:     where=["A>0", "B>0"],
   .....:     selector="df1_mt",
   .....: )
   .....: 
Out[472]: 
                   A         B         C         D         E         F  foo
2000-01-05  1.043605  1.798494 -1.954845 -1.712882 -0.204377 -1.608953  bar
2000-01-07  0.150568  0.754820 -1.935461  1.007668  0.079529 -1.459471  bar
上一篇下一篇

猜你喜欢

热点阅读