Python 数据处理（二十）—— HDF5 基础

2021-02-23 本文已影响0人名本无名

前言

HDF(Hierarchical Data Format, 层级数据格式)，是设计用来存储和组织大量数据的一组文件格式（HDF4，HDF5）

HDF5 允许您存储大量的数值数据，同时能够轻松、快速地访问数据。数千个数据集可以存储在一个文件中，可以根据需要进行分类和标记

使用

HDFStore 是一个类似 dict 的对象，它使用 PyTables 库并以高性能的 HDF5 格式来读写 pandas 对象。

In [345]: store = pd.HDFStore("store.h5")

In [346]: print(store)
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

可以将对象写入文件，就像将键值对添加到字典一样

In [347]: index = pd.date_range("1/1/2000", periods=8)

In [348]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

In [349]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

# store.put('s', s) is an equivalent method
In [350]: store["s"] = s

In [351]: store["df"] = df

In [352]: store
Out[352]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

在当前或以后的 Python 会话中，您可以检索存储的对象

# store.get('df') is an equivalent method
In [353]: store["df"]
Out[353]: 
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

# dotted (attribute) access provides get as well
In [354]: store.df
Out[354]: 
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

删除键对应的对象：

# store.remove('df') is an equivalent method
In [355]: del store["df"]

In [356]: store
Out[356]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

关闭 store 并使用上下文管理器

In [357]: store.close()

In [358]: store
Out[358]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

In [359]: store.is_open
Out[359]: False

# Working with, and automatically closing the store using a context manager
In [360]: with pd.HDFStore("store.h5") as store:
   .....:     store.keys()
   .....:

1 读写 API

HDFStore 支持使用 read_hdf 进行读取和 to_hdf 进行写入的顶级 API，其工作方式类似于 read_csv 和 to_csv。

In [361]: df_tl = pd.DataFrame({"A": list(range(5)), "B": list(range(5))})

In [362]: df_tl.to_hdf("store_tl.h5", "table", append=True)

In [363]: pd.read_hdf("store_tl.h5", "table", where=["index>2"])
Out[363]: 
   A  B
3  3  3
4  4  4

HDFStore 默认情况下不会删除空数据行，可以通过设置 dropna=True 来更改此行为

In [364]: df_with_missing = pd.DataFrame(
   .....:     {
   .....:         "col1": [0, np.nan, 2],
   .....:         "col2": [1, np.nan, np.nan],
   .....:     }
   .....: )
   .....: 

In [365]: df_with_missing
Out[365]: 
   col1  col2
0   0.0   1.0
1   NaN   NaN
2   2.0   NaN

In [366]: df_with_missing.to_hdf("file.h5", "df_with_missing", format="table", mode="w")

In [367]: pd.read_hdf("file.h5", "df_with_missing")
Out[367]: 
   col1  col2
0   0.0   1.0
1   NaN   NaN
2   2.0   NaN

In [368]: df_with_missing.to_hdf(
   .....:     "file.h5", "df_with_missing", format="table", mode="w", dropna=True
   .....: )
   .....: 

In [369]: pd.read_hdf("file.h5", "df_with_missing")
Out[369]: 
   col1  col2
0   0.0   1.0
2   2.0   NaN

2 fixed 格式

fixed 是 put 和 to_hdf 的默认格式，能够进行快速的读写，但是不可以追加，也不可以搜索。

使用 format ='fixed' 或 format ='f' 指定

3 table (表)格式

HDFStore 在磁盘上支持另一种 PyTables 格式，即表格式。

从概念上讲，表的形状很像 DataFrame，有行和列。一个表可以在同一会话或其他会话中被追加

此外，还支持删除和查询类型操作。该格式可以在 append、put 或 to_hdf 中设置 format='table' 或 format='t' 来指定

也可以设置为一个全局选项 pd.set_option('io.hdf.default_format'，'table') 来设置 put/append/to_hdf 的默认存储为表格式。

In [370]: store = pd.HDFStore("store.h5")

In [371]: df1 = df[0:4]

In [372]: df2 = df[4:]

# append data (creates a table automatically)
In [373]: store.append("df", df1)

In [374]: store.append("df", df2)

In [375]: store
Out[375]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

# select the entire object
In [376]: store.select("df")
Out[376]: 
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

# the type of stored data
In [377]: store.root.df._v_attrs.pandas_type
Out[377]: 'frame_table'

4 层级键

存储的键可以是类似于路径名的分层格式的字符串，这将会生成子存储的层次结构（或者说 PyTables 中的 Groups）

可以在没有 '/' 开头的情况下指定键，并且总是绝对有效的(例如 'foo' 指代 '/foo')。

删除操作会删除子存储和下面的所有东西，所以要小心。

In [378]: store.put("foo/bar/bah", df)

In [379]: store.append("food/orange", df)

In [380]: store.append("food/apple", df)

In [381]: store
Out[381]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

# a list of keys are returned
In [382]: store.keys()
Out[382]: ['/df', '/food/apple', '/food/orange', '/foo/bar/bah']

# remove all nodes under this level
In [383]: store.remove("food")

In [384]: store
Out[384]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

您可以使用 walk 方法遍历组层次结构

In [385]: for (path, subgroups, subkeys) in store.walk():
   .....:     for subgroup in subgroups:
   .....:         print("GROUP: {}/{}".format(path, subgroup))
   .....:     for subkey in subkeys:
   .....:         key = "/".join([path, subkey])
   .....:         print("KEY: {}".format(key))
   .....:         print(store.get(key))
   .....: 
GROUP: /foo
KEY: /df
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288
GROUP: /foo/bar
KEY: /foo/bar/bah
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

对于存储在根节点下的项目，不能像上面描述的那样以点（属性）访问方式检索层次键

In [8]: store.foo.bar.bah
AttributeError: 'HDFStore' object has no attribute 'foo'

# you can directly access the actual PyTables node but using the root node
In [9]: store.root.foo.bar.bah
Out[9]:
/foo/bar/bah (Group) ''
  children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array), 'axis1' (Array)]

但是可以使用的显式键值字符串访问

In [386]: store["foo/bar/bah"]
Out[386]: 
                   A         B         C
2000-01-01  1.334065  0.521036  0.930384
2000-01-02 -1.613932  1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04  0.632369 -1.249657  0.975593
2000-01-05  1.060617 -0.143682  0.218423
2000-01-06  3.050329  1.317933 -0.963725
2000-01-07 -0.539452 -0.771133  0.023751
2000-01-08  0.649464 -1.736427  0.197288

5 存储类型

5.1 存储混合类型

也可以存储混合类型值，字符串列将会存储为固定的长度，其最大长度为数据中字符串长度的最大值，如果后续想要添加更长的字符串，将会抛出异常

设置 min_itemsize={"values": size} 可以指定字符串列的最小长度，同时也支持 floats, strings, ints, bools, datetime64 数据类型的存储。

对于字符串列，nan_rep='nan' 可以更改磁盘中 NaN 的默认表示形式

In [387]: df_mixed = pd.DataFrame(
   .....:     {
   .....:         "A": np.random.randn(8),
   .....:         "B": np.random.randn(8),
   .....:         "C": np.array(np.random.randn(8), dtype="float32"),
   .....:         "string": "string",
   .....:         "int": 1,
   .....:         "bool": True,
   .....:         "datetime64": pd.Timestamp("20010102"),
   .....:     },
   .....:     index=list(range(8)),
   .....: )
   .....: 

In [388]: df_mixed.loc[df_mixed.index[3:5], ["A", "B", "string", "datetime64"]] = np.nan

In [389]: store.append("df_mixed", df_mixed, min_itemsize={"values": 50})

In [390]: df_mixed1 = store.select("df_mixed")

In [391]: df_mixed1
Out[391]: 
          A         B         C  string  int  bool datetime64
0 -0.116008  0.743946 -0.398501  string    1  True 2001-01-02
1  0.592375 -0.533097 -0.677311  string    1  True 2001-01-02
2  0.476481 -0.140850 -0.874991  string    1  True 2001-01-02
3       NaN       NaN -1.167564     NaN    1  True        NaT
4       NaN       NaN -0.593353     NaN    1  True        NaT
5  0.852727  0.463819  0.146262  string    1  True 2001-01-02
6 -1.177365  0.793644 -0.131959  string    1  True 2001-01-02
7  1.236988  0.221252  0.089012  string    1  True 2001-01-02

In [392]: df_mixed1.dtypes.value_counts()
Out[392]: 
float64           2
int64             1
float32           1
bool              1
datetime64[ns]    1
object            1
dtype: int64

# we have provided a minimum string column size
In [393]: store.root.df_mixed.table
Out[393]: 
/df_mixed/table (Table(8,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
  "values_block_1": Float32Col(shape=(1,), dflt=0.0, pos=2),
  "values_block_2": Int64Col(shape=(1,), dflt=0, pos=3),
  "values_block_3": Int64Col(shape=(1,), dflt=0, pos=4),
  "values_block_4": BoolCol(shape=(1,), dflt=False, pos=5),
  "values_block_5": StringCol(itemsize=50, shape=(1,), dflt=b'', pos=6)}
  byteorder := 'little'
  chunkshape := (689,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

6 存储包含多索引的 DataFrame

In [394]: index = pd.MultiIndex(
   .....:     levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
   .....:     codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
   .....:     names=["foo", "bar"],
   .....: )
   .....: 

In [395]: df_mi = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])

In [396]: df_mi
Out[396]: 
                  A         B         C
foo bar                                
foo one    0.667450  0.169405 -1.358046
    two   -0.105563  0.492195  0.076693
    three  0.213685 -0.285283 -1.210529
bar one   -1.408386  0.941577 -0.342447
    two    0.222031  0.052607  2.093214
baz two    1.064908  1.778161 -0.913867
    three -0.030004 -0.399846 -1.234765
qux one    0.081323 -0.268494  0.168016
    two   -0.898283 -0.218499  1.408028
    three -1.267828 -0.689263  0.520995

In [397]: store.append("df_mi", df_mi)

In [398]: store.select("df_mi")
Out[398]: 
                  A         B         C
foo bar                                
foo one    0.667450  0.169405 -1.358046
    two   -0.105563  0.492195  0.076693
    three  0.213685 -0.285283 -1.210529
bar one   -1.408386  0.941577 -0.342447
    two    0.222031  0.052607  2.093214
baz two    1.064908  1.778161 -0.913867
    three -0.030004 -0.399846 -1.234765
qux one    0.081323 -0.268494  0.168016
    two   -0.898283 -0.218499  1.408028
    three -1.267828 -0.689263  0.520995

# the levels are automatically included as data columns
In [399]: store.select("df_mi", "foo=bar")
Out[399]: 
                A         B         C
foo bar                              
bar one -1.408386  0.941577 -0.342447
    two  0.222031  0.052607  2.093214

数据类型

HDFStore 将对象 dtype 映射到 PyTables dtype。支持以下类型

image.png

不支持 unicode

分类数据

可以将包含分类的数据写入到 HDFStore。查询的工作原理与对象数组相同。

然而，分类类型的数据以一种更有效的方式存储

In [473]: dfcat = pd.DataFrame(
   .....:     {"A": pd.Series(list("aabbcdba")).astype("category"), "B": np.random.randn(8)}
   .....: )
   .....: 

In [474]: dfcat
Out[474]: 
   A         B
0  a  0.477849
1  a  0.283128
2  b -2.045700
3  b -0.338206
4  c -0.423113
5  d  2.314361
6  b -0.033100
7  a -0.965461

In [475]: dfcat.dtypes
Out[475]: 
A    category
B     float64
dtype: object

In [476]: cstore = pd.HDFStore("cats.h5", mode="w")

In [477]: cstore.append("dfcat", dfcat, format="table", data_columns=["A"])

In [478]: result = cstore.select("dfcat", where="A in ['b', 'c']")

In [479]: result
Out[479]: 
   A         B
2  b -2.045700
3  b -0.338206
4  c -0.423113
6  b -0.033100

In [480]: result.dtypes
Out[480]: 
A    category
B     float64
dtype: object

字符串列

HDFStore 的底层实现对字符串列使用了固定的列宽(itemsize)。itemsize 的值为在第一次调用 append 时，计算出的字符串列中最长的字符串长度

而在随后的 append 中，如果传入的字符串列长度超过了之前的 itemsize，将会抛出异常。

在第一次创建表时，传入 min_itemsize 参数来指定特定字符串列的最小长度。

min_itemsize 可以是整数，也可以是将列名对应整数的字典，用于指定特定字符串列的最小列宽

传递 min_itemsize 字典将导致所有传递的列自动创建为 data_columns

In [481]: dfs = pd.DataFrame({"A": "foo", "B": "bar"}, index=list(range(5)))

In [482]: dfs
Out[482]: 
     A    B
0  foo  bar
1  foo  bar
2  foo  bar
3  foo  bar
4  foo  bar

# A and B have a size of 30
In [483]: store.append("dfs", dfs, min_itemsize=30)

In [484]: store.get_storer("dfs").table
Out[484]: 
/dfs/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=30, shape=(2,), dflt=b'', pos=1)}
  byteorder := 'little'
  chunkshape := (963,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

# A is created as a data_column with a size of 30
# B is size is calculated
In [485]: store.append("dfs2", dfs, min_itemsize={"A": 30})

In [486]: store.get_storer("dfs2").table
Out[486]: 
/dfs2/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=3, shape=(1,), dflt=b'', pos=1),
  "A": StringCol(itemsize=30, shape=(), dflt=b'', pos=2)}
  byteorder := 'little'
  chunkshape := (1598,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False}

字符串列将使用 nan_rep 参数指定的字符串来表示 np.nan（缺失值），默认值为 nan。

默认的行为可能会在无意中把一个实际的 nan 值变成一个缺失值

In [487]: dfss = pd.DataFrame({"A": ["foo", "bar", "nan"]})

In [488]: dfss
Out[488]: 
     A
0  foo
1  bar
2  nan

In [489]: store.append("dfss", dfss)

In [490]: store.select("dfss")
Out[490]: 
     A
0  foo
1  bar
2  NaN

# here you need to specify a different nan rep
In [491]: store.append("dfss2", dfss, nan_rep="_nan_")

In [492]: store.select("dfss2")
Out[492]: 
     A
0  foo
1  bar
2  nan