Python 数据处理(二十)—— HDF5 基础
前言
HDF
(Hierarchical Data Format
, 层级数据格式),是设计用来存储和组织大量数据的一组文件格式(HDF4
,HDF5
)
HDF5
允许您存储大量的数值数据,同时能够轻松、快速地访问数据。数千个数据集可以存储在一个文件中,可以根据需要进行分类和标记
使用
HDFStore
是一个类似 dict
的对象,它使用 PyTables
库并以高性能的 HDF5
格式来读写 pandas
对象。
In [345]: store = pd.HDFStore("store.h5")
In [346]: print(store)
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
可以将对象写入文件,就像将键值对添加到字典一样
In [347]: index = pd.date_range("1/1/2000", periods=8)
In [348]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
In [349]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])
# store.put('s', s) is an equivalent method
In [350]: store["s"] = s
In [351]: store["df"] = df
In [352]: store
Out[352]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
在当前或以后的 Python
会话中,您可以检索存储的对象
# store.get('df') is an equivalent method
In [353]: store["df"]
Out[353]:
A B C
2000-01-01 1.334065 0.521036 0.930384
2000-01-02 -1.613932 1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04 0.632369 -1.249657 0.975593
2000-01-05 1.060617 -0.143682 0.218423
2000-01-06 3.050329 1.317933 -0.963725
2000-01-07 -0.539452 -0.771133 0.023751
2000-01-08 0.649464 -1.736427 0.197288
# dotted (attribute) access provides get as well
In [354]: store.df
Out[354]:
A B C
2000-01-01 1.334065 0.521036 0.930384
2000-01-02 -1.613932 1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04 0.632369 -1.249657 0.975593
2000-01-05 1.060617 -0.143682 0.218423
2000-01-06 3.050329 1.317933 -0.963725
2000-01-07 -0.539452 -0.771133 0.023751
2000-01-08 0.649464 -1.736427 0.197288
删除键对应的对象:
# store.remove('df') is an equivalent method
In [355]: del store["df"]
In [356]: store
Out[356]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
关闭 store
并使用上下文管理器
In [357]: store.close()
In [358]: store
Out[358]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [359]: store.is_open
Out[359]: False
# Working with, and automatically closing the store using a context manager
In [360]: with pd.HDFStore("store.h5") as store:
.....: store.keys()
.....:
1 读写 API
HDFStore
支持使用 read_hdf
进行读取和 to_hdf
进行写入的顶级 API
,其工作方式类似于 read_csv
和 to_csv
。
In [361]: df_tl = pd.DataFrame({"A": list(range(5)), "B": list(range(5))})
In [362]: df_tl.to_hdf("store_tl.h5", "table", append=True)
In [363]: pd.read_hdf("store_tl.h5", "table", where=["index>2"])
Out[363]:
A B
3 3 3
4 4 4
HDFStore
默认情况下不会删除空数据行,可以通过设置 dropna=True
来更改此行为
In [364]: df_with_missing = pd.DataFrame(
.....: {
.....: "col1": [0, np.nan, 2],
.....: "col2": [1, np.nan, np.nan],
.....: }
.....: )
.....:
In [365]: df_with_missing
Out[365]:
col1 col2
0 0.0 1.0
1 NaN NaN
2 2.0 NaN
In [366]: df_with_missing.to_hdf("file.h5", "df_with_missing", format="table", mode="w")
In [367]: pd.read_hdf("file.h5", "df_with_missing")
Out[367]:
col1 col2
0 0.0 1.0
1 NaN NaN
2 2.0 NaN
In [368]: df_with_missing.to_hdf(
.....: "file.h5", "df_with_missing", format="table", mode="w", dropna=True
.....: )
.....:
In [369]: pd.read_hdf("file.h5", "df_with_missing")
Out[369]:
col1 col2
0 0.0 1.0
2 2.0 NaN
2 fixed 格式
fixed
是 put
和 to_hdf
的默认格式,能够进行快速的读写,但是不可以追加,也不可以搜索。
使用 format ='fixed'
或 format ='f'
指定
3 table (表)格式
HDFStore
在磁盘上支持另一种 PyTables
格式,即表格式。
从概念上讲,表的形状很像 DataFrame
,有行和列。一个表可以在同一会话或其他会话中被追加
此外,还支持删除和查询类型操作。该格式可以在 append
、put
或 to_hdf
中设置 format='table'
或 format='t'
来指定
也可以设置为一个全局选项 pd.set_option('io.hdf.default_format','table')
来设置 put/append/to_hdf
的默认存储为表格式。
In [370]: store = pd.HDFStore("store.h5")
In [371]: df1 = df[0:4]
In [372]: df2 = df[4:]
# append data (creates a table automatically)
In [373]: store.append("df", df1)
In [374]: store.append("df", df2)
In [375]: store
Out[375]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
# select the entire object
In [376]: store.select("df")
Out[376]:
A B C
2000-01-01 1.334065 0.521036 0.930384
2000-01-02 -1.613932 1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04 0.632369 -1.249657 0.975593
2000-01-05 1.060617 -0.143682 0.218423
2000-01-06 3.050329 1.317933 -0.963725
2000-01-07 -0.539452 -0.771133 0.023751
2000-01-08 0.649464 -1.736427 0.197288
# the type of stored data
In [377]: store.root.df._v_attrs.pandas_type
Out[377]: 'frame_table'
4 层级键
存储的键可以是类似于路径名的分层格式的字符串,这将会生成子存储的层次结构(或者说 PyTables
中的 Groups
)
可以在没有 '/'
开头的情况下指定键,并且总是绝对有效的(例如 'foo'
指代 '/foo'
)。
删除操作会删除子存储和下面的所有东西,所以要小心。
In [378]: store.put("foo/bar/bah", df)
In [379]: store.append("food/orange", df)
In [380]: store.append("food/apple", df)
In [381]: store
Out[381]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
# a list of keys are returned
In [382]: store.keys()
Out[382]: ['/df', '/food/apple', '/food/orange', '/foo/bar/bah']
# remove all nodes under this level
In [383]: store.remove("food")
In [384]: store
Out[384]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
您可以使用 walk
方法遍历组层次结构
In [385]: for (path, subgroups, subkeys) in store.walk():
.....: for subgroup in subgroups:
.....: print("GROUP: {}/{}".format(path, subgroup))
.....: for subkey in subkeys:
.....: key = "/".join([path, subkey])
.....: print("KEY: {}".format(key))
.....: print(store.get(key))
.....:
GROUP: /foo
KEY: /df
A B C
2000-01-01 1.334065 0.521036 0.930384
2000-01-02 -1.613932 1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04 0.632369 -1.249657 0.975593
2000-01-05 1.060617 -0.143682 0.218423
2000-01-06 3.050329 1.317933 -0.963725
2000-01-07 -0.539452 -0.771133 0.023751
2000-01-08 0.649464 -1.736427 0.197288
GROUP: /foo/bar
KEY: /foo/bar/bah
A B C
2000-01-01 1.334065 0.521036 0.930384
2000-01-02 -1.613932 1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04 0.632369 -1.249657 0.975593
2000-01-05 1.060617 -0.143682 0.218423
2000-01-06 3.050329 1.317933 -0.963725
2000-01-07 -0.539452 -0.771133 0.023751
2000-01-08 0.649464 -1.736427 0.197288
对于存储在根节点下的项目,不能像上面描述的那样以点(属性)访问方式检索层次键
In [8]: store.foo.bar.bah
AttributeError: 'HDFStore' object has no attribute 'foo'
# you can directly access the actual PyTables node but using the root node
In [9]: store.root.foo.bar.bah
Out[9]:
/foo/bar/bah (Group) ''
children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array), 'axis1' (Array)]
但是可以使用的显式键值字符串访问
In [386]: store["foo/bar/bah"]
Out[386]:
A B C
2000-01-01 1.334065 0.521036 0.930384
2000-01-02 -1.613932 1.088104 -0.632963
2000-01-03 -0.585314 -0.275038 -0.937512
2000-01-04 0.632369 -1.249657 0.975593
2000-01-05 1.060617 -0.143682 0.218423
2000-01-06 3.050329 1.317933 -0.963725
2000-01-07 -0.539452 -0.771133 0.023751
2000-01-08 0.649464 -1.736427 0.197288
5 存储类型
5.1 存储混合类型
也可以存储混合类型值,字符串列将会存储为固定的长度,其最大长度为数据中字符串长度的最大值,如果后续想要添加更长的字符串,将会抛出异常
设置 min_itemsize={"values": size}
可以指定字符串列的最小长度,同时也支持 floats
, strings
, ints
, bools
, datetime64
数据类型的存储。
对于字符串列,nan_rep='nan'
可以更改磁盘中 NaN
的默认表示形式
In [387]: df_mixed = pd.DataFrame(
.....: {
.....: "A": np.random.randn(8),
.....: "B": np.random.randn(8),
.....: "C": np.array(np.random.randn(8), dtype="float32"),
.....: "string": "string",
.....: "int": 1,
.....: "bool": True,
.....: "datetime64": pd.Timestamp("20010102"),
.....: },
.....: index=list(range(8)),
.....: )
.....:
In [388]: df_mixed.loc[df_mixed.index[3:5], ["A", "B", "string", "datetime64"]] = np.nan
In [389]: store.append("df_mixed", df_mixed, min_itemsize={"values": 50})
In [390]: df_mixed1 = store.select("df_mixed")
In [391]: df_mixed1
Out[391]:
A B C string int bool datetime64
0 -0.116008 0.743946 -0.398501 string 1 True 2001-01-02
1 0.592375 -0.533097 -0.677311 string 1 True 2001-01-02
2 0.476481 -0.140850 -0.874991 string 1 True 2001-01-02
3 NaN NaN -1.167564 NaN 1 True NaT
4 NaN NaN -0.593353 NaN 1 True NaT
5 0.852727 0.463819 0.146262 string 1 True 2001-01-02
6 -1.177365 0.793644 -0.131959 string 1 True 2001-01-02
7 1.236988 0.221252 0.089012 string 1 True 2001-01-02
In [392]: df_mixed1.dtypes.value_counts()
Out[392]:
float64 2
int64 1
float32 1
bool 1
datetime64[ns] 1
object 1
dtype: int64
# we have provided a minimum string column size
In [393]: store.root.df_mixed.table
Out[393]:
/df_mixed/table (Table(8,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
"values_block_1": Float32Col(shape=(1,), dflt=0.0, pos=2),
"values_block_2": Int64Col(shape=(1,), dflt=0, pos=3),
"values_block_3": Int64Col(shape=(1,), dflt=0, pos=4),
"values_block_4": BoolCol(shape=(1,), dflt=False, pos=5),
"values_block_5": StringCol(itemsize=50, shape=(1,), dflt=b'', pos=6)}
byteorder := 'little'
chunkshape := (689,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
6 存储包含多索引的 DataFrame
In [394]: index = pd.MultiIndex(
.....: levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
.....: codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
.....: names=["foo", "bar"],
.....: )
.....:
In [395]: df_mi = pd.DataFrame(np.random.randn(10, 3), index=index, columns=["A", "B", "C"])
In [396]: df_mi
Out[396]:
A B C
foo bar
foo one 0.667450 0.169405 -1.358046
two -0.105563 0.492195 0.076693
three 0.213685 -0.285283 -1.210529
bar one -1.408386 0.941577 -0.342447
two 0.222031 0.052607 2.093214
baz two 1.064908 1.778161 -0.913867
three -0.030004 -0.399846 -1.234765
qux one 0.081323 -0.268494 0.168016
two -0.898283 -0.218499 1.408028
three -1.267828 -0.689263 0.520995
In [397]: store.append("df_mi", df_mi)
In [398]: store.select("df_mi")
Out[398]:
A B C
foo bar
foo one 0.667450 0.169405 -1.358046
two -0.105563 0.492195 0.076693
three 0.213685 -0.285283 -1.210529
bar one -1.408386 0.941577 -0.342447
two 0.222031 0.052607 2.093214
baz two 1.064908 1.778161 -0.913867
three -0.030004 -0.399846 -1.234765
qux one 0.081323 -0.268494 0.168016
two -0.898283 -0.218499 1.408028
three -1.267828 -0.689263 0.520995
# the levels are automatically included as data columns
In [399]: store.select("df_mi", "foo=bar")
Out[399]:
A B C
foo bar
bar one -1.408386 0.941577 -0.342447
two 0.222031 0.052607 2.093214
数据类型
HDFStore
将对象 dtype
映射到 PyTables
dtype
。支持以下类型

不支持 unicode
分类数据
可以将包含分类的数据写入到 HDFStore
。查询的工作原理与对象数组相同。
然而,分类类型的数据以一种更有效的方式存储
In [473]: dfcat = pd.DataFrame(
.....: {"A": pd.Series(list("aabbcdba")).astype("category"), "B": np.random.randn(8)}
.....: )
.....:
In [474]: dfcat
Out[474]:
A B
0 a 0.477849
1 a 0.283128
2 b -2.045700
3 b -0.338206
4 c -0.423113
5 d 2.314361
6 b -0.033100
7 a -0.965461
In [475]: dfcat.dtypes
Out[475]:
A category
B float64
dtype: object
In [476]: cstore = pd.HDFStore("cats.h5", mode="w")
In [477]: cstore.append("dfcat", dfcat, format="table", data_columns=["A"])
In [478]: result = cstore.select("dfcat", where="A in ['b', 'c']")
In [479]: result
Out[479]:
A B
2 b -2.045700
3 b -0.338206
4 c -0.423113
6 b -0.033100
In [480]: result.dtypes
Out[480]:
A category
B float64
dtype: object
字符串列
HDFStore
的底层实现对字符串列使用了固定的列宽(itemsize
)。itemsize
的值为在第一次调用 append
时,计算出的字符串列中最长的字符串长度
而在随后的 append
中,如果传入的字符串列长度超过了之前的 itemsize
,将会抛出异常。
在第一次创建表时,传入 min_itemsize
参数来指定特定字符串列的最小长度。
min_itemsize
可以是整数,也可以是将列名对应整数的字典,用于指定特定字符串列的最小列宽
传递 min_itemsize
字典将导致所有传递的列自动创建为 data_columns
In [481]: dfs = pd.DataFrame({"A": "foo", "B": "bar"}, index=list(range(5)))
In [482]: dfs
Out[482]:
A B
0 foo bar
1 foo bar
2 foo bar
3 foo bar
4 foo bar
# A and B have a size of 30
In [483]: store.append("dfs", dfs, min_itemsize=30)
In [484]: store.get_storer("dfs").table
Out[484]:
/dfs/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=30, shape=(2,), dflt=b'', pos=1)}
byteorder := 'little'
chunkshape := (963,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
# A is created as a data_column with a size of 30
# B is size is calculated
In [485]: store.append("dfs2", dfs, min_itemsize={"A": 30})
In [486]: store.get_storer("dfs2").table
Out[486]:
/dfs2/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=3, shape=(1,), dflt=b'', pos=1),
"A": StringCol(itemsize=30, shape=(), dflt=b'', pos=2)}
byteorder := 'little'
chunkshape := (1598,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False}
字符串列将使用 nan_rep
参数指定的字符串来表示 np.nan
(缺失值),默认值为 nan
。
默认的行为可能会在无意中把一个实际的 nan
值变成一个缺失值
In [487]: dfss = pd.DataFrame({"A": ["foo", "bar", "nan"]})
In [488]: dfss
Out[488]:
A
0 foo
1 bar
2 nan
In [489]: store.append("dfss", dfss)
In [490]: store.select("dfss")
Out[490]:
A
0 foo
1 bar
2 NaN
# here you need to specify a different nan rep
In [491]: store.append("dfss2", dfss, nan_rep="_nan_")
In [492]: store.select("dfss2")
Out[492]:
A
0 foo
1 bar
2 nan