pytables使用笔记

2021-01-28 本文已影响0人井底蛙蛙呱呱呱

pytables是一种用来快速存取大量数据的工具，其功能与h5py类似，都是将数据储存为hdf5格式，但是更为强大。但是也正是由于其更加强大的功能，也导致了其官方文档的冗杂。这里简要记录一些pytables的使用笔记。

pytables可以存取的格式非常丰富：字符串，数值，数组，字符数组以及可变数组等均可以进行储存。

对于简单的字符串，数值或数组储存使用官方提供的代码样例即可：

import numpy as np
import os
import tables
import tables as tb
from tqdm import tqdm
import csv
import json

# jupyter notebook中设置交互式输出
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

with open(fpath, 'r') as f:
    fcsv = csv.DictReader(f, delimiter='\t')
    for row in tqdm(fcsv):
        label = int(float(row['label']))
        id_ = row['id']
        camp = row['mz_camp']
        arr_features = json.loads(row['features'])
        break

label
id_
camp
arr_features[:10]
# 数据样例输出如下：
1
'f6379cff0f6a4e3c4b6b9a514d1e2df7'
'28136560~28203900~28207488~28229731~28241431~28256316~28280756~28305638~28320029~28327595~28330390~28335941~28350527~28355142~28357456~28358600~28363423~28367817~28377060~28382884~28383105~28389436~28390580~28408871~28441839~28442775~28462158~28467384~28467592~28468411~28473351~28476406~28499819~28505188~28510882~28523531~28523544~28524454~28535608~28541445~28564026~28569928~28570786~28570955~28590884~28623358~28623579~28625464~28633199~28640154~28651373~28670262~78188107~78195101'
[1, 76, 2475, 2651, 2651, 2651, 2651, 2651, 2651, 2651]

pytables储存数据为hdf5格式：

# class Particle(tb.IsDescription):
#     label = tb.Int8Col(pos=1)               
#     uid = tb.StringCol(itemsize=32, pos=2)            
#     array_features = tb.Int32Col(shape=(len(arr_features),),pos=3)              

arr_length = len(arr_features)
Particle2 = {
    "uid": tb.Col.from_kind("string", itemsize=32, pos=1),            
    "label": tb.Col.from_type("int8", pos=2),            
    "array_features": tb.Col.from_type("int32", shape=(len(arr_features),), pos=3),            
}

在上面使用了两种方式来定义我们将要储存的数据的数据类型，其中：

itemsize 表示字符类型数据字符的最大长度，超出的部分将会截断；
pos 表示将要储存的几个数据的顺序，若批量append的话将会用到；
shape 则是指定了数据维度；

对于数组数据，pytables提供了多种类型：

tables.Array，最普通的array储存方式，对应create_array，不进行压缩，且不支持对array shape进行扩充，如更改列数等，不支持；另外一点有意思的地方是，array对于什么数据类型存进去就是什么数据类型读取出来，譬如已list存进去则读取出来就是list，numpy array存进去读取出来则是numpy array；最后，tables.Array是不支持压缩的；
tables.CArray，CArray与Array主要的不同就是CArray是支持压缩的，其通过create_carray()中的filters参数进行指定，如filters=tables.Filters(complevel=5, complib='zlib')；
tables.EArray，对应create_earray()，EArray与CArray的不同在于EArray是可扩展的，即虽然定义时其shape=(1,3)，但是后续可对其中一个维度进行更改（当前版本仅支持一个维度的更改）；
tables.VLArray，对应于create_vlarray()，这个功能非常强大，在vlarray中可以储存不同长度的数组，如[1,2], [1,2,3], [1,2,3,4]；

更多Array相关内容可参考官方文档：Homogenous storage classes .

下面我们将camp变量split为list of string，然后利用VLArray进行储存:

vlarray = fileh.create_vlarray(root, 'mz_camp', atom=tables.ObjectAtom(),
                               title="text data.")

这里将atom指定为tables.ObjectAtom()使得读取出来的数据仍然为string。也可以使atom=tables.StringAtom(itemsize=32)，但是此时读取出来的数据就变成了bytes string了，如 b"hello"，这时需要var.decode()来得到string。

利用上面的table和array基本可以处理所有数据格式的储存问题了。完整代码如下：

import numpy as np
import os
import tables
import tables as tb
from tqdm import tqdm
import csv
import json

# jupyter notebook中设置交互式输出
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

with open(fpath, 'r') as f:
    fcsv = csv.DictReader(f, delimiter='\t')
    for row in tqdm(fcsv):
        label = int(float(row['label']))
        id_ = row['id']
        camp = row['mz_camp']
        arr_features = json.loads(row['features'])
        break

label
id_
camp
arr_features[:10]
# 数据样例输出如下：
1
'f6379cff0f6a4e3c4b6b9a514d1e2df7'
'28136560~28203900~28207488~28229731~28241431~28256316~28280756~28305638~28320029~28327595~28330390~28335941~28350527~28355142~28357456~28358600~28363423~28367817~28377060~28382884~28383105~28389436~28390580~28408871~28441839~28442775~28462158~28467384~28467592~28468411~28473351~28476406~28499819~28505188~28510882~28523531~28523544~28524454~28535608~28541445~28564026~28569928~28570786~28570955~28590884~28623358~28623579~28625464~28633199~28640154~28651373~28670262~78188107~78195101'
[1, 76, 2475, 2651, 2651, 2651, 2651, 2651, 2651, 2651]


# 将数据储存为hdf5格式
#（1）open file，若是添加数据，则可以将这里的mode设置为'a'
fileh = tables.open_file("vlarray2.h5", mode="w")

#（2）定义table数据格式，并创建储存节点
# class Particle(tb.IsDescription):
#     label = tb.Int8Col(pos=1)               
#     uid = tb.StringCol(itemsize=32, pos=2)            
#     array_features = tb.Int32Col(shape=(0, len(arr_features, )),pos=3)              

arr_length = len(arr_features)
Particle2 = {
    "uid": tb.Col.from_kind("string", itemsize=32, pos=1),            
    "label": tb.Col.from_type("int8", pos=2),            
    "array_features": tb.Col.from_type("int32", shape=(arr_length, ), pos=3),            
}
# Get the root group
root = fileh.root
table = fileh.create_table(root, 'table', Particle2, "here id, label and array features.")

# （3）创建string list储存格式和节点
vlarray = fileh.create_vlarray(root, 'mz_camp',  atom=tables.ObjectAtom(),
                               title="text data.")

# （4）数据装载
# 批量装载table数据
table.append([(id_, 1, arr_features),
              (id_, 2, arr_features),
              (id_, 3, arr_features),
              (id_, 4, arr_features),
              (id_, 5, arr_features),
              (id_, 6, arr_features)])
table.flush()

# 另一种一个一个的装载table数据的格式
# # Create a shortcut to the table record object
# particle = table.row
# particle['uid'] = id_
# particle['label'] = label
# particle['array_features'] = arr_features

# 装载vlarray数据，目前发现valarray只能一个一个装载，否则会将一次装载的数据当成一个样本
vlarray.append(['1']+camp.split('~'))
vlarray.append(['2']+camp.split('~'))
vlarray.append(['3']+camp.split('~'))
vlarray.append(['4']+camp.split('~'))
vlarray.append(['5']+camp.split('~'))
vlarray.append(['6']+camp.split('~'))
vlarray.flush()

# 关闭文件
fileh.close()

文件读取可以使用下面的代码：

fileh = tables.open_file("vlarray2.h5", mode="r")
root = fileh.root

# 直接根据索引提取table中的数据
root.table.cols.uid[1]
root.table.cols.label[1]
root.table.cols.array_features[1]
root.mz_camp[1][:5]
# 输出
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
2
array([   1,   76, 2475, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2554,
       2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651,
       2651, 2651, 2651, 2577, 2651, 2651, 2651, 2651, 2651, 2651, 2651,
       2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651,
       2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651],
      dtype=int32)
['2', '28136560', '28203900', '28207488', '28229731']

# 也可以通过for循环来读取数据
for id_, lab, arr, camp_ in zip(root.table.cols.uid, root.table.cols.label, root.table.cols.array_features, root.mz_camp):
    print(id_)
    print(lab)
    print(arr[:5])
    print(camp_[:5])
    
fileh.close()
# 输出
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
1
[   1   76 2475 2651 2651]
['1', '28136560', '28203900', '28207488', '28229731']
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
2
[   1   76 2475 2651 2651]
['2', '28136560', '28203900', '28207488', '28229731']
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
3
[   1   76 2475 2651 2651]
['3', '28136560', '28203900', '28207488', '28229731']
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
4
[   1   76 2475 2651 2651]
['4', '28136560', '28203900', '28207488', '28229731']
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
5
[   1   76 2475 2651 2651]
['5', '28136560', '28203900', '28207488', '28229731']
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
6
[   1   76 2475 2651 2651]
['6', '28136560', '28203900', '28207488', '28229731']

更多使用方法可以参考官方文档：
Pytables Tutorials.
官方使用examples.

pytables使用笔记

猜你喜欢

热点阅读