九 hugegraph+cassandra 底层存储结构

2019-09-26 本文已影响0人 NazgulSun

hugegraph底层使用多张表来存储数据

cqlsh:hugegraph> describe tables;

graph_vertices system_edges_in graph_search_indexes
vertex_labels system_search_indexes edge_labels
system_range_indexes system_secondary_indexes system_edges_out
graph_edges_out graph_secondary_indexes index_labels
property_keys graph_range_indexes counters
system_vertices graph_edges_in

基本分为 graph_前缀开头的，以及分为system_前缀开头的表。还有一些共用的入counter。
graph_x开头的基本是给业务数据用的表，而system开头的基本是系统用的表。

使用图数据的第一步是定义schema，包括属性，节点，关系，索引。

先看最小单位为property

cqlsh:hugegraph> select * from property_Keys limit 1;

 id | cardinality | data_type | name   | properties | status | user_data
----+-------------+-----------+--------+------------+--------+-----------
 23 |           1 |         8 | org_no |       null |      1 |      null

cardinality标识这个属性是单个值还是集合，dataType 8 为string 类型，枚举值需要看代码。

节点

有了属性，把他们组合起来就是点了。

cqlsh:hugegraph> select * from vertex_labels where id = 3;

 id | enable_label_index | id_strategy | index_labels | name | nullable_keys | primary_keys | properties | status | user_data
----+--------------------+-------------+--------------+------+---------------+--------------+------------+--------+-----------
  3 |               True |           3 |     {38, 39} | News |       {8, 30} |         null |    {8, 30} |      1 |      null

可以看到一个节点，有很多个配置， id_strategy为id生成策略，属性(ID)集合，每个属性对应的index_ID, 主键，nullable key等。这些值都是通过一个 ID list 来引用其他的表，起到一个外键的作用。

前面介绍过property，下面看看，index_label 这个表。

index_Labels

cqlsh:hugegraph> select * from index_labels where id in (38,39);

 id | base_type | base_value | fields  | index_type | name                 | status
----+-----------+------------+---------+------------+----------------------+--------
 38 |         1 |          3 | [30, 8] |          1 | News_news_id_user_id |      2
 39 |         1 |          3 |     [8] |          1 |         News_user_id |      2

这个表说明了Id 为38，39的index情况，比如名字，依附的是节点或者边

到此一个节点的定义就完整了，下面可以看一下边的定义。

edge_lables

cqlsh:hugegraph> select * from edge_labels;

 id | enable_label_index | frequency | index_labels             | name                      | nullable_keys               | properties                  | sort_keys | source_label | status | target_label | user_data
----+--------------------+-----------+--------------------------+---------------------------+-----------------------------+-----------------------------+-----------+--------------+--------+--------------+-----------
  1 |               True |         1 |             {30, 31, 32} |    Company_Invest_Company |                 {8, 28, 29} |                 {8, 28, 29} |      null |            1 |      1 |            1 |      null
  2 |               True |         1 |             {33, 34, 35} |     Person_Invest_Company |                 {8, 28, 29} |                 {8, 28, 29} |      null |            2 |      1 |            1 |      null
  4 |               True |         1 | {40, 41, 43, 44, 45, 46} |      News_Related_Company | {8, 31, 32, 33, 34, 35, 36} | {8, 31, 32, 33, 34, 35, 36} |      null |            3 |      1 |            1 |      null
  3 |               True |         1 |                 {36, 37} | Person_PositionOf_Company |                     {8, 28} |                     {8, 28} |      null |            2 |      1 |            1 |      null

与节点相比，只是多出一个 frequency的概念，代表两个节点之间，是否支持多条具有同样label的边。
比如有人和公司节点，定义了边，叫做关联关系。这个关联关系可能存在多条，比如，任职，投资。此时这个frequency的type 就是multiple了。
如果定义了multiple，必须要指定一个属性作为 sortKey,也就是区分关联关系，是任职还是投资。

定义好了schema之后可以来看看数据是怎么存的，schema的定义就像是 class，数据本身才是对象。

先看graph_vertices

cqlsh:hugegraph> select * from graph_vertices limit 1;

 id                  | label | properties
---------------------+-------+--------------------------------------------------------------------------
 S353623026048307200 |     2 | {8: '"admin"', 11: '"熊美斌"', 27: '"6a425966030e61cfec8ffb40958ee3d0"'}

其实很简单，就是 ID，Label ，以及properties数据。properties 使用map。key 为属性的Id。

在看下边 edge：

  owner_vertex        | direction | label | sort_values | other_vertex        | properties
---------------------+-----------+-------+-------------+---------------------+-------------------------------------
 S352108445291385111 |         2 |     3 |             | S353604988217462784 | {8: '"admin"', 28: '"股东,董监高"'}

边分为 edges_in, 和edges_out，这个是为了正向和反向查询的遍历。在图分割的章节会说明设计的原因。
这个边，存储了开始节点和结束节点的ID，以及相应的属性，和节点的结构很相似。

存储了节点和边，当我们需要查询的时候如何定位目标呢？通过属性的值来查询？
在cassandra作为存储的结构里面，我们先看看 graph_vertices 定义：

CREATE TABLE hugegraph.graph_vertices (
id text PRIMARY KEY,
label int,
properties map<int, text>
) WITH bloom_filter_fp_chance = 0.01

有一个primary key，这个将会作为partition key，通常如果primary 为组合（a,b，C）
那么第一个键作为partition key，b c叫做 cluster　key，查询的时候只能按照key 的前缀匹配来查询，如where a=1 b=2，而不能做c=1 b=2 这种。
但是我们查询节点，通常会根据任何属性的组合来查询，为了加快查询的速度和避免
cassandra存储引擎的限制，hugegraph 手动实现了索引。

怎么做的。比如对一个节点，地区节点 D,有属性 name="中国"，city="上海"，我们可以在 name 和city上做索引，系统会在在一个叫做 graph_secondary_indexes 里面插入所有的（name,city）-》node_ID 的倒排寻关系。查询的时候，查询这样一个倒排序表就可以了。

cqlsh:hugegraph> select * from graph_secondary_indexes where field_values='7778529' limit 1;

 field_values | index_label_id | element_ids
--------------+----------------+---------------------
      7778529 |             38 | S365024405421690880

除了二级索引，还有range 以及search index.
search index 支持对属性的全文检索，提供了类似lucene的功能
range index ，支持范围查询，这些都是hugegraph在系统层面实现的。
虽然增加了系统的复杂性，因为要管理索引的建立，更新和同步，但也提供了优秀的性能。

到目前位置，基本上对graph_前缀的表都过了一遍，

下面要看的是 system_前缀的表。

其实system_vertices, system_edge 这些表对应的节点和边的 schema 也是放在vertex_label 和 edge_label里面的。只是他们存储的数据不一样。

以system Vertices 为例，他存储的是整个系统的任务 task执行的情况。
说到 hugegraph task ，这里要插一句。系统内部存在很多异步任务。比如新增节点，修改节点都会触发 rebuild index 这类任务。
这些任务都是异步执行的，任务的状态都存在 task 节点里面。系统在重启或异常停止恢复时候，都会去读取任务的状态决定是否需要restore等。

这个任务节点的schema，是在系统启动的时候初始化的。整个hugegraph 初始化，基本上就是插入了 task 节点，为他创建了一些属性，和查询需要的index。

cqlsh:hugegraph> select * from system_vertices limit 1;

 id  | label | properties
-----+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 L35 |   -13 | {-9: '0', -8: '1564109684704', -7: '1564109684694', -6: '0', -5: '5', -3: '"com.baidu.hugegraph.job.schema.RebuildIndexCallable"', -2: '"INDEX_LABEL:35:Person_Invest_Company_user_id"', -1: '"rebuild_index"'}

label = -13, 在 vertex_label 看一下：

 cqlsh:hugegraph> select * from vertex_labels  where id = -13;

 id  | enable_label_index | id_strategy | index_labels | name  | nullable_keys           | primary_keys | properties                                          | status | user_data
-----+--------------------+-------------+--------------+-------+-------------------------+--------------+-----------------------------------------------------+--------+-----------
 -13 |               True |           4 |        {-14} | ~task | {-12, -11, -10, -8, -4} |         null | {-12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1} |      1 |      null

他对应的 index 就存在 system_range_indexs, search_indexes, secondary_indexes 里面。

最后是counter 表

cqlsh:hugegraph> select * from counters;

 schema_type  | id
--------------+----
 PROPERTY_KEY | 36
 VERTEX_LABEL |  3
   SYS_SCHEMA | 32
  INDEX_LABEL | 53
   EDGE_LABEL |  4
         TASK | 59

存储的是全局ID，比如property_key 表，下一个id 就是 37.

至此，hugegraph 存储结构描述清楚，不同存储后端，都大同小异，比如mysql，也基本上使用了这一套表结构