Hudi 使用之Metadata Index

2024-07-22 本文已影响0人 AlienPaul

本篇带来Hudi metadata index的介绍、配置和使用。本篇将Hudi官网内容有关部分重新整理在一起，阅读和查找起来更为容易。

Metadata Index依赖metadata表来实现，提供了无需读取底层数据而快速定位数据物理存储位置的能力。
Metadata表设计为无服务，并且不和特定计算引擎绑定。
Metadata表是一个MOR类型的Hudi表，用于存放索引信息，位于.hoodie目录。数据文件格式为HFile，能够显著增强按key查找数据的能力。
相比传统的Index而言，metadata index性能提升巨大，无论是写性能和读性能。

优点：

Eliminate the requirement of list files operation。无需执行list file操作。对于海量数据的对象存储而言list file需要遍历文件，资源消耗极大会成为性能瓶颈。
Expose columns stats through indices for better query planning and faster lookups by readers。记录列状态信息可以有效的帮助计算引擎优化执行计划，根据索引中的列统计信息能够裁剪掉不相关的数据。对于Hudi的表服务例如compaction和clustering等，列统计信息能够帮助引擎快速定位到需要操作的数据所在的物理文件位置，加速运行过程。

支持的索引类型：

files index: Stored as files partition in the metadata table. Contains file information such as file name, size, and active state for each partition in the data table. Improves the files listing performance by avoiding direct file system calls such as exists, listStatus and listFiles on the data table.
column_stats index: Stored as column_stats partition in the metadata table. Contains the statistics of interested columns, such as min and max values, total values, null counts, size, etc., for all data files and are used while serving queries with predicates matching interested columns. This index is used along with the data skipping to speed up queries by orders of magnitude.

bloom_filter index: Stored as bloom_filter partition in the metadata table. This index employs range-based pruning on the minimum and maximum values of the record keys and bloom-filter-based lookups to tag incoming records. For large tables, this involves reading the footers of all matching data files for bloom filters, which can be expensive in the case of random updates across the entire dataset. This index stores bloom filters of all data files centrally to avoid scanning the footers directly from all data files.
record_index: Stored as record_index partition in the metadata table. Contains the mapping of the record key to location. Record index is a global index, enforcing key uniqueness across all partitions in the table. Most recently added in 0.14.0 Hudi release, this index aids in locating records faster than other existing indices and can provide a speedup orders of magnitude faster in large deployments where index lookup dominates write latencies.

对应翻译如下：

文件索引：存储在元数据表的files分区中。包含了数据表中每个分区的文件信息，如文件名、大小和当前状态。通过避免对数据表直接进行文件系统调用，如exists、listStatus和listFiles，它提高了文件列表的性能。
列统计索引：存储在元数据表的column_stats分区中。包含了所有数据文件中感兴趣列的统计信息，如最小值和最大值、总值、null数量、大小等，这些在服务与感兴趣列匹配的谓词的查询时使用。这个索引与data skip技术结合使用，可以极大地加快查询速度。
布隆过滤器索引：存储在元数据表的bloom_filter分区中。该索引采用基于记录键的最小值和最大值的范围修剪以及基于布隆过滤器的查找，来标记传入的记录。对于大型表，这涉及到读取所有匹配数据文件的页脚中的布隆过滤器，如果整个数据集随机更新，这可能代价高昂。该索引将所有数据文件的布隆过滤器集中存储，避免了直接扫描所有数据文件的页脚。
记录索引：存储在元数据表的record_index分区中。包含了记录键到位置的映射。记录索引是一个全局索引，它强制执行表中所有分区的键唯一性。最近在Hudi 0.14.0版本中添加，该索引帮助比其他现有索引更快地定位记录，并且在索引查找占主导地位写入延迟的大型部署中，可以提供数量级更快的加速。

数据写入端启用Hudi metadata表和多模索引

Spark可用的配置项：

Config Name	Default	Description
hoodie.metadata.enable	true (Optional) Enabled on the write side	Enable the internal metadata table which serves table metadata like level file listings. For 0.10.1 and prior releases, metadata table is disabled by default and needs to be explicitly enabled. `Config Param: ENABLE` `Since Version: 0.7.0`
hoodie.metadata.index.bloom.filter.enable	false (Optional)	Enable indexing bloom filters of user data files under metadata table. When enabled, metadata table will have a partition to store the bloom filter index and will be used during the index lookups. `Config Param: ENABLE_METADATA_INDEX_BLOOM_FILTER` `Since Version: 0.11.0`
hoodie.metadata.index.column.stats.enable	false (Optional)	Enable indexing column ranges of user data files under metadata table key lookups. When enabled, metadata table will have a partition to store the column ranges and will be used for pruning files during the index lookups. `Config Param: ENABLE_METADATA_INDEX_COLUMN_STATS` `Since Version: 0.11.0`
hoodie.metadata.record.index.enable	false (Optional)	Create the HUDI Record Index within the Metadata Table `Config Param: RECORD_INDEX_ENABLE_PROP` `Since Version: 0.14.0`

Flink可用的配置项：

Config Name	Default	Description
metadata.enabled	false(Optional)	Enable the internal metadata table which serves table metadata like level file listings, default disabled. `Config Param: METADATA_ENABLED`
hoodie.metadata.index.column.stats.enable	false (Optional)	Enable indexing column ranges of user data files under metadata table key lookups. When enabled, metadata table will have a partition to store the column ranges and will be used for pruning files during the index lookups.

根据官网列出的配置项来看，目前Flink只支持column stats和data skip。

使用文件索引

文件级别的索引只需启用metadata table即可。
各引擎启用metadata table的配置项如下：

Readers	Config	Description
- Spark DataSource - Spark SQL - Strucured Streaming	hoodie.metadata.enable	When set to `true` enables use of the spark file index implementation for Hudi, that speeds up listing of large tables.
Presto	hudi.metadata-table-enabled	When set to `true` fetches the list of file names and sizes from Hudi’s metadata table rather than storage.
Trino	hudi.metadata-enabled	When set to `true` fetches the list of file names and sizes from metadata rather than storage.
Athena	hudi.metadata-listing-enabled	When this table property is set to `TRUE` enables the Hudi metadata table and the related file listing functionality
- Flink DataStream - Flink SQL	metadata.enabled	When set to `true` from DDL uses the internal metadata table to serves table metadata like level file listings

使用column_stats index 和 data skipping

Spark或者Flink使用data skipping的前提是Hudi表启用metadata table并且开启column index。
Spark或者Flink读取时启用data skipping，分别需要开启如下配置：

Readers	Config	Description
- Spark DataSource - Spark SQL - Strucured Streaming	- `hoodie.metadata.enable` - `hoodie.enable.data.skipping`	- When set to `true` enables use of the spark file index implementation for Hudi, that speeds up listing of large tables. - When set to `true` enables data-skipping allowing queries to leverage indices to reduce the search space by skipping over files `Config Param: ENABLE_DATA_SKIPPING` `Since Version: 0.10.0`
- Flink DataStream - Flink SQL	- `metadata.enabled` - `read.data.skipping.enabled`	- When set to `true` from DDL uses the internal metadata table to serves table metadata like level file listings - When set to `true` enables data-skipping allowing queries to leverage indices to reduce the search space by skipping over files

记录索引和传统索引的对比

Record level索引和传统索引相比优势巨大。

Record Level Index	Global Simple Index	Global Bloom Index	Bucket Index
Performant look-up in general	Yes	No	No
Boost both writes and reads	Yes	No, write-only	No, write-only
Easy to enable	Yes	Yes	Yes

Spark在线异步metadata index

Spark写入端Hudi参数示例如下：

# ensure that both metadata and async indexing is enabled as below two configs  
hoodie.metadata.enable=true  
hoodie.metadata.index.async=true  
# enable column_stats index config  
hoodie.metadata.index.column.stats.enable=true  
# set concurrency mode and lock configs as this is a multi-writer scenario  
# check https://hudi.apache.org/docs/concurrency_control/ for differnt lock provider configs  
hoodie.write.concurrency.mode=optimistic_concurrency_control  
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider  
hoodie.write.lock.zookeeper.url=<zk_url>  
hoodie.write.lock.zookeeper.port=<zk_port>  
hoodie.write.lock.zookeeper.lock_key=<zk_key>  
hoodie.write.lock.zookeeper.base_path=<zk_base_path>

离线metadata index

Schedule

我们可以使用HoodieIndex的schedule模式，排期index构建操作。例如：

spark-submit \
--class org.apache.hudi.utilities.HoodieIndexer \
/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar \
--props /Users/home/path/to/indexer.properties \
--mode schedule \
--base-path /tmp/hudi-ny-taxi \
--table-name ny_hudi_tbl \
--index-types COLUMN_STATS \
--parallelism 1 \
--spark-memory 1g

该操作会在timeline中写入 indexing.requested instant。

Execute

使用HoodieIndexer的execute模式，执行上一步的index排期。

spark-submit \
--class org.apache.hudi.utilities.HoodieIndexer \
/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar \
--props /Users/home/path/to/indexer.properties \
--mode execute \
--base-path /tmp/hudi-ny-taxi \
--table-name ny_hudi_tbl \
--index-types COLUMN_STATS \
--parallelism 1 \
--spark-memory 1g

我们也可以使用scheduleAndExecute模式，把排期和执行放在一起搞定。当然，排期和执行分开的话具有更高的灵活性。

Drop

删除索引可使用HoodieIndexer的dropindex模式。

spark-submit \
--class org.apache.hudi.utilities.HoodieIndexer \
/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar \
--props /Users/home/path/to/indexer.properties \
--mode dropindex \
--base-path /tmp/hudi-ny-taxi \
--table-name ny_hudi_tbl \
--index-types COLUMN_STATS \
--parallelism 1 \
--spark-memory 2g

并发写入控制

单实例写入，inline表服务。这种情况仅需简单开启hoodie.metadata.enable，即可保证并发安全。
单实例写入，异步表服务。需要增加乐观并发访问控制。

hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider

多实例写入。为防止数据丢失，需要所有的写入实例配置分布式锁

hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.write.lock.provider=<distributed-lock-provider-classname>

注意事项

在启用元数据表之前，需要停止该表的所有写入操作。或者是开启并发写入控制。

目前MOR表不支持data skipping。参见：[HUDI-3866] Support Data Skipping for MOR - ASF JIRA (apache.org)

参考文献

Metadata Table | Apache Hudi
Metadata Indexing | Apache Hudi
记录级别索引：Apache Hudi 针对大型数据集的超快索引 - 知乎 (zhihu.com)