玩转大数据JavaFlink学习指南

Hudi 使用之Metadata Index

2024-07-22  本文已影响0人  AlienPaul

本篇带来Hudi metadata index的介绍、配置和使用。本篇将Hudi官网内容有关部分重新整理在一起,阅读和查找起来更为容易。

Metadata Index依赖metadata表来实现,提供了无需读取底层数据而快速定位数据物理存储位置的能力。
Metadata表设计为无服务,并且不和特定计算引擎绑定。
Metadata表是一个MOR类型的Hudi表,用于存放索引信息,位于.hoodie目录。数据文件格式为HFile,能够显著增强按key查找数据的能力。
相比传统的Index而言,metadata index性能提升巨大,无论是写性能和读性能。

优点:

支持的索引类型:

对应翻译如下:

数据写入端启用Hudi metadata表和多模索引

Spark可用的配置项:

Config Name Default Description
hoodie.metadata.enable true (Optional) Enabled on the write side Enable the internal metadata table which serves table metadata like level file listings. For 0.10.1 and prior releases, metadata table is disabled by default and needs to be explicitly enabled.

Config Param: ENABLE
Since Version: 0.7.0
hoodie.metadata.index.bloom.filter.enable false (Optional) Enable indexing bloom filters of user data files under metadata table. When enabled, metadata table will have a partition to store the bloom filter index and will be used during the index lookups.

Config Param: ENABLE_METADATA_INDEX_BLOOM_FILTER
Since Version: 0.11.0
hoodie.metadata.index.column.stats.enable false (Optional) Enable indexing column ranges of user data files under metadata table key lookups. When enabled, metadata table will have a partition to store the column ranges and will be used for pruning files during the index lookups.

Config Param: ENABLE_METADATA_INDEX_COLUMN_STATS
Since Version: 0.11.0
hoodie.metadata.record.index.enable false (Optional) Create the HUDI Record Index within the Metadata Table

Config Param: RECORD_INDEX_ENABLE_PROP
Since Version: 0.14.0

Flink可用的配置项:

Config Name Default Description
metadata.enabled false(Optional) Enable the internal metadata table which serves table metadata like level file listings, default disabled.

Config Param: METADATA_ENABLED
hoodie.metadata.index.column.stats.enable false (Optional) Enable indexing column ranges of user data files under metadata table key lookups. When enabled, metadata table will have a partition to store the column ranges and will be used for pruning files during the index lookups.

根据官网列出的配置项来看,目前Flink只支持column stats和data skip。

使用文件索引

文件级别的索引只需启用metadata table即可。
各引擎启用metadata table的配置项如下:

Readers Config Description
- Spark DataSource
- Spark SQL
- Strucured Streaming
hoodie.metadata.enable When set to true enables use of the spark file index implementation for Hudi, that speeds up listing of large tables.
Presto hudi.metadata-table-enabled When set to true fetches the list of file names and sizes from Hudi’s metadata table rather than storage.
Trino hudi.metadata-enabled When set to true fetches the list of file names and sizes from metadata rather than storage.
Athena hudi.metadata-listing-enabled When this table property is set to TRUE enables the Hudi metadata table and the related file listing functionality
- Flink DataStream
- Flink SQL
metadata.enabled When set to true from DDL uses the internal metadata table to serves table metadata like level file listings

使用column_stats index 和 data skipping

Spark或者Flink使用data skipping的前提是Hudi表启用metadata table并且开启column index。
Spark或者Flink读取时启用data skipping,分别需要开启如下配置:

Readers Config Description
- Spark DataSource
- Spark SQL
- Strucured Streaming
- hoodie.metadata.enable
- hoodie.enable.data.skipping
- When set to true enables use of the spark file index implementation for Hudi, that speeds up listing of large tables.
- When set to true enables data-skipping allowing queries to leverage indices to reduce the search space by skipping over files
Config Param: ENABLE_DATA_SKIPPING
Since Version: 0.10.0
- Flink DataStream
- Flink SQL
- metadata.enabled
- read.data.skipping.enabled
- When set to true from DDL uses the internal metadata table to serves table metadata like level file listings
- When set to true enables data-skipping allowing queries to leverage indices to reduce the search space by skipping over files

记录索引和传统索引的对比

Record level索引和传统索引相比优势巨大。

Record Level Index Global Simple Index Global Bloom Index Bucket Index
Performant look-up in general Yes No No
Boost both writes and reads Yes No, write-only No, write-only
Easy to enable Yes Yes Yes

Spark在线异步metadata index

Spark写入端Hudi参数示例如下:

# ensure that both metadata and async indexing is enabled as below two configs  
hoodie.metadata.enable=true  
hoodie.metadata.index.async=true  
# enable column_stats index config  
hoodie.metadata.index.column.stats.enable=true  
# set concurrency mode and lock configs as this is a multi-writer scenario  
# check https://hudi.apache.org/docs/concurrency_control/ for differnt lock provider configs  
hoodie.write.concurrency.mode=optimistic_concurrency_control  
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider  
hoodie.write.lock.zookeeper.url=<zk_url>  
hoodie.write.lock.zookeeper.port=<zk_port>  
hoodie.write.lock.zookeeper.lock_key=<zk_key>  
hoodie.write.lock.zookeeper.base_path=<zk_base_path>

离线metadata index

Schedule

我们可以使用HoodieIndexschedule模式,排期index构建操作。例如:

spark-submit \
--class org.apache.hudi.utilities.HoodieIndexer \
/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar \
--props /Users/home/path/to/indexer.properties \
--mode schedule \
--base-path /tmp/hudi-ny-taxi \
--table-name ny_hudi_tbl \
--index-types COLUMN_STATS \
--parallelism 1 \
--spark-memory 1g

该操作会在timeline中写入 indexing.requested instant。

Execute

使用HoodieIndexer的execute模式,执行上一步的index排期。

spark-submit \
--class org.apache.hudi.utilities.HoodieIndexer \
/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar \
--props /Users/home/path/to/indexer.properties \
--mode execute \
--base-path /tmp/hudi-ny-taxi \
--table-name ny_hudi_tbl \
--index-types COLUMN_STATS \
--parallelism 1 \
--spark-memory 1g

我们也可以使用scheduleAndExecute模式,把排期和执行放在一起搞定。当然,排期和执行分开的话具有更高的灵活性。

Drop

删除索引可使用HoodieIndexerdropindex模式。

spark-submit \
--class org.apache.hudi.utilities.HoodieIndexer \
/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar \
--props /Users/home/path/to/indexer.properties \
--mode dropindex \
--base-path /tmp/hudi-ny-taxi \
--table-name ny_hudi_tbl \
--index-types COLUMN_STATS \
--parallelism 1 \
--spark-memory 2g

并发写入控制

hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.write.lock.provider=<distributed-lock-provider-classname>

注意事项

在启用元数据表之前,需要停止该表的所有写入操作。或者是开启并发写入控制。

目前MOR表不支持data skipping。参见:[HUDI-3866] Support Data Skipping for MOR - ASF JIRA (apache.org)

参考文献

Metadata Table | Apache Hudi
Metadata Indexing | Apache Hudi
记录级别索引:Apache Hudi 针对大型数据集的超快索引 - 知乎 (zhihu.com)

上一篇 下一篇

猜你喜欢

热点阅读