大数据,机器学习,人工智能Spark在简书大数据

Databricks Delta Lake 介绍

2019-05-06  本文已影响26人  牛肉圆粉不加葱

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.

一、Delta Lake 特性

二、批量读取和写入

2.1、简单示例

create a table

df.write.format("delta").save("/delta/events")

Partition data

df.write.format("delta").partitionBy("date").save("/delta/events")

Read a table

spark.read.format("delta").load("/delta/events")

2.2、查询表的旧快照(时间旅行)

Delta Lake 时间旅行允许您查询 Delta Lake 表的旧快照。时间旅行有很多用例,包括:

DataFrameReader options 允许从 Delta Lake 表创建一个DataFrame 关联到表的特定版本,可以使用如下两种方式:

df1 = spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/delta/events")
df2 = spark.read.format("delta").option("versionAsOf", version).load("/delta/events")

对于timestamp_string,仅接受日期或时间戳字符串。例如,2019-01-012019-01-01 00:00:00.000Z

2.3、写入一个表

使用 Append 模式,可以自动将新数据追加到现有 Delta Lake 表:

df.write.format("delta").mode("append").save("/delta/events")

要以原子方式替换表中的所有数据,可以使用 overwrite 模式:

df.write.format("delta").mode("overwrite").save("/delta/events")

您可以选择性地仅覆盖与分区列上的谓词匹配的数据。如下以原子方式将1月份替换为df中的数据:

df.write
  .format("delta")
  .mode("overwrite")
  .option("replaceWhere", "date >= '2017-01-01' AND date <= '2017-01-31'")
  .save("/delta/events")

2.4、Schema 自动更新

Delta Lake 可以自动更新表的 schema,作为 DML 事务的一部分,并使 schema 与正在写入的数据兼容

2.4.1、增加列

当以下任意情况为 true 时,DataFrame 中存在但表中缺少的列将自动添加为写入事务的一部分:

2.4.2、NullType 列

写入 Delta 时,会从 DataFrame 中删除 NullType 列(因为 Parquet 不支持 NullType)。当收到该列的不同数据类型时,Delta Lake 会将 schema 合并到新数据类型

默认情况下,覆盖表中的数据不会覆盖 schema。 使用模式 overwrite 覆盖表而不使用 replaceWhere 时,可能仍希望覆盖正在写入的数据的 schema。 可以通过设置以下内容来选择替换表的 schema :

df.write.option("overwriteSchema", "true")

2.5、视图

Delta Lake 支持在 Delta Lake 表上创建视图,就像使用 data source 表一样。

使用视图操作时的核心挑战是解析 schema。 如果更改 Delta Lake 表 schema。 例如,如果向 Delta Lake表添加新列,则必须确保此列在该基表之上构建的相应视图中可用。

三、流式读取和写入

四、并发控制

Delta Lake 在读写之间提供 ACID 事务保证。 这意味着:

4.1、乐观的并发控制

Delta Lake 使用乐观并发控制在写入之间提供事务保证。在这种机制下,写操作分三个阶段进行:

  1. read:读取表的最新可用版本以识别需要修改哪些文件
  2. write:通过编写新数据文件来进行所有更改
  3. validate and commit:调用 commit 方法,生成 commit 信息,生成一个新的递增1的文件,如果相同的文件名已经存在,则报 ConcurrentModificationException

五、Delta 目录结构

adminMacBook-Pro:spark-2.1.1-bin-2.7.3 admin$ hadoop fs -ls /tmp/delta-table/
Found 34 items
drwx------   - admin supergroup          0 2019-04-30 14:22 /tmp/delta-table/_delta_log
-rw-------   2 admin supergroup        263 2019-04-30 14:21 /tmp/delta-table/part-00000-174ce4e0-9dde-4704-9d79-b41e1cb51eda-c000.snappy.parquet
-rw-------   2 admin supergroup        263 2019-04-30 14:13 /tmp/delta-table/part-00000-19ed1ad3-45b6-4527-8acc-2137f256165b-c000.snappy.parquet
-rw-------   2 admin supergroup        263 2019-04-30 14:22 /tmp/delta-table/part-00000-2ec77809-f928-4717-bc66-c61f5fa4e690-c000.snappy.parquet
-rw-------   2 admin supergroup        263 2019-04-30 14:13 /tmp/delta-table/part-00000-381f68e5-f027-4304-a9a9-a0d63b33f95c-c000.snappy.parquet
-rw-------   2 admin supergroup        263 2019-04-30 14:22 /tmp/delta-table/part-00000-4f7a5e99-8a2d-4661-85a2-3adb796e6014-c000.snappy.parquet
-rw-------   2 admin supergroup        263 2019-04-30 14:21 /tmp/delta-table/part-00000-6739de6e-61d2-4083-84a6-127362012290-c000.snappy.parquet
-rw-------   2 admin supergroup        263 2019-04-30 14:13 /tmp/delta-table/part-00000-687847fc-4196-40a8-87aa-cb288ce41d3a-c000.snappy.parquet
-rw-------   2 admin supergroup        263 2019-04-30 14:22 /tmp/delta-table/part-00000-717ac816-0195-4581-9c26-8249e94b6cf6-c000.snappy.parquet
-rw-------   2 admin supergroup        263 2019-04-30 14:12 /tmp/delta-table/part-00000-8520f2ed-5a81-4caa-bd24-ca16ae96fcfc-c000.snappy.parquet
-rw-------   2 admin supergroup        263 2019-04-30 14:13 /tmp/delta-table/part-00000-b538fde3-6fe2-454a-b505-4f388807866f-c000.snappy.parquet
-rw-------   2 admin supergroup        263 2019-04-30 14:22 /tmp/delta-table/part-00000-e7ce533e-2a52-4f16-b356-1627f4d0b986-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:22 /tmp/delta-table/part-00003-2272ef42-30f6-461a-a3df-1d643d36fe57-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:21 /tmp/delta-table/part-00003-5f13f30e-0412-4013-8887-ef36a813b7b3-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:22 /tmp/delta-table/part-00003-6d7f0bfd-9dce-4dcc-9ed8-6a8bb9fd3f43-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:22 /tmp/delta-table/part-00003-9905ed46-2906-4626-8ceb-7a1b33d0fba9-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:12 /tmp/delta-table/part-00003-a4841920-25b7-4822-85c8-c946605227f9-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:13 /tmp/delta-table/part-00003-bbf64370-749e-4b05-a1aa-83edd474f4dd-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:13 /tmp/delta-table/part-00003-c38c8546-adca-4580-884c-f0814d8a9b86-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:22 /tmp/delta-table/part-00003-cda3cb71-6bba-478a-9623-5c5a3483517e-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:13 /tmp/delta-table/part-00003-ced20325-7840-441e-b1ee-4b5781ec92df-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:13 /tmp/delta-table/part-00003-d6ce4ea8-103e-4dc5-9b28-e53b321b0b86-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:21 /tmp/delta-table/part-00003-ee71b4f1-6153-42fd-bfec-3e4478c23a22-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:22 /tmp/delta-table/part-00007-37233cb9-bd0b-41b9-b935-25592e57bcbb-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:13 /tmp/delta-table/part-00007-3851b027-acca-445f-8208-408bafe6ecaf-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:13 /tmp/delta-table/part-00007-405cf78e-229b-4045-891d-582433962093-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:22 /tmp/delta-table/part-00007-44be9b91-a94f-41d0-af47-b867e052b92b-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:13 /tmp/delta-table/part-00007-48155c68-e17c-4773-8711-03da5aba86b3-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:21 /tmp/delta-table/part-00007-57aeb2af-3785-400a-94d7-c257e4ce6ba0-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:21 /tmp/delta-table/part-00007-616923f9-615b-4977-9b8b-15a155baf9b4-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:12 /tmp/delta-table/part-00007-c23e1896-47dd-4076-a189-174bb45f6384-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:13 /tmp/delta-table/part-00007-c34873a2-9077-4ed0-9104-3f08059be4c9-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:22 /tmp/delta-table/part-00007-ca13411e-88ad-4d5a-8a6d-dfd1808824bb-c000.snappy.parquet
-rw-------   2 admin supergroup        423 2019-04-30 14:22 /tmp/delta-table/part-00007-f0912369-add4-444b-9bdb-677a1d688db6-c000.snappy.parquet

adminMacBook-Pro:spark-2.1.1-bin-2.7.3 admin$ hadoop fs -ls /tmp/delta-table/_delta_log
Found 13 items
-rw-------   2 admin supergroup       1076 2019-04-30 14:12 /tmp/delta-table/_delta_log/00000000000000000000.json
-rw-------   2 admin supergroup       1147 2019-04-30 14:13 /tmp/delta-table/_delta_log/00000000000000000001.json
-rw-------   2 admin supergroup       1147 2019-04-30 14:13 /tmp/delta-table/_delta_log/00000000000000000002.json
-rw-------   2 admin supergroup       1147 2019-04-30 14:13 /tmp/delta-table/_delta_log/00000000000000000003.json
-rw-------   2 admin supergroup        718 2019-04-30 14:13 /tmp/delta-table/_delta_log/00000000000000000004.json
-rw-------   2 admin supergroup       1573 2019-04-30 14:21 /tmp/delta-table/_delta_log/00000000000000000005.json
-rw-------   2 admin supergroup       1147 2019-04-30 14:21 /tmp/delta-table/_delta_log/00000000000000000006.json
-rw-------   2 admin supergroup       1147 2019-04-30 14:22 /tmp/delta-table/_delta_log/00000000000000000007.json
-rw-------   2 admin supergroup       1147 2019-04-30 14:22 /tmp/delta-table/_delta_log/00000000000000000008.json
-rw-------   2 admin supergroup       1147 2019-04-30 14:22 /tmp/delta-table/_delta_log/00000000000000000009.json
-rw-------   2 admin supergroup      13308 2019-04-30 14:22 /tmp/delta-table/_delta_log/00000000000000000010.checkpoint.parquet
-rw-------   2 admin supergroup       1147 2019-04-30 14:22 /tmp/delta-table/_delta_log/00000000000000000010.json
-rw-------   2 admin supergroup         38 2019-04-30 14:22 /tmp/delta-table/_delta_log/_last_checkpoint

adminMacBook-Pro:spark-2.1.1-bin-2.7.3 admin$ hadoop fs -cat /tmp/delta-table/_delta_log/00000000000000000000.json
{"commitInfo":{"userId":null,"userName":null,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"job":null,"notebook":null,"clusterId":null,"readVersion":null,"isolationLevel":null}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"7a880cd4-0061-42ae-a998-965b6cfc3198","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1556604759993}}
{"add":{"path":"part-00000-8520f2ed-5a81-4caa-bd24-ca16ae96fcfc-c000.snappy.parquet","partitionValues":{},"size":263,"modificationTime":1556604760157,"dataChange":true}}
{"add":{"path":"part-00003-a4841920-25b7-4822-85c8-c946605227f9-c000.snappy.parquet","partitionValues":{},"size":423,"modificationTime":1556604760193,"dataChange":true}}
{"add":{"path":"part-00007-c23e1896-47dd-4076-a189-174bb45f6384-c000.snappy.parquet","partitionValues":{},"size":423,"modificationTime":1556604760216,"dataChange":true}}

adminMacBook-Pro:spark-2.1.1-bin-2.7.3 admin$ hadoop fs -cat /tmp/delta-table/_delta_log/00000000000000000001.json
{"commitInfo":{"userId":null,"userName":null,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"job":null,"notebook":null,"clusterId":null,"readVersion":0,"isolationLevel":null}}
{"add":{"path":"part-00000-687847fc-4196-40a8-87aa-cb288ce41d3a-c000.snappy.parquet","partitionValues":{},"size":263,"modificationTime":1556604781695,"dataChange":true}}
{"add":{"path":"part-00003-bbf64370-749e-4b05-a1aa-83edd474f4dd-c000.snappy.parquet","partitionValues":{},"size":423,"modificationTime":1556604781707,"dataChange":true}}
{"add":{"path":"part-00007-c34873a2-9077-4ed0-9104-3f08059be4c9-c000.snappy.parquet","partitionValues":{},"size":423,"modificationTime":1556604781708,"dataChange":true}}
{"remove":{"path":"part-00000-8520f2ed-5a81-4caa-bd24-ca16ae96fcfc-c000.snappy.parquet","deletionTimestamp":1556604781912,"dataChange":true}}
{"remove":{"path":"part-00003-a4841920-25b7-4822-85c8-c946605227f9-c000.snappy.parquet","deletionTimestamp":1556604781913,"dataChange":true}}
{"remove":{"path":"part-00007-c23e1896-47dd-4076-a189-174bb45f6384-c000.snappy.parquet","deletionTimestamp":1556604781913,"dataChange":true}}
上一篇 下一篇

猜你喜欢

热点阅读