Spark 2.1 structured streaming

2017-01-06 本文已影响0人 biggeng

最近（12月8日）, Spark 2.1 版本正式发布。2.1版本是第二个Spark2.x版本。又增强了Spark对于Structured streaming的支持，包括数据源对Kafka的支持，以及新增的streaming中对于event time watermark的支持。

什么是structured streaming ?

在Spark2.0时，Spark引入了structured streaming，structured streaming是建立在Spark SQL之上的可扩展和高容错的流处理架构。不同于Spark1.x时代的DStream和ForeachRDD, structured streaming的目的是使用户能够像使用Spark SQL处理批处理一样，能够使用相同的方法处理流数据。Spark SQL引擎会递增式的处理到来的数据，并且持续更新流处理输出的数据。

当前的structured streaming 特性还是处于alpha版本，可以进行实验环境的验证，不建议进行生产环境。

没有边界的大表

不同于Spark1.x使用interval将流数据分为不同的mini batch, structured streaming将流数据看作是一张没有边界的表，流数据不断的向表尾增加数据。如下图所示：

structured-streaming-stream-as-a-table.png

在每一个周期时，新的内容将会增加到表尾，查询的结果将会更新到结果表中。一旦结果表被更新，就需要将改变后的表内容输出到外部的sink中。

structured streaming支持三种输出模式:

Complete mode: 整个更新的结果表都会被输出。
Append mode: 只有新增加到结果表的数据会被输出。
Updated mode: 只有被更新的结果表会输出。当前版本暂不支持这个特性

Word count

structured-streaming-example-model.png

不同于Spark1.x，用户需要自己保存历史的数据，structured steaming会帮助用户维护历史的计算数据放到结果表中，每次只需要更新结果表的数据。

Event time

event time作为Row的一列，表示的event的实际时间，而不是到达streaming处理的时间。
event time可以用来进行基于时间相关的计算

watermark处理延迟数据

上面提到了structured streaming可以维护历史的数据，但是如果一条数据的到来时间延迟过长，那么计算这条数据没有什么意义。因此需要一种机制丢弃掉延迟到来的数据。在Spark2.1中，引入了watermark机制。watermark指定列名称为event time的，并且定义了数据延迟到大的最大阈值。超过这个阈值到来的数据将会被忽略。
使用示例如下：

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
val windowedCounts = words
    .withWatermark("timestamp", "10 minutes")
    .groupBy(
        window($"timestamp", "10 minutes", "5 minutes"),
        $"word")
    .count()

Kafka支持

Spark2.1支持Kafka0.10.0集成structured streaming. 可以支持从一个topic，多个topic读入数据生成data frame.

// Subscribe to 1 topic
val ds1 = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .load()
ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

// Subscribe to multiple topics
val ds2 = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1,topic2")
  .load()
ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

// Subscribe to a pattern
val ds3 = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribePattern", "topic.*")
  .load()
ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]