flink学习之十一-window&EventTime实
上面试了Processing Time,在这里准备看下Event Time,以及必须需要关注的,在ET场景下的Watermarks。
EventTime & Watermark
Event time programs must specify how to generate Event Time Watermarks, which is the mechanism that signals progress in event time.
以event time为准的程序,必须要指定watermark.
以下内容引自 《从0到1学习flink》及 官网说明:
支持 Event Time 的流处理器需要一种方法来衡量 Event Time 的进度。 例如,当 Event Time 超过一小时结束时,需要通知构建每小时窗口的窗口操作符,以便操作员可以关闭正在进行的窗口。
Event Time 可以独立于 Processing Time 进行。 例如,在一个程序中,操作员的当前 Event Time 可能略微落后于 Processing Time (考虑到接收事件的延迟),而两者都以相同的速度进行。另一方面,另一个流程序可能只需要几秒钟的时间就可以处理完 Kafka Topic 中数周的 Event Time 数据。
A stream processor that supports event time needs a way to measure the progress of event time. For example, a window operator that builds hourly windows needs to be notified when event time has passed beyond the end of an hour, so that the operator can close the window in progress.
Event time can progress independently of processing time (measured by wall clocks). For example, in one program the current event time of an operator may trail slightly behind the processing time (accounting for a delay in receiving the events), while both proceed at the same speed. On the other hand, another streaming program might progress through weeks of event time with only a few seconds of processing, by fast-forwarding through some historic data already buffered in a Kafka topic (or another message queue).
Flink 中用于衡量 Event Time 进度的机制是 Watermarks。 Watermarks 作为数据流的一部分流动并带有时间戳 t。 Watermark(t)声明 Event Time 已到达该流中的时间 t,这意味着流中不应再有具有时间戳 t’<= t 的元素(即时间戳大于或等于水印的事件)
下图显示了带有(逻辑)时间戳和内联水印的事件流。在本例中,事件是按顺序排列的(相对于它们的时间戳),这意味着水印只是流中的周期性标记。
stream_watermark_in_orderWatermark 对于无序流是至关重要的,如下所示,其中事件不按时间戳排序。通常,Watermark 是一种声明,通过流中的该点,到达某个时间戳的所有事件都应该到达。一旦水印到达操作员,操作员就可以将其内部事件时间提前到水印的值。
stream_watermark_out_of_order
理解下来,如果flink中设置的时间类型是Event Time,必须要设置watermark,作为告诉flink进度的标志。
如果watermark(time1)已经确定,那么说明流中所有time2早于watermark-time1的数据肯定都已经被处理完毕,不管是有序数据流还是无序数据流。
watermark是谁来产生的?--sorry,是跑在flink中的job代码来产生,而不是datasource本身。
watermark是每个数据都有一个对应的么?可以1:1,但不是,按需要和实际情况来做。
It is possible to generate a watermark on every single event. However, because each watermark causes some computation downstream, an excessive number of watermarks degrades performance.
平行流中的水印
水印是在源函数处生成的,或直接在源函数之后生成的。源函数的每个并行子任务通常独立生成其水印。这些水印定义了特定并行源处的事件时间。
当水印通过流程序时,它们会提前到达操作人员处的事件时间。当一个操作符提前(advanced)它的事件时间(event time)时,它为它的后续操作符在下游生成一个新的水印。
一些操作员消耗多个输入流; 例如,一个 union,或者跟随 keyBy(…)或 partition(…)函数的运算符。 这样的操作员当前事件时间是其输入流的事件时间的最小值。 由于其输入流更新其事件时间,因此操作员也是如此。
下图显示了流经并行流的事件和水印的示例,以及跟踪事件时间的运算符。
flink_parallel_streams_watermarks
从上图看,event time是从source中产生的,同样的,watermark也是如此。
数据从source在经过map转换,并且放在window中处理
其他的没看懂。。。
关于TimeStamp及Watermark
In order to work with event time, Flink needs to know the events’ timestamps, meaning each element in the stream needs to have its event timestamp assigned. This is usually done by accessing/extracting the timestamp from some field in the element.
Timestamp assignment goes hand-in-hand with generating watermarks, which tell the system about progress in event time.
There are two ways to assign timestamps and generate watermarks:
- Directly in the data stream source
- Via a timestamp assigner / watermark generator: in Flink, timestamp assigners also define the watermarks to be emitted
Attention Both timestamps and watermarks are specified as milliseconds since the Java epoch of 1970-01-01T00:00:00Z.
event time类型下,flink必须知道event对应的timestamp,也就是说,这个stream中的每个元素都要分配timestamp,一般是放在每个元素中对应的字段。
分配timestamp和生成watermark一般是在一起处理的(hand-in-hand).
有两种方式来分配timestamp+生成watermark
- 直接在datasource中指定
- 通过一个timestamp assigner(或者称之为watermark generator)来指定。在flink中,timestamp assigner 同时也是一个watermark generator
直接在datasource中指定
Stream sources can directly assign timestamps to the elements they produce, and they can also emit watermarks. When this is done, no timestamp assigner is needed. Note that if a timestamp assigner is used, any timestamps and watermarks provided by the source will be overwritten.
To assign a timestamp to an element in the source directly, the source must use the
collectWithTimestamp(...)
method on theSourceContext
. To generate watermarks, the source must call theemitWatermark(Watermark)
function.
比如之前的mysql datasource with spring,其实现是这样的:
@Override
public void run(SourceContext<UrlInfo> sourceContext) throws Exception {
log.info("------query ");
if(urlInfoManager == null){
init();
}
List<UrlInfo> urlInfoList = urlInfoManager.queryAll();
urlInfoList.parallelStream().forEach(urlInfo -> sourceContext.collect(urlInfo));
}
如果需要加入timestamp,则需要调用collectWithTimestamp;如果需要生成watermark,则需要调用emitWatermark。
修改后如下:
@Override
public void run(SourceContext<UrlInfo> sourceContext) throws Exception {
log.info("------query ");
if(urlInfoManager == null){
init();
}
List<UrlInfo> urlInfoList = urlInfoManager.queryAll();
urlInfoList.parallelStream().forEach(urlInfo -> {
// 增加timestamp
sourceContext.collectWithTimestamp(urlInfo,System.currentTimeMillis());
// 生成水印
sourceContext.emitWatermark(new Watermark(urlInfo.getCurrentTime()== null? System.currentTimeMillis():urlInfo.getCurrentTime().getTime()));
sourceContext.collect(urlInfo);
});
}
注意其中增加的两行代码,timestamp和watermark都是针对每个元素的。
通过Timestamp Assigners / Watermark Generators指定
Timestamp assigners take a stream and produce a new stream with timestamped elements and watermarks. If the original stream had timestamps and/or watermarks already, the timestamp assigner overwrites them.
Timestamp assigners are usually specified immediately after the data source, but it is not strictly required to do so. A common pattern, for example, is to parse (MapFunction) and filter (FilterFunction) before the timestamp assigner. In any case, the timestamp assigner needs to be specified before the first operation on event time (such as the first window operation). As a special case, when using Kafka as the source of a streaming job, Flink allows the specification of a timestamp assigner / watermark emitter inside the source (or consumer) itself. More information on how to do so can be found in the Kafka Connector documentation.
Timestamp Assigner 允许输入一个stream,输出一个带timestamp、watermark的元素组成的流。如果流之前已经有了timestamp、watermark,则会被覆盖。
Timestamp Assigner 一般会立即在datasoure初始化之后马上指定,不过却并不一定非要这么做。一个通用的模式是在parse、filter之后,指定timestamp assigner;不过在任何第一次需要对event time操作之前,必须指定timestamp assigner。
先看一个例子:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink();
DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>());
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream =
dataStreamSource.filter((FilterFunction<UrlInfo>) o -> {
if (o.getDomain() == UrlInfo.BAIDU) {
return true;
}
return false;
}).assignTimestampsAndWatermarks(new MyTimestampAndWatermarkAssigner());
dataStreamSource.addSink(new PrintSinkFunction());
env.execute("mysql Datasource with pool and spring");
}
可以看到,这里在filter之后做了一个assignTimestampAndWatermarks的操作。
With Periodic Watermarks--周期性的添加watermark
AssignerWithPeriodicWatermarks
assigns timestamps and generates watermarks periodically (possibly depending on the stream elements, or purely based on processing time).The interval (every n milliseconds) in which the watermark will be generated is defined via
ExecutionConfig.setAutoWatermarkInterval(...)
. The assigner’sgetCurrentWatermark()
method will be called each time, and a new watermark will be emitted if the returned watermark is non-null and larger than the previous watermark.
如果需要周期性的生成watermark,而不是每次都生成,就需要调用方法AssignerWithPeriodicWatermarks,时间间隔以milliseconds为单位,需要在ExecutionConfig.setAutoWatermarkInterval方法中设置。
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink();
DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>());
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// 设定watermark间隔时间
ExecutionConfig config = env.getConfig();
config.setAutoWatermarkInterval(300);
SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream =
dataStreamSource.filter((FilterFunction<UrlInfo>) o -> {
if (o.getDomain() == UrlInfo.BAIDU) {
return true;
}
return false;
}).assignTimestampsAndWatermarks(new TimeLagWatermarkGenerator());
dataStreamSource.addSink(new PrintSinkFunction());
env.execute("mysql Datasource with pool and spring");
}
可以看到,这里通过ExecuteConfig设置了watermark生成的间隔时间,同时在filter之后加入了TimeLagWatermarkGenerator,其代码如下(来源于官网,稍有修改):
/**
* This generator generates watermarks that are lagging behind processing time by a fixed amount.
* It assumes that elements arrive in Flink after a bounded delay.
*/
public class TimeLagWatermarkGenerator implements AssignerWithPeriodicWatermarks<UrlInfo> {
private final long maxTimeLag = 5000; // 5 seconds
@Override
public long extractTimestamp(UrlInfo element, long previousElementTimestamp) {
return element.getCurrentTime().getTime();
}
@Override
public Watermark getCurrentWatermark() {
// return the watermark as current time minus the maximum time lag
return new Watermark(System.currentTimeMillis() - maxTimeLag);
}
}
With Punctuated(不时打断) Watermarks
To generate watermarks whenever a certain event indicates that a new watermark might be generated, use
AssignerWithPunctuatedWatermarks
. For this class Flink will first call theextractTimestamp(...)
method to assign the element a timestamp, and then immediately call thecheckAndGetNextWatermark(...)
method on that element.The
checkAndGetNextWatermark(...)
method is passed the timestamp that was assigned in theextractTimestamp(...)
method, and can decide whether it wants to generate a watermark. Whenever thecheckAndGetNextWatermark(...)
method returns a non-null watermark, and that watermark is larger than the latest previous watermark, that new watermark will be emitted.
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink();
DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>());
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream =
dataStreamSource.filter((FilterFunction<UrlInfo>) o -> {
if (o.getDomain() == UrlInfo.BAIDU) {
return true;
}
return false;
}).assignTimestampsAndWatermarks(new PunctuatedAssigner());
dataStreamSource.addSink(new PrintSinkFunction());
env.execute("mysql Datasource with pool and spring");
}
import myflink.model.UrlInfo;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.watermark.Watermark;
public class PunctuatedAssigner implements AssignerWithPunctuatedWatermarks<UrlInfo> {
@Override
public long extractTimestamp(UrlInfo element, long previousElementTimestamp) {
return element.getCurrentTime().getTime();
}
@Override
public Watermark checkAndGetNextWatermark(UrlInfo lastElement, long extractedTimestamp) {
/**
* Creates a new watermark with the given timestamp in milliseconds.
*/
return lastElement.hasWatermarkMarker() ? new Watermark(extractedTimestamp) : null;
}
}
kafka相关
When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).
In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.
For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks.
The illustrations below show how to use the per-Kafka-partition watermark generation, and how watermarks propagate through the streaming dataflow in that case.
由于kafka有多个partition,每个kafka partition中可能都有自己的event time规则,而在消费端,多个partition中的数据是并行处理的,来自于不同partition的数据其event time规则不同,所以就破坏掉了event time的生成规则。
在这种情况下,可以使用flink的Kafka-partition-aware watermark生成,如下代码:
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("zookeeper.connect", "localhost:2181");
properties.put("group.id", "metric-group");
properties.put("auto.offset.reset", "latest");
properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
SingleOutputStreamOperator<UrlInfo> dataStreamSource = env.addSource(
new FlinkKafkaConsumer010<String>(
"testjin",// topic
new SimpleStringSchema(),
properties
)
).setParallelism(1)
// map操作,转换,从一个数据流转换成另一个数据流,这里是从string-->UrlInfo
.map(string -> JSON.parseObject(string, UrlInfo.class));
dataStreamSource.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<UrlInfo>(){
@Override
public long extractAscendingTimestamp(UrlInfo element) {
return element.getCurrentTime().getTime();
}
});
env.execute("save url to db");
}
注意使用的是AscendingTimestampExtractor,也就是一个升序的timestamp 指派器。
参考资料:
http://www.54tianzhisheng.cn/2018/12/11/Flink-time/
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_timestamps_watermarks.html