Flink中的数据流编程模型

2019-08-01  本文已影响0人  kerwinX

文章基于Flink 官方文档翻译,翻译水平不高,故尽量贴出原文内容以供印证。原文地址:Dataflow Programming Model

抽象层次(Levels of Abstraction)

Flink提供了开发流/批处理应用程序的不同抽象级别。

levels of abstraction

程序和数据流(Programs and Dataflows)

The basic building blocks of Flink programs are streams and transformations. (Note that the DataSets used in Flink’s DataSet API are also streams internally – more about that later.) Conceptually a stream is a (potentially never-ending) flow of data records, and a transformation is an operation that takes one or more streams as input, and produces one or more output streams as a result.

流(streams)和转换(transformations)是构成Flink程序的基本组成。(需要注意的是,Flink的DataSet API也是基于流实现的——稍后会详细介绍。)从概念上讲,流是一种(可能永远不会结束)流动的数据记录,而转换是一种操作,它接收一个或者多个流作为输入,并且生产出一个或多个输出流作为结果。

When executed, Flink programs are mapped to streaming dataflows, consisting of streams and transformation operators. Each dataflow starts with one or more sources and ends in one or more sinks. The dataflows resemble arbitrary directed acyclic graphs (DAGs). Although special forms of cycles are permitted via iteration constructs, for the most part we will gloss over this for simplicity.

当执行时,Flink程序被影射到流式数据流(streaming dataflows),由流和转换操作组成。每个数据流开始于一个或多个数据源(sources),并结束于一个或多个接收器(sinks)。数据流类似于任意有向无环图(directed acyclic graphs DAGs)。<u style="box-sizing: border-box;">虽然通过迭代构造允许使用特殊形式的循环,但是为了简单起见,我们将在大多数情况下忽略这一点。</u>

program dataflow

Often there is a one-to-one correspondence between the transformations in the programs and the operators in the dataflow. Sometimes, however, one transformation may consist of multiple transformation operators.

Sources and sinks are documented in the streaming connectors and batch connectors docs. Transformations are documented in DataStream operators and DataSet transformations.

通常,程序中的转换与数据流中的操作符之间存在一对一的对应关系。然而,有时一个转换可能包含多个转换操作符。

源(sources)和接收器(sinks)记录在streaming connectorsbatch connectors文档中。转换记录在DataStream operatorsDataSet transformations中。

并行数据流(Parallel Dataflows)

Programs in Flink are inherently parallel and distributed. During execution, a stream has one or more stream partitions, and each operator has one or more operator subtasks. The operator subtasks are independent of one another, and execute in different threads and possibly on different machines or containers.

The number of operator subtasks is the parallelism of that particular operator. The parallelism of a stream is always that of its producing operator. Different operators of the same program may have different levels of parallelism.

Flink程序本质上是并行和分布式。在执行期间,一个流有一个或多个流分区(stream partitions),每个操作符有一个或多个操作符子任务(operator subtasks)。操作符子任务彼此独立,并在不同的线程中执行,甚至在不同的机器或容器上执行。

操作符子任务的数量为该特定操作符的并行度(parallelism)。 流的并行度总是取决于它的生产操作。同一程序的不同操作符可能具有不同级别的并行度。

parallel dataflow

Streams can transport data between two operators in a one-to-one (or forwarding) pattern, or in a redistributing pattern:

  • One-to-one streams (for example between the Source and the map() operators in the figure above) preserve the partitioning and ordering of the elements. That means that subtask[1] of the map() operator will see the same elements in the same order as they were produced by subtask[1] of the Source operator.
  • Redistributing streams (as between map() and keyBy/window above, as well as between keyBy/window and Sink) change the partitioning of streams. Each operator subtask sends data to different target subtasks, depending on the selected transformation. Examples are keyBy() (which re-partitions by hashing the key), broadcast(), or rebalance() (which re-partitions randomly). In a redistributing exchange the ordering among the elements is only preserved within each pair of sending and receiving subtasks (for example, subtask[1] of map() and subtask[2] of keyBy/window). So in this example, the ordering within each key is preserved, but the parallelism does introduce non-determinism regarding the order in which the aggregated results for different keys arrive at the sink.

Details about configuring and controlling parallelism can be found in the docs on parallel execution.

流可以在两个操作符之间以一对一 one to one(或转发 forwarding)模式传输数据,也可以采用重新分发 Redistributing 模式:

有关配置和控制并行性的详细信息可以在并行执行的文档中找到。

窗口 (Windows)

Aggregating events (e.g., counts, sums) works differently on streams than in batch processing. For example, it is impossible to count all elements in a stream, because streams are in general infinite (unbounded). Instead, aggregates on streams (counts, sums, etc), are scoped by windows, such as “count over the last 5 minutes”, or “sum of the last 100 elements”.

Windows can be time driven (example: every 30 seconds) or data driven (example: every 100 elements). One typically distinguishes different types of windows, such as tumbling windows (no overlap), sliding windows (with overlap), and session windows (punctuated by a gap of inactivity).

More window examples can be found in this blog post. More details are in the window docs.

聚合事件(如counts、sums)在流上的工作方式与在批处理中不同。例如,不可能计算流中的所有元素,因为流通常是无限的(无界的)。作为替代,流上的聚合由窗口windows 限定范围,例如”过去5分钟内的数量“或“最后100个元素的总和”。

窗口可以是时间驱动的(例如:每30秒),也可以是数据驱动的(例如:每100个元素)。一个典型的例子是区分不同类型的窗口,比如翻滚窗口 tumbling windows(没有重叠)、滑动窗口 sliding windows(有重叠)和会话窗口 session windows(中间有一个不活动的间隙)。

windows

更多的窗口例子可以在这篇博文中找到 blog post。更多细节在windows docs中。

时间(Time)

When referring to time in a streaming program (for example to define windows), one can refer to different notions of time:

  • Event Time is the time when an event was created. It is usually described by a timestamp in the events, for example attached by the producing sensor, or the producing service. Flink accesses event timestamps via timestamp assigners.
  • Ingestion time is the time when an event enters the Flink dataflow at the source operator.
  • Processing Time is the local time at each operator that performs a time-based operation.

More details on how to handle time are in the event time docs.

在流处理程序中提到时间(例如定义 windows)时,可以指不同的时间概念:

event_ingestion_processing_time.png

关于如何处理时间的更多细节在这里 event time docs

有状态操作(Stateful Operations)

While many operations in a dataflow simply look at one individual event at a time (for example an event parser), some operations remember information across multiple events (for example window operators). These operations are called stateful.

The state of stateful operations is maintained in what can be thought of as an embedded key/value store. The state is partitioned and distributed strictly together with the streams that are read by the stateful operators. Hence, access to the key/value state is only possible on keyed streams, after a keyBy() function, and is restricted to the values associated with the current event’s key. Aligning the keys of streams and state makes sure that all state updates are local operations, guaranteeing consistency without transaction overhead. This alignment also allows Flink to redistribute the state and adjust the stream partitioning transparently.

For more information, see the documentation on state.

虽然数据流中的许多操作一次只查看一个单独的事件(例如事件解析器),但是有些操作记录了跨多个事件的信息(例如窗口操作符)。这些操作称为有状态的(stateful)。

有状态操作的状态维护在可视为嵌入式键/值存储。状态与有状态操作符读取的流一起被严格地分区和分布。因此,只有通过keyBy()函数执行后的keyed streams才能访问键/值状态,并且只能访问与当前事件的键关联的值。将流的键与状态对齐可以确保所有状态更新都是本地操作,从而确保一致性,而不需要事务开销。这种对齐还允许Flink重新分配状态并透明地调整流分区。

state partitioning

更多信息,详见相关文档 state

容错检查点(Checkpoints for Fault Tolerance)

Flink implements fault tolerance using a combination of stream replay and checkpointing. A checkpoint is related to a specific point in each of the input streams along with the corresponding state for each of the operators. A streaming dataflow can be resumed from a checkpoint while maintaining consistency (exactly-once processing semantics) by restoring the state of the operators and replaying the events from the point of the checkpoint.

The checkpoint interval is a means of trading off the overhead of fault tolerance during execution with the recovery time (the number of events that need to be replayed).

The description of the fault tolerance internals provides more information about how Flink manages checkpoints and related topics. Details about enabling and configuring checkpointing are in the checkpointing API docs.

Flink使用流回放(stream replay)和检查点(checkpoint)的组合实现容错。检查点与每个输入流中的特定点以及每个操作符的对应状态相关。流数据流可以从检查点恢复,同时通过恢复操作符的状态并从检查点重播事件来保持一致性(精确地说,一次处理语义)。

检查点间隔是一种用恢复时间(需要重播的事件数量)来抵消执行期间容错开销的方法。

fault tolerance internals 提供了关于Flink如何管理检查点和相关主题的更多信息。启用和配置检查点的详细信息在checkpointing API docs中。

批处理(Batch on Streaming)

Flink executes batch programs as a special case of streaming programs, where the streams are bounded (finite number of elements). A DataSet is treated internally as a stream of data. The concepts above thus apply to batch programs in the same way as well as they apply to streaming programs, with minor exceptions:

  • Fault tolerance for batch programs does not use checkpointing. Recovery happens by fully replaying the streams. That is possible, because inputs are bounded. This pushes the cost more towards the recovery, but makes the regular processing cheaper, because it avoids checkpoints.
  • Stateful operations in the DataSet API use simplified in-memory/out-of-core data structures, rather than key/value indexes.
  • The DataSet API introduces special synchronized (superstep-based) iterations, which are only possible on bounded streams. For details, check out the iteration docs.

Flink把批处理作为流式处理的一种特殊情况,即有界流(有限数量的元素)。DataSet在内部被视为数据流。因此,上述概念同样适用于批处理程序,只有少数不同:

下一步

接下来继续了解Flink分布式运行时(Distributed Runtime)的基本概念。

上一篇下一篇

猜你喜欢

热点阅读