Chapter 2 Data Processing Using
Window
The window function allows the grouping of existing KeyedDataStreama by time or other conditions. The following transformation emits groups of records by a time window of 10 seconds
windows
函数允许对已有的KeyedDataStream
通过时间或其他条件进行分组。下面的transformation
通过10秒的时间窗口产生了一组数据。
In Java
inputStream. keyBy (0).window (TumblingEventTimeWindows.of (Time.seconde (10)));
In Scala:
inputStream. keyBy (0).window (TumblingEventTimeWindows.of (Time.seconde (10)))
Flink defines slices of data in order to process (potentially) infinite data streams. These slices are called windows. This slicing helps processing data in chunks by applying transformations. To do windowing on a stream, we need to assign a key on which the distribution can be made and a function which describes what transformations to perform on a windowed stream
Flink定义数据分片,以便处理无穷的数据流。这些分片叫windows
。这个分片通过transformation
有助于处理大块数据。要在流上执行窗口化。我们需要指定key
,基于key
可以实现分布式;还需要一个function
,这个方法描述了在窗口化的流上需要执行什么样的transformation
。
To slice streams into windows, we can use pre-implemented Flink window assigners. We have options such as, tumbling windows, sliding windows, global and session windows.
Flink also allows you to write custom window assigners by extending WindowAssigner
class. Let's try to understand how these various assigners work.
将流切成窗口,我们可以使用预先实现好的Flink 窗口分配器。包括tumbling windows
,sliding windows
,global
and session windows
。
Flink 也允许你写一些自定义的分配器通过继承 WindowAssigner
类。下面我们先了解一下这些内置的分配器是如何工作的。
Global windows
Global windows are never-ending windows unless specified by a trigger. Generally in this case, each element is assigned to one single per-key global Window. If we don't specify any trigger, no computation will ever get triggered.
全局窗口是永远不会结束的窗口,除非指定触发器。通常,在这种场景下,每个元素都会被分配到单独的per-key
全局窗口。如果未指定触发器,则不会触发计算。
Tumbling windows (翻滚窗口,无重叠)
Tumbling windows are created based on certain times. They are fixed-length windows and non over lapping. Tumbling windows should be useful when you need to do computation of elements in specific time. For example, tumbling window of 10 minutes can be used to compute a group of events occurring in 10 minutes time.
Tumbling windows
是基于确定时间的。它们的窗口长度是固定的并且不会有重叠。这种窗口用于当需要对指定时间内的元素进行计算时。举个例子,10分钟的翻滚窗口可以对10分钟内产生的事件进行计算。
Sliding windows(滑动窗口,有重叠)
Sliding windows are like tumbling windows but they are overlapping. They are fixed length windows overlapping the previous ones by a user given window slide parameter .This type of windowing is useful when you want to compute something out of a group of events occurring in a certain time frame.
Sliding windows
与tumbling windows
类似,但它有重叠。它的长度固定但通过用户给定的滑动参数会和上一个窗口有重叠。这种窗口用于你想对固定时间框架内发生的事件进行计算。
Session windows
Session windows are useful when windows boundaries need to be decided upon the input data. Session windows allows flexibility in window start time and window size. We can also provide session gap configuration parameter which indicates how long to wait before considering the session in closed。
Session windows
在根据输入数据确定窗口边界的场景是有用的。Session windows
允许灵活的配置窗口启动时间和窗口大小。我们还可以提供会话间隙配置参数,该参数指定在关闭会话之前需要等待多长时间。
WindowAIl
The windowAll function allows the grouping of regular data streams. Generally this is a non-parallel data transformation, as it runs on non-partitioned streams of data.
windowAll函数允许对常规数据流进行分组。通常这是一个非并行数据 transformation
,因为它运行在非分区的数据流上。
In Java:
inputStream.windowAll (TumblingEventTimeWindows.of (Time.seconda (10)));
In Scala:
inputStream.windowAll (TumblingEventTimeWindows.of (Time.seconde (10)));
Similar to regular data stream functions, we have window data stream functions as well.The only difference is they work on windowed data streams. So window reduce works like the Reduce function, Window fold works like the Fold function, and there are aggregations as well
和普通的数据流一样,窗口数据流也有对应的函数。它们唯一的区别是它们工作在窗口数据流上。Window reduce
的运行和Reduce
方法一样,Window fold
和Fold
方法一样。其他的聚合方法也是如此。
Union
The Union function performs the union of two or more data streams together. This does the combining of data streams in parallel。 If we combine one stream with itself then it outputs each record twice.
Union
方法将两个或多个数据流合并在一起。这个合并是并行的。如果我们将stream
和它自己组合的话,那么每条记录会输出两次。
In Java:
inputStream. union (inputstreaml, inputstream2, ...);
In Scala:
inputStream.union (inputstream1, inputstream2. ...)
Window join
We can also join two data streams by some keys in a common window. The following example shows the joining of two streams in a Window of 5 seconds where the joining condition of the first attribute of the first stream is equal to the second attribute of the other stream
我们也可以将共有窗口的两个流通过key连接起来。下面 的例 子演示了两个流在一个5秒的窗口进行连接;连接条件是第一个流的第一个属于和另一个流的第二个属性相等。
In Java:
inputStream.join (inputStream1).
where (0).equalTo (1)
.window (TumblingEventTimeWindows.of (Time.seconds(5)))
.apply (new JoinFunction (){...});
In Scala:
inputStream. join (inputStream1)
.where (0) .equalTo (1)
.window (TumblingEventTimewindows.of (Time.seconds (5)))
.apply{...}
Split
This function splits the stream into two or more streams based on the criteria. This can be used when you get a mixed stream and you may want to process each data separately
这个方法会根据条件将一个流拆分成两个或多个流。当你拿到一个混合流,然后你想分别去处理它们的数据时,这个方法很有用。
In Java:
SplitStream<Integer> split = inputStream. split (new outputSelector<Integer>() {
@override public Iterable<string> select (Integer value) {
List<String> output = new ArrayList<String> ();
if (value% 2 ==0){
output. add ("even");
}
else {
output.add ("odd");
}
return output);
In Scala:
val split= inputStream. split
(num: Int) =>
(num % 2) match {
case 0 => List ("even")
case 1 => List ("odd")
}
}
Select
This function allows you to select a specific stream from the split stream
该方法允许你从split stream
中选择特定的流。
In Java:
SplitStream split;
DataStream even = split.select ("even");
DataStream odd = split.select ("odd");
DataStream all = split.select ("even", "odd");
In Scala:
val even = split select "even"
val odd = split select "odd"
val all = split.select ("even", "odd")
Project
The Project function allows you to select a sub-set of attributes from the event stream and only sends selected elements to the next processing stream.
Project
方法允许从事件流中选择一个属性的子集,并只把选中的选项发到下一个待处理的流中。
In Java
DataStream<Tuple4<Integer,Double,String,String>>in=//{....}
DataStream<Tuple2<String,String>>out=in.project(3,2);
In Scala:
var in:DateStream[(Int,Double,String,String)]=//{....}
var out=in.project(3,2)
The preceding function selects the attribute numbers 2 and 3 from the given records. The following is the sample input and output records
上个方法从给定的记录集中,将第2和第3个属性选中。下面是简单输入及对应的输出记录。
(1,10.0, A, B)=> (B,A)
(2,20.0,C, D)=> (D,C)