Flink源码解析python机器学习爬虫

彻底搞懂 Flink Kafka OffsetState 存

2020-07-03  本文已影响0人  shengjk1

写给大忙人看的Flink 消费 Kafka 已经对 Flink 消费 kafka 进行了源码级别的讲解。可是有一点没有说的很明白那就是 offset 是怎么存储到状态中的?

Kafka Offset 是如何存储在 state 中的

写给大忙人看的Flink 消费 Kafka 的基础上继续往下说。

// get the records for each topic partition
                // 我们知道 partitionDiscoverer.discoverPartitions 已经保证了 subscribedPartitionStates 仅仅包含该 task 的 KafkaTopicPartition
                for (KafkaTopicPartitionState<TopicPartition> partition : subscribedPartitionStates()) {
                    //仅仅取出属于该 task 的数据
                    List<ConsumerRecord<byte[], byte[]>> partitionRecords =
                        records.records(partition.getKafkaPartitionHandle());

                    for (ConsumerRecord<byte[], byte[]> record : partitionRecords) {
                        //传进来的 deserializer. 即自定义 deserializationSchema
                        final T value = deserializer.deserialize(record);
                        
                        //当我们自定义 deserializationSchema isEndOfStream 设置为 true 的时候,整个流程序就停掉了
                        if (deserializer.isEndOfStream(value)) {
                            // end of stream signaled
                            running = false;
                            break;
                        }

                        // emit the actual record. this also updates offset state atomically
                        // and deals with timestamps and watermark generation
                        emitRecord(value, partition, record.offset(), record);
                    }
                }

其中 subscribedPartitionStates 方法实际上是获取属性 subscribedPartitionStates。
继续往下追踪,一直到

protected void emitRecordWithTimestamp(
            T record, KafkaTopicPartitionState<KPH> partitionState, long offset, long timestamp) throws Exception {

        if (record != null) {
        // 没有 watermarks
            if (timestampWatermarkMode == NO_TIMESTAMPS_WATERMARKS) {
                // fast path logic, in case there are no watermarks generated in the fetcher

                // emit the record, using the checkpoint lock to guarantee
                // atomicity of record emission and offset state update
                synchronized (checkpointLock) {
                    sourceContext.collectWithTimestamp(record, timestamp);
                    // 设置 state 中的 offset( 实际上设置 subscribedPartitionStates 而当 snapshotState 时,获取 subscribedPartitionStates 中的值进行 snapshotState)
                    partitionState.setOffset(offset);
                }
            } else if (timestampWatermarkMode == PERIODIC_WATERMARKS) {
                emitRecordWithTimestampAndPeriodicWatermark(record, partitionState, offset, timestamp);
            } else {
                emitRecordWithTimestampAndPunctuatedWatermark(record, partitionState, offset, timestamp);
            }
        } else {
            // if the record is null, simply just update the offset state for partition
            synchronized (checkpointLock) {
                partitionState.setOffset(offset);
            }
        }
    }

当 sourceContext 发送完这条消息的时候,才设置 offset 到 subscribedPartitionStates 中。

而当 FlinkKafkaConsumer 做 Snapshot 时,会从 fetcher 中获取 subscribedPartitionStates。

//从 fetcher subscribedPartitionStates 中获取相应的值
                HashMap<KafkaTopicPartition, Long> currentOffsets = fetcher.snapshotCurrentState();

                if (offsetCommitMode == OffsetCommitMode.ON_CHECKPOINTS) {
                    // the map cannot be asynchronously updated, because only one checkpoint call can happen
                    // on this function at a time: either snapshotState() or notifyCheckpointComplete()
                    pendingOffsetsToCommit.put(context.getCheckpointId(), currentOffsets);
                }

                for (Map.Entry<KafkaTopicPartition, Long> kafkaTopicPartitionLongEntry : currentOffsets.entrySet()) {
                    unionOffsetStates.add(
                            Tuple2.of(kafkaTopicPartitionLongEntry.getKey(), kafkaTopicPartitionLongEntry.getValue()));
                }

至此进行 checkpoint 时,相应的 offset 就存入了 state。

上一篇下一篇

猜你喜欢

热点阅读