kafka简介

2017-10-28 本文已影响0人狼牙战士

Apache kafka是一个分布式流平台。这到底是什么意思？

我们认为流平台具有三个关键功能：

它允许发布和订阅记录流。在这方面，它类似于消息队列或企业消息传递系统。
它允许您以容错的方式存储记录流。
它允许您处理记录发生时的流。

kafka的优点是什么？

它被用于两大类应用程序：

构建实时流数据管道，在系统或应用程序之间可靠地获取数据
构建对数据流进行转换或响应的实时流应用程序

为了理解kafka是如何做到这些的，让我们深入探究kafka的自下而上的能力。

首先几个概念：

kafka是运行在一个或多个服务器集群。
kafka集群将记录的流存储在称为主题的类别中。
每个记录由一个键、一个值和一个时间戳组成。

kafka有四个核心API：

Producer API允许应用程序将记录流发布到一个或多个kafka主题
Consumer API允许应用程序订阅一个或多个主题，并处理生成给它们的记录流。
Streams API允许应用程序充当流处理器，消耗来自一个或多个主题的输入流，并向一个或多个输出主题生成输出流，有效地将输入流转换为输出流。
Connector API允许构建和运行可重用的生产者或消费者，将kafka主题连接到现有应用程序或数据系统。例如，关系数据库的连接器可能捕获表的每一个更改。

1.png

在kafka中，客户机和服务器之间的通信是通过一个简单的、高性能的、与语言无关的tcp协议完成的。该协议是版本和保持与旧版本的向后兼容性。我们为kafka提供java客户端，但客户也可选择其他语言。

Topics and Logs

让我们首先深入了解kafka的核心抽象，它提供了一段记录——topic。
topic是发布记录的类别或提要名称。kafka的主题总是多订阅者;也就是说，一个topic可以有零、一个或许多订阅了写入数据的消费者。
对于每个topic，Kafka集群维护一个分区日志，它看起来像这样:

.png

每个分区都是一个有序的、不可变的记录序列，它不断被附加到结构化的提交日志中。分区中的记录都分配了一个名为偏移量的连续id号，它惟一地标识分区中的每个记录。
kafka集群保留了所有已发布的记录——无论它们是否使用了可配置的保留期。例如，如果保留策略设置为两天，那么在记录发布之后的两天内，它就可以用于消费，之后将被丢弃以释放空间。kafka的性能在数据大小上是有效的常数，所以长期存储数据并不是问题。

3.png
实际上，在每个消费者基础上保留的唯一元数据是该用户在日志中的偏移量或位置。这个偏移量由使用者控制:通常情况下，消费者会随着读取记录而线性增加偏移量，但实际上，由于该位置是由消费者控制的，它可以按照它喜欢的任意顺序消耗记录。例如，消费者可以重新设置旧的偏移量来重新处理过去的数据，或者跳过最近的记录，并从“现在”开始消费。
这种功能的组合意味着kafka的消费者非常便宜——他们可以在不影响集群或其他消费者的情况下来来去去。例如，您可以使用我们的命令行工具来“跟踪”任何主题的内容，而不改变任何现有使用者所使用的内容。
日志中的分区有多种用途。首先，它们允许日志的规模超出了适用于单个服务器的大小。每个单独的分区必须适合承载它的服务器，但是一个主题可能有多个分区，因此它可以处理任意数量的数据。其次，它们作为并行的单元，在这一点上再多一点。

Distribution

日志的分区分布在Kafka集群中的服务器上，每个服务器处理数据，并请求共享分区。每个分区在可配置数量的服务器上复制，用于容错。
每个分区都有一个服务器充当“领导者”，0或更多的服务器充当“追随者”。当追随者被动地复制领导者时，领导者处理所有对分区的读和写请求。如果领导者失败，其中一个追随者将自动成为新的领导者。每个服务器作为它的某些分区的领导者，以及其他一些分区的跟随者，因此负载在集群中是均衡的。

Producer

生产者将数据发布到他们选择的主题上。生产者负责选择在主题中分配给哪个分区的记录。这可以用一个循环的方式来完成，仅仅是为了平衡负载，或者可以根据一些语义分区函数来完成(比方说根据记录中的一些关键字)。更多关于在一秒内使用分区!

Consumers

消费者用一个消费者群体的名称来标记自己，并且每个发布到一个主题的记录都被发送到每个订阅消费者组内的一个消费者实例。消费者实例可以在单独的进程中，也可以在单独的机器上。
如果所有的消费者实例都有相同的消费者组，那么记录将有效地负载在消费者实例上。
如果所有的消费者实例都有不同的消费群体，那么每个记录将被广播到所有的消费者过程。

4.png

一个两个服务器Kafka集群，承载4个分区(p0 - p3)和两个消费者组。消费者集团A有两个消费者实例，B组有4个。

然而，更常见的是，我们发现主题有少量的用户组，每个“逻辑订阅者”都有一个。每个组都由许多用于可伸缩性和容错的消费者实例组成。这只不过是发布-订阅语义，订阅者是一组消费者，而不是单个进程。

在Kafka中实现消费的方式是将日志中的分区划分为消费者实例，以便每个实例都是在任何时间点上“公平共享”分区的唯一使用者。这一维护团队成员的过程是由Kafka协议动态处理的。如果新实例加入该小组，他们将接管该小组其他成员的一些分区;如果一个实例死亡，它的分区将被分配给其余的实例。

Kafka只在一个分区中提供一个总订单，而不是一个主题中的不同分区之间的记录。对于大多数应用程序来说，每个分区排序和按键划分数据的能力都是足够的。然而，如果您需要一个完整的订单，那么就可以用一个只有一个分区的主题来实现，尽管这意味着每个消费者群体只有一个消费者过程。

Guarantees

高级的kafka给出以下担保：

由生产者发送给特定主题分区的消息将按发送的顺序追加。也就是说，如果一个记录M1被相同的生产者发送为一个记录的M2，并且M1被先发送，那么M1将会有一个比M2更低的偏移量，并在日志的前面出现。
一个消费者实例看到记录的顺序存储在日志中。
对于一个带有复制因子N的主题，我们将容忍高达N - 1的服务器故障，而不会丢失任何提交到日志的记录。

kafka as a Messaging System

与传统的企业消息传递系统相比，Kafka的信息流是怎样的呢?

消息传递传统上有两种模式:排队和发布订阅。在一个队列中，一个用户池可以从服务器读取数据，并且每个记录都可以访问其中一个;在发布-订阅中，记录向所有消费者广播。这两种模型都有优点和缺点。排队的优势在于，它允许您在多个消费者实例上划分数据处理，这可以让您对处理进行扩展。不幸的是，队列不是多订阅的——一旦一个进程读取数据，它就会消失。发布-订阅允许将数据广播到多个进程，但由于每个消息都传递给每个订阅服务器，所以无法进行缩放处理。

Kafka的消费者群体概念概括了这两个概念。与队列一样，消费者组允许您在一个进程集合(消费者组的成员)中分配处理。与发布订阅一样，Kafka允许您向多个消费群体广播消息。

Kafka模型的优点是，每个主题都有这些属性——它可以进行规模处理，而且也可以多订阅——不需要选择一个或另一个.

与传统的消息传递系统相比，Kafka的订货保证更强。

传统的队列保留了服务器上订单的记录，如果多个使用者从队列中消费，则服务器会按存储的顺序分发记录。然而，尽管服务器会按顺序分发记录，但这些记录是异步传递给消费者的，所以它们可能会在不同的消费者中出现。这实际上意味着在并行消费的情况下，记录的顺序丢失了。消息传递系统通常使用“独占消费者”的概念来解决这个问题，它只允许一个进程从队列中使用，但这当然意味着在处理过程中没有并行性。

Kafka做得更好。通过在主题中有一个并行的概念，Kafka能够同时提供订购保证和负载均衡。这是通过将主题中的分区分配给消费者组中的消费者来实现的，这样每个分区就会被组中的一个消费者所消费。通过这样做，我们确保使用者是该分区的唯一阅读器，并按顺序使用数据。由于有许多分区，因此在许多消费者实例上仍然可以平衡负载。但是请注意，在消费者组中不能有比分区更多的消费者实例。

Kafka as a Storage System

任何允许发布消息与使用它们分离的消息队列都可以有效地充当飞行消息的存储系统。Kafka的不同之处在于它是一个很好的存储系统。

写入Kafka的数据被写入磁盘，并被复制用于容错。Kafka允许生产者等待确认，以便在完全复制并保证即使服务器写入失败时，写入也不完整。

Kafka使用的磁盘结构，无论您在服务器上有50 KB还是50 TB的持久数据，Kafka都将执行相同的操作。

由于认真对待存储并允许客户控制其读取位置，您可以将Kafka看作是一种专用于高性能、低延迟提交日志存储、复制和传播的分布式文件系统。

Kafka for Stream Processing

仅仅读取、写入和存储数据流是不够的，其目的是实现流的实时处理。

在Kafka的流处理器中，任何需要从输入的主题中持续的数据流，对这个输入执行一些处理，并生成持续的数据流到输出主题。

例如，零售应用程序可能接收销售和发货的输入流，并输出一串重新排序和价格调整以计算这些数据。

可以使用生产者和消费者api直接进行简单的处理。然而，对于更复杂的转换，Kafka提供了一个完全集成的流API。这使得构建应用程序可以进行非平凡的处理，这些应用程序可以从流中计算聚合或将流连接到一起。

这个工具帮助解决了这种类型的应用程序所面临的难题:处理无序数据、重新处理输入作为代码更改、执行有状态的计算等。

streams API基于Kafka提供的核心原语:它使用生产者和消费者API作为输入，使用Kafka进行有状态的存储，并在流处理器实例中使用相同的组机制来进行容错。

Putting the Pieces Together

这种消息传递、存储和流处理的组合看起来很不寻常，但对于Kafka作为流媒体平台的角色来说，这是必不可少的。

像HDFS这样的分布式文件系统允许为批处理存储静态文件。实际上，这样的系统允许存储和处理过去的历史数据。

传统的企业消息传递系统允许处理在订阅后到达的未来消息。以这种方式构建的应用程序在到达时处理未来的数据。

Kafka结合了这两种功能，这两者的结合对于Kafka的使用来说是一个重要的平台，可以作为流媒体应用程序的平台，也可以用于流媒体数据管道。

通过组合存储和低延迟订阅，流媒体应用程序可以以同样的方式处理过去和将来的数据。这是一个单一的应用程序可以处理历史的、存储的数据，而不是在它到达最后一个记录时结束，它可以在未来的数据到达时继续处理。这是一个关于流处理的通用概念，即subsumes批处理和消息驱动的应用程序。

同样，对于流媒体数据管道，订阅和实时事件的结合使得使用Kafka用于非常低延迟的管道成为可能;但是，能够可靠地存储数据的能力使其能够用于关键数据，在这些关键数据中，必须保证数据的交付，或者是与离线系统集成，这些系统只能周期性地加载数据，或者可以进行更长时间的维护。流处理设施可以在数据到达时转换数据。

原文

Introduction

Apache Kafka® is a distributed streaming platform. What exactly does that mean?
We think of a streaming platform as having three key capabilities:
It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.
It lets you store streams of records in a fault-tolerant way.
It lets you process streams of records as they occur.

What is Kafka good for?
It gets used for two broad classes of application:
Building real-time streaming data pipelines that reliably get data between systems or applications
Building real-time streaming applications that transform or react to the streams of data

To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
First a few concepts:
Kafka is run as a cluster on one or more servers.
The Kafka cluster stores streams of records in categories called topics.
Each record consists of a key, a value, and a timestamp.

Kafka has four core APIs:
The Producer API allows an application to publish a stream of records to one or more Kafka topics.
The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.
Topics and Logs
Let's first dive into the core abstraction Kafka provides for a stream of records—the topic.
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.
For each topic, the Kafka cluster maintains a partitioned log that looks like this:

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
Distribution
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
Producers
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!
Consumers
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
Guarantees
At a high-level Kafka gives the following guarantees:
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees records in the order they are stored in the log.
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

More details on these guarantees are given in the design section of the documentation.
Kafka as a Messaging System
How does Kafka's notion of streams compare to a traditional enterprise messaging system?
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.
The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.
The advantage of Kafka's model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.
Kafka has stronger ordering guarantees than a traditional messaging system, too.
A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.
Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.
Kafka as a Storage System
Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.
Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn't considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.
The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.
As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
For details about the Kafka's commit log storage and replication design, please read this page.
Kafka for Stream Processing
It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.
In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.
For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.
It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.
This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.
The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.
Putting the Pieces Together
This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a streaming platform.
A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past.
A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.
Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applications as well as for streaming data pipelines.
By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.
Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low-latency pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guaranteed or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.
For more information on the guarantees, APIs, and capabilities Kafka provides see the rest of the documentation.