kafka
Summary
Kafka is a distributed messaging system for log processing. Log is generated from each activities from system such as (1) logins, pageviews, clicks, "likes" (2) operational metrics such as service call stack, latency, errors, disk utils etc. Log data is important for analytics to track users engagement and system utilization.
Recent trends is that log data can be used in real time to provide features such as (1) search relevance (2) recommendations driven by item popularity (3) ad targeting (4) newsfeed features that aggregate user status updates or actions for their friends to read
Kafka is designed because this real time usage of log data is orders of magnitude larger than the "real" data. Early systems for processing this kind of data relied on physically scraping log files off productino servers for analysis. Recent years, log aggregators have been built for collecting and loading the log data into a data warehouse or Hadoop for offline consumption. LinkedIn need to support most of the real-time applications with delays of no more than fre seconds. Thus Kafka is designed.
Kafka Design principles
Kafka defines a stream of messages of a particular type and call it topic. The published messages are stored at a set of servers called brokers.
A consumer can subscribe to one or more topics from the brokers. (consume by pulling data from the brokers)
What makes Kafka fast?
SImple Storage
Kafka use logical log to store each topic into a partition. A log is implemented as a set of segment files of approximately same size(1GB). Everytime a producer publishes a message to a partition, the broker simply appends it to the last segment file. The segment file is flushed once a configurable number of messages have been published or certain amount of time elapsed. (A message is only exposed to the consumers after it is flushed, I assume this make sure the log is large enough for disk transaction? So we don't initiate data tranfer for every log or when log are still small)
Addressing by logical offset
Kafka uses logical offset to address each message. This avoids the overhead of maintaining auxiliary, seek-intensive random access index structures (B-tree?) that map the message ids to the actual message locations.
Each pull request from consumer contains the offset of the message from which the consumption begins and an acceptable number of bytes to fetch. Each broker keeps in memory a sorted list of offsets.
Rely on underlying file system page cache
Kafka explicitly avoid caching messgaes in memory at the Kafka layer. It utilize the local page cache. This has main benefit of avoiding double buffering. It has very little overhead in garbage collecting its memory, making efficient implementation in a VM-based language feasible (Java?)
With consumer and producer access segment file sequencially and consumer slightly lagging behind producer, normal operating system caching heuristics are effetively used. (write through caching and read ahead)
Network optimization
On Linux and other Unix operating sytems, there exists a sendfile API that can directly transfer bytes from a file channel to a socket channel. This typically avoids 2 of copies and 1 system call for typical method of sending bytes from consumer applications.
Stateless broker
Kafka uses a simple time-based SLA for retention policy. A message is automatically deleted if it has been retained longer than 7days. An important benefit of this design is consumer can deliberately rewind back to an old offset and recomsume data. This is particularly important to ETL data loads to data warehouse or Hadoop system.
Distributed
Kafka has concept of consumer groups where each group consist one or more consumers that jointly consume a set of subscribed topics (concurrently/parallelly processing a topic)
To facilitate coordination, Kafka employed a highly available Zookeeper.
Zookeeper for the following tasks:
- detecting the addition and the removal of brokers and consumers
- triggering a rebalance process in each consumer when the above events happen
- maintaining the consumption relationship and keeping track of the consumed offset of each partition. Specifically, when each broker or consumer starts up, it stores its information in a broker or consumer registry in Zookeeper.
Experiment results
Producer test: Kafka can produce in orders of magnitude higher than ActiveMQ, and at least 2 times higher than RabbitMQ.
Few reasons: first, Kafka doesn't wait for ackowledgements from broker and sends messages as fast as broker can handle. From figure 4, single Kafka producer almost saturated the 1Gb link shows this hypothesis
figure4
We note that without acknowledging the producer, there is no guarantee that every published message is actually received by the broker. For many types of log data, it is desirable to trade durability for throughput, as long as the number of dropped messages is relatively small.
Durability tradeoff.
Second is because Kafka has more efficient storage format (logical offset)
Consumer test:
On average, Kafka consumed 22,000 messages per second, more than 4 times that of ActiveMQ and RabbitMQ.
Reasons:
- Fewer bytes were tranferred from the broker in Kafka due to its efficient storage
- Broker in AvtiveMQ and RabbitMQ had to maintain delivery state. Kafka has no disk write activities on broker (sendfile API from Linux system?)
Conclusion
Kafka is designed for processing huge volume of log data streams. It use pull-based comsumption model that allows an application to consume data at its own rate and rewind the consumption whenever needed.
Future work: durability, data availability guarantees, async and sync replication models