Key details about the Kafka partitioner

Apache Kafka is the de facto standard for event streaming today. Part of the reason Kafka has been so successful is its ability to handle massive amounts of data, with a throughput of millions of records per second, which is not unheard of in a production environment. Part of Kafka's design that makes this possible is partitioning.

Kafka uses partitions to spread the data load among the brokers in the cluster, which is also the unit of parallelism; more partitions means higher throughput. Since Kafka uses key-value pairs, it is critical to fetch records with the same key on the same partition.

Consider a banking application that generates each transaction to Kafka using a customer ID. It is critical to have all of these events on the same partition; this way, consumer applications will process records in the order they arrive. The mechanism for ensuring that records with the same key are on the correct partition is a simple but effective procedure: take the hash of the key modulo the partition number. The following diagram shows this concept in action:

At a high level, a hash function such as CRC32 or Murmur2 takes an input and produces an output of a fixed size, such as a 64-bit number. The same input always produces the same output, whether implemented in Java, Python, or any other language. The partitioner uses the hash results to consistently select partitions, so the same record key will always map to the same Kafka partition. I won't go into detail in this blog, but it is enough to know that there are several hashing algorithms available.

What I want to discuss today is not how partitioning works, but the partitioner in the Kafka producer client. The producer uses a partitioner to determine the correct partition for a given key, so it is critical to use the same partitioner strategy in the producer client.

Since the creator client has a default partitioner setup, this requirement should not be a problem. For example, when using the Java producer client with the Apache Kafka distribution, this class provides a default partitioner that uses the Murmur2  hash function to determine the partition for a given key.KafkaProducer

But what about Kafka producer clients in other languages? The excellent librdkafka project is a C/C++ implementation of the Kafka client, widely used in non-JVM Kafka applications. Also, Kafka clients in other languages ​​(Python, C#) are built on top of it. librdkafka's default partitioner uses the CRC32 hash function to obtain correct partitioning of keys.

This condition is not a problem in itself, but it could easily be. The Kafka broker is language-agnostic of the client; you can use clients in any language as long as it follows the Kafka protocol, and the broker will happily accept their produce and consume requests. Given today's polyglot programming environment, you can have development teams within your organization working in different languages, such as Python and Java. But without any changes, two groups will use different partitioning strategies in the form of different hashing algorithms: librdkafka producers using CRC32 and Java producers using Murmur2, so records with the same key will land on different partitions! So, what is the remedy for this situation?

Java only provides one hashing algorithm through the default partitioner; since implementing a partitioner is tricky, it's best to leave it as the default. But the librdkafka producer client provides multiple options . One of these options is the murmur2_random partitioner, which uses the murmur2 hash function and assigns null keys to random partitions, which is comparable to the behavior of Java's default partitioner.KafkaProducer

For example, if using the Kafka producer client in C#, the partitioning strategy can be set with the following line:

C#
1
ProducerConfig.Partitioner = Partitioner.Murmur2Random;.Partitions = Partitions.Murmur2Random;

Now your C# and Java producer clients use compatible partitioning methods!

When using a non-Java Kafka client, it is a good idea to enable the same partitioning strategy as the Java producer client to ensure that all producers use consistent partitions for different keys.

 

Guess you like

Origin blog.csdn.net/weixin_56863624/article/details/130672606