Kafka Streams Exactly Once Design

https://docs.google.com/document/d/1pGZ8xtOOyGwDYgH5vA6h19zOMMaduFK1DAB8_gBYA2c/edit#


Kafka Streams Exactly Once Design

This document serves to describe the detailed implementation design of KIP-129: Streams Exactly-Once Semantics, and it is dependant on KIP-98 - Exactly Once Delivery and Transactional Messaging. Readers are highly recommended to read these two KIP proposals before continue reading on this doc.

Design Goal

Background

Task Commit

Producer Transactional Messaging

Consumer Read-Committed Only

Task Shutdown

Clean Shutdown

Unclean Shutdown

Task Rebalancing

Error Handling

Misc. Considerata


Design Goal

The goal of this design document is to enhance Kafka Streams clients with exactly-once guarantee, such that each input record fetched from the source Kafka topics will only be processed exactly once such that associated state stores are updated and result output messages to destination Kafka topics are successfully sent, even under failures. More specifically, Streams clients will rely on the following features provided by KIP-98 to achieve this guarantee:

  1. Idempotent Producer based on producer identifiers (PIDs) to eliminate duplicates.

  2. Transactional messaging with consumer committing offsets as part of the transaction.

  3. Consumer configured to fetch committed message only.


Discussion on side-effects: The above exactly-once guarantee does not extend to external side-effects: for example, a processor sending an email to some account that cannot get acknowledged, or a processor writing to an external database that cannot be rolled back. Such actions may be executed multiple times or even not at all for the input record even with this approach.


Background

In Kafka Streams, a user defined processor topology can be possibly divided into multiple sub-topologies that are disconnected from each other. For disconnected sub-topologies, output Kafka topics from one sub-topology might feed into input topics of another sub-topology. Due to the presence of Kafka topics, there is no back-pressure between sub-topologies. Such a situation can happen when users deliberately apply “through” operators, for example, to materialized some intermediate streams into their own topics, or the library automatically use an internal re-partition topic before join / aggregation operators.

Depending on the number of partitions of the input topics, each sub-topology then can be mapped into one or more tasks, with each task consuming a non-overlapping subset of partitions of the input topics. Tasks are then assigned to Kafka Streams threads that are running the same application code. Each task may have zero or multiple state store instances associated with its sub-topology. Each thread contains one producer client and two consumer clients (one for normal fetching from input topic partitions, and one for fetching from changelog topics for state restoration only). Tasks assigned to the same thread will then share these clients for fetching and producing messages.

As a result, the current processing state of a running task consists of the following three things:

  1. The offsets of its assigned partitions up to which it has consumed and processed so far.

  2. The current content of its associated state stores so far.

  3. The resulting messages it has produced to output Kafka topics so far.

Whenever the first value in this triplet has changed (i.e. more input messages processed), the other two values may be changed as well.

Today when a commit() call is triggered, either by the library periodically, or by the user code explicitly or upon rebalance, the following operations are executed in order:

  1. Flush all the state stores. If the store is persistent, flushing will persist all its dirty writes.

  2. Flush all the generated outgoing messages. This includes both the generated result messages as well as the changelog messages of the state stores.

  3. Commit currently consumed offsets on the assigned input topic-partitions.

When a failure happens in between of these three ordered operations, upon recovery we will reprocess these uncommitted input messages, and potentially cause duplicates on the state store updates as well as outgoing messages. Hence the guarantees that Kafka Streams provides is at-least-once.


As mentioned at the beginning of this document, the design goal of Kafka Streams EoS is to make the above three operations “atomic” together: either the result of all three operations are reflected or none of them are reflected. In the remaining sections we will describe the proposed design in different Kafka Streams procedures.


Task Commit

As mentioned in the previous section, the state of a task is a triplet of

(input-offsets, state-content, output-offsets)

And since any state content can be re-constructed by the associated changelog, we treat the changelog stored in Kafka as the “source-of-truth” and use its topic offset to represent the state-content itself. Also for the output-offsets, we are going to use the commit APIs to make sure that any messages beyond the output-offsets are not “committed”, hence not exposed to the downstream consumers. Thus the state of a task now becomes a triplet of

(input-offsets, changelog-messages, output-messages)

We are going to use the transactional messaging APIs proposed in KIP-98 to commit the above state triplet as a bunch of messages in an “atomic”, transactional manner. Each producer will issue a single ongoing transaction a time (we will talk about committing input-offsets within the transaction later). Within a transaction, we can also eliminate duplicate messages based on producer’s PIDs.

One issue, however, is that since in Kafka Streams multiple tasks can be assigned to the same thread, such tasks will share the same producer client to send their outgoing messages. When rebalance happens, these tasks may be re-assigned to more than one other threads, hence multiple producers. On the other hand, each producer equipped with a unique transactional.id can only have one ongoing transaction at a give time. And hence upon getting the migrated task(s), those producers need to read its last transaction during the recovery process, and that transaction would contain contents from multiple tasks. Therefore, we can no longer use a single producer per thread since we need to ensure transactional boundaries within a single task.

In this design we take the approach to assign a separate producer per task so that any transaction contains only output messages of a single task. The downside of this approach though, is that it may incur performance impact due to reduced batching effectiveness and increased memory management difficulties. Users still have the knobs to tune the number of producers in the producer-per-task scenario via instantiating the PartitionGrouper interface.


Producer Transactional Messaging

When a new stream task is created during the consumer rebalance process, the following steps will be executed:

  1. Create a new KafkaProducer client associated, set the transactional.id config as applicationID + taskID. This is to guarantee uniqueness of the transactional.id assuming that the applicationID is unique across all the applications running against the shared Kafka cluster.

  2. Call initTransactions() on the newly created producer, to enable transactional functionalities on the producer. This would roll forward or roll back any possibly unfinished transactions by the previous task that has failed.

  3. Call beginTransaction() to start the first transaction. This step is to make sure that any ongoing transactions with this transactional.id should have completed, before the consumer read the last committed offset.


Discussion on Fencing with Transactional ID. When producer.initTransactions() is called the producer will contact the transaction coordinator (from the broker) to get its internal PID given the transactional.id. The coordinator will then enhance the epoch number for that PID, so that any other producer instances trying to commit a transaction later by contacting it will be rejected due to incorrect epoch number. Hence this is to fence zombie write with the same transactional.id.

In addition, assuming the application is stateful and that it is storing its recovery state in Kafka (i.e. the changelog), it needs to wait for the previous ongoing transaction to complete before it restores its own state. Calling producer.initTransactions() before letting the consumer read committed offsets makes sure that the read offsets are always committed.

And during normal execution, when the task needs to send messages to output topics it will just use normal producer.send() calls. We leverage the PID and the sequence number to fence duplicated writes, i.e. invalid ProduceRequest either with incorrect epoch number of non-consecutive sequence number will be rejected by brokers. If the producer becomes offline for some time, the transactional.id -> PID mapping will be automatically expired after a configured timeout and the ongoing transaction be rolled backed by the coordinator itself, so that upon recovery a new PID will be assigned to the transactional.id to continue with the new transactions.


When stream.commit() is called, the following steps are executed in order:

  1. Flush local state stores (KTable caches) to make sure all changelog records are sent downstream.

  2. Call producer.sendOffsetsToTransactions(offsets) to commit the current recorded consumer’s positions within the transaction. Note that although the consumer of the thread can be shared among multiple tasks hence multiple producers, task’s assigned partitions are always exclusive, and hence it is safe to just commit the offsets of this tasks’ assigned partitions.

  3. Call producer.commitTransaction() to commit the current transaction. As a result the task state represented as the above triplet is committed atomically.

  4. Call producer.beginTransaction() again to start the next transaction.


Discussion on user-side message buffering: If users choose to buffer some incoming messages within their applications and do not process them later after the commit(), then a failure may cause these messages to be lost. Thus, this KIP does not improve the current state of Kafka Streams guarantees for this scenario as it is not even at-least-once semantics: it is out of the scope of this design, and we can only educate users to make sure their internally buffered messages are all processed before they explicitly call commit(), or provide a parameterized commit(offsets) call of them to commit only on the processed messages (a parameterized commit(offsets) API is out-of-scope of this KIP and might be an improvement later on).


Discussion on transient failures: On the producer side, a bunch of transient exceptions can be possibly thrown when the brokers are temporarily unavailable, bytes corrupted while transmitting over network, etc. These transient exceptions are throwable in the callbacks. To avoid such transient failures from happening and break even at-least-once semantics, we should enforce the #.retries config to INFINITY and check that in the callbacks transient failures should never be thrown. For other non-transient failures we discuss them in the Error Handling section.


Consumer Read-Committed Only

Stream thread’s embedded consumer should set the isolation.level config to read_committed to make sure that any consumed messages are from committed transactions. Note that the consuming partitions may be produced by multiple producers, and these producers may either use transactional messaging (for example, if they are Streams’ embedded producers) or not at all. So the fetching partitions may have both transactional and non-transactional messages, and by setting isolation.level config to read_committed consumers will still consume non-transactional messages.

By doing so the processor can then safely process the received messages and potentially update its task’s local state stores, knowing that those updates will never need to be reverted since the received transactional messages are all committed. At the same time, the messages returned from the consumers are still in the offset order.

One thing to note though, is that if there is an incomplete transaction on the fetching partition whose status cannot be decided yet, the consumer cannot return any messages after this transaction’s messages even if they are known to be committed in order to preserve offset ordering. As a result, the consumer is “blocked” waiting on this incomplete transaction, and hence the processor will be stalled as well.

In order to make sure all that incomplete transactions will be completed eventually even if their producers are stalled in the middle of producing these transactions so that consumers can be unblocked, transactional coordinator will proactively timeout transactions based on the producer-side transaction.timeout.ms config value.


Task Shutdown

Tasks can either be shut down when users call stream.close() on the instances, or when rebalance happens (will talk in details later) and assigned partitions gets revoked, or when the Kafka Streams application encounters some non-recoverable errors and hence need to stop-and-resume the task. Upon shutting down the task, we need to make sure its corresponding producer has committed the current transaction.

Since the local state store data will keep being updated between two consecutive commits, and these updates cannot usually be rolled back, therefore when we encountered some errors and hence need to stop the task, we need to treat it as an "unclean" shutdown as its state stores may not be valid any more, and hence when resuming the task it needs to be rebuilt from scratch instead of reusing the existing state stores. On the other hand, if the task can be shutdown cleanly, e.g. due to a rebalance, we do not need to re-build the state store upon resuming and should be able to continue from the current status. In order to distinguish these two cases, we will introduce a flag parameter in the task.close() call indicating if the task is shutting down "cleanly" or not:


close(boolean clean /* if the task can be closed cleanly or not */)


Clean Shutdown

If task.close(true)  is called, the following steps are executed:

  1. Close all the processors in the sub-topology, as well as flushing the state store manager to make sure any incomplete processing is completed and output messages are sent.

  2. Commit the task following the above steps (note that if some of the steps failed, we will jump directly to the unclean shutdown process).

  3. Write the change log offsets that was tracked from the callback function for all stores into a CRCed local offset checkpoint file. The existence of the offset checkpoint file indicates if the task was cleanly shutdown.

  4. Stop the underlying producer client for the task by calling producer.close().


Unclean Shutdown

If task.close(false)  is called, the following steps are executed:

  1. Abort the current transaction by calling producer.abortTransaction(); NOTE that we do NOT need to flush any data since they will be aborted anyways.

  2. Close all the processors in the sub-topology, as well as flushing the state store manager, when new messages need to be sent we can just ignore them since they will be aborted anyways.

  3. Stop the underlying producer client for the task by calling producer.close().


Task Rebalancing

Kafka Streams failures are detected via the consumer membership protocol, i.e. heartbeats, and upon detected failures tasks will be migrated between existing instances; similarly when new instances of the same app have been started, the rebalance process will be triggered as well so that tasks can be rebalanced among the instances.

More specifically, upon consumer rebalancing, Kafka Streams will just cleanly shutdown the task for the corresponding revoked partitions in callback.onPartitionRevoked() following the steps mentioned above; then in callback.onPartitionAssigned() the thread can just (re-)create the tasks for the corresponding assigned partitions and then resume these created tasks. Since consumers can only consume "committed data", tasks that are processing different sub-topologies will be decoupled from each other, in the way that when we abort one task’s transaction, we do not need to propagate aborts backward to other tasks that are processing its children sub-topologies. Hence the task resuming process becomes quite simple, that when all the tasks have been created inside the callback, we just need to execute the following steps:

  1. Let the consumer set its last committed offsets by reading from the consumer coordinator, which will only reflect offsets from committed transactions. Note this step is done inside the consumer client and Kafka Streams does not need to add any extra logic.

  2. Check if the local checkpoint file containing the offsets exists:

  1. If yes it means that the task was located on this process and was successfully shut down in the previous run; in this case restore from the recorded offset to changelog end.

  2. Otherwise it means the task was either not existed on this process before, or it was shut down uncleanly; in this case restore from the beginning offset to changelog end.

Note that the restoration consumer should also "read committed" only, up to the changelog end.


Error Handling

Now we can talk about different failure handling scenarios, and whether or not they should be triggering a task closure and rebalancing. Note that since Streams controls the usage of the embedded Producer / Consumer clients, many of the possible thrown exceptions should never happen otherwise there is a bug in the Streams code, e.g. InterruptException should never be thrown as Streams should never interrupt the thread calling corresponding functions, and hence producer.flush()/close() should never expect any exceptions; similarly  IllegalArgumentException / IllegalStateException should never be thrown.


We categorize all the possible failures encountered as three types, based on their handling logic, which are summarized below:

Rollback then Resume: when a Kafka broker has a transient failure, a retriable exception error code may be sent back to the clients. However, in order to achieve EoS we cannot simply retry upon such exceptions. Instead we need to “rollback” the current transaction and then resume the task by executing the following steps (the running Kafka Streams task does not need to be terminated):

  1. Call producer.abortTransaction() to abort the current ongoing transaction.

  2. Read the previously committed offset and reset the consumer to those offsets.

  3. Restore the local state stores from the changelog.

Failures falling in this category includes:

  • producer.partitionsFor() thrown exceptions.

  • producer.send() thrown exceptions.

  • producer.xxTransaction() thrown exceptions
    (except ProducerFencedException).

  • Consumer API functions (e.g. poll / etc) throw retriable exceptions.

  • Consumer rebalance process throw retriable exceptions.


Close Myself as Zoombie: when the newly added ProducerFenced / InvalidProducerEpoch error code is returned in the producer send callback, it means that another producer with the same PID as been set up on a different thread or machine, and hence this task is already a “zombie” and hence can be safely closed. We can simply uncleanly shut down the task and proceed. If all tasks of a thread has been closed, then that thread can be shut down.

Failures falling in this category includes:

  • producer.xxTransaction() thrown ProducerFenced/InvalidProducerEpoch


Stop the World: When fatal exceptions such as authentication failures are thrown, the whole Kafka Streams instance should be shut down immediately and alert users through logging / metrics, since in this case task is doomed to fail even after migrating to another stream thread. In addition, such failures should be reported to users as they may be due to mis-configured apps or misusage, and hence will likely to be thrown at the very beginning of the runtime.

Failures falling in this category includes:

  • Any of the thrown AuthorizationException from either producer or consumer.

  • Any of the thrown exceptions from the Streams own classes. For example:

    • thread.onPartitionAssigned() / onPartitionRevoked() thrown StreamsException.

    • task.addRecords() thrown SerializationException.

  • producer.Callback#Exception contains non-retriable exception (note that it should never throw retriable exception).

  • Any other exceptions that are not covered in the previous two categories.

    • E.g. KafkaException from either producer or consumer.


Discussion on application code failures: The above section only discussed possible thrown exceptions from the Streams library. In the streams application code (i.e. user customized processor.process() / punctuate() / etc)  various exceptions can still be thrown, e.g. from div by zero to specific errors like a RocksDBException. In this proposal they will all fall into the third category above as a fatal error, and will cause the thread to die and users can optionally handle them in the exception handler. There are some discussions about extending the exception handling in a finer granularity for application errors, which will be addressed in a separate KIP and hence not discussed here.


Discussion on State Store Checkpoints: one potential optimization for the Rollback then Resume case above, to reduce the state state restoration process upon resuming, is to keep some state store “checkpoints” that are aligned with message transaction boundaries, which can be saved atomically with the commit of the message transactions. There are a few different options doing so, and we’d like to describe one of these ideas here as an example, to apply state checkpoints (for example, RocksDB supports such features). More specifically:

  • When committing a task, after the state stores have all been flushed and before produce write the committed offset, we can choose to trigger the persistent state stores' checkpoint mechanism to make a checkpoint for each persistent stores.

  • When writing the committed offset into the input-offset-topic, also pass in the reference (e.g. local state directory / files locations) of all the persistent store checkpoints into the offset metadata as part of the state triplet.

  • Then upon resuming, if the fetched last transaction contains the state store checkpoints information, we can skip the restoration step of the state stores from changelog but use the checkpointed image directly.

Similarly, we can leverage the underlying state store engine’s own transaction mechanism to create such consistent checkpoints as well (e.g. RocksDB also supports transactions).


Public Interfaces

The only public interface changes proposed in this KIP is adding one more config to StreamsConfig:


processing.guarantee

Here are the possible values:


exactly_once: the processing of each record will be reflected exactly once in the application’s state even in the presence of failures.

at_least_once: the processing of each record will be reflected at least once in the application’s state even in the presence of failures.


Default: at_least_once



Misc. Considerata

  1. The mappings from partitions to tasks must be "static", i.e. a partition should NEVER be reassigned from one existing task to another; in practice it is the case in DefaultPartitionGrouper but we cannot enforce it programmatically for user-customizable interface other than mentioning that in in the JavaDoc as well.

  2. As mentioned above, since consumers can only consume "committed data", the committed transactions will be decoupled from one sub-topology to another sub-topology; however, we need to control the transaction "length" to be reasonably small otherwise the latency from one sub-topology to another downstream sub-topology would increase. More specifically, the worst-case latency is now lower-bounded by the auto-commit period, so we should set a small (e.g. 100ms) default config value for commit interval, and document clearly that if users want to increase this interval, they will see higher latencies. We are also working on improving this scenario as a future work to let consumers to only consume “committed data” at the end of the pipeline but allow consume “unstable data” in the middle of the pipeline, with the cost of cascading rollbacks in case of failures, details are still under design and will likely be proposed as a separate KIP.

猜你喜欢

转载自blog.csdn.net/xiao_jun_0820/article/details/79609781