Remember the failure of flink to consume Kafka topic caused by a downtime

1. Event background

The crash of the cluster server caused many big data components to be shut down abnormally. After restarting the server and cluster, all components were in normal status, but the flink task could not run normally.

2. Problem phenomenon

After restarting the server, everything seems to be normal and the components are in good condition.
insert image description here

However, a problem was discovered when submitting the flink task. Zookeeper reported that the canary test failed from time to time.

insert image description here

Then I checked the flink running log and found an error: Timeout for obtaining topic metadata, and all tasks reported this.

insert image description here

3. Positioning problem

To solve the problem, we need to find the root cause of the problem
and combine it with zookeeper to report canary test failure from time to time. We suspect that it is a network problem (the server is stretched), but there is no way to prove it, so we can only continue to think of other methods.
Use the command line to start a consumer on the server side, and still get an error. Now

insert image description here
I understand that it is not a network problem, it should be a problem with kafka metadata
, so I use the command to view the information of each topic.

kafka-topics --describe --zookeeper node3:2181

The results show that all topics are normal,
so I create a new topic and try to see if it can be consumed. The result is also no, and the error in the picture above is reported.

So I thought that there might be a problem with the Kafka cluster information saved on zookeeper. Finally, it was determined that the information recorded in the /controller file on zookeeper did not match the actual information.

4. Solve problems

After locating the problem, it will be solved.
1. Close the kafka cluster service.
2. Delete the /controller file
3. Restart the zookeeper cluster
4. Start the kafka cluster service
5. Resubmit the flink task
6. Solve the problem

5. Expansion

/controller file function

In a Kafka cluster, only one Broker is elected as the Controller. Each Broker has the opportunity to become a Controller and is responsible for managing partitions, copies, fault detection and recovery in the cluster. In order to ensure that a new Controller can be quickly re-elected when the Broker fails, Kafka uses Zookeeper to store and manage Controller information. Therefore, the content of the /controller file is very critical. It allows Kafka Broker and other clients to know which Broker the current controller is and communicate with the controller to handle various tasks in a timely manner.

Specifically, Kafka stores the ID and address information of the current Controller in the /controller node on Zookeeper. This information can be accessed by other brokers and clients so that they can communicate with the Controller and perform necessary management tasks, such as creating or deleting partitions and replicas.

In addition, Kafka also uses Zookeeper to store other metadata related to cluster management, such as allocation information for partitions and replicas, and offset information for consumer groups.

Guess you like

Origin blog.csdn.net/xfp1007907124/article/details/130132868