1. Problem background
In the early morning of February 27th, during the backup process, the MySQL standby database in the production environment experienced an increase in the replication delay of the standby database because FLUSH TABLES WITH READ LOCK was not released. The lock in the slow log was not released for nearly 25 minutes.
Version:
- MySQL 5.7.21
- PXB 2.4.18
Slow query log:
Backup command in backup script:
The main logical content of mysql_kill.sh:
Backup parameters:
2. Problem recurrence and analysis
2.1 Problem analysis
- 144 is the SQL thread, the Coordinator thread in parallel replication;
- 145/146 are worker threads for parallel replication, and transactions in the 145/146 worker thread queue can be executed in parallel.
- Thread 162 is used to flush tables with read lock executed by innobackup;
144 When the Coordinator thread distributed the transaction in the relay log, it found that the transaction could not be executed and had to wait for the previous transaction to complete submission, so it was in the waiting for dependent transaction to commit state. Threads 145/146 and backup thread 162 form a deadlock. Thread 145 waits for thread 162's global read lock to be released. Thread 162 occupies the MDL::global read lock global read lock. When applying for the global commit lock, it blocks and waits for thread 146. Thread 146 occupies MDL. :: commit lock, because the slave library sets slave_preserve_commit_order=1 to ensure the binlog submission order of the slave library, and the binlog corresponding to the 146 thread execution transaction is at the back, so the transaction of 145 is waiting for submission. Finally, an infinite loop of 145->162->146->145 is formed, forming a deadlock.
It is still rare for three threads to form a deadlock with each other.
2.2 Why the relevant parameters do not take effect
--ftwrl-wait-timeout =60 means that before executing FTWRL, if a long SQL is detected, wait for the specified time (seconds). If there is still a long SQL after the timeout, the backup will exit with an error. The default value is 0, which means immediate execution.
--ftwrl-wait-threshold =5 refers to the method of detecting long SQL before executing FTWRL. If there is a SQL that has been running for more than the specified time (seconds) before executing flush, the SQL is defined as a long SQL. Default 60s.
--kill-long-queries_timeout=0 After executing FTWRL, if the flush operation is blocked for N seconds, the thread blocking it will be killed. The default setting of 0 means that no SQL blocking flush will be killed until the execution of the SQL is completed.
From the explanation of each parameter above, it is not difficult to see that the --ftwrl-wait-* parameter is for the long SQL detection mechanism before FTWRL is executed, and will not help when FTWRL has been executed. The --kill-long-* parameter is to set the default value. 0, has no effect.
3. Conclusion and suggestions
- Executing FTWRL in PXB backup to add a global read lock and causing a deadlock with the SQL thread is the reason why the slave database delay is too high this time.
- Enable
--kill-long-queries\_type
and--kill-long-queries\_timeout
parameters, execute the operation of killing the relevant thread after detecting that flush is blocked. It is more violent and involves greater risks. It can be considered if there is no business access to the backup database. - Enable
--safe-slave-backup
the parameter, which will stop the SQL thread when performing backup to avoid deadlock. It is only recommended to execute this on a standby database that has no business access. - Set MySQL parameters
slave\_preserve\_commit\_order=0
and turn off the sequential submission of binlog from the slave database. Turning off this parameter only affects the submission order of parallel replicated transactions in the slave database, and has no impact on the final data consistency. Therefore, if there are no special requirements, the binlog order of the slave database must be consistent with The main library remains consistent and settings can be consideredslave\_preserve\_commit\_order=0
to avoid deadlocks.
Enjoy GreatSQL :)
About GreatSQL
GreatSQL is a domestic independent open source database suitable for financial-level applications. It has many core features such as high performance, high reliability, high ease of use, and high security. It can be used as an optional replacement for MySQL or Percona Server and is used in online production environments. , completely free and compatible with MySQL or Percona Server.
Related links: GreatSQL Community Gitee GitHub Bilibili
GreatSQL Community:
Community reward suggestions and feedback: https://greatsql.cn/thread-54-1-1.html
Community blog prize-winning submission details: https://greatsql.cn/thread-100-1-1.html
(If you have any questions about the article or have unique insights, you can go to the official community website to ask or share them~)
Technical exchange group:
WeChat & QQ group:
QQ group: 533341697
WeChat group: Add GreatSQL Community Assistant (WeChat ID: wanlidbc
) as a friend and wait for the community assistant to add you to the group.