慢查询和innobackup备份导致数据库挂起

现象

  • 短信报警凌晨的备份任务失败;
  • 研发说应用挂了,应用都重启过了,貌似还没解决;
  • 其他cpu、磁盘空间、日志等指标都正常,没收到报警,疑惑;

排查

  • 凌晨1点开始的备份不成功:

>> log scanned up to (348743804470)
xtrabackup: Creating suspend file '/data1/backup/2017-11-14_01-00-06/xtrabackup_suspended_2' with pid '3048'

171114 01:06:15  innobackupex: Continuing after ibbackup has suspended
171114 01:06:15  innobackupex: Executing FLUSH NO_WRITE_TO_BINLOG TABLES...
>> log scanned up to (348743819365)
>> log scanned up to (348743830358)
>> log scanned up to (348743838117)
>> log scanned up to (348743848569)

……

  • 同时,某个慢查询已经执行了1个多小时还没完成,一直Sending data状态;
  • 大量SQL开始等待,状态:Waiting for table flush

分析

关于Waiting for table flush,官方说明:
The thread is executing FLUSH TABLES and is waiting for all threads to close their tables, or the thread got a notification that the underlying structure for a table has changed and it needs to reopen the table to get the new structure. However, to reopen the table, it must wait until all other threads have closed the table in question.
This notification takes place if another thread has used FLUSH TABLES or one of the following statements on the table in question: FLUSH TABLES tbl_name,
本案中大量进程Waiting for table flush,就是等待innobackup flush table后,close这些表。但innobackup被慢查询卡住了。

解决

innobackup有几个参数可以应对这种情况:
--kill-long-queries-timeout=N (seconds) 
how many time we give for queries to complete after FLUSH TABLES WITH READ LOCK is issued before start to kill. Default if 0, not to kill.
--kill-long-query-type={all|select} 
which queries should be killed once kill-long-queries-timeout has expired.
举例:
innobackupex --user=root --password=password --kill-long-queries-timeout=30 --kill-long-query-type=all /backup/
innobackup等待30秒后,杀死堵塞的进程,可以顺利完成备份,不会对其他DML操作产生影响。


发布了24 篇原创文章 · 获赞 25 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/sdmei/article/details/78537946
今日推荐