MySQL backup fails, the twists and turns of the problem analysis and treatment

Today, together with my colleagues we deal with the problem of a strange anomaly MySQL space, from dealing with this issue can be found in the way some of the issues to be addressed.

Background problem is that there is always a backup of an instance failure, after the investigation several times, to ensure that the situation in Slave available to set aside, by the time these days just do the next finishing and comb.

Backup failure error message is:

innobackupex: Error writing file '/tmp/xbtempevLQbf' (Errcode: 28 - No space left on device)
xtrabackup: Error: write to logfile failed
xtrabackup: Error: xtrabackup_copy_logfile() failed.

Looks and more straightforward problem, lack of space Well, the problem is not the space configuration.

But when the simulation test locally, use the following script to turn on the machine test.

/usr/local/mysql_tools/percona-xtrabackup-2.4.8-Linux-x86_64/bin/innobackupex --defaults-file=/data/mysql_4308/my.cnf --user=xxxx --password=xxxx --socket=/data/mysql_4308/tmp/mysql.sock  --stream=tar /data/xxxx/mysql/xxxx_4308/2020-02-11   > /data/xxxx/mysql/xxxx_4308/2020-02-11.tar.gz

We found where the / tmp directory has no space for unusual circumstances, but instead is using the space of the root directory appeared abnormal, which is unusual shots tests intercepted a space.

And after throwing an exception, the backup fails, space usage immediately restored.

Comprehensive information currently available, my question is seemingly intuitive feel and / tmp not too direct link, it must be in the course of the root directory of other directories generate an exception.

So I began a second test, this time I focus on the overall use of the root directory, which directory to see in the end is abnormal, but embarrassing is that, despite their rapid acquisition of the script, there is not even found in our common space anomaly directory.

332K    ./home
411M    ./lib
26M     ./lib64
16K     ./lost+found
4.0K    ./media
4.0K    ./misc
4.0K    ./mnt
0       ./net
184M    ./opt
du: cannot access `./proc/40102/task/40102/fd/4': No such file or directory
du: cannot access `./proc/40102/task/40102/fdinfo/4': No such file or directory
du: cannot access `./proc/40102/fd/4': No such file or directory
du: cannot access `./proc/40102/fdinfo/4': No such file or directory
0       ./proc
2.3G    ./root
56K     ./tmp
。。。

Therefore, from the current situation, it should be the space under / proc directory associated with the anomaly.

Things to this point, it seems that way available is running out.

I have gone through the script, the parameters of the investigation file, as a whole compared to no obvious environmental and other issues, but there is one detail caught my attention, and that is to use top, to see the memory of this example uses 6G ( server memory is 8G), but the buffer pool configuration is about 3G, which is a library from the environment, there is no application connectivity, so it is unlikely that there are too many connected resource consumption, so the whole, and should be the server memory abnormalities.

This time try the online resize, found no space contraction of. Because it is from the library service, so I started to restart the service from the library.

But the unexpected is to restart the database when stuck, probably after 2 minutes, just to see some decimal output, roughly the output of the two lines, still no response, no background check log output, so I started trying to plan B , ready to Kill process to restart the service.

This time, the kill operation into effect, and after a while the service starts up. But the reported abnormal copy from the library.

                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
。。。
             Master_Server_Id: 190
                  Master_UUID: 570dcd0e-f6d0-11e8-adc3-005056b7e95f
。。。
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 200211 14:20:57
           Retrieved_Gtid_Set: 570dcd0e-f6d0-11e8-adc3-005056b7e95f:821211986-2157277214
            Executed_Gtid_Set: 570dcd0e-f6d0-11e8-adc3-005056b7e95f:1-820070317:821211986-2157277214

This error message is more obvious, is the main library binlog purge is lost, resulting in time to copy the application from the library failed.

Why is there such a strange question, because the main library binlog default or retain some number of days, and will not put binlog 1 hour before deletion.

On a number of variables GTID values ​​are as follows:

Retrieved_Gtid_Set: 570dcd0e-f6d0-11e8-adc3-005056b7e95f:821211986-2157277214

Executed_Gtid_Set: 570dcd0e-f6d0-11e8-adc3-005056b7e95f:1-820070317:821211986-2157277214

gtid_purged     : 570dcd0e-f6d0-11e8-adc3-005056b7e95f:1-820070317:821211986-2131381624

Master side GTID_Purged as:

gtid_purged      :570dcd0e-f6d0-11e8-adc3-005056b7e95f:1-2089314252

Comprehensive information point of view, Slave GTID end of the main library and the absence of complete link up, which means before this Slave done some operations led to GTID produced some deviations in Master and Slave end.

And this missing part of the change 570dcd0e-f6d0-11e8-adc3-005056b7e95f: 821 211 986 conservatively estimating is a month before the, binlog is certainly not reserved.

We are here to temporarily fix the replication problem.

Stop Slave did not expect a problem, a seemingly simple stop Slave operation actually lasted 1 minute.

>>stop slave;

Query OK, 0 rows affected (1 min 1.99 sec)

Try reducing the Buffer pool configuration, restart, stop slave, this operation is still very slow, so the delay is not a problem and Buffer Pool relations can be ruled out in this direction, and relatively larger GTID relationship.

Slave side fix the following steps:

reset master;
stop slave;
reset slave all;
SET @@GLOBAL.GTID_PURGED='570dcd0e-f6d0-11e8-adc3-005056b7e95f:1-2157277214';
CHANGE MASTER TO MASTER_USER='dba_repl', MASTER_PASSWORD='xxxx' , MASTER_HOST='xxxxx',MASTER_PORT=4308,MASTER_AUTO_POSITION = 1;

Which GTID_PURGED configuration is the key. 

After the repair, delays Slave-side to solve, and try to back up again, not even in the root directory of the space consumption. 

summary:

This process is mainly to solve the problem quickly, some steps crawl log is not enough rich and detailed, from the analysis of the problem, it still lacks some of the more convincing things, for the cause of the problem, in essence, it is unreasonable the problem (such as bug or configuration anomaly, etc.) led to the anomaly.

In this one we can still learn from that whole idea of ​​analysis, rather than the problem itself. 

Proper way without surgery, patients may be so; there is no way to surgery, surgery beyond

Welcome to the concern Java Road Public No.

Good article, I look at ❤️

Published 93 original articles · won praise 4199 · Views 760,000 +

Guess you like

Origin blog.csdn.net/hollis_chuang/article/details/104305655