mysql slave 备库延迟是怎么得到的

在mysql的备库的监控中有一项很重要的指标:Seconds_Behind_Master,这个值是怎么得到的呢？下面从5.1.58的代码中分析一下：

mysql的replication中有2个比较重要的class：Master_info(rpl_mi.h), Relay_log_info(rpl_rli.h)，他们分别对应于master,info文件和slave.info文件；很显然，Master_info是io_thread需要的，Relay_log_info是sql_thread需要的。Master_info中有一个变量 clock_diff_with_master，这个值记录着mysql的主库和备库的时间差，可以理解为主备的主机时间差。clock_diff_with_master变量的定义如下：

  /*
     The difference in seconds between the clock of the master and the clock of
     the slave (second - first). It must be signed as it may be <0 or >0.
     clock_diff_with_master is computed when the I/O thread starts; for this the
     I/O thread does a SELECT UNIX_TIMESTAMP() on the master.
     "how late the slave is compared to the master" is computed like this:
     clock_of_slave - last_timestamp_executed_by_SQL_thread - clock_diff_with_master

  */
  long clock_diff_with_master;

这个变量的注释直接提到了Seconds_Behind_Master的计算方法：clock_of_slave - last_timestamp_executed_by_SQL_thread - clock_diff_with_master。clock_of_slave是slave的当前时间--执行show slave status的当前时间。

先看一下clock_diff_with_master的计算：(slave.cc)。执行”start slave;“/“start slave io_thread;”后，会执行start_slave_threads来启动io thread，io thread启动后首先做的就是获取主库的mysql版本和主库的当前时间(mysql_real_query(mysql, STRING_WITH_LEN("SELECT UNIX_TIMESTAMP()")))，获取到主库的当前时间后，用备库的当前时间减去主库的时间，得到clock_diff_with_master。

int start_slave_threads(bool need_slave_mutex, bool wait_for_start,
                        Master_info* mi, const char* master_info_fname,
                        const char* slave_info_fname, int thread_mask)
{
...
  if (thread_mask & SLAVE_IO)
    error=start_slave_thread(handle_slave_io,lock_io,lock_cond_io,
                             cond_io,
                             &mi->slave_running, &mi->slave_run_id,
                             mi, 1); //high priority, to read the most possible
...
}
...
/**
  Slave IO thread entry point.

  @param arg Pointer to Master_info struct that holds information for
  the IO thread.

  @return Always 0.
*/
pthread_handler_t handle_slave_io(void *arg)
{
...
  thd_proc_info(thd, "Checking master version");
  ret= get_master_version_and_clock(mysql, mi);
...
}
...
static int get_master_version_and_clock(MYSQL* mysql, Master_info* mi)
{
...
  master_res= NULL;
  if (!mysql_real_query(mysql, STRING_WITH_LEN("SELECT UNIX_TIMESTAMP()")) &&
      (master_res= mysql_store_result(mysql)) &&
      (master_row= mysql_fetch_row(master_res)))
  {
    mi->clock_diff_with_master=
      (long) (time((time_t*) 0) - strtoul(master_row[0], 0, 10));
  }
  else if (is_network_error(mysql_errno(mysql)))
  {
    mi->report(WARNING_LEVEL, mysql_errno(mysql),
               "Get master clock failed with error: %s", mysql_error(mysql));
    goto network_err;
  }
  else 
  {
    mi->clock_diff_with_master= 0; /* The "most sensible" value */
    sql_print_warning("\"SELECT UNIX_TIMESTAMP()\" failed on master, "
                      "do not trust column Seconds_Behind_Master of SHOW "
                      "SLAVE STATUS. Error: %s (%d)",
                      mysql_error(mysql), mysql_errno(mysql));
  }
...
}

last_timestamp_executed_by_SQL_thread则是sql slave thread最近一次执行的event的时间戳(rpl.rli.cc中的last_event_start_time变量)。而这个变量的更新则是在stmt_done()函数中完成的(rpl.rli.cc)，相关定义如下:

  /*
    Used by row-based replication to detect that it should not stop at
    this event, but give it a chance to send more events. The time
    where the last event inside a group started is stored here. If the
    variable is zero, we are not in a group (but may be in a
    transaction).
   */
  time_t last_event_start_time;

  /**
    Helper function to do after statement completion.

    This function is called from an event to complete the group by
    either stepping the group position, if the "statement" is not
    inside a transaction; or increase the event position, if the
    "statement" is inside a transaction.

    @param event_log_pos
    Master log position of the event. The position is recorded in the
    relay log info and used to produce information for <code>SHOW
    SLAVE STATUS</code>.

    @param event_creation_time
    Timestamp for the creation of the event on the master side. The
    time stamp is recorded in the relay log info and used to compute
    the <code>Seconds_behind_master</code> field.
  */
  void stmt_done(my_off_t event_log_pos,
                 time_t event_creation_time);

具体的逻辑则是，sql thread启动后，读取relaylog(netxt_event())，apply & update pos(apply_event_and_update_pos())，update_pos的时候判断是否执行到了事务的结束位置，如果执行到了，则调用stmt_done()，stmt_done()会将last_master_timestamp更新为最近一次event的创建时间(event_creation_time)。因此如果在主备基本无延迟的时候，主库执行了一个大事务，你会发现备库延迟突然很大，然后又没了，"延迟跳跃"。

/**
  Slave SQL thread entry point.

  @param arg Pointer to Relay_log_info object that holds information
  for the SQL thread.

  @return Always 0.
*/
pthread_handler_t handle_slave_sql(void *arg)
{
...
    if (exec_relay_log_event(thd,rli))
...
}
...
static int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{
...
  Log_event * ev = next_event(rli);
...
    exec_res= apply_event_and_update_pos(ev, thd, rli);
...
}
...
int apply_event_and_update_pos(Log_event* ev, THD* thd, Relay_log_info* rli)
{
...
    exec_res= ev->apply_event(rli);
...
    int error= ev->update_pos(rli);
...
}
....log_event.cc
int
Rows_log_event::do_update_pos(Relay_log_info *rli)
{
...
  if (get_flags(STMT_END_F))
  {
    /*
      Indicate that a statement is finished.
      Step the group log position if we are not in a transaction,
      otherwise increase the event log position.
    */
    rli->stmt_done(log_pos, when);
   }
...
}
...rpl_rli.cc
void Relay_log_info::stmt_done(my_off_t event_master_log_pos,
                                  time_t event_creation_time)
{
...
  if ((sql_thd->options & OPTION_BEGIN) && opt_using_transactions)
    inc_event_relay_log_pos();
  else
  {
    inc_group_relay_log_pos(event_master_log_pos);
    flush_relay_log_info(this);
    /*
      Note that Rotate_log_event::do_apply_event() does not call this
      function, so there is no chance that a fake rotate event resets
      last_master_timestamp.  Note that we update without mutex
      (probably ok - except in some very rare cases, only consequence
      is that value may take some time to display in
      Seconds_Behind_Master - not critical).
    */
#ifndef DBUG_OFF
    if (!(event_creation_time == 0 && debug_not_change_ts_if_art_event > 0))
#else
      if (event_creation_time != 0)
#endif
        last_master_timestamp= event_creation_time;
  }
}

另外一种可能会出现延迟跳跃的情况是：主库的网络有问题，io thread拉binlog的速度很慢，备库实际有很大的延迟，而sql thread执行的时候永远是hot log，这个时候会出现，备库一直有很大的延迟，突然一下延迟没了(=0)，接下去延迟又变得很大。这种情况的原因为：sql thread读取relay log是在next_event(slave.cc)中完成的，而next_event中会判断当前的读取的relay log是否是“热的”--正在被io thread使用，如果是"热的"，并且读取到当前relay log的最后一个event 或io thread没有再更新relay log的时候，会将last_master_time_stamp置为0，并new一个 Rotate_log_event。而如果在这个时候执行"show slave status"，延迟就是0了。这个地方需要补充一点的是“show slave status”和前面提到的逻辑，唯一有争用的锁是&rli->data_lock，但是next_event在恢复last_master_time_stamp前就释放了这个锁：

...slave.cc...
static Log_event* next_event(Relay_log_info* rli)
{
...
  safe_mutex_assert_owner(&rli->data_lock);
...
    if (!cur_log->error) /* EOF */
    {
      /*
        On a hot log, EOF means that there are no more updates to
        process and we must block until I/O thread adds some and
        signals us to continue
      */
      if (hot_log)
      {
        time_t save_timestamp= rli->last_master_timestamp;
        rli->last_master_timestamp= 0;
...
        /*
          We can, and should release data_lock while we are waiting for
          update. If we do not, show slave status will block
        */
        pthread_mutex_unlock(&rli->data_lock);
...
        rli->last_master_timestamp= save_timestamp;
...
       }
...
     }
...
}
....
/**
  Execute a SHOW SLAVE STATUS statement.

  @param thd Pointer to THD object for the client thread executing the
  statement.

  @param mi Pointer to Master_info object for the IO thread.

  @retval FALSE success
  @retval TRUE failure
*/
bool show_master_info(THD* thd, Master_info* mi)
{
...
    pthread_mutex_lock(&mi->run_lock);
    protocol->store(mi->io_thd ? mi->io_thd->proc_info : "", &my_charset_bin);
    pthread_mutex_unlock(&mi->run_lock);

    pthread_mutex_lock(&mi->data_lock);
    pthread_mutex_lock(&mi->rli.data_lock);
    pthread_mutex_lock(&mi->err_lock);
    pthread_mutex_lock(&mi->rli.err_lock);
...
    /*
      Seconds_Behind_Master: if SQL thread is running and I/O thread is
      connected, we can compute it otherwise show NULL (i.e. unknown).
    */
    if ((mi->slave_running == MYSQL_SLAVE_RUN_CONNECT) &&
        mi->rli.slave_running)
    {
      long time_diff= ((long)(time(0) - mi->rli.last_master_timestamp)
                       - mi->clock_diff_with_master);
      /*
        Apparently on some systems time_diff can be <0. Here are possible
        reasons related to MySQL:
        - the master is itself a slave of another master whose time is ahead.
        - somebody used an explicit SET TIMESTAMP on the master.
        Possible reason related to granularity-to-second of time functions
        (nothing to do with MySQL), which can explain a value of -1:
        assume the master's and slave's time are perfectly synchronized, and
        that at slave's connection time, when the master's timestamp is read,
        it is at the very end of second 1, and (a very short time later) when
        the slave's timestamp is read it is at the very beginning of second
        2. Then the recorded value for master is 1 and the recorded value for
        slave is 2. At SHOW SLAVE STATUS time, assume that the difference
        between timestamp of slave and rli->last_master_timestamp is 0
        (i.e. they are in the same second), then we get 0-(2-1)=-1 as a result.
        This confuses users, so we don't go below 0: hence the max().

        last_master_timestamp == 0 (an "impossible" timestamp 1970) is a
        special marker to say "consider we have caught up".
      */
      protocol->store((longlong)(mi->rli.last_master_timestamp ?
                                 max(0, time_diff) : 0));
    }
    else
    {
      protocol->store_null();
    }
...
}

到这里Seconds_Behind_Master的计算就结束了。总结起来就是Seconds_Behind_Master的计算有3部分组成，备库当前时间，sql thread最近一次执行的event的时间，主备系统的时间差。备库出现延迟跳跃，可能是大事务引起，也可能是网络问题。(ps:这2种情况，都被我遇到过:))

链接：

http://www.mysqlperformanceblog.com/2013/09/16/possible-reasons-when-mysql-replication-lag-is-flapping-between-0-and-xxxxx/

mysql slave 备库延迟是怎么得到的

猜你喜欢