《redis设计与实现》-17 集群的故障检测

一 序

  redis集群中的节点分为master 和slave  。其中master负责处理槽;slave 会复制master 的数据, 并在master 下线后, 代替它继续进行处理命令请求。下面分别介绍节点复制、故障检测,限于篇幅故障转移下一篇整理。

二 节点复制

   向集群节点发送"CLUSTER  REPLICATE  <nodeID>"命令,使收到命令的集群节点成为<nodeID>节点的从节点。

源码在函数cluster.c/clusterCommand中

 else if (!strcasecmp(c->argv[1]->ptr,"replicate") && c->argc == 3) {
        /* CLUSTER REPLICATE <NODE ID> */
        // 获取指定的主节点
        clusterNode *n = clusterLookupNode(c->argv[2]->ptr);

        /* Lookup the specified node in our table. */
        // 没找到该节点
        if (!n) {
            addReplyErrorFormat(c,"Unknown node %s", (char*)c->argv[2]->ptr);
            return;
        }

        /* I can't replicate myself. */
        // 不能复制myself自己
        if (n == myself) {
            addReplyError(c,"Can't replicate myself");
            return;
        }

        /* Can't replicate a slave. */
        // 如果该节点是从节点,回复错误
        if (nodeIsSlave(n)) {
            addReplyError(c,"I can only replicate a master, not a slave.");
            return;
        }

        /* If the instance is currently a master, it should have no assigned
         * slots nor keys to accept to replicate some other node.
         * Slaves can switch to another master without issues. */
        // 如果myself是主节点,并且负责有一些槽或者数据库中没有键
        if (nodeIsMaster(myself) &&
            (myself->numslots != 0 || dictSize(server.db[0].dict) != 0)) {
            addReplyError(c,
                "To set a master the node must be empty and "
                "without assigned slots.");
            return;
        }

        /* Set the master. */
        // 将该节点设置为myself节点的主节点
        clusterSetMaster(n);
        // 更新状态和保存配置
        clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE|CLUSTER_TODO_SAVE_CONFIG);
        addReply(c,shared.ok);
}

主要是调用clusterSetMaster函数置当前节点为n节点的从节点。

/* Set the specified node 'n' as master for this node.
 * If this node is currently a master, it is turned into a slave. */
// 将指定的节点n设置为myself的主节点。如果myself本来就是主节点,那么设置为从节点
void clusterSetMaster(clusterNode *n) {
    // 保证n节点不是myself节点
    serverAssert(n != myself);
    // 保证myself节点不负责槽位
    serverAssert(myself->numslots == 0);
    // 如果myself节点是主节点
    if (nodeIsMaster(myself)) {
        // 取消主节点和迁移从节点的标识
        myself->flags &= ~(CLUSTER_NODE_MASTER|CLUSTER_NODE_MIGRATE_TO);
        // 设置为从节点
        myself->flags |= CLUSTER_NODE_SLAVE;
        // 清空所有槽的导入导出状态
        clusterCloseAllSlots();
    // myself是从节点
    } else {
        // 如果myself从节点有从属的主节点,将myself从节点和它的主节点断开主从关系
        if (myself->slaveof)
            clusterNodeRemoveSlave(myself->slaveof,myself);
    }
    // 将节点n设置为myself的主节点
    myself->slaveof = n;
    // 将myself节点添加到主节点n的从节点字典中
    clusterNodeAddSlave(n,myself);
    // 使myself节点复制n节点
    replicationSetMaster(n->ip, n->port);
    // 重置与手动故障转移的状态
    resetManualFailover();
}

 主要是调用replicationSetMaster函数,这里直接复用了主从复制部分的代码,相当于向当前节点发送了"SLAVE  OF"命令,开始主从复制流程;

  clusterNode{ clusterNode *slaves, numslaves}.

  • 记录着正在复制该master的slave信息

三 故障检测

      之前的集群 gossip协议介绍了Redisgossip协议,集群节点通过PING/PONG消息实现节点通信,消息不但可以传播节点槽信息,还可以传播主从状态、节点故障信息等。因此故障检测也是就是通过消息传播机制实现的。

3.1 主观故障检测

      首先Redis集群节点每隔1s会随机向一个最有可能发生故障的节点发送PING消息。执行该操作的函数是集群的定时函数clusterCron。

  /* Ping some random node 1 time every 10 iterations, so that we usually ping
     * one random node every second. */
    // 每1s中发送一次PING消息
    if (!(iteration % 10)) {
        int j;

        /* Check a few random nodes and ping the one with the oldest
         * pong_received time. */
        // 随机抽查5个节点,向pong_received值最小的发送PING消息
        for (j = 0; j < 5; j++) {
            // 随机抽查一个节点
            de = dictGetRandomKey(server.cluster->nodes);
            clusterNode *this = dictGetVal(de);

            /* Don't ping nodes disconnected or with a ping currently active. */
            // 跳过无连接或已经发送过PING的节点
            if (this->link == NULL || this->ping_sent != 0) continue;
            // 跳过myself节点和处于握手状态的节点
            if (this->flags & (CLUSTER_NODE_MYSELF|CLUSTER_NODE_HANDSHAKE))
                continue;
            // 查找出这个5个随机抽查的节点,接收到PONG回复过去最久的节点
            if (min_pong_node == NULL || min_pong > this->pong_received) {
                min_pong_node = this;
                min_pong = this->pong_received;
            }
        }
        // 向接收到PONG回复过去最久的节点发送PING消息,判断是否可达
        if (min_pong_node) {
            serverLog(LL_DEBUG,"Pinging node %.40s", min_pong_node->name);
            clusterSendPing(min_pong_node->link, CLUSTERMSG_TYPE_PING);
        }
    }

如果这个节点真的发生了故障,当发送了它PING消息后,就不会接收到PONG消息作为回复,因此会触发超时判断。

  // 迭代所有的节点
    while((de = dictNext(di)) != NULL) {
        clusterNode *node = dictGetVal(de);
        now = mstime(); /* Use an updated time at every iteration. */
        mstime_t delay;
        // 跳过myself节点,无地址NOADDR节点,和处于握手状态的节点
        if (node->flags &
            (CLUSTER_NODE_MYSELF|CLUSTER_NODE_NOADDR|CLUSTER_NODE_HANDSHAKE))
                continue;

        /* Orphaned master check, useful only if the current instance
         * is a slave that may migrate to another master. */
        // 如果myself是从节点并且node节点是主节点并且该主节点不处于下线状态
        if (nodeIsSlave(myself) && nodeIsMaster(node) && !nodeFailed(node)) {
            // 判断node主节点有多少个正常的从节点
            int okslaves = clusterCountNonFailingSlaves(node);

            /* A master is orphaned if it is serving a non-zero number of
             * slots, have no working slaves, but used to have at least one
             * slave, or failed over a master that used to have slaves. */
            // node主节点没有ok的从节点,并且node节点负责有槽位,并且node节点指定了槽迁移标识
            if (okslaves == 0 && node->numslots > 0 &&
                node->flags & CLUSTER_NODE_MIGRATE_TO)
            {
                // 孤立的节点数加1
                orphaned_masters++;
            }
            // 更新一个主节点最多ok从节点的数量
            if (okslaves > max_slaves) max_slaves = okslaves;
            // 如果myself是从节点,并且从属于当前node主节点
            if (nodeIsSlave(myself) && myself->slaveof == node)
                // 记录myself从节点的主节点的ok从节点数
                this_slaves = okslaves;
        }

        /* If we are waiting for the PONG more than half the cluster
         * timeout, reconnect the link: maybe there is a connection
         * issue even if the node is alive. */
        // 如果等待PONG回复的时间超过cluster_node_timeout的一半,重新建立连接。即使节点正常,但是连接出问题
        if (node->link && /* is connected */
            now - node->link->ctime >
            server.cluster_node_timeout && /* was not already reconnected */
            node->ping_sent && /* we already sent a ping */
            node->pong_received < node->ping_sent && /* still waiting pong */
            /* and we are waiting for the pong more than timeout/2 */
            now - node->ping_sent > server.cluster_node_timeout/2)
        {
            /* Disconnect the link, it will be reconnected automatically. */
            // 释放连接,下个周期会自动重连
            freeClusterLink(node->link);
        }

        /* If we have currently no active ping in this instance, and the
         * received PONG is older than half the cluster timeout, send
         * a new ping now, to ensure all the nodes are pinged without
         * a too big delay. */
        // 如果当前没有发送PING消息,并且在一定时间内也没有收到PONG回复
        if (node->link &&
            node->ping_sent == 0 &&
            (now - node->pong_received) > server.cluster_node_timeout/2)
        {
            // 给node节点发送一个PING消息
            clusterSendPing(node->link, CLUSTERMSG_TYPE_PING);
            continue;
        }

        /* If we are a master and one of the slaves requested a manual
         * failover, ping it continuously. */
        // 如果myself是主节点,一个从节点请求手动故障转移
        if (server.cluster->mf_end &&
            nodeIsMaster(myself) &&
            server.cluster->mf_slave == node &&
            node->link)
        {
            // 给该请求节点发送PING消息
            clusterSendPing(node->link, CLUSTERMSG_TYPE_PING);
            continue;
        }

        /* Check only if we have an active ping for this instance. */
        // 如果当前还没有发送PING消息,则跳过,只要发送了PING消息之后,才会执行以下操作
        if (node->ping_sent == 0) continue;

        /* Compute the delay of the PONG. Note that if we already received
         * the PONG, then node->ping_sent is zero, so can't reach this
         * code at all. */
        // 计算已经等待接收PONG回复的时长
        delay = now - node->ping_sent;
        // 如果等待的时间超过了限制
        if (delay > server.cluster_node_timeout) {
            /* Timeout reached. Set the node as possibly failing if it is
             * not already in this state. */
            // 设置该节点为疑似下线的标识
            if (!(node->flags & (CLUSTER_NODE_PFAIL|CLUSTER_NODE_FAIL))) {
                serverLog(LL_DEBUG,"*** NODE %.40s possibly failing",
                    node->name);
                node->flags |= CLUSTER_NODE_PFAIL;
                // 设置更新状态的标识
                update_state = 1;
            }
        }
    }

    如果发送PING消息的时间已经超过了cluster_node_timeout限制,默认是15S,那么会将迭代的该节点的flags打开CLUSTER_NODE_PFAIL标识,表示myself节点主观判断该节点下线。但是这不代表最终的故障判定。

3.2 客观故障的检测

      当myself节点检测到一个节点疑似下线后,就会打开该节点的CLUSTER_NODE_PFAIL标识,表示判断该节点主观下线。之前在介绍redis的gossip协议时候,通过函数clusterSendPing()发送PING/PONG消息,会将处于CLUSTER_NODE_PFAIL状态的节点处于消息的流言部分,无论是集群中的哪个主节点接收到了消息,会调用clusterReadHandler()函数来读取收到的消息,并且判断读取的消息合法性和完整性等等。如果消息可读,会调用clusterProcessPacket()函数来处理读取到的消息,这个函数是通用处理消息的函数,这里只看处理PING/PONG消息包的流言部分的代码clusterProcessGossipSection:

    

/* Process the gossip section of PING or PONG packets.
 * Note that this function assumes that the packet is already sanity-checked
 * by the caller, not in the content of the gossip section, but in the
 * length. */
// 处理流言中的 PING or PONG 数据包,函数调用者应该检查流言包的合法性 
void clusterProcessGossipSection(clusterMsg *hdr, clusterLink *link) {
	   // 获取该条消息包含的节点数信息
    uint16_t count = ntohs(hdr->count);
    // clusterMsgDataGossip数组的地址
    clusterMsgDataGossip *g = (clusterMsgDataGossip*) hdr->data.ping.gossip;
    // 发送消息的节点
    clusterNode *sender = link->node ? link->node : clusterLookupNode(hdr->sender);
	 
	   // 遍历所有节点的信息
    while(count--) {
    	  // 获取节点的标识信息
        uint16_t flags = ntohs(g->flags);
        clusterNode *node;
        sds ci;

        if (server.verbosity == LL_DEBUG) {
        	   // 根据获取的标识信息,生成用逗号连接的sds字符串ci
            ci = representClusterNodeFlags(sdsempty(), flags);
            serverLog(LL_DEBUG,"GOSSIP %.40s %s:%d %s",
                g->nodename,
                g->ip,
                ntohs(g->port),
                ci);
            sdsfree(ci);
        }

        /* Update our state accordingly to the gossip sections */
        /*使用消息中的信息对节点进行更新 */
        // 根据指定name从集群中查找并返回节点
        node = clusterLookupNode(g->nodename);
         // 如果node存在
        if (node) {
            /* We already know this node.
               Handle failure reports, only when the sender is a master. */
            // 如果 sender 是一个主节点且不是本身,那么我们需要处理下线报告  
            if (sender && nodeIsMaster(sender) && node != myself) {
            	   // 如果标识中指定了关于下线的状态 
                if (flags & (CLUSTER_NODE_FAIL|CLUSTER_NODE_PFAIL)) {
                	   // 将sender的添加到node的故障报告中
                    if (clusterNodeAddFailureReport(node,sender)) {
                        serverLog(LL_VERBOSE,
                            "Node %.40s reported node %.40s as not reachable.",
                            sender->name, node->name);
                    }
                      // 判断node节点是否处于真正的下线FAIL状态
                    markNodeAsFailingIfNeeded(node);
                } else { // 如果标识表示节点处于正常状态
                	   // 如果 sender 曾经发送过对 node 的下线报告,那么清除该报告
                    if (clusterNodeDelFailureReport(node,sender)) {
                        serverLog(LL_VERBOSE,
                            "Node %.40s reported node %.40s is back online.",
                            sender->name, node->name);
                    }
                }
            }

            /* If we already know this node, but it is not reachable, and
             * we see a different address in the gossip section of a node that
             * can talk with this other node, update the address, disconnect
             * the old link if any, so that we'll attempt to connect with the
             * new address. */
            // 虽然node存在,但是node已经处于下线状态
            // 但是消息中的标识却反应该节点不处于下线状态,并且实际的地址和消息中的地址发生变化
            // 这些表明该节点换了新地址,尝试进行握手  
            if (node->flags & (CLUSTER_NODE_FAIL|CLUSTER_NODE_PFAIL) &&
                !(flags & CLUSTER_NODE_NOADDR) &&
                !(flags & (CLUSTER_NODE_FAIL|CLUSTER_NODE_PFAIL)) &&
                (strcasecmp(node->ip,g->ip) || node->port != ntohs(g->port)))
            {
            	  // 释放原来的集群连接对象
                if (node->link) freeClusterLink(node->link);
                // 设置节点的地址为消息中的地址	
                memcpy(node->ip,g->ip,NET_IP_STR_LEN);                
                node->port = ntohs(g->port);
                 // 清除无地址的标识
                node->flags &= ~CLUSTER_NODE_NOADDR;
            }
        } else { // node不存在,没有在当前集群中找到
            /* If it's not in NOADDR state and we don't have it, we
             * start a handshake process against this IP/PORT pairs.
             * 如果 node 不在 NOADDR 状态,并且当前节点不认识 node 
             * 那么向 node 发送 HANDSHAKE 消息。
             *
             * Note that we require that the sender of this gossip message
             * is a well known node in our cluster, otherwise we risk
             * joining another cluster. 
             * 注意,当前节点必须保证 sender 是本集群的节点,
             * 否则我们将有加入了另一个集群的风险。
             */
            if (sender &&
                !(flags & CLUSTER_NODE_NOADDR) &&
                !clusterBlacklistExists(g->nodename))
            {
            	   // 开始进行握手
                clusterStartHandshake(g->ip,ntohs(g->port));
            }
        }

        /* Next node */
        //处理下一个节点
        g++;
    }
}

     如果消息中附带的节点信息所对应的节点node存在,且附带的节点信息显示该node节点处于下线或疑似下线状态,那么会调用clusterNodeAddFailureReport()函数将sender节点添加到node的故障报告的链表中。然后调用markNodeAsFailingIfNeeded()函数来判断该node节点是否真正的处于客观下线状态,下面分别看下函数实现:

/* This function is called every time we get a failure report from a node.
 * The side effect is to populate the fail_reports list (or to update
 * the timestamp of an existing report).
 * 函数的作用就是将下线节点的下线报告添加到 fail_reports 列表,
 * 如果这个下线节点的下线报告已经存在,
 * 那么更新该报告的时间戳。
 * 'failing' is the node that is in failure state according to the
 * 'sender' node.
 * failing 参数指向下线节点,而 sender 参数则指向报告 failing 已下线的节点。
 *
 * The function returns 0 if it just updates a timestamp of an existing
 * failure report from the same sender. 1 is returned if a new failure
 * report is created. 
 * 函数返回 0 表示对已存在的报告进行了更新,返回 1 则表示创建了一条新的下线报告。
 */
int clusterNodeAddFailureReport(clusterNode *failing, clusterNode *sender) {
	  // 指向保存下线报告的链表
    list *l = failing->fail_reports;
    listNode *ln;
    listIter li;
    clusterNodeFailReport *fr;

    /* If a failure report from the same sender already exists, just update
     * the timestamp. */
     // 查找 sender 节点的下线报告是否已经存在 
    listRewind(l,&li);
    while ((ln = listNext(&li)) != NULL) {
        fr = ln->value;
         // 如果存在的话,那么只更新该报告的时间戳
        if (fr->node == sender) {
            fr->time = mstime();
            return 0;
        }
    }

    /* Otherwise create a new report. */
     // 否则的话,就创建一个新的报告
    fr = zmalloc(sizeof(*fr));
     // 设置发送该报告的节点
    fr->node = sender;
    // 设置时间
    fr->time = mstime();
    // 添加到故障报告的链表中
    listAddNodeTail(l,fr);
    return 1;
}

函数很简单,遍历下线节点的fail_reports故障报告链表,如果sender节点之前就已经报告该节点下线,那么更新报告的时间戳,否则创建新的报告,加入到该链表中。

然后调用markNodeAsFailingIfNeeded()函数来判断该函数是否处于客观下线状态。代码如下:

/* This function checks if a given node should be marked as FAIL.
 * It happens if the following conditions are met:
 *  此函数用于判断是否需要将 node 标记为 FAIL 。将 node 标记为 FAIL 需要满足以下两个条件:
 *
 * 1) We received enough failure reports from other master nodes via gossip.
 *    Enough means that the majority of the masters signaled the node is
 *    down recently.
 *     有半数以上的主节点将 node 标记为 PFAIL 状态。   
 * 2) We believe this node is in PFAIL state.
 *    当前节点也将 node 标记为 PFAIL 状态。
 *
 * If a failure is detected we also inform the whole cluster about this
 * event trying to force every other node to set the FAIL flag for the node.
 * 如果确认 node 已经进入了 FAIL 状态,那么节点还会向其他节点发送 FAIL 消息,让其他节点也将 node 标记为 FAIL 。 
 *
 * Note that the form of agreement used here is weak, as we collect the majority
 * of masters state during some time, and even if we force agreement by
 * propagating the FAIL message, because of partitions we may not reach every
 * node. However:
 *  这种判断节点下线的方法是弱(weak)的,因为我们在一段时间内收集大多数的主节点状态并且即使我们是强制通过传播 FAIL 消息来征求同意
 *  ,由于分区的问题我们无法可达每一个节点。但是:
 * 1) Either we reach the majority and eventually the FAIL state will propagate
 *    to all the cluster.
 *    只要我们成功将 node 标记为 FAIL,那么这个 FAIL 状态最终(eventually)总会传播至整个集群的所有节点。
 * 2) Or there is no majority so no slave promotion will be authorized and the
 *    FAIL flag will be cleared after some time.
 *    或者没有达到大多数所以不能晋升从节点,会在一段时间内将 FAIL 标识清除。
 */
void markNodeAsFailingIfNeeded(clusterNode *node) {
    int failures;
    // 标记为 FAIL 所需的节点数量,需要超过集群节点数量的一半 
    int needed_quorum = (server.cluster->size / 2) + 1;
    
     // 不处于pfail(需要确认是否故障)状态,则直接返回
    if (!nodeTimedOut(node)) return; /* We can reach it. */
    // 处于fail(已确认为故障)状态,则直接返回	
    if (nodeFailed(node)) return; /* Already FAILing. */
    
    // 返回认为node节点下线(标记为 PFAIL or FAIL 状态)的其他节点数量(不包括当前节点)
    failures = clusterNodeFailureReportsCount(node);
    /* Also count myself as a voter if I'm a master. */
    // 如果当前节点是主节点,也投一票
    if (nodeIsMaster(myself)) failures++;
     // 如果报告node故障的节点数量不够总数的一半,无法判定node是否下线,直接返回	
    if (failures < needed_quorum) return; /* No weak agreement from masters. */

    serverLog(LL_NOTICE,
        "Marking node %.40s as failing (quorum reached).", node->name);

    /* Mark the node as failing. */
    // 取消PFAIL,设置为FAIL
    node->flags &= ~CLUSTER_NODE_PFAIL;
    node->flags |= CLUSTER_NODE_FAIL;
    // 并设置下线时间
    node->fail_time = mstime();

    /* Broadcast the failing node name to everybody, forcing all the other
     * reachable nodes to flag the node as FAIL. */
    // 如果当前节点是主节点的话,那么向其他节点发送报告 node 的 FAIL 信息
     // 让其他节点也将 node 标记为 FAIL 
    if (nodeIsMaster(myself)) clusterSendFail(node->name);
    clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE|CLUSTER_TODO_SAVE_CONFIG);
}

该函数用来判断node节点是否处于客观下线的状态,Redis认为,如果集群中过半数的主节点都认为该节点处于下线的状态,那么该节点就处于客观下线的状态,因为needed_quorum就是计算的票数。

  是将客观下线的节点广播给集群中所有的节点。通过发送FAIL消息,调用clusterSendFail()函数,代码如下:


/* Send a FAIL message to all the nodes we are able to contact.
 * The FAIL message is sent when we detect that a node is failing
 * (CLUSTER_NODE_PFAIL) and we also receive a gossip confirmation of this:
 * we switch the node state to CLUSTER_NODE_FAIL and ask all the other
 * nodes to do the same ASAP. */
// 发送一个FAIL消息给所有可达的节点。当察觉到一个节点处于PFAIL状态,发送一个FAIL消息。并且接收一个这么的gossip信息:我们要将该节点的状态设置为FAIL的,要求所有的节点也这么做。
void clusterSendFail(char *nodename) {
    unsigned char buf[sizeof(clusterMsg)];
    clusterMsg *hdr = (clusterMsg*) buf;
    // 构建FAIL的消息包包头
    clusterBuildMessageHdr(hdr,CLUSTERMSG_TYPE_FAIL);
    // 设置下线节点的名字
    memcpy(hdr->data.fail.about.nodename,nodename,CLUSTER_NAMELEN);
    // 发送给所有集群中的节点
    clusterBroadcastMessage(buf,ntohl(hdr->totlen));
}

这样集群中所有节点就知道了客观下线的节点。

参考:

https://blog.csdn.net/men_wen/article/details/73137338

猜你喜欢

转载自blog.csdn.net/bohu83/article/details/86652968
今日推荐