《redis设计与实现》-17槽指派

一序

Redis集群通过分片的方式来保存数据库中的键值对：集群的整个数据库被分为16384个槽（slot），数据库中每个键都属于这16384个槽的其中一个，集群中的每个节点可以处理0个或最多16384个槽。
当数据库总的16384个槽都有节点在处理时，集群处于上线状态（ok）；相反地，如果数据库中有任何一个槽没有得到处理，那么集群处于下线状态（fail）。
通过向节点发送CLUSTER ADDSLOTS命令，我们可以将一个或多个槽指派（assign）给节点负责：

CLUSTER ADDSLOTS <slot> [slot ...]

接下来分别介绍节点保存槽指派信息的方法，以及节点之间传播槽指派信息的方法，之后是CLUSTER ADDSLOTS命令的实现。

二记录槽节点指派信息

clusterNode结构的slots属性和numslot属性记录了节点负责处理哪些槽：

typedef struct clusterNode {
    // 节点的槽位图
    unsigned char slots[CLUSTER_SLOTS/8]; /* slots handled by this node */
    // 当前节点复制槽的数量
    int numslots;   /* Number of slots handled by this node */
} clusterNode;

slots属性是一个二进制位数组（bit array），这个数组的长度为16384/8=2048个字节，共包含16384个二进制位。
Redis以0为起始索引，16383为终止索引，对slots数组中的16384个二进制位进行编号，并根据索引i上的二进制位的值来判断节点是否负责处理槽i：

如果slots数组在索引i上的二进制位的值为1，那么表示节点负责处理槽i；
如果slots数组在索引i上的二进制位的值为0，那么表示节点不负责处理槽i；

至于numslots属性则记录节点负责处理的槽的数量，也即是slots数组中值为1的二进制位的数量。

这个数组的索引1,3,5,8,910的二进制位值为1，其他的位置二进制值为0，表示节点负责处理槽1,3,5,8,9,10

三传播节点的槽指派信息

一个节点除了会将自己负责处理的槽记录在clusterNode结构的slots属性和numslots属性之外，它还会将自己的slots数组通过消息发送给集群中的其他节点，以此来告知其他节点自己目前负责处理哪些槽。
当节点A通过消息从节点B那里接收到节点B的slots数组时，节点A会在自己的clusterState.nodes字典中查找节点B对应的clusterNode结构，并对结构中的slots数组进行保存或者更新。
因为集群中的每个节点都会将自己的slots数组通过消息发送给集群中的其他节点，并且每个接收到slots数组的节点都会将数组保存到相应节点的clusterNode结构里面，因此，集群中的每个节点都会知道数据库中的16384个槽分别被指派给了集群中的哪些节点。

在调用clusterBuildMessageHdr()函数构建消息包的头部时，会将发送节点的槽位信息添加进入。

/* Build the message header. hdr must point to a buffer at least
 * sizeof(clusterMsg) in bytes. */
// 构建信息头部 ，hdr至少指向一个sizeof(clusterMsg)大小的缓冲区
void clusterBuildMessageHdr(clusterMsg *hdr, int type) {
    int totlen = 0;
    uint64_t offset;
    clusterNode *master;

    /* If this node is a master, we send its slots bitmap and configEpoch.
     * 如果这是一个主节点，那么发送该节点的槽 bitmap 和配置纪元。 
     * If this node is a slave we send the master's information instead (the
     * node is flagged as slave so the receiver knows that it is NOT really
     * in charge for this slots. 
     * 如果当前节点是从节点，发送它主节点的槽位图信息和配置纪元
     */
    master = (nodeIsSlave(myself) && myself->slaveof) ?
              myself->slaveof : myself;
    // 清零信息头
    memset(hdr,0,sizeof(*hdr));
    // 设置头部的签名
    hdr->ver = htons(CLUSTER_PROTO_VER);
    hdr->sig[0] = 'R';
    hdr->sig[1] = 'C';
    hdr->sig[2] = 'm';
    hdr->sig[3] = 'b';
    // 设置信息类型
    hdr->type = htons(type);
    // 设置信息发送者
    memcpy(hdr->sender,myself->name,CLUSTER_NAMELEN);
    // 设置当前节点负责的槽
    memcpy(hdr->myslots,master->slots,sizeof(hdr->myslots));
     // 清零从属的主节点信息
    memset(hdr->slaveof,0,CLUSTER_NAMELEN);
    // 如果myself是从节点，设置消息头部从属主节点的信息
    if (myself->slaveof != NULL)
        memcpy(hdr->slaveof,myself->slaveof->name, CLUSTER_NAMELEN);
    // 设置port    
    hdr->port = htons(server.port);
    // 设置myself节点类型
    hdr->flags = htons(myself->flags);
    // 设置当前集群的状态
    hdr->state = server.cluster->state;

    /* Set the currentEpoch and configEpochs. */
    // 设置集群当前纪元和主节点配置纪元
    hdr->currentEpoch = htonu64(server.cluster->currentEpoch);
    hdr->configEpoch = htonu64(master->configEpoch);

    /* Set the replication offset. */
     // 如果myself是从节点
    if (nodeIsSlave(myself))
    	   // 获取复制偏移量
        offset = replicationGetSlaveOffset();
    else  // myself是主节点，获取复制的偏移量
        offset = server.master_repl_offset;
     // 设置复制偏移量    
    hdr->offset = htonu64(offset);

    /* Set the message flags. */
    // 如果myself是主节点，正在进行手动故障转移
    if (nodeIsMaster(myself) && server.cluster->mf_end)
    	  // 设置主节点暂停手动故障转移的标识
        hdr->mflags[0] |= CLUSTERMSG_FLAG0_PAUSED;

    /* Compute the message length for certain messages. For other messages
     * this is up to the caller. */
     // 如果消息是 FAIL 类型的，计算消息的总长度 
    if (type == CLUSTERMSG_TYPE_FAIL) {
        totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
        totlen += sizeof(clusterMsgDataFail);
     // 如果消息是 UPDATE 类型的，计算消息的总长度   
    } else if (type == CLUSTERMSG_TYPE_UPDATE) {
        totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
        totlen += sizeof(clusterMsgDataUpdate);
    }
     // 设置信息的长度
    hdr->totlen = htonl(totlen);
    /* For PING, PONG, and MEET, fixing the totlen field is up to the caller. */
}

在调用clusterProcessPacket()函数处理消息包时，会根据消息包的信息，如果出现槽位分配信息不匹配的情况，会更新当前节点视角的槽位分配的信息。该函数的处理这种情况的代码如下：

 /* Update our info about served slots.
         *
         * Note: this MUST happen after we update the master/slave state
         * so that CLUSTER_NODE_MASTER flag will be set. */
         // 更新当前节点所负责的槽信息
        /* Many checks are only needed if the set of served slots this
         * instance claims is different compared to the set of slots we have
         * for it. Check this ASAP to avoid other computational expansive
         * checks later. */
         // 这些操作必须在更新主从状态之后进行，因为需要CLUSTER_NODE_MASTER标识 
        clusterNode *sender_master = NULL; /* Sender or its master if slave. */
        int dirty_slots = 0; /* Sender claimed slots don't match my view? */

        if (sender) {
        	  // 如果sender是从节点，那么获取其主节点信息
            // 如果sender是主节点，那么获取sender的信息
            sender_master = nodeIsMaster(sender) ? sender : sender->slaveof;
            if (sender_master) {
            	    // sender发送的槽信息和主节点的槽信息是否匹配
                dirty_slots = memcmp(sender_master->slots,
                        hdr->myslots,sizeof(hdr->myslots)) != 0;
            }
        }

        /* 1) If the sender of the message is a master, and we detected that
         *    the set of slots it claims changed, scan the slots to see if we
         *    need to update our configuration. */
          // 1. 如果sender是主节点，但是槽信息出现不匹配现象 
        if (sender && nodeIsMaster(sender) && dirty_slots)
        	   // 检查当前节点对sender的槽信息，并且进行更新
            clusterUpdateSlotsConfigWith(sender,senderConfigEpoch,hdr->myslots);

发送节点负责了一些槽位之后，将这些槽位信息通过发送包发送给myself节点，在myself节点视角的集群中查找的sender节点则是没有设置关于发送节点的槽位信息。所以dirty_slots被赋值为1，表示出现了槽位信息不匹配的情况。最终会调用clusterUpdateSlotsConfigWith()函数更新myself节点视角中，集群关于发送节点的槽位信息。该函数代码如下

/* This function is called when we receive a master configuration via a
 * PING, PONG or UPDATE packet. What we receive is a node, a configEpoch of the
 * node, and the set of slots claimed under this configEpoch.
 * 这个函数在节点通过 PING 、 PONG 、 UPDATE 消息接收到一个 master 的配置时调用，
 * 函数以一个节点，节点的 configEpoch ，
 * 以及节点在 configEpoch 纪元下的槽配置作为参数。
 *
 * What we do is to rebind the slots with newer configuration compared to our
 * local configuration, and if needed, we turn ourself into a replica of the
 * node (see the function comments for more info).
 * 要做的就是在 slots 参数的新配置和本节点的当前配置进行对比，并更新本节点对槽的布局，
 * 如果需要，将本节点转换为一个复制的节点
 *
 * The 'sender' is the node for which we received a configuration update.
 * Sometimes it is not actually the "Sender" of the information, like in the
 * case we receive the info via an UPDATE packet.
 * 根据情况， sender 参数可以是消息的发送者，也可以是消息发送者的主节点。 
 */
 
void clusterUpdateSlotsConfigWith(clusterNode *sender, uint64_t senderConfigEpoch, unsigned char *slots) {
    int j;
    clusterNode *curmaster, *newmaster = NULL;
    /* The dirty slots list is a list of slots for which we lose the ownership
     * while having still keys inside. This usually happens after a failover
     * or after a manual cluster reconfiguration operated by the admin.
     *
     * If the update message is not able to demote a master to slave (in this
     * case we'll resync with the master updating the whole key space), we
     * need to delete all the keys in the slots we lost ownership. */
    uint16_t dirty_slots[CLUSTER_SLOTS];
    int dirty_slots_count = 0;

    /* Here we set curmaster to this node or the node this node
     * replicates to if it's a slave. In the for loop we are
     * interested to check if slots are taken away from curmaster. */
     // 如果当前节点是主节点，那么获取当前节点
    // 如果当前节点是从节点，那么获取当前从节点所从属的主节点
    curmaster = nodeIsMaster(myself) ? myself : myself->slaveof;

    // 如果发送消息的节点就是本节点，则直接返回
    if (sender == myself) {    	 
        serverLog(LL_WARNING,"Discarding UPDATE message about myself.");
        return;
    }
	  
	  // 遍历所有槽 
    for (j = 0; j < CLUSTER_SLOTS; j++) {
    	    // 如果 slots 中的槽 j 已经被指派，那么执行以下代码
        if (bitmapTestBit(slots,j)) {
            /* The slot is already bound to the sender of this message. */
             // 如果当前槽是sender负责的，那么跳过当前槽
            if (server.cluster->slots[j] == sender) continue;

            /* The slot is in importing state, it should be modified only
             * manually via redis-trib (example: a resharding is in progress
             * and the migrating side slot was already closed and is advertising
             * a new config. We still want the slot to be closed manually). */
              // 如果当前槽处于导入状态，它应该只能通过redis-trib 被手动修改，所以跳过该槽
            if (server.cluster->importing_slots_from[j]) continue;

            /* We rebind the slot to the new node claiming it if:
             *   将槽重新绑定到新的节点，如果满足以下条件
             * 1) The slot was unassigned or the new node claims it with a
             *    greater configEpoch.
             *    1.槽没有被指定或者新的节点声称它有一个更大的配置纪元
             * 2) We are not currently importing the slot. 
             *    2.当前没有导入该槽
             */
            if (server.cluster->slots[j] == NULL ||
                server.cluster->slots[j]->configEpoch < senderConfigEpoch)
            {
                /* Was this slot mine, and still contains keys? Mark it as
                 * a dirty slot. */
                 // 如果当前槽被当前节点所负责，而且槽中有数据，表示该槽发生冲突
                if (server.cluster->slots[j] == myself &&
                    countKeysInSlot(j) &&
                    sender != myself)
                {   // 将发生冲突的槽记录到脏槽中
                    dirty_slots[dirty_slots_count] = j;
                    // 脏槽数加1
                    dirty_slots_count++;
                }
                // 负责槽 j 的原节点是当前节点的主节点？
                // 如果是的话，说明故障转移发生了，将当前节点的复制对象设置为新的主节点
                if (server.cluster->slots[j] == curmaster)
                    newmaster = sender;
                // 删除当前被指定的槽    
                clusterDelSlot(j);
                // 将槽分配给sender
                clusterAddSlot(sender,j);
                clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|
                                     CLUSTER_TODO_UPDATE_STATE|
                                     CLUSTER_TODO_FSYNC_CONFIG);
            }
        }
    }

    /* If at least one slot was reassigned from a node to another node
     * with a greater configEpoch, it is possible that:
     * 如果当前节点（或者当前节点的主节点）有至少一个槽被指派到了 sender
     * 并且 sender 的 configEpoch 比当前节点的纪元要大，那么可能发生了：
     * 1) We are a master left without slots. This means that we were
     *    failed over and we should turn into a replica of the new
     *    master.
     *  1 当前节点是一个不在处理任何槽的主节点，这是应该将当前节点设置为新主节点的从节点
     * 2) We are a slave and our master is left without slots. We need
     *    to replicate to the new slots owner. 
     *  当前节点是一个从节点，并且当前节点的主节点不在处理任何槽，这是应该将当前节点设置为新主节点的从节点
     */
    if (newmaster && curmaster->numslots == 0) {
        serverLog(LL_WARNING,
            "Configuration change detected. Reconfiguring myself "
            "as a replica of %.40s", sender->name);
        // 将 sender 设置为当前节点的主节点    
        clusterSetMaster(sender);
        clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|
                             CLUSTER_TODO_UPDATE_STATE|
                             CLUSTER_TODO_FSYNC_CONFIG);
    } else if (dirty_slots_count) {
        /* If we are here, we received an update message which removed
         * ownership for certain slots we still have keys about, but still
         * we are serving some slots, so this master node was not demoted to
         * a slave.
         * 如果执行到这里，我们接收到一个删除当前我们负责槽的所有者的更新消息,
         * 但是我们仍然负责该槽，所以主节点不能被降级为从节点
         * In order to maintain a consistent state between keys and slots
         * we need to remove all the keys from the slots we lost. 
         * 为了保持键和槽的关系，需要从我们丢失的槽中将键删除
         */
        for (j = 0; j < dirty_slots_count; j++)
            // 遍历所有的脏槽，删除槽中的键-
            delKeysInSlot(dirty_slots[j]);
    }
}

更新槽位信息的两种情况：

如果myself节点视角下集群关于该槽没有指定负责的节点，会直接调用函数指派槽位。
如果发送节点的配置纪元更大，表示发送节点版本更新。这种情况需要进行两个if判断，判断是否发生了槽位指派节点冲突和是否检测到了故障。

最后都需要调用clusterAddSlot()函数，将当前槽位指派给myself节点视角下的集群中的sender节点。这样myself节点就知道了发送节点的槽分配信息。如果时间足够，每个主节点都会将自己负责的槽位信息告知给每一个集群中的其他节点，于是，集群中的每一个节点都会知道16384个槽分别指派给了集群中的哪个节点。

四记录集群所有槽的指派信息

clusterState结构中的slots数组记录了集群中所有16384个槽的指派信息。

typedef struct clusterState {
    // 导出槽数据到目标节点，该数组记录这些节点
    clusterNode *migrating_slots_to[CLUSTER_SLOTS];
    // 导入槽数据到目标节点，该数组记录这些节点
    clusterNode *importing_slots_from[CLUSTER_SLOTS];
    // 槽和负责槽节点的映射
    clusterNode *slots[CLUSTER_SLOTS];
    // 槽映射到键的跳跃表
    zskiplist *slots_to_keys;
} clusterState;

migrating_slots_to是一个数组，用于重新分片时保存：从当前节点导出的槽位的到负责该槽位的节点的映射关系。
importing_slots_from是一个数组，用于重新分片时保存：往当前节点导入的槽位的到负责该槽位的节点的映射关系。
slots是一个数组，保存集群中所有主节点和其负责的槽位的映射关系。
slots_to_keys是一个跳跃表，用于CLUSTER GETKEYSINSLOT命令可以返回多个属于槽位的键，通过遍历跳跃表实现。

如果slots[i]指针指向NULL，那么表示槽i尚未指派给任何节点。
如果slots[i]指针指向一个clusterNode结构，那么表示槽i已经指派给了clusterNode结构所代表的节点。

通过将所有槽的指派信息保存在clusterState.slots数组里面，程序要检查槽i是否已经被指派，又或者取得负责处理槽i的节点，只需要访问clusterState.slots[i]的值即可，这个操作的复杂度仅为O(1)。

要说明的一点是，虽然clusterState.slots数组中记录了集群中所有槽的指派信息，但使用clusterNode结构的slots数组来记录单个节点的槽指派信息仍然是有必要的：

因为当程序需要将某个节点的槽指派信息通过消息发送给其他节点时，程序只需要将相应节点的clusterNode.slots数组整个发送出去就可以了。
另一方面，如果Redis不使用clusterNode.slots数组，而单独使用clusterState.slots数组的话，那么每次要将节点A的槽指派信息传播给其他节点时，程序必须先遍历整个clusterState.slots数组，记录节点A负责处理哪些槽，然后才能发送节点A的槽指派信息，这比直接发送clusterNode.slots数组要麻烦和低效得多。
clusterState.slots数组记录了集群中所有槽的指派信息，而clusterNode.slots数组只记录了clusterNode结构所代表的节点的槽指派信息，这是两个slots数组的关键区别所在。

五 CLUSTER ADDSLOTS命令的实现

CLUSTER ADDSLOTS命令接受一个或多个槽作为参数，并将所有输入的槽指派给接收该命令的节点负责。当节点接收到客户端的cluster addslots命令后会调用对应的函数来处理命令，该命令的执行函数是clusterCommand()函数，该函数能够处理所有的cluster命令，因此我们列出处理addslots选项的代码：

 else if ((!strcasecmp(c->argv[1]->ptr,"addslots") ||
               !strcasecmp(c->argv[1]->ptr,"delslots")) && c->argc >= 3)
    {
        /* CLUSTER ADDSLOTS <slot> [slot] ... */
        /* CLUSTER DELSLOTS <slot> [slot] ... */
        int j, slot;
        unsigned char *slots = zmalloc(CLUSTER_SLOTS);
        // 删除操作
        int del = !strcasecmp(c->argv[1]->ptr,"delslots");

        memset(slots,0,CLUSTER_SLOTS);
        /* Check that all the arguments are parseable and that all the
         * slots are not already busy. */
        // 遍历所有指定的槽
        for (j = 2; j < c->argc; j++) {
            // 获取槽位的位置
            if ((slot = getSlotOrReply(c,c->argv[j])) == -1) {
                zfree(slots);
                return;
            }
            // 如果是删除操作，但是槽没有指定负责的节点，回复错误信息
            if (del && server.cluster->slots[slot] == NULL) {
                addReplyErrorFormat(c,"Slot %d is already unassigned", slot);
                zfree(slots);
                return;
            // 如果是添加操作，但是槽已经指定负责的节点，回复错误信息
            } else if (!del && server.cluster->slots[slot]) {
                addReplyErrorFormat(c,"Slot %d is already busy", slot);
                zfree(slots);
                return;
            }
            // 如果某个槽已经指定过多次了（在参数中指定了多次），那么回复错误信息
            if (slots[slot]++ == 1) {
                addReplyErrorFormat(c,"Slot %d specified multiple times",
                    (int)slot);
                zfree(slots);
                return;
            }
        }
        // 上个循环保证了指定的槽的可以处理
        for (j = 0; j < CLUSTER_SLOTS; j++) {
            // 如果当前参数中指定槽
            if (slots[j]) {
                int retval;

                /* If this slot was set as importing we can clear this
                 * state as now we are the real owner of the slot. */
                // 如果这个槽被设置为导入状态，那么取消该状态
                if (server.cluster->importing_slots_from[j])
                    server.cluster->importing_slots_from[j] = NULL;
                // 执行删除或添加操作
                retval = del ? clusterDelSlot(j) :
                               clusterAddSlot(myself,j);
                serverAssertWithInfo(c,NULL,retval == C_OK);
            }
        }
        zfree(slots);
        // 更新集群状态和保存配置
        clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE|CLUSTER_TODO_SAVE_CONFIG);
        addReply(c,shared.ok);

    }

首先判断当前操作是删除还是添加。其次判断指定要加入的槽位值是否合法.最后遍历所有参数中指定的槽位值，调用clusterAddSlot()将槽位指派给myself节点。代码如下：

/* Add the specified slot to the list of slots that node 'n' will
 * serve. Return C_OK if the operation ended with success.
 * If the slot is already assigned to another instance this is considered
 * an error and C_ERR is returned. */
int clusterAddSlot(clusterNode *n, int slot) {
	   // 如果已经指定有节点，则返回C_ERR
    if (server.cluster->slots[slot]) return C_ERR;
    // 设置该槽被指定	
    clusterNodeSetSlotBit(n,slot);
     // 设置负责该槽的节点n
    server.cluster->slots[slot] = n;
    return C_OK;
}

clusterNodeSetSlotBit()会将myself节点槽位图中对应参数指定的槽值的那些位，设置为1，表示这些槽位由myself节点负责。

/* Set the slot bit and return the old value. */
// 设置slot槽位并返回旧的值
int clusterNodeSetSlotBit(clusterNode *n, int slot) {
    // 查看slot槽位是否被设置
    int old = bitmapTestBit(n->slots,slot);
    // 将slot槽位设置为1
    bitmapSetBit(n->slots,slot);
    // 如果之前没有被设置
    if (!old) {
        // 那么要更新n节点负责槽的个数
        n->numslots++;
        /* When a master gets its first slot, even if it has no slaves,
         * it gets flagged with MIGRATE_TO, that is, the master is a valid
         * target for replicas migration, if and only if at least one of
         * the other masters has slaves right now.
         *
         * Normally masters are valid targerts of replica migration if:
         * 1. The used to have slaves (but no longer have).
         * 2. They are slaves failing over a master that used to have slaves.
         *
         * However new masters with slots assigned are considered valid
         * migration tagets if the rest of the cluster is not a slave-less.
         *
         * See https://github.com/antirez/redis/issues/3043 for more info. */
        // 如果主节点是第一次指定槽，即使它没有从节点，也要设置MIGRATE_TO标识
        // 当且仅当至少有一个其他的主节点有从节点时，主节点才是有效的迁移目标
        if (n->numslots == 1 && clusterMastersHaveSlaves())
            // 设置节点迁移的标识
            n->flags |= CLUSTER_NODE_MIGRATE_TO;
    }
    return old;
}

最后在cluster addslots 命令执行完成后，节点会通过发送消息告知集群中的其他节点，自己正在处理那些槽。