背景

生产者需要元数据，比如主题的leader分区有哪些？leader分区都分步在哪些broker节点上，否则无法判断往哪些节点发送消息。

每个主题的分区分布在所有分布式broker上。
主题分区是动态改变的。

元数据更新流程

实现原理

KafkaProducer主线程更新元数据流程

整体元数据更新流程

指定元数据是否更新的标记

KafkaProducer类



private ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, long nowMs, long maxWaitMs) throws InterruptedException {

    // 1.获取元数据
    Cluster cluster = metadata.fetch();

    //判断是否是无效的缓存
    if (cluster.invalidTopics().contains(topic))
        throw new InvalidTopicException(topic);

    //2.把主题放入元数据主题列表
    metadata.add(topic, nowMs);
    
    //3.从元数据中找到主题对应的分区数。
    Integer partitionsCount = cluster.partitionCountForTopic(topic);

    // 4.如果客户端缓存中的元数据能找到消息发送对应分区，就不用去服务端请求更新元数据了，直接返回从生产者缓存中的元数据
    // 这里会拦截住大部分的消息发送。
    // 如果消息的主题有对应的分区，而且消息的分区没有设置或消息指定的发送分区在已知分区范围。就认为
    // 生产者元数据缓存中有对应的主题分区，这时就不用再请求最新的元数据了。直接用现在的元数据缓存

    if (partitionsCount != null && (partition == null || partition < partitionsCount))
        return new ClusterAndWaitTime(cluster, 0);
    long remainingWaitMs = maxWaitMs;
    long elapsed = 0;

    //5.轮询不断要求Sender更新元数据。直到获得主题及分区信息或获取元数据或阻塞时间超时。解决两个问题：1.主题的分区数量增加了。2.元数据版本旧
    do {
        if (partition != null) {
            log.trace("Requesting metadata update for partition {} of topic {}.", partition, topic);
        } else {
            log.trace("Requesting metadata update for topic {}.", topic);
        }

        //6.把主题和过期时间加入元数据主题列表中
        metadata.add(topic, nowMs + elapsed);
        
        //7.标记元数据需要更新，并获得版本
        int version = metadata.requestUpdateForTopic(topic);

        //6.唤醒sender线程。
        // 因为send()和poll()方法的调用都在sender线程里，
        // 需要中断seletor()的阻塞及时让selector监听获取元数据channel上的网络事件。
        sender.wakeup();
        try {
            //7.阻塞线程。
            metadata.awaitUpdate(version, remainingWaitMs);
        } catch (TimeoutException ex) {
            throw new TimeoutException(
                    String.format("Topic %s not present in metadata after %d ms.",
                            topic, maxWaitMs));
        }

        //8.获取元数据
        cluster = metadata.fetch();

        //9.计算等待更新元数据消耗了多少时间。
        elapsed = time.milliseconds() - nowMs;

        //10.超时抛出异常。
        if (elapsed >= maxWaitMs) {
            throw new TimeoutException(partitionsCount == null ?
                    String.format("Topic %s not present in metadata after %d ms.",
                            topic, maxWaitMs) :
                    String.format("Partition %d of topic %s with partition count %d is not present in metadata after %d ms.",
                            partition, topic, partitionsCount, maxWaitMs));
        }
        metadata.maybeThrowExceptionForTopic(topic);
        remainingWaitMs = maxWaitMs - elapsed;

        //11.获取元数据分区数
        partitionsCount = cluster.partitionCountForTopic(topic);
    } while (partitionsCount == null || (partition != null && partition >= partitionsCount));

    //12.返回获取的元数据和更新元数据耗费的时间
    return new ClusterAndWaitTime(cluster, elapsed);
}
复制代码

为什么每次更新元数据是根据主题更新的？

因为没必要拉取全部的元数据，只要生产者根据自己的需要拉取到要发送的主题的元数据即可。整个集群的主题元数据可能是海量的，比如10000个主题。

阻塞方法：

ProducerMetaData.awaitUpdate()



public synchronized void awaitUpdate(final int lastVersion, final long timeoutMs) throws InterruptedException {
    long currentTimeMs = time.milliseconds();
    long deadlineMs = currentTimeMs + timeoutMs < 0 ? Long.MAX_VALUE : currentTimeMs + timeoutMs;
    
    time.waitObject(this, () -> {
        maybeThrowFatalException();
        return updateVersion() > lastVersion || isClosed();
    }, deadlineMs);

    if (isClosed())
        throw new KafkaException("Requested metadata update after close");
}
复制代码

当超时或元数据版本更新了就唤醒阻塞。 SystemTime里的方法waitObject():

public void waitObject(Object obj, Supplier<Boolean> condition, long deadlineMs) throws InterruptedException {
    synchronized (obj) {
        while (true) {
            if (condition.get())
                return;
            long currentTimeMs = milliseconds();
            if (currentTimeMs >= deadlineMs)
                throw new TimeoutException("Condition not satisfied before deadline");
            obj.wait(deadlineMs - currentTimeMs);
        }
    }
}
复制代码

这个方法是一个循环知道超时或满足条件。

解决了元数据不一致的两个问题：

发送主题没有任何leader分区，说明元数据版本旧；
需要的分区编号比现在的元数据分区数量大，说明主题的分区数量增加了。

但是却忽略了消息指定的分区不存在的情况，这里的考虑是如果这个分区正在选主那么此刻leader分区不会存在，但是等一会可能就会出现，所以放过去了，让Sender类正在发送的时候再去判断。

Sender类

private long sendProducerData(long now) {
    //1.从缓存中获取元数据
    Cluster cluster = metadata.fetch();

    //2.得到应该发送数据的节点,并且得出是否有待发送消息在元数据缓存中找不到leader 分区的node
    RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);

    //3.如果主题的 leader 分区对应的节点不存在，就要标注底层通讯层（NetworkClient）需要更新元数据的标识
    if (!result.unknownLeaderTopics.isEmpty()) {
        for (String topic : result.unknownLeaderTopics)
            this.metadata.add(topic, now);
        log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}",
            result.unknownLeaderTopics);
        this.metadata.requestUpdate();
    }
复制代码

解决了元数据不一致的问题：

要发送的leader分区在元数据集合不存在，说明有的分区正在选主，或者leader分区所在的机器挂了。

为什么同时有KafkaProducer类与Sender类都来给是否更新元数据打标记？

因为KafkaProducer类负责把消息放入缓冲区就不管了，Sender类异步获取消息的时候已经是有时间间隔的，这样如果元数据出现了变化，那么我们需要再次判断没有leader分区的主题集合是否为空，为空就标记需要更新元数据。

同时，KafkaProducer类并没有判断消息发送的leader分区不存在的情况。

Sender类标记是否更新元数据没有阻塞行为，而KafkaProducer类标记是否更新元数据有阻塞行为？

KafkaProducer类的判断条件第一个是主题没有对应的分区，那么说明这个主题有可能根本不存在，这样就必须更新元数据，看看是否真的不存在，如果真的不存在就是问题，不应该再往后走了，就应该及时抛出异常。

KafkaProducer类的判断条件第二个是指定的leader分区编号大于主题的分区数。如果leader分区编号大于主题的分区数，说明这个分区是新增的分区。这时需要核实元数据中是否真的有这个分区，需要更新元数据来看这个分区是否真的存在。

而对于Sender线程来说，先通过ready()方法拿到不能找到leader分区的主题集合，如果这个集合不为空就标记需要更新元数据。通过drain()方法得到能在元数据缓存中找到leader 分区的消息，最后把这些消息发送出去。也就是说Sender线程会发送有元数据信息的消息，对于存在没有元数据信息的数据会标记更新元数据。

发送更新元数据请求

NetworkClient类的Poll()方法：

public List<ClientResponse> poll(long timeout, long now) {
    ensureActive();
    if (!abortedSends.isEmpty()) {
        List<ClientResponse> responses = new ArrayList<>();
        handleAbortedSends(responses);
        completeResponses(responses);
        return responses;
    }

    //1.尝试更新元数据，创建元数据请求。
    long metadataTimeout = metadataUpdater.maybeUpdate(now);
    try {
    
        //2.更新元数据请求执行IO操作
        this.selector.poll(Utils.min(timeout, metadataTimeout, defaultRequestTimeoutMs));
    } catch (IOException e) {
        log.error("Unexpected error during I/O", e);
    }
复制代码

NetworkClient类里的内部类DefaultMetadataUpdater的方法maybeUpdate(long now)


public long maybeUpdate(long now) {

    // 1.更新时间是否到了
    long timeToNextMetadataUpdate = metadata.timeToNextUpdate(now);

    // 2.检测是否已经发送了MetadataRequest请求但还没收到
    long waitForMetadataFetch = hasFetchInProgress() ? defaultRequestTimeoutMs : 0;

    long metadataTimeout = Math.max(timeToNextMetadataUpdate, waitForMetadataFetch);
    if (metadataTimeout > 0) {
        return metadataTimeout;
    }


    //找到最小负载的node。
    Node node = leastLoadedNode(now);

    //没有node就重试
    if (node == null) {
        log.debug("Give up sending metadata request since no node is available");
        return reconnectBackoffMs;
    }

    return maybeUpdate(now, node);
}
复制代码

判断是否到了更新时间：

public synchronized long timeToNextUpdate(long nowMs) {
    long timeToExpire = updateRequested() ? 0 : Math.max(this.lastSuccessfulRefreshMs + this.metadataExpireMs - nowMs, 0);
    return Math.max(timeToExpire, timeToAllowUpdate(nowMs));
}
复制代码

NetworkClient类里的内部类DefaultMetadataUpdater的方法maybeUpdate(long now, Node node)

这个方法是最终发送请求更新元数据请求的方法

private long maybeUpdate(long now, Node node) {
    String nodeConnectionId = node.idString();

    // 1.判断是否能够向这个node发送请求
    if (canSendRequest(nodeConnectionId, now)) {

        //1.1 构建元数据请求
        Metadata.MetadataRequestAndVersion requestAndVersion = metadata.newMetadataRequestAndVersion(now);
        MetadataRequest.Builder metadataRequest = requestAndVersion.requestBuilder;
        log.debug("Sending metadata request {} to node {}", metadataRequest, node);

        //1.2 向指定节点发送元数据请求
        sendInternalMetadataRequest(metadataRequest, nodeConnectionId, now);
        inProgress = new InProgressData(requestAndVersion.requestVersion, requestAndVersion.isPartialUpdate);
        return defaultRequestTimeoutMs;
    }

    if (isAnyNodeConnecting()) {
        return reconnectBackoffMs;
    }

    //2.判断节点是否能连接上。
    if (connectionStates.canConnect(nodeConnectionId, now)) {
        log.debug("Initialize connection to node {} for sending metadata request", node);
        
        //初始化与node的连接
        initiateConnect(node, now);
        return reconnectBackoffMs;
    }
    return Long.MAX_VALUE;
}
复制代码

处理更新元数据的响应

NetworkClient类里的内部类DefaultMetadataUpdater的方法handleSuccessfulResponse()

public void handleSuccessfulResponse(RequestHeader requestHeader, long now, MetadataResponse response) {
    List<TopicPartition> missingListenerPartitions = response.topicMetadata().stream().flatMap(topicMetadata ->
        topicMetadata.partitionMetadata().stream()
            .filter(partitionMetadata -> partitionMetadata.error == Errors.LISTENER_NOT_FOUND)
            .map(partitionMetadata -> new TopicPartition(topicMetadata.topic(), partitionMetadata.partition())))
        .collect(Collectors.toList());
    if (!missingListenerPartitions.isEmpty()) {
        int count = missingListenerPartitions.size();
        log.warn("{} partitions have leader brokers without a matching listener, including {}",
                count, missingListenerPartitions.subList(0, Math.min(10, count)));
    }

    // 1.查看response的错误信息。
    Map<String, Errors> errors = response.errors();
    if (!errors.isEmpty())
        log.warn("Error while fetching metadata with correlation id {} : {}", requestHeader.correlationId(), errors);

    //2.如果没有broker相关信息就认为获得元数据失败
    if (response.brokers().isEmpty()) {
    
        //更新失败
        log.trace("Ignoring empty metadata response with correlation id {}.", requestHeader.correlationId());
        this.metadata.failedUpdate(now);
    } else {
        //3.更新meatedata
        this.metadata.update(inProgress.requestVersion, response, inProgress.isPartialUpdate, now);
    }
    inProgress = null;
}
复制代码

NetworkClient类里的内部类DefaultMetadataUpdater的方法update()

public synchronized void update(int requestVersion, MetadataResponse response, boolean isPartialUpdate, long nowMs) {
    super.update(requestVersion, response, isPartialUpdate, nowMs);

    // 找出已获得相关元数据的相关主题，并从新主题集合中删除
    if (!newTopics.isEmpty()) {
        for (MetadataResponse.TopicMetadata metadata : response.topicMetadata()) {
            newTopics.remove(metadata.topic());
        }
    }

    // 唤醒阻塞
    notifyAll();
}
复制代码

Metadata类的方法update()

public synchronized void update(int requestVersion, MetadataResponse response, boolean isPartialUpdate, long nowMs) {
    Objects.requireNonNull(response, "Metadata response cannot be null");
    if (isClosed())
        throw new IllegalStateException("Update requested after metadata close");
        
    //1.判断是否是部分主题更新，以及更新几个字段
    this.needPartialUpdate = requestVersion < this.requestVersion;
    this.lastRefreshMs = nowMs;
    this.updateVersion += 1;
    if (!isPartialUpdate) {
        this.needFullUpdate = false;
        this.lastSuccessfulRefreshMs = nowMs;
    }

    String previousClusterId = cache.clusterResource().clusterId();

    //2.解析元数据响应
    this.cache = handleMetadataResponse(response, isPartialUpdate, nowMs);
    Cluster cluster = cache.cluster();
    maybeSetMetadataError(cluster);
    this.lastSeenLeaderEpochs.keySet().removeIf(tp -> !retainTopic(tp.topic(), false, nowMs));
    String newClusterId = cache.clusterResource().clusterId();
    if (!Objects.equals(previousClusterId, newClusterId)) {
        log.info("Cluster ID: {}", newClusterId);
    }
    clusterResourceListeners.onUpdate(cache.clusterResource());
    log.debug("Updated cluster metadata updateVersion {} to {}", this.updateVersion, this.cache);
}
复制代码

架构设计知识

如何设计一个信息同步的模块？

生产者元数据的更新就是一个信息同步模块，有很多的设计经验可以去借鉴：

需要一个缓存层，满足大部分的数据需求。
设计合理的更新时机：1) 数据缺失时主动更新。2）定时更新。
确定新版本的方法：版本号是否比现在的大。
需要一个检测信息是否同步成功的机制。
解耦：业务主线程要与信息同步线程分开，业务主线程异步获取信息。

写在最后

本人在掘金发布了小册，对kafka做了源码级的剖析。

欢迎支持笔者小册：《Kafka 源码精讲》

Kafka源码分析04：生产者获取元数据的流程

背景

元数据更新流程

实现原理

KafkaProducer主线程更新元数据流程

整体元数据更新流程

指定元数据是否更新的标记

发送更新元数据请求

处理更新元数据的响应

架构设计知识

如何设计一个信息同步的模块？

写在最后

猜你喜欢