【go-libp2p源码剖析】DHT 路由刷新管理器


RtRefreshManager用于dht路由表刷新。网络上的节点可能随时会离线,RtRefreshManager提供定时(每10分钟)检测机制,如果节点离线,则将该节点从路由表中剔除出去,待所有节点检测完后,刷新k桶数据(填充k桶)。

  • 它提供一个Refresh方法,Refresh有一个force参数,如果force参数设置为true,则所有存储桶都将被强制刷新,而与它们的最后刷新时间无关。返回的通道将阻塞,直到刷新完成。
  • 它还对外暴露了一个RefreshNoWait方法。与Refresh不一样,它不是强制刷新,而且不会阻塞。

时序图如下:
在这里插入图片描述

RtRefreshManager初始化

  1. IpfsDHT的new方法中初始化了RtRefreshManager,然后启动一个协程用于处理路由表的刷新
func New(ctx context.Context, h host.Host, options ...Option) (*IpfsDHT, error) {
    
    
	...
	dht, err := makeDHT(ctx, h, cfg)
	if err != nil {
    
    
		return nil, fmt.Errorf("failed to create DHT, err=%s", err)
	}
	...
	dht.proc.Go(dht.populatePeers)
	....
}
  1. makeDHT构造了RtRefreshManager、bootstrapPeers(cfg从New中options参数传入),默认bootstrapPeers为空

func makeDHT(ctx context.Context, h host.Host, cfg config) (*IpfsDHT, error) {
    
    
	...
	dht.bootstrapPeers = cfg.bootstrapPeers

	// rt refresh manager
	rtRefresh, err := makeRtRefreshManager(dht, cfg, maxLastSuccessfulOutboundThreshold)
	if err != nil {
    
    
		return nil, fmt.Errorf("failed to construct RT Refresh Manager,err=%s", err)
	}
	dht.rtRefreshManager = rtRefresh

	...
}
  1. 构造RtRefreshManager,这里有三个重要的参数refreshQueryTimeout(默认1分钟)、refreshInterval(默认10分钟)、maxLastSuccessfulOutboundThreshold.
    如果alpha >= K,则maxLastSuccessfulOutboundThresholdrefreshInterval
	if cfg.concurrency < cfg.bucketSize {
    
     // (alpha < K)
		l1 := math.Log(float64(1) / float64(cfg.bucketSize))                              //(Log(1/K))
		l2 := math.Log(float64(1) - (float64(cfg.concurrency) / float64(cfg.bucketSize))) // Log(1 - (alpha / K))
		maxLastSuccessfulOutboundThreshold = time.Duration(l1 / l2 * float64(cfg.routingTable.refreshInterval))
	} else {
    
    
		maxLastSuccessfulOutboundThreshold = cfg.routingTable.refreshInterval
	}

这里传入了两个闭包比较关键:

  1. keyGenFnc用于生成cpl peerID(随机id,该peer并不存在),这个peerID离rt.buckets[cpl]里的peer最近。如果根据此peerID找最近的peer,那么peer必然是在rt.buckets[cpl]。
  2. queryFnc用于刷新k-bucket。实际调用GetClosestPeers做迭代查询,这里查询出来的peer没有使用(不需要做put_value、get_value等操作),只是填充路由表(maxCplForRefresh个k-bucket都会执行一次queryFnc)。
func makeRtRefreshManager(dht *IpfsDHT, cfg config, maxLastSuccessfulOutboundThreshold time.Duration) (*rtrefresh.RtRefreshManager, error) {
    
    
	keyGenFnc := func(cpl uint) (string, error) {
    
    
		p, err := dht.routingTable.GenRandPeerID(cpl)
		return string(p), err
	}

	queryFnc := func(ctx context.Context, key string) error {
    
    
		_, err := dht.GetClosestPeers(ctx, key)
		return err
	}

	r, err := rtrefresh.NewRtRefreshManager(
		dht.host, dht.routingTable, cfg.routingTable.autoRefresh,
		keyGenFnc,
		queryFnc,
		cfg.routingTable.refreshQueryTimeout,
		cfg.routingTable.refreshInterval,
		maxLastSuccessfulOutboundThreshold,
		dht.refreshFinishedCh)

	return r, err
}

keyGenFnc实际调用的是路由表的GenRandPeerID。通过一系列位运算,能证明t根据targetCpl和randPrefix生成的targetPrefix一定在keyPrefixMap集合中。
后面的mask操作表示看不懂了。。。☺

maxCplForRefresh的值为15,也就是路由表只会刷新前16个桶,这里最多可有255个桶,为啥后面的桶不刷新了,原因应该在keyPrefixMap集合的生成。targetCpl=15时就需要生成65535个key,这个keyPrefixMap最终会载入内存中,如果targetCpl=32呢?那么就会生成4294967296个key,占用的内存也会呈指数增加,如果targetCpl=254呢?。这里取15 应该就是做一个取舍。


// GenRandPeerID generates a random peerID for a given Cpl
func (rt *RoutingTable) GenRandPeerID(targetCpl uint) (peer.ID, error) {
    
    
	if targetCpl > maxCplForRefresh {
    
    
		return "", fmt.Errorf("cannot generate peer ID for Cpl greater than %d", maxCplForRefresh)
	}

	localPrefix := binary.BigEndian.Uint16(rt.local)

	// For host with ID `L`, an ID `K` belongs to a bucket with ID `B` ONLY IF CommonPrefixLen(L,K) is EXACTLY B.
	// Hence, to achieve a targetPrefix `T`, we must toggle the (T+1)th bit in L & then copy (T+1) bits from L
	// to our randomly generated prefix.
	toggledLocalPrefix := localPrefix ^ (uint16(0x8000) >> targetCpl)
	randPrefix, err := randUint16()
	if err != nil {
    
    
		return "", err
	}

	// Combine the toggled local prefix and the random bits at the correct offset such that ONLY the first `targetCpl` bits match the local ID.
	mask := (^uint16(0)) << (16 - (targetCpl + 1))
	targetPrefix := (toggledLocalPrefix & mask) | (randPrefix & ^mask)

	// Convert to a known peer ID.
	key := keyPrefixMap[targetPrefix]
	id := [34]byte{
    
    mh.SHA2_256, 32}
	binary.BigEndian.PutUint32(id[2:], key)
	return peer.ID(id[:]), nil
}

再看看keyPrefixMap是怎么来的
keyPrefixMap数组有65535个key,每个key的值就是i(循环次数)

const bits = 16
const target = 1 << bits //65535
const idLen = 32 + 2 
func main() {
    
    
	pkg := os.Getenv("GOPACKAGE")
	file := os.Getenv("GOFILE")
	targetFile := strings.TrimSuffix(file, ".go") + "_prefixmap.go"

	ids := new([target]uint32)
	found := new([target]bool)
	count := int32(0)

	out := make([]byte, 32)
	//前面两个字节固定 18、32 用于hasher运算
	inp := [idLen]byte{
    
    mh.SHA2_256, 32}
	hasher := sha256.New()

	//循坏次数i最后肯定大于target(因为1~65535为种子生成的hash,hash前2个字节可能重复)
	for i := uint32(0); count < target; i++ {
    
    
		//将i转成大端序列
		binary.BigEndian.PutUint32(inp[2:], i)

		hasher.Write(inp[:])
		out = hasher.Sum(out[:0])
		hasher.Reset()
		//out总共32个字节(256位) 转成Uint32 也就是截取前面4个字节(32位) 再右移16位,即0x0000????,表示的最大数位65535
		//因为生成的hash前面2个字节 可能重复 这里做了去重  生成65535个数据  prefix作为key  value为索引i
		prefix := binary.BigEndian.Uint32(out) >> (32 - bits)
		if !found[prefix] {
    
    
			found[prefix] = true
			ids[prefix] = i
			count++
		}
	}

	f, err := os.Create(targetFile)
	if err != nil {
    
    
		panic(err)
	}

	printf := func(s string, args ...interface{
    
    }) {
    
    
		_, err := fmt.Fprintf(f, s, args...)
		if err != nil {
    
    
			panic(err)
		}
	}

	printf("package %s\n\n", pkg)
	printf("// Code generated by generate/generate_map.go DO NOT EDIT\n")
	printf("var keyPrefixMap = [...]uint32{")
	for i, j := range ids[:] {
    
    
		if i%16 == 0 {
    
    
			printf("\n\t")
		} else {
    
    
			printf(" ")
		}
		printf("%d,", j)
	}
	printf("\n}")
	f.Close()
}

DefaultBootstrapPeers

dht_bootstrap.go的init方法有几个引导节点。构造dht时可以将dht.DefaultBootstrapPeers或GetDefaultBootstrapPeerAddrInfos传入

var DefaultBootstrapPeers []multiaddr.Multiaddr

func init() {
    
    
	for _, s := range []string{
    
    
		"/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
		"/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
		"/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
		"/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
		"/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ", // mars.i.ipfs.io
	} {
    
    
		ma, err := multiaddr.NewMultiaddr(s)
		if err != nil {
    
    
			panic(err)
		}
		DefaultBootstrapPeers = append(DefaultBootstrapPeers, ma)
	}
}

func GetDefaultBootstrapPeerAddrInfos() []peer.AddrInfo {
    
    
	ds := make([]peer.AddrInfo, 0, len(DefaultBootstrapPeers))

	for i := range DefaultBootstrapPeers {
    
    
		info, err := peer.AddrInfoFromP2pAddr(DefaultBootstrapPeers[i])
		if err != nil {
    
    
			logger.Errorw("failed to convert bootstrapper address to peer addr info", "address",
				DefaultBootstrapPeers[i].String(), err, "err")
			continue
		}
		ds = append(ds, *info)
	}
	return ds
}

启动rtRefreshManager循环

在enableFixLowPeers的情况下,首先调用fixLowPeers填充路由表,然后启动rtRefreshManager,最后再启动一个协程(定时器任务,每两分钟调用fixLowPeers)

func (dht *IpfsDHT) populatePeers(_ goprocess.Process) {
    
    
	if !dht.disableFixLowPeers {
    
    
		dht.fixLowPeers(dht.ctx)
	}

	if err := dht.rtRefreshManager.Start(); err != nil {
    
    
		logger.Error(err)
	}

	if !dht.disableFixLowPeers {
    
    
		dht.proc.Go(dht.fixLowPeersRoutine)
	}

}

如果路由表中的节点数量少于阈值(10),则尝试将更多的节点放入路由表

func (dht *IpfsDHT) fixLowPeers(ctx context.Context) {
    
    
	if dht.routingTable.Size() > minRTRefreshThreshold {
    
    
		return
	}

	// 尝试将所有连接到当前主机的节点加入路由表
	for _, p := range dht.host.Network().Peers() {
    
    
		dht.peerFound(ctx, p, false)
	}

	// 在连接到引导节点之前,我们应该首先使用从以前的路由表快照中的非引导节点
	if dht.routingTable.Size() == 0 {
    
    
		if len(dht.bootstrapPeers) == 0 {
    
    
			return
		}

		found := 0
		//rand.Perm 打乱原来bootstrapPeers的顺序
		for _, i := range rand.Perm(len(dht.bootstrapPeers)) {
    
    
			ai := dht.bootstrapPeers[i]
			err := dht.Host().Connect(ctx, ai)
			if err == nil {
    
    
				found++
			} else {
    
    
				logger.Warnw("failed to bootstrap", "peer", ai.ID, "error", err)
			}

			// 等待两个引导节点,或者全部尝试。
			// 为什么要两个?从理论上讲,一个通常就足够了。
			// 但是,如果网络要重新启动,并且每个人都只连接到一个bootstrapper,那么我们将最终得到一个大部分为分区的网络。 
			// 因此我们总是使用两个随机的peer作为启动节点
			if found == maxNBoostrappers {
    
    
				break
			}
		}
	}

	// 如果路由表中仍然没有peer(可能是因为Identify尚未完成),则触发刷新没有任何意义。
	if dht.routingTable.Size() == 0 {
    
    
		return
	}

	if dht.autoRefresh {
    
    
		dht.rtRefreshManager.RefreshNoWait()
	}
}


// periodicBootstrapInterval=2 * time.Minute
// 这里启动了一个定时器,每两分钟检测一下路由表里的节点数量
// 同时rt.PeerRemoved回调方法会调用fixRTIfNeeded
func (dht *IpfsDHT) fixLowPeersRoutine(proc goprocess.Process) {
    
    
	ticker := time.NewTicker(periodicBootstrapInterval)
	defer ticker.Stop()

	for {
    
    
		select {
    
    
		case <-dht.fixLowPeersChan:
		case <-ticker.C:
		case <-proc.Closing():
			return
		}

		dht.fixLowPeers(dht.Context())
	}

}

func (dht *IpfsDHT) fixRTIfNeeded() {
    
    
	select {
    
    
	case dht.fixLowPeersChan <- struct{
    
    }{
    
    }:
	default:
	}
}

RtRefreshManager的start会启动一个协程执行loop循环

func (r *RtRefreshManager) Start() error {
    
    
	r.refcount.Add(1)
	go r.loop()
	return nil
}

在enableAutoRefresh开启的情况下,首先会强制刷新路由表一次,再启动定时器任务。

  • refreshTickrCh用于定时器自动刷新
  • triggerRefreshReq用于手动刷新(当调用Refresh、RefreshNoWait方法时)
    在定时任务中取出路由表中所有peer,再去Connect(可以理解为拨号或ping,10秒的超时时间),这里每个peer都会启动一个协程,如果peer连接不上,则从路由表中移除。等待所有的peer检测完毕,然后执行刷新。

func (r *RtRefreshManager) loop() {
    
    
	defer r.refcount.Done()

	var refreshTickrCh <-chan time.Time
	if r.enableAutoRefresh {
    
    
		err := r.doRefresh(true)
		if err != nil {
    
    
			logger.Warn("failed when refreshing routing table", err)
		}
		t := time.NewTicker(r.refreshInterval)
		defer t.Stop()
		refreshTickrCh = t.C
	}

	for {
    
    
		var waiting []chan<- error
		var forced bool
		select {
    
    
		case <-refreshTickrCh:
		case triggerRefreshReq := <-r.triggerRefresh:
			if triggerRefreshReq.respCh != nil {
    
    
				waiting = append(waiting, triggerRefreshReq.respCh)
			}
			forced = forced || triggerRefreshReq.forceCplRefresh
		case <-r.ctx.Done():
			return
		}

		// Batch multiple refresh requests if they're all waiting at the same time.
	OuterLoop:
		for {
    
    
			select {
    
    
			case triggerRefreshReq := <-r.triggerRefresh:
				if triggerRefreshReq.respCh != nil {
    
    
					waiting = append(waiting, triggerRefreshReq.respCh)
				}
				forced = forced || triggerRefreshReq.forceCplRefresh
			default:
				break OuterLoop
			}
		}

		// 执行刷新
		// 如果peer10秒钟没应答则驱逐
		var wg sync.WaitGroup
		for _, ps := range r.rt.GetPeerInfos() {
    
    
			if time.Since(ps.LastSuccessfulOutboundQueryAt) > r.successfulOutboundQueryGracePeriod {
    
    
				wg.Add(1)
				go func(ps kbucket.PeerInfo) {
    
    
					defer wg.Done()
					livelinessCtx, cancel := context.WithTimeout(r.ctx, peerPingTimeout)
					if err := r.h.Connect(livelinessCtx, peer.AddrInfo{
    
    ID: ps.Id}); err != nil {
    
    
						logger.Debugw("evicting peer after failed ping", "peer", ps.Id, "error", err)
						r.rt.RemovePeer(ps.Id)
					}
					cancel()
				}(ps)
			}
		}
		wg.Wait()

		// 查询本节点 并且刷新指定的buckets
		err := r.doRefresh(forced)
		for _, w := range waiting {
    
    
			w <- err
			close(w)
		}
		if err != nil {
    
    
			logger.Warnw("failed when refreshing routing table", "error", err)
		}
	}
}

首先调用queryForSelf查询自己(当前节点),最终调用dht.GetClosestPeers获取临近的peers。
然后调用路由表的GetTrackedCplsForRefresh方法获取每个cpl(16)的最后刷新时间。(为什么是16,是因为我们目前只能生成“maxCplForRefresh”位前缀)
遍历这16个cpl(cpl从0开始):
1.如果是强制刷新则不关心cpl的最后刷新时间,直接调用refreshCpl执行该cpl的刷新,否则判断是否小于refreshInterval,如果小于刷新间隔则不执行刷新。
2.如果refreshCpl执行失败,则将错误合并(16个cpl可能有多个error,最终只返回一个error),待全部遍历后,再将错误返回
3.如果refreshCpl执行成功,判断该cpl里的节点数量,如果为空,则说明cpl出现空隙,这时仅刷新2 *(Cpl + 1)到最后的cpl(为什么是2 *(Cpl + 1),如何证明?)
4.最后会往refreshDoneCh发送消息,最终在dht.rtPeerLoop中处理refreshDoneCh发来的消息

func (r *RtRefreshManager) doRefresh(forceRefresh bool) error {
    
    
	var merr error

	if err := r.queryForSelf(); err != nil {
    
    
		merr = multierror.Append(merr, err)
	}

	refreshCpls := r.rt.GetTrackedCplsForRefresh()

	rfnc := func(cpl uint) (err error) {
    
    
		if forceRefresh {
    
    
			err = r.refreshCpl(cpl)
		} else {
    
    
			err = r.refreshCplIfEligible(cpl, refreshCpls[cpl])
		}
		return
	}

	for c := range refreshCpls {
    
    
		cpl := uint(c)
		if err := rfnc(cpl); err != nil {
    
    
			merr = multierror.Append(merr, err)
		} else {
    
    
			// NPeersForCPL返回指定cpl里的peer数量
			if r.rt.NPeersForCpl(cpl) == 0 {
    
    
				lastCpl := min(2*(c+1), len(refreshCpls)-1)
				for i := c + 1; i < lastCpl+1; i++ {
    
    
					if err := rfnc(uint(i)); err != nil {
    
    
						merr = multierror.Append(merr, err)
					}
				}
				return merr
			}
		}
	}

	select {
    
    
	case r.refreshDoneCh <- struct{
    
    }{
    
    }:
	case <-r.ctx.Done():
		return r.ctx.Err()
	}

	return merr
}


const maxCplForRefresh uint = 15

// GetTrackedCplsForRefresh returns the Cpl's we are tracking for refresh.
// Caller is free to modify the returned slice as it is a defensive copy.
func (rt *RoutingTable) GetTrackedCplsForRefresh() []time.Time {
    
    
	maxCommonPrefix := rt.maxCommonPrefix()
	if maxCommonPrefix > maxCplForRefresh {
    
    
		maxCommonPrefix = maxCplForRefresh
	}

	rt.cplRefreshLk.RLock()
	defer rt.cplRefreshLk.RUnlock()

	cpls := make([]time.Time, maxCommonPrefix+1)
	for i := uint(0); i <= maxCommonPrefix; i++ {
    
    
		// defaults to the zero value if we haven't refreshed it yet.
		cpls[i] = rt.cplRefreshedAt[i]
	}
	return cpls
}

refreshCpl 首先为这个cpl生成一个key(即peerID,这个peer并不存在,但是离这个k桶最近),最后调用refreshQueryFnc(即dht.GetClosestPeers)

func (r *RtRefreshManager) refreshCpl(cpl uint) error {
    
    
	// gen a key for the query to refresh the cpl
	key, err := r.refreshKeyGenFnc(cpl)
	if err != nil {
    
    
		return fmt.Errorf("failed to generated query key for cpl=%d, err=%s", cpl, err)
	}

	logger.Infof("starting refreshing cpl %d with key %s (routing table size was %d)",
		cpl, loggableRawKeyString(key), r.rt.Size())

	if err := r.runRefreshDHTQuery(key); err != nil {
    
    
		return fmt.Errorf("failed to refresh cpl=%d, err=%s", cpl, err)
	}

	logger.Infof("finished refreshing cpl %d, routing table size is now %d", cpl, r.rt.Size())
	return nil
}

func (r *RtRefreshManager) runRefreshDHTQuery(key string) error {
    
    
	queryCtx, cancel := context.WithTimeout(r.ctx, r.refreshQueryTimeout)
	defer cancel()

	err := r.refreshQueryFnc(queryCtx, key)

	if err == nil || (err == context.DeadlineExceeded && queryCtx.Err() == context.DeadlineExceeded) {
    
    
		return nil
	}
	return err
}

再来看看dht.GetClosestPeers,runLookupWithFollowup的代码分析请参考另外一篇文章-DHT Kademlia 迭代查询。

//GetClosestPeers是Kademlia的“节点查找”操作。根据key返回的K个节点(在chann中)。
//如果取消了上下文,则此函数将返回上下文错误以及迄今已找到的最接近的K个节点。
//查找完成后会重置Cpl最后刷新时间
func (dht *IpfsDHT) GetClosestPeers(ctx context.Context, key string) (<-chan peer.ID, error) {
    
    
	if key == "" {
    
    
		return nil, fmt.Errorf("can't lookup empty key")
	}
	//TODO: I can break the interface! return []peer.ID
	lookupRes, err := dht.runLookupWithFollowup(ctx, key,
		func(ctx context.Context, p peer.ID) ([]*peer.AddrInfo, error) {
    
    
			// For DHT query command
			routing.PublishQueryEvent(ctx, &routing.QueryEvent{
    
    
				Type: routing.SendingQuery,
				ID:   p,
			})

			pmes, err := dht.findPeerSingle(ctx, p, peer.ID(key))
			if err != nil {
    
    
				logger.Debugf("error getting closer peers: %s", err)
				return nil, err
			}
			peers := pb.PBPeersToPeerInfos(pmes.GetCloserPeers())

			// For DHT query command
			routing.PublishQueryEvent(ctx, &routing.QueryEvent{
    
    
				Type:      routing.PeerResponse,
				ID:        p,
				Responses: peers,
			})

			return peers, err
		},
		func() bool {
    
     return false },
	)

	if err != nil {
    
    
		return nil, err
	}

	out := make(chan peer.ID, dht.bucketSize)
	defer close(out)

	for _, p := range lookupRes.peers {
    
    
		out <- p
	}

	if ctx.Err() == nil && lookupRes.completed {
    
    
		// refresh the cpl for this key as the query was successful
		dht.routingTable.ResetCplRefreshedAtForID(kb.ConvertKey(key), time.Now())
	}

	return out, ctx.Err()
}

猜你喜欢

转载自blog.csdn.net/kk3909/article/details/110870855