etcd Distributed Lock: Best achieved cp distributed lock

Why cp Distributed Lock

Distributed Lock function and demands, we have Redis Distributed Lock: simple version based Distributed Lock AOP and Redis realization of a brief before.

Currently self-development of Redis distributed lock, can already meet most of the scenes (equity + non-automatic renewal + reentrant distributed lock), can be put into production environments. But because it is based on Redis stand-alone environment, not only for high concurrency scenarios. With the expansion of access to business scenarios, Redis single has become unreliable, then the next to give us only two choices: 1, Redis changed to single cluster. 2, use other implementations based on the consensus algorithm.

Scenario 1 You have congenital defects, redis clusters can not guarantee the consistency of data between the moment the master node goes down, master and slave node may be inconsistent. This will cause the service to get a lock from a master node and master node goes down, before the slave node has not yet fully master data synchronization finished, the service from the slave node b successfully got the same lock a.

In other implementations based on the consensus algorithm, zk and ectd is a good choice. Then take into account the already zk Get the lead out, we chose ectd this rising star.

Because in a distributed lock scene, we are more concerned about the locks consistency, instead of a lock of availability, it is more reliable than ap cp lock lock.

Design ideas

etcd introduced the concept of lease, we first need to grant a lease, then also set the effective time of the lease. The effective time of the lease time that we can use as an effective lock.

Then we can call the lock function etcd perform a locking operation on the specified lockName on the specified lease. If no other thread holds the lock, the thread can directly hold the lock. Otherwise you need to wait. Here we can set the timeout time lock wait time to achieve a competitive process waiting to acquire the lock failure. Of course, due to network fluctuations and other issues, I recommend a minimum timeout time is set to 500ms (or do you think a reasonable value).

Then unlock the process, we gave up the unlock operation etcd, and directly used revoke the operating etcd. Why not adopt unlock operation, first, because the parameters are needed to unlock the lock-step operation returns lockKey, we do not want to maintain a multi-field, and second, because we will eventually perform revoke operation, but the operation will revoke the lease All key under fail, because we are currently designing a lease that corresponds to a lock, the situation will release additional business scenarios in the lock does not exist.

In addition, in order to ensure the thread while waiting to acquire the lock in leases do not expire, so we have to set up a thread for this daemon thread, after granting a lease to open a daemon thread in the thread, on a regular basis to determine the need for renewal.

And redis distributed lock is not the same as the effective time redis Distributed Lock is valid in the cache, so you can open the daemon thread for renewal after the success in acquiring the lock, and the effective time etcd is distributed lock lease effective time, while waiting to acquire the lock may tenancies expire, so you have to have to open a daemon thread after obtaining a lease. This adds a lot of complexity.

## concrete realization etcd Go through native language to write, direct application there will be little difficulty in java program, so we jetcd used directly as a client etcd, so you can use the code in a way and etcd java program server communications.

jetcd provides LeaseClient, we can directly use the grant to complete the operation function that granted the lease.

public LockLeaseData getLeaseData(String lockName, Long lockTime) {
    try {
        LockLeaseData lockLeaseData = new LockLeaseData();
        CompletableFuture<LeaseGrantResponse> leaseGrantResponseCompletableFuture = client.getLeaseClient().grant(lockTime);
        Long leaseId = leaseGrantResponseCompletableFuture.get(1, TimeUnit.SECONDS).getID();
        lockLeaseData.setLeaseId(leaseId);
        CpSurvivalClam cpSurvivalClam = new CpSurvivalClam(Thread.currentThread(), leaseId, lockName, lockTime, this);
        Thread survivalThread = threadFactoryManager.getThreadFactory().newThread(cpSurvivalClam);
        survivalThread.start();
        lockLeaseData.setCpSurvivalClam(cpSurvivalClam);
        lockLeaseData.setSurvivalThread(survivalThread);
        return lockLeaseData;
    } catch (InterruptedException | ExecutionException | TimeoutException e) {
        return null;
    }
}
复制代码

Also as mentioned above, after we get the lease, opened CpSurvivalClam guardian thread to renewal on a regular basis. CpSurvivalClam implementation and realization of our broadly consistent in redis distributed lock when the only difference will be one of expandLockTime operation changed to etcd in keepAliveOnce. expandLockTime method specifically as follows:

/**
 * 重置锁的有效时间
 *
 * @param leaseId 锁的租约id
 * @return 是否成功重置
 */
public Boolean expandLockTime(Long leaseId) {
    try {
        CompletableFuture<LeaseKeepAliveResponse> leaseKeepAliveResponseCompletableFuture = client.getLeaseClient().keepAliveOnce(leaseId);
        leaseKeepAliveResponseCompletableFuture.get();
        return Boolean.TRUE;
    } catch (InterruptedException | ExecutionException e) {
        return Boolean.FALSE;
    }
}
复制代码

Then jetcd provides LockClient, we can directly use lock function, and the leaseId lockName passed, we will get a lease in the lockKey. Furthermore, in order to ensure the success of the lock, the lease has not expired. We added a step timeToLive operation for judging whether the lease still alive after acquiring the lock success. If ttl is not greater than 0, it is determined that the lock failure.

/**
 * 在指定的租约上加锁,如果租约过期,则算加锁失败。
 *
 * @param leaseId  锁的租约Id
 * @param lockName 锁的名称
 * @param waitTime 加锁过程中的的等待时间,单位ms
 * @return 是否加锁成功
 */
public Boolean tryLock(Long leaseId, String lockName, Long waitTime) {
    try {
        CompletableFuture<LockResponse> lockResponseCompletableFuture = client.getLockClient().lock(ByteSequence.from(lockName, Charset.defaultCharset()), leaseId);
        long timeout = Math.max(500, waitTime);
        lockResponseCompletableFuture.get(timeout, TimeUnit.MILLISECONDS).getKey();
        CompletableFuture<LeaseTimeToLiveResponse> leaseTimeToLiveResponseCompletableFuture = client.getLeaseClient().timeToLive(leaseId, LeaseOption.DEFAULT);
        long ttl = leaseTimeToLiveResponseCompletableFuture.get(1, TimeUnit.SECONDS).getTTl();
        if (ttl > 0) {
            return Boolean.TRUE;
        } else {
            return Boolean.FALSE;
        }
    } catch (TimeoutException | InterruptedException | ExecutionException e) {
        return Boolean.FALSE;
    }
}
复制代码

Unlocking process, we can use directly revoke operating under LeaseClient, release the lock while under the lease revoked the lease.

/**
 * 取消租约,并释放锁
 *
 * @param leaseId 租约id
 * @return 是否成功释放
 */
public Boolean unLock(Long leaseId) {
    try {
        CompletableFuture<LeaseRevokeResponse> revokeResponseCompletableFuture = client.getLeaseClient().revoke(leaseId);
        revokeResponseCompletableFuture.get(1, TimeUnit.SECONDS);
        return Boolean.TRUE;
    } catch (InterruptedException | ExecutionException | TimeoutException e) {
        return Boolean.FALSE;
    }
}

复制代码

Then unified CpLock object encapsulates the process of locking and unlocking of external exposure execute only way to avoid the user forgets to unlock procedure.

public class CpLock {

    private String lockName;

    private LockEtcdClient lockEtcdClient;

    /**
     * 分布式锁的锁持有数
     */
    private volatile int state;

    private volatile transient Thread lockOwnerThread;

    /**
     * 当前线程拥有的lease对象
     */
    private FastThreadLocal<LockLeaseData> lockLeaseDataFastThreadLocal = new FastThreadLocal<>();
    /**
     * 锁自动释放时间,单位s,默认为30
     */
    private static Long LOCK_TIME = 30L;

    /**
     * 获取锁失败单次等待时间,单位ms,默认为300
     */
    private static Integer SLEEP_TIME_ONCE = 300;

    CpLock(String lockName, LockEtcdClient lockEtcdClient) {
        this.lockName = lockName;
        this.lockEtcdClient = lockEtcdClient;
    }

    private LockLeaseData getLockLeaseData(String lockName, long lockTime) {
        if (lockLeaseDataFastThreadLocal.get() != null) {
            return lockLeaseDataFastThreadLocal.get();
        } else {
            LockLeaseData lockLeaseData = lockEtcdClient.getLeaseData(lockName, lockTime);
            lockLeaseDataFastThreadLocal.set(lockLeaseData);
            return lockLeaseData;
        }
    }

    final Boolean tryLock(long waitTime) {
        final long startTime = System.currentTimeMillis();
        final long endTime = startTime + waitTime * 1000;
        final long lockTime = LOCK_TIME;
        final Thread current = Thread.currentThread();
        try {
            do {
                int c = this.getState();
                if (c == 0) {
                    LockLeaseData lockLeaseData = this.getLockLeaseData(lockName, lockTime);
                    if (Objects.isNull(lockLeaseData)) {
                        return Boolean.FALSE;
                    }
                    Long leaseId = lockLeaseData.getLeaseId();
                    if (lockEtcdClient.tryLock(leaseId, lockName, endTime - System.currentTimeMillis())) {
                        log.info("线程获取重入锁成功,cp锁的名称为{}", lockName);
                        this.setLockOwnerThread(current);
                        this.setState(c + 1);
                        return Boolean.TRUE;
                    }
                } else if (lockOwnerThread == Thread.currentThread()) {
                    if (c + 1 <= 0) {
                        throw new Error("Maximum lock count exceeded");
                    }
                    this.setState(c + 1);
                    log.info("线程重入锁成功,cp锁的名称为{},当前LockCount为{}", lockName, state);
                    return Boolean.TRUE;
                }
                int sleepTime = SLEEP_TIME_ONCE;
                if (waitTime > 0) {
                    log.info("线程暂时无法获得cp锁,当前已等待{}ms,本次将再等待{}ms,cp锁的名称为{}", System.currentTimeMillis() - startTime, sleepTime, lockName);
                    try {
                        Thread.sleep(sleepTime);
                    } catch (InterruptedException e) {
                        log.info("线程等待过程中被中断,cp锁的名称为{}", lockName, e);
                    }
                }
            } while (System.currentTimeMillis() <= endTime);
            if (waitTime == 0) {
                log.info("线程获得cp锁失败,将放弃获取,cp锁的名称为{}", lockName);
            } else {
                log.info("线程获得cp锁失败,之前共等待{}ms,将放弃等待获取,cp锁的名称为{}", System.currentTimeMillis() - startTime, lockName);
            }
            this.stopKeepAlive();
            return Boolean.FALSE;
        } catch (Exception e) {
            log.error("execute error", e);
            this.stopKeepAlive();
            return Boolean.FALSE;
        }
    }

    /**
     * 停止续约,并将租约对象从线程中移除
     */
    private void stopKeepAlive() {
        LockLeaseData lockLeaseData = lockLeaseDataFastThreadLocal.get();
        if (Objects.nonNull(lockLeaseData)) {
            lockLeaseData.getCpSurvivalClam().stop();
            lockLeaseData.setCpSurvivalClam(null);
            lockLeaseData.getSurvivalThread().interrupt();
            lockLeaseData.setSurvivalThread(null);
        }
        lockLeaseDataFastThreadLocal.remove();
    }

    final void unLock() {
        if (lockOwnerThread == Thread.currentThread()) {
            int c = this.getState() - 1;
            if (c == 0) {
                this.setLockOwnerThread(null);
                this.setState(c);
                LockLeaseData lockLeaseData = lockLeaseDataFastThreadLocal.get();
                this.stopKeepAlive();
                //unLock操作必须在最后执行,避免其他线程获取到锁时的state等数据不正确
                lockEtcdClient.unLock(lockLeaseData.getLeaseId());
                log.info("重入锁LockCount-1,线程已成功释放锁,cp锁的名称为{}", lockName);
            } else {
                this.setState(c);
                log.info("重入锁LockCount-1,cp锁的名称为{},剩余LockCount为{}", lockName, c);
            }
        }
    }

    public <T> T execute(Supplier<T> supplier, int waitTime) {
        Boolean holdLock = Boolean.FALSE;
        Preconditions.checkArgument(waitTime >= 0, "waitTime必须为自然数");
        try {
            if (holdLock = this.tryLock(waitTime)) {
                return supplier.get();
            }
            return null;
        } catch (Exception e) {
            log.error("cpLock execute error", e);
            return null;
        } finally {
            if (holdLock) {
                this.unLock();
            }
        }
    }

    public <T> T execute(Supplier<T> supplier) {
        return this.execute(supplier, 0);
    }
}

复制代码

CpLock and before Redis distributed lock in ApLock achieve broadly consistent. The main differences are:

1, because we are open in the operating lease granted in the daemon thread, so the competition lock failed with abnormal release and lock these scenarios, we have to stop the renewal of a daemon thread. Also, because of reentrant scene, we just want to go to compete generate lease lock in the case of state is zero. So avoid judging a variety of circumstances, we have introduced FastThreadLocal lockLeaseDataFastThreadLocal to save Lease object for the current thread.

2, redis distributed lock in any scene, waiting to acquire a lock is accomplished by polling the sleep mode, while in etcd scenario, we wait is done by a 0 etcd its wait state logic, in the state of non- 0 scene, still waiting to achieve by way of polling sleep. Because there may be converted to a non-zero state from zero, so our waitTime value is endTime - System.currentTimeMillis (), rather than the original incoming waitTime. This can make the waiting time more close to our expectations.

Release Notes

This update, we implemented based etcd of cp distributed lock, but also fixes a hidden problem redis distributed lock.

Before setState operation after unLock, so that in concurrency scenarios can cause a problem. And a thread b thread acquire locks in a competition, then the respective local and state variables c are 0, then after a thread to acquire the lock immediately release the lock, then do an unLock, state or 1, b thread successful lock, the state is reset to c + 1, is still 1, then a thread execution setState, will stete to 0. At this time, if b thread to release the lock, the operation performed stete-1, it becomes a -1. This problem is mainly due to acquiring state values ​​and state values ​​modify operation is asynchronous, while in multithreaded scenarios, distributed lock by lock control, we only need to unLock operation moved after all assignments to solve this problem.

Next Steps

Cp version distributed lock the current implementation, it has been distributed lock can meet the vast majority of scenes (equity + non-renewable automatically + reentrant distributed lock), it may have been put into production environments. Follow-up plan, ap and cp lock will lock each update will optimize some usage scenarios. Will try to solve the problem of fair locks and lock loop access issues need to wait for sleep.

The cp distributed lock need to consider a lot of usage scenarios, currently only carried out small-scale test, if ill-considered local, but also hope you forgive.

Recommended Reading

1, Redis distributed lock: simple version based Distributed Lock AOP and Redis realization of
2, Redis distributed lock (B): Support for renewal lock, lock timeout to avoid lead after multiple threads get locked
3, Redis distributed lock (c): supports reentrant lock, to avoid deadlock when the lock recursively

Well, we next goodbye, welcome to discuss the message. Also welcomed the thumbs up -

Guess you like

Origin juejin.im/post/5d69dd446fb9a06aed713877