Business need to use regular tasks, but must take into account the timing of the task may be repeatedly initiated at the same time in the cluster, which is not what we want, so the following collected some common scenarios for your reference.
problem
First is that the regular tasks in a cluster need to be resolved:
1, if each machine in the cluster to start a scheduled task, likely to cause problems of data duplication processing
2, if the timing of the task switching mode, only one machine switch on, switch OFF other machines, to avoid duplication of data processing problems, but there is a problem of the single point of failure.
Program
① designated to perform regular tasks of the machine
to select multiple machines in a task execution timing, every execution to determine whether the current back to the machine and the specified machine are identical, will be performed.
This approach avoids the implementation of many times, but the most obvious shortcoming is the single point of failure, when this specified machine hung up after the task will not be executed.
② is achieved by dynamically loading a configuration file
, for example: New Timmer multiple profiles, the tasks assigned. Load different profiles for different machines, that is, different machines perform different tasks.
This approach defect is to have at least two different profiles, so maintenance is very troublesome.
③ timing tasks performed separately
The procedures involved in the task timing, peeled out from the main routine performed separately.
This approach defect is to increase the cost of development and maintenance, if not a large project, it is generally not recommended.
④ listener
program to monitor, monitor whether there is duplication of tasks performed by a timer, the relevant business logic any, is not executed.
⑤ timer task read from the database
since the same is connected to a database, the database to a marked task timing corresponding mark, to distinguish whether or repeat the timed task.
create table timed_task (
type int(3) not null,
status tinyint(1) not null,
exec_timestamp bigint(20) not null,
interval_time int(11) not null,
PRIMARY KEY (type)
);
This table is used to indicate whether the current query task may be performed, if there are many timing of the project tasks, a plurality of different types of timers can be distinguished by the type field, in order to achieve a common table.
type # 将多个不同类型的定时器进行区分
status # 状态,是否可以执行定时任务,0为可执行,1为不可执行
exec_timestamp # 上一次定时任务的执行时间
interval_time # 时间阈值,以秒为单位
Here mainly talk about the field interval_time
, mainly used to detect the presence or absence node to perform regular tasks failure or downtime and prevent short timed task had been performed.
You can set up a checking mechanism, when performing node failure or downtime is not timely field status is reset to 0, so that the re-timed task in an executable state, then can be compared based on field exec_timestamp and field interval_time and the current time . If the current time is larger than the field exec_timestamp and field interval_time sum, while the status field is always 1 if, indicating a fault or downtime, then you can go check program where there is a problem.
Timer task performed in the cluster, each machine may be present as a difference of a few seconds, so there may be
a timing task it has been processed through a machine, and the processing time is only a few milliseconds, so the status is reset to field 0, so that the re-timed task in an executable state, but because a few seconds of deviation, then a machine only to the regular tasks of the time expires, found in the database field status timed task between 0 and enforceable in status, it will again perform a task this timing, the situation repeated data processing occurs. Therefore, this field interval_time
can be used to prevent the problem repeatedly performed, if the current time is less than the fields and field exec_timestamp interval_time sum words and status field is 0, indicating that the timing of the short term task has been performed once, then allowed performed again.
⑥ use redis expired distributed lock mechanism and
borrow distributed lock redis to solve the problem of duplication of data processing, in order to prevent the lock is locked for a long time, or to prevent the client does not delete bounced off the lock, can be added to expire expiration time.
Individual practice
I personally use the sixth program, it is the use of distributed lock redis to solve the problem of multi-machine processing of timed tasks. There will be the following simple implementation code, but due to limited space, so only the code in the next file, but this is irrational structure in development, for reference here.
Note: The following code redis practice is for the case of a single instance, and implementing distributed lock will be mentioned in another article from the cluster architecture or redis redis main
package main
import (
"github.com/robfig/cron/v3"
"github.com/go-redis/redis"
"time"
"fmt"
"math/rand"
"strconv"
"crypto/md5"
)
// 分布式锁的结构体
type redisLock struct {
key string
value string
}
var Cron *cron.Cron // 用到第三方库提供的定时任务,很方便就解决了定时任务的难题, 避免重复造轮子
var client = redis.NewClient(&redis.Options{
Addr: "127.0.0.1:6379",
Password: "", // no password set
DB: 0, // use default DB
})
func StartCronJob() {
Cron = cron.New()
cronJobTime := "* * * * *" // 每分钟执行一次
_, err := Cron.AddFunc(cronJobTime, cronJob)
if err != nil {
fmt.Println(err)
return
}
Cron.Start()
}
func cronJob() {
r := redisLock{}
hasLock, err := r.GrabLock(time.Now().Unix(), 101) // 抢锁
if err != nil {
return
}
if !hasLock {
return
}
defer r.ReleaseLock() // 抢到锁的才有资格释放锁
fmt.Println("hello world, ", time.Now())
}
// 抢锁
func (r *redisLock) GrabLock(dateStamp int64, tp int) (hasLock bool, err error) {
// 随机睡眠(0-20ms),在集群中,往往由于存在几ms的时间偏差,所以可能会导致定时任务
// 总是被某几台机器执行,这就很不均衡了,所以随机睡眠,让每台机器都有机会抢到锁去执行定时任务
rand.Seed(time.Now().UnixNano())
time.Sleep(time.Duration(rand.Intn(20)) * time.Millisecond)
// key之所以这样设置就是因为如果存在很多定时任务的时候,可以更好的区别开来不同的定时任务
key := "cron_job:" + strconv.FormatInt(dateStamp, 10) + ":" + strconv.Itoa(tp)
// value之所以这样设置是为了在所有获取锁请求的客户端里保持唯一,就是用来保证能安全地释放锁,这个很重要,因为这可以避免误删其他客户端得到的锁
ds := strconv.FormatInt(dateStamp, 10) + strconv.FormatInt(int64(tp), 10) + strconv.Itoa(rand.Intn(100))
value := fmt.Sprintf("%x", md5.Sum([]byte(ds)))
expireTime := 300
r.key = key
r.value = value
// 上锁
hasLock, err = client.SetNX(key, value, time.Duration(expireTime)*time.Second).Result()
if err != nil {
fmt.Println(err)
return
}
return
}
// 释放锁
func (r *redisLock) ReleaseLock() {
// 为保证原子性,使用lua脚本去删除key
delScript := `if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else return 0
end`
keys := []string{r.key}
result, err := client.Eval(delScript, keys, r.value).Result()
if err != nil || result == 0 {
fmt.Println(err)
return
}
}
func main() {
StartCronJob()
}
First, by means of the redis setnx
to operate the command setnx
itself for the time key assignment will determine whether the existence of this key redis, if there returns -1, if not, will directly set
key. That setnx
command with set
any difference command? setnx
It is atomic, but set
can not guarantee atomicity.
If the timeout is to prevent a lock to grab a client before redis not release the lock on is down, the lock will never be released, other clients will not grab the lock, the result is a deadlock.
So why execution eval()
method ensures atomicity, due to the characteristics Redis, here is the official website of the eval
commands are interpreted: the eval
command execution Lua
time of the script, Lua
the script will be treated as a command to execute, and until the eval
command is completed, Redis will perform other command.
The most common way is to release the lock using a direct del()
method to remove the lock, this is not to determine the owner of the lock to release the lock and direct way, will lead to the release of a client to another client locks. For example: A client got the lock was blocked an operation for a long time, and then the lock after a time-out after redis automatically released, and then the A client thought the lock or your own, go to releasing the lock, the lock may actually have been got client B, thus leading to the release of the client a client B lock. So with a simple DEL
instruction may cause a client to delete the other clients of the lock.
Therefore, in order to prevent someone else's lock released, it must be checked before releasing the lock, the current value is not set up their own into the value, if not, it means that the lock is not their own, and can not be released.
reference
http://blog.sina.com.cn/s/blog_3fba24680102vhb4.html
https://blog.csdn.net/crazy_yinchao/article/details/77837357
https://www.bbsmax.com/A/amd08lBjdg/