consul服务发现系统设计:
前言: 一个系统中的服务,随着业务变得复杂,服务数量变得越来越多。 用配置文件更新服务变得不可靠,不实时。 于是有了专业的服务发现系统,如etcd,consul,zookeeper等组件,服务向集群注册服务,客户端从集群获取服务(对应的ip:port:router或者uri)。这样客户端就无需关心服务端的节点动态变化,服务发现系统还可以实现负载均衡,服务宕机后的无缝迁移,实现服务的高可用。 consul与etcd一样采用了raft算法(数据一致性算法)和gossip算法(后面我会专门写一篇文章介绍该算法,是libp2p或bitcoin的数据传播算法之一)应用示例: 例如mongodb副本集+consul实现高可用高性能的nosql数据库,例如etcd+redis或consul+filecoin 实现高可用高性能的链集群。
架构示意图:
3个server,3个client,每个客户端部署一个web服务。
consul的server的数据会持久化到磁盘文件,client的不会持久化到文件,其他的功能并没有不同
1、环境准备:
系统:ubuntu18.04
节点:
# 部署consul server
192.168.1.47
192.168.1.48
192.168.1.49
# 部署consul client、mongodb
192.168.1.100
192.168.1.101
192.168.1.102
2、安装服务:
在6台节点全部执行以下命令:
1安装consul apt源:
ubuntu/Debian 添加apt源仓库,安装consul:
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
安装consul:
apt-get update
apt-get install consul -y
2安装mongodb:
apt install mongodb-server -y
3、搭建consul集群:
1、在server和client节点执行以下命令(创建mongodb的数据目录):
mkdir -p /data/consul_data
2、启动server节点:
在consul server1节点(192.168.1.47)执行:
nohup consul agent -bootstrap-expect 2 -server -data-dir /data/consul0 -node=server1 -bind=192.168.1.47 -config-dir /etc/consul.d -enable-script-checks=true -datacenter=dc1 /data/consul0/consul.log 2>&1 &
在consul server2节点(192.168.1.48)执行:
nohup consul agent -server -data-dir /data/consul0 -node=server2 -bind=192.168.1.48 -config-dir /etc/consul.d -enable-script-checks=true -datacenter=dc1 -join 192.168.1.47 /data/consul0/consul.log 2>&1 &
在consul server3节点(192.168.1.49)执行:
nohup consul agent -server -data-dir /data/consul0 -node=server3 -bind=192.168.1.49 -config-dir /etc/consul.d -enable-script-checks=true -datacenter=dc1 -join 192.168.1.47 /data/consul0/consul.log 2>&1 &
3、启动client节点:
在consul server3这个节点、3个consul client节点启动web服务:
root@jacky-VirtualBox:~# cat /root/test/web/main.go
package main
// web.go
import (
"fmt"
"io"
"log"
"net/http"
"strconv"
)
var iCnt int = 0
func helloHandler(w http.ResponseWriter, r *http.Request) {
iCnt++
str := "Hello world ! friend(" + strconv.Itoa(iCnt) + ")"
io.WriteString(w, str)
fmt.Println(str)
}
func main() {
ht := http.HandlerFunc(helloHandler)
if ht != nil {
http.Handle("/hello", ht)
}
err := http.ListenAndServe(":80", nil)
if err != nil {
log.Fatal("ListenAndserve:", err.Error())
}
}
root@jacky-VirtualBox:~#
root@jacky-VirtualBox:~# cd /root/test/web/
root@jacky-VirtualBox:~# go build main.go
root@jacky-VirtualBox:~# nohup /root/test/web/web > /tmp/web.log 2>&1 &
root@jacky-VirtualBox:~#
root@jacky-VirtualBox:~# ps -aux | grep web
root 1804 0.1 0.1 1003376 5244 pts/0 Sl 11:01 0:00 /root/test/web/web
root 1812 0.0 0.0 17672 724 pts/0 S+ 11:01 0:00 grep --color=auto web
root@jacky-VirtualBox:~#
root@jacky-VirtualBox:~# lsof -i:80
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
web 1804 root 3u IPv6 31934 0t0 TCP *:http (LISTEN)
root@jacky-VirtualBox:~#
将web服务注册到consul集群(有http注册和配置文件注册2种方式,推荐用配置文件的方式):
#mkdir /etc/consul.d/
# 该路径下的所有*.json文件都对应着一个服务,consul在启动时会解析json文件注册为服务,并进行健康检查(可以看作是心跳)
vim 编辑配置文件/etc/consul.d/web.json
{
"service":{
"name":"web",
"tags":[
"rails"
],
"port":80,
"check":{
"name":"ping",
"script":"curl -s localhost:80",
"interval":"3s"
}
}
}
在consul client1节点(192.168.1.100)执行:
nohup consul agent -data-dir /data/consul0 -node=client1 -bind=192.168.1.100 -config-dir /etc/consul.d -enable-script-checks=true -datacenter=dc1 -join 192.168.1.47 /data/consul0/consul.log 2>&1 &
在consul client2节点(192.168.1.101)执行:
nohup consul agent -data-dir /data/consul0 -node=client2 -bind=192.168.1.101 -config-dir /etc/consul.d -enable-script-checks=true -datacenter=dc1 -join 192.168.1.47 /data/consul0/consul.log 2>&1 &
在consul client3节点(192.168.1.102)执行:
nohup consul agent -data-dir /data/consul0 -node=client3 -bind=192.168.1.102 -config-dir /etc/consul.d -enable-script-checks=true -datacenter=dc1 -join 192.168.1.47 /data/consul0/consul.log 2>&1 &
注解:
其中:nohup:no hang up,意思是不挂断。表示永久执行命令,哪怕当前终端已经退出登录
&:后台执行命令。
2>&1:
在bash shell中,
0代表标准输入,一般是键盘录入;
1代表标准输出,一般是屏幕;
2代表标准错误;
因此当命令使用nohup &运行以后,标准都错误都输出到2去了,console上看不到输出的错误。
因此,2>&1,起到了一个重定向都作用,将标准错误重定向到标准输出上去,后台运行的程序就可以在屏幕上看到程序输出的错误。
查看日志:
tail -f /data/consul0/consul.log
4查看consul集群信息:
在consul client3节点(192.168.1.102)执行命令:
root@jacky-VirtualBox:~# consul info
agent:
check_monitors = 1
check_ttls = 0
checks = 1
services = 1
build:
prerelease =
revision = 27de64da
version = 1.10.0
consul:
acl = disabled
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 1
goroutines = 43
max_procs = 1
os = linux
version = go1.16.5
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 2
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 6
members = 4
query_queue = 0
query_time = 3
root@jacky-VirtualBox:~#
root@jacky-VirtualBox:~# consul members
Node Address Status Type Build Protocol DC Segment
server1 192.168.1.47:8301 alive server 1.10.0 2 dc1 <all>
server2 192.168.1.48:8301 alive server 1.10.0 2 dc1 <all>
server3 192.168.1.49:8301 alive server 1.10.0 2 dc1 <all>
client1 192.168.1.100:8301 alive client 1.10.0 2 dc1 <default>
client2 192.168.1.101:8301 alive client 1.10.0 2 dc1 <default>
client3 192.168.1.102:8301 alive client 1.10.0 2 dc1 <default>
root@jacky-VirtualBox:~#
5、查看服务信息:
root@jacky-VirtualBox:~# dig @127.0.0.1 -p 8600 web.service.consul SRV
; <<>> DiG 9.16.1-Ubuntu <<>> @127.0.0.1 -p 8600 web.service.consul SRV
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46717
;; flags: qr aa rd; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 9
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;web.service.consul. IN SRV
;; ANSWER SECTION:
web.service.consul. 0 IN SRV 1 1 80 client2.node.dc1.consul.
web.service.consul. 0 IN SRV 1 1 80 client3.node.dc1.consul.
web.service.consul. 0 IN SRV 1 1 80 client1.node.dc1.consul.
web.service.consul. 0 IN SRV 1 1 80 server3.node.dc1.consul.
;; ADDITIONAL SECTION:
client2.node.dc1.consul. 0 IN A 192.168.1.101
client2.node.dc1.consul. 0 IN TXT "consul-network-segment="
client3.node.dc1.consul. 0 IN A 192.168.1.102
client3.node.dc1.consul. 0 IN TXT "consul-network-segment="
client1.node.dc1.consul. 0 IN A 192.168.1.100
client1.node.dc1.consul. 0 IN TXT "consul-network-segment="
server3.node.dc1.consul. 0 IN A 192.168.1.49
server3.node.dc1.consul. 0 IN TXT "consul-network-segment="
;; Query time: 3 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: 二 7月 1 21:6:41 CST 2021
;; MSG SIZE rcvd: 411
root@jacky-VirtualBox:~#
6、consul服务重新加载:
consul的策略是启动时加载服务,运行期间并不会主动去扫描文件。因此,在手工增、删服务之后,需要给consul一个事件消息,使重新加载服务(重新扫描/etc/consul.d/目录下面的全部json文件,并以此为准去更新内存中的服务注册表)。 手工事件触发的这种设计思想非常合理。
root@jacky-VirtualBox:~# mv /etc/consul.d/web.json /tmp
root@jacky-VirtualBox:~# consul reload
Configuration reload triggered
root@jacky-VirtualBox:~# tail -f /data/consul0/consul.log
2021-07-14T11:16:04.954+0800 [WARN] agent: Check is now critical: check=service:web
2021-07-14T11:16:14.993+0800 [WARN] agent: Check is now critical: check=service:web
2021-07-14T11:16:25.042+0800 [WARN] agent: Check is now critical: check=service:web
2021-07-14T11:16:35.065+0800 [WARN] agent: Check is now critical: check=service:web
2021-07-14T11:16:45.106+0800 [WARN] agent: Check is now critical: check=service:web
2021-07-14T11:16:55.147+0800 [WARN] agent: Check is now critical: check=service:web
2021-07-14T11:16:56.290+0800 [WARN] agent.auto_config: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
2021-07-14T11:16:56.290+0800 [WARN] agent.auto_config: using enable-script-checks without ACLs and without allow_write_http_from is DANGEROUS, use enable-local-script-checks instead, see https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations/
2021-07-14T11:16:56.290+0800 [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry {
disable_compat_1.9 = true }` to disable them.
2021-07-14T11:16:56.301+0800 [INFO] agent: Deregistered service: service=web
7、启动ui服务:(生产环境建议去掉ui服务,用cli运维即可)
上面启动consul server和client时,没有启动ui服务。这里选取某个节点(例如server3)重新启动,启动时指定ui打开服务:
先杀死consul服务:
root@jacky-VirtualBox:~# ps -aux | grep consul
root 1941 1.9 2.9 781836 74308 pts/0 Sl 12:40 0:21 consul agent -server -ui -data-dir /data/consul0 -node=server3 -bind=192.168.1.49 -config-dir /etc/consul.d -enable-script-checks=true -datacenter=dc1 -join 192.168.1.47
root 2221 0.0 0.0 17676 728 pts/0 S+ 12:58 0:00 grep --color=auto consul
root@jacky-VirtualBox:~#
root@jacky-VirtualBox:~#
root@jacky-VirtualBox:~#
root@jacky-VirtualBox:~# kill 1941
root@jacky-VirtualBox:~#
[3]+ Exit 1 nohup consul agent -server -ui -data-dir /data/consul0 -node=server3 -bind=192.168.2.122 -config-dir /etc/consul.d -enable-script-checks=true -datacenter=dc1 -join 192.168.2.120 > /data/consul0/consul.log 2>&1
root@jacky-VirtualBox:~#
再次启动consul server3:
nohup consul agent -server -ui -data-dir /data/consul0 -node=server3 -bind=192.168.1.49 -config-dir /etc/consul.d -enable-script-checks=true -datacenter=dc1 -join 192.168.1.47 > /data/consul0/consul.log 2>&1 &
在windows 浏览器输入http://=192.168.1.49:8500/ui,发现无法查看,这……。
遇到问题不要慌,把手机拿出来发个朋友圈……
我们去server3节点查看一下上面的ui服务的监听端口,发现是监听在127.0.0.1:8500
root@jacky-VirtualBox:~# lsof -i:8500
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
consul 1941 root 22u IPv4 35022 0t0 TCP localhost:8500 (LISTEN)
root@jacky-VirtualBox:~#
root@jacky-VirtualBox:~# curl http://localhost:8500/ui
<a href="/ui/">Moved Permanently</a>.
root@jacky-VirtualBox:~#
通过cli的测试发现2个问题:第一个,consul 的ui服务默认只监听本地的请求。第二,服务发生了重定向。
有2种方法可以解决这个问题:1,在server3上面部署一个tcp代理服务器,例如监听0.0.0.0:9009端口,把其他ip对192.168.1.49:9009/ui 的请求转发给127.0.0.1:8500/ui。2,指定server3 ui的监听地址:
nohup consul agent -server -ui -client=0.0.0.0 -data-dir /data/consul0 -node=server3 -bind=192.168.1.49 -config-dir /etc/consul.d -enable-script-checks=true -datacenter=dc1 -rejoin 192.168.1.47 > /data/consul0/consul.log 2>&1 &
查看端口监听地址,已经变成了0.0.0.0:8500:
root@jacky-VirtualBox:~# lsof -i:8500
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
consul 2228 root 22u IPv6 37271 0t0 TCP *:8500 (LISTEN)
root@jacky-VirtualBox:~#
结果与上面的分析一致,在windows查看ui web:
用代理的方式访问ui:
代理服务源码:
package main
import (
"fmt"
"io"
"net"
"os"
"strings"
"sync"
"github.com/urfave/cli/v2"
"golang.org/x/sys/unix"
)
type tcpproxy struct {
lock sync.Mutex
dsts []string
src string
}
var TcpProxy = &cli.Command{
Name: "tcpproxy",
Aliases: []string{
""},
Usage: "tcp port proxy",
UsageText: "tcpproxy [--src=0.0.0.0:7777] [--dst=136.19.188.100:9999,136.19.188.110:9999,136.19.188.120:9999]",
Flags: []cli.Flag{
&cli.StringFlag{
Name: "src",
Hidden: true,
},
&cli.StringFlag{
Name: "dst",
Hidden: true,
},
},
Action: func(cctx *cli.Context) error {
src := cctx.String("src")
dst := cctx.String("dst")
if src == "" || dst == "" {
fmt.Println(cctx.Command.UsageText)
return nil
}
tp := &tcpproxy{
src: src,
dsts: strings.Split(dst, ","),
}
tp.server()
return nil
},
}
func main() {
local := []*cli.Command{
TcpProxy,
}
app := &cli.App{
Name: "proxy",
Usage: "proxy tcpproxy",
Version: "v0.0.1",
EnableBashCompletion: true,
Flags: []cli.Flag{
&cli.StringFlag{
Name: "configfile",
EnvVars: []string{
""},
Hidden: true,
Value: "cfg.toml",
},
},
Commands: local,
}
if err := app.Run(os.Args); err != nil {
fmt.Fprintf(os.Stderr, "ERROR: %s\n\n", err) // nolint:errcheck
os.Exit(1)
}
}
func unixSetLimit(soft uint64, max uint64) error {
rlimit := unix.Rlimit{
Cur: soft,
Max: max,
}
return unix.Setrlimit(unix.RLIMIT_NOFILE, &rlimit)
}
func (p *tcpproxy) server() {
unixSetLimit(60000, 60000)
listen, err := net.Listen("tcp", p.src)
if err != nil {
fmt.Println(err)
return
}
defer listen.Close()
fmt.Println("listen at:", p.src)
for {
conn, err := listen.Accept()
if err != nil {
fmt.Printf("接受客户端连接错误:%v\n", err)
continue
}
fmt.Println("build new proxy connect. ", "client address =", conn.RemoteAddr(), " local server address=", conn.LocalAddr())
go p.handle(conn)
}
}
func (p *tcpproxy) handle(sconn net.Conn) {
defer sconn.Close()
dst, ok := p.select_dst()
if !ok {
return
}
dconn, err := net.Dial("tcp", dst)
if err != nil {
fmt.Printf("连接%v失败:%v\n", dst, err)
return
}
defer dconn.Close()
ExitChan := make(chan bool, 1)
// 转发到目标服务器
go func(sconn net.Conn, dconn net.Conn, Exit chan bool) {
_, err := io.Copy(dconn, sconn)
if err != nil {
fmt.Printf("往%v发送数据失败:%v\n", dst, err)
ExitChan <- true
}
}(sconn, dconn, ExitChan)
// 从目标服务器返回数据到客户端
go func(sconn net.Conn, dconn net.Conn, Exit chan bool) {
_, err := io.Copy(sconn, dconn)
if err != nil {
fmt.Printf("从%v接收数据失败:%v\n", dst, err)
ExitChan <- true
}
}(sconn, dconn, ExitChan)
<-ExitChan
}
// ip 轮询
func (p *tcpproxy) select_dst() (string, bool) {
p.lock.Lock()
defer p.lock.Unlock()
if len(p.dsts) < 1 {
fmt.Println("failed select_dst()")
return "", false
}
dst := p.dsts[0]
p.dsts = append(p.dsts[1:], dst)
return dst, true
}
root@jacky-VirtualBox:~/test/proxy# tree
.
├── go.mod
├── go.sum
├── main.go
└── proxy
0 directories, 4 files
root@jacky-VirtualBox:~/test/proxy#
root@jacky-VirtualBox:~/test/proxy# go mod tidy
go: finding module for package github.com/urfave/cli/v2
go: downloading github.com/urfave/cli/v2 v2.3.0
go: downloading github.com/urfave/cli v1.22.5
go: finding module for package golang.org/x/sys/unix
go: downloading golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c
go: found github.com/urfave/cli/v2 in github.com/urfave/cli/v2 v2.3.0
go: found golang.org/x/sys/unix in golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c
go: downloading github.com/cpuguy83/go-md2man/v2 v2.0.0-20190314233015-f79a8a8ca69d
root@jacky-VirtualBox:~/test/proxy# go build
root@jacky-VirtualBox:~/test/proxy# ll
total 5176
drwxr-xr-x 2 root root 4096 7月 14 12:32 ./
drwxr-xr-x 9 root root 4096 7月 14 12:19 ../
-rw-r--r-- 1 root root 121 7月 14 12:32 go.mod
-rw-r--r-- 1 root root 1454 7月 14 12:32 go.sum
-rw-r--r-- 1 root root 3066 7月 14 12:20 main.go
-rwxr-xr-x 1 root root 5278425 7月 14 12:32 proxy*
root@jacky-VirtualBox:~/test/proxy#
root@jacky-VirtualBox:~/test/proxy# nohup ./proxy tcpproxy --src=0.0.0.0:9009 --dst=localhost:8500 > /tmp/proxy.log 2>&1 &
root@jacky-VirtualBox:~/test/proxy# tail -f /tmp/proxy.log
listen at: 0.0.0.0:9009
build new proxy connect. client address = 192.168.1.30:63053 local server address= 192.168.1.49:9009
build new proxy connect. client address = 192.168.1.30:52724 local server address= 192.168.1.49:9009
build new proxy connect. client address = 192.168.1.30:61155 local server address= 192.168.1.49:9009
build new proxy connect. client address = 192.168.1.30:62671 local server address= 192.168.1.49:9009
build new proxy connect. client address = 192.168.1.30:64189 local server address= 192.168.1.49:9009
build new proxy connect. client address = 192.168.1.30:57534 local server address= 192.168.1.49:9009
访问ui web的代理端口:
7、mongodb高可用集群设计