序:在我们之前的文章《ELK+Kafka+Filebeat分析Nginx日志》中提到了ES集群的配置,在我后来的实验中发现,该集群虽然可以启动,也可以正常处理数据,但是只要ES1节点挂掉之后,整个集群都无法正常运行,数据也无法处理,达不到高可用的效果。下面我们来解决这个问题。
先来看我们之前的配置:
ES1节点:
[root@es1 ~]# cat /opt/soft/elasticsearch-7.6.0/elasticsearch.yml
#集群名称
cluster.name: my-app
#节点名称
node.name: es1
#是否为管理节点
node.master: true
#是否为数据节点
node.data: true
#数据存放路径
path.data: /var/es/data
#日志存放路径
path.logs: /var/es/logs
#当前节点的IP
network.host: 10.1.1.7
#当前节点工作的端口
http.port: 9200
#通信端口
transport.tcp.port: 9300
#选择使用哪个主节点来初始化集群,并去主动发现其他节点,必须和node.name一致
cluster.initial_master_nodes: ["es1"]
#集群初始化时自动发现的IP和通信端口
discovery.zen.ping.unicast.hosts: ["10.1.1.7:9300","10.1.1.8:9300", "10.1.1.9:9300"]
#该属性定义的是为了形成一个集群,有主节点资格并互相连接的节点的最小数目为(N/2)+1,以防止集群脑裂出现多个master节点
discovery.zen.minimum_master_nodes: 2
ES2配置:
cluster.name: my-app
node.name: es2
node.master: true
node.data: true
path.data: /var/es/data
path.logs: /var/es/logs
network.host: 10.1.1.8
http.port: 9200
transport.tcp.port: 9300
cluster.initial_master_nodes: ["es1"]
discovery.zen.ping.unicast.hosts: ["10.1.1.7:9300","10.1.1.8:9300", "10.1.1.9:9300"]
discovery.zen.minimum_master_nodes: 2
ES3配置:
cluster.name: my-app
node.name: es3
node.master: false
node.data: true
path.data: /var/es/data
path.logs: /var/es/logs
network.host: 10.1.1.9
http.port: 9200
transport.tcp.port: 9300
cluster.initial_master_nodes: ["es1"]
discovery.zen.ping.unicast.hosts: ["10.1.1.7:9300","10.1.1.8:9300", "10.1.1.9:9300"]
discovery.zen.minimum_master_nodes: 2
集群启动后的情况:
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.1.1.7 14 94 0 0.06 0.03 0.05 dilm * es1
10.1.1.8 12 97 3 0.14 0.09 0.11 dilm - es2
10.1.1.9 12 93 0 0.32 0.08 0.07 dil - es3
目前的问题是当10.1.1.7节点挂掉后,集群瘫痪。
问题出在哪儿呢?
首先要保证我们的有资格成为master的数量,一个完整的生产集群,应该至少保证3个节点有资格成为master,也就是至少三个节点配置文件中配置了node.master:true。
这样参与选举master的discovery.zen.minimum_master_nodes的数量可以配置为(3/2)+1=2,也就是至少有2个node.master节点参加选举,才可以选出新的master,我们上面就是因为整体只有三个节点,其中具备成为master的只有2个,而我们配置文件中配置的discovery.zen.minimum_master_nodes也是2,当我们把一个node.master停掉后,整个集群只剩下1个node.master,达不到至少2个node.master参加选举这一要求,所以集群就宕机了。
好的,现在我们清楚了基本要求,也就是至少三个node.master,至少2个参加选举。现在新搭建一个集群:
ES1:
[root@es1 ~]# cat /etc/elasticsearch/elasticsearch.yml
cluster.name: my-app
node.name: es1
node.master: true
node.data: true
path.data: /var/es/data
path.logs: /var/es/logs
network.host: 192.168.1.8
http.port: 9200
transport.tcp.port: 9300
cluster.initial_master_nodes: ["es1"]
discovery.zen.ping.unicast.hosts: ["192.168.1.8:9300","192.168.1.9:9300", "192.168.1.10:9300","192.168.1.12:9300"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
ES2:
[root@es2 ~]# cat /etc/elasticsearch/elasticsearch.yml
cluster.name: my-app
node.name: es2
node.master: true
node.data: true
path.data: /var/es/data
path.logs: /var/es/logs
network.host: 192.168.1.9
http.port: 9200
transport.tcp.port: 9300
cluster.initial_master_nodes: ["es1"]
discovery.zen.ping.unicast.hosts: ["192.168.1.8:9300","192.168.1.9:9300", "192.168.1.10:9300","192.168.1.12:9300"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
ES3:
[root@es3 ~]# cat /etc/elasticsearch/elasticsearch.yml
cluster.name: my-app
node.name: es3
node.master: true
node.data: true
path.data: /var/es/data
path.logs: /var/es/logs
network.host: 192.168.1.10
http.port: 9200
transport.tcp.port: 9300
cluster.initial_master_nodes: ["es1"]
discovery.zen.ping.unicast.hosts: ["192.168.1.8:9300","192.168.1.9:9300", "192.168.1.10:9300","192.168.1.12:9300"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
ES4:
[root@es4 ~]# cat /etc/elasticsearch/elasticsearch.yml
cluster.name: my-app
node.name: es4
node.master: false
node.data: true
path.data: /var/es/data
path.logs: /var/es/logs
network.host: 192.168.1.12
http.port: 9200
transport.tcp.port: 9300
cluster.initial_master_nodes: ["es1"]
discovery.zen.ping.unicast.hosts: ["192.168.1.8:9300","192.168.1.9:9300", "192.168.1.10:9300","192.168.1.12:9300"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
四个节点逐一启动后,可以在ES1节点的日志中看到:
[2020-03-16T13:11:53,469][INFO ][o.e.c.s.MasterService ] [es1] node-join[{es2}{CgceuyGlQ2GTcCGqegSz3Q}{oxQ-79VaQxulg07SejH0iQ}{192.168.1.9}{192.168.1.9:9300}{dilm}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 1, version: 20, delta: added {{es2}{CgceuyGlQ2GTcCGqegSz3Q}{oxQ-79VaQxulg07SejH0iQ}{192.168.1.9}{192.168.1.9:9300}{dilm}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true}}
[2020-03-16T13:11:55,058][INFO ][o.e.c.s.ClusterApplierService] [es1] added {{es2}{CgceuyGlQ2GTcCGqegSz3Q}{oxQ-79VaQxulg07SejH0iQ}{192.168.1.9}{192.168.1.9:9300}{dilm}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true}}, term: 1, version: 20, reason: Publication{term=1, version=20}
[2020-03-16T13:11:55,084][INFO ][o.e.c.s.MasterService ] [es1] node-join[{es3}{i-wCmfCESsCNDr6Vw50Aew}{rxtD1oqtQniz62voo7MPUg}{192.168.1.10}{192.168.1.10:9300}{dilm}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true} join existing leader, {es4}{L221OFajR8-FlaIIYg37Qw}{zvHrSewbQvq3OO14zKHCpg}{192.168.1.12}{192.168.1.12:9300}{dil}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 1, version: 21, delta: added {{es4}{L221OFajR8-FlaIIYg37Qw}{zvHrSewbQvq3OO14zKHCpg}{192.168.1.12}{192.168.1.12:9300}{dil}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true},{es3}{i-wCmfCESsCNDr6Vw50Aew}{rxtD1oqtQniz62voo7MPUg}{192.168.1.10}{192.168.1.10:9300}{dilm}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true}}
[2020-03-16T13:11:56,347][INFO ][o.e.c.s.ClusterApplierService] [es1] added {{es4}{L221OFajR8-FlaIIYg37Qw}{zvHrSewbQvq3OO14zKHCpg}{192.168.1.12}{192.168.1.12:9300}{dil}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true},{es3}{i-wCmfCESsCNDr6Vw50Aew}{rxtD1oqtQniz62voo7MPUg}{192.168.1.10}{192.168.1.10:9300}{dilm}{ml.machine_memory=1019797504, ml.max_open_jobs=20, xpack.installed=true}}, term: 1, version: 21, reason: Publication{term=1, version=21}
通过日志可以看到,其余3个节点已经加入到由ES1初始化的一个集群当中了。现可以通过访问http://192.168.1.8:9200/_cat/nodes?v 来查看集群各个节点状态:
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.1.10 8 93 0 0.00 0.13 0.25 dilm - es3
192.168.1.8 8 92 0 0.00 0.12 0.24 dilm * es1
192.168.1.12 11 93 0 0.00 0.12 0.22 dil - es4
192.168.1.9 7 92 0 0.00 0.12 0.24 dilm - es2
现在我们停掉ES1节点:
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.1.10 9 93 0 0.00 0.08 0.22 dilm * es3
192.168.1.12 9 93 0 0.04 0.09 0.20 dil - es4
192.168.1.9 11 92 0 0.00 0.08 0.21 dilm - es2
现在看到master节点转移到了ES3上了。
基本的ES集群高可用搭建完成。