Zookeeper extension of Shame

One, background

Based on the company's development rigid demand, production VM server to migrate to unified ZStack virtualized servers. Check your server project to use, which zookeeper cluster trick, so the need for migration.

Second, the migration plan

In order to influence the migration does business, it is preferable to use 扩容> - 缩容for manner.

zk

说明:
1.原生产集群为VM-1,VM-2,VM-3组成一个3节点的ZK集群;
2.对该集群扩容,增加至6节点(新增ZS-1,ZS-2,ZS-3),进行数据同步完成;
3.进行缩容,下掉原先来的三个节点(VM-1,VM-2,VM-3);
4.替换nginx解析地址。

OK! 目标很明确,过程也很清晰,然后开干。

Third, the step (process has no problem in a test environment to verify):

  1. Conduct of the new three servers zk environment configuration, and can be the same as the old cluster configuration, it is best to use the same version (Moderators using 3.4.6);

  2. zoo.cfg the old address of the new node of the cluster increased (increased by one), and then restart one by one to the newly added nodes.

zk-2

Fourth, the problem

[root@localhost bin]# ./zkServer.sh  status
ZooKeeper JMX enabled by default
Using config: /usr/zookeeper/zookeeper-3.4.6/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.

  • 此时查看数据,数据同步正常
ZS-1 数据同步正常,但是无法查看节点的状态信息;
  • Suspect the reason is because the old node without rebooting; this time to see the original cluster node information, found that the original cluster node is abnormal. By positioning the investigation, the original state of the cluster has been in an abnormal state.

  • Preliminary positioning may be due to the presence of abnormal elections original cluster node can not lead into the new normal, continue troubleshooting.

  • Restore the initial state of the cluster, if the cluster node status has been unable to see properly. OK to continue positioning ...

Fifth, the investigation process

The following methods from the network:

There may be several reasons:

First, zoo.cfg file configuration: dataLogDir specified directory has not been created.

1.zoo.cfg
[root@SIA-215 conf]# cat zoo.cfg
...
dataDir=/app/zookeeperdata/data
dataLogDir=/app/zookeeperdata/log
...

2.路径
[root@SIA-215 conf]# cd /app/zookeeperdata/
[root@SIA-215 zookeeperdata]# ll
total 8
drwxr-xr-x 3 root root 4096 Apr 23 19:59 data
drwxr-xr-x 3 root root 4096 Aug 29  2015 log

The investigation by the exclusion factors.

Second, myid integer format is wrong file, server, or an integer of zoo.cfg does not correspond.

[root@SIA-215 data]# cd /app/zookeeperdata/data
[root@SIA-215 data]# cat myid 
2[root@SIA-215 data]# 

After positioning the investigation is not excluded that reason.

Third, the firewall is not closed.

Using the service iptables stop off the firewall;
use service iptables status confirmation;
use chkconfig iptables off to disable the firewall.

Sure that the firewall is turned off.

[root@localhost ~]# service iptables status
iptables: Firewall is not running.
确认防火墙是关闭的

Fourth, the port is occupied.


[root@localhost bin]# netstat -tunlp | grep 2181
tcp        0      0 :::12181                    :::*                        LISTEN      30035/java          
tcp        0      0 :::22181                    :::*                        LISTEN      30307/java 

确认端口没有被占用

Fifth, zoo.cfg file error hostname.


经测试环境测试,主机名正确,多域名解析也正常,不存在此问题

Sixth, hosts file, the host name of the machine has two corresponding, simply leave the map host names and ip addresses.


经测试环境测试,主机名正确,多域名解析也正常,不存在此问题 排除。

Seventh, zkServer.sh in the nc command in question.


 可能是机器上没有安装nc命令,还有种说法是在zkServer.sh里找到这句:
 STAT=`echo stat | nc localhost $(grep clientPort “$ZOOCFG” | sed -e ‘s/.*=//’) 2> /dev/null| grep Mode`
 在nc与localhost之间加上 -q 1 (是数字1而不是字母l)
 
 zookeeper版本是3.4.6,zkServer.sh里根本没有这一句(获取状态的语句没有用nc命令)

 # -q is necessary on some versions of linux where nc returns too quickly, and no stat result is output
    clientPortAddress=`grep "^[[:space:]]*clientPortAddress[^[:alpha:]]" "$ZOOCFG" | sed -e 's/.*=//'`
    if ! [ $clientPortAddress ]
    then
        clientPortAddress="localhost"
    fi
    clientPort=`grep "^[[:space:]]*clientPort[^[:alpha:]]" "$ZOOCFG" | sed -e 's/.*=//'`
    STAT=`"$JAVA" "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \
             -cp "$CLASSPATH" $JVMFLAGS org.apache.zookeeper.client.FourLetterWordMain \
             $clientPortAddress $clientPort srvr 2> /dev/null    \
          | grep Mode`
    if [ "x$STAT" = "x" ]
    then
        echo "Error contacting service. It is probably not running."
        exit 1
    else
        echo $STAT
        exit 0
    fi
    ;;

Sixth, the following investigation of their own way:

Currently phenomenon old cluster data synchronization to work properly, it can be leader election (acquired from the log), but can not view node status information with the anomaly; clustering expansion, data can not be synchronized.

Solution:

1, try to start the foreground mode, select a non-primary node restart, you can start foreground view the log.


zkserver.sh start-foreground

节点启动正常,无异常输出。

2, view the shell script: analysis zkServer.sh .

  • "Error contacting service. It is probably not running." Piece of log the following script.

STAT=`"$JAVA" "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \
             -cp "$CLASSPATH" $JVMFLAGS org.apache.zookeeper.client.FourLetterWordMain \
             $clientPortAddress $clientPort srvr 2> /dev/null    \
          | grep Mode`
    if [ "x$STAT" = "x" ]
    then
        echo "Error contacting service. It is probably not running."
        exit 1
    else
        echo $STAT
        exit 0
    fi
    ;;

  • Interception part of the contents of the script: We can initially set should be $STATacquired if the abnormal STAT variable is empty, it will display Error contacting service IS Probably not running .: It.
    The OK, then under analysis this $STATin the end is what the hell?

 if [ “x$STAT” = “x” ]
then
echo “Error contacting service. It is probably not running.”
exit 1
else
echo $STAT
exit 0
fi

3, try to look at the implementation process in debug mode shell of:

  • Intercepting the execution log segment as follows: STAT variable does indeed empty, resulting in the output of Error contacting service It is probably not running and exit...

++ grep '^[[:space:]]*clientPort[^[:alpha:]]' /app/zookeeper-3.4.6/bin/../conf/zoo.cfg
+ clientPort=5181
++ grep Mode
++ /opt/jdk1.8.0_131/bin/java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp '/app/zookeeper-3.4.6/bin/../build/classes:/app/zookeeper-3.4.6/bin/../build/lib/*.jar:/app/zookeeper-3.4.6/bin/../lib/slf4j-log4j12-1.6.1.jar:/app/zookeeper-3.4.6/bin/../lib/slf4j-api-1.6.1.jar:/app/zookeeper-3.4.6/bin/../lib/netty-3.7.0.Final.jar:/app/zookeeper-3.4.6/bin/../lib/log4j-1.2.16.jar:/app/zookeeper-3.4.6/bin/../lib/jline-0.9.94.jar:/app/zookeeper-3.4.6/bin/../zookeeper-3.4.6.jar:/app/zookeeper-3.4.6/bin/../src/java/lib/*.jar:/app/zookeeper-3.4.6/bin/../conf:.:/opt/jdk1.8.0_131/lib/dt.jar:/opt/jdk1.8.0_131/lib/tools.jar' org.apache.zookeeper.client.FourLetterWordMain localhost 5181 srvr
+ STAT=
+ ‘[‘ x = x ‘]’
+ echo ‘Error contacting service. It is probably not running.’
Error contacting service. It is probably not running.
+ exit 1

4, modify the shell script: analysis zkServer.sh in the script of the total increase in output STAT content, this time we are not filtered.


STAT1=`"$JAVA" "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \
             -cp "$CLASSPATH" $JVMFLAGS org.apache.zookeeper.client.FourLetterWordMain \
             $clientPortAddress $clientPort srvr 2> test.log \ `

echo "$STAT1"
  • The best way is to copy a new script, the original script to avoid contamination. I do so; then run the script.

[root@localhost bin]# ./zkServer.sh  status
ZooKeeper JMX enabled by default
Using config: /usr/zookeeper/zookeeper-3.4.10/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.

  • Then look at the generated file test.log: really abnormal content.

in thread “main” java.lang.NumberFormatException: For input string: “2181
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at org.apache.zookeeper.client.FourLetterWordMain.main(FourLetterWordMain.java:76)

  • From the point of view log information, suggesting that the port number 2181 result. It is not a valid number.

zkServer.sh There are so sentence:


clientPort=`grep “^[[:space:]]*clientPort[^[:alpha:]]” “$ZOOCFG” | sed -e ‘s/.*=//’`
grep “^[[:space:]]*clientPort[^[:alpha:]]” “$ZOOCFG” | sed -e ‘s/.*=//’在执行过程中,实际命令如下:
grep ‘^[[:space:]]*clientPort[^[:alpha:]]’ /app/zookeeper-3.4.6/bin/../conf/zoo.cfg | sed -e ‘s/.*=//’

  • The final problem can be basically confirmed profile.

  • Replacement Profile: restart the problem is solved.

  • There is reason may be edited zoo.cfg encoding format, etc. cause document content analysis exception.

Author: Mao Masae

Further Reading: [Yixin Technology Salon 01] AI in Taiwan: an agile business intelligence support programs | Share Record

[Should] ICT salon 02 units in construction practice CreditEase agile data | Share Record

Guess you like

Origin blog.csdn.net/gao2175/article/details/90667502