Solr5 and Solr6 same cluster coexistence conflict resolution

problem background     

     The company has deployed Solr5.3 search engine service since September 2015. By the beginning of this year, the version of Solr has been developed to Solr6.x, but the open source community is really fast, because the Solr6 server integrates Facebook's prestodb The Sql parsing engine of the database ( http://prestodb-china.com/ ) allows the Solr6 server to support simple sql statement queries, and the search engine can support Sql statement queries for a while, although it can only support a very simple single table Query statements and simple aggregation queries with centralized functions, in short Solr has taken a solid step in nosql DB. What is more convenient is that the Solrj code package provides an API package that implements the jdbc interface, so that general development engineers can easily write search engine-based query operations.

     

     Of course, we tried to use such a good function, so we deployed the new Solr6 collection on the old version of Solr cluster at a very fast speed, but it didn't take long to find a serious problem. The problem is that the problem occurs when calling an API like:

http://10.1.5.19:8080/solr/admin/collections?action=CREATE&name=collection1

    When calling the admin/collections API to submit a request to create a new Collection to the cluster, the operation often times out, and an exception related to serialization exceptions is reported. No matter whether the collection created through this API is solr5 or solr6, there will be problems. can be created successfully.

   

     Several other APIs also have such problems without exception. for example:

curl 'http://localhost:8080/solr/admin/collections?action=DELETE&name=search4totalpay'

  

Cause Analysis

   When looking for the reason, it is easy to think whether it is because a node with a version of solr6 was added to the cluster that originally only had solr5 nodes? Finally, after reading a bunch of code, I found that it was indeed because it was in the old cluster.

   Let's first look at a cluster call flow chart that executes the " /admin/collections " api, as follows:

 

 Cluster creation collection process description:

 

  1. Initiate an API request such as /admin/collections?action=CREATE to any node in the solr cluster (corresponding to the processing class: org.apache.solr.handler.admin.CollectionsHandler),
  2. After the node receives the request, it writes a temporary node to zookeeper. The content of the node is the content of the task executed this time (the format is json)
  3. The node with the yellow background in the above figure (this node is the supervisor node, which is competed by all nodes) claims the task of the zookeeper temporary node,
  4. The overseer node executes the task in the node. When action=CREATE, it is to redistribute the command to create a collection replica node to other nodes. The corresponding command path is "/admin/cores"
  5. After all replicas are successfully created, an overser node writes a successful execution mark to zookeeper, and the end user can perceive that the task has been executed successfully (of course, the execution of the task can also be executed asynchronously)

 At this point, the key point of the problem is understood. The key is in the second and third steps. There is a problem in sending requests such as /admin/collections from solr6 to solr5 nodes (higher version nodes to lower version), and the Solr API is moving forward There is a problem in compatibility, but in the test, it is found that there is no problem in sending requests from solr5 to solr6, that is, there is no problem with Solr being backward compatible

 

Solution:

  Now that you understand the problem, you can start to solve it. It's very simple, just remove the process of Solr6 node participating in the Overseer role competition when the Solr6 engine node is started, so that the Solr6 node has no chance to become the Overseer role. To achieve this goal, you must first understand the Overseer campaign-related code structure.

   Take a look at the following class diagram:

 

  When ZkController is initialized, the LeaderElector object will be initialized, and the LeaderElector will pre-write the local node information to the /overseer_elect/election node to participate in the leader election. Solr has two scenarios where you need to run for leader. One is that the cluster supervisor mentioned above is responsible for executing the tasks submitted to the cloud, and the other is the node that executes the master task in the share.

The execution logic is encapsulated in OverseerElectionContext and ShardLeaderElectionContextBase respectively.

 

The point that needs to be modified is that in the init() method of ZkController:

private void init(CurrentCoreDescriptorProvider registerOnReconnect) {
    try {
      createClusterZkNodes(zkClient);
      zkStateReader.createClusterStateWatchersAndUpdate();
      // start the overseer first as following code may need it's processing
      if (!zkRunOnly) {
    //▼▼▼▼▼▼20161022 baisui add for solr6 and solr5 nodes are in the same cluster, the solr node will grab the overser role but the execution is not normal, so it will grab the overser
    // The function is removed, so that the role of the overser will only be snatched by the nodes of solr5
        overseerElector = new LeaderElector(zkClient){

			@Override
			public int joinElection(ElectionContext context, boolean replacement)
					throws KeeperException, InterruptedException, IOException {
				return 0;
			}
			@Override
			public int joinElection(ElectionContext context, boolean replacement, boolean joinAtHead)
					throws KeeperException, InterruptedException, IOException {
				return 0;
			}
			@Override
			void retryElection(ElectionContext context, boolean joinAtHead)
					throws KeeperException, InterruptedException, IOException {
			}
        };
        //▲▲▲▲▲▲
        this.overseer = new Overseer(cc.getShardHandlerFactory().getShardHandler(), cc.getUpdateShardHandler(),
            CommonParams.CORES_HANDLER_PATH, zkStateReader, this, cloudConfig);
        ElectionContext context = new OverseerElectionContext(zkClient,
            overseer, getNodeName());
        overseerElector.setup(context);
        overseerElector.joinElection(context, false);
      }

      Stat stat = zkClient.exists(ZkStateReader.LIVE_NODES_ZKNODE, null, true);
      if (stat != null && stat.getNumChildren() > 0) {
        publishAndWaitForDownStates();
      }

      // Do this last to signal we're up.
      createEphemeralLiveNode();
    } catch (IOException e) {
      log.error("", e);
      throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
          "Can't create ZooKeeperController", e);
    } catch (InterruptedException e) {
      // Restore the interrupted status
      Thread.currentThread().interrupt();
      log.error("", e);
      throw new ZooKeeperException(SolrException.ErrorCode.SERVER_ERROR,
          "", e);
    } catch (KeeperException e) {
      log.error("", e);
      throw new ZooKeeperException(SolrException.ErrorCode.SERVER_ERROR,
          "", e);
    }

  }

 

 As above, you only need to overload the three methods of LeaderElector, joinElection(), retryElection(), and do nothing in the method body, so that the process of Solr node election for cluster supervisor can be removed.

 

Summarize

  So far, Solr5 and Solr6 can coexist in a Zookeeper domain and execute normally. Of course, the easiest way is to start another zk cluster and put the index of solr6 into a separate cluster, but this will increase the cost of cluster maintenance invisibly, which is not worth the loss.

   

   

    

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326997694&siteId=291194637