Eureka Client Instance status DOWN - SpringCloud Eureka 实例状态为DOWN时如何排查问题

Eureka Client Instance status DOWN - SpringCloud Eureka 实例状态为DOWN时如何排查问题

发现问题

启动spring boot application注册euraka client到eureka server后，可以在eureka.server.ip:port看到对应的实例列表，但是实例前面有一个红色的DOWN:
这里写图片描述
如果用ip去访问这个服务的REST接口，是完全没有问题的；但是如果用RestTemplate或FeignClient通过applicationName访问就会失败。

产生原因

注册到eureka server的服务，如果开启了健康检查，spring boot程序会隔一小段时间就检查一下配置的一些外部资源是否可用，比如各种数据源是否能联通，如果任意一个配置无法正常连通，就会向eureka server推送消息让该实例下线。这相当于是一个hearbeat，检查各个实例是否能正常使用。当然自己也可以自己实现HealthCheckHandler来个性化健康检测，甚至给一个REST接口更改getStatus()返回值，以实现主动上下线的操作，具体可以参考这里：springcloud-eureka集群-健康检测。

其中，打开健康检测的功能需要有以下配置：
pom中引入

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

application.properties需要打开健康检测：

eureka.client.healthcheck.enabled = true

启动应用后，就可以通过访问：
${spring.cloud.client.ipAddress}:${server.port}[/server.contextPath]/health
来查看健康情况，正常时返回：

{"details":{},"status":{"code":"UP","description":"Spring Cloud Eureka Discovery Client"}}

异常时返回：

{"details":{},"status":{"code":"DOWN","description":""}}

Eureka服务中心的状态就根据这些数据来更新的。

解决问题

找到问题的源头就可以很快解决了。
首先，访问一下
${spring.cloud.client.ipAddress}:${server.port}[/server.contextPath]/health（如果状态是DOWN，健康检测机制会多次重试确保真的挂了，所以可能会等待稍长的时间才会返回结果），如果这里是真的DOWN，那么就可以断定是某个资源不可用导致的，一般日志里都有打印错误，比如MySQL挂了，日志就会不停地打印：

...
[2018-06-14 20:26:16] [http-nio-8089-exec-2] [ERROR] [com.alibaba.druid.filter.logging.Log4jFilter:152] {conn-10002, stmt-20007} execute error. SELECT 1
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
...

如果无法访问
${spring.cloud.client.ipAddress}:${server.port}[/server.contextPath]/health这个接口，等到烟火清凉也没有返回，而且应用并未配置任何外部资源、或者所有外部资源确认可以访问，说明这个应用思想出问题了，这次我遇到的就是这个情况。这时要特别留意日志里的WARN级别信息，这些WARN相当重要，这次问题遇到了8行WARN：

[2018-06-14 19:51:54] [main] [WARN] [org.mybatis.spring.mapper.ClassPathMapperScanner:166] No MyBatis mapper was found in '[com.xx.xx.core, com.xx.xx]' package. Please check your configuration.
[2018-06-14 19:51:55] [main] [WARN] [org.mybatis.spring.mapper.ClassPathMapperScanner:166] No MyBatis mapper was found in '[com.xx.xx.**.mapper]' package. Please check your configuration.
[2018-06-14 19:51:55] [main] [WARN] [org.springframework.context.annotation.ConfigurationClassPostProcessor:373] Cannot enhance @Configuration bean definition 'myBatisMapperScannerConfig' since its singleton instance has been created too early. The typical cause is a non-static @Bean method with a BeanDefinitionRegistryPostProcessor return type: Consider declaring such methods as 'static'.
[2018-06-14 19:52:05] [main] [WARN] [com.netflix.config.sources.URLConfigurationSource:121] No URLs will be polled as dynamic configuration sources.
[2018-06-14 19:52:05] [main] [WARN] [com.netflix.config.sources.URLConfigurationSource:121] No URLs will be polled as dynamic configuration sources.
[2018-06-14 19:52:05] [main] [WARN] [org.springframework.boot.autoconfigure.thymeleaf.AbstractTemplateResolverConfiguration:60] Cannot find template location: classpath:/templates/ (please add some templates or check your Thymeleaf configuration)
[2018-06-14 19:52:09] [DiscoveryClient-InstanceInfoReplicator-0] [WARN] [com.netflix.discovery.DiscoveryClient$3:1277] Saw local status change event StatusChangeEvent [timestamp=1528977129378, current=DOWN, previous=UP]
[2018-06-14 19:52:09] [DiscoveryClient-InstanceInfoReplicator-0] [WARN] [com.netflix.discovery.InstanceInfoReplicator:93] Ignoring onDemand update due to rate limiter

前两个WARN基本就可以定位到问题了，因为当前应用没有任何外部资源，提示里不存在的两个包路径也没有在任何地方有配置，这就说明一定是父项目或者依赖项目有导致DOWN的配置，果不其然，依赖了一个包，里面有数据库配置，而且连不上……exclusion掉对应的包，一切正常了。

/health接口无法访问的原因可能也是因为重试，资源的具体配置是在父项目或者依赖项目中，可能无法正常请求或返回，导致一直卡住（不确定是这个原因）。

后面有两个WARN提示了状态变更，如果想看所有动作日志，还可以在logback中打开DEBUG，就会清楚地看到向eureka server PUT状态的HTTP请求详情。

事后总结

解决问题的时候走了一些弯路，包括到sping cloud github上找issues，以为是spring boot和spring cloud版本冲突，以为是父项目和子项目的版本冲突等等，其实就是有外部资源检查不可用，主要是访问/health接口失败断了节奏。