背景

上线经历

11月的一次常规上线，tomcat启动时期发生了crash，一共发生了4次。

其中1、2、4次tomcat没有启动完成就crash了。第3次启动以后一直报错，在运行20分钟后发生了crash。

回滚均没有发生任何问题。

上线内容

其他的就不多说了吧。主要是将ActiveMQ的消费者进行了多线程的改造，没有使用DefaultMessageListenerContainer提供的concurrency属性，而是在OnMessage方法里把消息丢到了一个线程池去处理。线程使用的是AOP的方式，这个以后详述。

错误日志

第一次启动：

Nov 02, 2017 10:08:44 PM redis.clients.jedis.JedisSentinelPool initPool
INFO: Created JedisPool to master at xx.xx.xx.xx:xxxx
**
GLib:ERROR:gmain.c:1963:g_main_dispatch: assertion failed: (current->dispatching_sources == &current_source_link)

第二次启动：

 Nov 02, 2017 10:17:13 PM redis.clients.jedis.JedisSentinelPool initPool
 INFO: Created JedisPool to master at xx.xx.xx.xx:xxxx

 (process:7785): GLib-CRITICAL (recursed) **: g_main_context_iterate: assertion `g_thread_supported ()' failed
 aborting...


 (process:7785): GLib-WARNING **: g_main_loop_run() was called from second thread but g_thread_init() was never called.

第三次虽然启动了，但是一直有错误：

Nov 02, 2017 10:43:07 PM redis.clients.jedis.JedisSentinelPool initPool
 INFO: Created JedisPool to master at xx.xx.xx.xx:xxxx
 十一月 02, 2017 10:43:14 下午 org.apache.catalina.startup.HostConfig deployDescriptor
 信息: Deploying configuration descriptor manager.xml
 十一月 02, 2017 10:43:14 下午 org.apache.catalina.startup.HostConfig deployDescriptor
 信息: Deploying configuration descriptor host-manager.xml
 十一月 02, 2017 10:43:14 下午 org.apache.catalina.startup.HostConfig deployDirectory
 信息: Deploying web application directory examples
 十一月 02, 2017 10:43:14 下午 org.apache.catalina.startup.HostConfig deployDirectory
 信息: Deploying web application directory docs
 十一月 02, 2017 10:43:14 下午 org.apache.coyote.http11.Http11Protocol start
 信息: Starting Coyote HTTP/1.1 on http-8000
 十一月 02, 2017 10:43:14 下午 org.apache.jk.common.ChannelSocket init
 信息: JK: ajp13 listening on /0.0.0.0:8001
 十一月 02, 2017 10:43:14 下午 org.apache.jk.server.JkMain start
 信息: Jk running ID=0 time=0/82  config=null
 十一月 02, 2017 10:43:14 下午 org.apache.catalina.startup.Catalina start
 信息: Server startup in 134766 ms
 十一月 02, 2017 10:46:29 下午 com.navercorp.pinpoint.bootstrap.config.DefaultProfilerConfig readBoolean
 信息: profiler.jdk.http.param=true
 十一月 02, 2017 10:46:29 下午 com.navercorp.pinpoint.bootstrap.config.DefaultProfilerConfig readBoolean
 信息: profiler.jdk.http.param=true
 十一月 02, 2017 10:46:29 下午 com.navercorp.pinpoint.bootstrap.config.DefaultProfilerConfig readBoolean
 信息: profiler.jdk.http.param=true

 (process:15054): GLib-CRITICAL **: g_main_context_iterate: assertion `g_thread_supported ()' failed

第四次启动和第一次相同，就不贴了。

一些现象

tomcat启动完成前（Server startup in xxx ms），jedis链接完毕，mq的链接会创建完毕并且开始消费消息，pinpoint会开始织入。
由于业务原因，部分改造的mq会使用redis作为分布式锁。
因为一个bug导致线程资源抛出空指针异常。

版本

2.6.32_1-15-0-0 #1 SMP Fri Sep 19 15:37:59 CST 2014 x86_64 x86_64 x86_64 GNU/Linux

java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)

pinpoint: 1.5

tomcat: 6.0.43

手术历程

我们把这一次的事故作为一次手术，开始操刀~

尝试复现问题

线上环境在修复了空指针的异常后，启动没有发送错误。因此考虑一大部分原因在空指针上，打算在测试环境进行mock。

测试环境模拟了mq多线程 + 分布式锁 + 抛出空指针，搭建pinpoint进行织入，版本均与线上一致，没有复现。

在一个夜深人静的也晚，两个汉子『偷偷』在线上推了有问题的版本来尝试复现，同时tomcat配置了crash日志：-XX:ErrorFile=/home/work/logs/java_error%p.log

第一次启动，居然成功了。。。

kill -9，重启，成功了，没有错误日志。。。

kill -9，立马重启（0.5秒），启动失败，错误日志同背景第二次。然而没有任何crash日志。查看系统日志，也没有相关信息。

至此，开始陷入死胡同。。。

百度、谷歌、必应各种搜索。

有关的记录如下：

https://issues.jenkins-ci.org/browse/JENKINS-18980

https://issues.jenkins-ci.org/browse/JENKINS-14425

https://groups.google.com/forum/#!topic/jenkinsci-users/R6ikq12KNDM

http://hllvm.group.iteye.com/group/topic/39864

但是没有解决方案，又陷入死胡同。更悲剧的是，线上环境在运行了3天后，宕机了（只宕机了一台。。。），错误日志同背景第一次。

从机器相关原因切入

在纠结了一阵子时间后，考虑从机器本身的原因着手，将拥有MQ的两台机器进行位置互换，来check是否新机器会出现问题。

在历经了一个月服务器稳定运行后，正当我们初步确认是机器导致crash的原因时，在某日下午5点钟，一个晴天霹雳打了过来。。。

新机器crash了= =|||。所有的推论在一瞬间被推翻~。

tomcat基础设施升级

本来打算从系统内核开始验证，无奈公司机房管理的原因，升级内核并不容易，因此考虑从基础设施开始升级。

将web服务器的所有tomcat从6.0.43升级到了7.0.85版本，然后关注线上运行状况。

改造时间：2018年2月27日。

目前稳定运行中，如果有后续的结果，将随时同步。。。

记一次Tomcat Crash

背景