记录一次排查极光推送SDK死锁问题

线上服务稳定了好几个月,这两天抽风,总是有mysql数据行被锁,update会报错(Lock wait timeout exceeded; try restarting transaction)

导致系统无法使用,害得我得一直在后台给用户解锁(通过mysql的kill命令)/(ㄒoㄒ)/~~


极光sdk版本(接入sdk时从官方文档复制的,一直没更新,可谁没事会去更新这个呢):

<!-- JPush -->
<dependency>
	<groupId>cn.jpush.api</groupId>
	<artifactId>jpush-client</artifactId>
	<version>3.2.15</version>
</dependency>
<dependency>
	<groupId>cn.jpush.api</groupId>
	<artifactId>jiguang-common</artifactId>
	<version>1.0.1</version>
</dependency>
<dependency>
	<groupId>io.netty</groupId>
	<artifactId>netty-all</artifactId>
	<version>4.1.6.Final</version>
</dependency>


通过mysql后台查询,会发现有几个事务一直不提交,一直锁住了数据。

由于之前一直后台功能一直正常,并没有往代码方面想,找了一天数据库原因未果(尼玛的)!

经过一天观察,发现只有遇到极光推送的地方才会加锁,而且是不定时出现,故重点观察极光推送代码(为log4j添加了打印线程id的配置)!

第一步:通过观察log,发现极光推送之后该线程在log中再也不出现了(只打印了关键代码3中LastHttpContent判定逻辑中的语句,死锁的原因)

通过jps和jstack定位到了极光sdk的死锁问题(真坑):


第二步:分析原因

关键代码1:cn.jiguang.common.connection.NettyHttpClient.sendHttpRequest:158

private ResponseWrapper sendHttpRequest(HttpMethod method, String url, String body) throws Exception {
        CountDownLatch latch = new CountDownLatch(1);// 注意此处实例化了一个CountDownLatch 并传递给了NettyClientInitializer
        NettyClientInitializer initializer = new NettyClientInitializer(_sslCtx, null, latch);
        b.handler(initializer);
        ResponseWrapper wrapper = new ResponseWrapper();
        URI uri = new URI(url);
        String scheme = uri.getScheme() == null ? "http" : uri.getScheme();
        String host = uri.getHost() == null ? "127.0.0.1" : uri.getHost();
        int port = uri.getPort();
        if (port == -1) {
            if ("http".equalsIgnoreCase(scheme)) {
                port = 80;
            } else if ("https".equalsIgnoreCase(scheme)) {
                port = 443;
            }
        }

        try {
            ChannelFuture connect = b.connect(host, port);
            _channel = connect.sync().channel();
            FullHttpRequest request;
            if (null != body) {
                ByteBuf byteBuf = Unpooled.copiedBuffer(body.getBytes(CharsetUtil.UTF_8));
                request = new DefaultFullHttpRequest(HttpVersion.HTTP_1_1, method, uri.getRawPath(), byteBuf);
                request.headers().set(HttpHeaderNames.CONTENT_LENGTH, (long) byteBuf.readableBytes());
            } else {
                request = new DefaultFullHttpRequest(HTTP_1_1, method, uri.getRawPath());
            }
            request.headers().set(HttpHeaderNames.HOST, uri.getHost());
            request.headers().set(HttpHeaderNames.AUTHORIZATION, _authCode);
            request.headers().set("Content-Type","application/json;charset=utf-8");
            connect.awaitUninterruptibly();
            LOG.info("Sending request. " + request);
            LOG.info("Send body: " + body);
            _channel.writeAndFlush(request);
            latch.await();// 等待继续执行,这个地方使用CountDownLatch的作用就是等待netty异步网络调用返回结果后继续执行,所以此处等待,如果netty执行完后不countDown线程将永远等待在这个地方
            wrapper = initializer.getResponse();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return wrapper;
    }
关键代码2:cn.jiguang.common.connection.NettyClientInitializer.initChannel:25

public NettyClientInitializer(SslContext sslContext, NettyHttpClient.BaseCallback callback, CountDownLatch latch) {
        this._sslCtx = sslContext;
        this._callback = callback;
        this._latch = latch;
    }

    @Override
    protected void initChannel(SocketChannel socketChannel) throws Exception {
        this._handler = new HttpResponseHandler(_callback, _latch);// 此处将上一段代码实例化的CountDownLatch继续传递给了cn.jiguang.common.connection.HttpResponseHandler
        socketChannel.pipeline().addLast(_sslCtx.newHandler(socketChannel.alloc()), new HttpClientCodec(), _handler);
    }
关键代码3:cn.jiguang.common.connection.HttpResponseHandler

public class HttpResponseHandler extends SimpleChannelInboundHandler<HttpObject> {

    private static final Logger LOG = LoggerFactory.getLogger(HttpResponseHandler.class);
    private int status;
    private NettyHttpClient.BaseCallback _callback;
    private CountDownLatch _latch;
    private ResponseWrapper _wrapper = new ResponseWrapper();

    public HttpResponseHandler(NettyHttpClient.BaseCallback callback, CountDownLatch latch) {
        this._callback = callback;
        this._latch = latch;
    }


    @Override
    protected void channelRead0(ChannelHandlerContext ctx, HttpObject msg) throws Exception {
        if (msg instanceof HttpResponse) {
            HttpResponse response = (HttpResponse) msg;
            status = response.status().code();
        }
        if (msg instanceof HttpContent) {
            HttpContent content = (HttpContent) msg;
            LOG.info(content.content().toString());
            if (content instanceof LastHttpContent) {// 通过观察日志,发现此处仅执行内容LastHttpContent的代码,并未返回HttpContent,推测是极光后台更新了,对之前的sdk兼容有问题
                LOG.info("closing connection");
                ctx.close();
            } else {
                String responseContent = content.content().toString(CharsetUtil.UTF_8);
                _wrapper.responseCode = status;
                _wrapper.responseContent = responseContent;
                LOG.info("Got Response code: " + status + " content: " + responseContent);
                System.err.println("Got Response code: " + status + " content: " + responseContent);
                System.err.flush();
                if (null != _callback) {
                    _callback.onSucceed(_wrapper);
                }
                if (null != _latch) {
                    _latch.countDown();
                }
            }
        }
    }

    @Override
    public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) {
        LOG.error("error:", cause);
        try {
            ctx.close();
        }catch (Exception ex) {
            LOG.error("close error:", ex);
        }
    }

    public void resetLatch(CountDownLatch latch) {
        this._latch = latch;
    }

    public ResponseWrapper getResponse() {
        return _wrapper;
    }
}
此段代码在异常时( exceptionCaught)也未进行countDown调用,也会死锁!!!着实让人堪忧啊!

最后通过升级了极光的sdk解决了死锁问题!!!

总结:CountLatchDown一定要注意countDown,否则就会死锁,挂住线程不提交事务,造成数据库行锁。

并且查询数据库事务表看到的执行sql是null,并没有正在执行的sql,应该就可以判定是事务未提交,分析为什么未提交就好了!



猜你喜欢

转载自blog.csdn.net/qq838419230/article/details/78643165