背景
离线HDFS集群在执行hdfs dfsadmin -refreshNamenodes hostname:50020
向新的NameSpace
注册的过程中,出现>10个DataNode
节点发生死锁的情况,发生概率约1‰。本文用于记录该问题的排除过程及解决办法。
基础回顾
java中,无论是使用synchronized
关键字隐式地获取锁再释放锁,还是使用Lock接口(以及相关实现类)显示地调用lock()
/unlock()
方法,都能达到锁功能的目的。而死锁是两个甚至多个线程被永久阻塞时的一种运行局面,其必备的几个条件在此不再阐述,以下用简单的例子演示死锁的发生。
public class TestLock {
private static Object locka = new Object();
private static Object lockb = new Object();
public static void main(String[] args) {
new Thread(new Runnable() {
@Override
public void run() {
synchronized (locka){
sleep(500);
synchronized (lockb){
}
}
}
},"thread1").start();
new Thread(new Runnable() {
@Override
public void run() {
synchronized (lockb){
sleep(500);
synchronized (locka){
}
}
}
},"thread2").start();
}
public static void sleep(long ms) {
try {
Thread.sleep(ms);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
上述thread1与thread2在(0, t1)
时间内分别获得locka
与lockb
,在(t1, ∞)
时间内等待对方已经持有的锁,形成两个线程循环等待的局面。
jstack -l <pid>
可用于导出 Java 应用程序的线程堆栈,以及打印锁的附加信息。本例如下:
Found one Java-level deadlock:
=============================
"thread2":
waiting to lock monitor 0x00007fe9e316a608 (object 0x0000000795817088, a java.lang.Object),
which is held by "thread1"
"thread1":
waiting to lock monitor 0x00007fe9e38374b8 (object 0x0000000795817098, a java.lang.Object),
which is held by "thread2"
Java stack information for the threads listed above:
===================================================
"thread2":
at thread.lock.lock1.TestLock$2.run(TestLock.java:26)
- waiting to lock <0x0000000795817088> (a java.lang.Object)
- locked <0x0000000795817098> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)
"thread1":
at thread.lock.lock1.TestLock$1.run(TestLock.java:15)
- waiting to lock <0x0000000795817098> (a java.lang.Object)
- locked <0x0000000795817088> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)
Found 1 deadlock.
明显看到locked
和waiting to lock
的循环等待的情况。
DataNode死锁分析与修复
死锁分析
线上环境对一个进程的运行状态的分析往往较为复杂。在DataNode(简称dn)执行命令hdfs dfsadmin -refreshNamenodes hostname:50020
出现夯住及dn日志出现大量异常时,通过jstack -l
定位到死锁,部分jstack如下:
// ignore other stack
"Thread-219217170" #219414965 daemon prio=5 os_prio=0 tid=0x00007f8a284e1000 nid=0xf1f7 waiting for monitor entry [0x00007f8a87ba8000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.addBlockPool(BlockPoolManager.java:65)
- waiting to lock <0x00000006c0779a40> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolManager)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1765)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:390)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:282)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:825)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000074f872078> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
// ignore other stack
Found one Java-level deadlock:
=============================
"DataXceiver for client DFSClient_NONMAPREDUCE_-1127884171_152 at /10.83.175.59:58436 [Sending block BP-1922004198-10.83.100.11-1515033360950:blk_8441637123_7
413732454]":
waiting to lock monitor 0x00007f8ade875fa8 (object 0x00000006c0779a40, a org.apache.hadoop.hdfs.server.datanode.BlockPoolManager),
which is held by "IPC Server handler 106 on 50020"
"IPC Server handler 106 on 50020":
waiting for ownable synchronizer 0x000000074f872078, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
which is held by "Thread-219217170"
"Thread-219217170":
waiting to lock monitor 0x00007f8ade875fa8 (object 0x00000006c0779a40, a org.apache.hadoop.hdfs.server.datanode.BlockPoolManager),
which is held by "IPC Server handler 106 on 50020"
Java stack information for the threads listed above:
===================================================
"DataXceiver for client DFSClient_NONMAPREDUCE_-1127884171_152 at /10.83.175.59:58436 [Sending block BP-1922004198-10.83.100.11-1515033360950:blk_8441637123_7
413732454]":
at org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.get(BlockPoolManager.java:85)
- waiting to lock <0x00000006c0779a40> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolManager)
at org.apache.hadoop.hdfs.server.datanode.DataNode.getDNRegistrationForBP(DataNode.java:1909)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.checkAndWaitForBP(DataXceiver.java:1365)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.checkAccess(DataXceiver.java:1406)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.checkAccess(DataXceiver.java:1396)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:577)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
at java.lang.Thread.run(Thread.java:745)
"IPC Server handler 106 on 50020":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000074f872078> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.readLock(BPOfferService.java:110)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.getBlockPoolId(BPOfferService.java:207)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.formatThreadName(BPServiceActor.java:561)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.access$700(BPServiceActor.java:85)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$LifelineSender.start(BPServiceActor.java:1027)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.start(BPServiceActor.java:554)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.start(BPOfferService.java:343)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolManager$1.run(BlockPoolManager.java:129)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.startAll(BlockPoolManager.java:124)
- locked <0x00000006c0779a40> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolManager)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.doRefreshNamenodes(BlockPoolManager.java:231)
- locked <0x00000006c0779a40> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolManager)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.refreshNamenodes(BlockPoolManager.java:170)
- locked <0x00000006c0779d20> (a java.lang.Object)
at org.apache.hadoop.hdfs.server.datanode.DataNode.refreshNamenodes(DataNode.java:3278)
at org.apache.hadoop.hdfs.server.datanode.DataNode.refreshNamenodes(DataNode.java:3285)
at org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolServerSideTranslatorPB.refreshNamenodes(ClientDatanodeProtocolServerSideTranslatorPB.java:132)
at org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$ClientDatanodeProtocolService$2.callBlockingMethod(ClientDatanodeProtocolProtos.java:15993)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:882)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2709)
"Thread-219217170":
at org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.addBlockPool(BlockPoolManager.java:65)
- waiting to lock <0x00000006c0779a40> (a org.apache.hadoop.hdfs.server.datanode.BlockPoolManager)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1765)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:390)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:282)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:825)
at java.lang.Thread.run(Thread.java:745)
Found 1 deadlock.
上述jstack显示有3个线程与死锁的发生有关,实际上只有"IPC Server handler 106 on 50020"(refreshNameNodeThread
线程)和"Thread-219217170"(bpThread
线程)两个线程对死锁的发生起了作用,形成的闭环等待如下图所示:
图中绿色背景节点为两个线程的启动,红色背景节点为获得锁或等待锁。上述死锁的形成涉及的锁及主要流程如下:
BPOfferService中:
private final Lock mReadLock = mReadWriteLock.readLock();
private final Lock mWriteLock = mReadWriteLock.writeLock();
BlockPoolManager中:
synchronized作用在BlockPoolManager实例上
线程1:refreshNameNodeThread
-> (0, t1)时间内获得BlockPoolManager实例上的synchronized锁,(t1, ∞)时间内等待BPOfferService读锁
线程2:bpThread
-> (0, t1)时间内获得BPOfferService读锁,(t1, ∞)时间内等待BlockPoolManager实例上的synchronized锁
线程1获得锁代码如下:
// BlockPoolManager
private void doRefreshNamenodes(
Map<String, Map<String, InetSocketAddress>> addrMap,
Map<String, Map<String, InetSocketAddress>> lifelineAddrMap)
throws IOException {
// ...
synchronized (this) {
// dosomething
}
}
线程1等锁代码如下:
// BPOfferService
String getBlockPoolId(boolean quiet) {
// ...
readLock();
// ...
}
线程2获得锁代码如下:
// BPOfferService
void verifyAndSetNamespaceInfo(BPServiceActor actor, NamespaceInfo nsInfo)
throws IOException {
writeLock();
// ...
}
线程2等待锁代码如下:
// BlockPoolManager
synchronized void addBlockPool(BPOfferService bpos) {
// dosomething
}
虽然有synchronized、readLock、writeLock参与,由于读锁与写锁之间仍然互斥,上述仍然形成头尾相接的循环等待资源关系,造成死锁。
解决方式
避免死锁的办法就是打破死锁形成的任一必要条件,常见的一种方式为打破环形等待资源的关系,这里列举该方式的两点实现方案:
- 减少锁范围,尽可能只锁必要的资源。如,对于
synchronized method(){}
的方法,确定在该方法没必要锁的资源可以不加锁,改成method(){ synchronized {} }
只锁必要资源。 - 错开两个线程获得锁的时间。在线程1释放锁后,再让第二个线程进入获得锁的逻辑。
这里将以上述的后者来解决本文的死锁问题。死锁的发生一般概率较低,想通过反复跑测试的办法来复现死锁几乎不可能。想要复现死锁,我们需要控制线程获得锁的时机。即要保证(0, t1)
时间内,线程1和线程2均已获得且只获得了该线程的第一把锁,(t1, ∞)
时间内想要获得各自的第二把锁。我们可以有以下有段来控制线程获得锁发生的时序:
CountDownLatch
CyclicBarrier
Thread.sleep
- …
如CyclicBarrier,它可以实现让一组线程等待至某个状态之后再全部同时执行。在本问题中,某个状态指的就是线程1和线程2各自获得了自己的第一把锁。如下:
// ...
CyclicBarrier barrier = new CyclicBarrier(2);
public void delayWhenHoldLock() {
try {
barrier.await();
} catch (Exception e) {
e.printStackTrace();
}
}
在线程1和线程2获得第二把锁前调用delayWhenHoldLock()
方法即可。当然,使用CyclicBarrier,在有死锁的情况下能完全复现死锁,当没有死锁时,可能会有一个线程一直await()的情况。为了能复现死锁,且修复后又能跑通测试,本文使用Thread.sleep
作为最终测试。为了错开两个线程获得锁的时间,第二个线程(bpThread)的启动被调整到第一个线程释放读锁之后再启动,修复后的流程图如下:
上述改动,在修复前复现死锁,修复后测试跑通。符合预期。这个给出整个调试过程中需要注意的点:
- 在HDFS中,通过启动
MiniDFSCluster
来模拟集群,通过DataNodeFaultInjector
类注入一个仅仅在测试时才生效的故障,对开发者十分友好,要善于利用。 - 在复现死锁时,要注意两个线程等锁的对象分别是对方已持有的。如果只是两个同名的实例,则不会复现死锁。
本改动已提给HDFS社区,链接:https://issues.apache.org/jira/browse/HDFS-15641