Flink线上问题: The assigned slot container_xxx was removed
客户现场使用Flink(on Yarn)进行数据抽取,Source是JDBC,Sink是Kafka,客户反映流程差不多跑10天左右就挂,让我看看.
环境:
Flink: 1.5.2
jdk: 1.8.0_25
Hadoop: 2.4.1
jobmanger和TaskManger都分配1G内存
首先我看了一下我们系统收集到的日志,有2段可能有用.
第一段:
2019-12-26 08:42:42,157 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@tdh04:46540] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2019-12-26 08:42:42,877 INFO org.apache.flink.yarn.YarnResourceManager - Closing TaskExecutor connection container_1576488269936_0008_01_000008 because: Exception from container-launch.
Container id: container_e05_1576488269936_0008_01_000008
Exit code: 255
Stack trace: ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 255
第二段:
2019-12-26 08:42:42,877 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Unregister TaskManager f4660fcc70ee329e2427b5ed1245aa83 from the SlotManager.
2019-12-26 08:42:42,878 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: JDBC Source -> Timestamps/Watermarks -> sink_0_projection -> 数据输出_0_CheckOutputTypeFunction -> Sink: Unnamed (1/1) (96b3d4c2da227693ac34a8e8d2a4abea) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:793)
at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:339)
at org.apache.flink.yarn.YarnResourceManager$$Lambda$212/930337248.run(Unknown Source)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-12-26 08:42:42,879 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Discarding checkpoint 82951 of job 42983ed24d98360747a8a535f2a3c8ba because: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
2019-12-26 08:42:42,879 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job flow_1575462616880_0011 (42983ed24d98360747a8a535f2a3c8ba) switched from state RUNNING to FAILING.
org.apache.flink.util.FlinkException: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:793)
at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:339)
at org.apache.flink.yarn.YarnResourceManager$$Lambda$212/930337248.run(Unknown Source)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-12-26 08:42:42,882 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Try to restart or fail the job flow_1575462616880_0011 (42983ed24d98360747a8a535f2a3c8ba) if no longer possible.
2019-12-26 08:42:42,882 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job flow_1575462616880_0011 (42983ed24d98360747a8a535f2a3c8ba) switched from state FAILING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:793)
at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:339)
at org.apache.flink.yarn.YarnResourceManager$$Lambda$212/930337248.run(Unknown Source)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-12-26 08:42:42,882 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Could not restart the job flow_1575462616880_0011 (42983ed24d98360747a8a535f2a3c8ba) because the restart strategy prevented it.
org.apache.flink.util.FlinkException: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:793)
at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:339)
at org.apache.flink.yarn.YarnResourceManager$$Lambda$212/930337248.run(Unknown Source)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
看完日志能得到的信息非常有限,就知道了container被移除了,至于为什么被移除还不知道.接下来去Yarn上看一下日志.(Yarn要开启日志聚合,第一次让我看问题的时候日志聚合没有开,我把日志聚合打开后告诉他有问题之后再找我)
Yarn上有用的日志:
jobmanager.out
2019-12-26 08:42:44,233 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
taskmanager.err
SEVERE: Failed to resolve default logging config file: config/java.util.logging.properties
Uncaught error from thread [flink-akka.actor.default-dispatcher-4]: GC overhead limit exceeded, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for for ActorSystem[flink]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.reflect.AccessorGenerator.emitConstructor(AccessorGenerator.java:429)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:379)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1484)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1334)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation$MethodInvocation.readObject(RemoteRpcInvocation.java:204)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:502)
at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:489)
at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:477)
我按时间先后整理了一下:
2019-12-26 08:42:42,877 Container退出
2019-12-26 08:42:44.084 kill job
2019-12-26 08:42:44,233 ClusterEntrypoint - RECEIVED SIGNAL 15
2019-12-26 08:42:45,005 FINISH_APPLICATION sent to absent application application_1576488269936_0008
首先是TaskManger的Container退出,其实这时候任务就失败了,由于任务异常结束,系统主动kill Yarn上的任务,所以SIGNAL 15其实是我们自己发的,下面那条其是在Yarn上kill一个已经不存在的任务时发出的警告.
由于没有找到明显导致Container退出的原因,结合以下信息:
- 任务可以正常启动
- 几乎周期性的失败
- akka报的OutOfMemory
猜测可能是存在内存泄露,然后我就去业务代码(不是我写的)里看,看到JDBC PreparedStatement每次都弄个新的,并且没有close,我猜测可能就是它导致的,然后写了一段Demo,如下:
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
public class JDBC {
public static void main(String[] args) throws Exception {
Class.forName("com.mysql.jdbc.Driver");
Connection conn = DriverManager.getConnection(
"jdbc:mysql://localhost:3306/test", "root", "123456");
PreparedStatement preparedStatement;
while (true) {
preparedStatement = conn.prepareStatement("select count(*) from table_a");
ResultSet resultSet = preparedStatement.executeQuery();
while (resultSet.next()) {
resultSet.getObject(1);
}
//preparedStatement.close();
//Thread.sleep(10);
}
}
}
JVM参数:-Xmx100M
使用Visual VM观察内存变化,启动程序后很快就OOM了,所以加了个sleep,通过Visual VM观察老年代一直在增长,后来发生了GC但是也不管用还是OOM了.
更改代码把preparedStatement关闭之后老年代虽然也可以几乎涨满,但是GC过后内存就下来了.
如果不是任务启动的时候因为其他异常比如ClassNotFound导致的The assigned slot container_xxx was removed,并且周期性的出现问题,要优先考虑一下内存泄露.
欢迎关注公众号:大数据开发者