Flink线上问题: The assigned slot container_xxx was removed

Flink线上问题: The assigned slot container_xxx was removed


客户现场使用Flink(on Yarn)进行数据抽取,Source是JDBC,Sink是Kafka,客户反映流程差不多跑10天左右就挂,让我看看.

环境:

Flink: 1.5.2

jdk: 1.8.0_25

Hadoop: 2.4.1

jobmanger和TaskManger都分配1G内存

首先我看了一下我们系统收集到的日志,有2段可能有用.

第一段:

2019-12-26 08:42:42,157 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@tdh04:46540] has failed, address is now gated for [50] ms. Reason: [Disassociated] 
2019-12-26 08:42:42,877 INFO  org.apache.flink.yarn.YarnResourceManager                     - Closing TaskExecutor connection container_1576488269936_0008_01_000008 because: Exception from container-launch.
Container id: container_e05_1576488269936_0008_01_000008
Exit code: 255
Stack trace: ExitCodeException exitCode=255: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
	at org.apache.hadoop.util.Shell.run(Shell.java:482)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 255

第二段:

2019-12-26 08:42:42,877 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Unregister TaskManager f4660fcc70ee329e2427b5ed1245aa83 from the SlotManager.
2019-12-26 08:42:42,878 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: JDBC Source -> Timestamps/Watermarks -> sink_0_projection -> 数据输出_0_CheckOutputTypeFunction -> Sink: Unnamed (1/1) (96b3d4c2da227693ac34a8e8d2a4abea) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
	at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:793)
	at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:339)
	at org.apache.flink.yarn.YarnResourceManager$$Lambda$212/930337248.run(Unknown Source)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
	at akka.actor.ActorCell.invoke(ActorCell.scala:495)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-12-26 08:42:42,879 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Discarding checkpoint 82951 of job 42983ed24d98360747a8a535f2a3c8ba because: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
2019-12-26 08:42:42,879 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job flow_1575462616880_0011 (42983ed24d98360747a8a535f2a3c8ba) switched from state RUNNING to FAILING.
org.apache.flink.util.FlinkException: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
	at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:793)
	at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:339)
	at org.apache.flink.yarn.YarnResourceManager$$Lambda$212/930337248.run(Unknown Source)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
	at akka.actor.ActorCell.invoke(ActorCell.scala:495)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-12-26 08:42:42,882 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to restart or fail the job flow_1575462616880_0011 (42983ed24d98360747a8a535f2a3c8ba) if no longer possible.
2019-12-26 08:42:42,882 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job flow_1575462616880_0011 (42983ed24d98360747a8a535f2a3c8ba) switched from state FAILING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
	at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:793)
	at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:339)
	at org.apache.flink.yarn.YarnResourceManager$$Lambda$212/930337248.run(Unknown Source)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
	at akka.actor.ActorCell.invoke(ActorCell.scala:495)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-12-26 08:42:42,882 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Could not restart the job flow_1575462616880_0011 (42983ed24d98360747a8a535f2a3c8ba) because the restart strategy prevented it.
org.apache.flink.util.FlinkException: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
	at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:793)
	at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:339)
	at org.apache.flink.yarn.YarnResourceManager$$Lambda$212/930337248.run(Unknown Source)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
	at akka.actor.ActorCell.invoke(ActorCell.scala:495)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

看完日志能得到的信息非常有限,就知道了container被移除了,至于为什么被移除还不知道.接下来去Yarn上看一下日志.(Yarn要开启日志聚合,第一次让我看问题的时候日志聚合没有开,我把日志聚合打开后告诉他有问题之后再找我)

Yarn上有用的日志:

jobmanager.out

2019-12-26 08:42:44,233 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

taskmanager.err

SEVERE: Failed to resolve default logging config file: config/java.util.logging.properties
Uncaught error from thread [flink-akka.actor.default-dispatcher-4]: GC overhead limit exceeded, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for for ActorSystem[flink]
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at sun.reflect.AccessorGenerator.emitConstructor(AccessorGenerator.java:429)
	at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:379)
	at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
	at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
	at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
	at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
	at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
	at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
	at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
	at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
	at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1484)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1334)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
	at org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation$MethodInvocation.readObject(RemoteRpcInvocation.java:204)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:502)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:489)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:477)

我按时间先后整理了一下:

2019-12-26 08:42:42,877  Container退出
2019-12-26 08:42:44.084  kill job
2019-12-26 08:42:44,233 ClusterEntrypoint  - RECEIVED SIGNAL 15
2019-12-26 08:42:45,005  FINISH_APPLICATION sent to absent application application_1576488269936_0008

首先是TaskManger的Container退出,其实这时候任务就失败了,由于任务异常结束,系统主动kill Yarn上的任务,所以SIGNAL 15其实是我们自己发的,下面那条其是在Yarn上kill一个已经不存在的任务时发出的警告.

由于没有找到明显导致Container退出的原因,结合以下信息:

  1. 任务可以正常启动
  2. 几乎周期性的失败
  3. akka报的OutOfMemory

猜测可能是存在内存泄露,然后我就去业务代码(不是我写的)里看,看到JDBC PreparedStatement每次都弄个新的,并且没有close,我猜测可能就是它导致的,然后写了一段Demo,如下:

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

public class JDBC {

    public static void main(String[] args) throws Exception {

        Class.forName("com.mysql.jdbc.Driver");
        Connection conn = DriverManager.getConnection(
            "jdbc:mysql://localhost:3306/test", "root", "123456");

        PreparedStatement preparedStatement;
        while (true) {
            preparedStatement = conn.prepareStatement("select count(*) from table_a");
            ResultSet resultSet = preparedStatement.executeQuery();
            while (resultSet.next()) {
                resultSet.getObject(1);
            }
            //preparedStatement.close();
            //Thread.sleep(10);
        }
    }
}

JVM参数:-Xmx100M

使用Visual VM观察内存变化,启动程序后很快就OOM了,所以加了个sleep,通过Visual VM观察老年代一直在增长,后来发生了GC但是也不管用还是OOM了.

更改代码把preparedStatement关闭之后老年代虽然也可以几乎涨满,但是GC过后内存就下来了.

如果不是任务启动的时候因为其他异常比如ClassNotFound导致的The assigned slot container_xxx was removed,并且周期性的出现问题,要优先考虑一下内存泄露.

欢迎关注公众号:大数据开发者
在这里插入图片描述

发布了14 篇原创文章 · 获赞 4 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/u010942041/article/details/103731168