flink--The heartbeat...timed out

景志
2023-12-01

flink程序被kill掉,查看日志发现如下报错:

Caused by: java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id container_e06_1627962873638_5732_01_000003  timed out.
 at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1125)
 at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
 at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
 ... 20 more

原因:此错误是container心跳超时,出现此种错误一般有两种可能:

1、分布式物理机网络失联,这种原因一般情况下failover后作业能正常恢复,如果出现的不频繁可以不用关注;

2、failover的节点对应TM的内存设置太小,GC严重导致心跳超时,建议调大对应节点的内存值

解决方案:加大flink程序的运行内存

 类似资料: