我在Thread上运行flink作业,我们使用命令行中的“fink run”将作业提交给Thread,有一天我们在flink作业上出现异常,因为我们没有启用flink重启策略,所以它只是失败了,但最终我们从Thread应用程序列表中发现作业状态为“成功”,我们预期为“失败”。
Flink CLI日志:
06/12/2018 03:13:37 FlatMap (getTagStorageMapper.flatMap)(23/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (ResultReducer.reduceGroup)(31/32) switched to CANCELED
06/12/2018 03:13:37 FlatMap (SubClassEDFJoinMapper.flatMap)(29/32) switched to CANCELED
06/12/2018 03:13:37 CHAIN DataSource (SubClassInventory.AvroInputFormat.createInput) -> FlatMap (SubClassInventoryMapper.flatMap)(27/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (OutputReducer.reduceGroup)(28/32) switched to CANCELED
06/12/2018 03:13:37 CHAIN DataSource (SubClassInventory.AvroInputFormat.createInput) -> FlatMap (BIMBQMInstrumentMapper.flatMap)(27/32) switched to CANCELED
06/12/2018 03:13:37 GroupReduce (BIMBQMGovCorpReduce.reduceGroup)(30/32) switched to CANCELED
06/12/2018 03:13:37 FlatMap (BIMBQMEVMJoinMapper.flatMap)(32/32) switched to CANCELED
06/12/2018 03:13:37 Job execution switched to status FAILED.
No JobSubmissionResult returned, please make sure you called ExecutionEnvironment.execute()
2018-06-12 03:13:37,625 INFO org.apache.flink.yarn.YarnClusterClient - Sending shutdown request to the Application Master
2018-06-12 03:13:37,625 INFO org.apache.flink.yarn.YarnClusterClient - Start application client.
2018-06-12 03:13:37,630 INFO org.apache.flink.yarn.ApplicationClient - Notification about new leader address akka.tcp://flink@ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
2018-06-12 03:13:37,632 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager.
2018-06-12 03:13:37,633 INFO org.apache.flink.yarn.ApplicationClient - Received address of new leader akka.tcp://flink@ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager with session ID 00000000-0000-0000-0000-000000000000.
2018-06-12 03:13:37,634 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager null.
2018-06-12 03:13:37,635 INFO org.apache.flink.yarn.ApplicationClient - Trying to register at JobManager akka.tcp://flink@ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager.
2018-06-12 03:13:37,688 INFO org.apache.flink.yarn.ApplicationClient - Successfully registered at the ResourceManager using JobManager Actor[akka.tcp://flink@ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager#182802345]
2018-06-12 03:13:38,648 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager.
2018-06-12 03:13:39,480 INFO org.apache.flink.yarn.YarnClusterClient - Application application_1528772982594_0001 finished with state FINISHED and final state SUCCEEDED at 1528773218662
2018-06-12 03:13:39,480 INFO org.apache.flink.yarn.YarnClusterClient - YARN Client is shutting down
2018-06-12 03:13:39,582 INFO org.apache.flink.yarn.ApplicationClient - Stopped Application client.
2018-06-12 03:13:39,583 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager Actor[akka.tcp://flink@ip-10-97-46-149.tr-fr-nonprod.aws-int.thomsonreuters.com:45663/user/jobmanager#182802345].
Flink作业管理器日志:
FlatMap (BIMBQMEVMJoinMapper.flatMap) (32/32) (67a002e07fe799c1624a471340c8cf9d) switched from CANCELING to CANCELED.
Try to restart or fail the job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) if no longer possible.
Requesting new TaskManager container with 8192 megabytes memory. Pending requests: 1
Job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) switched from state FAILING to FAILED.
Could not restart the job Flink Java Job at Tue Jun 12 03:13:17 UTC 2018 (1086cedb3617feeee8aace29a7fc6bd0) because the restart strategy prevented it.
Unregistered task manager ip-10-97-44-186/10.97.44.186. Number of registered task managers 31. Number of available slots 31
Stopping JobManager with final application status SUCCEEDED and diagnostics: Flink YARN Client requested shutdown
Shutting down cluster with status SUCCEEDED : Flink YARN Client requested shutdown
Unregistering application from the YARN Resource Manager
Waiting for application to be successfully unregistered.
有谁能帮我理解为什么塞恩说我的Flink工作是“成功的”?
纱线中报告的应用程序状态并不反映已执行作业的状态,而是反映Flink簇的状态,因为这是纱线应用程序。因此,纱线应用的最终状态仅取决于燧石簇是否正确完成。换言之,如果作业失败,则不一定意味着Flink集群失败。这是两件不同的事情。
即使是一个简单的WordCount mapduce也会因相同的错误而失败。 Hadoop 2.6.0 下面是纱线原木。 似乎在资源协商期间发生了某种超时 但我无法验证这一点,即超时的确切原因。 2016-11-11 15:38:09313信息组织。阿帕奇。hadoop。纱线服务器resourcemanager。amlauncher。AMLauncher:启动appattempt\u 1478856
aws上的3台机器(32个内核和64 GB内存) 我手动安装了带有hdfs和yarn服务的Hadoop2(没有使用EMR)。 机器#1运行hdfs-(NameNode&SeconderyNameNode)和yarn-(resourcemanager),在masters文件中定义 问题是,我认为我做错了,因为这项工作需要相当多的时间,大约一个小时,我认为它不是很优化。 我使用以下命令运行flink:
我试图理解Python线程的“守护进程”标志。这我知道 线程可以标记为“守护线程”。此标志的意义在于,当只剩下守护进程线程时,整个Python程序将退出。初始值从创建线程继承。 但在我的例子中,python程序在守护进程线程离开并且线程没有完成其工作之前退出。 主程序 第一个线程只写5000个第一个整数,而第二个线程不写任何数字
> 我试图在作业完成之前返回Spring Batch作业ID。 我当前的实现只在作业完成后返回信息。 我使用批处理程序控制器和批处理服务,发布在下面。谢谢,我是新来的Spring Batch,经过详尽的搜索,找不到太多与我的问题相关的。有一个帖子有人使用Apache骆驼,但我没有。 控制器 服务 再次感谢。 编辑 我已将其添加到批处理配置中 编辑 在马哈茂德·本·哈辛的评论的帮助下,我解决了这个问
我正在使用spark submit执行以下命令: spark submit script\u测试。py—主纱线—部署模式群集spark submit script\u测试。py—主纱线簇—部署模式簇 这工作做得很好。我可以在Spark History Server UI下看到它。但是,我无法在RessourceManager UI(纱线)下看到它。 我感觉我的作业没有发送到集群,但它只在一个节点上
我有一项服务,每天在Kubernetes上部署数千个短期工作。我试图让Kubernetes在完成后使用这里描述的功能删除这些作业: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#clean-up-finished-jobs-automatically 作业完成,但在表示的时间限制之