记一次hadoopp提交任务失败
info日志
2019-07-18 11:40:50 386 [QuartzScheduler_Worker-1:203538] - [INFO] org.apache.hadoop.ipc.Client - Retrying connect to server: sparka/10.240.47.152:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2019-07-18 11:40:51 388 [QuartzScheduler_Worker-1:204540] - [INFO] org.apache.hadoop.ipc.Client - Retrying connect to server: sparka/10.240.47.152:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2019-07-18 11:40:52 389 [QuartzScheduler_Worker-1:205541] - [INFO] org.apache.hadoop.ipc.Client - Retrying connect to server: sparka/10.240.47.152:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)error日志
2019-07-18 10:51:12 160 [QuartzScheduler_Worker-1:669550199] - [ERROR] com.bonree.action.mr.dispatcher.MrDispatcherJob - mr dispatcher. task name [TaskConfig{id=923, name='TRAJECTORY@HOUR@1562281200000', type=1, createTime=2019-07-05, uploadTime=2019-07-05, info='{"baseInput":"/data/br/base/action/inputpath/source/2019/07/05/07","baseOutput":"/data/br/base/action/inputpath/result/2019/07/05/07/TRAJECTORY","configMap":{},"gran":"HOUR","monitorTime":1562281200000,"mrTypeName":"TRAJECTORY"}', status=2}] run occur error!taskConfig:java.net.ConnectException: Call From sparka/10.240.47.152 to sparka:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2019-07-18 11:11:22 520 [QuartzScheduler_Worker-1:670760559] - [ERROR] com.bonree.action.mr.dispatcher.MrDispatcherJob - mr dispatcher. task name [TaskConfig{id=925, name='TRAJECTORY@HOUR@1562284800000', type=1, createTime=2019-07-05, uploadTime=2019-07-05, info='{"baseInput":"/data/br/base/action/inputpath/source/2019/07/05/08","baseOutput":"/data/br/base/action/inputpath/result/2019/07/05/08/TRAJECTORY","configMap":{},"gran":"HOUR","monitorTime":1562284800000,"mrTypeName":"TRAJECTORY"}', status=2}] run occur error!taskConfig:java.net.ConnectException: Call From sparka/10.240.47.152 to sparka:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 2019-07-18 11:31:32 854 [QuartzScheduler_Worker-1:671970893] - [ERROR] com.bonree.action.mr.dispatcher.MrDispatcherJob - mr dispatcher. task name [TaskConfig{id=924, name='LAUNCH@HOUR@1562284800000', type=1, createTime=2019-07-05, uploadTime=2019-07-05, info='{"baseInput":"/data/br/base/action/inputpath/source/2019/07/05/08","baseOutput":"/data/br/base/action/inputpath/result/2019/07/05/08/LAUNCH","configMap":{},"gran":"HOUR","monitorTime":1562284800000,"mrTypeName":"LAUNCH"}', status=2}] run occur error!taskConfig:java.net.ConnectException: Call From sparka/10.240.47.152 to sparka:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused类似问题可能是8020或者其他端口ConnectionRefused,基本确定是对应端口服务有问题。这里8032是yarn服务,所以去检查,若不知道8032是什么服务,可以去hadoop配置文件下找下配的是哪个服务
我这先全量扫下配置有没有配置8032,然后去修改过配置的配置文件定位在哪个配置那文件
[root@sparka hadoop]# cat * | grep 8032 <value>sparka:8032</value> [root@sparka hadoop]# cat hdfs-site.xml | grep 8032 [root@sparka hadoop]# cat core-site.xml | grep 8032 [root@sparka hadoop]# cat yarn-site.xml | grep 8032 <value>sparka:8032</value> <property> <name>yarn.resourcemanager.address</name> <value>sparka:8032</value> </property>jps检查yarn的resourceManager服务
[root@sparka hadoop]# jps 24384 jar 17312 jar 15937 Main 12746 NameNode 7370 SDK_WEB_AUTOREPORT.jar 31436 ZEUS_MANAGERSERVER.jar 26637 jar 24078 Jps 21903 jar 29427 Bootstrap 31027 jar 12947 JournalNode 18580 jar 23541 RunJar 24314 jar 13117 DFSZKFailoverController 22333 jar 15358 Main 4287 machineagent.jar [root@sparka hadoop]#确实发现没有,可以去检查下何时因为啥原因停掉了或者启动失败,去hadoop的logs路径下检查下这个服务的日志
查看resourceManager日志
[root@sparka logs]# tail -F yarn-root-resourcemanager-sparka.log 2019-06-22 00:08:54,794 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1555906617082_6791,name=sdk_data_action_day-determine_partitions_groupby-Optional.of([2019-06-21T00:00:00.000Z/2019-06-22T00:00:00.000Z]),user=root,queue=root.root,state=FAILED,trackingUrl=http://sparka:8088/cluster/app/application_1555906617082_6791,appMasterHost=N/A,startTime=1561132960264,finishTime=1561133311607,finalStatus=FAILED 2019-06-22 00:08:58,664 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : sparkb:32859 for container : container_1555906617082_6794_01_000001 2019-06-22 00:09:00,142 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1555906617082_6794_01_000001 Container Transitioned from ALLOCATED to ACQUIRED 2019-06-22 00:09:00,142 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Clear node set for appattempt_1555906617082_6794_000001 2019-06-22 00:09:00,859 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1555906617082_6794 AttemptId: appattempt_1555906617082_6794_000001 MasterContainer: Container: [ContainerId: container_1555906617082_6794_01_000001, NodeId: sparkb:32859, NodeHttpAddress: sparkb:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.240.47.153:32859 }, ] 2019-06-22 00:09:03,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1555906617082_6794_000001 State change from SCHEDULED to ALLOCATED_SAVING 2019-06-22 00:09:06,446 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1555906617082_6794_000001 State change from ALLOCATED_SAVING to ALLOCATED 2019-06-22 00:09:07,251 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1555906617082_6794_000001 2019-06-22 00:09:13,245 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException 2019-06-22 00:09:13,960 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException最后两行看到这个异常,进去源码看下
/** * Forcibly terminates the currently running Java virtual machine. * * @param status * exit code * @param msg * message used to create the {@code HaltException} * @throws HaltException * if Runtime.getRuntime().halt() is disabled for test purposes */ public static void halt(int status, String msg) throws HaltException { LOG.info("Halt with status " + status + " Message: " + msg); if (systemHaltDisabled) { HaltException ee = new HaltException(status, msg); LOG.fatal("Halt called", ee); if (null == firstHaltException) { firstHaltException = ee; } throw ee; } Runtime.getRuntime().halt(status); }根据方法注释,被强制退出了jvm进程。一般出现强制停止服务的大都是服务器资源不能支持程序正常运行,这里我猜测是内存不足导致,但由于之前停掉的,也没看到啥日志证据。。。
重启yarn
有排查更好的思路或方式请指出,非常感谢