解决海豚调度器跑出数据但显示状态失败(在CDH6.3.2跑离线数仓任务)
海豚调度器,在生产环境突然出现问题,虽然跑出数据,但显示状态失败,导致工作流无法执行下去。
诡异的问题
查看zookeeper日志如下:
4edf77f387, negotiated timeout = 60000
[INFO] 2025-03-25 10:17:50.845 org.apache.curator.framework.state.ConnectionStateManager:[251] - State change: RECONNECTED
[ERROR] 2025-03-25 10:17:50.847 org.apache.curator.framework.recipes.cache.TreeCache:[779] -
java.lang.IllegalStateException: unexpected NodeCreated on non-root node
at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:507)
at org.apache.curator.framework.recipes.cache.TreeCache$TreeNode.process(TreeCache.java:372)
at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:77)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:533)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
[ERROR] 2025-03-25 10:17:50.847 org.apache.curator.framework.recipes.cache.TreeCache:[779] -
java.lang.IllegalStateException: unexpected NodeCreated on non-root node
at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:507)
at org.apache.curator.framework.recipes.cache.TreeCache$TreeNode.process(TreeCache.java:372)
at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:77)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:533)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
[ERROR] 2025-03-25 10:17:50.847 org.apache.curator.framework.recipes.cache.TreeCache:[779] -
java.lang.IllegalStateException: unexpected NodeCreated on non-root node
at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:507)
at org.apache.curator.framework.recipes.cache.TreeCache$TreeNode.process(TreeCache.java:372)
at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:77)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:533)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
[ERROR] 2025-03-25 10:17:50.848 org.apache.curator.framework.recipes.cache.TreeCache:[779] -
java.lang.IllegalStateException: unexpected NodeCreated on non-root node
at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:507)
at org.apache.curator.framework.recipes.cache.TreeCache$TreeNode.process(TreeCache.java:372)
at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:77)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:533)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
[ERROR] 2025-03-25 10:17:50.848 org.apache.curator.framework.recipes.cache.TreeCache:[779] -
java.lang.IllegalStateException: unexpected NodeCreated on non-root node
at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:507)
at org.apache.curator.framework.recipes.cache.TreeCache$TreeNode.process(TreeCache.java:372)
at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:77)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:533)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
[INFO] 2025-03-25 10:17:50.848 org.apache.dolphinscheduler.service.zk.ZookeeperOperator:[76] - reconnected to zookeeper
[root@cdh03 logs]# tail -f dolphinscheduler-worker.log
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:533)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
[ERROR] 2025-03-25 10:17:50.848 org.apache.curator.framework.recipes.cache.TreeCache:[779] -
java.lang.IllegalStateException: unexpected NodeCreated on non-root node
at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:507)
at org.apache.curator.framework.recipes.cache.TreeCache$TreeNode.process(TreeCache.java:372)
at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:77)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:533)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
[INFO] 2025-03-25 10:17:50.848 org.apache.dolphinscheduler.service.zk.ZookeeperOperator:[76] - reconnected to zookeeper
好像是海豚调度器跟CDH6.3.2 zoopkeeper或跟yarn交互的问题,之前简单暴力重启了CDH集群后就正常了。
重启海豚调度器的步骤
1. 首先,进入DolphinScheduler的安装目录,执行下面的shell命令:
cd /opt/dolphinscheduler # 或您的实际安装路径
2. 停止所有服务:
./bin/stop-all.sh
3. 等待几秒钟确保所有服务都已停止,然后启动所有服务:
./bin/start-all.sh
重新执行任务发现问题依然存在。(偶尔也会有执行时间短,不和mysql交互的显示状态成功)
没办法又查看海豚调度器的worker日志,这次发现日志如下:
[ERROR] 2025-03-25 13:45:54.049 org.apache.dolphinscheduler.common.utils.HttpUtils:[73] - Connect to cdh02:8088 [cdh02/10.0.0.2] failed: Connection refused (Connection refused)
org.apache.http.conn.HttpHostConnectException: Connect to cdh02:8088 [cdh02/10.0.0.2] failed: Connection refused (Connection refused)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at org.apache.dolphinscheduler.common.utils.HttpUtils.get(HttpUtils.java:60)
at org.apache.dolphinscheduler.common.utils.HadoopUtils$YarnHAAdminUtils.getRMState(HadoopUtils.java:645)
at org.apache.dolphinscheduler.common.utils.HadoopUtils$YarnHAAdminUtils.getAcitveRMName(HadoopUtils.java:620)
at org.apache.dolphinscheduler.common.utils.HadoopUtils.getAppAddress(HadoopUtils.java:557)
at org.apache.dolphinscheduler.common.utils.HadoopUtils.getApplicationUrl(HadoopUtils.java:204)
at org.apache.dolphinscheduler.common.utils.HadoopUtils.getApplicationStatus(HadoopUtils.java:410)
at org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.isSuccessOfYarnState(AbstractCommandExecutor.java:390)
at org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.run(AbstractCommandExecutor.java:230)
at org.apache.dolphinscheduler.server.worker.task.shell.ShellTask.handle(ShellTask.java:98)
at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.run(TaskExecuteThread.java:133)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:74)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:134)
... 24 common frames omitted
[INFO] 2025-03-25 13:45:54.050 org.apache.dolphinscheduler.common.utils.HadoopUtils:[206] - application url : http://null:8088/ws/v1/cluster/apps/%s
[ERROR] 2025-03-25 13:46:04.065 org.apache.dolphinscheduler.common.utils.HttpUtils:[73] - null: Name or service not known
java.net.UnknownHostException: null: Name or service not known
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:111)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at org.apache.dolphinscheduler.common.utils.HttpUtils.get(HttpUtils.java:60)
at org.apache.dolphinscheduler.common.utils.HadoopUtils.getApplicationStatus(HadoopUtils.java:412)
at org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.isSuccessOfYarnState(AbstractCommandExecutor.java:390)
at org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.run(AbstractCommandExecutor.java:230)
at org.apache.dolphinscheduler.server.worker.task.shell.ShellTask.handle(ShellTask.java:98)
at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.run(TaskExecuteThread.java:133)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolE
海豚调度器Worker日志错误分析与解决方案
根据海豚调度器(DolphinScheduler)Worker日志,可以看与YARN资源管理器连接相关的错误。这些错误可能是导致Sqoop任务执行后工作流显示失败的原因。
从日志中可以看出以下关键问题:
1. YARN ResourceManager连接失败 :
Connect to cdh02:8088 [cdh02/10.0.0.2] failed: Connection refused
海豚调度器尝试连接CDH集群的YARN ResourceManager(端口8088),但连接被拒绝。
由于CDH6.3.2是配置高可用,在海豚调度器conf/common.properties看YARN ResourceManager的配置如下:
yarn.resourcemanager.ha.rm.ids=cdh01,cdh02
看到CDH的Yarn配置是cdh01、cdh04,而且cdh04是活动节点,而海豚调度器配置错成cdh02,而cdh02根本不是Yarn的ResourceManager,自然是访问不通的。
修改成正确的
yarn.resourcemanager.ha.rm.ids=cdh01,cdh02
重启海豚调度器果然正常了。