seo网站诊断流程公司网站建设费用会计处理
前言
该异常发生背景:
- 项目上的同事之前已部署了8个节点均正常
- 项目上的同事后面又扩容了8个节点均异常
- Yarn ResourceManager 配置了 HA
- Hadoop 版本: 3.1.1
具体异常
2025-09-17 09:30:00,869 INFO client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm2
2025-09-17 09:30:01,043 INFO retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - org.apache.hadoop.security.authorize.AuthorizationException: User nm/indata-192-168-1-3.indata.com@INDATA.COM (auth:KERBEROS) is not authorized for protocol interface org.apache.hadoop.yarn.server.api.ResourceTrackerPB: this service is only accessible by nm/192.168.1.3@INDATA.COM, while invoking ResourceTrackerPBClientImpl.registerNodeManager over rm2 after 1 failover attempts. Trying to failover after sleeping for 21498ms.
2025-09-17 09:30:22,541 INFO client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm1
2025-09-17 09:30:22,545 INFO retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - java.net.ConnectException: Call From indata-192-168-1-3.indata.com/192.168.1.3 to indata-192-168-1-1.indata.com:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ResourceTrackerPBClientImpl.registerNodeManager over rm1 after 2 failover attempts. Trying to failover after sleeping for 33407ms.
2025-09-17 09:30:55,953 INFO client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm2
2025-09-17 09:30:55,992 INFO retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - org.apache.hadoop.security.authorize.AuthorizationException: User nm/indata-192-168-1-3.indata.com@INDATA.COM (auth:KERBEROS) is not authorized for protocol interface org.apache.hadoop.yarn.server.api.ResourceTrackerPB: this service is only accessible by nm/192.168.1.3@INDATA.COM, while invoking ResourceTrackerPBClientImpl.registerNodeManager over rm2 after 3 failover attempts. Trying to failover after sleeping for 44974ms.
2025-09-17 09:31:40,967 INFO client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm1
2025-09-17 09:31:40,971 INFO retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - java.net.ConnectException: Call From indata-192-168-1-3.indata.com/192.168.1.3 to indata-192-168-1-1.indata.com:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ResourceTrackerPBClientImpl.registerNodeManager over rm1 after 4 failover attempts. Trying to failover after sleeping for 15164ms.
2025-09-17 09:31:56,136 INFO client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm2
2025-09-17 09:31:56,181 INFO retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - org.apache.hadoop.security.authorize.AuthorizationException: User nm/indata-192-168-1-3.indata.com@INDATA.COM (auth:KERBEROS) is not authorized for protocol interface org.apache.hadoop.yarn.server.api.ResourceTrackerPB: this service is only accessible by nm/192.168.1.3@INDATA.COM, while invoking ResourceTrackerPBClientImpl.registerNodeManager over rm2 after 5 failover attempts. Trying to failover after sleeping for 27554ms.
2025-09-17 09:32:23,741 INFO client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm1
2025-09-17 09:32:23,749 INFO retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - java.net.ConnectException: Call From indata-192-168-1-3.indata.com/192.168.1.3 to indata-192-168-1-1.indata.com:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ResourceTrackerPBClientImpl.registerNodeManager over rm1 after 6 failover attempts. Trying to failover after sleeping for 35229ms.
异常汇总分析:
权限认证异常
- 错误信息:
org.apache.hadoop.security.authorize.AuthorizationException: User nm/indata-192-168-1-3.indata.com@INDATA.COM (auth:KERBEROS) is not authorized for protocol interface org.apache.hadoop.yarn.server.api.ResourceTrackerPB: this service is only accessible by nm/192.168.1.3@INDATA.COM - 说明:使用 Kerberos 认证的用户
nm/indata-192-168-1-3.indata.com@INDATA.COM没有访问权限,该服务仅允许nm/192.168.1.3@INDATA.COM访问 - 发生场景:NodeManager 向 rm2 注册时反复出现此错误
连接拒绝异常
- 错误信息:
java.net.ConnectException: Call From indata-192-168-1-3.indata.com/192.168.1.3 to indata-192-168-1-1.indata.com:8031 failed on connection exception: java.net.ConnectException: Connection refused - 说明:从节点
indata-192-168-1-3.indata.com连接到indata-192-168-1-1.indata.com:8031被拒绝 - 发生场景:NodeManager 尝试向 rm1 注册时反复出现此错误
整体情况:
- 系统在 rm1 和 rm2 之间不断进行故障转移(failover)
- 对 rm2 的连接存在权限认证问题
- 对 rm1 的连接根本无法建立(端口 8031 拒绝连接)
- 异常循环发生,NodeManager 始终无法成功注册到任何一个 ResourceManager
- 重试多次后,最终 NodeManager 启动失败
说明
- rm1 此时为 StandBy , StandBy 是没有对应的 8031 端口的,所以连接失败
- rm2 此时为 Active ,因为权限认证问题导致连接失败。
首先将 rm1 切换为 Active
- 观察日志,发现连接 rm1 时没有报权限认证的错误,从而连接成功。
- 合理推断该问题是由权限认证失败导致的。但是不清楚为啥 rm2 认证失败,但 rm1 认证成功。
最终解决
经过各种分析和尝试,最终通过分析源码发现了根因。
源码
根据日志关键字定位到源码:ServiceAuthorizationManager ,授权失败的关键代码在 authorize 方法中
public void authorize(UserGroupInformation user, Class<?> protocol,Configuration conf,InetAddress addr) throws AuthorizationException {AccessControlList[] acls = protocolToAcls.get(protocol);MachineList[] hosts = protocolToMachineLists.get(protocol);if (acls == null || hosts == null) {throw new AuthorizationException("Protocol " + protocol + " is not known.");}// get client principal key to verify (if available)KerberosInfo krbInfo = SecurityUtil.getKerberosInfo(protocol, conf);String clientPrincipal = null; if (krbInfo != null) {String clientKey = krbInfo.clientPrincipal();if (clientKey != null && !clientKey.isEmpty()) {try {clientPrincipal = SecurityUtil.getServerPrincipal(conf.get(clientKey), addr);} catch (IOException e) {throw (AuthorizationException) new AuthorizationException("Can't figure out Kerberos principal name for connection from "+ addr + " for user=" + user + " protocol=" + protocol).initCause(e);}}}if((clientPrincipal != null && !clientPrincipal.equals(user.getUserName())) || acls.length != 2 || !acls[0].isUserAllowed(user) || acls[1].isUserAllowed(user)) {String cause = clientPrincipal != null ?": this service is only accessible by " + clientPrincipal :": denied by configured ACL";AUDITLOG.warn(AUTHZ_FAILED_FOR + user+ " for protocol=" + protocol + cause);throw new AuthorizationException("User " + user +" is not authorized for protocol " + protocol + cause);}if (addr != null) {String hostAddress = addr.getHostAddress();if (hosts.length != 2 || !hosts[0].includes(hostAddress) ||hosts[1].includes(hostAddress)) {AUDITLOG.warn(AUTHZ_FAILED_FOR + " for protocol=" + protocol+ " from host = " + hostAddress);throw new AuthorizationException("Host " + hostAddress +" is not authorized for protocol " + protocol) ;}}AUDITLOG.info(AUTHZ_SUCCESSFUL_FOR + user + " for protocol="+protocol);}
错误触发条件
代码中 clientPrincipal != null && !clientPrincipal.equals(user.getUserName()) 为 true,即:
clientPrincipal:ResourceManager 根据配置计算出的 “允许访问的主体”(nm/192.168.1.3@INDATA.COM)user.getUserName():NodeManager 实际使用的 Kerberos 主体(nm/indata-192-168-1-3.indata.com@INDATA.COM)- 两者不相等,直接抛出
AuthorizationException,错误信息就是 “this service is only accessible by nm/192.168.1.3@INDATA.COM”
clientPrincipal 的来源
clientPrincipal 由以下代码生成:
KerberosInfo krbInfo = SecurityUtil.getKerberosInfo(protocol, conf);
String clientKey = krbInfo.clientPrincipal(); // 获取协议对应的“客户端主体配置项”
clientPrincipal = SecurityUtil.getServerPrincipal(conf.get(clientKey), addr);
protocol:即ResourceTrackerPB(NodeManager 注册协议)krbInfo.clientPrincipal():通过@KerberosInfo注解获取该协议对应的 “客户端主体配置项” —— 对于ResourceTrackerPB,这个配置项是yarn.nodemanager.principal(Hadoop 内置注解定义)SecurityUtil.getServerPrincipal(conf.get(clientKey), addr):根据yarn.nodemanager.principal的值和客户端 IP(addr),生成clientPrincipal
clientPrincipal 为何是 nm/192.168.1.3@INDATA.COM?
SecurityUtil.getServerPrincipal 方法会解析 yarn.nodemanager.principal 的值:
yarn.nodemanager.principal配置为nm/_HOST@INDATA.COM_HOST被替换为客户端 IP(192.168.1.3)(而非域名),最终生成nm/192.168.1.3@INDATA.COM
_HOST 为何没有被替换为主机名?
经过验证,发现 rm2 节点中的 /etc/hosts 没有配置 192.168.1.3 对应的域名的映射 , 推测 _HOST 替换逻辑:
- ResourceManager 会检查自己所在节点的 /etc/hosts 中有没有配置 NodeManager 的 IP 地址与域名的映射
- 如果配置了,则替换为域名
- 如果没有配置,则替换为 IP
根因
- 虽然新扩容的8个节点的 /etc/hosts 配置了所有的 16个 节点的 IP 地址与域名的映射,但是之前已经部署的8个节点中,只在 rm1 的节点添加了新增节点的 IP 地址与域名的映射
- 在 NodeManager 向 ResourceManager 注册时,ResourceManager 会 根据
yarn.nodemanager.principal的值和客户端 IP(addr),生成clientPrincipal,yarn.nodemanager.principal 的配置值为 nm/_HOST@INDATA.COM ,ResourceManager 会检查自己所在节点的 /etc/hosts 中有没有配置 NodeManager 的 IP 的主机名 ,如果配置了则替换为主机名 ,如果没有配置则替换为 IP - 这样就只允许通过
nm/192.168.1.3@INDATA.COM认证, 与 keytab 中的nm/indata-192-168-1-3.indata.com@INDATA.COM值不一致导致认证失败 - 因为 rm1 正确配置了 /etc/hosts ,所以当 rm1 切换为 active 时是可以认证成功的
解决方法
- 在 rm2 节点的 /etc/hosts 中正确配置新增 8个节点的 IP 地址与域名的映射
- 为了避免引起其他不必要的麻烦,建议在所有节点都要配置好所有节点的 IP 地址与域名的映射
