当前位置: 首页 > news >正文

GaussDB 集群故障cm_ctl: can‘t connect to cm_server

1. 问题描述

gaussdb,3AZ3副本架构,重启节点服务器后,报错无法连接cm_server,cm_ctl: can’t connect to cm_server.

[omm@gaussdb03 ~]$ cm_ctl query -Cvpid
[  CMServer State   ]node             node_ip         instance                             state
-----------------------------------------------------------------------------
1  172.16.60.226 172.16.60.226   1    /data/cluster/data/cm/cm_server Down
2  172.16.60.227 172.16.60.227   2    /data/cluster/data/cm/cm_server Down
3  172.16.60.228 172.16.60.228   3    /data/cluster/data/cm/cm_server Standby[    ETCD State     ]node             node_ip         instance                     state
---------------------------------------------------------------------------
1  172.16.60.226 172.16.60.226   7001 /data/cluster/data/etcd Down
2  172.16.60.227 172.16.60.227   7002 /data/cluster/data/etcd Down
3  172.16.60.228 172.16.60.228   7003 /data/cluster/data/etcd Downcm_ctl: can't connect to cm_server. 
Maybe cm_server is not running, or timeout expired. Please try again.

2. 问题分析

  • 检查每台机器上,集群组件进程CM,ETCD,GTM,CN,DN还都存在
[root@gaussdb03 ~]# ps -ef |grep cluster
omm         5198       1  0 13:43 ?        00:00:06 /data/cluster/core/app/bin/om_monitor -L /data/cluster/logs/gaussdb/omm/cm/om_monitor
omm         5202    5198  9 13:43 ?        00:01:32 /data/cluster/core/app/bin/cm_agent
omm         5214       1  0 13:43 ?        00:00:03 /data/cluster/core/app/bin/etcd -name etcd_7003 --data-dir /data/cluster/data/etcd --client-cert-auth --trusted-ca-file /data/cluster/core/app/share/sslcert/etcd/etcdca.crt --cert-file /data/cluster/data/etcd/etcd.crt --key-file /data/cluster/data/etcd/etcd.key --peer-client-cert-auth --peer-trusted-ca-file /data/cluster/core/app/share/sslcert/etcd/etcdca.crt --peer-cert-file /data/cluster/data/etcd/etcd.crt --peer-key-file /data/cluster/data/etcd/etcd.key -initial-advertise-peer-urls https://172.16.60.228:30320 -listen-peer-urls https://172.16.60.228:30320 -listen-client-urls https://172.16.60.228:30300 -advertise-client-urls https://172.16.60.228:30300 --election-timeout 5000 --heartbeat-interval 1000 --log-outputs stdout --quota-backend-bytes 8589934592 --auto-compaction-mode periodic --auto-compaction-retention 1h -initial-cluster-token etcd-cluster-omm --enable-v2=false -initial-cluster etcd_7001=https://172.16.60.226:30320,etcd_7002=https://172.16.60.227:30320,etcd_7003=https://172.16.60.228:30320 -initial-cluster-state new
omm         5362       1  0 13:43 ?        00:00:00 /data/cluster/core/app/bin/gs_gtm -D /data/cluster/data/gtm -M pending
omm         5369       1 41 13:43 ?        00:06:57 /data/cluster/core/app/bin/gaussdb --coordinator -D /data/cluster/data/cn
omm         5385       1  2 13:43 ?        00:00:29 /data/cluster/core/app/bin/cm_server
omm         5576       1 23 13:43 ?        00:03:56 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6003 -M pending
omm         6225       1 23 13:43 ?        00:03:54 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6006 -M pending
omm         6482       1 23 13:43 ?        00:03:57 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6007 -M pending
root       23084   23031  0 13:59 pts/0    00:00:00 grep cluster
  • 由于 CM,ETCD 均显示 Down,根据官方文档,应先保证 ETCD 正常,然后 CM 可以依赖 ETCD 选主
    在这里插入图片描述
  • 检查ETCD日志
[omm@gaussdb01 etcd]$ pwd
/data/cluster/logs/gaussdb/omm/cm/etcd
[omm@gaussdb01 etcd]$ view etcd_7001-current.log
{"level":"info","ts":"2025-09-01T14:11:57.175+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"6c461eeb977a77bb [logterm: 5, index: 16182] sent MsgPreVote request to 82a123c2037aba1a at term 5"}
{"level":"info","ts":"2025-09-01T14:11:57.175+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"6c461eeb977a77bb [logterm: 5, index: 16182] sent MsgPreVote request to d354b9b181618c10 at term 5"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"82a123c2037aba1a","rtt":"0s","error":"dial tcp 172.16.60.228:30320: i/o timeout"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"d354b9b181618c10","rtt":"0s","error":"dial tcp 172.16.60.227:30320: connect: no route to host"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"82a123c2037aba1a","rtt":"0s","error":"dial tcp 172.16.60.228:30320: connect: no route to host"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d354b9b181618c10","rtt":"0s","error":"dial tcp 172.16.60.227:30320: connect: no route to host"}
  • 检查防火墙配置,防火墙未关闭,关闭防火墙
[root@gaussdb03 ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemonLoaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)Active: active (running) since Mon 2025-09-01 13:42:30 CST; 29min agoDocs: man:firewalld(1)Main PID: 1334 (firewalld)Tasks: 2Memory: 34.6MCGroup: /system.slice/firewalld.service└─1334 /usr/bin/python3 /usr/sbin/firewalld --nofork --nopidSep 01 13:42:29 gaussdb03 systemd[1]: Starting firewalld - dynamic firewall daemon...
Sep 01 13:42:30 gaussdb03 systemd[1]: Started firewalld - dynamic firewall daemon.
[root@gaussdb03 ~]# systemctl stop firewalld.service
[root@gaussdb03 ~]# systemctl disable firewalld.service
Removed /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
  • 再次检查集群状态正常
[omm@gaussdb01 ~]$ cm_ctl query -Cv
[  CMServer State   ]node             instance state
---------------------------------
1  172.16.60.226 1        Standby
2  172.16.60.227 2        Standby
3  172.16.60.228 3        Primary[    ETCD State     ]node             instance state
---------------------------------------
1  172.16.60.226 7001     StateFollower
2  172.16.60.227 7002     StateLeader
3  172.16.60.228 7003     StateFollower[   Cluster State   ]cluster_state   : Normal
redistributing  : No
balanced        : Yes
current_az      : AZ_ALL[ Coordinator State ]node             instance state
---------------------------------
1  172.16.60.226 5001     Normal
2  172.16.60.227 5002     Normal
3  172.16.60.228 5003     Normal[ Central Coordinator State ]node             instance state
---------------------------------
2  172.16.60.227 5002     Normal[     GTM State     ]node             instance state                    sync_state
-----------------------------------------------------------------
1  172.16.60.226 1001     P Primary Connection ok  Sync
2  172.16.60.227 1002     S Standby Connection ok  Sync
3  172.16.60.228 1003     S Standby Connection ok  Sync[  Datanode State   ]node             instance state            | node             instance state            | node             instance state
---------------------------------------------------------------------------------------------------------------------------------------
1  172.16.60.226 6001     P Primary Normal | 2  172.16.60.227 6002     S Standby Normal | 3  172.16.60.228 6003     S Standby Normal
2  172.16.60.227 6004     P Primary Normal | 1  172.16.60.226 6005     S Standby Normal | 3  172.16.60.228 6006     S Standby Normal
3  172.16.60.228 6007     P Primary Normal | 2  172.16.60.227 6008     S Standby Normal | 1  172.16.60.226 6009     S Standby Normal

3. 问题总结

由于操作系统防火墙未关闭,导致操作系统重启后,ETCD状态不正常,无法连接到其它节点,导致CMS状态异常,无法正常连接到实例。

http://www.dtcms.com/a/362247.html

相关文章:

  • API安全厂商F5首发后量子加密方案,为企业后量子时代加固防线
  • Java中方法的参数传递
  • TFT屏幕:STM32硬件SPI+DMA+队列自动传输
  • 【无标题】训练、推理适用的数据类型
  • C++ 学习与 CLion 使用:(五)数据类型,包括整型、实型、字符型、转义字符、字符串、布尔型
  • 椭圆曲线的数学基础
  • 【算法专题训练】17、双向链表
  • openEuler2403部署Redis8集群
  • AI推理方法演进:Chain-of-Thought、Tree-of-Thought与Graph-of-Thought技术对比分析
  • Spring 控制器参数注解
  • LangGraph 边(Edge)机制完全指南
  • Java 不支持在非静态内部类中声明静态 Static declarations in inner classes are not supported异常处理
  • 2025我“生发”了『折行』|『内注』|『终端正偿』|『中文负偿』四大“邪术”(前二造福python代码阅读者;后二助力所有艺术人)
  • nrf52840 解锁
  • 2025年09月01日Github流行趋势
  • 数据结构初阶:详解栈和队列(下)——队列
  • 并发编程--线程池(1)线程池概念 Java 线程池体系(Executor、ThreadPoolExecutor、Executors)
  • resnet网络
  • 甲烷浓度时空演变趋势分析与异常值计算(附下载脚本)
  • 洛谷 P5836 [USACO19DEC] Milk Visits S-普及/提高-
  • 基于MCP架构的OpenWeather API服务端设计与实现
  • jetson开发板Ubuntu系统Docker中使用 MySQL 数据库详解-安装与配置指南
  • Python上下文管理器与资源管理
  • 基于51单片机停车场车位引导系统设计
  • 四个典型框架对比
  • 软考-操作系统-错题收集(2)文件系统的多级索引结构
  • 【重学MySQL】九十七、MySQL目录结构与文件系统解析
  • 二叉树核心操作知识点整理
  • 大模型微调显存内存节约方法
  • Java实现的IP4地址合法判断新思路