GaussDB 集群故障cm_ctl: can‘t connect to cm_server
1. 问题描述
gaussdb,3AZ3副本架构,重启节点服务器后,报错无法连接cm_server,cm_ctl: can’t connect to cm_server.
[omm@gaussdb03 ~]$ cm_ctl query -Cvpid
[ CMServer State ]node node_ip instance state
-----------------------------------------------------------------------------
1 172.16.60.226 172.16.60.226 1 /data/cluster/data/cm/cm_server Down
2 172.16.60.227 172.16.60.227 2 /data/cluster/data/cm/cm_server Down
3 172.16.60.228 172.16.60.228 3 /data/cluster/data/cm/cm_server Standby[ ETCD State ]node node_ip instance state
---------------------------------------------------------------------------
1 172.16.60.226 172.16.60.226 7001 /data/cluster/data/etcd Down
2 172.16.60.227 172.16.60.227 7002 /data/cluster/data/etcd Down
3 172.16.60.228 172.16.60.228 7003 /data/cluster/data/etcd Downcm_ctl: can't connect to cm_server.
Maybe cm_server is not running, or timeout expired. Please try again.
2. 问题分析
- 检查每台机器上,集群组件进程CM,ETCD,GTM,CN,DN还都存在
[root@gaussdb03 ~]# ps -ef |grep cluster
omm 5198 1 0 13:43 ? 00:00:06 /data/cluster/core/app/bin/om_monitor -L /data/cluster/logs/gaussdb/omm/cm/om_monitor
omm 5202 5198 9 13:43 ? 00:01:32 /data/cluster/core/app/bin/cm_agent
omm 5214 1 0 13:43 ? 00:00:03 /data/cluster/core/app/bin/etcd -name etcd_7003 --data-dir /data/cluster/data/etcd --client-cert-auth --trusted-ca-file /data/cluster/core/app/share/sslcert/etcd/etcdca.crt --cert-file /data/cluster/data/etcd/etcd.crt --key-file /data/cluster/data/etcd/etcd.key --peer-client-cert-auth --peer-trusted-ca-file /data/cluster/core/app/share/sslcert/etcd/etcdca.crt --peer-cert-file /data/cluster/data/etcd/etcd.crt --peer-key-file /data/cluster/data/etcd/etcd.key -initial-advertise-peer-urls https://172.16.60.228:30320 -listen-peer-urls https://172.16.60.228:30320 -listen-client-urls https://172.16.60.228:30300 -advertise-client-urls https://172.16.60.228:30300 --election-timeout 5000 --heartbeat-interval 1000 --log-outputs stdout --quota-backend-bytes 8589934592 --auto-compaction-mode periodic --auto-compaction-retention 1h -initial-cluster-token etcd-cluster-omm --enable-v2=false -initial-cluster etcd_7001=https://172.16.60.226:30320,etcd_7002=https://172.16.60.227:30320,etcd_7003=https://172.16.60.228:30320 -initial-cluster-state new
omm 5362 1 0 13:43 ? 00:00:00 /data/cluster/core/app/bin/gs_gtm -D /data/cluster/data/gtm -M pending
omm 5369 1 41 13:43 ? 00:06:57 /data/cluster/core/app/bin/gaussdb --coordinator -D /data/cluster/data/cn
omm 5385 1 2 13:43 ? 00:00:29 /data/cluster/core/app/bin/cm_server
omm 5576 1 23 13:43 ? 00:03:56 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6003 -M pending
omm 6225 1 23 13:43 ? 00:03:54 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6006 -M pending
omm 6482 1 23 13:43 ? 00:03:57 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6007 -M pending
root 23084 23031 0 13:59 pts/0 00:00:00 grep cluster
- 由于 CM,ETCD 均显示 Down,根据官方文档,应先保证 ETCD 正常,然后 CM 可以依赖 ETCD 选主
- 检查ETCD日志
[omm@gaussdb01 etcd]$ pwd
/data/cluster/logs/gaussdb/omm/cm/etcd
[omm@gaussdb01 etcd]$ view etcd_7001-current.log
{"level":"info","ts":"2025-09-01T14:11:57.175+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"6c461eeb977a77bb [logterm: 5, index: 16182] sent MsgPreVote request to 82a123c2037aba1a at term 5"}
{"level":"info","ts":"2025-09-01T14:11:57.175+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"6c461eeb977a77bb [logterm: 5, index: 16182] sent MsgPreVote request to d354b9b181618c10 at term 5"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"82a123c2037aba1a","rtt":"0s","error":"dial tcp 172.16.60.228:30320: i/o timeout"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"d354b9b181618c10","rtt":"0s","error":"dial tcp 172.16.60.227:30320: connect: no route to host"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"82a123c2037aba1a","rtt":"0s","error":"dial tcp 172.16.60.228:30320: connect: no route to host"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d354b9b181618c10","rtt":"0s","error":"dial tcp 172.16.60.227:30320: connect: no route to host"}
- 检查防火墙配置,防火墙未关闭,关闭防火墙
[root@gaussdb03 ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemonLoaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)Active: active (running) since Mon 2025-09-01 13:42:30 CST; 29min agoDocs: man:firewalld(1)Main PID: 1334 (firewalld)Tasks: 2Memory: 34.6MCGroup: /system.slice/firewalld.service└─1334 /usr/bin/python3 /usr/sbin/firewalld --nofork --nopidSep 01 13:42:29 gaussdb03 systemd[1]: Starting firewalld - dynamic firewall daemon...
Sep 01 13:42:30 gaussdb03 systemd[1]: Started firewalld - dynamic firewall daemon.
[root@gaussdb03 ~]# systemctl stop firewalld.service
[root@gaussdb03 ~]# systemctl disable firewalld.service
Removed /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
- 再次检查集群状态正常
[omm@gaussdb01 ~]$ cm_ctl query -Cv
[ CMServer State ]node instance state
---------------------------------
1 172.16.60.226 1 Standby
2 172.16.60.227 2 Standby
3 172.16.60.228 3 Primary[ ETCD State ]node instance state
---------------------------------------
1 172.16.60.226 7001 StateFollower
2 172.16.60.227 7002 StateLeader
3 172.16.60.228 7003 StateFollower[ Cluster State ]cluster_state : Normal
redistributing : No
balanced : Yes
current_az : AZ_ALL[ Coordinator State ]node instance state
---------------------------------
1 172.16.60.226 5001 Normal
2 172.16.60.227 5002 Normal
3 172.16.60.228 5003 Normal[ Central Coordinator State ]node instance state
---------------------------------
2 172.16.60.227 5002 Normal[ GTM State ]node instance state sync_state
-----------------------------------------------------------------
1 172.16.60.226 1001 P Primary Connection ok Sync
2 172.16.60.227 1002 S Standby Connection ok Sync
3 172.16.60.228 1003 S Standby Connection ok Sync[ Datanode State ]node instance state | node instance state | node instance state
---------------------------------------------------------------------------------------------------------------------------------------
1 172.16.60.226 6001 P Primary Normal | 2 172.16.60.227 6002 S Standby Normal | 3 172.16.60.228 6003 S Standby Normal
2 172.16.60.227 6004 P Primary Normal | 1 172.16.60.226 6005 S Standby Normal | 3 172.16.60.228 6006 S Standby Normal
3 172.16.60.228 6007 P Primary Normal | 2 172.16.60.227 6008 S Standby Normal | 1 172.16.60.226 6009 S Standby Normal
3. 问题总结
由于操作系统防火墙未关闭,导致操作系统重启后,ETCD状态不正常,无法连接到其它节点,导致CMS状态异常,无法正常连接到实例。