当前位置：首页 > news >正文

GaussDB 集群故障cm_ctl: can‘t connect to cm_server

news 2025/9/2 14:13:55

1. 问题描述

gaussdb，3AZ3副本架构，重启节点服务器后，报错无法连接cm_server，cm_ctl: can’t connect to cm_server.

[omm@gaussdb03 ~]$ cm_ctl query -Cvpid
[  CMServer State   ]node             node_ip         instance                             state
-----------------------------------------------------------------------------
1  172.16.60.226 172.16.60.226   1    /data/cluster/data/cm/cm_server Down
2  172.16.60.227 172.16.60.227   2    /data/cluster/data/cm/cm_server Down
3  172.16.60.228 172.16.60.228   3    /data/cluster/data/cm/cm_server Standby[    ETCD State     ]node             node_ip         instance                     state
---------------------------------------------------------------------------
1  172.16.60.226 172.16.60.226   7001 /data/cluster/data/etcd Down
2  172.16.60.227 172.16.60.227   7002 /data/cluster/data/etcd Down
3  172.16.60.228 172.16.60.228   7003 /data/cluster/data/etcd Downcm_ctl: can't connect to cm_server. 
Maybe cm_server is not running, or timeout expired. Please try again.

2. 问题分析

检查每台机器上，集群组件进程CM，ETCD，GTM，CN，DN还都存在

[root@gaussdb03 ~]# ps -ef |grep cluster
omm         5198       1  0 13:43 ?        00:00:06 /data/cluster/core/app/bin/om_monitor -L /data/cluster/logs/gaussdb/omm/cm/om_monitor
omm         5202    5198  9 13:43 ?        00:01:32 /data/cluster/core/app/bin/cm_agent
omm         5214       1  0 13:43 ?        00:00:03 /data/cluster/core/app/bin/etcd -name etcd_7003 --data-dir /data/cluster/data/etcd --client-cert-auth --trusted-ca-file /data/cluster/core/app/share/sslcert/etcd/etcdca.crt --cert-file /data/cluster/data/etcd/etcd.crt --key-file /data/cluster/data/etcd/etcd.key --peer-client-cert-auth --peer-trusted-ca-file /data/cluster/core/app/share/sslcert/etcd/etcdca.crt --peer-cert-file /data/cluster/data/etcd/etcd.crt --peer-key-file /data/cluster/data/etcd/etcd.key -initial-advertise-peer-urls https://172.16.60.228:30320 -listen-peer-urls https://172.16.60.228:30320 -listen-client-urls https://172.16.60.228:30300 -advertise-client-urls https://172.16.60.228:30300 --election-timeout 5000 --heartbeat-interval 1000 --log-outputs stdout --quota-backend-bytes 8589934592 --auto-compaction-mode periodic --auto-compaction-retention 1h -initial-cluster-token etcd-cluster-omm --enable-v2=false -initial-cluster etcd_7001=https://172.16.60.226:30320,etcd_7002=https://172.16.60.227:30320,etcd_7003=https://172.16.60.228:30320 -initial-cluster-state new
omm         5362       1  0 13:43 ?        00:00:00 /data/cluster/core/app/bin/gs_gtm -D /data/cluster/data/gtm -M pending
omm         5369       1 41 13:43 ?        00:06:57 /data/cluster/core/app/bin/gaussdb --coordinator -D /data/cluster/data/cn
omm         5385       1  2 13:43 ?        00:00:29 /data/cluster/core/app/bin/cm_server
omm         5576       1 23 13:43 ?        00:03:56 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6003 -M pending
omm         6225       1 23 13:43 ?        00:03:54 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6006 -M pending
omm         6482       1 23 13:43 ?        00:03:57 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6007 -M pending
root       23084   23031  0 13:59 pts/0    00:00:00 grep cluster

由于 CM，ETCD 均显示 Down，根据官方文档，应先保证 ETCD 正常，然后 CM 可以依赖 ETCD 选主
检查ETCD日志

[omm@gaussdb01 etcd]$ pwd
/data/cluster/logs/gaussdb/omm/cm/etcd
[omm@gaussdb01 etcd]$ view etcd_7001-current.log
{"level":"info","ts":"2025-09-01T14:11:57.175+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"6c461eeb977a77bb [logterm: 5, index: 16182] sent MsgPreVote request to 82a123c2037aba1a at term 5"}
{"level":"info","ts":"2025-09-01T14:11:57.175+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"6c461eeb977a77bb [logterm: 5, index: 16182] sent MsgPreVote request to d354b9b181618c10 at term 5"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"82a123c2037aba1a","rtt":"0s","error":"dial tcp 172.16.60.228:30320: i/o timeout"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"d354b9b181618c10","rtt":"0s","error":"dial tcp 172.16.60.227:30320: connect: no route to host"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"82a123c2037aba1a","rtt":"0s","error":"dial tcp 172.16.60.228:30320: connect: no route to host"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d354b9b181618c10","rtt":"0s","error":"dial tcp 172.16.60.227:30320: connect: no route to host"}

检查防火墙配置，防火墙未关闭，关闭防火墙

[root@gaussdb03 ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemonLoaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)Active: active (running) since Mon 2025-09-01 13:42:30 CST; 29min agoDocs: man:firewalld(1)Main PID: 1334 (firewalld)Tasks: 2Memory: 34.6MCGroup: /system.slice/firewalld.service└─1334 /usr/bin/python3 /usr/sbin/firewalld --nofork --nopidSep 01 13:42:29 gaussdb03 systemd[1]: Starting firewalld - dynamic firewall daemon...
Sep 01 13:42:30 gaussdb03 systemd[1]: Started firewalld - dynamic firewall daemon.
[root@gaussdb03 ~]# systemctl stop firewalld.service
[root@gaussdb03 ~]# systemctl disable firewalld.service
Removed /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.

再次检查集群状态正常

[omm@gaussdb01 ~]$ cm_ctl query -Cv
[  CMServer State   ]node             instance state
---------------------------------
1  172.16.60.226 1        Standby
2  172.16.60.227 2        Standby
3  172.16.60.228 3        Primary[    ETCD State     ]node             instance state
---------------------------------------
1  172.16.60.226 7001     StateFollower
2  172.16.60.227 7002     StateLeader
3  172.16.60.228 7003     StateFollower[   Cluster State   ]cluster_state   : Normal
redistributing  : No
balanced        : Yes
current_az      : AZ_ALL[ Coordinator State ]node             instance state
---------------------------------
1  172.16.60.226 5001     Normal
2  172.16.60.227 5002     Normal
3  172.16.60.228 5003     Normal[ Central Coordinator State ]node             instance state
---------------------------------
2  172.16.60.227 5002     Normal[     GTM State     ]node             instance state                    sync_state
-----------------------------------------------------------------
1  172.16.60.226 1001     P Primary Connection ok  Sync
2  172.16.60.227 1002     S Standby Connection ok  Sync
3  172.16.60.228 1003     S Standby Connection ok  Sync[  Datanode State   ]node             instance state            | node             instance state            | node             instance state
---------------------------------------------------------------------------------------------------------------------------------------
1  172.16.60.226 6001     P Primary Normal | 2  172.16.60.227 6002     S Standby Normal | 3  172.16.60.228 6003     S Standby Normal
2  172.16.60.227 6004     P Primary Normal | 1  172.16.60.226 6005     S Standby Normal | 3  172.16.60.228 6006     S Standby Normal
3  172.16.60.228 6007     P Primary Normal | 2  172.16.60.227 6008     S Standby Normal | 1  172.16.60.226 6009     S Standby Normal