clickhouse-server连不上clickhouse-keeper的问题记录
背景
想简单部署一个1 shard 2 replica,1keeper的集群。
有两个虚拟机:192.168.1.3,192.168.1.6。
192.168.1.3:部署1个ck,1个keeper
192.168.1.6:部署1个ck
192.168.1.3和192.168.1.6的ck组成1个shard,互为副本。
我的配置
clickhouse-server:
config.d/house-keeper.xml
<clickhouse>
<zookeeper>
<node index="1">
<host>server1</host>
<port>9181</port>
</node>
</zookeeper>
</clickhouse>
clickhouse-keeper:
<clickhouse>
<path>/data/keeper/conf</path><keeper_server>
<tcp_port>9181</tcp_port>
<listen_host>0.0.0.0</listen_host> <!-- IPv4 -->
<listen_host>::</listen_host> <!-- IPv6 -->
<server_id>1</server_id>
<log_storage_path>/data/keeper/log</log_storage_path>
<snapshot_storage_path>/data/keeper/snapshots</snapshot_storage_path>
<create_snapshot_on_exit>0</create_snapshot_on_exit>
<digest_enabled>1</digest_enabled>
<coordination_settings>
<operation_timeout_ms>10000</operation_timeout_ms>
<session_timeout_ms>100000</session_timeout_ms>
<min_session_timeout_ms>10000</min_session_timeout_ms>
<force_sync>false</force_sync>
<startup_timeout>240000</startup_timeout>
<!-- we want all logs for complex problems investigation -->
<reserved_log_items>100000</reserved_log_items>
<snapshot_distance>100000</snapshot_distance><!-- For instant start in single node configuration -->
<heart_beat_interval_ms>0</heart_beat_interval_ms>
<election_timeout_lower_bound_ms>0</election_timeout_lower_bound_ms>
<election_timeout_upper_bound_ms>0</election_timeout_upper_bound_ms><compress_logs>0</compress_logs>
<async_replication>1</async_replication>
<latest_logs_cache_size_threshold>1073741824</latest_logs_cache_size_threshold>
<commit_logs_cache_size_threshold>524288000</commit_logs_cache_size_threshold><raft_logs_level>trace</raft_logs_level>
</coordination_settings><raft_configuration>
<server>
<id>1</id>
<hostname>server1</hostname>
<port>9234</port>
</server>
</raft_configuration><feature_flags>
<filtered_list>1</filtered_list>
<multi_read>1</multi_read>
<check_not_exists>1</check_not_exists>
<create_if_not_exists>1</create_if_not_exists>
</feature_flags>
</keeper_server>
</clickhouse>
server1:192.168.1.3
server2:192.168.1.6
问题
之后在启动所有涉及的server之后,有这些错误:
1.ck端
<Error> virtual bool DB::DDLWorker::initializeMainThread(): Code: 999. Coordination::Exception: All connection tries failed while connecting to ZooKeeper. nodes:
2.keeper端日志显示
使用命令查看对外开外的连接端口:
netstat -tulnp | grep 9181
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 127.0.0.1:9181 0.0.0.0:* LISTEN 19010/./clickhouse-
tcp6 0 0 ::1:9181 :::* LISTEN 19010/./clickhouse-
发现不对劲,我明明配置了listen_host为0.0.0.0啊:
解决
百思不得其解之后,打算查查源码此处listen_host是如何设置的。
翻阅源码:在listen_hosts为空的情况下,监听以下两个ip。
std::vector<std::string> listen_hosts = DB::getMultipleValuesFromConfig(config(), "", "listen_host");
bool listen_try = config().getBool("listen_try", false);
if (listen_hosts.empty())
{
listen_hosts.emplace_back("::1");
listen_hosts.emplace_back("127.0.0.1");
listen_try = true;
}
那这就奇怪了,莫非是我listen_host的位置不对???
之后再翻阅代码,发现取的是clickhouse标签下的listen_host标签。
于是,我调整listen_host的位置在<clickhouse>标签之后,重启三个server之后,发现此时连接成功了。(ps: 之前是设置在<keeper_server>标签之后)。
解决后,keeper日志中提示如下:
clickhouse-server也没有关于连接keeper的错误日志了。DDLWorker线程不再因为连不到keeper而阻塞。
我们在192.168.1.3使用keeper的客户端连接keeper,执行cons 4字命令发现所有clickhouse-server已经连接到了clickhouse-keeper:
总结
本次问题很低级,但是很耗时间去排查。记录,防止以后再犯。