SigNoz 外置 ClickHouse 高可用部署实践
文章目录
- 概述
- 架构设计
- 部署
- Clickhouse
- 集群部署
- 相关库初始化
- Signoz部署
- Collector流程
- 步骤 1:OTLP → Kafka
- 步骤 2:Kafka → ClickHouse
- 采样逻辑
- 业务接入
概述
- 目的:提升 SigNoz Trace/Logs/Metric 存储的可靠性与可用性
- 背景:默认 SigNoz 内置 ClickHouse 对高可用支持有限,生产环境需要外置 HA 集群
架构设计
- SigNoz Collector / Query / UI 与 外置 ClickHouse 分离
- Kafka 作为 Collector 与外置 ClickHouse 之间的中间队列,解耦采集与存储,缓解高并发写入压力。
部署
Clickhouse
集群部署
文档地址
相关库初始化
- schema-migrator-sync 会检查 ClickHouse 是否存在 SigNoz 所需的表,如果不存在就 自动创建。
- 这个过程保证 Collector、Query、Dashboard 所依赖的表结构是完整且可用的。
后续升级或变更表结构时,也会通过 sync 或 async 来执行 ALTER 等操作。
version: "3.9"services:schema-migrator-sync:image: signoz/signoz-schema-migrator:v0.111.41container_name: schema-migrator-synccommand:- sync- --dsn=tcp://default:TrRkZeK1KJsOzfdxVFRLk7oy7hJx@192.168.100.10:9920- --up= - --cluster-name=clickhouse_cluster ##填写正确的集群名称restart: on-failureschema-migrator-async:image: signoz/signoz-schema-migrator:v0.111.41container_name: schema-migrator-asynccommand:- async- --dsn=tcp://default:TrRkZeK1KJsOzfdxVFRLk7oy7hJx@192.168.100.10:9920- --up=- --cluster-name=clickhouse_clusterdepends_on:schema-migrator-sync:condition: service_completed_successfullyrestart: on-failure
⚡️: schema-migrator-sync填写任意一个ck节点即可,它会自动识别整个集群,然后在 所有节点上同步创建表。
Signoz部署
# 使用 Helm Chart安装应用
~ helm repo add signoz https://charts.signoz.io
~ helm pull signoz/signoz
~ tar -zcvf signoz-${version}.tgz
~ vim values.yaml
#关闭CK创建
clickhouse:enabled: falsezookeeper:enabled: false#连接外置CK地址
externalClickhouse:host: 192.168.200.10 #最好填写TVS地址cluster: clickhouse_clusterdatabase: signoz_metricstraceDatabase: signoz_traceslogDatabase: signoz_logsuser: "default"password: "" existingSecret:existingSecretPasswordKey:secure: falseverify: falsehttpPort: 8323tcpPort: 9922
Collector流程
⚡️: 启动两个Collector分步骤
步骤 1:OTLP → Kafka
- 作用:把应用(微服务/agent)发送的 trace、metric、log 数据先收集到 Kafka 中。
- 流程:
- 应用通过 OTLP(gRPC 或 HTTP)发送数据到 Collector。
- Collector 进行必要的 processor 处理(如 batch、tail_sampling)。
- 经过 exporter 把数据写入 Kafka。
- 优点:
- 数据不会直接写 CK,减轻压力。
- 可以做异步缓冲、临时存储、消息重试。
- 支持大规模高并发采集。
- 配置
- 自行更新
ConfigMap
和 服务启动命令 - “–config=/conf/otel-collector-config-otok.yaml”
- 自行更新
## otel-collector-config-otok.yaml
exporters:clickhouselogsexporter:dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_LOG_DATABASE}timeout: 10suse_new_schema: trueclickhousetraces:datasource: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_TRACE_DATABASE}low_cardinal_exception_grouping: ${env:LOW_CARDINAL_EXCEPTION_GROUPING}use_new_schema: truemetadataexporter:cache:provider: in_memorydsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/signoz_metadatatenant_id: ${env:TENANT_ID}timeout: 10ssignozclickhousemetrics:dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_DATABASE}timeout: 45skafka/traces:brokers:- 192.168.100.10:9092- 192.168.100.20:9092- 192.168.100.30:9092topic: otel_tracesprotocol_version: 2.0.0#encoding: otlp_jsonproducer:max_message_bytes: 10485760 kafka/logs:brokers:- 192.168.100.10:9092- 192.168.100.20:9092- 192.168.100.30:9092topic: otel_logsprotocol_version: 2.0.0#encoding: otlp_jsonproducer:max_message_bytes: 10485760kafka/metrics:brokers:- 192.168.100.10:9092- 192.168.100.20:9092- 192.168.100.30:9092topic: otel_metricsprotocol_version: 2.0.0#encoding: otlp_jsonproducer:max_message_bytes: 10485760
extensions:health_check:endpoint: 0.0.0.0:13133pprof:endpoint: localhost:1777zpages:endpoint: localhost:55679
processors:batch:send_batch_size: 50000timeout: 1stail_sampling:decision_wait: 2spolicies:- name: error-tracestype: status_codestatus_code:status_codes: [ERROR]- name: success-tracestype: probabilisticprobabilistic:sampling_percentage: 10- name: long-tracestype: latencylatency:threshold_ms: 2000000signozspanmetrics/delta:aggregation_temporality: AGGREGATION_TEMPORALITY_DELTAdimensions:- default: defaultname: service.namespace- default: defaultname: deployment.environment- name: signoz.collector.iddimensions_cache_size: 100000latency_histogram_buckets:- 100us- 1ms- 2ms- 6ms- 10ms- 50ms- 100ms- 250ms- 500ms- 1000ms- 1400ms- 2000ms- 5s- 10s- 20s- 40s- 60smetrics_exporter: signozclickhousemetrics
receivers:httplogreceiver/heroku:endpoint: 0.0.0.0:8081source: herokuhttplogreceiver/json:endpoint: 0.0.0.0:8082source: jsonjaeger:protocols:grpc:endpoint: 0.0.0.0:14250thrift_http:endpoint: 0.0.0.0:14268otlp:protocols:grpc:endpoint: 0.0.0.0:4317max_recv_msg_size_mib: 16http:endpoint: 0.0.0.0:4318kafka:brokers:- 192.168.100.10:9092- 192.168.100.20:9092- 192.168.100.30:9092topic: otelgroup_id: otel_groupprotocol_version: 2.0.0
service:extensions:- health_check- zpages- pprofpipelines:logs:exporters:- kafka/logs- metadataexporterprocessors:- batchreceivers:- otlp- httplogreceiver/heroku- httplogreceiver/jsonmetrics:exporters:- metadataexporter- kafka/metricsprocessors:- batchreceivers:- otlptraces:exporters:- kafka/traces- metadataexporterprocessors:- batch- tail_samplingreceivers:- otlp- jaegertelemetry:logs:encoding: json
👆: 协议不同 + 消费处理逻辑不同,所以建议 OTLP 数据写 Kafka 时按 signal 分 topic:
otel_traces
、otel_metrics
、otel_logs
。 避免 proto 类型不匹配导致的illegal wireType
错误
步骤 2:Kafka → ClickHouse
- 作用:从 Kafka 中消费数据,然后写入 ClickHouse。
- 流程:
- Collector 作为 Kafka consumer 消费 topic。
- 可以做进一步处理(比如 batch、metrics 聚合、span metrics)。
- 最后通过 ClickHouse exporter 写入对应表(traces/logs/metrics)。
- 注意事项:
- 消费时需要指定
encoding
,保证 Kafka 写入和读取格式一致。 - 可在这个阶段做一些额外筛选/加工(比如聚合、drop 不需要的数据)。
- 消费时需要指定
- 配置
- 自行更新
ConfigMap
和 服务启动命令 - “–config=/conf/otel-collector-config-ktoc.yaml”
- 自行更新
## otel-collector-config-ktoc.yaml
exporters:clickhouselogsexporter:dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_LOG_DATABASE}timeout: 10suse_new_schema: trueclickhousetraces:datasource: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_TRACE_DATABASE}low_cardinal_exception_grouping: ${env:LOW_CARDINAL_EXCEPTION_GROUPING}use_new_schema: truemetadataexporter:cache:provider: in_memorydsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/signoz_metadatatenant_id: ${env:TENANT_ID}timeout: 10ssignozclickhousemetrics:dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_DATABASE}timeout: 45s
extensions:health_check:endpoint: 0.0.0.0:13133pprof:endpoint: localhost:1777zpages:endpoint: localhost:55679
processors:batch:send_batch_size: 50000timeout: 1ssignozspanmetrics/delta:aggregation_temporality: AGGREGATION_TEMPORALITY_DELTAdimensions:- default: defaultname: service.namespace- default: defaultname: deployment.environment- name: signoz.collector.iddimensions_cache_size: 100000latency_histogram_buckets:- 100us- 1ms- 2ms- 6ms- 10ms- 50ms- 100ms- 250ms- 500ms- 1000ms- 1400ms- 2000ms- 5s- 10s- 20s- 40s- 60smetrics_exporter: signozclickhousemetrics
receivers:httplogreceiver/heroku:endpoint: 0.0.0.0:8081source: herokuhttplogreceiver/json:endpoint: 0.0.0.0:8082source: jsonjaeger:protocols:grpc:endpoint: 0.0.0.0:14250thrift_http:endpoint: 0.0.0.0:14268otlp:protocols:grpc:endpoint: 0.0.0.0:4317max_recv_msg_size_mib: 16http:endpoint: 0.0.0.0:4318kafka/logs:brokers:- 192.168.100.10:9092- 192.168.100.20:9092- 192.168.100.30:9092topic: otel_logsgroup_id: otel_logs_groupprotocol_version: 2.0.0encoding: otlp_protokafka/traces:brokers:- 192.168.100.10:9092- 192.168.100.20:9092- 192.168.100.30:9092topic: otel_tracesgroup_id: otel_traces_groupprotocol_version: 2.0.0encoding: otlp_protokafka/metrics:brokers:- 192.168.100.10:9092- 192.168.100.20:9092- 192.168.100.30:9092topic: otel_metricsgroup_id: otel_metrics_groupprotocol_version: 2.0.0encoding: otlp_proto
service:extensions:- health_check- zpages- pprofpipelines:logs:exporters:- clickhouselogsexporter- metadataexporterprocessors:- batchreceivers:- kafka/logs- httplogreceiver/heroku- httplogreceiver/jsonmetrics:exporters:- metadataexporter- signozclickhousemetricsprocessors:- batchreceivers:- kafka/metricstraces:exporters:- clickhousetraces- metadataexporterprocessors:- signozspanmetrics/delta- batchreceivers:- kafka/traces- jaegertelemetry:logs:encoding: json
采样逻辑
- tail_sampling(错误全采,正常按比例采样)应该放在 OTLP → Kafka 这层,这样写入 Kafka 的数据已经是你想要的量。
- Kafka → CK 主要做消费、写表和一些轻量处理,不再做采样。
# 对应配置
processors:batch:send_batch_size: 50000timeout: 1stail_sampling:decision_wait: 2spolicies:- name: error-tracestype: status_codestatus_code:status_codes: [ERROR]- name: success-tracestype: probabilisticprobabilistic:sampling_percentage: 10- name: long-tracestype: latencylatency:threshold_ms: 2000000
业务接入
运维处理完成后,业务方根据情况接入,支持多 receivers