当前位置：首页 > news >正文

16.大数据监控

news 2025/10/8 17:13:25

0.说明

监控主要构成。
在这里插入图片描述

软件版本。
在这里插入图片描述

1.exporter监控配置

1.1 node_exporter

启动命令

nohup ./node_exporter &

服务
创建文件 /etc/systemd/system/node_exporter.service：

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target[Service]
User=bigdatabit9
Group=bigdatabit9
Type=simple
ExecStart=/opt/apps/node_exporter/node_exporter
Restart=always[Install]
WantedBy=multi-user.target

1.2 kafka_exporter

启动脚本

#!/bin/bash
cd /opt/apps/exporters/kafka_exporter 
nohup ./kafka_exporter --kafka.server=instance-kafka01:9092 --kafka.server=instance-kafka02:9092 --kafka.server=instance-kafka03:9092 \
--zookeeper.server=instance-kafka03:2181,instance-kafka02:2181,instance-kafka01:2181 \
--web.listen-address="172.16.0.243:9340" >/dev/null 2>&1 &

服务
创建文件 /etc/systemd/system/kafka_exporter.service：

[Unit]
Description=Kafka Exporter for Prometheus
Wants=network-online.target
After=network-online.target[Service]
User=bigdatabit9
Group=bigdatabit9
Type=simple
ExecStart=/opt/apps/exporters/kafka_exporter/kafka_exporter \--kafka.server=instance-kafka01:9092 \--kafka.server=instance-kafka02:9092 \--kafka.server=instance-kafka03:9092 \--zookeeper.server=instance-kafka03:2181,instance-kafka02:2181,instance-kafka01:2181 \--web.listen-address=0.0.0.0:9340
Restart=always
RestartSec=5[Install]
WantedBy=multi-user.target

启动exporter

这里以kafka_exporter为例，其他服务一样。

命令

sudo systemctl daemon-reload
sudo systemctl enable kafka_exporter
sudo systemctl start kafka_exporter

检查服务状态

sudo systemctl status kafka_exporter

在这里插入图片描述

2. prometheus 配置

2.1 配置prometheus.yml

# my global config
global:scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.# scrape_timeout is set to the global default (10s).# Alertmanager configuration
alerting:alertmanagers:- static_configs:- targets:- instance-metric01:9093# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:# - "first_rules.yml"# - "second_rules.yml"- "rules/*.yml"# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: "prometheus"# metrics_path defaults to '/metrics'# scheme defaults to 'http'.static_configs:- targets: ["localhost:9090"]- job_name: "pushgateway"static_configs:- targets: ["instance-metric01:9091"]- job_name: "kafka"static_configs:- targets: ["1instance-kafka02:9340"]- job_name: "node"static_configs:- targets: ["instance-kafka01:9100","instance-kafka02:9100","instance-kafka03:9100","instance-metric01:9100"]metric_relabel_configs:- action: replacesource_labels: ["instance"]regex: ([^:]+):([0-9]+)replacement: $1target_label: "host_name"

2.2 告警规则rules 配置

在prometheus目录rules目录下。

cpu.yml

groups:
- name: cpu_staterules:- alert: cpu使用率告警expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) by (host_name)) * 100 > 90for: 30slabels:severity: warningannotations:summary: "{{$labels.host_name}}CPU使用率超过90%"description: " 服务器【{{$labels.host_name}}】：当前CPU使用率{{$value}}%超过90%"

disk.yml

groups:
- name: disk_staterules:- alert: 磁盘使用率告警expr: (node_filesystem_size_bytes{fstype=~"ext.?|xfs"} - node_filesystem_avail_bytes{fstype=~"ext.?|xfs"}) / node_filesystem_size_bytes{fstype=~"ext.?|xfs"} * 100 > 80for: 30slabels:severity: warningannotations:summary: "{{$labels.host_name}}磁盘分区使用率超过80%"description: " 服务器【{{$labels.host_name}}】上的挂载点：【{{ $labels.mountpoint }}】当前值{{$value}}%超过80%"

dispatcher.yml

groups:
- name: dispatcher_staterules:- alert: dispatcher06状态expr: sum(dispatcher06_data) == 0for: 30slabels:severity: criticalannotations:summary: "dispatcher写入数据为0"description: "服务器172.16.0.218上的dispatcher写入数据为0，进程发生问题！"- alert: dispatcher07状态expr: sum(dispatcher07_data) == 0for: 30slabels:severity: criticalannotations:summary: "dispatcher写入数据为0"description: "服务器172.16.0.219上的dispatcher写入数据为0，进程发生问题！"- alert: dispatcherk1状态expr: sum(dispatcherk1_data) == 0for: 30slabels:severity: criticalannotations:summary: "dispatcher写入数据为0"description: "服务器172.16.0.243上的dispatcher写入数据为0，进程发生问题！"- alert: dispatcherk2状态expr: sum(dispatcherk2_data) == 0for: 30slabels:severity: criticalannotations:summary: "dispatcher写入数据为0"description: "服务器172.16.0.244上的dispatcher写入数据为0，进程发生问题！"- alert: dispatcherk3状态expr: sum(dispatcherk3_data) == 0for: 30slabels:severity: criticalannotations:summary: "dispatcher写入数据为0"description: "服务器172.16.0.245上的dispatcher写入数据为0，进程发生问题！"- alert: dispatcherk4状态expr: sum(dispatcherk4_data) == 0for: 30slabels:severity: criticalannotations:summary: "dispatcher写入数据为0"description: "服务器172.16.0.246上的dispatcher写入数据为0，进程发生问题！"- alert: dispatcherk5状态expr: sum(dispatcherk5_data) == 0for: 30slabels:severity: criticalannotations:summary: "dispatcher写入数据为0"description: "服务器172.16.0.247上的dispatcher写入数据为0，进程发生问题！"- alert: dispatcherk6状态expr: sum(dispatcherk6_data) == 0for: 30slabels:severity: criticalannotations:summary: "dispatcher写入数据为0"description: "服务器172.16.0.140上的dispatcher写入数据为0，进程发生问题！"- alert: dispatcherk7状态expr: sum(dispatcherk7_data) == 0for: 30slabels:severity: criticalannotations:summary: "dispatcher写入数据为0"description: "服务器172.16.0.141上的dispatcher写入数据为0，进程发生问题！"- alert: dispatcherk8状态expr: sum(dispatcherk8_data) == 0for: 30slabels:severity: criticalannotations:summary: "dispatcher写入数据为0"description: "服务器172.16.0.142上的dispatcher写入数据为0，进程发生问题！"- alert: dispatcherk9状态expr: sum(dispatcherk9_data) == 0for: 30slabels:severity: criticalannotations:summary: "dispatcher写入数据为0"description: "服务器172.16.0.143上的dispatcher写入数据为0，进程发生问题！"- alert: dispatcherk13状态expr: sum(dispatcherk13_data) == 0for: 30slabels:severity: criticalannotations:summary: "dispatcher写入数据为0"description: "服务器172.16.0.155上的dispatcher写入数据为0，进程发生问题！"

dn.yml

groups:
- name: dn_staterules:- alert: DataNode容量告警expr: (sum(Hadoop_DataNode_DfsUsed{name="FSDatasetState"}) by (host_name) / sum(Hadoop_DataNode_Capacity{name="FSDatasetState"}) by(host_name)) * 100 > 80for: 30slabels:severity: warningannotations:summary: "DataNode节点：{{$labels.host_name}}已使用容量超过80%"description: "DataNode节点：{{$labels.host_name}}，当前已使用容量：{{$value}}超过总容量的80%"

kafka_lag.yml

groups:
- name: kafka_lagrules:- alert: kafka消息积压报警expr: sum(kafka_consumergroup_lag{ topic!~"pct_.+"}) by(consumergroup,topic) > 500000 or sum(kafka_consumergroup_lag{topic=~"pct_.+"}) by(consumergroup,topic) > 2000000for: 30slabels:severity: warningannotations:summary: "Topic:{{$labels.topic}}的消费组{{$labels.consumergroup}}消息积压"description: "消息Lag:{{$value}}"

mem.yml

groups:
- name: memory_staterules:- alert: 内存使用率告警expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100 > 90for: 30slabels:severity: warningannotations:summary: "{{$labels.host_name}}内存使用率超过90%"description: " 服务器【{{$labels.host_name}}】：当前内存使用率{{$value}}%超过90%"

process.yml

groups:
- name: proc_staterules:- alert: 进程存活告警expr: namedprocess_namegroup_num_procs<1for: 60slabels:severity: criticaltarget: "{{$labels.app_name}}"annotations:summary: "进程{{$labels.app_name}}已停止"description: "进程 {{$labels.app_name}} 在服务器:{{$labels.host_name}}上已经停止."

prometheus_process.yml

groups:
- name: proc_staterules:- alert: prometheus组件进程存活告警expr: sum(up) by(instance,job) == 0for: 30slabels:severity: criticaltarget: "{{$labels.job}}"annotations:summary: "进程{{$labels.job}}已停止"description: "进程 {{$labels.job}} 在服务器:{{$labels.instance}}上已经停止."

yarn.yml

groups:
- name: yarn_noderules:- alert: yarn节点不足expr: sum(Hadoop_ResourceManager_NumActiveNMs{job='rm'}) by (job) < 13 or sum(Hadoop_ResourceManager_NumActiveNMs{job='rmf'}) by (job) < 12for: 30slabels:severity: warningannotations:summary: "yarn集群:{{$labels.job}}节点不足"

2.3 启动

启动命令

nohup /opt/apps/prometheus/prometheus \
--web.listen-address="0.0.0.0:9090" \
--web.read-timeout=5m \
--web.max-connections=10  \
--storage.tsdb.retention=7d  \
--storage.tsdb.path="data/" \
--query.max-concurrency=20   \
--query.timeout=2m \
--web.enable-lifecycle \
> /opt/apps/prometheus/logs/start.log 2>&1 &

2.4 重新加载配置

重新加载配置

curl -X POST http://localhost:9090/-/reload

3. pushgateway

启动命令

nohup /opt/apps/pushgateway/pushgateway \
--web.listen-address="0.0.0.0:9091" \
> /opt/apps/pushgateway/start.log 2>&1 &

4. alertmanager

4.1 配置alertmanager.yml

route:group_by: ['alertname']group_wait: 10sgroup_interval: 1mrepeat_interval: 5mreceiver: 'web.hook'
receivers:
- name: 'web.hook'webhook_configs:- url: 'http://mecury-ca01:9825/api/alarm/send'send_resolved: true
inhibit_rules:- source_match:alertname: 'ApplicationDown'severity: 'critical'target_match:severity: 'warning'equal: ['alertname', 'job', "target", 'instance']

配置报警地址，报警参数参考

{"version": "4","groupKey": "alertname:ApplicationDown","status": "firing","receiver": "web.hook","groupLabels": {"alertname": "ApplicationDown"},"commonLabels": {"alertname": "ApplicationDown","severity": "critical","instance": "10.0.0.1:8080","job": "web","target": "10.0.0.1"},"commonAnnotations": {"summary": "Web application is down","description": "The web application at instance 10.0.0.1:8080 is not responding."},"externalURL": "http://alertmanager:9093","alerts": [{"status": "firing","labels": {"alertname": "ApplicationDown","severity": "critical","instance": "10.0.0.1:8080","job": "web","target": "10.0.0.1"},"annotations": {"summary": "Web application is down","description": "The web application at instance 10.0.0.1:8080 is not responding."},"startsAt": "2025-06-19T04:30:00Z","endsAt": "0001-01-01T00:00:00Z","generatorURL": "http://prometheus:9090/graph?g0.expr=up%7Bjob%3D%22web%22%7D+%3D%3D+0","fingerprint": "1234567890abcdef"}]
}

4.2 启动

启动脚本 start.sh

#!/bin/bashnohup /opt/apps/alertmanager/alertmanager \
--config.file=/opt/apps/alertmanager/alertmanager.yml \
> /opt/apps/alertmanager/start.log 2>&1 &

5.grafana

5.1 安装

启动命令

nohup /opt/apps/grafana/bin/grafana-server web > /opt/apps/grafana/grafana.log 2>&1 &

默认用户名和密码：admin

5.2 常用模板

node 16098
kafka 7589
process 249

查看全文

http://www.dtcms.com/a/254320.html

Kafka Broker处理消费者请求源码深度解析：从请求接收到数据返回

WHAT - React Native 开发 App 从 0 到上线全流程周期

React 新框架的一些实践心得（关注业务的话，框架的设计封装思路）

【研发工具】.Net创建多项目模板(Visual Studio)

设计模式 | 单例模式——饿汉模式懒汉模式

从零开始的云计算生活——第二十天，脚踏实地，SSH与Rsync服务

uni-app总结5-UTS插件开发

Axios 拦截器实现原理深度剖析：构建优雅的请求处理管道

Vue-11-前端框架Vue之应用基础父组件传值到子组件props的使用

TDengine 集群超能力：超越 InfluxDB 的水平扩展与开源优势

具身机器人

Oracle/MySQL/SqlServer/PostgreSQL等数据库的数据类型映射以及各版本数据类型情况说明

SQL等价改写优化

VACM 详解：SNMPv3 的访问控制核心

国产ARM/RISCV与OpenHarmony物联网项目(六)SF1节点开发

java+springboot注释介绍+使用介绍

Docker制作镜像

Sentinel（一）：Sentinel 介绍和安装

设计模式之五大设计原则（SOLID原则）浅谈

基于 OpenCV 的图像亮度、对比度与锐度调节

攻防演练：1.木马后门文件演练

Neo4j操作指南：修改节点数据与新增节点属性

Android Framework阅读经验

SCRM软件数据分析功能使用指南：从数据挖掘到商业决策

PL端软核FIFO读写

【数据破茧成蝶】企业数据标准：AI时代的智能罗盘与增长基石

机构运动分析系统开发（Python实现）

【漏洞复现】Apache Kafka Connect 任意文件读取漏洞(CVE-2025-27817)

八字排盘小游戏微信流量主小程序开源

用 python 开发一个可调用工具的 AI Agent，实现电脑配置专业评价

0.说明

1.exporter监控配置

1.1 node_exporter

1.2 kafka_exporter

启动exporter

2. prometheus 配置

2.1 配置prometheus.yml

2.2 告警规则rules 配置

2.3 启动

2.4 重新加载配置

3. pushgateway

4. alertmanager

4.1 配置alertmanager.yml

4.2 启动

5.grafana

5.1 安装

5.2 常用模板

相关文章：