当前位置: 首页 > news >正文

Apollo Monitor模块技术深度解析

目录

  1. 模块概述
  2. 架构设计
  3. 核心数据结构
  4. 硬件监控器
  5. 软件监控器
  6. 状态管理
  7. 配置系统
  8. 执行流程
  9. 最佳实践
  10. 总结

1. 模块概述

1.1 功能定位

Monitor(监控)模块是Apollo自动驾驶系统的"健康检查中心",负责:

  • 硬件监控:监控GPS、CAN卡、系统资源等硬件状态
  • 软件监控:监控进程、模块、消息通道、延迟等软件状态
  • 安全保护:在自动驾驶模式下检查安全条件,必要时触发紧急停车
  • 状态汇总:收集所有监控结果并发布系统整体状态
  • 故障诊断:提供详细的错误信息帮助快速定位问题

1.2 技术特点

  • 周期性执行:基于CyberRT的TimerComponent,每500ms执行一次
  • 分层监控:硬件层+软件层+安全层三层监控体系
  • 灵活配置:通过HMI模式文件动态配置监控项
  • 优先级升级:FATAL > ERROR > WARN > OK > UNKNOWN
  • 安全保护:在自动驾驶模式下自动检查并触发EStop
  • 高效发布:使用哈希指纹检测状态变化,减少网络开销

1.3 代码规模

  • 主文件:monitor.h, monitor.cc
  • 监控器数量:13个(4个硬件监控器 + 9个软件监控器)
  • 代码位置modules/monitor/

2. 架构设计

2.1 系统分层架构

┌─────────────────────────────────────────────────────────────┐
│                    Monitor主组件                             │
│              (TimerComponent, 500ms周期)                     │
└─────────────────────────────────────────────────────────────┘↓
┌─────────────────────────────────────────────────────────────┐
│                  MonitorManager                              │
│            (单例,中央管理器)                                 │
│  - 状态管理 (SystemStatus)                                   │
│  - 配置管理 (HMIMode)                                        │
│  - Channel管理 (Reader/Writer缓存)                          │
│  - 日志缓冲 (MonitorLogBuffer)                              │
└─────────────────────────────────────────────────────────────┘↓
┌─────────────────────────────────────────────────────────────┐
│                    监控器层                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  硬件监控器  │  │  软件监控器  │  │  安全监控器  │      │
│  ├──────────────┤  ├──────────────┤  ├──────────────┤      │
│  │ GPS Monitor  │  │Process Mon   │  │Functional    │      │
│  │ CAN Monitor  │  │Module Mon    │  │Safety Mon    │      │
│  │Resource Mon  │  │Channel Mon   │  └──────────────┘      │
│  │Localization  │  │Latency Mon   │                        │
│  │   Monitor    │  │Summary Mon   │                        │
│  │              │  │Camera Mon    │                        │
│  │              │  │Recorder Mon  │                        │
│  └──────────────┘  └──────────────┘                        │
└─────────────────────────────────────────────────────────────┘↓
┌─────────────────────────────────────────────────────────────┐
│                    输出层                                    │
│  /apollo/monitor/system_status (SystemStatus消息)           │
│  /apollo/monitor/latency_report (LatencyReport消息)         │
│  MonitorLog消息                                              │
└─────────────────────────────────────────────────────────────┘

2.2 目录结构

modules/monitor/
├── monitor.h                       # 主组件头文件
├── monitor.cc                      # 主组件实现
├── BUILD                           # 构建配置
├── README.md                       # 文档
├── cyberfile.xml                   # 包元数据
│
├── dag/
│   └── monitor.dag                 # DAG配置(500ms周期)
│
├── launch/
│   └── monitor.launch              # 启动配置
│
├── common/                         # 公共基础类
│   ├── monitor_manager.h           # 中央管理器(单例)
│   ├── monitor_manager.cc
│   ├── recurrent_runner.h          # 可周期运行的基类
│   └── recurrent_runner.cc
│
├── hardware/                       # 硬件监控器(4个)
│   ├── gps_monitor.h/cc           # GPS/GNSS状态监控
│   ├── esdcan_monitor.h/cc        # ESD CAN卡监控
│   ├── socket_can_monitor.h/cc    # Socket CAN监控
│   └── resource_monitor.h/cc      # 系统资源监控
│
└── software/                       # 软件监控器(9个)├── process_monitor.h/cc       # 进程运行状态监控├── module_monitor.h/cc        # 模块节点监控├── channel_monitor.h/cc       # 消息通道监控├── latency_monitor.h/cc       # 端到端延迟监控├── summary_monitor.h/cc       # 状态汇总与发布├── functional_safety_monitor.h/cc  # 功能安全检查├── localization_monitor.h/cc       # 定位状态监控├── camera_monitor.h/cc             # 摄像头监控└── recorder_monitor.h/cc           # 记录器状态监控

2.3 设计模式

2.3.1 单例模式(Singleton Pattern)

MonitorManager采用单例模式,确保全局唯一:

class MonitorManager : public Singleton<MonitorManager> {public:// 获取单例实例static MonitorManager* Instance();// 禁止拷贝和赋值MonitorManager(const MonitorManager&) = delete;MonitorManager& operator=(const MonitorManager&) = delete;
};// 使用方式
auto manager = MonitorManager::Instance();
2.3.2 模板方法模式(Template Method Pattern)

RecurrentRunner定义监控器的执行骨架:

class RecurrentRunner {public:// 模板方法:定义执行骨架void Tick(const double current_time) {if (ShouldRun(current_time)) {RunOnce(current_time);  // 子类实现}}// 纯虚函数:子类必须实现virtual void RunOnce(const double current_time) = 0;private:bool ShouldRun(const double current_time);
};// 子类实现具体逻辑
class GpsMonitor : public RecurrentRunner {public:void RunOnce(const double current_time) override {// GPS监控逻辑}
};
2.3.3 策略模式(Strategy Pattern)

不同的监控器实现不同的监控策略:

// 硬件监控策略
class GpsMonitor : public RecurrentRunner {void RunOnce(...) { /* 检查GPS解质量 */ }
};class ResourceMonitor : public RecurrentRunner {void RunOnce(...) { /* 检查CPU/内存/磁盘 */ }
};// 软件监控策略
class ProcessMonitor : public RecurrentRunner {void RunOnce(...) { /* 检查进程是否运行 */ }
};class ChannelMonitor : public RecurrentRunner {void RunOnce(...) { /* 检查消息通道 */ }
};

3. 核心数据结构

3.1 Monitor - 主组件

文件位置:modules/monitor/monitor.h

class Monitor : public apollo::cyber::TimerComponent {public:bool Init() override;bool Proc() override;private:std::vector<std::shared_ptr<RecurrentRunner>> runners_;
};

Init实现(monitor.cc:40-75):

bool Monitor::Init() {// 初始化管理器MonitorManager::Instance()->Init(node_);// 创建硬件监控器runners_.emplace_back(new EsdCanMonitor());runners_.emplace_back(new SocketCanMonitor());runners_.emplace_back(new GpsMonitor());runners_.emplace_back(new LocalizationMonitor());// 创建软件监控器runners_.emplace_back(new CameraMonitor());runners_.emplace_back(new ProcessMonitor());runners_.emplace_back(new ModuleMonitor());const std::shared_ptr<LatencyMonitor> latency_monitor(new LatencyMonitor());runners_.emplace_back(latency_monitor);runners_.emplace_back(new ChannelMonitor(latency_monitor));runners_.emplace_back(new ResourceMonitor());runners_.emplace_back(new SummaryMonitor());// 功能安全监控器(可选)if (FLAGS_enable_functional_safety) {runners_.emplace_back(new FunctionalSafetyMonitor());}return true;
}

Proc实现(monitor.cc:77-85):

bool Monitor::Proc() {const double current_time = Clock::NowInSeconds();// 开始新的监控周期if (!MonitorManager::Instance()->StartFrame(current_time)) {return false;}// 执行所有监控器for (auto& runner : runners_) {runner->Tick(current_time);}// 结束监控周期MonitorManager::Instance()->EndFrame();return true;
}

3.2 MonitorManager - 中央管理器

文件位置:modules/monitor/common/monitor_manager.h

class MonitorManager : public Singleton<MonitorManager> {public:void Init(const std::shared_ptr<cyber::Node>& node);// Frame管理bool StartFrame(const double current_time);void EndFrame();// 获取接口const HMIMode& GetHMIMode() const { return mode_config_; }bool IsInAutonomousMode() const { return in_autonomous_driving_; }SystemStatus* GetStatus() { return &status_; }MonitorLogBuffer& LogBuffer() { return log_buffer_; }// 创建Reader/Writer(带缓存)template <class T>std::shared_ptr<cyber::Reader<T>> CreateReader(const std::string& channel);template <class T>std::shared_ptr<cyber::Writer<T>> CreateWriter(const std::string& channel);private:std::shared_ptr<cyber::Node> node_;// 核心数据SystemStatus status_;                  // 系统状态HMIMode mode_config_;                 // 当前模式配置bool in_autonomous_driving_ = false;  // 自动驾驶标志MonitorLogBuffer log_buffer_;         // 日志缓冲// Reader缓存(避免重复创建)std::unordered_map<std::string, std::shared_ptr<cyber::ReaderBase>> readers_;
};

StartFrame实现(monitor_manager.cc:46-94):

bool MonitorManager::StartFrame(const double current_time) {// 1. 读取HMI状态static auto hmi_status_reader =CreateReader<HMIStatus>(FLAGS_hmi_status_topic);auto hmi_status = hmi_status_reader->GetLatestObserved();if (hmi_status == nullptr) {AERROR_EVERY(100) << "No HMIStatus was received.";return false;}// 2. 检测模式变化const std::string current_mode = hmi_status->current_mode();if (current_mode_ != current_mode) {AINFO << "HMI mode changed from " << current_mode_<< " to " << current_mode;// 加载新模式配置if (!cyber::common::GetProtoFromFile(HMIModeHelper::GetModeDefinitionPath(current_mode),&mode_config_)) {AERROR << "Failed to load mode config for " << current_mode;return false;}current_mode_ = current_mode;// 更新监控组件列表status_.clear_components();for (const auto& iter : mode_config_.monitored_components()) {status_.mutable_components()->insert({iter.first, Component()});}} else {// 清空上一帧的汇总状态for (auto& iter : *status_.mutable_components()) {iter.second.clear_summary();}}// 3. 检测是否在自动驾驶中in_autonomous_driving_ = CheckAutonomousDriving(current_time);return true;
}

CheckAutonomousDriving实现

bool MonitorManager::CheckAutonomousDriving(const double current_time) {static auto chassis_reader =CreateReader<Chassis>(FLAGS_chassis_topic);auto chassis = chassis_reader->GetLatestObserved();if (chassis == nullptr) {return false;}// 检查底盘消息是否过期if (current_time - chassis->header().timestamp_sec() >FLAGS_max_chassis_message_delay) {return false;}// 检查驾驶模式return chassis->driving_mode() == Chassis::COMPLETE_AUTO_DRIVE;
}

3.3 RecurrentRunner - 基类

文件位置:modules/monitor/common/recurrent_runner.h

class RecurrentRunner {public:RecurrentRunner(const std::string& name, const double interval);virtual ~RecurrentRunner() = default;// 主入口点void Tick(const double current_time);// 子类必须实现virtual void RunOnce(const double current_time) = 0;protected:std::string name_;            // 监控器名称unsigned int round_count_;    // 执行轮数private:double interval_;             // 执行间隔(秒)double next_round_;           // 下次执行时间
};

Tick实现(recurrent_runner.cc:27-37):

void RecurrentRunner::Tick(const double current_time) {// ProcessMonitor特殊处理:立即执行if (name_ == "ProcessMonitor" &&MonitorManager::Instance()->GetStatus()->detect_immediately()) {RunOnce(current_time);return;}// 按间隔执行if (next_round_ <= current_time) {++round_count_;AINFO_EVERY(100) << name_ << " is running round #" << round_count_;next_round_ = current_time + interval_;RunOnce(current_time);}
}

3.4 SystemStatus - 系统状态

文件位置:modules/common_msgs/monitor_msgs/system_status.proto

message ComponentStatus {enum Status {UNKNOWN = 0;OK = 1;WARN = 2;ERROR = 3;FATAL = 4;  // 最高优先级}optional Status status = 1 [default = UNKNOWN];optional string message = 2;
}message Component {optional ComponentStatus summary = 1;          // 汇总状态optional ComponentStatus process_status = 2;   // 进程检查结果optional ComponentStatus channel_status = 3;   // 通道检查结果optional ComponentStatus resource_status = 4;  // 资源检查结果optional ComponentStatus other_status = 5;     // 其他检查结果optional ComponentStatus module_status = 6;    // 模块检查结果
}message SystemStatus {optional apollo.common.Header header = 1;// 组件状态集合map<string, ComponentStatus> hmi_modules = 7;       // HMI模块状态map<string, Component> components = 8;              // 监控组件状态map<string, ComponentStatus> other_components = 10; // 其他组件状态map<string, Component> global_components = 11;      // 全局组件状态// 安全相关optional string passenger_msg = 4;                  // 乘客消息optional double safety_mode_trigger_time = 5;       // 安全模式触发时间optional bool require_emergency_stop = 6;           // 是否需要EStop// 其他标志optional bool is_realtime_in_simulation = 9;        // 仿真标志optional bool detect_immediately = 12;              // 立即检测标志
}

状态优先级

FATAL (4) > ERROR (3) > WARN (2) > OK (1) > UNKNOWN (0)

4. 硬件监控器

4.1 GPS监控器

文件位置:modules/monitor/hardware/gps_monitor.h

功能:监控GNSS定位解的质量

执行间隔:3秒

实现(gps_monitor.cc:39-72):

void GpsMonitor::RunOnce(const double current_time) {auto manager = MonitorManager::Instance();// 获取GPS组件配置Component* component = FindOrNull(*manager->GetStatus()->mutable_components(),FLAGS_gps_component_name);  // 默认: "GPS"if (component == nullptr) {return;  // 当前模式不监控GPS}// 读取最新GnssBestPose消息static auto reader =manager->CreateReader<GnssBestPose>(FLAGS_gnss_best_pose_topic);auto gnss_best_pose = reader->GetLatestObserved();ComponentStatus* status = component->mutable_other_status();// 检查消息是否存在if (gnss_best_pose == nullptr) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,"No GnssBestPose message",status);return;}// 检查解质量等级switch (gnss_best_pose->sol_type()) {case SolutionType::NARROW_INT:    // RTK固定解SummaryMonitor::EscalateStatus(ComponentStatus::OK, "", status);break;case SolutionType::SINGLE:        // 单点定位SummaryMonitor::EscalateStatus(ComponentStatus::WARN,"SolutionType is SINGLE",status);break;default:                          // 其他不良状态SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,"SolutionType is wrong",status);break;}
}

GNSS解类型

  • NARROW_INT:RTK固定解(最佳)→ OK
  • SINGLE:单点定位(较差)→ WARN
  • 其他:无效状态 → ERROR

4.2 CAN监控器

文件位置

  • modules/monitor/hardware/esdcan_monitor.h
  • modules/monitor/hardware/socket_can_monitor.h

功能:监控CAN通信硬件状态

执行间隔:3秒

ESD CAN状态码映射

// esdcan_monitor.cc
switch (status_code) {case NTCAN_SUCCESS:              // 正常status = ComponentStatus::OK;break;case NTCAN_RX_TIMEOUT:           // 接收超时case NTCAN_TX_TIMEOUT:           // 发送超时status = ComponentStatus::WARN;break;case NTCAN_TX_ERROR:             // 发送错误case NTCAN_CONTR_OFF_BUS:        // 控制器离线status = ComponentStatus::ERROR;break;case NTCAN_CONTR_BUSY:           // 控制器忙碌status = ComponentStatus::FATAL;break;default:status = ComponentStatus::UNKNOWN;break;
}

4.3 资源监控器

文件位置:modules/monitor/hardware/resource_monitor.h

功能:监控系统资源使用情况

执行间隔:5秒

监控指标

4.3.1 磁盘空间
void ResourceMonitor::CheckDiskSpace(const ResourceMonitorConfig& config,ComponentStatus* status) {for (const auto& disk_space : config.disk_spaces()) {// 1. 遍历匹配路径(支持通配符)for (const auto& path : cyber::common::Glob(disk_space.path())) {// 2. 读取磁盘信息struct statvfs stat;if (statvfs(path.c_str(), &stat) != 0) {continue;}// 3. 计算可用空间(GB)const double available_gb =static_cast<double>(stat.f_bavail * stat.f_bsize) / (1 << 30);// 4. 检查阈值if (available_gb < disk_space.insufficient_space_error()) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,absl::StrCat("Insufficient disk space at ", path,": ", available_gb, " GB"),status);} else if (available_gb < disk_space.insufficient_space_warning()) {SummaryMonitor::EscalateStatus(ComponentStatus::WARN,absl::StrCat("Low disk space at ", path,": ", available_gb, " GB"),status);}}}
}

配置示例

disk_spaces {path: "/apollo/data/*"insufficient_space_warning: 50   # 警告阈值: 50GBinsufficient_space_error: 20     # 错误阈值: 20GB
}
4.3.2 CPU使用率
void ResourceMonitor::CheckCPUUsage(const ResourceMonitorConfig& config,ComponentStatus* status) {for (const auto& cpu_usage : config.cpu_usages()) {float usage_percentage = 0.0;if (cpu_usage.process_dag_path().empty()) {// 系统整体CPU使用率usage_percentage = GetSystemCPUUsage();} else {// 特定进程CPU使用率usage_percentage = GetProcessCPUUsage(cpu_usage.process_dag_path());}// 检查阈值if (usage_percentage > cpu_usage.high_cpu_usage_error()) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,absl::StrCat("High CPU usage: ", usage_percentage, "%"),status);} else if (usage_percentage > cpu_usage.high_cpu_usage_warning()) {SummaryMonitor::EscalateStatus(ComponentStatus::WARN,absl::StrCat("CPU usage warning: ", usage_percentage, "%"),status);}}
}// 从/proc/stat读取系统CPU使用率
float GetSystemCPUUsage() {static uint64_t last_total = 0, last_idle = 0;std::ifstream stat_file("/proc/stat");std::string line;std::getline(stat_file, line);  // 读取第一行 "cpu ..."std::istringstream iss(line);std::string cpu;uint64_t user, nice, system, idle, iowait, irq, softirq, steal;iss >> cpu >> user >> nice >> system >> idle >> iowait >> irq >> softirq >> steal;uint64_t total = user + nice + system + idle + iowait + irq + softirq + steal;uint64_t diff_total = total - last_total;uint64_t diff_idle = idle - last_idle;float usage = 100.0 * (1.0 - static_cast<float>(diff_idle) / diff_total);last_total = total;last_idle = idle;return usage;
}
4.3.3 内存使用
void ResourceMonitor::CheckMemoryUsage(const ResourceMonitorConfig& config,ComponentStatus* status) {for (const auto& memory_usage : config.memory_usages()) {int usage_mb = 0;if (memory_usage.process_dag_path().empty()) {// 系统整体内存使用usage_mb = GetSystemMemoryUsage();} else {// 特定进程内存使用usage_mb = GetProcessMemoryUsage(memory_usage.process_dag_path());}// 检查阈值if (usage_mb > memory_usage.high_memory_usage_error()) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,absl::StrCat("High memory usage: ", usage_mb, " MB"),status);} else if (usage_mb > memory_usage.high_memory_usage_warning()) {SummaryMonitor::EscalateStatus(ComponentStatus::WARN,absl::StrCat("Memory usage warning: ", usage_mb, " MB"),status);}}
}// 从/proc/meminfo读取系统内存使用
int GetSystemMemoryUsage() {std::ifstream meminfo("/proc/meminfo");std::string line;uint64_t total_kb = 0, available_kb = 0;while (std::getline(meminfo, line)) {if (line.find("MemTotal:") == 0) {sscanf(line.c_str(), "MemTotal: %lu kB", &total_kb);} else if (line.find("MemAvailable:") == 0) {sscanf(line.c_str(), "MemAvailable: %lu kB", &available_kb);}}uint64_t used_kb = total_kb - available_kb;return static_cast<int>(used_kb / 1024);  // 转换为MB
}
4.3.4 磁盘I/O负载
void ResourceMonitor::CheckDiskLoads(const ResourceMonitorConfig& config,ComponentStatus* status) {for (const auto& disk_load : config.disk_load_usages()) {int load = GetDiskLoad(disk_load.device_name());  // 读取/proc/diskstatsif (load > disk_load.high_disk_load_error()) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,absl::StrCat("High disk load on ", disk_load.device_name(),": ", load),status);} else if (load > disk_load.high_disk_load_warning()) {SummaryMonitor::EscalateStatus(ComponentStatus::WARN,absl::StrCat("Disk load warning on ", disk_load.device_name(),": ", load),status);}}
}

4.4 定位监控器

文件位置:modules/monitor/software/localization_monitor.h

功能:监控定位模块的融合状态

执行间隔:5秒

实现

void LocalizationMonitor::RunOnce(const double current_time) {auto manager = MonitorManager::Instance();// 读取LocalizationStatus消息static auto reader =manager->CreateReader<LocalizationStatus>(FLAGS_localization_status_topic);auto status_msg = reader->GetLatestObserved();Component* component = FindOrNull(*manager->GetStatus()->mutable_components(),"Localization");if (component == nullptr) {return;}ComponentStatus* status = component->mutable_other_status();if (status_msg == nullptr) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,"No LocalizationStatus message",status);return;}// 映射融合状态switch (status_msg->fusion_status()) {case LocalizationStatus::OK:SummaryMonitor::EscalateStatus(ComponentStatus::OK, "", status);break;case LocalizationStatus::WARNNING:SummaryMonitor::EscalateStatus(ComponentStatus::WARN,"Localization fusion status is WARNNING",status);break;case LocalizationStatus::ERROR:SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,"Localization fusion status is ERROR",status);break;case LocalizationStatus::FATAL:SummaryMonitor::EscalateStatus(ComponentStatus::FATAL,"Localization fusion status is FATAL",status);break;}
}

5. 软件监控器

5.1 进程监控器

文件位置:modules/monitor/software/process_monitor.h

功能:检查关键进程是否正在运行

执行间隔:1.5秒(可立即执行)

实现(process_monitor.cc:40-97):

void ProcessMonitor::RunOnce(const double current_time) {auto manager = MonitorManager::Instance();// 1. 获取所有运行进程列表std::vector<std::string> running_processes;for (const auto& cmd_file : cyber::common::Glob("/proc/*/cmdline")) {std::string cmd_string;if (cyber::common::GetContent(cmd_file, &cmd_string) &&!cmd_string.empty()) {// cmdline中参数用\0分隔,替换为空格std::replace(cmd_string.begin(), cmd_string.end(), '\0', ' ');running_processes.push_back(cmd_string);}}const auto& mode = manager->GetHMIMode();// 2. 检查HMI模块进程auto* hmi_modules = manager->GetStatus()->mutable_hmi_modules();for (const auto& iter : mode.modules()) {const std::string& module_name = iter.first;const auto& config = iter.second.process_monitor_config();UpdateStatus(running_processes, config, &(*hmi_modules)[module_name]);}// 3. 检查监控组件进程auto* components = manager->GetStatus()->mutable_components();for (const auto& iter : mode.monitored_components()) {const std::string& name = iter.first;if (iter.second.has_process()) {const auto& config = iter.second.process();auto* status = (*components)[name].mutable_process_status();UpdateStatus(running_processes, config, status);}}// 4. 检查全局组件进程auto* global_components = manager->GetStatus()->mutable_global_components();for (const auto& iter : mode.global_components()) {const std::string& name = iter.first;if (iter.second.has_process()) {const auto& config = iter.second.process();auto* status = (*global_components)[name].mutable_process_status();UpdateStatus(running_processes, config, status);}}
}

匹配逻辑

void ProcessMonitor::UpdateStatus(const std::vector<std::string>& running_processes,const ProcessMonitorConfig& config,ComponentStatus* status) {status->clear_status();// 遍历所有运行进程for (const std::string& command : running_processes) {bool all_keywords_matched = true;// 检查该进程是否包含所有关键字(AND逻辑)for (const std::string& keyword : config.command_keywords()) {if (command.find(keyword) == std::string::npos) {all_keywords_matched = false;break;}}if (all_keywords_matched) {// 找到匹配的进程SummaryMonitor::EscalateStatus(ComponentStatus::OK, command, status);return;}}// 未找到任何匹配进程 → FATALSummaryMonitor::EscalateStatus(ComponentStatus::FATAL,"Process not found",status);
}

配置示例

process {command_keywords: "mainboard"command_keywords: "-d"command_keywords: "modules/planning/dag/planning.dag"
}# 匹配示例:
# /opt/apollo/bin/mainboard -d modules/planning/dag/planning.dag

5.2 模块监控器

文件位置:modules/monitor/software/module_monitor.h

功能:检查CyberRT模块节点是否存在

执行间隔:1.5秒

实现(module_monitor.cc:43-80):

class ModuleMonitor : public RecurrentRunner {public:ModuleMonitor();void RunOnce(const double current_time) override;private:void UpdateStatus(const ModuleMonitorConfig& config,const std::string& module_name,ComponentStatus* status);std::shared_ptr<cyber::service_discovery::NodeManager> node_manager_;
};void ModuleMonitor::RunOnce(const double current_time) {auto manager = MonitorManager::Instance();const auto& mode = manager->GetHMIMode();// 检查所有监控的组件auto* components = manager->GetStatus()->mutable_components();for (const auto& iter : mode.monitored_components()) {const std::string& name = iter.first;const auto& monitored_component = iter.second;if (monitored_component.has_module()) {const auto& config = monitored_component.module();auto* status = (*components)[name].mutable_module_status();UpdateStatus(config, name, status);}}
}void ModuleMonitor::UpdateStatus(const ModuleMonitorConfig& config,const std::string& module_name,ComponentStatus* status) {status->clear_status();// 检查所有必需的节点是否存在(AND逻辑)bool all_nodes_matched = true;for (const std::string& node_name : config.node_name()) {if (!node_manager_->HasNode(node_name)) {all_nodes_matched = false;break;}}if (all_nodes_matched) {// 所有节点都找到SummaryMonitor::EscalateStatus(ComponentStatus::OK, module_name, status);} else {// 缺失关键节点SummaryMonitor::EscalateStatus(ComponentStatus::FATAL,"Required node not found",status);}
}

配置示例

module {node_name: "planning"node_name: "planning_monitor"
}

5.3 通道监控器

文件位置:modules/monitor/software/channel_monitor.h

功能:检查消息通道的延迟、频率和完整性

执行间隔:5秒

支持的通道(channel_monitor.cc:73-105):

static const auto channel_function_map = std::unordered_map<...>{{FLAGS_control_command_topic,&CreateReaderAndLatestMessage<control::ControlCommand>},{FLAGS_localization_topic,&CreateReaderAndLatestMessage<localization::LocalizationEstimate>},{FLAGS_perception_obstacle_topic,&CreateReaderAndLatestMessage<perception::PerceptionObstacles>},{FLAGS_prediction_topic,&CreateReaderAndLatestMessage<prediction::PredictionObstacles>},{FLAGS_planning_trajectory_topic,&CreateReaderAndLatestMessage<planning::ADCTrajectory>},{FLAGS_chassis_topic,&CreateReaderAndLatestMessage<canbus::Chassis>},{FLAGS_pointcloud_topic,&CreateReaderAndLatestMessage<drivers::PointCloud>},// ... 更多通道 ...
};

检查逻辑(channel_monitor.cc:171-236):

void ChannelMonitor::UpdateStatus(const ChannelMonitorConfig& config,ComponentStatus* status,const bool update_freq,const double freq) {status->clear_status();// 1. 获取Reader和最新消息const auto reader_message_pair = GetReaderAndLatestMessage(config.name());const auto reader = reader_message_pair.first;const auto message = reader_message_pair.second;// 2. 检查通道是否已注册if (reader == nullptr) {SummaryMonitor::EscalateStatus(ComponentStatus::UNKNOWN,absl::StrCat(config.name(), " is not registered in ChannelMonitor."),status);return;}// 3. 检查消息是否为空if (message == nullptr || message->ByteSize() == 0) {SummaryMonitor::EscalateStatus(ComponentStatus::FATAL,absl::StrCat("The message ", config.name(), " received is empty."),status);return;}// 4. 检查通道延迟(FATAL级别)const double delay = reader->GetDelaySec();if (delay < 0 || delay > config.delay_fatal()) {SummaryMonitor::EscalateStatus(ComponentStatus::FATAL,absl::StrCat(config.name(), " delayed for ", delay, " seconds."),status);}// 5. 检查强制字段(ERROR级别)const std::string field_sepr = ".";for (const auto& field : config.mandatory_fields()) {if (!ValidateFields(*message, absl::StrSplit(field, field_sepr), 0)) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,absl::StrCat(config.name(), " missing field ", field),status);}}// 6. 检查通道频率(WARN级别)if (update_freq) {if (freq > config.max_frequency_allowed()) {SummaryMonitor::EscalateStatus(ComponentStatus::WARN,absl::StrCat(config.name(), " has frequency ", freq," > max allowed ", config.max_frequency_allowed()),status);}if (freq < config.min_frequency_allowed()) {SummaryMonitor::EscalateStatus(ComponentStatus::WARN,absl::StrCat(config.name(), " has frequency ", freq," < min allowed ", config.min_frequency_allowed()),status);}}SummaryMonitor::EscalateStatus(ComponentStatus::OK, "", status);
}

字段验证算法

bool ValidateFields(const google::protobuf::Message& message,const std::vector<std::string>& fields,const size_t field_step) {if (field_step >= fields.size()) {return true;  // 已验证完所有字段}const auto* desc = message.GetDescriptor();const auto* refl = message.GetReflection();// 遍历消息的所有字段for (int i = 0; i < desc->field_count(); ++i) {const auto* field_desc = desc->field(i);if (field_desc->name() == fields[field_step]) {// 找到目标字段if (field_desc->is_repeated()) {// repeated字段:检查是否非空return refl->FieldSize(message, field_desc) > 0 &&field_step == fields.size() - 1;}if (field_desc->type() !=google::protobuf::FieldDescriptor::TYPE_MESSAGE) {// 标量字段:检查是否已设置return refl->HasField(message, field_desc) &&field_step == fields.size() - 1;}// message字段:递归验证return ValidateFields(refl->GetMessage(message, field_desc),fields,field_step + 1);}}return false;
}

配置示例

channel {name: "/apollo/localization/pose"delay_fatal: 3.0mandatory_fields: "pose.position.x"mandatory_fields: "pose.position.y"mandatory_fields: "pose.orientation"min_frequency_allowed: 5.0max_frequency_allowed: 15.0
}

5.4 延迟监控器

文件位置:modules/monitor/software/latency_monitor.h

功能:监控端到端延迟和消息频率

执行间隔:1.5秒

数据结构

class LatencyMonitor : public RecurrentRunner {public:LatencyMonitor();void RunOnce(const double current_time) override;// 获取通道频率bool GetFrequency(const std::string& channel_name, double* freq);private:void UpdateStat(const std::shared_ptr<LatencyRecordMap>& records);void PublishLatencyReport();void AggregateLatency();apollo::common::LatencyReport latency_report_;// 跟踪图:track_id -> (timestamp, module_id, module_name)std::unordered_map<uint64_t,std::set<std::tuple<uint64_t, uint64_t, std::string>>>track_map_;// 频率缓存:channel_name -> frequencystd::unordered_map<std::string, double> freq_map_;double flush_time_ = 0.0;
};

统计算法

LatencyStat GenerateStat(const std::vector<uint64_t>& numbers) {LatencyStat stat;uint64_t min_number = (1UL << 63);uint64_t max_number = 0;uint64_t sum = 0;for (const auto number : numbers) {min_number = std::min(min_number, number);max_number = std::max(max_number, number);sum += number;}const uint32_t sample_size = static_cast<uint32_t>(numbers.size());stat.set_min_duration(min_number);stat.set_max_duration(max_number);stat.set_aver_duration(sample_size == 0 ? 0 : sum / sample_size);stat.set_sample_size(sample_size);return stat;
}

发布主题/apollo/monitor/latency_report

报告间隔:15秒(FLAGS_latency_report_interval

5.5 汇总监控器

文件位置:modules/monitor/software/summary_monitor.h

功能:汇总所有监控器结果并发布系统状态

执行间隔:每次都执行

状态升级规则(summary_monitor.cc:33-45):

void SummaryMonitor::EscalateStatus(const ComponentStatus::Status new_status,const std::string& message,ComponentStatus* current_status) {// 优先级:FATAL > ERROR > WARN > OK > UNKNOWNif (new_status > current_status->status()) {current_status->set_status(new_status);if (!message.empty()) {current_status->set_message(message);} else {current_status->clear_message();}}
}

执行逻辑(summary_monitor.cc:51-99):

void SummaryMonitor::RunOnce(const double current_time) {auto manager = MonitorManager::Instance();auto* status = manager->GetStatus();// 1. 为每个component汇总状态for (auto& component : *status->mutable_components()) {auto* summary = component.second.mutable_summary();// 从各个子状态升级到汇总状态EscalateStatus(component.second.process_status().status(),component.second.process_status().message(),summary);EscalateStatus(component.second.module_status().status(),component.second.module_status().message(),summary);EscalateStatus(component.second.channel_status().status(),component.second.channel_status().message(),summary);EscalateStatus(component.second.resource_status().status(),component.second.resource_status().message(),summary);EscalateStatus(component.second.other_status().status(),component.second.other_status().message(),summary);}// 2. 为global components汇总状态for (auto& component : *status->mutable_global_components()) {auto* summary = component.second.mutable_summary();EscalateStatus(component.second.process_status().status(), ..., summary);EscalateStatus(component.second.resource_status().status(), ..., summary);}// 3. 检查状态是否变化(使用哈希指纹)static std::hash<std::string> hash_fn;std::string proto_bytes;status->SerializeToString(&proto_bytes);const size_t new_fp = hash_fn(proto_bytes);// 4. 状态变化或超时则发布if (system_status_fp_ != new_fp ||current_time - last_broadcast_ >FLAGS_system_status_publish_interval) {static auto writer =manager->CreateWriter<SystemStatus>(FLAGS_system_status_topic);common::util::FillHeader("SystemMonitor", status);writer->Write(*status);status->clear_header();system_status_fp_ = new_fp;last_broadcast_ = current_time;}
}

发布策略

  • 状态变化时立即发布
  • 或每隔一定时间发布(FLAGS_system_status_publish_interval,默认10秒)

5.6 功能安全监控器

文件位置:modules/monitor/software/functional_safety_monitor.h

功能:在自动驾驶模式下检查安全条件,触发EStop

执行间隔:每次都执行(如果启用)

安全检查逻辑(functional_safety_monitor.cc:49-76):

void FunctionalSafetyMonitor::RunOnce(const double current_time) {auto* system_status = MonitorManager::Instance()->GetStatus();// 1. 检查是否安全if (CheckSafety()) {// 一切正常 → 清除安全模式标记system_status->clear_passenger_msg();system_status->clear_safety_mode_trigger_time();system_status->clear_require_emergency_stop();return;}// 2. 如果已发送EStop → 什么都不做if (system_status->require_emergency_stop()) {return;}// 3. 新进入安全模式system_status->set_passenger_msg("Error! Please disengage.");if (!system_status->has_safety_mode_trigger_time()) {system_status->set_safety_mode_trigger_time(current_time);return;  // 第一次进入安全模式,不立即EStop}// 4. 超时后自动EStopif (system_status->safety_mode_trigger_time() +FLAGS_safety_mode_seconds_before_estop <  // 默认10秒current_time) {system_status->set_require_emergency_stop(true);// 记录日志MonitorManager::Instance()->LogBuffer().ERROR("Functional safety triggered emergency stop.");}
}bool FunctionalSafetyMonitor::CheckSafety() {auto manager = MonitorManager::Instance();// 仅在自动驾驶模式检查if (!manager->IsInAutonomousMode()) {return true;  // 非自动驾驶模式 → 安全}auto* status = manager->GetStatus();// 检查关键组件状态for (const auto& component : status->components()) {if (component.second.summary().status() == ComponentStatus::ERROR ||component.second.summary().status() == ComponentStatus::FATAL) {// 检查是否为安全关键组件const auto& mode = manager->GetHMIMode();const auto& monitored_component =FindOrNull(mode.monitored_components(), component.first);if (monitored_component != nullptr &&monitored_component->required_for_safety()) {// 安全关键组件出错 → 不安全manager->LogBuffer().ERROR(absl::StrCat(component.first, " triggers safe mode: ",component.second.summary().message()));return false;}}}return true;  // 所有关键组件正常 → 安全
}

EStop流程

1. 检测到关键组件错误↓
2. 进入安全模式(设置passenger_msg)↓
3. 等待10秒(FLAGS_safety_mode_seconds_before_estop)↓
4. 如果仍未恢复,设置require_emergency_stop↓
5. Control模块监听该标志并触发紧急停车

5.7 其他监控器

5.7.1 摄像头监控器

文件位置:modules/monitor/software/camera_monitor.h

功能:监控摄像头图像流

执行间隔:5秒

5.7.2 记录器监控器

文件位置:modules/monitor/software/recorder_monitor.h

功能:监控SmartRecorder的录制状态

执行间隔:5秒

状态映射

  • RECORDING → OK
  • TERMINATING → WARN
  • STOPPED → ERROR

6. 状态管理

6.1 状态层级

SystemStatus
├─ hmi_modules: map<string, ComponentStatus>
│   └─ Planning: ComponentStatus{status, message}
│
├─ components: map<string, Component>
│   └─ Localization: Component
│       ├─ summary: ComponentStatus          (汇总状态)
│       ├─ process_status: ComponentStatus   (进程检查)
│       ├─ module_status: ComponentStatus    (节点检查)
│       ├─ channel_status: ComponentStatus   (通道检查)
│       ├─ resource_status: ComponentStatus  (资源检查)
│       └─ other_status: ComponentStatus     (其他检查)
│
├─ other_components: map<string, ComponentStatus>
│   └─ OtherComponent: ComponentStatus
│
├─ global_components: map<string, Component>
│   └─ GlobalComponent: Component{...}
│
├─ passenger_msg: string                     (乘客提示)
├─ safety_mode_trigger_time: double          (安全模式触发时间)
└─ require_emergency_stop: bool              (紧急停车标志)

6.2 状态升级示例

假设Localization组件的各子状态为:process_status   = OK        (进程正在运行)
module_status    = OK        (节点存在)
channel_status   = WARN      (消息频率低)
resource_status  = OK        (资源正常)
other_status     = ERROR     (融合质量差)通过EscalateStatus升级:
↓
summary = max(OK, OK, WARN, OK, ERROR)
summary = ERROR  (最高优先级)

6.3 HMI模式配置

文件位置:modules/common_msgs/dreamview_msgs/hmi_mode.proto

message MonitoredComponent {optional ProcessMonitorConfig process = 1;optional ChannelMonitorConfig channel = 2;optional ResourceMonitorConfig resource = 3;optional bool required_for_safety = 4 [default = true];optional ModuleMonitorConfig module = 5;
}message HMIMode {map<string, apollo.dreamview.Module> modules = 1;map<string, MonitoredComponent> monitored_components = 2;map<string, ComponentStatus> other_components = 3;map<string, MonitoredComponent> global_components = 4;
}

配置示例(某个HMI模式文件):

monitored_components {key: "Localization"value {process {command_keywords: "mainboard"command_keywords: "-d"command_keywords: "modules/localization/dag/localization.dag"}module {node_name: "localization"}channel {name: "/apollo/localization/pose"delay_fatal: 3.0mandatory_fields: "pose.position.x"mandatory_fields: "pose.position.y"min_frequency_allowed: 5.0max_frequency_allowed: 15.0}required_for_safety: true}
}monitored_components {key: "Planning"value {process {command_keywords: "mainboard"command_keywords: "modules/planning/dag/planning.dag"}module {node_name: "planning"}channel {name: "/apollo/planning/trajectory"delay_fatal: 3.0mandatory_fields: "trajectory_point"min_frequency_allowed: 1.0max_frequency_allowed: 15.0}required_for_safety: true}
}

7. 配置系统

7.1 DAG配置

文件位置:modules/monitor/dag/monitor.dag

module_config {module_library : "modules/monitor/libmonitor.so"timer_components {class_name : "Monitor"config {name: "monitor"interval: 500      # 执行周期:500毫秒}}
}

说明

  • Monitor组件每500ms执行一次Proc()
  • Proc()中逐个调用各监控器的Tick()
  • 各监控器有自己的执行间隔(由RecurrentRunner管理)

7.2 启动配置

文件位置:modules/monitor/launch/monitor.launch

<cyber><module><name>monitor</name><dag_conf>modules/monitor/dag/monitor.dag</dag_conf><process_name>monitor</process_name></module>
</cyber>

7.3 Flag配置

主要配置标志

// 监控器名称和间隔
DEFINE_string(gps_monitor_name, "GpsMonitor", "GPS monitor name");
DEFINE_double(gps_monitor_interval, 3.0, "GPS monitor interval");
DEFINE_double(process_monitor_interval, 1.5, "Process monitor interval");
DEFINE_double(channel_monitor_interval, 5.0, "Channel monitor interval");
DEFINE_double(resource_monitor_interval, 5.0, "Resource monitor interval");// 组件名称
DEFINE_string(gps_component_name, "GPS", "GPS component name");// 主题名称
DEFINE_string(hmi_status_topic, "/apollo/hmi/status", "HMI status topic");
DEFINE_string(system_status_topic, "/apollo/monitor/system_status", "System status topic");
DEFINE_string(chassis_topic, "/apollo/canbus/chassis", "Chassis topic");
DEFINE_string(gnss_best_pose_topic, "/apollo/sensor/gnss/best_pose", "GNSS best pose topic");// 系统参数
DEFINE_double(system_status_publish_interval, 10.0, "System status publish interval");
DEFINE_double(max_chassis_message_delay, 1.0, "Max chassis message delay");
DEFINE_bool(enable_functional_safety, true, "Enable functional safety monitor");
DEFINE_double(safety_mode_seconds_before_estop, 10.0, "Seconds before EStop");// 延迟监控
DEFINE_double(latency_monitor_interval, 1.5, "Latency monitor interval");
DEFINE_double(latency_report_interval, 15.0, "Latency report interval");
DEFINE_int32(latency_reader_capacity, 30, "Latency reader capacity");

8. 执行流程

8.1 完整执行流程

Timer触发(每500ms)↓
Monitor::Proc()↓
┌──────────────────────────────────────────────────────────┐
│ MonitorManager::StartFrame(current_time)                 │
│   ├─ 读取HMI状态                                          │
│   ├─ 检测模式变化                                        │
│   │   └─ 如果模式变化 → 重新加载配置                     │
│   ├─ 清空上一帧的component汇总状态                       │
│   └─ 检测是否在自动驾驶模式                              │
└──────────────────────────────────────────────────────────┘↓
┌──────────────────────────────────────────────────────────┐
│ 执行所有监控器                                            │
│                                                          │
│ EsdCanMonitor::Tick()          (间隔3秒)                 │
│   └─ 如果时间到 → RunOnce()                              │
│       └─ 检查ESD CAN状态                                 │
│                                                          │
│ SocketCanMonitor::Tick()       (间隔3秒)                 │
│   └─ 检查Socket CAN状态                                  │
│                                                          │
│ GpsMonitor::Tick()             (间隔3秒)                 │
│   └─ 检查GNSS解质量                                      │
│       └─ 更新component.other_status                     │
│                                                          │
│ LocalizationMonitor::Tick()    (间隔5秒)                 │
│   └─ 检查定位融合状态                                    │
│                                                          │
│ CameraMonitor::Tick()          (间隔5秒)                 │
│   └─ 检查摄像头图像                                      │
│                                                          │
│ ProcessMonitor::Tick()         (间隔1.5秒或立即)         │
│   └─ 扫描/proc/*/cmdline                                 │
│   └─ 匹配进程关键字                                      │
│   └─ 更新component.process_status                       │
│                                                          │
│ ModuleMonitor::Tick()          (间隔1.5秒)               │
│   └─ 查询CyberRT节点                                     │
│   └─ 更新component.module_status                        │
│                                                          │
│ LatencyMonitor::Tick()         (间隔1.5秒)               │
│   └─ 收集延迟数据                                        │
│   └─ 计算频率                                            │
│   └─ 每15秒发布延迟报告                                  │
│                                                          │
│ ChannelMonitor::Tick()         (间隔5秒)                 │
│   └─ 检查消息延迟                                        │
│   └─ 检查强制字段                                        │
│   └─ 检查消息频率                                        │
│   └─ 更新component.channel_status                       │
│                                                          │
│ ResourceMonitor::Tick()        (间隔5秒)                 │
│   └─ 检查磁盘空间                                        │
│   └─ 检查CPU使用率                                       │
│   └─ 检查内存使用                                        │
│   └─ 检查磁盘I/O负载                                     │
│   └─ 更新component.resource_status                      │
│                                                          │
│ SummaryMonitor::Tick()         (每次执行)                │
│   └─ 汇总各component状态                                 │
│   │   └─ summary = max(process, module, channel, resource, other)
│   └─ 计算状态哈希指纹                                    │
│   └─ 如果状态变化或超时 → 发布SystemStatus消息          │
│                                                          │
│ FunctionalSafetyMonitor::Tick() (每次执行,如果启用)     │
│   └─ 检查是否在自动驾驶模式                              │
│   └─ 检查关键组件状态                                    │
│   └─ 如果有ERROR/FATAL → 进入安全模式                   │
│   └─ 超时10秒后 → 设置require_emergency_stop           │
└──────────────────────────────────────────────────────────┘↓
┌──────────────────────────────────────────────────────────┐
│ MonitorManager::EndFrame()                               │
│   └─ log_buffer_.Publish()                              │
│       └─ 发布所有监控日志消息                            │
└──────────────────────────────────────────────────────────┘

8.2 监控流程示例:Localization组件

假设HMI模式配置中包含Localization组件:1. ProcessMonitor检查定位进程├─ 扫描/proc/*/cmdline├─ 查找包含["mainboard", "-d", "localization.dag"]的进程├─ 找到: /opt/apollo/bin/mainboard -d modules/localization/dag/localization.dag└─ 结果: process_status = OK2. ModuleMonitor检查定位节点├─ 查询CyberRT节点├─ 检查"localization"节点是否存在└─ 结果: module_status = OK3. ChannelMonitor检查定位消息├─ 读取/apollo/localization/pose├─ 检查延迟: 0.2秒 < 3.0秒 (delay_fatal) → OK├─ 检查字段: pose.position.x ✓, pose.position.y ✓ → OK├─ 检查频率: 8.5 Hz,在[5.0, 15.0]范围内 → OK└─ 结果: channel_status = OK4. LocalizationMonitor检查定位质量├─ 读取LocalizationStatus消息├─ fusion_status = OK└─ 结果: other_status = OK5. SummaryMonitor汇总├─ summary = max(OK, OK, OK, OK)└─ 结果: summary = OK6. FunctionalSafetyMonitor检查安全└─ Localization.summary = OK → 安全

如果定位出错

假设LocalizationStatus.fusion_status = ERROR1. ProcessMonitor: process_status = OK (进程仍在运行)
2. ModuleMonitor: module_status = OK (节点仍存在)
3. ChannelMonitor: channel_status = OK (消息正常)
4. LocalizationMonitor: other_status = ERROR (融合错误)
5. SummaryMonitor: summary = max(OK, OK, OK, ERROR) = ERROR
6. FunctionalSafetyMonitor:└─ 检测到Localization.summary = ERROR└─ required_for_safety = true└─ 进入安全模式└─ 设置passenger_msg = "Error! Please disengage."└─ 设置safety_mode_trigger_time = current_time└─ 10秒后 → require_emergency_stop = true

9. 最佳实践

9.1 开发建议

9.1.1 添加新的监控器
// 步骤1:创建监控器类
// my_monitor.h
class MyMonitor : public RecurrentRunner {public:MyMonitor(): RecurrentRunner("MyMonitor", 3.0) {}  // 3秒间隔void RunOnce(const double current_time) override {auto manager = MonitorManager::Instance();// 获取组件配置Component* component = FindOrNull(*manager->GetStatus()->mutable_components(),"MyComponent");if (component == nullptr) {return;}ComponentStatus* status = component->mutable_other_status();// 执行检查逻辑if (CheckMyCondition()) {SummaryMonitor::EscalateStatus(ComponentStatus::OK, "", status);} else {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,"My condition failed",status);}}private:bool CheckMyCondition();
};// 步骤2:在Monitor::Init()中添加
bool Monitor::Init() {// ...runners_.emplace_back(new MyMonitor());// ...
}// 步骤3:在HMI模式配置中添加
monitored_components {key: "MyComponent"value {# 配置监控项}
}
9.1.2 配置监控项
# HMI模式配置文件monitored_components {key: "Planning"value {# 进程监控process {command_keywords: "mainboard"command_keywords: "modules/planning/dag/planning.dag"}# 节点监控module {node_name: "planning"node_name: "planning_monitor"}# 通道监控channel {name: "/apollo/planning/trajectory"delay_fatal: 3.0mandatory_fields: "trajectory_point"mandatory_fields: "header.timestamp_sec"min_frequency_allowed: 1.0max_frequency_allowed: 15.0}# 资源监控resource {cpu_usages {process_dag_path: "modules/planning/dag/planning.dag"high_cpu_usage_warning: 80.0high_cpu_usage_error: 95.0}memory_usages {process_dag_path: "modules/planning/dag/planning.dag"high_memory_usage_warning: 1024   # MBhigh_memory_usage_error: 2048}}# 是否安全关键required_for_safety: true}
}

9.2 调试技巧

9.2.1 查看系统状态
# 使用cyber_monitor查看SystemStatus消息
cyber_monitor /apollo/monitor/system_status# 使用echo命令打印详细信息
cyber_recorder echo /apollo/monitor/system_status
9.2.2 查看延迟报告
# 查看端到端延迟
cyber_monitor /apollo/monitor/latency_report
9.2.3 调试日志
// 在监控器中添加详细日志
AINFO << "Checking component: " << component_name;
AINFO << "Current status: " << ComponentStatus::Status_Name(status);
AERROR << "Check failed: " << error_message;

9.3 常见问题

9.3.1 组件状态一直为FATAL

问题:某个组件的process_status一直为FATAL

排查步骤

  1. 检查进程是否真的在运行:ps aux | grep <关键字>
  2. 检查进程关键字配置是否正确
  3. 查看/proc/*/cmdline实际内容:cat /proc/$(pidof <进程名>)/cmdline | tr '\0' ' '

解决方案

  • 调整HMI模式配置中的command_keywords
  • 确保所有关键字都出现在进程命令行中
9.3.2 通道监控报错

问题:channel_status报告"missing field"

排查步骤

  1. 检查消息是否真的缺失该字段
  2. 使用protobuf反射查看消息结构
  3. 检查字段名是否拼写正确(区分大小写)

解决方案

  • 修正mandatory_fields配置
  • 或修复上游模块,确保发送完整消息
9.3.3 频繁触发EStop

问题:FunctionalSafetyMonitor频繁触发紧急停车

排查步骤

  1. 查看SystemStatus,确认哪个组件触发安全模式
  2. 查看该组件的详细错误信息
  3. 检查是否真的需要设置required_for_safety: true

解决方案

  • 修复导致组件ERROR/FATAL的根本问题
  • 调整required_for_safety配置
  • 调整FLAGS_safety_mode_seconds_before_estop延迟

10. 总结

10.1 核心特性

Apollo Monitor模块是一个全面、灵活、高效的系统健康监控系统:

  1. 全面覆盖

    • 硬件层:GPS、CAN、资源
    • 软件层:进程、节点、通道、延迟
    • 安全层:功能安全检查、EStop
  2. 灵活配置

    • 通过HMI模式文件动态配置监控项
    • 运行时模式切换自动更新监控列表
    • 支持每个组件独立的监控策略
  3. 高效执行

    • 500ms周期,各监控器独立间隔
    • 使用哈希指纹检测状态变化
    • Reader/Writer缓存避免重复创建
  4. 安全保护

    • 自动检测自动驾驶模式
    • 关键组件错误自动触发EStop
    • 可配置的EStop延迟(默认10秒)

10.2 监控器总览

监控器类型间隔主要功能
EsdCanMonitor硬件3sESD CAN卡状态
SocketCanMonitor硬件3sSocket CAN状态
GpsMonitor硬件3sGNSS解质量
ResourceMonitor硬件5sCPU/内存/磁盘
LocalizationMonitor硬件/软件5s定位融合状态
ProcessMonitor软件1.5s进程运行状态
ModuleMonitor软件1.5sCyberRT节点
ChannelMonitor软件5s消息延迟/频率/字段
LatencyMonitor软件1.5s端到端延迟统计
CameraMonitor软件5s摄像头图像
RecorderMonitor软件5s记录器状态
SummaryMonitor软件每次状态汇总发布
FunctionalSafetyMonitor安全每次安全检查EStop

10.3 技术亮点

  • 分层监控:硬件→软件→安全三层体系
  • 优先级升级:FATAL > ERROR > WARN > OK > UNKNOWN
  • 自动驾驶保护:仅在自动驾驶模式下检查安全并触发EStop
  • 灵活的进程检测:基于关键字的AND匹配
  • 完整的通道检查:延迟+频率+强制字段
  • 资源深度监控:磁盘+CPU+内存+I/O
  • 高效的状态发布:哈希指纹检测变化

Monitor模块为Apollo自动驾驶系统提供了可靠的健康监控和安全保护。


参考资料

  • Apollo官方文档
  • CyberRT文档
  • Monitor模块README
  • SystemStatus Proto
  • HMIMode Proto
http://www.dtcms.com/a/574554.html

相关文章:

  • 济南市建设银行网站温州城乡建设学校
  • 广告联盟没有网站怎么做商城网站建设公司排名
  • 英伟达41页VLA框架:Alpamayo-R1凭“因果链推理”重塑端到端自动驾驶
  • TCP三握四挥TLS握手
  • 做网站用到什么技术wordpress常用页面
  • 用织梦做的网站怎样看作品集模板
  • C++中实现多线程编程
  • 编程网站入口免费建网站平台哪个好
  • 外贸网站服务器选择南京市江宁区建设局网站
  • 项目经历怎么填写百度seo网站排名
  • 网站建设的难点和问题网站建设信用卡取消
  • 《新概念英语青少年版》Starter A 知识点全整理
  • 饿了吗网站建设思路郑州网站定制
  • 英德市建设局网站网站的pv uv
  • 哈尔滨网站建设自助建站网上做室内设计的网站
  • AI Coding 资讯 2025-11-05.md
  • 嵌入式Linux——解密 ARM 性能优化:LDR 未命中时,为何 STR 还能“插队”?
  • 怎样可以查看网站是由哪个公司做的做网站每个月可以赚多少钱
  • 铜陵市建设工程管理局网站网站文字不能编辑器
  • 【从模仿到创造:大模型如何通过“先SFT后RL”实现能力进化?】
  • 外贸网站建设wordpresswordpress数据库加密方式
  • 徐州网站建设优化宣传做网站要租服务器
  • 做生存曲线网站清远市建设工程交易中心网站
  • 解决Linux串口登录界面重复输入密码
  • 【iso8601库】ISO 8601 低层解析器详解(parsers.rs)
  • 有什么网站可以接手工加工做在线免费看电视剧的网站
  • 类似享设计的网站做贸易选哪家网站
  • 算法笔记 10
  • 锛网站开封seo公司
  • Linux 进程资源占用分析指南