Apollo Monitor模块技术深度解析
目录
- 模块概述
- 架构设计
- 核心数据结构
- 硬件监控器
- 软件监控器
- 状态管理
- 配置系统
- 执行流程
- 最佳实践
- 总结
1. 模块概述
1.1 功能定位
Monitor(监控)模块是Apollo自动驾驶系统的"健康检查中心",负责:
- 硬件监控:监控GPS、CAN卡、系统资源等硬件状态
- 软件监控:监控进程、模块、消息通道、延迟等软件状态
- 安全保护:在自动驾驶模式下检查安全条件,必要时触发紧急停车
- 状态汇总:收集所有监控结果并发布系统整体状态
- 故障诊断:提供详细的错误信息帮助快速定位问题
1.2 技术特点
- 周期性执行:基于CyberRT的TimerComponent,每500ms执行一次
- 分层监控:硬件层+软件层+安全层三层监控体系
- 灵活配置:通过HMI模式文件动态配置监控项
- 优先级升级:FATAL > ERROR > WARN > OK > UNKNOWN
- 安全保护:在自动驾驶模式下自动检查并触发EStop
- 高效发布:使用哈希指纹检测状态变化,减少网络开销
1.3 代码规模
- 主文件:monitor.h, monitor.cc
- 监控器数量:13个(4个硬件监控器 + 9个软件监控器)
- 代码位置:
modules/monitor/
2. 架构设计
2.1 系统分层架构
┌─────────────────────────────────────────────────────────────┐
│ Monitor主组件 │
│ (TimerComponent, 500ms周期) │
└─────────────────────────────────────────────────────────────┘↓
┌─────────────────────────────────────────────────────────────┐
│ MonitorManager │
│ (单例,中央管理器) │
│ - 状态管理 (SystemStatus) │
│ - 配置管理 (HMIMode) │
│ - Channel管理 (Reader/Writer缓存) │
│ - 日志缓冲 (MonitorLogBuffer) │
└─────────────────────────────────────────────────────────────┘↓
┌─────────────────────────────────────────────────────────────┐
│ 监控器层 │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ 硬件监控器 │ │ 软件监控器 │ │ 安全监控器 │ │
│ ├──────────────┤ ├──────────────┤ ├──────────────┤ │
│ │ GPS Monitor │ │Process Mon │ │Functional │ │
│ │ CAN Monitor │ │Module Mon │ │Safety Mon │ │
│ │Resource Mon │ │Channel Mon │ └──────────────┘ │
│ │Localization │ │Latency Mon │ │
│ │ Monitor │ │Summary Mon │ │
│ │ │ │Camera Mon │ │
│ │ │ │Recorder Mon │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘↓
┌─────────────────────────────────────────────────────────────┐
│ 输出层 │
│ /apollo/monitor/system_status (SystemStatus消息) │
│ /apollo/monitor/latency_report (LatencyReport消息) │
│ MonitorLog消息 │
└─────────────────────────────────────────────────────────────┘
2.2 目录结构
modules/monitor/
├── monitor.h # 主组件头文件
├── monitor.cc # 主组件实现
├── BUILD # 构建配置
├── README.md # 文档
├── cyberfile.xml # 包元数据
│
├── dag/
│ └── monitor.dag # DAG配置(500ms周期)
│
├── launch/
│ └── monitor.launch # 启动配置
│
├── common/ # 公共基础类
│ ├── monitor_manager.h # 中央管理器(单例)
│ ├── monitor_manager.cc
│ ├── recurrent_runner.h # 可周期运行的基类
│ └── recurrent_runner.cc
│
├── hardware/ # 硬件监控器(4个)
│ ├── gps_monitor.h/cc # GPS/GNSS状态监控
│ ├── esdcan_monitor.h/cc # ESD CAN卡监控
│ ├── socket_can_monitor.h/cc # Socket CAN监控
│ └── resource_monitor.h/cc # 系统资源监控
│
└── software/ # 软件监控器(9个)├── process_monitor.h/cc # 进程运行状态监控├── module_monitor.h/cc # 模块节点监控├── channel_monitor.h/cc # 消息通道监控├── latency_monitor.h/cc # 端到端延迟监控├── summary_monitor.h/cc # 状态汇总与发布├── functional_safety_monitor.h/cc # 功能安全检查├── localization_monitor.h/cc # 定位状态监控├── camera_monitor.h/cc # 摄像头监控└── recorder_monitor.h/cc # 记录器状态监控
2.3 设计模式
2.3.1 单例模式(Singleton Pattern)
MonitorManager采用单例模式,确保全局唯一:
class MonitorManager : public Singleton<MonitorManager> {public:// 获取单例实例static MonitorManager* Instance();// 禁止拷贝和赋值MonitorManager(const MonitorManager&) = delete;MonitorManager& operator=(const MonitorManager&) = delete;
};// 使用方式
auto manager = MonitorManager::Instance();
2.3.2 模板方法模式(Template Method Pattern)
RecurrentRunner定义监控器的执行骨架:
class RecurrentRunner {public:// 模板方法:定义执行骨架void Tick(const double current_time) {if (ShouldRun(current_time)) {RunOnce(current_time); // 子类实现}}// 纯虚函数:子类必须实现virtual void RunOnce(const double current_time) = 0;private:bool ShouldRun(const double current_time);
};// 子类实现具体逻辑
class GpsMonitor : public RecurrentRunner {public:void RunOnce(const double current_time) override {// GPS监控逻辑}
};
2.3.3 策略模式(Strategy Pattern)
不同的监控器实现不同的监控策略:
// 硬件监控策略
class GpsMonitor : public RecurrentRunner {void RunOnce(...) { /* 检查GPS解质量 */ }
};class ResourceMonitor : public RecurrentRunner {void RunOnce(...) { /* 检查CPU/内存/磁盘 */ }
};// 软件监控策略
class ProcessMonitor : public RecurrentRunner {void RunOnce(...) { /* 检查进程是否运行 */ }
};class ChannelMonitor : public RecurrentRunner {void RunOnce(...) { /* 检查消息通道 */ }
};
3. 核心数据结构
3.1 Monitor - 主组件
文件位置:modules/monitor/monitor.h
class Monitor : public apollo::cyber::TimerComponent {public:bool Init() override;bool Proc() override;private:std::vector<std::shared_ptr<RecurrentRunner>> runners_;
};
Init实现(monitor.cc:40-75):
bool Monitor::Init() {// 初始化管理器MonitorManager::Instance()->Init(node_);// 创建硬件监控器runners_.emplace_back(new EsdCanMonitor());runners_.emplace_back(new SocketCanMonitor());runners_.emplace_back(new GpsMonitor());runners_.emplace_back(new LocalizationMonitor());// 创建软件监控器runners_.emplace_back(new CameraMonitor());runners_.emplace_back(new ProcessMonitor());runners_.emplace_back(new ModuleMonitor());const std::shared_ptr<LatencyMonitor> latency_monitor(new LatencyMonitor());runners_.emplace_back(latency_monitor);runners_.emplace_back(new ChannelMonitor(latency_monitor));runners_.emplace_back(new ResourceMonitor());runners_.emplace_back(new SummaryMonitor());// 功能安全监控器(可选)if (FLAGS_enable_functional_safety) {runners_.emplace_back(new FunctionalSafetyMonitor());}return true;
}
Proc实现(monitor.cc:77-85):
bool Monitor::Proc() {const double current_time = Clock::NowInSeconds();// 开始新的监控周期if (!MonitorManager::Instance()->StartFrame(current_time)) {return false;}// 执行所有监控器for (auto& runner : runners_) {runner->Tick(current_time);}// 结束监控周期MonitorManager::Instance()->EndFrame();return true;
}
3.2 MonitorManager - 中央管理器
文件位置:modules/monitor/common/monitor_manager.h
class MonitorManager : public Singleton<MonitorManager> {public:void Init(const std::shared_ptr<cyber::Node>& node);// Frame管理bool StartFrame(const double current_time);void EndFrame();// 获取接口const HMIMode& GetHMIMode() const { return mode_config_; }bool IsInAutonomousMode() const { return in_autonomous_driving_; }SystemStatus* GetStatus() { return &status_; }MonitorLogBuffer& LogBuffer() { return log_buffer_; }// 创建Reader/Writer(带缓存)template <class T>std::shared_ptr<cyber::Reader<T>> CreateReader(const std::string& channel);template <class T>std::shared_ptr<cyber::Writer<T>> CreateWriter(const std::string& channel);private:std::shared_ptr<cyber::Node> node_;// 核心数据SystemStatus status_; // 系统状态HMIMode mode_config_; // 当前模式配置bool in_autonomous_driving_ = false; // 自动驾驶标志MonitorLogBuffer log_buffer_; // 日志缓冲// Reader缓存(避免重复创建)std::unordered_map<std::string, std::shared_ptr<cyber::ReaderBase>> readers_;
};
StartFrame实现(monitor_manager.cc:46-94):
bool MonitorManager::StartFrame(const double current_time) {// 1. 读取HMI状态static auto hmi_status_reader =CreateReader<HMIStatus>(FLAGS_hmi_status_topic);auto hmi_status = hmi_status_reader->GetLatestObserved();if (hmi_status == nullptr) {AERROR_EVERY(100) << "No HMIStatus was received.";return false;}// 2. 检测模式变化const std::string current_mode = hmi_status->current_mode();if (current_mode_ != current_mode) {AINFO << "HMI mode changed from " << current_mode_<< " to " << current_mode;// 加载新模式配置if (!cyber::common::GetProtoFromFile(HMIModeHelper::GetModeDefinitionPath(current_mode),&mode_config_)) {AERROR << "Failed to load mode config for " << current_mode;return false;}current_mode_ = current_mode;// 更新监控组件列表status_.clear_components();for (const auto& iter : mode_config_.monitored_components()) {status_.mutable_components()->insert({iter.first, Component()});}} else {// 清空上一帧的汇总状态for (auto& iter : *status_.mutable_components()) {iter.second.clear_summary();}}// 3. 检测是否在自动驾驶中in_autonomous_driving_ = CheckAutonomousDriving(current_time);return true;
}
CheckAutonomousDriving实现:
bool MonitorManager::CheckAutonomousDriving(const double current_time) {static auto chassis_reader =CreateReader<Chassis>(FLAGS_chassis_topic);auto chassis = chassis_reader->GetLatestObserved();if (chassis == nullptr) {return false;}// 检查底盘消息是否过期if (current_time - chassis->header().timestamp_sec() >FLAGS_max_chassis_message_delay) {return false;}// 检查驾驶模式return chassis->driving_mode() == Chassis::COMPLETE_AUTO_DRIVE;
}
3.3 RecurrentRunner - 基类
文件位置:modules/monitor/common/recurrent_runner.h
class RecurrentRunner {public:RecurrentRunner(const std::string& name, const double interval);virtual ~RecurrentRunner() = default;// 主入口点void Tick(const double current_time);// 子类必须实现virtual void RunOnce(const double current_time) = 0;protected:std::string name_; // 监控器名称unsigned int round_count_; // 执行轮数private:double interval_; // 执行间隔(秒)double next_round_; // 下次执行时间
};
Tick实现(recurrent_runner.cc:27-37):
void RecurrentRunner::Tick(const double current_time) {// ProcessMonitor特殊处理:立即执行if (name_ == "ProcessMonitor" &&MonitorManager::Instance()->GetStatus()->detect_immediately()) {RunOnce(current_time);return;}// 按间隔执行if (next_round_ <= current_time) {++round_count_;AINFO_EVERY(100) << name_ << " is running round #" << round_count_;next_round_ = current_time + interval_;RunOnce(current_time);}
}
3.4 SystemStatus - 系统状态
文件位置:modules/common_msgs/monitor_msgs/system_status.proto
message ComponentStatus {enum Status {UNKNOWN = 0;OK = 1;WARN = 2;ERROR = 3;FATAL = 4; // 最高优先级}optional Status status = 1 [default = UNKNOWN];optional string message = 2;
}message Component {optional ComponentStatus summary = 1; // 汇总状态optional ComponentStatus process_status = 2; // 进程检查结果optional ComponentStatus channel_status = 3; // 通道检查结果optional ComponentStatus resource_status = 4; // 资源检查结果optional ComponentStatus other_status = 5; // 其他检查结果optional ComponentStatus module_status = 6; // 模块检查结果
}message SystemStatus {optional apollo.common.Header header = 1;// 组件状态集合map<string, ComponentStatus> hmi_modules = 7; // HMI模块状态map<string, Component> components = 8; // 监控组件状态map<string, ComponentStatus> other_components = 10; // 其他组件状态map<string, Component> global_components = 11; // 全局组件状态// 安全相关optional string passenger_msg = 4; // 乘客消息optional double safety_mode_trigger_time = 5; // 安全模式触发时间optional bool require_emergency_stop = 6; // 是否需要EStop// 其他标志optional bool is_realtime_in_simulation = 9; // 仿真标志optional bool detect_immediately = 12; // 立即检测标志
}
状态优先级:
FATAL (4) > ERROR (3) > WARN (2) > OK (1) > UNKNOWN (0)
4. 硬件监控器
4.1 GPS监控器
文件位置:modules/monitor/hardware/gps_monitor.h
功能:监控GNSS定位解的质量
执行间隔:3秒
实现(gps_monitor.cc:39-72):
void GpsMonitor::RunOnce(const double current_time) {auto manager = MonitorManager::Instance();// 获取GPS组件配置Component* component = FindOrNull(*manager->GetStatus()->mutable_components(),FLAGS_gps_component_name); // 默认: "GPS"if (component == nullptr) {return; // 当前模式不监控GPS}// 读取最新GnssBestPose消息static auto reader =manager->CreateReader<GnssBestPose>(FLAGS_gnss_best_pose_topic);auto gnss_best_pose = reader->GetLatestObserved();ComponentStatus* status = component->mutable_other_status();// 检查消息是否存在if (gnss_best_pose == nullptr) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,"No GnssBestPose message",status);return;}// 检查解质量等级switch (gnss_best_pose->sol_type()) {case SolutionType::NARROW_INT: // RTK固定解SummaryMonitor::EscalateStatus(ComponentStatus::OK, "", status);break;case SolutionType::SINGLE: // 单点定位SummaryMonitor::EscalateStatus(ComponentStatus::WARN,"SolutionType is SINGLE",status);break;default: // 其他不良状态SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,"SolutionType is wrong",status);break;}
}
GNSS解类型:
NARROW_INT:RTK固定解(最佳)→ OKSINGLE:单点定位(较差)→ WARN- 其他:无效状态 → ERROR
4.2 CAN监控器
文件位置:
- modules/monitor/hardware/esdcan_monitor.h
- modules/monitor/hardware/socket_can_monitor.h
功能:监控CAN通信硬件状态
执行间隔:3秒
ESD CAN状态码映射:
// esdcan_monitor.cc
switch (status_code) {case NTCAN_SUCCESS: // 正常status = ComponentStatus::OK;break;case NTCAN_RX_TIMEOUT: // 接收超时case NTCAN_TX_TIMEOUT: // 发送超时status = ComponentStatus::WARN;break;case NTCAN_TX_ERROR: // 发送错误case NTCAN_CONTR_OFF_BUS: // 控制器离线status = ComponentStatus::ERROR;break;case NTCAN_CONTR_BUSY: // 控制器忙碌status = ComponentStatus::FATAL;break;default:status = ComponentStatus::UNKNOWN;break;
}
4.3 资源监控器
文件位置:modules/monitor/hardware/resource_monitor.h
功能:监控系统资源使用情况
执行间隔:5秒
监控指标:
4.3.1 磁盘空间
void ResourceMonitor::CheckDiskSpace(const ResourceMonitorConfig& config,ComponentStatus* status) {for (const auto& disk_space : config.disk_spaces()) {// 1. 遍历匹配路径(支持通配符)for (const auto& path : cyber::common::Glob(disk_space.path())) {// 2. 读取磁盘信息struct statvfs stat;if (statvfs(path.c_str(), &stat) != 0) {continue;}// 3. 计算可用空间(GB)const double available_gb =static_cast<double>(stat.f_bavail * stat.f_bsize) / (1 << 30);// 4. 检查阈值if (available_gb < disk_space.insufficient_space_error()) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,absl::StrCat("Insufficient disk space at ", path,": ", available_gb, " GB"),status);} else if (available_gb < disk_space.insufficient_space_warning()) {SummaryMonitor::EscalateStatus(ComponentStatus::WARN,absl::StrCat("Low disk space at ", path,": ", available_gb, " GB"),status);}}}
}
配置示例:
disk_spaces {path: "/apollo/data/*"insufficient_space_warning: 50 # 警告阈值: 50GBinsufficient_space_error: 20 # 错误阈值: 20GB
}
4.3.2 CPU使用率
void ResourceMonitor::CheckCPUUsage(const ResourceMonitorConfig& config,ComponentStatus* status) {for (const auto& cpu_usage : config.cpu_usages()) {float usage_percentage = 0.0;if (cpu_usage.process_dag_path().empty()) {// 系统整体CPU使用率usage_percentage = GetSystemCPUUsage();} else {// 特定进程CPU使用率usage_percentage = GetProcessCPUUsage(cpu_usage.process_dag_path());}// 检查阈值if (usage_percentage > cpu_usage.high_cpu_usage_error()) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,absl::StrCat("High CPU usage: ", usage_percentage, "%"),status);} else if (usage_percentage > cpu_usage.high_cpu_usage_warning()) {SummaryMonitor::EscalateStatus(ComponentStatus::WARN,absl::StrCat("CPU usage warning: ", usage_percentage, "%"),status);}}
}// 从/proc/stat读取系统CPU使用率
float GetSystemCPUUsage() {static uint64_t last_total = 0, last_idle = 0;std::ifstream stat_file("/proc/stat");std::string line;std::getline(stat_file, line); // 读取第一行 "cpu ..."std::istringstream iss(line);std::string cpu;uint64_t user, nice, system, idle, iowait, irq, softirq, steal;iss >> cpu >> user >> nice >> system >> idle >> iowait >> irq >> softirq >> steal;uint64_t total = user + nice + system + idle + iowait + irq + softirq + steal;uint64_t diff_total = total - last_total;uint64_t diff_idle = idle - last_idle;float usage = 100.0 * (1.0 - static_cast<float>(diff_idle) / diff_total);last_total = total;last_idle = idle;return usage;
}
4.3.3 内存使用
void ResourceMonitor::CheckMemoryUsage(const ResourceMonitorConfig& config,ComponentStatus* status) {for (const auto& memory_usage : config.memory_usages()) {int usage_mb = 0;if (memory_usage.process_dag_path().empty()) {// 系统整体内存使用usage_mb = GetSystemMemoryUsage();} else {// 特定进程内存使用usage_mb = GetProcessMemoryUsage(memory_usage.process_dag_path());}// 检查阈值if (usage_mb > memory_usage.high_memory_usage_error()) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,absl::StrCat("High memory usage: ", usage_mb, " MB"),status);} else if (usage_mb > memory_usage.high_memory_usage_warning()) {SummaryMonitor::EscalateStatus(ComponentStatus::WARN,absl::StrCat("Memory usage warning: ", usage_mb, " MB"),status);}}
}// 从/proc/meminfo读取系统内存使用
int GetSystemMemoryUsage() {std::ifstream meminfo("/proc/meminfo");std::string line;uint64_t total_kb = 0, available_kb = 0;while (std::getline(meminfo, line)) {if (line.find("MemTotal:") == 0) {sscanf(line.c_str(), "MemTotal: %lu kB", &total_kb);} else if (line.find("MemAvailable:") == 0) {sscanf(line.c_str(), "MemAvailable: %lu kB", &available_kb);}}uint64_t used_kb = total_kb - available_kb;return static_cast<int>(used_kb / 1024); // 转换为MB
}
4.3.4 磁盘I/O负载
void ResourceMonitor::CheckDiskLoads(const ResourceMonitorConfig& config,ComponentStatus* status) {for (const auto& disk_load : config.disk_load_usages()) {int load = GetDiskLoad(disk_load.device_name()); // 读取/proc/diskstatsif (load > disk_load.high_disk_load_error()) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,absl::StrCat("High disk load on ", disk_load.device_name(),": ", load),status);} else if (load > disk_load.high_disk_load_warning()) {SummaryMonitor::EscalateStatus(ComponentStatus::WARN,absl::StrCat("Disk load warning on ", disk_load.device_name(),": ", load),status);}}
}
4.4 定位监控器
文件位置:modules/monitor/software/localization_monitor.h
功能:监控定位模块的融合状态
执行间隔:5秒
实现:
void LocalizationMonitor::RunOnce(const double current_time) {auto manager = MonitorManager::Instance();// 读取LocalizationStatus消息static auto reader =manager->CreateReader<LocalizationStatus>(FLAGS_localization_status_topic);auto status_msg = reader->GetLatestObserved();Component* component = FindOrNull(*manager->GetStatus()->mutable_components(),"Localization");if (component == nullptr) {return;}ComponentStatus* status = component->mutable_other_status();if (status_msg == nullptr) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,"No LocalizationStatus message",status);return;}// 映射融合状态switch (status_msg->fusion_status()) {case LocalizationStatus::OK:SummaryMonitor::EscalateStatus(ComponentStatus::OK, "", status);break;case LocalizationStatus::WARNNING:SummaryMonitor::EscalateStatus(ComponentStatus::WARN,"Localization fusion status is WARNNING",status);break;case LocalizationStatus::ERROR:SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,"Localization fusion status is ERROR",status);break;case LocalizationStatus::FATAL:SummaryMonitor::EscalateStatus(ComponentStatus::FATAL,"Localization fusion status is FATAL",status);break;}
}
5. 软件监控器
5.1 进程监控器
文件位置:modules/monitor/software/process_monitor.h
功能:检查关键进程是否正在运行
执行间隔:1.5秒(可立即执行)
实现(process_monitor.cc:40-97):
void ProcessMonitor::RunOnce(const double current_time) {auto manager = MonitorManager::Instance();// 1. 获取所有运行进程列表std::vector<std::string> running_processes;for (const auto& cmd_file : cyber::common::Glob("/proc/*/cmdline")) {std::string cmd_string;if (cyber::common::GetContent(cmd_file, &cmd_string) &&!cmd_string.empty()) {// cmdline中参数用\0分隔,替换为空格std::replace(cmd_string.begin(), cmd_string.end(), '\0', ' ');running_processes.push_back(cmd_string);}}const auto& mode = manager->GetHMIMode();// 2. 检查HMI模块进程auto* hmi_modules = manager->GetStatus()->mutable_hmi_modules();for (const auto& iter : mode.modules()) {const std::string& module_name = iter.first;const auto& config = iter.second.process_monitor_config();UpdateStatus(running_processes, config, &(*hmi_modules)[module_name]);}// 3. 检查监控组件进程auto* components = manager->GetStatus()->mutable_components();for (const auto& iter : mode.monitored_components()) {const std::string& name = iter.first;if (iter.second.has_process()) {const auto& config = iter.second.process();auto* status = (*components)[name].mutable_process_status();UpdateStatus(running_processes, config, status);}}// 4. 检查全局组件进程auto* global_components = manager->GetStatus()->mutable_global_components();for (const auto& iter : mode.global_components()) {const std::string& name = iter.first;if (iter.second.has_process()) {const auto& config = iter.second.process();auto* status = (*global_components)[name].mutable_process_status();UpdateStatus(running_processes, config, status);}}
}
匹配逻辑:
void ProcessMonitor::UpdateStatus(const std::vector<std::string>& running_processes,const ProcessMonitorConfig& config,ComponentStatus* status) {status->clear_status();// 遍历所有运行进程for (const std::string& command : running_processes) {bool all_keywords_matched = true;// 检查该进程是否包含所有关键字(AND逻辑)for (const std::string& keyword : config.command_keywords()) {if (command.find(keyword) == std::string::npos) {all_keywords_matched = false;break;}}if (all_keywords_matched) {// 找到匹配的进程SummaryMonitor::EscalateStatus(ComponentStatus::OK, command, status);return;}}// 未找到任何匹配进程 → FATALSummaryMonitor::EscalateStatus(ComponentStatus::FATAL,"Process not found",status);
}
配置示例:
process {command_keywords: "mainboard"command_keywords: "-d"command_keywords: "modules/planning/dag/planning.dag"
}# 匹配示例:
# /opt/apollo/bin/mainboard -d modules/planning/dag/planning.dag
5.2 模块监控器
文件位置:modules/monitor/software/module_monitor.h
功能:检查CyberRT模块节点是否存在
执行间隔:1.5秒
实现(module_monitor.cc:43-80):
class ModuleMonitor : public RecurrentRunner {public:ModuleMonitor();void RunOnce(const double current_time) override;private:void UpdateStatus(const ModuleMonitorConfig& config,const std::string& module_name,ComponentStatus* status);std::shared_ptr<cyber::service_discovery::NodeManager> node_manager_;
};void ModuleMonitor::RunOnce(const double current_time) {auto manager = MonitorManager::Instance();const auto& mode = manager->GetHMIMode();// 检查所有监控的组件auto* components = manager->GetStatus()->mutable_components();for (const auto& iter : mode.monitored_components()) {const std::string& name = iter.first;const auto& monitored_component = iter.second;if (monitored_component.has_module()) {const auto& config = monitored_component.module();auto* status = (*components)[name].mutable_module_status();UpdateStatus(config, name, status);}}
}void ModuleMonitor::UpdateStatus(const ModuleMonitorConfig& config,const std::string& module_name,ComponentStatus* status) {status->clear_status();// 检查所有必需的节点是否存在(AND逻辑)bool all_nodes_matched = true;for (const std::string& node_name : config.node_name()) {if (!node_manager_->HasNode(node_name)) {all_nodes_matched = false;break;}}if (all_nodes_matched) {// 所有节点都找到SummaryMonitor::EscalateStatus(ComponentStatus::OK, module_name, status);} else {// 缺失关键节点SummaryMonitor::EscalateStatus(ComponentStatus::FATAL,"Required node not found",status);}
}
配置示例:
module {node_name: "planning"node_name: "planning_monitor"
}
5.3 通道监控器
文件位置:modules/monitor/software/channel_monitor.h
功能:检查消息通道的延迟、频率和完整性
执行间隔:5秒
支持的通道(channel_monitor.cc:73-105):
static const auto channel_function_map = std::unordered_map<...>{{FLAGS_control_command_topic,&CreateReaderAndLatestMessage<control::ControlCommand>},{FLAGS_localization_topic,&CreateReaderAndLatestMessage<localization::LocalizationEstimate>},{FLAGS_perception_obstacle_topic,&CreateReaderAndLatestMessage<perception::PerceptionObstacles>},{FLAGS_prediction_topic,&CreateReaderAndLatestMessage<prediction::PredictionObstacles>},{FLAGS_planning_trajectory_topic,&CreateReaderAndLatestMessage<planning::ADCTrajectory>},{FLAGS_chassis_topic,&CreateReaderAndLatestMessage<canbus::Chassis>},{FLAGS_pointcloud_topic,&CreateReaderAndLatestMessage<drivers::PointCloud>},// ... 更多通道 ...
};
检查逻辑(channel_monitor.cc:171-236):
void ChannelMonitor::UpdateStatus(const ChannelMonitorConfig& config,ComponentStatus* status,const bool update_freq,const double freq) {status->clear_status();// 1. 获取Reader和最新消息const auto reader_message_pair = GetReaderAndLatestMessage(config.name());const auto reader = reader_message_pair.first;const auto message = reader_message_pair.second;// 2. 检查通道是否已注册if (reader == nullptr) {SummaryMonitor::EscalateStatus(ComponentStatus::UNKNOWN,absl::StrCat(config.name(), " is not registered in ChannelMonitor."),status);return;}// 3. 检查消息是否为空if (message == nullptr || message->ByteSize() == 0) {SummaryMonitor::EscalateStatus(ComponentStatus::FATAL,absl::StrCat("The message ", config.name(), " received is empty."),status);return;}// 4. 检查通道延迟(FATAL级别)const double delay = reader->GetDelaySec();if (delay < 0 || delay > config.delay_fatal()) {SummaryMonitor::EscalateStatus(ComponentStatus::FATAL,absl::StrCat(config.name(), " delayed for ", delay, " seconds."),status);}// 5. 检查强制字段(ERROR级别)const std::string field_sepr = ".";for (const auto& field : config.mandatory_fields()) {if (!ValidateFields(*message, absl::StrSplit(field, field_sepr), 0)) {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,absl::StrCat(config.name(), " missing field ", field),status);}}// 6. 检查通道频率(WARN级别)if (update_freq) {if (freq > config.max_frequency_allowed()) {SummaryMonitor::EscalateStatus(ComponentStatus::WARN,absl::StrCat(config.name(), " has frequency ", freq," > max allowed ", config.max_frequency_allowed()),status);}if (freq < config.min_frequency_allowed()) {SummaryMonitor::EscalateStatus(ComponentStatus::WARN,absl::StrCat(config.name(), " has frequency ", freq," < min allowed ", config.min_frequency_allowed()),status);}}SummaryMonitor::EscalateStatus(ComponentStatus::OK, "", status);
}
字段验证算法:
bool ValidateFields(const google::protobuf::Message& message,const std::vector<std::string>& fields,const size_t field_step) {if (field_step >= fields.size()) {return true; // 已验证完所有字段}const auto* desc = message.GetDescriptor();const auto* refl = message.GetReflection();// 遍历消息的所有字段for (int i = 0; i < desc->field_count(); ++i) {const auto* field_desc = desc->field(i);if (field_desc->name() == fields[field_step]) {// 找到目标字段if (field_desc->is_repeated()) {// repeated字段:检查是否非空return refl->FieldSize(message, field_desc) > 0 &&field_step == fields.size() - 1;}if (field_desc->type() !=google::protobuf::FieldDescriptor::TYPE_MESSAGE) {// 标量字段:检查是否已设置return refl->HasField(message, field_desc) &&field_step == fields.size() - 1;}// message字段:递归验证return ValidateFields(refl->GetMessage(message, field_desc),fields,field_step + 1);}}return false;
}
配置示例:
channel {name: "/apollo/localization/pose"delay_fatal: 3.0mandatory_fields: "pose.position.x"mandatory_fields: "pose.position.y"mandatory_fields: "pose.orientation"min_frequency_allowed: 5.0max_frequency_allowed: 15.0
}
5.4 延迟监控器
文件位置:modules/monitor/software/latency_monitor.h
功能:监控端到端延迟和消息频率
执行间隔:1.5秒
数据结构:
class LatencyMonitor : public RecurrentRunner {public:LatencyMonitor();void RunOnce(const double current_time) override;// 获取通道频率bool GetFrequency(const std::string& channel_name, double* freq);private:void UpdateStat(const std::shared_ptr<LatencyRecordMap>& records);void PublishLatencyReport();void AggregateLatency();apollo::common::LatencyReport latency_report_;// 跟踪图:track_id -> (timestamp, module_id, module_name)std::unordered_map<uint64_t,std::set<std::tuple<uint64_t, uint64_t, std::string>>>track_map_;// 频率缓存:channel_name -> frequencystd::unordered_map<std::string, double> freq_map_;double flush_time_ = 0.0;
};
统计算法:
LatencyStat GenerateStat(const std::vector<uint64_t>& numbers) {LatencyStat stat;uint64_t min_number = (1UL << 63);uint64_t max_number = 0;uint64_t sum = 0;for (const auto number : numbers) {min_number = std::min(min_number, number);max_number = std::max(max_number, number);sum += number;}const uint32_t sample_size = static_cast<uint32_t>(numbers.size());stat.set_min_duration(min_number);stat.set_max_duration(max_number);stat.set_aver_duration(sample_size == 0 ? 0 : sum / sample_size);stat.set_sample_size(sample_size);return stat;
}
发布主题:/apollo/monitor/latency_report
报告间隔:15秒(FLAGS_latency_report_interval)
5.5 汇总监控器
文件位置:modules/monitor/software/summary_monitor.h
功能:汇总所有监控器结果并发布系统状态
执行间隔:每次都执行
状态升级规则(summary_monitor.cc:33-45):
void SummaryMonitor::EscalateStatus(const ComponentStatus::Status new_status,const std::string& message,ComponentStatus* current_status) {// 优先级:FATAL > ERROR > WARN > OK > UNKNOWNif (new_status > current_status->status()) {current_status->set_status(new_status);if (!message.empty()) {current_status->set_message(message);} else {current_status->clear_message();}}
}
执行逻辑(summary_monitor.cc:51-99):
void SummaryMonitor::RunOnce(const double current_time) {auto manager = MonitorManager::Instance();auto* status = manager->GetStatus();// 1. 为每个component汇总状态for (auto& component : *status->mutable_components()) {auto* summary = component.second.mutable_summary();// 从各个子状态升级到汇总状态EscalateStatus(component.second.process_status().status(),component.second.process_status().message(),summary);EscalateStatus(component.second.module_status().status(),component.second.module_status().message(),summary);EscalateStatus(component.second.channel_status().status(),component.second.channel_status().message(),summary);EscalateStatus(component.second.resource_status().status(),component.second.resource_status().message(),summary);EscalateStatus(component.second.other_status().status(),component.second.other_status().message(),summary);}// 2. 为global components汇总状态for (auto& component : *status->mutable_global_components()) {auto* summary = component.second.mutable_summary();EscalateStatus(component.second.process_status().status(), ..., summary);EscalateStatus(component.second.resource_status().status(), ..., summary);}// 3. 检查状态是否变化(使用哈希指纹)static std::hash<std::string> hash_fn;std::string proto_bytes;status->SerializeToString(&proto_bytes);const size_t new_fp = hash_fn(proto_bytes);// 4. 状态变化或超时则发布if (system_status_fp_ != new_fp ||current_time - last_broadcast_ >FLAGS_system_status_publish_interval) {static auto writer =manager->CreateWriter<SystemStatus>(FLAGS_system_status_topic);common::util::FillHeader("SystemMonitor", status);writer->Write(*status);status->clear_header();system_status_fp_ = new_fp;last_broadcast_ = current_time;}
}
发布策略:
- 状态变化时立即发布
- 或每隔一定时间发布(
FLAGS_system_status_publish_interval,默认10秒)
5.6 功能安全监控器
文件位置:modules/monitor/software/functional_safety_monitor.h
功能:在自动驾驶模式下检查安全条件,触发EStop
执行间隔:每次都执行(如果启用)
安全检查逻辑(functional_safety_monitor.cc:49-76):
void FunctionalSafetyMonitor::RunOnce(const double current_time) {auto* system_status = MonitorManager::Instance()->GetStatus();// 1. 检查是否安全if (CheckSafety()) {// 一切正常 → 清除安全模式标记system_status->clear_passenger_msg();system_status->clear_safety_mode_trigger_time();system_status->clear_require_emergency_stop();return;}// 2. 如果已发送EStop → 什么都不做if (system_status->require_emergency_stop()) {return;}// 3. 新进入安全模式system_status->set_passenger_msg("Error! Please disengage.");if (!system_status->has_safety_mode_trigger_time()) {system_status->set_safety_mode_trigger_time(current_time);return; // 第一次进入安全模式,不立即EStop}// 4. 超时后自动EStopif (system_status->safety_mode_trigger_time() +FLAGS_safety_mode_seconds_before_estop < // 默认10秒current_time) {system_status->set_require_emergency_stop(true);// 记录日志MonitorManager::Instance()->LogBuffer().ERROR("Functional safety triggered emergency stop.");}
}bool FunctionalSafetyMonitor::CheckSafety() {auto manager = MonitorManager::Instance();// 仅在自动驾驶模式检查if (!manager->IsInAutonomousMode()) {return true; // 非自动驾驶模式 → 安全}auto* status = manager->GetStatus();// 检查关键组件状态for (const auto& component : status->components()) {if (component.second.summary().status() == ComponentStatus::ERROR ||component.second.summary().status() == ComponentStatus::FATAL) {// 检查是否为安全关键组件const auto& mode = manager->GetHMIMode();const auto& monitored_component =FindOrNull(mode.monitored_components(), component.first);if (monitored_component != nullptr &&monitored_component->required_for_safety()) {// 安全关键组件出错 → 不安全manager->LogBuffer().ERROR(absl::StrCat(component.first, " triggers safe mode: ",component.second.summary().message()));return false;}}}return true; // 所有关键组件正常 → 安全
}
EStop流程:
1. 检测到关键组件错误↓
2. 进入安全模式(设置passenger_msg)↓
3. 等待10秒(FLAGS_safety_mode_seconds_before_estop)↓
4. 如果仍未恢复,设置require_emergency_stop↓
5. Control模块监听该标志并触发紧急停车
5.7 其他监控器
5.7.1 摄像头监控器
文件位置:modules/monitor/software/camera_monitor.h
功能:监控摄像头图像流
执行间隔:5秒
5.7.2 记录器监控器
文件位置:modules/monitor/software/recorder_monitor.h
功能:监控SmartRecorder的录制状态
执行间隔:5秒
状态映射:
RECORDING→ OKTERMINATING→ WARNSTOPPED→ ERROR
6. 状态管理
6.1 状态层级
SystemStatus
├─ hmi_modules: map<string, ComponentStatus>
│ └─ Planning: ComponentStatus{status, message}
│
├─ components: map<string, Component>
│ └─ Localization: Component
│ ├─ summary: ComponentStatus (汇总状态)
│ ├─ process_status: ComponentStatus (进程检查)
│ ├─ module_status: ComponentStatus (节点检查)
│ ├─ channel_status: ComponentStatus (通道检查)
│ ├─ resource_status: ComponentStatus (资源检查)
│ └─ other_status: ComponentStatus (其他检查)
│
├─ other_components: map<string, ComponentStatus>
│ └─ OtherComponent: ComponentStatus
│
├─ global_components: map<string, Component>
│ └─ GlobalComponent: Component{...}
│
├─ passenger_msg: string (乘客提示)
├─ safety_mode_trigger_time: double (安全模式触发时间)
└─ require_emergency_stop: bool (紧急停车标志)
6.2 状态升级示例
假设Localization组件的各子状态为:process_status = OK (进程正在运行)
module_status = OK (节点存在)
channel_status = WARN (消息频率低)
resource_status = OK (资源正常)
other_status = ERROR (融合质量差)通过EscalateStatus升级:
↓
summary = max(OK, OK, WARN, OK, ERROR)
summary = ERROR (最高优先级)
6.3 HMI模式配置
文件位置:modules/common_msgs/dreamview_msgs/hmi_mode.proto
message MonitoredComponent {optional ProcessMonitorConfig process = 1;optional ChannelMonitorConfig channel = 2;optional ResourceMonitorConfig resource = 3;optional bool required_for_safety = 4 [default = true];optional ModuleMonitorConfig module = 5;
}message HMIMode {map<string, apollo.dreamview.Module> modules = 1;map<string, MonitoredComponent> monitored_components = 2;map<string, ComponentStatus> other_components = 3;map<string, MonitoredComponent> global_components = 4;
}
配置示例(某个HMI模式文件):
monitored_components {key: "Localization"value {process {command_keywords: "mainboard"command_keywords: "-d"command_keywords: "modules/localization/dag/localization.dag"}module {node_name: "localization"}channel {name: "/apollo/localization/pose"delay_fatal: 3.0mandatory_fields: "pose.position.x"mandatory_fields: "pose.position.y"min_frequency_allowed: 5.0max_frequency_allowed: 15.0}required_for_safety: true}
}monitored_components {key: "Planning"value {process {command_keywords: "mainboard"command_keywords: "modules/planning/dag/planning.dag"}module {node_name: "planning"}channel {name: "/apollo/planning/trajectory"delay_fatal: 3.0mandatory_fields: "trajectory_point"min_frequency_allowed: 1.0max_frequency_allowed: 15.0}required_for_safety: true}
}
7. 配置系统
7.1 DAG配置
文件位置:modules/monitor/dag/monitor.dag
module_config {module_library : "modules/monitor/libmonitor.so"timer_components {class_name : "Monitor"config {name: "monitor"interval: 500 # 执行周期:500毫秒}}
}
说明:
- Monitor组件每500ms执行一次
Proc() - 在
Proc()中逐个调用各监控器的Tick() - 各监控器有自己的执行间隔(由RecurrentRunner管理)
7.2 启动配置
文件位置:modules/monitor/launch/monitor.launch
<cyber><module><name>monitor</name><dag_conf>modules/monitor/dag/monitor.dag</dag_conf><process_name>monitor</process_name></module>
</cyber>
7.3 Flag配置
主要配置标志:
// 监控器名称和间隔
DEFINE_string(gps_monitor_name, "GpsMonitor", "GPS monitor name");
DEFINE_double(gps_monitor_interval, 3.0, "GPS monitor interval");
DEFINE_double(process_monitor_interval, 1.5, "Process monitor interval");
DEFINE_double(channel_monitor_interval, 5.0, "Channel monitor interval");
DEFINE_double(resource_monitor_interval, 5.0, "Resource monitor interval");// 组件名称
DEFINE_string(gps_component_name, "GPS", "GPS component name");// 主题名称
DEFINE_string(hmi_status_topic, "/apollo/hmi/status", "HMI status topic");
DEFINE_string(system_status_topic, "/apollo/monitor/system_status", "System status topic");
DEFINE_string(chassis_topic, "/apollo/canbus/chassis", "Chassis topic");
DEFINE_string(gnss_best_pose_topic, "/apollo/sensor/gnss/best_pose", "GNSS best pose topic");// 系统参数
DEFINE_double(system_status_publish_interval, 10.0, "System status publish interval");
DEFINE_double(max_chassis_message_delay, 1.0, "Max chassis message delay");
DEFINE_bool(enable_functional_safety, true, "Enable functional safety monitor");
DEFINE_double(safety_mode_seconds_before_estop, 10.0, "Seconds before EStop");// 延迟监控
DEFINE_double(latency_monitor_interval, 1.5, "Latency monitor interval");
DEFINE_double(latency_report_interval, 15.0, "Latency report interval");
DEFINE_int32(latency_reader_capacity, 30, "Latency reader capacity");
8. 执行流程
8.1 完整执行流程
Timer触发(每500ms)↓
Monitor::Proc()↓
┌──────────────────────────────────────────────────────────┐
│ MonitorManager::StartFrame(current_time) │
│ ├─ 读取HMI状态 │
│ ├─ 检测模式变化 │
│ │ └─ 如果模式变化 → 重新加载配置 │
│ ├─ 清空上一帧的component汇总状态 │
│ └─ 检测是否在自动驾驶模式 │
└──────────────────────────────────────────────────────────┘↓
┌──────────────────────────────────────────────────────────┐
│ 执行所有监控器 │
│ │
│ EsdCanMonitor::Tick() (间隔3秒) │
│ └─ 如果时间到 → RunOnce() │
│ └─ 检查ESD CAN状态 │
│ │
│ SocketCanMonitor::Tick() (间隔3秒) │
│ └─ 检查Socket CAN状态 │
│ │
│ GpsMonitor::Tick() (间隔3秒) │
│ └─ 检查GNSS解质量 │
│ └─ 更新component.other_status │
│ │
│ LocalizationMonitor::Tick() (间隔5秒) │
│ └─ 检查定位融合状态 │
│ │
│ CameraMonitor::Tick() (间隔5秒) │
│ └─ 检查摄像头图像 │
│ │
│ ProcessMonitor::Tick() (间隔1.5秒或立即) │
│ └─ 扫描/proc/*/cmdline │
│ └─ 匹配进程关键字 │
│ └─ 更新component.process_status │
│ │
│ ModuleMonitor::Tick() (间隔1.5秒) │
│ └─ 查询CyberRT节点 │
│ └─ 更新component.module_status │
│ │
│ LatencyMonitor::Tick() (间隔1.5秒) │
│ └─ 收集延迟数据 │
│ └─ 计算频率 │
│ └─ 每15秒发布延迟报告 │
│ │
│ ChannelMonitor::Tick() (间隔5秒) │
│ └─ 检查消息延迟 │
│ └─ 检查强制字段 │
│ └─ 检查消息频率 │
│ └─ 更新component.channel_status │
│ │
│ ResourceMonitor::Tick() (间隔5秒) │
│ └─ 检查磁盘空间 │
│ └─ 检查CPU使用率 │
│ └─ 检查内存使用 │
│ └─ 检查磁盘I/O负载 │
│ └─ 更新component.resource_status │
│ │
│ SummaryMonitor::Tick() (每次执行) │
│ └─ 汇总各component状态 │
│ │ └─ summary = max(process, module, channel, resource, other)
│ └─ 计算状态哈希指纹 │
│ └─ 如果状态变化或超时 → 发布SystemStatus消息 │
│ │
│ FunctionalSafetyMonitor::Tick() (每次执行,如果启用) │
│ └─ 检查是否在自动驾驶模式 │
│ └─ 检查关键组件状态 │
│ └─ 如果有ERROR/FATAL → 进入安全模式 │
│ └─ 超时10秒后 → 设置require_emergency_stop │
└──────────────────────────────────────────────────────────┘↓
┌──────────────────────────────────────────────────────────┐
│ MonitorManager::EndFrame() │
│ └─ log_buffer_.Publish() │
│ └─ 发布所有监控日志消息 │
└──────────────────────────────────────────────────────────┘
8.2 监控流程示例:Localization组件
假设HMI模式配置中包含Localization组件:1. ProcessMonitor检查定位进程├─ 扫描/proc/*/cmdline├─ 查找包含["mainboard", "-d", "localization.dag"]的进程├─ 找到: /opt/apollo/bin/mainboard -d modules/localization/dag/localization.dag└─ 结果: process_status = OK2. ModuleMonitor检查定位节点├─ 查询CyberRT节点├─ 检查"localization"节点是否存在└─ 结果: module_status = OK3. ChannelMonitor检查定位消息├─ 读取/apollo/localization/pose├─ 检查延迟: 0.2秒 < 3.0秒 (delay_fatal) → OK├─ 检查字段: pose.position.x ✓, pose.position.y ✓ → OK├─ 检查频率: 8.5 Hz,在[5.0, 15.0]范围内 → OK└─ 结果: channel_status = OK4. LocalizationMonitor检查定位质量├─ 读取LocalizationStatus消息├─ fusion_status = OK└─ 结果: other_status = OK5. SummaryMonitor汇总├─ summary = max(OK, OK, OK, OK)└─ 结果: summary = OK6. FunctionalSafetyMonitor检查安全└─ Localization.summary = OK → 安全
如果定位出错:
假设LocalizationStatus.fusion_status = ERROR1. ProcessMonitor: process_status = OK (进程仍在运行)
2. ModuleMonitor: module_status = OK (节点仍存在)
3. ChannelMonitor: channel_status = OK (消息正常)
4. LocalizationMonitor: other_status = ERROR (融合错误)
5. SummaryMonitor: summary = max(OK, OK, OK, ERROR) = ERROR
6. FunctionalSafetyMonitor:└─ 检测到Localization.summary = ERROR└─ required_for_safety = true└─ 进入安全模式└─ 设置passenger_msg = "Error! Please disengage."└─ 设置safety_mode_trigger_time = current_time└─ 10秒后 → require_emergency_stop = true
9. 最佳实践
9.1 开发建议
9.1.1 添加新的监控器
// 步骤1:创建监控器类
// my_monitor.h
class MyMonitor : public RecurrentRunner {public:MyMonitor(): RecurrentRunner("MyMonitor", 3.0) {} // 3秒间隔void RunOnce(const double current_time) override {auto manager = MonitorManager::Instance();// 获取组件配置Component* component = FindOrNull(*manager->GetStatus()->mutable_components(),"MyComponent");if (component == nullptr) {return;}ComponentStatus* status = component->mutable_other_status();// 执行检查逻辑if (CheckMyCondition()) {SummaryMonitor::EscalateStatus(ComponentStatus::OK, "", status);} else {SummaryMonitor::EscalateStatus(ComponentStatus::ERROR,"My condition failed",status);}}private:bool CheckMyCondition();
};// 步骤2:在Monitor::Init()中添加
bool Monitor::Init() {// ...runners_.emplace_back(new MyMonitor());// ...
}// 步骤3:在HMI模式配置中添加
monitored_components {key: "MyComponent"value {# 配置监控项}
}
9.1.2 配置监控项
# HMI模式配置文件monitored_components {key: "Planning"value {# 进程监控process {command_keywords: "mainboard"command_keywords: "modules/planning/dag/planning.dag"}# 节点监控module {node_name: "planning"node_name: "planning_monitor"}# 通道监控channel {name: "/apollo/planning/trajectory"delay_fatal: 3.0mandatory_fields: "trajectory_point"mandatory_fields: "header.timestamp_sec"min_frequency_allowed: 1.0max_frequency_allowed: 15.0}# 资源监控resource {cpu_usages {process_dag_path: "modules/planning/dag/planning.dag"high_cpu_usage_warning: 80.0high_cpu_usage_error: 95.0}memory_usages {process_dag_path: "modules/planning/dag/planning.dag"high_memory_usage_warning: 1024 # MBhigh_memory_usage_error: 2048}}# 是否安全关键required_for_safety: true}
}
9.2 调试技巧
9.2.1 查看系统状态
# 使用cyber_monitor查看SystemStatus消息
cyber_monitor /apollo/monitor/system_status# 使用echo命令打印详细信息
cyber_recorder echo /apollo/monitor/system_status
9.2.2 查看延迟报告
# 查看端到端延迟
cyber_monitor /apollo/monitor/latency_report
9.2.3 调试日志
// 在监控器中添加详细日志
AINFO << "Checking component: " << component_name;
AINFO << "Current status: " << ComponentStatus::Status_Name(status);
AERROR << "Check failed: " << error_message;
9.3 常见问题
9.3.1 组件状态一直为FATAL
问题:某个组件的process_status一直为FATAL
排查步骤:
- 检查进程是否真的在运行:
ps aux | grep <关键字> - 检查进程关键字配置是否正确
- 查看/proc/*/cmdline实际内容:
cat /proc/$(pidof <进程名>)/cmdline | tr '\0' ' '
解决方案:
- 调整HMI模式配置中的
command_keywords - 确保所有关键字都出现在进程命令行中
9.3.2 通道监控报错
问题:channel_status报告"missing field"
排查步骤:
- 检查消息是否真的缺失该字段
- 使用protobuf反射查看消息结构
- 检查字段名是否拼写正确(区分大小写)
解决方案:
- 修正
mandatory_fields配置 - 或修复上游模块,确保发送完整消息
9.3.3 频繁触发EStop
问题:FunctionalSafetyMonitor频繁触发紧急停车
排查步骤:
- 查看SystemStatus,确认哪个组件触发安全模式
- 查看该组件的详细错误信息
- 检查是否真的需要设置
required_for_safety: true
解决方案:
- 修复导致组件ERROR/FATAL的根本问题
- 调整
required_for_safety配置 - 调整
FLAGS_safety_mode_seconds_before_estop延迟
10. 总结
10.1 核心特性
Apollo Monitor模块是一个全面、灵活、高效的系统健康监控系统:
-
全面覆盖:
- 硬件层:GPS、CAN、资源
- 软件层:进程、节点、通道、延迟
- 安全层:功能安全检查、EStop
-
灵活配置:
- 通过HMI模式文件动态配置监控项
- 运行时模式切换自动更新监控列表
- 支持每个组件独立的监控策略
-
高效执行:
- 500ms周期,各监控器独立间隔
- 使用哈希指纹检测状态变化
- Reader/Writer缓存避免重复创建
-
安全保护:
- 自动检测自动驾驶模式
- 关键组件错误自动触发EStop
- 可配置的EStop延迟(默认10秒)
10.2 监控器总览
| 监控器 | 类型 | 间隔 | 主要功能 |
|---|---|---|---|
| EsdCanMonitor | 硬件 | 3s | ESD CAN卡状态 |
| SocketCanMonitor | 硬件 | 3s | Socket CAN状态 |
| GpsMonitor | 硬件 | 3s | GNSS解质量 |
| ResourceMonitor | 硬件 | 5s | CPU/内存/磁盘 |
| LocalizationMonitor | 硬件/软件 | 5s | 定位融合状态 |
| ProcessMonitor | 软件 | 1.5s | 进程运行状态 |
| ModuleMonitor | 软件 | 1.5s | CyberRT节点 |
| ChannelMonitor | 软件 | 5s | 消息延迟/频率/字段 |
| LatencyMonitor | 软件 | 1.5s | 端到端延迟统计 |
| CameraMonitor | 软件 | 5s | 摄像头图像 |
| RecorderMonitor | 软件 | 5s | 记录器状态 |
| SummaryMonitor | 软件 | 每次 | 状态汇总发布 |
| FunctionalSafetyMonitor | 安全 | 每次 | 安全检查EStop |
10.3 技术亮点
- 分层监控:硬件→软件→安全三层体系
- 优先级升级:FATAL > ERROR > WARN > OK > UNKNOWN
- 自动驾驶保护:仅在自动驾驶模式下检查安全并触发EStop
- 灵活的进程检测:基于关键字的AND匹配
- 完整的通道检查:延迟+频率+强制字段
- 资源深度监控:磁盘+CPU+内存+I/O
- 高效的状态发布:哈希指纹检测变化
Monitor模块为Apollo自动驾驶系统提供了可靠的健康监控和安全保护。
参考资料
- Apollo官方文档
- CyberRT文档
- Monitor模块README
- SystemStatus Proto
- HMIMode Proto
