记录一次海思Hi3798MV200-android7.0平台开机卡在第一张图无法进入系统问题分析解决过程
最近在处理客户返修机器时,遇到机器开机一直卡在开机logo无法进入系统问题,但不是每次开机都这样,发生故障概率大概在30%,之前虽然有遇到过,但概率极低,重启后就好了,所以一直没放在心上,刚好借这次机会深入分析一下。
由于卡在开机logo画面,还未进入系统,所以只能通过串口抓取日志如下:
healthd: battery l=100 v=0 t=42.4 h=2 st=2 chg=a
INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 0, t=84007 jiffies, g=703, c=702, q=367)
Task dump for CPU 1:
kworker/1:1 R running 0 595 2 0x00000002
Workqueue: HIFB_WorkQueque HIFB_WBC_StartWorkQueue{ 1} (t=84031 jiffies g=703 c=702 q=367)
Task dump for CPU 1:
kworker/1:1 R running 0 595 2 0x00000002
Workqueue: HIFB_WorkQueque HIFB_WBC_StartWorkQueue
[<c00188f8>] (unwind_backtrace) from [<c0013d44>] (show_stack+0x20/0x24)
[<c0013d44>] (show_stack) from [<c004f83c>] (sched_show_task+0xa8/0xf4)
[<c004f83c>] (sched_show_task) from [<c0052438>] (dump_cpu_task+0x3c/0x4c)
[<c0052438>] (dump_cpu_task) from [<c0070128>] (rcu_dump_cpu_stacks+0xa4/0xd8)
[<c0070128>] (rcu_dump_cpu_stacks) from [<c0073628>] (rcu_check_callbacks+0x428/0x764)
[<c0073628>] (rcu_check_callbacks) from [<c00783a8>] (update_process_times+0x50/0x70)
[<c00783a8>] (update_process_times) from [<c00890d0>] (tick_sched_handle+0x58/0x64)
[<c00890d0>] (tick_sched_handle) from [<c0089130>] (tick_sched_timer+0x54/0x9c)
[<c0089130>] (tick_sched_timer) from [<c0078e3c>] (__run_hrtimer+0xdc/0x1ac)
[<c0078e3c>] (__run_hrtimer) from [<c007970c>] (hrtimer_interrupt+0x12c/0x310)
[<c007970c>] (hrtimer_interrupt) from [<c05d51d0>] (arch_timer_handler_phys+0x40/0x48)
[<c05d51d0>] (arch_timer_handler_phys) from [<c006c930>] (handle_percpu_devid_irq+0xb4/0x11c)
[<c006c930>] (handle_percpu_devid_irq) from [<c00688d0>] (__handle_domain_irq+0x8c/0xe4)
[<c00688d0>] (__handle_domain_irq) from [<c00086a8>] (gic_handle_irq+0x30/0x6c)
[<c00086a8>] (gic_handle_irq) from [<c00148c0>] (__irq_svc+0x40/0x54)
Exception stack(0xddc7dce0 to 0xddc7dd28)
dce0: 00000003 debe2f00 00000003 00000001 c1266a78 debcbbc4 debcbbc0 c12673a0
dd00: 00000004 c0021d60 00000000 ddc7dd5c 00000003 ddc7dd28 c008db64 c008db88
dd20: 200f0013 ffffffff
[<c00148c0>] (__irq_svc) from [<c008db88>] (smp_call_function_many+0x214/0x2a0)
[<c008db88>] (smp_call_function_many) from [<c008dc88>] (on_each_cpu+0x38/0x58)
[<c008dc88>] (on_each_cpu) from [<c064c6f0>] (__dma_clear_buffer+0x124/0x150)
[<c064c6f0>] (__dma_clear_buffer) from [<c064d7c4>] (hil_mmb_alloc+0x108/0x2e8)
[<c064d7c4>] (hil_mmb_alloc) from [<c064a458>] (new_mmb+0x24/0x44)
[<c064a458>] (new_mmb) from [<c0649554>] (HI_DRV_SMMU_AllocAndMap+0x3c/0x110)
[<c0649554>] (HI_DRV_SMMU_AllocAndMap) from [<c0780fe0>] (OPTM_AllocAndMap+0xcc/0xf8)
[<c0780fe0>] (OPTM_AllocAndMap) from [<c07874d4>] (HIFB_WBC_InitLayer+0x2dc/0x40c)
[<c07874d4>] (HIFB_WBC_InitLayer) from [<c0787620>] (HIFB_WBC_StartWorkQueue+0x1c/0x20)
[<c0787620>] (HIFB_WBC_StartWorkQueue) from [<c003fecc>] (process_one_work+0x194/0x3ac)
[<c003fecc>] (process_one_work) from [<c0040128>] (worker_thread+0x44/0x4cc)
[<c0040128>] (worker_thread) from [<c00451b0>] (kthread+0xd8/0xf4)
[<c00451b0>] (kthread) from [<c000f960>] (ret_from_fork+0x14/0x20)
init: countError is 1
INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 2, t=147012 jiffies, g=703, c=702, q=579)
Task dump for CPU 1:
kworker/1:1 R running 0 595 2 0x00000002
Workqueue: HIFB_WorkQueque HIFB_WBC_StartWorkQueue{ 1} (t=147035 jiffies g=703 c=702 q=579)这些kernel层的报错看着头大,别慌,先把日志提交给AI,让它帮忙分析一下:
✅ 关键问题定位:
🔴 核心异常:RCU stall + kworker 死循环
log 中反复出现:
INFO: rcu_sched self-detected stall on CPU {1}
...
kworker/1:1 R running ... Workqueue: HIFB_WorkQueque HIFB_WBC_StartWorkQueue✅ 结论:
CPU1 被 kworker 线程永久占用,导致 RCU 调度器无法继续,系统主线程被饿死,无法继续启动 Android 系统服务。
🔍 根因分析:
1. HIFB_WBC_StartWorkQueue 死循环
HIFB是海思平台的 Frame Buffer 驱动模块。HIFB_WBC_StartWorkQueue是 写回缓存(Write Back)初始化流程。该函数内部死循环等待某条件成立,但条件永远不满足(如 DMA 完成、寄存器状态、时钟未就绪等)。
2. DMA 分配失败或死锁
log 中显示:
[<c0649554>] (HI_DRV_SMMU_AllocAndMap) [<c0780fe0>] (OPTM_AllocAndMap) [<c07874d4>] (HIFB_WBC_InitLayer)说明 HIFB 在尝试分配 DMA 缓冲区时阻塞,可能:
IOMMU/SMMU 未初始化完成
内存池耗尽
时钟/电源未打开
寄存器配置顺序错误
✅ 根本原因总结
HIFB_WBC_InitLayer()→OPTM_AllocAndMap()→HI_DRV_SMMU_AllocAndMap()→hil_mmb_alloc()→__dma_clear_buffer()→on_each_cpu()on_each_cpu()会等待所有 CPU 执行某个回调,但 CPU1 被 kworker 本身占用,无法调度,导致死锁
有了关键信息 HIFB_WBC_StartWorkQueue 死循环,那么就先到 HIFB_WBC_StartWorkQueue 源码位置,源码中搜索关键字可以找到:
HiSTBAndroidV600R003C00SPC020\device\hisilicon\bigfish\sdk\source\msp\drv\hifb\adp\src\drv_hifb_wbc.c中有HIFB_WBC_StartWorkQueue()函数定义
#ifndef HI_BUILD_IN_BOOT
HI_VOID HIFB_WBC_StartWorkQueue(struct work_struct *data)
{HIFB_LAYER_ID_E u32LayerID = HIFB_LAYER_ID_BUTT;OPTM_GFX_WORK_S *pstOpenSlvWork = container_of(data, OPTM_GFX_WORK_S, work);u32LayerID = (HIFB_LAYER_ID_E)(pstOpenSlvWork->u32Data);HIFB_WBC_InitLayer(u32LayerID);return;
}
#endif将 HIFB_WBC_InitLayer() 的调用从工作队列中移除,改为系统启动后期手动调用。
步骤 1:屏蔽工作队列里的调用
--- a/drivers/graphics/hifb/drv_hifb_wbc.c
+++ b/drivers/graphics/hifb/drv_hifb_wbc.c
@@ -480,7 +480,10 @@ HI_VOID HIFB_WBC_InitLayer(HIFB_LAYER_ID_E enLayerId)#ifndef HI_BUILD_IN_BOOTHI_VOID HIFB_WBC_StartWorkQueue(struct work_struct *data){
- HIFB_LAYER_ID_E u32LayerID = HIFB_LAYER_ID_BUTT;
+ /* 延迟初始化:不在 kworker 里执行,防止 on_each_cpu 死锁 */
+ return;
+
+ HIFB_LAYER_ID_E u32LayerID = HIFB_LAYER_ID_BUTT; /* 原逻辑已死码 */OPTM_GFX_WORK_S *pstOpenSlvWork = container_of(data, OPTM_GFX_WORK_S, work);u32LayerID = (HIFB_LAYER_ID_E)(pstOpenSlvWork->u32Data);HIFB_WBC_InitLayer(u32LayerID);步骤 2:添加 late_initcall 回调
在同一文件末尾(任意位置)加入:
#if !defined(HI_BUILD_IN_BOOT)
static int __init hifb_wbc_late_init(void)
{/* 系统启动后期再初始化 WBC,此时多核调度已稳定 */HIFB_WBC_InitLayer(HIFB_LAYER_SD_0);return 0;
}
late_initcall(hifb_wbc_late_init);
#endif步骤 3:重新编译内核,烧录kernel.img进行验证
然而事实并没有那么顺利,还是会出现该现象,不过log不一样了,说明上面修改还是有效的,新log如下:
改了之后还是会出现:INFO: rcu_sched self-detected stall on CPU { 0} (t=147006 jiffies g=-290 c=-291 q=112)
Task dump for CPU 0:
swapper/0 R running 0 1 0 0x00000002
[<c00188f8>] (unwind_backtrace) from [<c0013d44>] (show_stack+0x20/0x24)
[<c0013d44>] (show_stack) from [<c004f83c>] (sched_show_task+0xa8/0xf4)
[<c004f83c>] (sched_show_task) from [<c0052438>] (dump_cpu_task+0x3c/0x4c)
[<c0052438>] (dump_cpu_task) from [<c0070128>] (rcu_dump_cpu_stacks+0xa4/0xd8)
[<c0070128>] (rcu_dump_cpu_stacks) from [<c0073628>] (rcu_check_callbacks+0x428/ 0x764)
[<c0073628>] (rcu_check_callbacks) from [<c00783a8>] (update_process_times+0x50/ 0x70)
[<c00783a8>] (update_process_times) from [<c00863e4>] (tick_periodic+0x44/0xcc)
[<c00863e4>] (tick_periodic) from [<c008667c>] (tick_handle_periodic+0x38/0x98)
[<c008667c>] (tick_handle_periodic) from [<c05d51d0>] (arch_timer_handler_phys+0 x40/0x48)
[<c05d51d0>] (arch_timer_handler_phys) from [<c006c930>] (handle_percpu_devid_ir q+0xb4/0x11c)
[<c006c930>] (handle_percpu_devid_irq) from [<c00688d0>] (__handle_domain_irq+0x 8c/0xe4)
[<c00688d0>] (__handle_domain_irq) from [<c00086a8>] (gic_handle_irq+0x30/0x6c)
[<c00086a8>] (gic_handle_irq) from [<c00148c0>] (__irq_svc+0x40/0x54)
Exception stack(0xde45dc70 to 0xde45dcb8)
dc60: 00000003 debe2630 00000003 00000001
dc80: c1260a78 debc1bc4 debc1bc0 c12613a0 00000004 c0021d60 00000000 de45dcec
dca0: 00000003 de45dcb8 c008db64 c008db88 20000113 ffffffff
[<c00148c0>] (__irq_svc) from [<c008db88>] (smp_call_function_many+0x214/0x2a0)
[<c008db88>] (smp_call_function_many) from [<c008dc88>] (on_each_cpu+0x38/0x58)
[<c008dc88>] (on_each_cpu) from [<c064c6f0>] (__dma_clear_buffer+0x124/0x150)
[<c064c6f0>] (__dma_clear_buffer) from [<c064d7c4>] (hil_mmb_alloc+0x108/0x2e8)
[<c064d7c4>] (hil_mmb_alloc) from [<c064a458>] (new_mmb+0x24/0x44)
[<c064a458>] (new_mmb) from [<c0649554>] (HI_DRV_SMMU_AllocAndMap+0x3c/0x110)
[<c0649554>] (HI_DRV_SMMU_AllocAndMap) from [<c07dae3c>] (DMX_OsiInitDmxSet+0x80 /0x814)
[<c07dae3c>] (DMX_OsiInitDmxSet) from [<c07db650>] (DmxCluster_StartCluster+0x80 /0x170)
[<c07db650>] (DmxCluster_StartCluster) from [<c07cff54>] (HI_DRV_DMX_Init+0x44/0 x54)
[<c07cff54>] (HI_DRV_DMX_Init) from [<c094add0>] (MCE_ModuleInit+0x80/0x2e4)
[<c094add0>] (MCE_ModuleInit) from [<c094b050>] (MCE_Init+0x1c/0x14c)
[<c094b050>] (MCE_Init) from [<c00089dc>] (do_one_initcall+0xb8/0x1f0)
[<c00089dc>] (do_one_initcall) from [<c11abe28>] (kernel_init_freeable+0x10c/0x1 d4)
[<c11abe28>] (kernel_init_freeable) from [<c0c50328>] (kernel_init+0x1c/0xfc)
[<c0c50328>] (kernel_init) from [<c000f960>] (ret_from_fork+0x14/0x20)
Task dump for CPU 3:
swapper/3 R running 0 0 1 0x00000002
[<c0c56a88>] (__schedule) from [<de513fc0>] (0xde513fc0)🔍 下一步:精确定位新元凶
从最新一份 log 看,死锁路径是:
[<c07dae3c>] (DMX_OsiInitDmxSet+0x80/0x814)
[<c07db650>] (DmxCluster_StartCluster+0x80/0x170)
[<c07cff54>] (HI_DRV_DMX_Init+0x44/0x54)
[<c094add0>] (MCE_ModuleInit+0x80/0x2e4)问题已转向 DMX(Demux)驱动!
✅ 立即验证:跳过 DMX 初始化
在 DMX_OsiInitDmxSet() 开头直接返回,源码位置:HiSTBAndroidV600R003C00SPC020\device\hisilicon\bigfish\sdk\source\msp\drv\demux\demux_v2\drv_demux_func.c
static HI_S32 DMX_OsiInitDmxSet(Dmx_Set_S * DmxSet)
{if(1){printk(KERN_WARNING "DMX_OsiInitDmxSet skipped for RCU stall debug\n");return HI_SUCCESS;}else{HI_S32 ret = HI_FAILURE;HI_UNF_DMX_PORT_ATTR_S PortAttr;HI_UNF_DMX_TSO_PORT_ATTR_S TSOPortAttr;....}
}重新编译内核,烧录kernel.img进行验证,这次ok了,没在出现卡在开机logo现象。经测试也没影响其他功能使用,但是我们是直接跳过了DMX初始化,还是不太放心,要看看DMX是干什么用的。查资料后得知:
DMX(Demux)模块只在需要硬件解复用(卫星/有线/地面/IPTV 流)时才必须初始化;
你的设备若只做网络 OTT 播放、本地 USB 文件或HDMI 输入,完全可以不初始化 DMX,功能无损。要把 DMX 从内核 early-init挪到用户空间按需加载,只需:
把
HI_DRV_DMX_Init()做成 module_init(而非core_initcall)内核里默认 不编进 DMX 驱动(或编为
<M>模块)在
init.rc里insmod /vendor/lib/modules/hi_drv_dmx.ko(何时加载由你决定)
一、DMX 初始化的真实作用
表格
复制
| 功能场景 | 是否需要 DMX | 说明 |
|---|---|---|
| IPTV/OTT 网络播放 | ❌ | 码流已纯 ES,无需解复用 |
| 本地 USB MP4/MKV | ❌ | 软件 demux 在播放器完成 |
| HDMI/CVBS 输入 | ❌ | 无 TS 流 |
| 卫星/有线/地面 DVB | ✅ | 需硬件解 TS 获取音视频 PID |
| 运营商 IPTV(组播 UDP TS) | ✅ | 需硬件解复用 |
你跳过初始化后“一切正常”即属于左栏场景,故无影响。
二、如何把 DMX 初始化延迟到用户空间
① 把驱动改为模块(推荐)
make ARCH=arm menuconfig
Device Drivers ---><M> Hisilicon DMX support # 原为 <*> 改为 <M>保存后重新编译:
make ARCH=arm CROSS_COMPILE=arm-hisiv500-linux- modules得到:
drivers/media/dmx/hi_drv_dmx.ko
放进 /vendor/lib/modules/ 并在 init.rc 里:
on post-fs-datainsmod /vendor/lib/modules/hi_drv_dmx.ko这样开机阶段不再执行
HI_DRV_DMX_Init(),等文件系统 ready 后再加载,可彻底避开 RCU 死锁。
② 若必须编进内核,则改为 late_initcall
修改:
// drivers/media/dmx/dmx_drv.c
// module_init(HI_DRV_DMX_Init); // 原来
late_initcall(HI_DRV_DMX_Init); // 放到 arch 之后级别从 7 延后到 6,避开 early 阶段的 on_each_cpu() 风险。
③ 完全跳过(已验证)
继续保留你现在的 空返回 即可,产品无 DVB 功能时零副作用。
三、小结
DMX 只负责 TS 流硬件解复用,无 TS 场景可安全跳过。
延迟初始化最简洁办法:把
hi_drv_dmx编成<M>,在init.rc里insmod,既不影响功能,又彻底消除开机 RCU 死锁。
而在我们的产品使用场景刚好用不上DMX,所以我就直接跳过DMX初始化来解决该问题了。
