mindie1.0新特性及调试问题总结
说明
最近在ascend 310P3上使用mindie 1.0部署模型,跟我以前使用的mindie 1.0_rc2比,有很多新的特性和变化,导致部署出现了不少问题。这里罗列下我的发现,希望对其他人有用。
特性1:需要显式配置share_memory
报错信息:
[2025-04-02 14:04:30.801+08:00] [63217] [63219] [server] [INFO] [share_memory.cpp:168] : [ShareMemory::SharedMemorySizeCheck] shared memory size check success.
[2025-04-02 14:04:30.826+08:00] [63217] [63219] [server] [ERROR] [share_memory.cpp:158] : [ShareMemory::SharedMemorySizeCheck]shared memory available is not enough on the filesystem.
[2025-04-02 14:04:30.826+08:00] [63217] [63219] [server] [ERROR] [share_memory.cpp:26] : [ShareMemory::Create]available shared memory size is not enough.
[2025-04-02 14:04:30.826+08:00] [63217] [63219] [server] [ERROR] [master_IPC_communicator.cpp:137] : Failed to create response shared memory
[2025-04-02 14:04:30.826+08:00] [63217] [63219] [server] [ERROR] [connector_launcher.cpp:143] : [ConnectorLauncher::Launch] Failed to setup message channel.
[2025-04-02 14:04:30.833+08:00] [63217] [63219] [server] [ERROR] [model_backend.cpp:230] : [ModelBackend::InstanceInit] Failed to launcher connector agent.
[2025-04-02 14:04:30.833276+08:00] [63217] [63219] [server] [ERROR] [llm_infer_model_instance.cpp:234] : [LLMInferModelInstance::Init] llmManager_ init fail!
解决办法:启动docker时显式配置share_memory
docker run ...... --shm-size 10g ......
正确启动的相关打印:
[2025-04-02 14:53:47.319+08:00] [119] [121] [server] [INFO] [share_memory.cpp:167] : total shared memory size:10240MB, and available shared memory size:10240MB.
[2025-04-02 14:53:47.319+08:00] [119] [121] [server] [INFO] [share_memory.cpp:168] : [ShareMemory::SharedMemorySizeCheck] shared memory size check success.
特性2:模型目录下的config.json的权限必须为0750
报错信息:
Check path: config.json failed, by: Check Other group permission failed: Current permission is 4, but required no greater than 0. Required permission: 750, but got 644
Failed to check config.json under model weight path.
ERR: Failed to init endpoint! Please check the service log or console output.
Killed
解决办法:将config.json权限设置为0750
chmod 0750 /in/Qwen2.5-3B-Instruct/config.json
特性3:模型目录权限other group权限必须为0
报错信息:
[ERROR] [model_deploy_config.cpp:159] Failed to get vocab size from tokenizer wrapper with exception: PermissionError: The file should not be writable by others who are neither the owner nor in the group. Please check the input path:/invoker-deploy/Qwen_local_57_c7c2d6e7/in/Qwen2.5-3B-Instruct/generation_config.json, and change mode to 33277.
The file should not be writable by others who are neither the owner nor in the group. Please check the input path:/invoker-deploy/Qwen_local_57_c7c2d6e7/in/Qwen2.5-3B-Instruct/model-00002-of-00002.safetensors, and change mode to 33277
Check path: config.json failed, by: Check Other group permission failed: Current permission is 4, but required no greater than 0. Required permission: 750, but got 444
Failed to check config.json under model weight path.
解决办法:将other group权限清零,跟上面那样将整个目录权限设置为0750就可以
chmod 0750 /path/to/model/ -R
特性4: model_name不能包含特殊字符,只能包含[a-zA-Z0-9_.-]
报错信息:
The value of modelName must meet the following rules: The string length is [1, 256] and consists of a match of the type [a-zA-Z0-9_.-]. The first and last characters must be characters or digits.
ERR: Failed to init endpoint! Please check the service log or console output.
解决办法:config.json中的model_name不能包含特殊字符
特性5: mindie同一个模型可以配置多实例
可以通过mindie-server的配置文件config.json中的modelInstanceNumber来配置mindie启动多个实例
比如:
这时worldsize任然是1,但是会在2张卡上分别启动一个实例,对外的服务端口还是一个。2个实例都指定到一张卡上也是可以的,只要现存足够。
"modelInstanceNumber": 2,
"npuDeviceIds": [
[0], [0] #2个实例在同一张卡上
],