当前位置：首页 > news >正文

如何避免和恢复因终端关闭导致的 LoRA 微调中断

news 2025/11/5 5:06:14

环境：

Ubuntu20.04

Llama factory

Qwen2.5-7B-Instruct

llama.cpp

H20 95G

问题描述：

使用命令 CUDA_VISIBLE_DEVICES=1 FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/qwen2_5-7b_lora_sft.yaml 进行 LoRA 微调时，如果突然关闭终端窗口，微调进程会被中断。

在这里插入图片描述

解决方案：

中断原因：
- 终端关闭会发送 SIGHUP 信号，终止所有关联进程。
- 未使用后台运行或会话管理工具（如 nohup、tmux 等）。
如何确认中断：
- 检查训练日志，查看是否有异常终止记录。
- 使用 nvidia-smi 检查 GPU 是否仍在运行训练任务。
- 检查训练输出文件（如检查点）是否完整。

避免中断的方法：

使用 nohup：

nohup CUDA_VISIBLE_DEVICES=1 FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/qwen2_5-14b_lora_sft.yaml > train.log 2>&1 &

使用 sh -c 来包裹整个命令

这种方法可以确保环境变量被正确设置并应用于紧跟其后的命令：

nohup sh -c 'CUDA_VISIBLE_DEVICES=1 FORCE_TORCHRUN=1 python -m llamafactory.cli train examples/train_lora/qwen2_5-7b_lora_sft.yaml' > train.log 2>&1 &

实时日志查看：

tail -f train.log

在这里插入图片描述

中断后的恢复方法：
- 从检查点恢复：
```
CUDA_VISIBLE_DEVICES=1 FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/qwen2_5-14b_lora_sft.yaml --resume_from_checkpoint path_to_checkpoint
```
- 重新开始训练：如果没有保存检查点，只能重新开始。
- 检查日志和 GPU 状态：确认中断原因并解决问题。
总结：
- 使用后台运行或会话管理工具（如 `nohup``）避免中断。
- 启用检查点保存功能，以便中断后可以恢复训练。
- 定期检查训练日志和 GPU 状态，确保训练顺利进行。