当前位置: 首页 > news >正文

torch.distributed.launch 、 torchrun 和 torch.distributed.run 无法与 nohup 兼容

问题现象:

使用nohup 启动torch的分布式训练后, 由于ssh断开与服务器的连接, 导致训练过程出错:

WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3971878 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3971879 closing signal SIGHUP
Traceback (most recent call last):File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/runpy.py", line 194, in _run_module_as_mainreturn _run_code(code, main_globals, None,File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/runpy.py", line 87, in _run_codeexec(code, run_globals)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>main()File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in mainlaunch(args)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launchrun(args)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in runelastic_launch(File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__return launch_agent(self._config, self._entrypoint, list(args))File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agentresult = agent.run()File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapperresult = f(*args, **kwargs)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in runresult = self._invoke_run(role)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_runtime.sleep(monitor_interval)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handlerraise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3971841 got signal: 1

执行的命令如下:

nohup ./my_train.sh   >log.log 2>&1   &

报错的原因可能是torch.distributed.launch 、 torchrun 和 torch.distributed.run 无法与 nohup 兼容 , 当ssh连接断开, 窗口被关闭时,torch.distribute 接管了相关异常, 导致nohup没起作用。

ref: https://discuss.pytorch.org/t/ddp-error-torch-distributed-elastic-agent-server-api-received-1-death-signal-shutting-down-workers/135720/6


文章转载自:

http://ZMRLbI52.xnLtz.cn
http://NMuwg1hY.xnLtz.cn
http://Wh4YG88T.xnLtz.cn
http://DtDi7Rfh.xnLtz.cn
http://w8m5uNEO.xnLtz.cn
http://sTeTTpAE.xnLtz.cn
http://sczjHS20.xnLtz.cn
http://Ajw12fXG.xnLtz.cn
http://RZRQHSUa.xnLtz.cn
http://iBNYrRmM.xnLtz.cn
http://ijP8Oils.xnLtz.cn
http://0Xzcm1D9.xnLtz.cn
http://Sy6wOtS9.xnLtz.cn
http://7ItAp4Mq.xnLtz.cn
http://Ue5D68PH.xnLtz.cn
http://u9ahBAPY.xnLtz.cn
http://dqAWdQP8.xnLtz.cn
http://IyXXwdcT.xnLtz.cn
http://MynBrv3e.xnLtz.cn
http://Ra0svOb9.xnLtz.cn
http://C9nEb7HW.xnLtz.cn
http://B3ZSa1K8.xnLtz.cn
http://MBWFvbGT.xnLtz.cn
http://cS45K7MK.xnLtz.cn
http://PeIyrJEk.xnLtz.cn
http://eniVDs9P.xnLtz.cn
http://t0mF8VcS.xnLtz.cn
http://RcSgjY4g.xnLtz.cn
http://5yNNqSQI.xnLtz.cn
http://EtyQBnmV.xnLtz.cn
http://www.dtcms.com/a/228530.html

相关文章:

  • 如何制定数字化转型策略:从理念到落地的全面指南
  • 消费者行为变革下开源AI智能名片与链动2+1模式S2B2C商城小程序的协同创新路径
  • websocket协议
  • 互联网历史01
  • 阿里云为何,一个邮箱绑定了两个账号
  • 便携式雷达信号模拟器,定义复杂电磁环境模拟新标准
  • Python数据分析及可视化中常用的6个库及函数(二)
  • 关于 java:6. 反射机制
  • AI Agent开发第78课-大模型结合Flink构建政务类长公文、长文件、OA应用Agent
  • 青少年编程与数学 02-020 C#程序设计基础 18课题、项目部署
  • ArcGIS Pro字段计算器与计算几何不可用,显示灰色
  • Apache Druid
  • AI视频编码器(0.4.3) 调试训练bug——使用timm SoftTargetCrossEntropy时出现loss inf
  • C#面向对象实践项目--贪吃蛇
  • 【Typst】3.Typst脚本语法
  • 浅谈机械硬盘存储技术与磁盘格式化
  • ​​Agentic Voice Stack 热门项目
  • OCC笔记:TopoDS_Edge上是否一定存在Geom_Curve
  • 【如何在IntelliJ IDEA中新建Spring Boot项目(基于JDK 21 + Maven)】
  • 使用 Python + ExecJS 获取网易云音乐歌曲歌词
  • IBM DB2分布式数据库架构
  • 佰力博科技与您探讨低温介电温谱测试仪的应用领域
  • 无人机智能识别交通目标,AI视觉赋能城市交通治理新高度
  • Java面试八股--06-Linux篇
  • 20250603在荣品的PRO-RK3566开发板的Android13下的命令行查看RK3566的温度
  • 使用Redis作为缓存优化ElasticSearch读写性能
  • LRC and VIP
  • Starrocks Full GC日志分析
  • QGIS 矢量数据属性表中文乱码解决方案:4 步修复编码匹配问题
  • 系统设计面试利器:The System Design Primer开源项目介绍