定制PyTorch后端通信(backend)实战
定制PyTorch后端通信(backend)实战
缘起
相关研究中,需要替换PyTorch的通信后端,定制分布式训练中例如all_reduce的算法,包括通信协议的定制。
PyTorch默认的通信后端为‘gloo’和‘nccl’,其他支持的还有mpi(需要基于源码编译),实际上PyTorch 2.7的源代码里还实现了一个‘ucc’的实验性的通信后端,后续可以参考UCC后端的代码进行相关定制。
参见:Distributed communication package - torch.distributed — PyTorch 2.7 documentation
本地环境
$ lsb_release -a # Ubuntu 23.04
$ python3 --version # Python 3.11.4
$ python3 -c "import torch; print(torch.__version__)" # 2.7.0+cu126
实战
对于一个Python和PyTorch的新手,摸索了不知道多久,终于发现解决问题的入口:
1、使用 C++ 扩展定制进程组后端 — PyTorch 教程 2.7.0+cu126 文档 - PyTorch 深度学习库 此链接为中文,标题‘使用c++扩展定制(PyTorch)进程组后端’,正是我要找的东西。
2、https://github.com/H-Huang/torch_collective_extension 一个扩展torch集合通信的例子,该项目给出了两种扩展方式,一种基于custom_backend(推荐的方式),一种基于custom_process_group(旧方式,网上大部分找到的内容都是这种方式),其中基于custom_backend模式的代码与(1)相同,但更全面,默认给出了有关PyTorch所有集合通信的接口示例。
下面主要基于(1)的代码进行验证
代码
在用户目录下创建相关的目录及代码,本例中目录名为‘custom_bankend’,包含四个文件:
- dummy.hpp -- c++头文件
- dummy.cpp -- c++源文件
- setup.py -- Python项目编译和配置脚本
- example.py -- 验证测试代码
以下假设用户目录为‘/home/~usrname’
1. dummy.hpp
// file name: dummy.hpp // 代码源自PyTorch官网实例:https://pytorch.ac.cn/tutorials/intermediate/process_group_cpp_extension_tutorial.html #pragma once // 根据实际情况调整路径,参见 setup.py // #include <torch/python.h> #include <torch/csrc/api/include/torch/python.h> #include <torch/csrc/distributed/c10d/Backend.hpp> #include <torch/csrc/distributed/c10d/Work.hpp> #include <torch/csrc/distributed/c10d/Store.hpp> #include <torch/csrc/distributed/c10d/Types.hpp> #include <torch/csrc/distributed/c10d/Utils.hpp> #include <pybind11/chrono.h> namespace c10d { class BackendDummy : public Backend { public: BackendDummy(int rank, int size); c10::intrusive_ptr<Work> allgather( std::vector<std::vector<at::Tensor>>& outputTensors, std::vector<at::Tensor>& inputTensors, const AllgatherOptions& opts = AllgatherOptions()) override; c10::intrusive_ptr<Work> allreduce( std::vector<at::Tensor>& tensors, const AllreduceOptions& opts = AllreduceOptions()) override; // The collective communication APIs without a custom implementation // will error out if invoked by application code. static c10::intrusive_ptr<Backend> createBackendDummy( const c10::intrusive_ptr<::c10d::Store>& store, int rank, int size, const std::chrono::duration<float>& timeout); static void BackendDummyConstructor() __attribute__((constructor)) { py::object module = py::module::import("torch.distributed"); py::object register_backend = module.attr("Backend").attr("register_backend"); // torch.distributed.Backend.register_backend will add `dummy` as a // new valid backend. 注意是在这里指定通信后端的名称(dummy)! register_backend("dummy", py::cpp_function(createBackendDummy)); } }; class WorkDummy : public Work { friend class BackendDummy; public: WorkDummy( OpType opType, c10::intrusive_ptr<c10::ivalue::Future> future) // future of the output : Work( -1, // rank, only used by recvAnySource, irrelevant in this demo opType), future_(std::move(future)) {} bool isCompleted() override; bool isSuccess() const override; bool wait(std::chrono::milliseconds timeout = kUnsetTimeout) override; virtual c10::intrusive_ptr<c10::ivalue::Future> getFuture() override; private: c10::intrusive_ptr<c10::ivalue::Future> future_; }; } // namespace c10d |
2. dummy.cpp
// file name: dummy.cpp // 代码源自PyTorch官网实例:https://pytorch.ac.cn/tutorials/intermediate/process_group_cpp_extension_tutorial.html #include "dummy.hpp" #include <iostream> namespace c10d { bool WorkDummy::isCompleted() { return true; } bool WorkDummy::isSuccess() const { return true; } bool WorkDummy::wait(std::chrono::milliseconds /* unused */) { return true; } c10::intrusive_ptr<c10::ivalue::Future> WorkDummy::getFuture() { return future_; } // If necessary, pass store/rank/size to the ctor and exchange connection // information here BackendDummy::BackendDummy(int rank, int size) : Backend(rank, size) {} // This is a dummy allgather that sets all output tensors to zero // Modify the implementation to conduct real communication asynchronously c10::intrusive_ptr<Work> BackendDummy::allgather( std::vector<std::vector<at::Tensor>>& outputTensors, std::vector<at::Tensor>& inputTensors, const AllgatherOptions& /* unused */) { for (auto& outputTensorVec : outputTensors) { for (auto& outputTensor : outputTensorVec) { outputTensor.zero_(); } } auto future = c10::make_intrusive<c10::ivalue::Future>( c10::ListType::create(c10::ListType::create(c10::TensorType::get()))); future->markCompleted(c10::IValue(outputTensors)); return c10::make_intrusive<WorkDummy>(OpType::ALLGATHER, std::move(future)); } // This is a dummy allreduce that sets all output tensors to zero // Modify the implementation to conduct real communication asynchronously c10::intrusive_ptr<Work> BackendDummy::allreduce( std::vector<at::Tensor>& tensors, const AllreduceOptions& opts) { for (auto& tensor : tensors) { tensor.zero_(); } auto future = c10::make_intrusive<c10::ivalue::Future>( c10::ListType::create(c10::TensorType::get())); future->markCompleted(c10::IValue(tensors)); # 原文此处有误,操作类型应该为:ALLREDUCE return c10::make_intrusive<WorkDummy>(OpType::ALLREDUCE, std::move(future)); } c10::intrusive_ptr<Backend> BackendDummy::createBackendDummy( const c10::intrusive_ptr<::c10d::Store>& /* unused */, int rank, int size, const std::chrono::duration<float>& /* unused */) { return c10::make_intrusive<BackendDummy>(rank, size); } PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("createBackendDummy", &BackendDummy::createBackendDummy); } } // namespace c10d |
3. setup.py
# file name: setup.py # 代码源自PyTorch官网实例:https://pytorch.ac.cn/tutorials/intermediate/process_group_cpp_extension_tutorial.html # ~/.local/lib/python3.11/site-packages/torch import os import sys import torch from setuptools import setup from torch.utils import cpp_extension # 根据实际情况调整路径 # sources = ["src/dummy.cpp"] # include_dirs = [f"{os.path.dirname(os.path.abspath(__file__))}/include/"] # 在下面的代码行根据本地实际路径替换 sources = ["dummy.cpp"] include_dirs = ["/home/~~~/.local/lib/python3.11/site-packages/torch/include"] if torch.cuda.is_available(): module = cpp_extension.CUDAExtension( name = "dummy_collectives", sources = sources, include_dirs = include_dirs, ) else: module = cpp_extension.CppExtension( name = "dummy_collectives", sources = sources, include_dirs = include_dirs, ) # 不支持 cuda # module = cpp_extension.CppExtension( # name = "dummy_collectives", # sources = sources, # include_dirs = include_dirs, # ) setup( name = "Dummy-Collectives", version = "0.0.1", ext_modules = [module], cmdclass={'build_ext': cpp_extension.BuildExtension} ) |
4. example.py
提示:本文件源码改动较大
# 源码初始化有所改动,请密切注意 # 在终端运行命令: # torchrun --nnodes=1 --nproc-per-node=2 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 example.py import os import torch import dummy_collectives import torch.distributed as dist rank = int(os.getenv('RANK','0')) world_size = int(os.getenv('WORLD_SIZE','1')) dist.init_process_group(backend="dummy", rank=rank, world_size=world_size) # cpu allreduce x = torch.ones(6) dist.all_reduce(x) print(f"cpu allreduce: {x}") # gpu allrecduce : 实质是计算在 cuda,后端多机之间通信可定制 if torch.cuda.is_available(): y = x.cuda() dist.all_reduce(y) print(f"cuda allreduce: {y}") try: dist.broadcast(y, 0) except RuntimeError: print("got RuntimeError when calling broadcast") torch.distributed.destroy_process_group() |
编译&安装配置
注意将下面所有命令中的路径参数替换为自己本地的HOME路径。
$ python3 setup.py build
成功!在当前目录生成‘build’文件夹以及 .so 和 .egg-info文件。
$ python3 setup.py install
报错,如下:
running install /usr/lib/python3/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( /usr/lib/python3/dist-packages/setuptools/command/easy_install.py:146: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools. warnings.warn( error: can't create or remove files in install directory The following error occurred while trying to add or remove files in the installation directory: [Errno 13] Permission denied: '/usr/local/lib/python3.11/dist-packages/test-easy-install-68533.write-test' The installation directory you specified (via --install-dir, --prefix, or the distutils default setting) was: /usr/local/lib/python3.11/dist-packages/ Perhaps your account does not have write access to this directory? If the installation directory is a system-owned directory, you may need to sign in as the administrator or "root" account. If you do not have administrative access to this machine, you may wish to choose a different installation directory, preferably one that is listed in your PYTHONPATH environment variable. For information on other options, you may wish to consult the documentation at: https://setuptools.pypa.io/en/latest/deprecated/easy_install.html Please make the appropriate changes for your system and try again. |
目测是权限的问题,目标安装目录‘/usr/local/lib/python3.11/dist-packages/’不允许普通用户写,真正的安装目录应该在用户HOME目录下,即:
‘/home/~usrname/.local/lib/python3.11/site-packages’
通过输出的错误中的提示在命令行中指定安装目录,如下:
$ python3 setup.py install --install-dir=/home/~usrname/.local/lib/python3.11/site-packages
报错,如下:
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...] or: setup.py --help [cmd1 cmd2 ...] or: setup.py --help-commands or: setup.py cmd --help error: option --install-dir not recognized |
似乎是setup.py后跟随的命令不对
$ python3 setup.py --help
输出:
Common commands: (see '--help-commands' for more) setup.py build will build the package underneath 'build/' setup.py install will install the package Global options: 。。。 |
百思不得其解,根据命令提示只支持两个指令:build和install
根据参考资料(1)的提示,将命令改为‘develop’,执行:
$ python3 setup.py develop --install-dir=/home/~usrname/.local/lib/python3.11/site-packages
成功,输出:
running develop /usr/lib/python3/dist-packages/setuptools/command/easy_install.py:146: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools. warnings.warn( /usr/lib/python3/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( running egg_info creating Dummy_Collectives.egg-info writing Dummy_Collectives.egg-info/PKG-INFO writing dependency_links to Dummy_Collectives.egg-info/dependency_links.txt writing top-level names to Dummy_Collectives.egg-info/top_level.txt writing manifest file 'Dummy_Collectives.egg-info/SOURCES.txt' /home/~usrname/.local/lib/python3.11/site-packages/torch/utils/cpp_extension.py:576: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend. warnings.warn(msg.format('we could not find ninja.')) reading manifest file 'Dummy_Collectives.egg-info/SOURCES.txt' writing manifest file 'Dummy_Collectives.egg-info/SOURCES.txt' running build_ext copying build/lib.linux-x86_64-cpython-311/dummy_collectives.cpython-311-x86_64-linux-gnu.so -> Creating /home/~usrname/.local/lib/python3.11/site-packages/Dummy-Collectives.egg-link (link to .) Dummy-Collectives 0.0.1 is already the active version in easy-install.pth Installed /home/~usrname/src/pytorch/custom_backend Processing dependencies for Dummy-Collectives==0.0.1 Finished processing dependencies for Dummy-Collectives==0.0.1 |
验证
$ pip3 show dummy_collectives
Name: Dummy-Collectives Version: 0.0.1 Summary: Home-page: Author: Author-email: License: Location: /home/~usrname/src/pytorch/custom_backend Editable project location: /home/~usrname/src/pytorch/custom_backend Requires: Required-by: |
命令执行成功,说明已经完成通信后端 dummy 的注册。
$ python3 -c "import dummy_collectives"
报错,如下:
Traceback (most recent call last): File "<string>", line 1, in <module> ImportError: libc10.so: cannot open shared object file: No such file or directory |
通过find命令定位‘libc10.so’
$ find ~ -name libc10.so
根据搜索结果将文件路径添加到环境变量‘LD_LIBRARY_PATH’
本人测试时文件路径为:
/home/~usrname/.local/lib/python3.11/site-packages/torch/lib/libc10.so
$ export LD_LIBRARY_PATH=…
更新环境变量后再次执行:
$ python3 -c "import dummy_collectives"
继续报错,如下:
Traceback (most recent call last): File "<string>", line 1, in <module> ImportError: /home/~usrname/.local/lib/python3.11/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkGetErrorLog_12_6, version libnvJitLink.so.12 |
这次是找不到‘libnvJitLink.so.12’
继续通过find命令定位文件,本人测试的文件路径为:
/home/~usrname/.local/lib/python3.11/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12
继续将文件路径加入到环境变量‘LD_LIBRARY_PATH’
重要提示:经后续验证,在‘import dummy_collectives’之前执行‘import torch’即可无需配置环境变量‘LD_LIBRARY_PATH’,猜测原因可能是执行‘import torch’时自定加载了相关的.so文件。正确命令如下:
$ python3 -c "import torch; import dummy_collectives"
再次执行测试:
$ python3 -c "import dummy_collectives"
这次没有报错,有一个警告,如下:
/home/~usrname/.local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:350: UserWarning: Device capability of dummy unspecified, assuming `cpu` and `cuda`. Please specify it via the `devices` argument of `register_backend`. warnings.warn( |
试图消除此警告,通过修改‘.hpp’文件中‘register_backend()’方法,增加形如‘devices="cpu"’的参数,未果,编译不能通过,应该是torch版本的问题,暂时搁置不管,继续验证。
$ torchrun --nnodes=1 --nproc-per-node=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 example.py
成功!输出:
/home/~usrname/.local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:350: UserWarning: Device capability of dummy unspecified, assuming `cpu` and `cuda`. Please specify it via the `devices` argument of `register_backend`. warnings.warn( cpu allreduce: tensor([0., 0., 0., 0., 0., 0.]) cuda allreduce: tensor([0., 0., 0., 0., 0., 0.], device='cuda:0') got RuntimeError when calling broadcast |
提示:可以在上述命令中通过指定‘--nproc-per-node=x’参数启动多个进程同时分布式运行。
重要发现:
如果在import dummy_collectives 之前执行了 import torch,则无需指定之前的环境变量‘LD_LIBRARY_PATH’的操作。
后续工作:基于UDP和RoCE-v2协议定制PyTorch后端的all_reduce操作。