当前位置：首页 > news >正文

VLLM部署gpt-oss-20b踩坑记录

news 2025/8/23 6:39:28

部署显卡：4090 48G （不同型号显卡在后续安装使用flash-attn可能会出现不同问题）
模型下载地址：https://huggingface.co/openai/gpt-oss-20b/tree/main

官方环境安装和启动命令（看似简单实则都是坑）：

uv pip install --pre vllm==0.10.1+gptoss \--extra-index-url https://wheels.vllm.ai/gpt-oss/ \--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \--index-strategy unsafe-best-matchvllm serve openai/gpt-oss-20b

Python版本一定要3.12，否则会出现错误：

Collecting diskcache==5.6.3 (from vllm==0.10.1+gptoss)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/3f/27/4570e78fc0bf5ea0ca45eb1de3818a23787af9b390c0b0a0033a1b8236f9/diskcache-5.6.3-py3-none-any.whl (45 kB)
Collecting depyf==0.19.0 (from vllm==0.10.1+gptoss)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/28/4d/1192acbcdc5e843f5e5d51f6e8788f2b60a9fe0b578ac385ded67a0b0b26/depyf-0.19.0-py3-none-any.whl (39 kB)
Collecting compressed-tensors==0.10.2 (from vllm==0.10.1+gptoss)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/43/ac/56bb4b6b3150783119479e2f05e32ebfc39ca6ff8e6fcd45eb178743b39e/compressed_tensors-0.10.2-py3-none-any.whl (169 kB)
[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: pip install --upgrade pip
ERROR: Package 'gpt-oss' requires a different Python: 3.10.12 not in '<3.13,>=3.12'

直接使用上面官方的命令安装包的时候会很容易中断（不是磁盘空间不够）：

Resolved 169 packages in 15.45s
⠦ Preparing packages... (2/15)
nvidia-curand-cu12     ------------------------------ 5.87 MiB/60.67 MiB
nvidia-cuda-nvrtc-cu12 ------------------------------ 4.50 MiB/83.96 MiB
nvidia-nvshmem-cu12    ------------------------------ 4.25 MiB/118.83 MiB
pytorch-triton         ------------------------------ 4.50 MiB/147.49 MiB
nvidia-cufft-cu12      ------------------------------ 4.25 MiB/184.17 MiB
nvidia-cusolver-cu12   ------------------------------ 3.77 MiB/255.11 MiB
nvidia-cusparselt-cu12 ------------------------------ 4.00 MiB/273.89 MiB
nvidia-cusparse-cu12   ------------------------------ 4.25 MiB/274.86 MiB
nvidia-nccl-cu12       ------------------------------ 4.00 MiB/307.42 MiB
nvidia-cublas-cu12     ------------------------------ 4.25 MiB/566.81 MiB
nvidia-cudnn-cu12      ------------------------------ 4.00 MiB/674.02 MiB
vllm                   ------------------------------ 29.86 MiB/774.76 MiB
torch                  ------------------------------ 4.25 MiB/852.84 MiB

× Failed to download `torchvision==0.24.0.dev20250804+cu128`
├─▶ Failed to extract archive: torchvision-....whl
╰─▶ I/O operation failed during extraction

将这些包分开来一个个下载，避免同时尝试下载多个几百兆甚至上G的大文件导致中断，类似：

uv pip install --pre torch torchaudio torchvision \--index-url https://pypi.tuna.tsinghua.edu.cn/simple \--extra-index-url https://download.pytorch.org/whl/nightly/cu128

如果仍然失败可以使用 wget -c 手动下载文件，再本地安装：

wget -c https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.4.1-py3-none-manylinux_2_27_x86_64.whluv pip install nvidia_cublas_cu12-12.8.4.1-py3-none-manylinux_2_27_x86_64.whl

安装好环境后运行命令启动：

vllm serve /data1/models/gpt-oss-20b

成功出现错误：

(EngineCore_0 pid=1976920) W0822 11:48:23.183000 1976920 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=1976920) W0822 11:48:23.183000 1976920 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[W822 11:48:25.998767914 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_0 pid=1976920) INFO 08-22 11:48:25 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=1976920) INFO 08-22 11:48:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=1976920) INFO 08-22 11:48:25 [gpu_model_runner.py:1913] Starting to load model /data1/models/gpt-oss-20b...
(EngineCore_0 pid=1976920) INFO 08-22 11:48:26 [gpu_model_runner.py:1945] Loading model from scratch...
(EngineCore_0 pid=1976920) INFO 08-22 11:48:26 [cuda.py:323] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     self._init_executor()
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     self.collective_rpc("load_model")
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     return func(*args, **kwargs)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     self.model = model_loader.load_model(
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     model = initialize_model(vllm_config=vllm_config,
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 241, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     self.model = GptOssModel(
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]                  ^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 183, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 214, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     TransformerBlock(
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 183, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 110, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     self.attn = Attention(
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]                 ^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/attention/layer.py", line 176, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]     assert self.vllm_flash_attn_version == 3, (
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] AssertionError: Sinks are only supported in FlashAttention 3
(EngineCore_0 pid=1976920) Process EngineCore_0:
(EngineCore_0 pid=1976920) Traceback (most recent call last):
(EngineCore_0 pid=1976920)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=1976920)     self.run()
(EngineCore_0 pid=1976920)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=1976920)     self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_0 pid=1976920)     raise e
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=1976920)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=1976920)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=1976920)     super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_0 pid=1976920)     self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=1976920)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_0 pid=1976920)     self._init_executor()
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=1976920)     self.collective_rpc("load_model")
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=1976920)     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=1976920)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=1976920)     return func(*args, **kwargs)
(EngineCore_0 pid=1976920)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=1976920)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=1976920)     self.model = model_loader.load_model(
(EngineCore_0 pid=1976920)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(EngineCore_0 pid=1976920)     model = initialize_model(vllm_config=vllm_config,
(EngineCore_0 pid=1976920)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_0 pid=1976920)     return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_0 pid=1976920)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 241, in init
(EngineCore_0 pid=1976920)     self.model = GptOssModel(
(EngineCore_0 pid=1976920)                  ^^^^^^^^^^^^
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 183, in init
(EngineCore_0 pid=1976920)     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 214, in init
(EngineCore_0 pid=1976920)     TransformerBlock(
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 183, in init
(EngineCore_0 pid=1976920)     self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(EngineCore_0 pid=1976920)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 110, in init
(EngineCore_0 pid=1976920)     self.attn = Attention(
(EngineCore_0 pid=1976920)                 ^^^^^^^^^^
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/attention/layer.py", line 176, in init
(EngineCore_0 pid=1976920)     self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(EngineCore_0 pid=1976920)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in init
(EngineCore_0 pid=1976920)     assert self.vllm_flash_attn_version == 3, (
(EngineCore_0 pid=1976920)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) AssertionError: Sinks are only supported in FlashAttention 3
[rank0]:[W822 11:48:27.406723611 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1976008) Traceback (most recent call last):
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/bin/vllm", line 10, in <module>
(APIServer pid=1976008)     sys.exit(main())
(APIServer pid=1976008)              ^^^^^^
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=1976008)     args.dispatch_function(args)
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=1976008)     uvloop.run(run_server(args))
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=1976008)     return __asyncio.run(
(APIServer pid=1976008)            ^^^^^^^^^^^^^^
(APIServer pid=1976008)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1976008)     return runner.run(main)
(APIServer pid=1976008)            ^^^^^^^^^^^^^^^^
(APIServer pid=1976008)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1976008)     return self._loop.run_until_complete(task)
(APIServer pid=1976008)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=1976008)     return await main
(APIServer pid=1976008)            ^^^^^^^^^^
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1827, in run_server
(APIServer pid=1976008)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1847, in run_server_worker
(APIServer pid=1976008)     async with build_async_engine_client(
(APIServer pid=1976008)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008)   File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1976008)     return await anext(self.gen)
(APIServer pid=1976008)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 167, in build_async_engine_client
(APIServer pid=1976008)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1976008)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008)   File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1976008)     return await anext(self.gen)
(APIServer pid=1976008)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 209, in build_async_engine_client_from_engine_args
(APIServer pid=1976008)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1976008)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 1520, in inner
(APIServer pid=1976008)     return fn(*args, **kwargs)
(APIServer pid=1976008)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 173, in from_vllm_config
(APIServer pid=1976008)     return cls(
(APIServer pid=1976008)            ^^^^
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 119, in init
(APIServer pid=1976008)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1976008)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 101, in make_async_mp_client
(APIServer pid=1976008)     return AsyncMPClient(*client_args)
(APIServer pid=1976008)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 733, in init
(APIServer pid=1976008)     super().init(
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 421, in init
(APIServer pid=1976008)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=1976008)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008)   File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=1976008)     next(self.gen)
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
(APIServer pid=1976008)     wait_for_engine_startup(
(APIServer pid=1976008)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
(APIServer pid=1976008)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=1976008) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

尝试升级各种升级flash-attn等方法失败了：

pip install --upgrade flash-attn

最后在官方下载地址：https://github.com/Dao-AILab/flash-attention/releases/tag/v2.8.3，中手动下载并安装：

wget -c "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"pip install flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

毫不意外，再次出现错误：

(EngineCore_0 pid=2395666) W0822 15:04:45.498000 2395666 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=2395666) W0822 15:04:45.498000 2395666 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[W822 15:04:46.658661250 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_0 pid=2395666) INFO 08-22 15:04:46 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=2395666) INFO 08-22 15:04:46 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=2395666) INFO 08-22 15:04:46 [gpu_model_runner.py:1913] Starting to load model /data1/models/gpt-oss-20b...
(EngineCore_0 pid=2395666) INFO 08-22 15:04:46 [gpu_model_runner.py:1945] Loading model from scratch...
(EngineCore_0 pid=2395666) INFO 08-22 15:04:47 [cuda.py:323] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     self._init_executor()
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     self.collective_rpc("load_model")
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     return func(*args, **kwargs)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     self.model = model_loader.load_model(
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     model = initialize_model(vllm_config=vllm_config,
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 241, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     self.model = GptOssModel(
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]                  ^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 183, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 214, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     TransformerBlock(
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 183, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 110, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     self.attn = Attention(
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]                 ^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/attention/layer.py", line 176, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]     assert self.vllm_flash_attn_version == 3, (
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] AssertionError: Sinks are only supported in FlashAttention 3
(EngineCore_0 pid=2395666) Process EngineCore_0:
(EngineCore_0 pid=2395666) Traceback (most recent call last):
(EngineCore_0 pid=2395666)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=2395666)     self.run()
(EngineCore_0 pid=2395666)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=2395666)     self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_0 pid=2395666)     raise e
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=2395666)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=2395666)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=2395666)     super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_0 pid=2395666)     self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=2395666)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_0 pid=2395666)     self._init_executor()
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=2395666)     self.collective_rpc("load_model")
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=2395666)     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=2395666)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=2395666)     return func(*args, **kwargs)
(EngineCore_0 pid=2395666)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=2395666)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=2395666)     self.model = model_loader.load_model(
(EngineCore_0 pid=2395666)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(EngineCore_0 pid=2395666)     model = initialize_model(vllm_config=vllm_config,
(EngineCore_0 pid=2395666)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_0 pid=2395666)     return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_0 pid=2395666)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 241, in init
(EngineCore_0 pid=2395666)     self.model = GptOssModel(
(EngineCore_0 pid=2395666)                  ^^^^^^^^^^^^
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 183, in init
(EngineCore_0 pid=2395666)     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 214, in init
(EngineCore_0 pid=2395666)     TransformerBlock(
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 183, in init
(EngineCore_0 pid=2395666)     self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(EngineCore_0 pid=2395666)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 110, in init
(EngineCore_0 pid=2395666)     self.attn = Attention(
(EngineCore_0 pid=2395666)                 ^^^^^^^^^^
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/attention/layer.py", line 176, in init
(EngineCore_0 pid=2395666)     self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(EngineCore_0 pid=2395666)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in init
(EngineCore_0 pid=2395666)     assert self.vllm_flash_attn_version == 3, (
(EngineCore_0 pid=2395666)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) AssertionError: Sinks are only supported in FlashAttention 3
[rank0]:[W822 15:04:48.294247588 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2394725) Traceback (most recent call last):
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/bin/vllm", line 10, in <module>
(APIServer pid=2394725)     sys.exit(main())
(APIServer pid=2394725)              ^^^^^^
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=2394725)     args.dispatch_function(args)
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=2394725)     uvloop.run(run_server(args))
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=2394725)     return __asyncio.run(
(APIServer pid=2394725)            ^^^^^^^^^^^^^^
(APIServer pid=2394725)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=2394725)     return runner.run(main)
(APIServer pid=2394725)            ^^^^^^^^^^^^^^^^
(APIServer pid=2394725)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2394725)     return self._loop.run_until_complete(task)
(APIServer pid=2394725)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=2394725)     return await main
(APIServer pid=2394725)            ^^^^^^^^^^
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1827, in run_server
(APIServer pid=2394725)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1847, in run_server_worker
(APIServer pid=2394725)     async with build_async_engine_client(
(APIServer pid=2394725)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725)   File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=2394725)     return await anext(self.gen)
(APIServer pid=2394725)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 167, in build_async_engine_client
(APIServer pid=2394725)     async with build_async_engine_client_from_engine_args(
(APIServer pid=2394725)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725)   File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=2394725)     return await anext(self.gen)
(APIServer pid=2394725)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 209, in build_async_engine_client_from_engine_args
(APIServer pid=2394725)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=2394725)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 1520, in inner
(APIServer pid=2394725)     return fn(*args, **kwargs)
(APIServer pid=2394725)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 173, in from_vllm_config
(APIServer pid=2394725)     return cls(
(APIServer pid=2394725)            ^^^^
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 119, in init
(APIServer pid=2394725)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=2394725)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 101, in make_async_mp_client
(APIServer pid=2394725)     return AsyncMPClient(*client_args)
(APIServer pid=2394725)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 733, in init
(APIServer pid=2394725)     super().init(
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 421, in init
(APIServer pid=2394725)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=2394725)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725)   File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=2394725)     next(self.gen)
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
(APIServer pid=2394725)     wait_for_engine_startup(
(APIServer pid=2394725)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
(APIServer pid=2394725)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=2394725) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
warning

查阅相关资料后发现，在命令前加上VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 似乎能解决问题。

最终运行命令：

CUDA_VISIBLE_DEVICES=1 /
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 
VLLM_USE_V1=1 
VLLM_WORKER_MULTIPROC_METHOD=spawn 
vllm serve /data1/models/gpt-oss-20b 
--served-model-name gpt-oss-20b  
--gpu-memory-utilization 0.5
--tensor-parallel-size 1  
--host 0.0.0.0  
--port 8000  
--uvicorn-log-level debug

再次出现问题：

(venv-py12) root@test:/data1/tlw/vllm# CUDA_VISIBLE_DEVICES=1 VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1  VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn vllm serve /data1/models/gpt-oss-20b --served-model-name gpt-oss-20b  --gpu-memory-utilization 0.9  --tensor-parallel-size 1  --host 0.0.0.0  --port 8000  --uvicorn-log-level debug
INFO 08-22 15:28:09 [init.py:241] Automatically detected platform cuda.
(APIServer pid=2465374) INFO 08-22 15:28:12 [api_server.py:1787] vLLM API server version 0.10.2.dev2+gf5635d62e.d20250807
(APIServer pid=2465374) INFO 08-22 15:28:12 [utils.py:326] non-default args: {'model_tag': '/data1/models/gpt-oss-20b', 'host': '0.0.0.0', 'uvicorn_log_level': 'debug', 'model': '/data1/models/gpt-oss-20b', 'served_model_name': ['gpt-oss-20b']}
(APIServer pid=2465374) INFO 08-22 15:28:19 [config.py:726] Resolved architecture: GptOssForCausalLM
(APIServer pid=2465374) ERROR 08-22 15:28:19 [config.py:123] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/data1/models/gpt-oss-20b'. Use repo_type argument if needed., retrying 1 of 2
(APIServer pid=2465374) ERROR 08-22 15:28:21 [config.py:121] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/data1/models/gpt-oss-20b'. Use repo_type argument if needed.
(APIServer pid=2465374) INFO 08-22 15:28:21 [config.py:3628] Downcasting torch.float32 to torch.bfloat16.
(APIServer pid=2465374) INFO 08-22 15:28:21 [config.py:1759] Using max model len 131072
(APIServer pid=2465374) WARNING 08-22 15:28:21 [config.py:1198] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
(APIServer pid=2465374) INFO 08-22 15:28:21 [config.py:2588] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=2465374) INFO 08-22 15:28:21 [config.py:244] Overriding cuda graph sizes to [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024]
INFO 08-22 15:28:26 [init.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=2466326) INFO 08-22 15:28:29 [core.py:654] Waiting for init message from front-end.
(EngineCore_0 pid=2466326) INFO 08-22 15:28:29 [core.py:73] Initializing a V1 LLM engine (v0.10.2.dev2+gf5635d62e.d20250807) with config: model='/data1/models/gpt-oss-20b', speculative_config=None, tokenizer='/data1/models/gpt-oss-20b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='openai'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=gpt-oss-20b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":1024,"local_cache_dir":null}
(EngineCore_0 pid=2466326)
(EngineCore_0 pid=2466326)              LL          LL          MMM       MMM
(EngineCore_0 pid=2466326)              LL          LL          MMMM     MMMM
(EngineCore_0 pid=2466326)          V   LL          LL          MM MM   MM MM
(EngineCore_0 pid=2466326) vvvv  VVVV   LL          LL          MM  MM MM  MM
(EngineCore_0 pid=2466326) vvvv VVVV    LL          LL          MM   MMM   MM
(EngineCore_0 pid=2466326)  vvv VVVV    LL          LL          MM    M    MM
(EngineCore_0 pid=2466326)   vvVVVV     LL          LL          MM         MM
(EngineCore_0 pid=2466326)     VVVV     LLLLLLLLLL  LLLLLLLLL   M           M
(EngineCore_0 pid=2466326)
(EngineCore_0 pid=2466326) W0822 15:28:29.528000 2466326 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=2466326) W0822 15:28:29.528000 2466326 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[W822 15:28:30.519722738 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_0 pid=2466326) INFO 08-22 15:28:30 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=2466326) INFO 08-22 15:28:30 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=2466326) INFO 08-22 15:28:30 [gpu_model_runner.py:1913] Starting to load model /data1/models/gpt-oss-20b...
(EngineCore_0 pid=2466326) INFO 08-22 15:28:30 [gpu_model_runner.py:1945] Loading model from scratch...
(EngineCore_0 pid=2466326) INFO 08-22 15:28:30 [cuda.py:286] Using Triton backend on V1 engine.
(EngineCore_0 pid=2466326) WARNING 08-22 15:28:30 [rocm.py:29] Failed to import from amdsmi with ModuleNotFoundError("No module named 'amdsmi'")
(EngineCore_0 pid=2466326) WARNING 08-22 15:28:30 [rocm.py:40] Failed to import from vllm._rocm_C with ModuleNotFoundError("No module named 'vllm._rocm_C'")
(EngineCore_0 pid=2466326) INFO 08-22 15:28:30 [triton_attn.py:263] Using vllm unified attention for TritonAttentionImpl
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.37it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.25it/s]
(EngineCore_0 pid=2466326)
(EngineCore_0 pid=2466326) INFO 08-22 15:28:33 [default_loader.py:262] Loading weights took 2.52 seconds
(EngineCore_0 pid=2466326) INFO 08-22 15:28:33 [gpu_model_runner.py:1962] Model loading took 12.8848 GiB and 2.814522 seconds
(EngineCore_0 pid=2466326) INFO 08-22 15:28:38 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/68cd9b7b1a/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=2466326) INFO 08-22 15:28:38 [backends.py:541] Dynamo bytecode transform time: 4.39 s
/tmp/tmpu15cffx7/cuda_utils.c:5:10: fatal error: Python.h: No such file or directory
5 | #include <Python.h>
|          ^~~~~~~~~~
compilation terminated.
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 91, in init
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     self._initialize_kv_caches(vllm_config)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 181, in _initialize_kv_caches
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     self.model_executor.determine_available_memory())
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     output = self.collective_rpc("determine_available_memory")
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     return func(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     return func(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 243, in determine_available_memory
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     self.model_runner.profile_run()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2498, in profile_run
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     = self._dummy_run(self.max_num_tokens, is_profile=True)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     return func(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2250, in _dummy_run
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     outputs = model(
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]               ^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     return forward_call(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 258, in forward
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     return self.model(input_ids, positions)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 272, in call
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     output = self.compiled_callable(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 817, in compile_wrapper
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 979, in _compile_fx_inner
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     raise InductorError(e, currentframe()).with_traceback(
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 963, in _compile_fx_inner
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     mb_compiled_graph = fx_codegen_and_compile(
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]                         ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1646, in fx_codegen_and_compile
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1506, in codegen_and_compile
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     compiled_module = graph.compile_to_module()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]                       ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2318, in compile_to_module
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     return self._compile_to_module()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2324, in _compile_to_module
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]                                                              ^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2263, in codegen
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     self.scheduler.codegen()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 4759, in codegen
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     else self._codegen(self.nodes)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]          ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 4915, in _codegen
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     self.get_backend(device).codegen_node(node)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 107, in codegen_node
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     return self._triton_scheduling.codegen_node(node)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1401, in codegen_node
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     return self.codegen_node_schedule(
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1454, in codegen_node_schedule
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     src_code = kernel.codegen_kernel()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]                ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 3982, in codegen_kernel
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     **self.inductor_meta_common(),
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 3806, in inductor_meta_common
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     "backend_hash": torch.utils._triton.triton_hash_with_backend(),
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_triton.py", line 164, in triton_hash_with_backend
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     backend = triton_backend()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]               ^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_triton.py", line 156, in triton_backend
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     target = driver.active.get_current_target()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/driver.py", line 30, in getattr
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     return getattr(self._initialize_obj(), name)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]                    ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/driver.py", line 26, in _initialize_obj
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     self._obj = self._init_fn()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]                 ^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/driver.py", line 12, in _create_driver
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     return active_drivers0
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 715, in init
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     self.utils = CudaUtils()  # TODO: make static
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]                  ^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 62, in init
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     mod = compile_module_from_src(
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]           ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/build.py", line 88, in compile_module_from_src
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     so = _build(name, src_path, tmpdir, library_dirs or [], include_dirs or [], libraries or [])
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/build.py", line 51, in _build
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     subprocess.check_call(cc_cmd, stdout=subprocess.DEVNULL)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]   File "/usr/lib/python3.12/subprocess.py", line 413, in check_call
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]     raise CalledProcessError(retcode, cmd)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpu15cffx7/cuda_utils.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpu15cffx7/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpu15cffx7', '-I/usr/include/python3.12']' returned non-zero exit status 1.
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]
(EngineCore_0 pid=2466326) Process EngineCore_0:
(EngineCore_0 pid=2466326) Traceback (most recent call last):
(EngineCore_0 pid=2466326)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=2466326)     self.run()
(EngineCore_0 pid=2466326)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=2466326)     self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_0 pid=2466326)     raise e
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=2466326)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=2466326)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=2466326)     super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 91, in init
(EngineCore_0 pid=2466326)     self._initialize_kv_caches(vllm_config)
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 181, in _initialize_kv_caches
(EngineCore_0 pid=2466326)     self.model_executor.determine_available_memory())
(EngineCore_0 pid=2466326)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
(EngineCore_0 pid=2466326)     output = self.collective_rpc("determine_available_memory")
(EngineCore_0 pid=2466326)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=2466326)     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=2466326)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=2466326)     return func(*args, **kwargs)
(EngineCore_0 pid=2466326)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_0 pid=2466326)     return func(*args, **kwargs)
(EngineCore_0 pid=2466326)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 243, in determine_available_memory
(EngineCore_0 pid=2466326)     self.model_runner.profile_run()
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2498, in profile_run
(EngineCore_0 pid=2466326)     = self._dummy_run(self.max_num_tokens, is_profile=True)
(EngineCore_0 pid=2466326)       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_0 pid=2466326)     return func(*args, **kwargs)
(EngineCore_0 pid=2466326)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2250, in _dummy_run
(EngineCore_0 pid=2466326)     outputs = model(
(EngineCore_0 pid=2466326)               ^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_0 pid=2466326)     return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=2466326)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_0 pid=2466326)     return forward_call(*args, **kwargs)
(EngineCore_0 pid=2466326)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 258, in forward
(EngineCore_0 pid=2466326)     return self.model(input_ids, positions)
(EngineCore_0 pid=2466326)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 272, in call
(EngineCore_0 pid=2466326)     output = self.compiled_callable(*args, **kwargs)
(EngineCore_0 pid=2466326)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 817, in compile_wrapper
(EngineCore_0 pid=2466326)     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
(EngineCore_0 pid=2466326)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 979, in _compile_fx_inner
(EngineCore_0 pid=2466326)     raise InductorError(e, currentframe()).with_traceback(
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 963, in _compile_fx_inner
(EngineCore_0 pid=2466326)     mb_compiled_graph = fx_codegen_and_compile(
(EngineCore_0 pid=2466326)                         ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1646, in fx_codegen_and_compile
(EngineCore_0 pid=2466326)     return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
(EngineCore_0 pid=2466326)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1506, in codegen_and_compile
(EngineCore_0 pid=2466326)     compiled_module = graph.compile_to_module()
(EngineCore_0 pid=2466326)                       ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2318, in compile_to_module
(EngineCore_0 pid=2466326)     return self._compile_to_module()
(EngineCore_0 pid=2466326)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2324, in _compile_to_module
(EngineCore_0 pid=2466326)     self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
(EngineCore_0 pid=2466326)                                                              ^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2263, in codegen
(EngineCore_0 pid=2466326)     self.scheduler.codegen()
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 4759, in codegen
(EngineCore_0 pid=2466326)     else self._codegen(self.nodes)
(EngineCore_0 pid=2466326)          ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 4915, in _codegen
(EngineCore_0 pid=2466326)     self.get_backend(device).codegen_node(node)
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 107, in codegen_node
(EngineCore_0 pid=2466326)     return self._triton_scheduling.codegen_node(node)
(EngineCore_0 pid=2466326)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1401, in codegen_node
(EngineCore_0 pid=2466326)     return self.codegen_node_schedule(
(EngineCore_0 pid=2466326)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1454, in codegen_node_schedule
(EngineCore_0 pid=2466326)     src_code = kernel.codegen_kernel()
(EngineCore_0 pid=2466326)                ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 3982, in codegen_kernel
(EngineCore_0 pid=2466326)     **self.inductor_meta_common(),
(EngineCore_0 pid=2466326)       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 3806, in inductor_meta_common
(EngineCore_0 pid=2466326)     "backend_hash": torch.utils._triton.triton_hash_with_backend(),
(EngineCore_0 pid=2466326)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_triton.py", line 164, in triton_hash_with_backend
(EngineCore_0 pid=2466326)     backend = triton_backend()
(EngineCore_0 pid=2466326)               ^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_triton.py", line 156, in triton_backend
(EngineCore_0 pid=2466326)     target = driver.active.get_current_target()
(EngineCore_0 pid=2466326)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/driver.py", line 30, in getattr
(EngineCore_0 pid=2466326)     return getattr(self._initialize_obj(), name)
(EngineCore_0 pid=2466326)                    ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/driver.py", line 26, in _initialize_obj
(EngineCore_0 pid=2466326)     self._obj = self._init_fn()
(EngineCore_0 pid=2466326)                 ^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/driver.py", line 12, in _create_driver
(EngineCore_0 pid=2466326)     return active_drivers0
(EngineCore_0 pid=2466326)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 715, in init
(EngineCore_0 pid=2466326)     self.utils = CudaUtils()  # TODO: make static
(EngineCore_0 pid=2466326)                  ^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 62, in init
(EngineCore_0 pid=2466326)     mod = compile_module_from_src(
(EngineCore_0 pid=2466326)           ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/build.py", line 88, in compile_module_from_src
(EngineCore_0 pid=2466326)     so = _build(name, src_path, tmpdir, library_dirs or [], include_dirs or [], libraries or [])
(EngineCore_0 pid=2466326)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/build.py", line 51, in _build
(EngineCore_0 pid=2466326)     subprocess.check_call(cc_cmd, stdout=subprocess.DEVNULL)
(EngineCore_0 pid=2466326)   File "/usr/lib/python3.12/subprocess.py", line 413, in check_call
(EngineCore_0 pid=2466326)     raise CalledProcessError(retcode, cmd)
(EngineCore_0 pid=2466326) torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpu15cffx7/cuda_utils.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpu15cffx7/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpu15cffx7', '-I/usr/include/python3.12']' returned non-zero exit status 1.
(EngineCore_0 pid=2466326)
(EngineCore_0 pid=2466326) Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore_0 pid=2466326)
[rank0]:[W822 15:28:40.720887190 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2465374) Traceback (most recent call last):
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/bin/vllm", line 10, in <module>
(APIServer pid=2465374)     sys.exit(main())
(APIServer pid=2465374)              ^^^^^^
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=2465374)     args.dispatch_function(args)
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=2465374)     uvloop.run(run_server(args))
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=2465374)     return __asyncio.run(
(APIServer pid=2465374)            ^^^^^^^^^^^^^^
(APIServer pid=2465374)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=2465374)     return runner.run(main)
(APIServer pid=2465374)            ^^^^^^^^^^^^^^^^
(APIServer pid=2465374)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2465374)     return self._loop.run_until_complete(task)
(APIServer pid=2465374)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=2465374)     return await main
(APIServer pid=2465374)            ^^^^^^^^^^
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1827, in run_server
(APIServer pid=2465374)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1847, in run_server_worker
(APIServer pid=2465374)     async with build_async_engine_client(
(APIServer pid=2465374)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374)   File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=2465374)     return await anext(self.gen)
(APIServer pid=2465374)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 167, in build_async_engine_client
(APIServer pid=2465374)     async with build_async_engine_client_from_engine_args(
(APIServer pid=2465374)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374)   File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=2465374)     return await anext(self.gen)
(APIServer pid=2465374)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 209, in build_async_engine_client_from_engine_args
(APIServer pid=2465374)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=2465374)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 1520, in inner
(APIServer pid=2465374)     return fn(*args, **kwargs)
(APIServer pid=2465374)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 173, in from_vllm_config
(APIServer pid=2465374)     return cls(
(APIServer pid=2465374)            ^^^^
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 119, in init
(APIServer pid=2465374)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=2465374)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 101, in make_async_mp_client
(APIServer pid=2465374)     return AsyncMPClient(*client_args)
(APIServer pid=2465374)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 733, in init
(APIServer pid=2465374)     super().init(
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 421, in init
(APIServer pid=2465374)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=2465374)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374)   File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=2465374)     next(self.gen)
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
(APIServer pid=2465374)     wait_for_engine_startup(
(APIServer pid=2465374)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
(APIServer pid=2465374)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=2465374) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

关键错误：

/tmp/tmpu15cffx7/cuda_utils.c:5:10: fatal error: Python.h: No such file or directory5 | #include <Python.h>|          ^~~~~~~~~~
compilation terminated.

解决方法：

sudo apt-get update
sudo apt-get install python3.12-dev

再次出现问题：

(EngineCore_0 pid=2504605) INFO 08-22 15:41:41 [triton_attn.py:263] Using vllm unified attention for TritonAttentionImpl
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.36it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.32it/s]
(EngineCore_0 pid=2504605)
(EngineCore_0 pid=2504605) INFO 08-22 15:41:43 [default_loader.py:262] Loading weights took 2.35 seconds
(EngineCore_0 pid=2504605) INFO 08-22 15:41:44 [gpu_model_runner.py:1962] Model loading took 12.8848 GiB and 2.664834 seconds
(EngineCore_0 pid=2504605) INFO 08-22 15:41:49 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/68cd9b7b1a/r                                                              ank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=2504605) INFO 08-22 15:41:49 [backends.py:541] Dynamo bytecode transform time: 4.57 s
(EngineCore_0 pid=2504605) INFO 08-22 15:41:52 [backends.py:194] Cache the graph for dynamic shape for later use
(EngineCore_0 pid=2504605) INFO 08-22 15:42:22 [backends.py:215] Compiling a graph for dynamic shape takes 33.10 s
(EngineCore_0 pid=2504605) INFO 08-22 15:42:35 [monitor.py:34] torch.compile takes 37.67 s in total
(EngineCore_0 pid=2504605) [rank0]:W0822 15:42:35.972000 2504605 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=2504605) [rank0]:W0822 15:42:35.972000 2504605 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(EngineCore_0 pid=2504605) [rank0]:W0822 15:42:35.974000 2504605 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=2504605) [rank0]:W0822 15:42:35.974000 2504605 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(EngineCore_0 pid=2504605) INFO 08-22 15:43:31 [gpu_worker.py:275] Available KV cache memory: 28.66 GiB
(EngineCore_0 pid=2504605) INFO 08-22 15:43:31 [kv_cache_utils.py:993] GPU KV cache size: 626,080 tokens
(EngineCore_0 pid=2504605) INFO 08-22 15:43:31 [kv_cache_utils.py:997] Maximum concurrency for 131,072 tokens per request: 9.40x
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [01:13<00:00,  1.13it/s]
(EngineCore_0 pid=2504605) INFO 08-22 15:44:45 [gpu_model_runner.py:2567] Graph capturing finished in 74 secs, took 0.74 GiB
(EngineCore_0 pid=2504605) INFO 08-22 15:44:45 [core.py:216] init engine (profile, create kv cache, warmup model) took 181.35 seconds
(EngineCore_0 pid=2504605) reasoning_end_token_ids [200006, 173781, 200005, 17196, 200008]
(APIServer pid=2503694) INFO 08-22 15:44:47 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 78261
(APIServer pid=2503694) INFO 08-22 15:44:47 [api_server.py:1599] Supported_tasks: ['generate']
(APIServer pid=2503694) WARNING 08-22 15:44:47 [serving_responses.py:123] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
[rank0]:[W822 15:45:54.457687074 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2503694) Traceback (most recent call last):
(APIServer pid=2503694)   File "/data1/tlw/vllm/venv-py12/bin/vllm", line 10, in <module>
(APIServer pid=2503694)     sys.exit(main())
(APIServer pid=2503694)              ^^^^^^
(APIServer pid=2503694)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=2503694)     args.dispatch_function(args)
(APIServer pid=2503694)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=2503694)     uvloop.run(run_server(args))
(APIServer pid=2503694)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=2503694)     return __asyncio.run(
(APIServer pid=2503694)            ^^^^^^^^^^^^^^
(APIServer pid=2503694)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=2503694)     return runner.run(main)
(APIServer pid=2503694)            ^^^^^^^^^^^^^^^^
(APIServer pid=2503694)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2503694)     return self._loop.run_until_complete(task)
(APIServer pid=2503694)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2503694)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2503694)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=2503694)     return await main
(APIServer pid=2503694)            ^^^^^^^^^^
(APIServer pid=2503694)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1827, in run_server
(APIServer pid=2503694)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2503694)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1855, in run_server_worker
(APIServer pid=2503694)     await init_app_state(engine_client, vllm_config, app.state, args)
(APIServer pid=2503694)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1657, in init_app_state
(APIServer pid=2503694)     state.openai_serving_responses = OpenAIServingResponses(
(APIServer pid=2503694)                                      ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2503694)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_responses.py", line 130, in init
(APIServer pid=2503694)     get_stop_tokens_for_assistant_actions())
(APIServer pid=2503694)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2503694)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/harmony_utils.py", line 187, in get_stop_tokens_for_assistant_actions
(APIServer pid=2503694)     return get_encoding().stop_tokens_for_assistant_actions()
(APIServer pid=2503694)            ^^^^^^^^^^^^^^
(APIServer pid=2503694)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/harmony_utils.py", line 37, in get_encoding
(APIServer pid=2503694)     _harmony_encoding = load_harmony_encoding(
(APIServer pid=2503694)                         ^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2503694)   File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/openai_harmony/init.py", line 689, in load_harmony_encoding
(APIServer pid=2503694)     inner: _PyHarmonyEncoding = _load_harmony_encoding(name)
(APIServer pid=2503694)                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2503694) openai_harmony.HarmonyError: error downloading or loading vocab file: an underlying IO error occurred while reading from response https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken: error decoding response body

核心错误：
vLLM 在启动 OpenAI 兼容的 API 服务器时，需要加载一个名为 o200k_base.tiktoken 的词汇表（vocab）文件。服务器下载这个文件时失败了。

openai_harmony.HarmonyError: error downloading or loading vocab file: an underlying IO error occurred while reading from response https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken: error decoding response body

解决方法：
手动下载并指定路径

mkdir tiktoken_cachewget https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
wget https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktokenexport TIKTOKEN_CACHE_DIR="/data1/tiktoken_cache"

最后运行前面的启动命令成功运行：


(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:29] Available routes are:
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /docs, Methods: HEAD, GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /health, Methods: GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /load, Methods: GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /ping, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /ping, Methods: GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /tokenize, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /detokenize, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/models, Methods: GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /version, Methods: GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/responses, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/completions, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/embeddings, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /pooling, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /classify, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /score, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/score, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /rerank, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/rerank, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v2/rerank, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /invocations, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /metrics, Methods: GET
(APIServer pid=2560106) INFO:     Started server process [2560106]
(APIServer pid=2560106) INFO:     Waiting for application startup.
(APIServer pid=2560106) INFO:     Application startup complete.

简单测试脚本：

import openai# 1. 初始化 OpenAI 客户端
# - base_url 指向您本地 vLLM 服务器的地址和端口。
# - api_key 是必需参数，但 vLLM 不会验证它，所以可以填写任意字符串。
client = openai.OpenAI(base_url="http://x.x.x.x:8000/v1",api_key="not-needed"
)# 2. 定义模型名称和聊天内容
# - 这里的 model 名称必须与您启动 vLLM 服务时 --served-model-name 参数指定的名称一致。
MODEL_NAME = "gpt-oss-20b"
messages = [{"role": "user", "content": "你好，请用中文介绍一下你自己以及你的功能。"}
]print(f"--- 向模型 {MODEL_NAME} 发送请求 ---")
print(f"用户: {messages[-1]['content']}")
print("--- 模型响应 ---")try:# 3. 发送请求到 vLLM 服务器response = client.chat.completions.create(model=MODEL_NAME,messages=messages,temperature=0.7,  # 控制生成文本的随机性，0 表示最确定max_tokens=512,  # 控制回复的最大长度)# 4. 打印模型的完整回复print(response.choices[0].message.content)except openai.APIConnectionError as e:print("\n[错误] 无法连接到 vLLM 服务器。")print(f"详细错误: {e.__cause__}")except Exception as e:print(f"\n[发生未知错误]: {e}")

返回结果：

--- 向模型 gpt-oss-20b 发送请求 ---
用户: 你好，请用中文介绍一下你自己以及你的功能。
--- 模型响应 ---
你好！我是 ChatGPT，基于 OpenAI 的 GPT‑4 架构开发的大型语言模型。以下是我的基本信息和主要功能介绍：## 基本信息
- **模型**：GPT‑4（Chat Generative Pre‑trained Transformer 4）
- **训练数据截止时间**：2023‑09
- **语言能力**：支持多种语言（中文、英文、日语、法语等），但主要以中文和英文为主。
- **平台**：可通过多种接口使用（网页、API、聊天机器人等）。## 核心功能| 功能 | 说明 |
|------|------|
| **自然语言理解 & 生成** | 可以回答问题、撰写文章、生成对话、改写文本、翻译等。 |
| **知识问答** | 提供事实性回答，涵盖科学、技术、历史、文化、时事等广泛领域。 |
| **创意写作** | 写诗、小说、剧本、广告文案、标题、摘要等。 |
| **编程与调试** | 生成代码、解释代码、调试、算法设计、技术文档。 |
| **教育辅导** | 讲解概念、解题思路、做练习、提供学习资源。 |
| **商务与营销** | 撰写商务邮件、市场分析、产品描述、SEO 文案。 |
| **情感支持** | 进行轻度心理支持、倾听、建议，但不替代专业心理咨询。 |
| **多轮对话** | 记住上下文，进行连贯、自然的对话。 |
| **定制化** | 通过提示工程（prompt engineering）可以调节语气、风格、专业度。 |## 使用注意1. **信息准确性**：我基于已有数据训练，可能出现过时或不准确的信息。重要决策请自行核实。
2. **隐私安全**：对话内容不会被用于训练或公开分享，但请避免输入敏感个人信息。
3. **合法合规**：请勿用于违法或不道德用途，例如生成仇恨言论、传播谣言等。## 如何与我互动