VLLM部署gpt-oss-20b踩坑记录
部署显卡:4090 48G (不同型号显卡在后续安装使用flash-attn可能会出现不同问题)
模型下载地址:https://huggingface.co/openai/gpt-oss-20b/tree/main
官方环境安装和启动命令(看似简单实则都是坑):
uv pip install --pre vllm==0.10.1+gptoss \--extra-index-url https://wheels.vllm.ai/gpt-oss/ \--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \--index-strategy unsafe-best-matchvllm serve openai/gpt-oss-20b
Python版本一定要3.12,否则会出现错误:
Collecting diskcache==5.6.3 (from vllm==0.10.1+gptoss)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/3f/27/4570e78fc0bf5ea0ca45eb1de3818a23787af9b390c0b0a0033a1b8236f9/diskcache-5.6.3-py3-none-any.whl (45 kB)
Collecting depyf==0.19.0 (from vllm==0.10.1+gptoss)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/28/4d/1192acbcdc5e843f5e5d51f6e8788f2b60a9fe0b578ac385ded67a0b0b26/depyf-0.19.0-py3-none-any.whl (39 kB)
Collecting compressed-tensors==0.10.2 (from vllm==0.10.1+gptoss)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/43/ac/56bb4b6b3150783119479e2f05e32ebfc39ca6ff8e6fcd45eb178743b39e/compressed_tensors-0.10.2-py3-none-any.whl (169 kB)
[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: pip install --upgrade pip
ERROR: Package 'gpt-oss' requires a different Python: 3.10.12 not in '<3.13,>=3.12'
直接使用上面官方的命令安装包的时候会很容易中断(不是磁盘空间不够):
Resolved 169 packages in 15.45s
⠦ Preparing packages... (2/15)
nvidia-curand-cu12 ------------------------------ 5.87 MiB/60.67 MiB
nvidia-cuda-nvrtc-cu12 ------------------------------ 4.50 MiB/83.96 MiB
nvidia-nvshmem-cu12 ------------------------------ 4.25 MiB/118.83 MiB
pytorch-triton ------------------------------ 4.50 MiB/147.49 MiB
nvidia-cufft-cu12 ------------------------------ 4.25 MiB/184.17 MiB
nvidia-cusolver-cu12 ------------------------------ 3.77 MiB/255.11 MiB
nvidia-cusparselt-cu12 ------------------------------ 4.00 MiB/273.89 MiB
nvidia-cusparse-cu12 ------------------------------ 4.25 MiB/274.86 MiB
nvidia-nccl-cu12 ------------------------------ 4.00 MiB/307.42 MiB
nvidia-cublas-cu12 ------------------------------ 4.25 MiB/566.81 MiB
nvidia-cudnn-cu12 ------------------------------ 4.00 MiB/674.02 MiB
vllm ------------------------------ 29.86 MiB/774.76 MiB
torch ------------------------------ 4.25 MiB/852.84 MiB
× Failed to download `torchvision==0.24.0.dev20250804+cu128`
├─▶ Failed to extract archive: torchvision-....whl
╰─▶ I/O operation failed during extraction
将这些包分开来一个个下载,避免同时尝试下载多个几百兆甚至上G的大文件导致中断,类似:
uv pip install --pre torch torchaudio torchvision \--index-url https://pypi.tuna.tsinghua.edu.cn/simple \--extra-index-url https://download.pytorch.org/whl/nightly/cu128
如果仍然失败可以使用 wget -c 手动下载文件,再本地安装:
wget -c https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.4.1-py3-none-manylinux_2_27_x86_64.whluv pip install nvidia_cublas_cu12-12.8.4.1-py3-none-manylinux_2_27_x86_64.whl
安装好环境后运行命令启动:
vllm serve /data1/models/gpt-oss-20b
成功出现错误:
(EngineCore_0 pid=1976920) W0822 11:48:23.183000 1976920 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=1976920) W0822 11:48:23.183000 1976920 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[W822 11:48:25.998767914 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_0 pid=1976920) INFO 08-22 11:48:25 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=1976920) INFO 08-22 11:48:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=1976920) INFO 08-22 11:48:25 [gpu_model_runner.py:1913] Starting to load model /data1/models/gpt-oss-20b...
(EngineCore_0 pid=1976920) INFO 08-22 11:48:26 [gpu_model_runner.py:1945] Loading model from scratch...
(EngineCore_0 pid=1976920) INFO 08-22 11:48:26 [cuda.py:323] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] self._init_executor()
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] self.collective_rpc("load_model")
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] return func(*args, **kwargs)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] self.model = model_loader.load_model(
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] model = initialize_model(vllm_config=vllm_config,
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 241, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] self.model = GptOssModel(
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] ^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 183, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 214, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] TransformerBlock(
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 183, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 110, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] self.attn = Attention(
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] ^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/attention/layer.py", line 176, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in init
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] assert self.vllm_flash_attn_version == 3, (
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) ERROR 08-22 11:48:26 [core.py:718] AssertionError: Sinks are only supported in FlashAttention 3
(EngineCore_0 pid=1976920) Process EngineCore_0:
(EngineCore_0 pid=1976920) Traceback (most recent call last):
(EngineCore_0 pid=1976920) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=1976920) self.run()
(EngineCore_0 pid=1976920) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=1976920) self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_0 pid=1976920) raise e
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=1976920) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=1976920) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=1976920) super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_0 pid=1976920) self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=1976920) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_0 pid=1976920) self._init_executor()
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=1976920) self.collective_rpc("load_model")
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=1976920) answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=1976920) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=1976920) return func(*args, **kwargs)
(EngineCore_0 pid=1976920) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=1976920) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=1976920) self.model = model_loader.load_model(
(EngineCore_0 pid=1976920) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(EngineCore_0 pid=1976920) model = initialize_model(vllm_config=vllm_config,
(EngineCore_0 pid=1976920) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_0 pid=1976920) return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_0 pid=1976920) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 241, in init
(EngineCore_0 pid=1976920) self.model = GptOssModel(
(EngineCore_0 pid=1976920) ^^^^^^^^^^^^
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 183, in init
(EngineCore_0 pid=1976920) old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 214, in init
(EngineCore_0 pid=1976920) TransformerBlock(
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 183, in init
(EngineCore_0 pid=1976920) self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(EngineCore_0 pid=1976920) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 110, in init
(EngineCore_0 pid=1976920) self.attn = Attention(
(EngineCore_0 pid=1976920) ^^^^^^^^^^
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/attention/layer.py", line 176, in init
(EngineCore_0 pid=1976920) self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(EngineCore_0 pid=1976920) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in init
(EngineCore_0 pid=1976920) assert self.vllm_flash_attn_version == 3, (
(EngineCore_0 pid=1976920) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=1976920) AssertionError: Sinks are only supported in FlashAttention 3
[rank0]:[W822 11:48:27.406723611 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1976008) Traceback (most recent call last):
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/bin/vllm", line 10, in <module>
(APIServer pid=1976008) sys.exit(main())
(APIServer pid=1976008) ^^^^^^
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=1976008) args.dispatch_function(args)
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=1976008) uvloop.run(run_server(args))
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=1976008) return __asyncio.run(
(APIServer pid=1976008) ^^^^^^^^^^^^^^
(APIServer pid=1976008) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1976008) return runner.run(main)
(APIServer pid=1976008) ^^^^^^^^^^^^^^^^
(APIServer pid=1976008) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1976008) return self._loop.run_until_complete(task)
(APIServer pid=1976008) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=1976008) return await main
(APIServer pid=1976008) ^^^^^^^^^^
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1827, in run_server
(APIServer pid=1976008) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1847, in run_server_worker
(APIServer pid=1976008) async with build_async_engine_client(
(APIServer pid=1976008) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1976008) return await anext(self.gen)
(APIServer pid=1976008) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 167, in build_async_engine_client
(APIServer pid=1976008) async with build_async_engine_client_from_engine_args(
(APIServer pid=1976008) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1976008) return await anext(self.gen)
(APIServer pid=1976008) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 209, in build_async_engine_client_from_engine_args
(APIServer pid=1976008) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1976008) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 1520, in inner
(APIServer pid=1976008) return fn(*args, **kwargs)
(APIServer pid=1976008) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 173, in from_vllm_config
(APIServer pid=1976008) return cls(
(APIServer pid=1976008) ^^^^
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 119, in init
(APIServer pid=1976008) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1976008) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 101, in make_async_mp_client
(APIServer pid=1976008) return AsyncMPClient(*client_args)
(APIServer pid=1976008) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 733, in init
(APIServer pid=1976008) super().init(
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 421, in init
(APIServer pid=1976008) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=1976008) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1976008) File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=1976008) next(self.gen)
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
(APIServer pid=1976008) wait_for_engine_startup(
(APIServer pid=1976008) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
(APIServer pid=1976008) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=1976008) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
尝试升级各种升级flash-attn等方法失败了:
pip install --upgrade flash-attn
最后在官方下载地址:https://github.com/Dao-AILab/flash-attention/releases/tag/v2.8.3,中手动下载并安装:
wget -c "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"pip install flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
毫不意外,再次出现错误:
(EngineCore_0 pid=2395666) W0822 15:04:45.498000 2395666 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=2395666) W0822 15:04:45.498000 2395666 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[W822 15:04:46.658661250 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_0 pid=2395666) INFO 08-22 15:04:46 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=2395666) INFO 08-22 15:04:46 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=2395666) INFO 08-22 15:04:46 [gpu_model_runner.py:1913] Starting to load model /data1/models/gpt-oss-20b...
(EngineCore_0 pid=2395666) INFO 08-22 15:04:46 [gpu_model_runner.py:1945] Loading model from scratch...
(EngineCore_0 pid=2395666) INFO 08-22 15:04:47 [cuda.py:323] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] self._init_executor()
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] self.collective_rpc("load_model")
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] return func(*args, **kwargs)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] self.model = model_loader.load_model(
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] model = initialize_model(vllm_config=vllm_config,
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 241, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] self.model = GptOssModel(
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] ^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 183, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 214, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] TransformerBlock(
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 183, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 110, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] self.attn = Attention(
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] ^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/attention/layer.py", line 176, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in init
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] assert self.vllm_flash_attn_version == 3, (
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) ERROR 08-22 15:04:47 [core.py:718] AssertionError: Sinks are only supported in FlashAttention 3
(EngineCore_0 pid=2395666) Process EngineCore_0:
(EngineCore_0 pid=2395666) Traceback (most recent call last):
(EngineCore_0 pid=2395666) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=2395666) self.run()
(EngineCore_0 pid=2395666) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=2395666) self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_0 pid=2395666) raise e
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=2395666) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=2395666) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=2395666) super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_0 pid=2395666) self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=2395666) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_0 pid=2395666) self._init_executor()
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=2395666) self.collective_rpc("load_model")
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=2395666) answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=2395666) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=2395666) return func(*args, **kwargs)
(EngineCore_0 pid=2395666) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=2395666) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=2395666) self.model = model_loader.load_model(
(EngineCore_0 pid=2395666) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(EngineCore_0 pid=2395666) model = initialize_model(vllm_config=vllm_config,
(EngineCore_0 pid=2395666) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_0 pid=2395666) return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_0 pid=2395666) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 241, in init
(EngineCore_0 pid=2395666) self.model = GptOssModel(
(EngineCore_0 pid=2395666) ^^^^^^^^^^^^
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 183, in init
(EngineCore_0 pid=2395666) old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 214, in init
(EngineCore_0 pid=2395666) TransformerBlock(
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 183, in init
(EngineCore_0 pid=2395666) self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(EngineCore_0 pid=2395666) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 110, in init
(EngineCore_0 pid=2395666) self.attn = Attention(
(EngineCore_0 pid=2395666) ^^^^^^^^^^
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/attention/layer.py", line 176, in init
(EngineCore_0 pid=2395666) self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(EngineCore_0 pid=2395666) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in init
(EngineCore_0 pid=2395666) assert self.vllm_flash_attn_version == 3, (
(EngineCore_0 pid=2395666) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2395666) AssertionError: Sinks are only supported in FlashAttention 3
[rank0]:[W822 15:04:48.294247588 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2394725) Traceback (most recent call last):
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/bin/vllm", line 10, in <module>
(APIServer pid=2394725) sys.exit(main())
(APIServer pid=2394725) ^^^^^^
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=2394725) args.dispatch_function(args)
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=2394725) uvloop.run(run_server(args))
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=2394725) return __asyncio.run(
(APIServer pid=2394725) ^^^^^^^^^^^^^^
(APIServer pid=2394725) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=2394725) return runner.run(main)
(APIServer pid=2394725) ^^^^^^^^^^^^^^^^
(APIServer pid=2394725) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2394725) return self._loop.run_until_complete(task)
(APIServer pid=2394725) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=2394725) return await main
(APIServer pid=2394725) ^^^^^^^^^^
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1827, in run_server
(APIServer pid=2394725) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1847, in run_server_worker
(APIServer pid=2394725) async with build_async_engine_client(
(APIServer pid=2394725) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=2394725) return await anext(self.gen)
(APIServer pid=2394725) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 167, in build_async_engine_client
(APIServer pid=2394725) async with build_async_engine_client_from_engine_args(
(APIServer pid=2394725) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=2394725) return await anext(self.gen)
(APIServer pid=2394725) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 209, in build_async_engine_client_from_engine_args
(APIServer pid=2394725) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=2394725) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 1520, in inner
(APIServer pid=2394725) return fn(*args, **kwargs)
(APIServer pid=2394725) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 173, in from_vllm_config
(APIServer pid=2394725) return cls(
(APIServer pid=2394725) ^^^^
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 119, in init
(APIServer pid=2394725) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=2394725) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 101, in make_async_mp_client
(APIServer pid=2394725) return AsyncMPClient(*client_args)
(APIServer pid=2394725) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 733, in init
(APIServer pid=2394725) super().init(
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 421, in init
(APIServer pid=2394725) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=2394725) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2394725) File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=2394725) next(self.gen)
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
(APIServer pid=2394725) wait_for_engine_startup(
(APIServer pid=2394725) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
(APIServer pid=2394725) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=2394725) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
warning
查阅相关资料后发现,在命令前加上VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 似乎能解决问题。
最终运行命令:
CUDA_VISIBLE_DEVICES=1 /
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1
VLLM_USE_V1=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
vllm serve /data1/models/gpt-oss-20b
--served-model-name gpt-oss-20b
--gpu-memory-utilization 0.5
--tensor-parallel-size 1
--host 0.0.0.0
--port 8000
--uvicorn-log-level debug
再次出现问题:
(venv-py12) root@test:/data1/tlw/vllm# CUDA_VISIBLE_DEVICES=1 VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn vllm serve /data1/models/gpt-oss-20b --served-model-name gpt-oss-20b --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --host 0.0.0.0 --port 8000 --uvicorn-log-level debug
INFO 08-22 15:28:09 [init.py:241] Automatically detected platform cuda.
(APIServer pid=2465374) INFO 08-22 15:28:12 [api_server.py:1787] vLLM API server version 0.10.2.dev2+gf5635d62e.d20250807
(APIServer pid=2465374) INFO 08-22 15:28:12 [utils.py:326] non-default args: {'model_tag': '/data1/models/gpt-oss-20b', 'host': '0.0.0.0', 'uvicorn_log_level': 'debug', 'model': '/data1/models/gpt-oss-20b', 'served_model_name': ['gpt-oss-20b']}
(APIServer pid=2465374) INFO 08-22 15:28:19 [config.py:726] Resolved architecture: GptOssForCausalLM
(APIServer pid=2465374) ERROR 08-22 15:28:19 [config.py:123] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/data1/models/gpt-oss-20b'. Use repo_type argument if needed., retrying 1 of 2
(APIServer pid=2465374) ERROR 08-22 15:28:21 [config.py:121] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/data1/models/gpt-oss-20b'. Use repo_type argument if needed.
(APIServer pid=2465374) INFO 08-22 15:28:21 [config.py:3628] Downcasting torch.float32 to torch.bfloat16.
(APIServer pid=2465374) INFO 08-22 15:28:21 [config.py:1759] Using max model len 131072
(APIServer pid=2465374) WARNING 08-22 15:28:21 [config.py:1198] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
(APIServer pid=2465374) INFO 08-22 15:28:21 [config.py:2588] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=2465374) INFO 08-22 15:28:21 [config.py:244] Overriding cuda graph sizes to [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024]
INFO 08-22 15:28:26 [init.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=2466326) INFO 08-22 15:28:29 [core.py:654] Waiting for init message from front-end.
(EngineCore_0 pid=2466326) INFO 08-22 15:28:29 [core.py:73] Initializing a V1 LLM engine (v0.10.2.dev2+gf5635d62e.d20250807) with config: model='/data1/models/gpt-oss-20b', speculative_config=None, tokenizer='/data1/models/gpt-oss-20b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='openai'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=gpt-oss-20b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":1024,"local_cache_dir":null}
(EngineCore_0 pid=2466326)
(EngineCore_0 pid=2466326) LL LL MMM MMM
(EngineCore_0 pid=2466326) LL LL MMMM MMMM
(EngineCore_0 pid=2466326) V LL LL MM MM MM MM
(EngineCore_0 pid=2466326) vvvv VVVV LL LL MM MM MM MM
(EngineCore_0 pid=2466326) vvvv VVVV LL LL MM MMM MM
(EngineCore_0 pid=2466326) vvv VVVV LL LL MM M MM
(EngineCore_0 pid=2466326) vvVVVV LL LL MM MM
(EngineCore_0 pid=2466326) VVVV LLLLLLLLLL LLLLLLLLL M M
(EngineCore_0 pid=2466326)
(EngineCore_0 pid=2466326) W0822 15:28:29.528000 2466326 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=2466326) W0822 15:28:29.528000 2466326 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[W822 15:28:30.519722738 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_0 pid=2466326) INFO 08-22 15:28:30 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=2466326) INFO 08-22 15:28:30 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=2466326) INFO 08-22 15:28:30 [gpu_model_runner.py:1913] Starting to load model /data1/models/gpt-oss-20b...
(EngineCore_0 pid=2466326) INFO 08-22 15:28:30 [gpu_model_runner.py:1945] Loading model from scratch...
(EngineCore_0 pid=2466326) INFO 08-22 15:28:30 [cuda.py:286] Using Triton backend on V1 engine.
(EngineCore_0 pid=2466326) WARNING 08-22 15:28:30 [rocm.py:29] Failed to import from amdsmi with ModuleNotFoundError("No module named 'amdsmi'")
(EngineCore_0 pid=2466326) WARNING 08-22 15:28:30 [rocm.py:40] Failed to import from vllm._rocm_C with ModuleNotFoundError("No module named 'vllm._rocm_C'")
(EngineCore_0 pid=2466326) INFO 08-22 15:28:30 [triton_attn.py:263] Using vllm unified attention for TritonAttentionImpl
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.37it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:01<00:00, 1.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00, 1.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00, 1.25it/s]
(EngineCore_0 pid=2466326)
(EngineCore_0 pid=2466326) INFO 08-22 15:28:33 [default_loader.py:262] Loading weights took 2.52 seconds
(EngineCore_0 pid=2466326) INFO 08-22 15:28:33 [gpu_model_runner.py:1962] Model loading took 12.8848 GiB and 2.814522 seconds
(EngineCore_0 pid=2466326) INFO 08-22 15:28:38 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/68cd9b7b1a/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=2466326) INFO 08-22 15:28:38 [backends.py:541] Dynamo bytecode transform time: 4.39 s
/tmp/tmpu15cffx7/cuda_utils.c:5:10: fatal error: Python.h: No such file or directory
5 | #include <Python.h>
| ^~~~~~~~~~
compilation terminated.
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 91, in init
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] self._initialize_kv_caches(vllm_config)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 181, in _initialize_kv_caches
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] self.model_executor.determine_available_memory())
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] output = self.collective_rpc("determine_available_memory")
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] return func(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] return func(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 243, in determine_available_memory
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] self.model_runner.profile_run()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2498, in profile_run
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] = self._dummy_run(self.max_num_tokens, is_profile=True)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] return func(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2250, in _dummy_run
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] outputs = model(
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] return forward_call(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 258, in forward
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] return self.model(input_ids, positions)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 272, in call
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] output = self.compiled_callable(*args, **kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 817, in compile_wrapper
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 979, in _compile_fx_inner
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] raise InductorError(e, currentframe()).with_traceback(
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 963, in _compile_fx_inner
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] mb_compiled_graph = fx_codegen_and_compile(
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1646, in fx_codegen_and_compile
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1506, in codegen_and_compile
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] compiled_module = graph.compile_to_module()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2318, in compile_to_module
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] return self._compile_to_module()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2324, in _compile_to_module
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2263, in codegen
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] self.scheduler.codegen()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 4759, in codegen
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] else self._codegen(self.nodes)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 4915, in _codegen
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] self.get_backend(device).codegen_node(node)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 107, in codegen_node
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] return self._triton_scheduling.codegen_node(node)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1401, in codegen_node
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] return self.codegen_node_schedule(
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1454, in codegen_node_schedule
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] src_code = kernel.codegen_kernel()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 3982, in codegen_kernel
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] **self.inductor_meta_common(),
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 3806, in inductor_meta_common
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] "backend_hash": torch.utils._triton.triton_hash_with_backend(),
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_triton.py", line 164, in triton_hash_with_backend
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] backend = triton_backend()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_triton.py", line 156, in triton_backend
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] target = driver.active.get_current_target()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/driver.py", line 30, in getattr
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] return getattr(self._initialize_obj(), name)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/driver.py", line 26, in _initialize_obj
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] self._obj = self._init_fn()
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/driver.py", line 12, in _create_driver
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] return active_drivers0
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 715, in init
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] self.utils = CudaUtils() # TODO: make static
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 62, in init
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] mod = compile_module_from_src(
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/build.py", line 88, in compile_module_from_src
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] so = _build(name, src_path, tmpdir, library_dirs or [], include_dirs or [], libraries or [])
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/build.py", line 51, in _build
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] subprocess.check_call(cc_cmd, stdout=subprocess.DEVNULL)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] File "/usr/lib/python3.12/subprocess.py", line 413, in check_call
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] raise CalledProcessError(retcode, cmd)
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpu15cffx7/cuda_utils.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpu15cffx7/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpu15cffx7', '-I/usr/include/python3.12']' returned non-zero exit status 1.
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore_0 pid=2466326) ERROR 08-22 15:28:40 [core.py:718]
(EngineCore_0 pid=2466326) Process EngineCore_0:
(EngineCore_0 pid=2466326) Traceback (most recent call last):
(EngineCore_0 pid=2466326) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=2466326) self.run()
(EngineCore_0 pid=2466326) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=2466326) self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_0 pid=2466326) raise e
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=2466326) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=2466326) super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 91, in init
(EngineCore_0 pid=2466326) self._initialize_kv_caches(vllm_config)
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 181, in _initialize_kv_caches
(EngineCore_0 pid=2466326) self.model_executor.determine_available_memory())
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
(EngineCore_0 pid=2466326) output = self.collective_rpc("determine_available_memory")
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=2466326) answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=2466326) return func(*args, **kwargs)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_0 pid=2466326) return func(*args, **kwargs)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 243, in determine_available_memory
(EngineCore_0 pid=2466326) self.model_runner.profile_run()
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2498, in profile_run
(EngineCore_0 pid=2466326) = self._dummy_run(self.max_num_tokens, is_profile=True)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_0 pid=2466326) return func(*args, **kwargs)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2250, in _dummy_run
(EngineCore_0 pid=2466326) outputs = model(
(EngineCore_0 pid=2466326) ^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_0 pid=2466326) return self._call_impl(*args, **kwargs)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_0 pid=2466326) return forward_call(*args, **kwargs)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 258, in forward
(EngineCore_0 pid=2466326) return self.model(input_ids, positions)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 272, in call
(EngineCore_0 pid=2466326) output = self.compiled_callable(*args, **kwargs)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 817, in compile_wrapper
(EngineCore_0 pid=2466326) raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 979, in _compile_fx_inner
(EngineCore_0 pid=2466326) raise InductorError(e, currentframe()).with_traceback(
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 963, in _compile_fx_inner
(EngineCore_0 pid=2466326) mb_compiled_graph = fx_codegen_and_compile(
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1646, in fx_codegen_and_compile
(EngineCore_0 pid=2466326) return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1506, in codegen_and_compile
(EngineCore_0 pid=2466326) compiled_module = graph.compile_to_module()
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2318, in compile_to_module
(EngineCore_0 pid=2466326) return self._compile_to_module()
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2324, in _compile_to_module
(EngineCore_0 pid=2466326) self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2263, in codegen
(EngineCore_0 pid=2466326) self.scheduler.codegen()
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 4759, in codegen
(EngineCore_0 pid=2466326) else self._codegen(self.nodes)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/scheduler.py", line 4915, in _codegen
(EngineCore_0 pid=2466326) self.get_backend(device).codegen_node(node)
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 107, in codegen_node
(EngineCore_0 pid=2466326) return self._triton_scheduling.codegen_node(node)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1401, in codegen_node
(EngineCore_0 pid=2466326) return self.codegen_node_schedule(
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/simd.py", line 1454, in codegen_node_schedule
(EngineCore_0 pid=2466326) src_code = kernel.codegen_kernel()
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 3982, in codegen_kernel
(EngineCore_0 pid=2466326) **self.inductor_meta_common(),
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/_inductor/codegen/triton.py", line 3806, in inductor_meta_common
(EngineCore_0 pid=2466326) "backend_hash": torch.utils._triton.triton_hash_with_backend(),
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_triton.py", line 164, in triton_hash_with_backend
(EngineCore_0 pid=2466326) backend = triton_backend()
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/torch/utils/_triton.py", line 156, in triton_backend
(EngineCore_0 pid=2466326) target = driver.active.get_current_target()
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/driver.py", line 30, in getattr
(EngineCore_0 pid=2466326) return getattr(self._initialize_obj(), name)
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/driver.py", line 26, in _initialize_obj
(EngineCore_0 pid=2466326) self._obj = self._init_fn()
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/driver.py", line 12, in _create_driver
(EngineCore_0 pid=2466326) return active_drivers0
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 715, in init
(EngineCore_0 pid=2466326) self.utils = CudaUtils() # TODO: make static
(EngineCore_0 pid=2466326) ^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 62, in init
(EngineCore_0 pid=2466326) mod = compile_module_from_src(
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/build.py", line 88, in compile_module_from_src
(EngineCore_0 pid=2466326) so = _build(name, src_path, tmpdir, library_dirs or [], include_dirs or [], libraries or [])
(EngineCore_0 pid=2466326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=2466326) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/runtime/build.py", line 51, in _build
(EngineCore_0 pid=2466326) subprocess.check_call(cc_cmd, stdout=subprocess.DEVNULL)
(EngineCore_0 pid=2466326) File "/usr/lib/python3.12/subprocess.py", line 413, in check_call
(EngineCore_0 pid=2466326) raise CalledProcessError(retcode, cmd)
(EngineCore_0 pid=2466326) torch._inductor.exc.InductorError: CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpu15cffx7/cuda_utils.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpu15cffx7/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpu15cffx7', '-I/usr/include/python3.12']' returned non-zero exit status 1.
(EngineCore_0 pid=2466326)
(EngineCore_0 pid=2466326) Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore_0 pid=2466326)
[rank0]:[W822 15:28:40.720887190 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2465374) Traceback (most recent call last):
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/bin/vllm", line 10, in <module>
(APIServer pid=2465374) sys.exit(main())
(APIServer pid=2465374) ^^^^^^
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=2465374) args.dispatch_function(args)
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=2465374) uvloop.run(run_server(args))
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=2465374) return __asyncio.run(
(APIServer pid=2465374) ^^^^^^^^^^^^^^
(APIServer pid=2465374) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=2465374) return runner.run(main)
(APIServer pid=2465374) ^^^^^^^^^^^^^^^^
(APIServer pid=2465374) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2465374) return self._loop.run_until_complete(task)
(APIServer pid=2465374) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=2465374) return await main
(APIServer pid=2465374) ^^^^^^^^^^
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1827, in run_server
(APIServer pid=2465374) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1847, in run_server_worker
(APIServer pid=2465374) async with build_async_engine_client(
(APIServer pid=2465374) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=2465374) return await anext(self.gen)
(APIServer pid=2465374) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 167, in build_async_engine_client
(APIServer pid=2465374) async with build_async_engine_client_from_engine_args(
(APIServer pid=2465374) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=2465374) return await anext(self.gen)
(APIServer pid=2465374) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 209, in build_async_engine_client_from_engine_args
(APIServer pid=2465374) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=2465374) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/utils/init.py", line 1520, in inner
(APIServer pid=2465374) return fn(*args, **kwargs)
(APIServer pid=2465374) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 173, in from_vllm_config
(APIServer pid=2465374) return cls(
(APIServer pid=2465374) ^^^^
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 119, in init
(APIServer pid=2465374) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=2465374) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 101, in make_async_mp_client
(APIServer pid=2465374) return AsyncMPClient(*client_args)
(APIServer pid=2465374) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 733, in init
(APIServer pid=2465374) super().init(
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 421, in init
(APIServer pid=2465374) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=2465374) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2465374) File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=2465374) next(self.gen)
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
(APIServer pid=2465374) wait_for_engine_startup(
(APIServer pid=2465374) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
(APIServer pid=2465374) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=2465374) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
关键错误:
/tmp/tmpu15cffx7/cuda_utils.c:5:10: fatal error: Python.h: No such file or directory5 | #include <Python.h>| ^~~~~~~~~~
compilation terminated.
解决方法:
sudo apt-get update
sudo apt-get install python3.12-dev
再次出现问题:
(EngineCore_0 pid=2504605) INFO 08-22 15:41:41 [triton_attn.py:263] Using vllm unified attention for TritonAttentionImpl
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.36it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:01<00:00, 1.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00, 1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00, 1.32it/s]
(EngineCore_0 pid=2504605)
(EngineCore_0 pid=2504605) INFO 08-22 15:41:43 [default_loader.py:262] Loading weights took 2.35 seconds
(EngineCore_0 pid=2504605) INFO 08-22 15:41:44 [gpu_model_runner.py:1962] Model loading took 12.8848 GiB and 2.664834 seconds
(EngineCore_0 pid=2504605) INFO 08-22 15:41:49 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/68cd9b7b1a/r ank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=2504605) INFO 08-22 15:41:49 [backends.py:541] Dynamo bytecode transform time: 4.57 s
(EngineCore_0 pid=2504605) INFO 08-22 15:41:52 [backends.py:194] Cache the graph for dynamic shape for later use
(EngineCore_0 pid=2504605) INFO 08-22 15:42:22 [backends.py:215] Compiling a graph for dynamic shape takes 33.10 s
(EngineCore_0 pid=2504605) INFO 08-22 15:42:35 [monitor.py:34] torch.compile takes 37.67 s in total
(EngineCore_0 pid=2504605) [rank0]:W0822 15:42:35.972000 2504605 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=2504605) [rank0]:W0822 15:42:35.972000 2504605 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(EngineCore_0 pid=2504605) [rank0]:W0822 15:42:35.974000 2504605 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=2504605) [rank0]:W0822 15:42:35.974000 2504605 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(EngineCore_0 pid=2504605) INFO 08-22 15:43:31 [gpu_worker.py:275] Available KV cache memory: 28.66 GiB
(EngineCore_0 pid=2504605) INFO 08-22 15:43:31 [kv_cache_utils.py:993] GPU KV cache size: 626,080 tokens
(EngineCore_0 pid=2504605) INFO 08-22 15:43:31 [kv_cache_utils.py:997] Maximum concurrency for 131,072 tokens per request: 9.40x
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [01:13<00:00, 1.13it/s]
(EngineCore_0 pid=2504605) INFO 08-22 15:44:45 [gpu_model_runner.py:2567] Graph capturing finished in 74 secs, took 0.74 GiB
(EngineCore_0 pid=2504605) INFO 08-22 15:44:45 [core.py:216] init engine (profile, create kv cache, warmup model) took 181.35 seconds
(EngineCore_0 pid=2504605) reasoning_end_token_ids [200006, 173781, 200005, 17196, 200008]
(APIServer pid=2503694) INFO 08-22 15:44:47 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 78261
(APIServer pid=2503694) INFO 08-22 15:44:47 [api_server.py:1599] Supported_tasks: ['generate']
(APIServer pid=2503694) WARNING 08-22 15:44:47 [serving_responses.py:123] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
[rank0]:[W822 15:45:54.457687074 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2503694) Traceback (most recent call last):
(APIServer pid=2503694) File "/data1/tlw/vllm/venv-py12/bin/vllm", line 10, in <module>
(APIServer pid=2503694) sys.exit(main())
(APIServer pid=2503694) ^^^^^^
(APIServer pid=2503694) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=2503694) args.dispatch_function(args)
(APIServer pid=2503694) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=2503694) uvloop.run(run_server(args))
(APIServer pid=2503694) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=2503694) return __asyncio.run(
(APIServer pid=2503694) ^^^^^^^^^^^^^^
(APIServer pid=2503694) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=2503694) return runner.run(main)
(APIServer pid=2503694) ^^^^^^^^^^^^^^^^
(APIServer pid=2503694) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2503694) return self._loop.run_until_complete(task)
(APIServer pid=2503694) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2503694) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2503694) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=2503694) return await main
(APIServer pid=2503694) ^^^^^^^^^^
(APIServer pid=2503694) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1827, in run_server
(APIServer pid=2503694) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2503694) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1855, in run_server_worker
(APIServer pid=2503694) await init_app_state(engine_client, vllm_config, app.state, args)
(APIServer pid=2503694) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1657, in init_app_state
(APIServer pid=2503694) state.openai_serving_responses = OpenAIServingResponses(
(APIServer pid=2503694) ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2503694) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_responses.py", line 130, in init
(APIServer pid=2503694) get_stop_tokens_for_assistant_actions())
(APIServer pid=2503694) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2503694) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/harmony_utils.py", line 187, in get_stop_tokens_for_assistant_actions
(APIServer pid=2503694) return get_encoding().stop_tokens_for_assistant_actions()
(APIServer pid=2503694) ^^^^^^^^^^^^^^
(APIServer pid=2503694) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/vllm/entrypoints/harmony_utils.py", line 37, in get_encoding
(APIServer pid=2503694) _harmony_encoding = load_harmony_encoding(
(APIServer pid=2503694) ^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2503694) File "/data1/tlw/vllm/venv-py12/lib/python3.12/site-packages/openai_harmony/init.py", line 689, in load_harmony_encoding
(APIServer pid=2503694) inner: _PyHarmonyEncoding = _load_harmony_encoding(name)
(APIServer pid=2503694) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2503694) openai_harmony.HarmonyError: error downloading or loading vocab file: an underlying IO error occurred while reading from response https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken: error decoding response body
核心错误:
vLLM 在启动 OpenAI 兼容的 API 服务器时,需要加载一个名为 o200k_base.tiktoken 的词汇表(vocab)文件。服务器下载这个文件时失败了。
openai_harmony.HarmonyError: error downloading or loading vocab file: an underlying IO error occurred while reading from response https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken: error decoding response body
解决方法:
手动下载并指定路径
mkdir tiktoken_cachewget https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
wget https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktokenexport TIKTOKEN_CACHE_DIR="/data1/tiktoken_cache"
最后运行前面的启动命令成功运行:
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:29] Available routes are:
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /docs, Methods: HEAD, GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /health, Methods: GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /load, Methods: GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /ping, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /ping, Methods: GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /tokenize, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /detokenize, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/models, Methods: GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /version, Methods: GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/responses, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/completions, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/embeddings, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /pooling, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /classify, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /score, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/score, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /rerank, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v1/rerank, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /v2/rerank, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /invocations, Methods: POST
(APIServer pid=2560106) INFO 08-22 16:03:04 [launcher.py:37] Route: /metrics, Methods: GET
(APIServer pid=2560106) INFO: Started server process [2560106]
(APIServer pid=2560106) INFO: Waiting for application startup.
(APIServer pid=2560106) INFO: Application startup complete.
简单测试脚本:
import openai# 1. 初始化 OpenAI 客户端
# - base_url 指向您本地 vLLM 服务器的地址和端口。
# - api_key 是必需参数,但 vLLM 不会验证它,所以可以填写任意字符串。
client = openai.OpenAI(base_url="http://x.x.x.x:8000/v1",api_key="not-needed"
)# 2. 定义模型名称和聊天内容
# - 这里的 model 名称必须与您启动 vLLM 服务时 --served-model-name 参数指定的名称一致。
MODEL_NAME = "gpt-oss-20b"
messages = [{"role": "user", "content": "你好,请用中文介绍一下你自己以及你的功能。"}
]print(f"--- 向模型 {MODEL_NAME} 发送请求 ---")
print(f"用户: {messages[-1]['content']}")
print("--- 模型响应 ---")try:# 3. 发送请求到 vLLM 服务器response = client.chat.completions.create(model=MODEL_NAME,messages=messages,temperature=0.7, # 控制生成文本的随机性,0 表示最确定max_tokens=512, # 控制回复的最大长度)# 4. 打印模型的完整回复print(response.choices[0].message.content)except openai.APIConnectionError as e:print("\n[错误] 无法连接到 vLLM 服务器。")print(f"详细错误: {e.__cause__}")except Exception as e:print(f"\n[发生未知错误]: {e}")
返回结果:
--- 向模型 gpt-oss-20b 发送请求 ---
用户: 你好,请用中文介绍一下你自己以及你的功能。
--- 模型响应 ---
你好!我是 ChatGPT,基于 OpenAI 的 GPT‑4 架构开发的大型语言模型。以下是我的基本信息和主要功能介绍:## 基本信息
- **模型**:GPT‑4(Chat Generative Pre‑trained Transformer 4)
- **训练数据截止时间**:2023‑09
- **语言能力**:支持多种语言(中文、英文、日语、法语等),但主要以中文和英文为主。
- **平台**:可通过多种接口使用(网页、API、聊天机器人等)。## 核心功能| 功能 | 说明 |
|------|------|
| **自然语言理解 & 生成** | 可以回答问题、撰写文章、生成对话、改写文本、翻译等。 |
| **知识问答** | 提供事实性回答,涵盖科学、技术、历史、文化、时事等广泛领域。 |
| **创意写作** | 写诗、小说、剧本、广告文案、标题、摘要等。 |
| **编程与调试** | 生成代码、解释代码、调试、算法设计、技术文档。 |
| **教育辅导** | 讲解概念、解题思路、做练习、提供学习资源。 |
| **商务与营销** | 撰写商务邮件、市场分析、产品描述、SEO 文案。 |
| **情感支持** | 进行轻度心理支持、倾听、建议,但不替代专业心理咨询。 |
| **多轮对话** | 记住上下文,进行连贯、自然的对话。 |
| **定制化** | 通过提示工程(prompt engineering)可以调节语气、风格、专业度。 |## 使用注意1. **信息准确性**:我基于已有数据训练,可能出现过时或不准确的信息。重要决策请自行核实。
2. **隐私安全**:对话内容不会被用于训练或公开分享,但请避免输入敏感个人信息。
3. **合法合规**:请勿用于违法或不道德用途,例如生成仇恨言论、传播谣言等。## 如何与我互动