当前位置：首页 > news >正文

AMD的云上GPU运行Deepseek

news 来源：原创 2025/6/14 6:44:06

AMD的云上GPU运行Deepseek

笔者在2025年4月19下午有幸参加了 AMD&CSDN的ROCM AI开发者交流大会，使用AMD的云上GPU运行了Deepseek

本机安装Open web UI

之前是容器化方式部署的：

https://lizhiyong.blog.csdn.net/article/details/145582453

这次换个方式，因为要使用Python环境，参考：

https://lizhiyong.blog.csdn.net/article/details/127827522

安装Anaconda后几条命令即可拉起服务：

conda env list
conda init
conda create -n py311 python=3.11
conda activate py311
pip install open-webui
open-webui serve

成功后，可能还需要按一下ctrl+c，才能访问：

http://127.0.0.1:8080

服务器配置

root@atl1g2r2u16gpu:/workspace# rocm-smi============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp        Power     Partitions          SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%  (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)                                                  
==========================================================================================================================
0       4     0x74a1,   47045  45.0°C      156.0W    NPS1, SPX, 0        136Mhz  900Mhz  0%   auto  750.0W  0%     0%    
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================
root@atl1g2r2u16gpu:/workspace# rocminfo
ROCk module version 6.10.5 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  Name:                    INTEL(R) XEON(R) PLATINUM 8568Y+   Uuid:                    CPU-XX                             Marketing Name:          INTEL(R) XEON(R) PLATINUM 8568Y+   Vendor Name:             CPU                                Feature:                 None specified                     Profile:                 FULL_PROFILE                       Float Round Mode:        NEAR                               Max Queue Number:        0(0x0)                             Queue Min Size:          0(0x0)                             Queue Max Size:          0(0x0)                             Queue Type:              MULTI                              Node:                    0                                  Device Type:             CPU                                Cache Info:              L1:                      49152(0xc000) KB                   Chip ID:                 0(0x0)                             ASIC Revision:           0(0x0)                             Cacheline Size:          64(0x40)                           Max Clock Freq. (MHz):   4000                               BDFID:                   0                                  Internal Node ID:        0                                  Compute Unit:            48                                 SIMDs per CU:            0                                  Shader Engines:          0                                  Shader Arrs. per Eng.:   0                                  WatchPts on Addr. Ranges:1                                  Memory Properties:       Features:                NonePool Info:               Pool 1                   Segment:                 GLOBAL; FLAGS: FINE GRAINED        Size:                    1056335292(0x3ef665bc) KB          Allocatable:             TRUE                               Alloc Granule:           4KB                                Alloc Recommended Granule:4KB                                Alloc Alignment:         4KB                                Accessible by all:       TRUE                               Pool 2                   Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINEDSize:                    1056335292(0x3ef665bc) KB          Allocatable:             TRUE                               Alloc Granule:           4KB                                Alloc Recommended Granule:4KB                                Alloc Alignment:         4KB                                Accessible by all:       TRUE                               Pool 3                   Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINEDSize:                    1056335292(0x3ef665bc) KB          Allocatable:             TRUE                               Alloc Granule:           4KB                                Alloc Recommended Granule:4KB                                Alloc Alignment:         4KB                                Accessible by all:       TRUE                               Pool 4                   Segment:                 GLOBAL; FLAGS: COARSE GRAINED      Size:                    1056335292(0x3ef665bc) KB          Allocatable:             TRUE                               Alloc Granule:           4KB                                Alloc Recommended Granule:4KB                                Alloc Alignment:         4KB                                Accessible by all:       TRUE                               ISA Info:                
*******                  
Agent 2                  
*******                  Name:                    INTEL(R) XEON(R) PLATINUM 8568Y+   Uuid:                    CPU-XX                             Marketing Name:          INTEL(R) XEON(R) PLATINUM 8568Y+   Vendor Name:             CPU                                Feature:                 None specified                     Profile:                 FULL_PROFILE                       Float Round Mode:        NEAR                               Max Queue Number:        0(0x0)                             Queue Min Size:          0(0x0)                             Queue Max Size:          0(0x0)                             Queue Type:              MULTI                              Node:                    1                                  Device Type:             CPU                                Cache Info:              L1:                      49152(0xc000) KB                   Chip ID:                 0(0x0)                             ASIC Revision:           0(0x0)                             Cacheline Size:          64(0x40)                           Max Clock Freq. (MHz):   4000                               BDFID:                   0                                  Internal Node ID:        1                                  Compute Unit:            48                                 SIMDs per CU:            0                                  Shader Engines:          0                                  Shader Arrs. per Eng.:   0                                  WatchPts on Addr. Ranges:1                                  Memory Properties:       Features:                NonePool Info:               Pool 1                   Segment:                 GLOBAL; FLAGS: FINE GRAINED        Size:                    1056940056(0x3effa018) KB          Allocatable:             TRUE                               Alloc Granule:           4KB                                Alloc Recommended Granule:4KB                                Alloc Alignment:         4KB                                Accessible by all:       TRUE                               Pool 2                   Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINEDSize:                    1056940056(0x3effa018) KB          Allocatable:             TRUE                               Alloc Granule:           4KB                                Alloc Recommended Granule:4KB                                Alloc Alignment:         4KB                                Accessible by all:       TRUE                               Pool 3                   Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINEDSize:                    1056940056(0x3effa018) KB          Allocatable:             TRUE                               Alloc Granule:           4KB                                Alloc Recommended Granule:4KB                                Alloc Alignment:         4KB                                Accessible by all:       TRUE                               Pool 4                   Segment:                 GLOBAL; FLAGS: COARSE GRAINED      Size:                    1056940056(0x3effa018) KB          Allocatable:             TRUE                               Alloc Granule:           4KB                                Alloc Recommended Granule:4KB                                Alloc Alignment:         4KB                                Accessible by all:       TRUE                               ISA Info:                
*******                  
Agent 3                  
*******                  Name:                    gfx942                             Uuid:                    GPU-e927e74ce22946d1               Marketing Name:          AMD Instinct MI300X                Vendor Name:             AMD                                Feature:                 KERNEL_DISPATCH                    Profile:                 BASE_PROFILE                       Float Round Mode:        NEAR                               Max Queue Number:        128(0x80)                          Queue Min Size:          64(0x40)                           Queue Max Size:          131072(0x20000)                    Queue Type:              MULTI                              Node:                    2                                  Device Type:             GPU                                Cache Info:              L1:                      32(0x20) KB                        L2:                      4096(0x1000) KB                    L3:                      262144(0x40000) KB                 Chip ID:                 29857(0x74a1)                      ASIC Revision:           1(0x1)                             Cacheline Size:          64(0x40)                           Max Clock Freq. (MHz):   2100                               BDFID:                   19968                              Internal Node ID:        2                                  Compute Unit:            304                                SIMDs per CU:            4                                  Shader Engines:          32                                 Shader Arrs. per Eng.:   1                                  WatchPts on Addr. Ranges:4                                  Coherent Host Access:    FALSE                              Memory Properties:       Features:                KERNEL_DISPATCH Fast F16 Operation:      TRUE                               Wavefront Size:          64(0x40)                           Workgroup Max Size:      1024(0x400)                        Workgroup Max Size per Dimension:x                        1024(0x400)                        y                        1024(0x400)                        z                        1024(0x400)                        Max Waves Per CU:        32(0x20)                           Max Work-item Per CU:    2048(0x800)                        Grid Max Size:           4294967295(0xffffffff)             Grid Max Size per Dimension:x                        4294967295(0xffffffff)             y                        4294967295(0xffffffff)             z                        4294967295(0xffffffff)             Max fbarriers/Workgrp:   32                                 Packet Processor uCode:: 166                                SDMA engine uCode::      22                                 IOMMU Support::          None                               Pool Info:               Pool 1                   Segment:                 GLOBAL; FLAGS: COARSE GRAINED      Size:                    201310208(0xbffc000) KB            Allocatable:             TRUE                               Alloc Granule:           4KB                                Alloc Recommended Granule:2048KB                             Alloc Alignment:         4KB                                Accessible by all:       FALSE                              Pool 2                   Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINEDSize:                    201310208(0xbffc000) KB            Allocatable:             TRUE                               Alloc Granule:           4KB                                Alloc Recommended Granule:2048KB                             Alloc Alignment:         4KB                                Accessible by all:       FALSE                              Pool 3                   Segment:                 GLOBAL; FLAGS: FINE GRAINED        Size:                    201310208(0xbffc000) KB            Allocatable:             TRUE                               Alloc Granule:           4KB                                Alloc Recommended Granule:2048KB                             Alloc Alignment:         4KB                                Accessible by all:       FALSE                              Pool 4                   Segment:                 GROUP                              Size:                    64(0x40) KB                        Allocatable:             FALSE                              Alloc Granule:           0KB                                Alloc Recommended Granule:0KB                                Alloc Alignment:         0KB                                Accessible by all:       FALSE                              ISA Info:                ISA 1                    Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-Machine Models:          HSA_MACHINE_MODEL_LARGE            Profiles:                HSA_PROFILE_BASE                   Default Rounding Mode:   NEAR                               Default Rounding Mode:   NEAR                               Fast f16:                TRUE                               Workgroup Max Size:      1024(0x400)                        Workgroup Max Size per Dimension:x                        1024(0x400)                        y                        1024(0x400)                        z                        1024(0x400)                        Grid Max Size:           4294967295(0xffffffff)             Grid Max Size per Dimension:x                        4294967295(0xffffffff)             y                        4294967295(0xffffffff)             z                        4294967295(0xffffffff)             FBarrier Max Size:       32                                 
*** Done ***             
root@atl1g2r2u16gpu:/workspace# free -htotal        used        free      shared  buff/cache   available
Mem:           2.0Ti        79Gi       1.5Ti       8.0Mi       432Gi       1.9Ti
Swap:          8.0Gi          0B       8.0Gi
root@atl1g2r2u16gpu:/workspace# cpuinfo
Python Version: 3.12.9.final.0 (64 bit)
Cpuinfo Version: 9.0.0
Vendor ID Raw: GenuineIntel
Hardware Raw: 
Brand Raw: INTEL(R) XEON(R) PLATINUM 8568Y+
Hz Advertised Friendly: 3.1935 GHz
Hz Actual Friendly: 3.1935 GHz
Hz Advertised: (3193491000, 0)
Hz Actual: (3193491000, 0)
Arch: X86_64
Bits: 64
Count: 96
Arch String Raw: x86_64
L1 Data Cache Size: 4.5 MiB
L1 Instruction Cache Size: 3145728
L2 Cache Size: 201326592
L2 Cache Line Size: 2048
L2 Cache Associativity: 7
L3 Cache Size: 314572800
Stepping: 2
Model: 207
Family: 6
Processor Type: 
Flags: 3dnowprefetch, abm, acpi, adx, aes, amx_bf16, amx_int8, amx_tile, aperfmperf, apic, arat, arch_capabilities, arch_lbr, arch_perfmon, art, avx, avx2, avx512_bf16, avx512_bitalg, avx512_fp16, avx512_vbmi2, avx512_vnni, avx512_vpopcntdq, avx512bitalg, avx512bw, avx512cd, avx512dq, avx512f, avx512ifma, avx512vbmi, avx512vbmi2, avx512vl, avx512vnni, avx512vpopcntdq, avx_vnni, bmi1, bmi2, bts, bus_lock_detect, cat_l2, cat_l3, cdp_l2, cdp_l3, cldemote, clflush, clflushopt, clwb, cmov, constant_tsc, cpuid, cpuid_fault, cqm, cqm_llc, cqm_mbm_local, cqm_mbm_total, cqm_occup_llc, cx16, cx8, dca, de, ds_cpl, dtes64, dtherm, dts, enqcmd, epb, erms, est, f16c, flush_l1d, fma, fpu, fsgsbase, fsrm, fxsr, gfni, hfi, ht, ibpb, ibrs, ibrs_enhanced, ibt, ida, intel_pt, invpcid, la57, lahf_lm, lm, mba, mca, mce, md_clear, mmx, monitor, movbe, movdir64b, movdiri, msr, mtrr, nonstop_tsc, nopl, nx, ospke, osxsave, pae, pat, pbe, pcid, pclmulqdq, pconfig, pdcm, pdpe1gb, pebs, pge, pku, pln, pni, popcnt, pqe, pqm, pse, pse36, pts, rdpid, rdrand, rdrnd, rdseed, rdt_a, rdtscp, rep_good, sdbg, sep, serialize, sgx, sgx_lc, sha, sha_ni, smap, smep, smx, split_lock_detect, ss, ssbd, sse, sse2, sse4_1, sse4_2, ssse3, stibp, syscall, tm, tm2, tme, tsc, tsc_adjust, tsc_deadline_timer, tsc_known_freq, tscdeadline, tsxldtrk, umip, user_shstk, vaes, vme, vmx, vpclmulqdq, waitpkg, wbnoinvd, x2apic, xgetbv1, xsave, xsavec, xsaveopt, xsaves, xtopology, xtpr
root@atl1g2r2u16gpu:/workspace#

运行步骤

root@atl1g2r2u16gpu:/workspace# vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \--host 0.0.0.0 \--port $VLLM_PORT \--api-key abc-123 \--trust-remote-code \--seed 42
INFO 04-19 08:13:42 [__init__.py:239] Automatically detected platform rocm.
INFO 04-19 08:13:43 [api_server.py:1034] vLLM API server version 0.8.3.dev349+gb8498bc4a
INFO 04-19 08:13:43 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='deepseek-ai/DeepSeek-R1-Distill-Qwen-32B', config='', host='0.0.0.0', port=8100, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='abc-123', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='deepseek-ai/DeepSeek-R1-Distill-Qwen-32B', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=42, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x730f25e351c0>)
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 664/664 [00:00<00:00, 8.19MB/s]
INFO 04-19 08:13:55 [config.py:604] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
INFO 04-19 08:13:55 [arg_utils.py:1735] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
WARNING 04-19 08:13:55 [arg_utils.py:1597] The model has a long context length (131072). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
INFO 04-19 08:13:59 [api_server.py:246] Started engine process with PID 225
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.07k/3.07k [00:00<00:00, 38.8MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.03M/7.03M [00:00<00:00, 27.0MB/s]
INFO 04-19 08:14:01 [__init__.py:239] Automatically detected platform rocm.
INFO 04-19 08:14:02 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.3.dev349+gb8498bc4a) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-32B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-32B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=42, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 181/181 [00:00<00:00, 1.97MB/s]
INFO 04-19 08:14:05 [utils.py:746] Port 8100 is already in use, trying port 8101
INFO 04-19 08:14:10 [rocm.py:181] None is not supported in AMD GPUs.
INFO 04-19 08:14:10 [rocm.py:182] Using ROCmFlashAttention backend.
INFO 04-19 08:14:10 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-19 08:14:10 [model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B...
WARNING 04-19 08:14:10 [rocm.py:283] Model architecture 'Qwen2ForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0`
INFO 04-19 08:14:11 [weight_utils.py:265] Using model weights format ['*.safetensors']
model-00008-of-000008.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.07G/4.07G [00:17<00:00, 239MB/s]
model-00001-of-000008.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.79G/8.79G [00:35<00:00, 246MB/s]
model-00003-of-000008.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.78G/8.78G [00:36<00:00, 239MB/s]
model-00004-of-000008.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.78G/8.78G [00:37<00:00, 235MB/s]
model-00006-of-000008.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.78G/8.78G [00:37<00:00, 234MB/s]
model-00005-of-000008.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.78G/8.78G [00:37<00:00, 233MB/s]
model-00002-of-000008.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.78G/8.78G [00:37<00:00, 233MB/s]
model-00007-of-000008.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.78G/8.78G [00:37<00:00, 232MB/s]
INFO 04-19 08:14:49 [weight_utils.py:281] Time spent downloading weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-32B: 37.866222 seconds████████████████████████████▏| 8.72G/8.78G [00:37<00:00, 476MB/s]
model.safetensors.index.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64.0k/64.0k [00:00<00:00, 47.3MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]█████████████████████████████████████████████████████████████████████████████████████████| 8.78G/8.78G [00:37<00:00, 485MB/s]
Loading safetensors checkpoint shards:  12% Completed | 1/8 [00:03<00:21,  3.08s/it]█████████████████████████████████████████████████████████████████████████████▌   | 8.54G/8.78G [00:37<00:00, 424MB/s]
Loading safetensors checkpoint shards:  25% Completed | 2/8 [00:06<00:19,  3.23s/it]██████████████████████████████████████████████████████████████████████████████▌  | 8.61G/8.78G [00:37<00:00, 498MB/s]
Loading safetensors checkpoint shards:  38% Completed | 3/8 [00:09<00:15,  3.17s/it]
Loading safetensors checkpoint shards:  50% Completed | 4/8 [00:11<00:10,  2.54s/it]███████████████████████████████████████████████████████████████████████████████▉ | 8.70G/8.78G [00:37<00:00, 613MB/s]
Loading safetensors checkpoint shards:  62% Completed | 5/8 [00:14<00:08,  2.70s/it]
Loading safetensors checkpoint shards:  75% Completed | 6/8 [00:17<00:05,  2.84s/it]
Loading safetensors checkpoint shards:  88% Completed | 7/8 [00:20<00:02,  2.89s/it]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:23<00:00,  2.93s/it]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:23<00:00,  2.90s/it]INFO 04-19 08:15:12 [loader.py:458] Loading weights took 23.41 seconds
INFO 04-19 08:15:12 [model_runner.py:1146] Model loading took 61.3008 GiB and 61.724195 seconds
INFO 04-19 08:15:56 [worker.py:295] Memory profiling takes 43.98 seconds
INFO 04-19 08:15:56 [worker.py:295] the current vLLM instance can use total_gpu_memory (191.98GiB) x gpu_memory_utilization (0.90) = 172.79GiB
INFO 04-19 08:15:56 [worker.py:295] model weights take 61.30GiB; non_torch_memory takes 0.96GiB; PyTorch activation peak memory takes 25.45GiB; the rest of the memory reserved for KV Cache is 85.07GiB.
INFO 04-19 08:15:56 [executor_base.py:112] # rocm blocks: 21777, # CPU blocks: 1024
INFO 04-19 08:15:56 [executor_base.py:117] Maximum concurrency for 131072 tokens per request: 2.66x
INFO 04-19 08:15:57 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:09<00:00,  3.61it/s]
INFO 04-19 08:16:07 [model_runner.py:1598] Graph capturing finished in 10 secs, took 0.24 GiB
INFO 04-19 08:16:07 [llm_engine.py:448] init engine (profile, create kv cache, warmup model) took 54.57 seconds
WARNING 04-19 08:16:07 [config.py:1094] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 04-19 08:16:07 [serving_chat.py:117] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.95}
INFO 04-19 08:16:07 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.95}
INFO 04-19 08:16:07 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8100
INFO 04-19 08:16:07 [launcher.py:26] Available routes are:
INFO 04-19 08:16:07 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET
INFO 04-19 08:16:07 [launcher.py:34] Route: /docs, Methods: HEAD, GET
INFO 04-19 08:16:07 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 04-19 08:16:07 [launcher.py:34] Route: /redoc, Methods: HEAD, GET
INFO 04-19 08:16:07 [launcher.py:34] Route: /health, Methods: GET
INFO 04-19 08:16:07 [launcher.py:34] Route: /load, Methods: GET
INFO 04-19 08:16:07 [launcher.py:34] Route: /ping, Methods: POST, GET
INFO 04-19 08:16:07 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-19 08:16:07 [launcher.py:34] Route: /version, Methods: GET
INFO 04-19 08:16:07 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /score, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /invocations, Methods: POST
INFO 04-19 08:16:07 [launcher.py:34] Route: /metrics, Methods: GET
INFO:     Started server process [71]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     223.160.206.192:62698 - "OPTIONS /v1/models HTTP/1.1" 200 OK
INFO:     223.160.206.192:62698 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     223.160.206.192:62699 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     223.160.206.192:62700 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     223.160.206.192:62701 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     223.160.206.192:62702 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     223.160.206.192:62703 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     223.160.206.192:62704 - "OPTIONS /v1/chat/completions HTTP/1.1" 200 OK
INFO 04-19 08:30:50 [chat_utils.py:396] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 04-19 08:30:50 [logger.py:39] Received request chatcmpl-fa95118d98274ce79820a32a2c0fbb1f: prompt: '<｜begin▁of▁sentence｜><｜User｜>你是谁？<｜Assistant｜><think>\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=-1, min_p=0.0, ppl_measurement=False, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131064, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     223.160.206.192:62704 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 04-19 08:30:50 [engine.py:310] Added request chatcmpl-fa95118d98274ce79820a32a2c0fbb1f.
INFO 04-19 08:30:53 [metrics.py:489] Avg prompt throughput: 1.6 tokens/s, Avg generation throughput: 14.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 04-19 08:30:54 [logger.py:39] Received request chatcmpl-439caf5427184211b2c8a8e61a1db25d: prompt: '<｜begin▁of▁sentence｜><｜User｜>### Task:\nGenerate a concise, 3-5 word title with an emoji summarizing the chat history.\n### Guidelines:\n- The title should clearly represent the main theme or subject of the conversation.\n- Use emojis that enhance understanding of the topic, but avoid quotation marks or special formatting.\n- Write the title in the chat\'s primary language; default to English if multilingual.\n- Prioritize accuracy over excessive creativity; keep it clear and simple.\n### Output:\nJSON format: { "title": "your concise title here" }\n### Examples:\n- { "title": "📉 Stock Market Trends" },\n- { "title": "🍪 Perfect Chocolate Chip Recipe" },\n- { "title": "Evolution of Music Streaming" },\n- { "title": "Remote Work Productivity Tips" },\n- { "title": "Artificial Intelligence in Healthcare" },\n- { "title": "🎮 Video Game Development Insights" }\n### Chat History:\n<chat_history>\nUSER: 你是谁？\nASSISTANT: 您好！我是由中国的深度求索（DeepSeek）公司开发的智能助手DeepSeek-R1。如您有任何任何问题，我会尽我所能为您提供帮助。\n</think>\n\n您好！我是由中国的深度求索（DeepSeek）公司开发的智能助手DeepSeek-R1。如您有任何任何问题，我会尽我所能为您提供帮助。\n</chat_history><｜Assistant｜><think>\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=-1, min_p=0.0, ppl_measurement=False, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 04-19 08:30:54 [engine.py:310] Added request chatcmpl-439caf5427184211b2c8a8e61a1db25d.
INFO 04-19 08:30:58 [metrics.py:489] Avg prompt throughput: 57.5 tokens/s, Avg generation throughput: 22.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO:     223.160.206.192:62704 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 04-19 08:31:01 [logger.py:39] Received request chatcmpl-8c4f8f4fc3fd40b48f62b822c7154e05: prompt: '<｜begin▁of▁sentence｜><｜User｜>### Task:\nGenerate 1-3 broad tags categorizing the main themes of the chat history, along with 1-3 more specific subtopic tags.\n\n### Guidelines:\n- Start with high-level domains (e.g. Science, Technology, Philosophy, Arts, Politics, Business, Health, Sports, Entertainment, Education)\n- Consider including relevant subfields/subdomains if they are strongly represented throughout the conversation\n- If content is too short (less than 3 messages) or too diverse, use only ["General"]\n- Use the chat\'s primary language; default to English if multilingual\n- Prioritize accuracy over specificity\n\n### Output:\nJSON format: { "tags": ["tag1", "tag2", "tag3"] }\n\n### Chat History:\n<chat_history>\nUSER: 你是谁？\nASSISTANT: 您好！我是由中国的深度求索（DeepSeek）公司开发的智能助手DeepSeek-R1。如您有任何任何问题，我会尽我所能为您提供帮助。\n</think>\n\n您好！我是由中国的深度求索（DeepSeek）公司开发的智能助手DeepSeek-R1。如您有任何任何问题，我会尽我所能为您提供帮助。\n</chat_history><｜Assistant｜><think>\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=-1, min_p=0.0, ppl_measurement=False, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=130822, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 04-19 08:31:01 [engine.py:310] Added request chatcmpl-8c4f8f4fc3fd40b48f62b822c7154e05.
INFO 04-19 08:31:03 [metrics.py:489] Avg prompt throughput: 49.0 tokens/s, Avg generation throughput: 26.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 04-19 08:31:08 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 46.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO:     223.160.206.192:62704 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 04-19 08:31:20 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 04-19 08:31:30 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 04-19 08:35:21 [logger.py:39] Received request chatcmpl-bb908ecb4c1f46749eb5145d9272b85d: prompt: '<｜begin▁of▁sentence｜><｜User｜>你是谁？<｜Assistant｜>\n\n您好！我是由中国的深度求索（DeepSeek）公司开发的智能助手DeepSeek-R1。如您有任何任何问题，我会尽我所能为您提供帮助。<｜end▁of▁sentence｜><｜User｜>虎鲸是鱼吗？<｜Assistant｜><think>\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=-1, min_p=0.0, ppl_measurement=False, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131019, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     223.160.206.192:62706 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 04-19 08:35:21 [engine.py:310] Added request chatcmpl-bb908ecb4c1f46749eb5145d9272b85d.
INFO 04-19 08:35:25 [metrics.py:489] Avg prompt throughput: 10.6 tokens/s, Avg generation throughput: 22.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 04-19 08:35:30 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 46.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 04-19 08:35:40 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

采用了容器化方式

端口

root@atl1g2r2u16gpu:/workspace# echo $VLLM_PORT
xxxx
root@atl1g2r2u16gpu:/workspace#

可以本地通过公网ip访问

配置连接

http://134.199.131.6:xxxx/v1

在这里插入图片描述

验证

在这里插入图片描述

通过Open WebUI可以正常访问！！！服务器Token生成速度不错

支持ROCm的GPU

参考官网：

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-distributions

截止2025年6月，官方适配支持ROCm的GPU：

GPU	Architecture	LLVM target	Support
AMD Radeon RX 9070 XT	RDNA4	gfx1201	✅ [5]
AMD Radeon RX 9070 GRE	RDNA4	gfx1201	✅ [5]
AMD Radeon RX 9070	RDNA4	gfx1201	✅ [5]
AMD Radeon RX 9060 XT	RDNA4	gfx1200	✅ [5]
AMD Radeon RX 7900 XTX	RDNA3	gfx1100	✅
AMD Radeon RX 7900 XT	RDNA3	gfx1100	✅
AMD Radeon RX 7900 GRE	RDNA3	gfx1100	✅ [5]
AMD Radeon RX 7800 XT	RDNA3	gfx1101	✅ [5]
AMD Radeon VII	GCN5.1	gfx906	❌

好消息是终于支持了RDNA4的新显卡，不用可以买老款显卡了。

坏消息是等了这么久，笔者的7840hs核显780M依旧不被官方支持【但是Github有魔改版】：

https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html
https://github.com/ByronLeeeee/Ollama-For-AMD-Installer
https://github.com/likelovewant/ollama-for-amd
https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU

后续使用魔改版ollama尝试下较LM Studio的Vulkan哪种方式token生成速度更可观！！！

转载请注明出处：https://lizhiyong.blog.csdn.net/article/details/148623788

在这里插入图片描述