当前位置：首页 > news >正文

大模型压测方法

news 2025/7/16 3:58:37

一、背景

经过几天的实际使用，发现当并发数达到一定阈值时，性能出现下降。为进一步评估和优化集群性能，现对已部署的 DeepSeek-r1 推理集群进行深入的性能压测。选型LLM 性能压测工具经过调研，选择推理引擎 SGLang 自带的 sglang.bench_serving 基准测试工具，以及 Locust 和 EvalScope 两款成熟的性能测试工具，进行全面的性能评估。

二、Locust

官网地址：https://locust.io
Locust 是一款开源的性能和负载测试工具，主要用于测试 HTTP 和其他协议的性能。它的最大优势是，用户可以用简单的 Python 代码来定义测试，灵活且易于使用，并且提供Web UI界面，在测试过程中，可以实时查看吞吐量、响应时间和错误情况，或者将数据导出以供后续分析。

#安装包
~# pip3 install locust

~# locust --version
locust 2.32.9 from /usr/local/lib/python3.10/dist-packages/locust (Python 3.10.12)

##压测脚本
from locust import HttpUser,task, between
import  json
class  LLMUser(HttpUser):
   wait_time = between(1,2) # 每个用户请求的间隔时间
   @task
   def generate_text(self):
      headers = {"Content-Type":"application/json"}
      data = {"model":"DeepSeek-R1:1.5b-qwen-distill-q4_K_M","prompt":"简单介绍一下你自己","stream": False}
      self.client.post("/api/generate", headers=headers,  json=data, timeout=60)

###压测方法
# locust -f locustfile.py --host http://10.0.x.x:32169

# 日志如下：
[2025-02-17 17:35:34,204] mgmt-ser-14-128/INFO/locust.main: Starting Locust 2.32.9
[2025-02-17 17:35:34,204] mgmt-ser-14-128/INFO/locust.main: Starting web interface at http://0.0.0.0:8089

在这里插入图片描述

三、SGLang Bench

:~# python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:8000 --batch-size 1 --input-len 128 --output-len 128

#结果如下：
INFO 02-18 18:10:16 __init__.py:190] Automatically detected platform cuda.
batch size: 16
latency: 2.46 s
output throughput: 104.01 token/s
(input + output) throughput: 6760.66 token/s
batch size: 1
latency: 4.64 s
output throughput: 27.61 token/s
(input + output) throughput: 55.22 token/s
##############################################################
:~# python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:8000 --batch-size 10 --input-len 1280 --output-len 1280

#结果如下
INFO 02-18 18:10:55 __init__.py:190] Automatically detected platform cuda.
batch size: 16
latency: 2.50 s
output throughput: 102.31 token/s
(input + output) throughput: 6650.28 token/s
batch size: 10
latency: 58.57 s
output throughput: 218.54 token/s
(input + output) throughput: 437.07 token/s

安装

四、EvalScope

官网：https://evalscope.readthedocs.io
EvalScope是魔搭社区官方推出的模型评测与性能基准测试框架，内置多个常用测试基准和评测指标，如MMLU、CMMLU、C-Eval、GSM8K、ARC、HellaSwag、TruthfulQA、MATH和HumanEval等；支持多种类型的模型评测，包括LLM、多模态LLM、embedding模型和reranker模型。EvalScope还适用于多种评测场景，如端到端RAG评测、竞技场模式和模型推理性能压测等。

~# pip3 install evalscope -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

# 安装额外依赖
~# pip install evalscope[perf] -U  -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

###通过evalscope命令行测试：
######################################################
#!/usr/bin/env bash

evalscope perf \
 --parallel 1 \
 --url http://10.233.57.205:8000/v1/completions \
 --model deepseek-r1 \
 --log-every-n-query 5 \
 --connect-timeout 6000 \
 --read-timeout 6000 \
 --max-tokens 2048 \
 --min-tokens 2048 \
 --api openai \
 --dataset speed_benchmark \
 --debug
 ###########################################
##通过python脚本压测
 #!/usr/bin/env python

from evalscope.perf.main import run_perf_benchmark

task_cfg = {"url": "http://127.0.0.1:11434/v1/chat/completions",
            "parallel": 1,
            "model": "deepseek-r1",
            "number": 15,
            "api": "openai",
            "dataset": "openqa",
            "stream": True}
run_perf_benchmark(task_cfg)


注：speed_benchmark: 测试[1, 6144, 14336, 30720]长度的prompt，固定输出2048个token。

五、性能指标

通常把评估大模型服务的指标分为两类。一类是和系统级别的指标，包括input token per sec，output token per sec；另一类是请求级别的指标，TPOT，TTFT。

5.1系统级别指标用于衡量整个大模型服务的整体性能和吞吐能力，通常是从全局视角评估系统的效率。*
InputToken Per Sec（输入 Token 吞吐量）：系统每秒能够处理的输入 token 数量。
OutputToken Per Sec（输出 Token 吞吐量）：系统每秒能够生成的输出 token 数量。大家通常说的 TPS（Token Per Sec）一般都是指 Output TPS。
Concurrency（并发数）：系统在同一时间正在处理的请求数量。

5.2请求级别指标用于衡量单个请求的性能和响应效率，通常是从用户视角评估系统的 SLA。
TTFT (Time to First Token，首 Token 延迟) 从发送请求到系统生成第一个输出 token 的时间。衡量系统对单个请求的响应速度，TTFT 越低，用户体验越好。
TPOT (Time Per OutputToken，单 Token 生成时间) ：系统生成每个输出 token 所需的时间。TPOT 越低，模型生成文本的速度越快。一般 TPOT 需要在 50ms 以内，否则会跟不上人眼的阅读速度，影响阅读体验。

**5.3吞吐量直接反映成本。**同样的时间能吐的字越多，单个 token 的成本越低。 TPOT 和 TTFT 直接反映服务质量（SLA），更低的 TTFT 和 TPOT 会带来更好的体验。更低的价格和更高的服务质量，二者常常不可得兼。他们之间的桥梁是Concurrency。max-concurrency可以近似看成 decode 阶段的 batch size ，一个过大的 max-concurrency 会提升 GPU 利用率带来更高吞吐更低的成本，但是很多 request 打包在一起导致每个 request 都很高的 TPOT 。

六、参考资料

https://mp.weixin.qq.com/s/XAu5v0LAofvR47vNaK5tMw
https://mp.weixin.qq.com/s/q9oMW-FjfCz5qQrDO0whIg
 https://modelscope.cn/docs
 https://docs.locust.io/en/stable/quickstart.html
 https://datacrunch.io/blog/deepseek-v3-llm-nvidia-h200-gpu-inference-benchmarking• 
 https://docs.sglang.ai/references/benchmark_and_profiling.html
 https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/speed_benchmark.html#
 https://www.modelscope.cn/datasets/gliang1001/ShareGPT_V3_unfiltered_cleaned_split/files
https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-four-a1008-nodes@

查看全文

http://www.dtcms.com/a/34961.html