大模型部署全攻略:Docker+FastAPI+Nginx搭建高可用AI服务
随着大型语言模型(LLMs)的飞速发展,将其部署到生产环境以提供稳定、高效的服务已成为AI落地应用的关键一步。一个健壮、可扩展且高可用的AI服务架构,能够显著提升用户体验并保证业务连续性。
本文将为您提供一份详尽的部署攻略,指导您如何使用 Docker 构建模型推理服务,FastAPI 作为高性能的API框架,以及Nginx 作为高性能的反向代理和负载均衡器,共同搭建一个高可用的AI服务。我们将贯穿模型封装、服务化、容器化、部署以及高可用性设计的全过程。
一、 为什么选择 Docker + FastAPI + Nginx 组合?
在众多技术栈中,为何倾向于选择这三者组合来部署大模型服务?
Docker (容器化):
环境隔离与一致性: 确保模型及其所有依赖(库、Python版本、CUDA等)在一个可移植的容器中运行,免受宿主机环境干扰,极大地简化了“在我的机器上能运行”到“在生产环境也能运行”的转换。
易于部署与管理: Docker镜像可以轻松分发、存储和部署,无论是单机还是大规模集群(如Kubernetes)。
资源控制: Docker可以限制容器的CPU、内存使用,有助于资源的管理。
FastAPI (高性能Python Web框架):
极高的性能: 基于Starlette和Pydantic,FastAPI是Python中最快的Web框架之一,非常适合处理LLM的I/O密集型任务。
现代且易用: 支持Python 3.7+的类型提示,自动生成交互式API文档(Swagger/OpenAPI),代码清晰,开发效率高。
异步支持: 内置对async/await的支持,能够高效处理并发的HTTP请求,对于需要模型推理(可能耗时)的服务至关重要。
强大的数据校验: Pydantic提供开箱即用的模型数据验证,能轻松处理OpenAI兼容API的请求参数。
Nginx (Web服务器 & 反向代理):
高性能反向代理: Nginx以其出色的并发处理能力、高效的事件驱动模型而闻名,能够处理大量的并发HTTP请求。
负载均衡: 可以将传入的请求分发到多个FastAPI应用实例(Worker Process),实现水平扩展,提高整体吞吐量和可用性。
SSL/TLS终止: 能够处理HTTPS加密,并将解密后的请求转发给应用,减轻应用层的负担。
静态文件服务: 如果API服务中有静态文件(如Swagger UI),Nginx可以高效地提供服务。
请求/响应缓存: 可配置缓存机制。
健康检查: Nginx可以监控后端服务的健康状态,并将流量仅导向健康的实例。
整体架构图:
<TEXT>
+-------------------+ +--------------------+ +-------------------+
| User/Client | ---> | Nginx | ---> | FastAPI |
+-------------------+ | (Reverse Proxy, | | (Model Inference |
| Load Balancer, SSL)| | API Service) |
+--------------------+ +-------------------+
| (Multiple Instances)
|
+-------------------+
| Docker |
| Container |
+-------------------+
|
+-------------------+
| GPU Hardware |
+-------------------+
二、 Step 1: Model Service - FastAPI API设计
首先,准备一个能够加载和运行你的大模型,并通过FastAPI暴露HTTP API的服务。这里我们继续使用前文提到的OpenAI兼容API封装作为示例。
Python代码 (app/main.py):
(此处引用上文“K8s部署大模型”中的app/main.py代码,确保其功能完整,能够处理GET /v1/models和POST /v1/chat/completions,支持流式输出。)
关键点回顾:
OpenAI兼容性: 严格按照OpenAI的请求/响应格式(messages, role, content, choices, usage, stream等)进行设计。
模型推理: CustomModelService中的infer和stream_infer方法封装了实际的模型调用。在生产环境中,这里会是你的高效推理后端(如vLLM、TensorRT-LLM、Trition Infernece Server或者一个优化的PyTorch/TensorFlow推理代码)。
异步IO: FastAPI和async/await对于处理I/O密集型的模型推理非常高效。
三、 Step 2: Docker化你的FastAPI应用
下一步是为FastAPI服务创建一个Docker镜像。
app/requirements.txt (确保包含所有必要的库):
<TEXT>
fastapi
uvicorn[standard] # 'standard' includes uvicorn, websockets, httptools which are good for performance
pydantic
torch # or tensorflow, transformers, etc.
numba # for potential performance optimizations
# other dependencies for your model inference
Dockerfile
<DOCKERFILE>
# --- Base Image ---
# Use a base image with Python and necessary libraries.
# For GPU workloads, use an NVIDIA CUDA-enable base image.
# Example using CUDA 11.8 with Python 3.10:
FROM nvidia/cuda:11.8.0-base-ubuntu22.04 AS builder
# Set environment variables
ENV PYTHONUNBUFFERED=1 \
PORT=8000 \
MODEL_NAME="your-model-name" \
# Adjust model path or loading mechanism as per your needs
# MODEL_PATH="/app/models/your_model"
# Configure user and workdir
ARG USER=appuser
ARG GROUP=appgroup
RUN groupadd --gid 1000 ${GROUP} && useradd --uid 1000 --gid ${GROUP} --shell /bin/bash --create-home ${USER}
WORKDIR /app
# Install system dependencies (e.g., build tools, CUDA specific libs if not in base)
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
python3-pip \
git \
# Add any other necessary system packages here
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY app/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application source code
COPY app/ /app/
# OPTIONAL: Copy large model files if they are not dynamically loaded
# If your model is very large, consider using volumes or init containers in K8s
# COPY models/ /app/models/
# Set ownership to the non-root user we created
RUN chown -R ${USER}:${GROUP} /app
# Switch to the non-root user
USER ${USER}
# Run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
# Note: For GPU bound models, often 1 worker is sufficient. If your FastAPI doesn't
# directly use awaitable model calls (e.g., blocking calls run in thread pool),
# you might need more workers, but this increases GPU resource demands.
# Uvicorn with 'standard' extras includes httptools and websockets for better performance.
构建与推送镜像:
<BASH>
# Assuming you are in the project root directory
docker build -t your-dockerhub-username/your-llm-api:v1.0.0 .
docker push your-dockerhub-username/your-llm-api:v1.0.0
四、 Step 3: Nginx 配置与集成
Nginx将作为我们AI服务的前端。它会接收来自客户端的请求,进行负载均衡,并将请求转发给后端的FastAPI应用实例。
a. Nginx 镜像
我们可以使用一个基础的Nginx镜像,然后将其与我們的FastAPI应用一起部署。更常见的做法是在Kubernetes中,将Nginx作为Ingress Controller或LoadBalancer Service的一部分来管理,或者直接在Pod中运行Nginx。
为了简洁起见,我们先考虑在同一个Pod内或通过K8s Service暴露。更高级的部署会使用Nginx Ingress Controller。
b. Nginx 配置文件 (nginx.conf)
我们将创建一个示例nginx.conf,用于将流量代理到FastAPI服务。
<NGINX>
# Custom Nginx configuration
# Note: This is a simplified example. For production, consider more robust configs.
# Example for running FastAPI in a single container with Nginx as a reverse proxy.
# If using Kubernetes, Nginx often acts as INGRESS or a LoadBalancer.
# For a single container setup, you might use supervisord or similar to manage both.
# Here we assume Nginx is already running and proxying to a *separate* FastAPI process/container.
# If running in K8s, the service will handle load balancing to multiple pods of FastAPI.
# Let's structure this for Kubernetes where Nginx likely stands *in front* of the FastAPI pods.
# In K8s, your `nginx.conf` would typically be part of the Ingress Controller or deployed as a separate proxy.
# For this article, we will assume Nginx is running and needs to forward to 'llm-api-service' on port 80.
# --- Simplified Nginx Conf Fragment (Illustrative, as K8s Ingress handles most of this) ---
# In a real K8s setup, this logic is often within the Ingress Controller's configuration.
# But if you were to run Nginx directly in a pod to proxy FastAPI service:
events {
worker_connections 1024;
multi_accept on;
}
http {
include mime.types;
default_type application/json;
sendfile on;
keepalive_timeout 65;
upstream fastapi_llm_api {
# The 'llm-api-service' is the Kubernetes Service name we defined earlier.
# K8s service automatically load balances to healthy pods.
# If running Nginx in the *same* pod as FastAPI (uncommon for production):
# server 127.0.0.1:8000;
# If Nginx is a separate proxy in front of the Service:
server llm-api-service.default. svc.cluster.local:80; # K8s internal DNS resolution
# Example for multiple FastAPI instances if not using K8s Service for balancing
# server fastapi-app-1:8000;
# server fastapi-app-2:8000;
# Load balancing method (e.g., round-robin, least_conn)
# keepalive 32; # Optional: Keep connections to backend open
}
server {
listen 80;
server_name your-ai-service.example.com; # Your domain name
# Basic health check endpoint for Nginx to probe FastAPI service
location /healthz {
access_log off;
proxy_pass http://llm-api-service.default.svc.cluster.local:80/v1/models; # Use a known healthy endpoint
proxy_intercept_errors on;
health_check interval=10s fails=3 backend=llm-api-api; # Requires Nginx Plus or specific modules
}
location / {
proxy_pass http://fastapi_llm_api; # Forward to the upstream group
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Add specific headers for streaming if needed
# proxy_request_buffering off; # Important for streaming
# proxy_http_version 1.1; # Ensure HTTP/1.1 for keep-alive
# Optimization for potentially long-lived connections for streaming
proxy_read_timeout 3600s; # Adjust as needed
proxy_send_timeout 3600s;
client_max_body_size 100M; # Adjust for large model inputs if applicable
}
# If you want Nginx to serve Swagger UI (FastAPI includes it at /docs)
location /docs {
proxy_pass http://llm-api-service.default.svc.cluster.local:80/docs;
proxy_set_header Host $host;
}
location /openapi.json {
proxy_pass http://llm-api-service.default.svc.cluster.local:80/openapi.json;
proxy_set_header Host $host;
}
}
# Add other servers for SSL, etc.
}
说明:
在Kubernetes中,我们通常不直接构建一个包含Nginx和FastAPI的Docker镜像(除非作为简单的Sidecar)。更标准的方式是:
FastAPIlication Pods: 运行FastAPI服务。
Kubernetes Service: 为FastAPI Pods提供一个稳定的ClusterIP,并实现服务层面的负载均衡。
Nginx Ingress Controller: 部署一个Nginx Ingress Controller(通常作为DaemonSet或Deployment),它本身是一个Nginx实例(或多个),负责监听外部流量,解析Ingress资源,并将流量路由到后端的Services。
对于本文的“高可用AI服务”目标,我们将侧重于如何在EKS/GKE/AKS等环境中,通过Ingress Controller来实现Nginx的便利能力。
四、 Step 4: Kubernetes Deployment & Service (FastAPI App)
这部分与前文“K8s部署大模型”的k8s/deployment.yaml和k8s/service.yaml类似,但我们可以稍微调整副本数以展示负载均衡。
k8s/fastapi-deployment.yaml
<YAML>
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-fastapi-app # Renamed to be clearer
namespace: default
labels:
app: llm-fastapi
spec:
replicas: 3 # Start with 3 replicas to demonstrate load balancing
selector:
matchLabels:
app: llm-fastapi
template:
metadata:
labels:
app: llm-fastapi
spec:
containers:
- name: llm-fastapi-container
image: your-dockerhub-username/your-llm-api:v1.0.0 # Your FastAPI image
ports:
- containerPort: 8000
resources:
# Request slightly more resources potentially, and importantly, specify GPU
requests:
cpu: "1" # Request 1 CPU core
memory: "4Gi" # Request 4GB RAM
nvidia.com/gpu: 1 # IMPORTANT: Request 1 GPU
limits:
cpu: "2" # Limit to 2 CPU cores
memory: "8Gi" # Limit to 8GB RAM
nvidia.com/gpu: 1 # Limit to 1 GPU
readinessProbe:
httpGet:
path: /v1/models
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /v1/models
port: 8000
initialDelaySeconds: 60
periodSeconds: 15
env:
- name: MODEL_NAME
value: "your-model-name"
k8s/fastapi-service.yaml
<YAML>
apiVersion: v1
kind: Service
metadata:
name: llm-fastapi-svc # Renamed for clarity
namespace: default
spec:
selector:
app: llm-fastapi # Matches Deployment's pod labels
ports:
- protocol: TCP
port: 80 # Internal service port
targetPort: 8000 # Container port
type: ClusterIP # Keep as ClusterIP, Ingress will expose it
五、 Step 5: Nginx Ingress Controller Deployment
安装Ingress Controller是Kubernetes暴露HTTP/HTTPS服务的标准方式。这里我们假设您已经安装了Kubernetes集群,并安装了NGINX Ingress Controller。
安装Nginx Ingress Controller (推荐使用Helm):
<BASH>
helm repo add ingress-nginx Welcome - Ingress-Nginx Controller
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--create-namespace \
--set controller.replicaCount=2 \ # Run 2 Nginx Ingress controller replicas for high availability
--set controller.nodeSelector."kubernetes\.io/os"="linux" \
--set controller.kind="DaemonSet" \ # Or Deployment if preferred, DaemonSet is common for edge
--set controller.hostNetwork=true \ # Often needed for DaemonSet for direct node access
--set controller.admissionWebhooks.patchMode="reuse" \
--set controller.admissionWebhooks.cleanupOnStart=true \
--set controller.podSecurityContext.seccompProfile.type="Unrestricted" # For DaemonSet typically
# Add options for LoadBalancer service if cloud provider support is needed
# --set controller.service.type="LoadBalancer"
注意: controller.hostNetwork=true 为DaemonSet配置,它允许Ingress Controller Pod直接监听节点的主网络接口。如果选择Deployment或在云提供商环境中,通常会配置controller.service.type="LoadBalancer"来获得外部IP。
Ingress Resource Definition
k8s/ingress.yaml
<YAML>
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llm-api-ingress
namespace: default
annotations:
# Use this annotation if Nginx Ingress Controller is configured without default backend
# If using a default backend, this may not be necessary.
# nginx.ingress.kubernetes.io/enable-non-gzip="true" # For example, to handle non-gzipped responses if needed
# nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" # Long timeout for streaming
# nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
# nginx.ingress.kubernetes.io/proxy-body-size: "100m"
# Add annotations for SSL if you have a TLS certificate configured
# nginx.ingress.kubernetes.io/ssl-redirect: "true"
# If k8s service is 'llm-fastapi-svc' on port 80, this points to the service.
# If you also want to expose /docs via the same ingress, it will also be proxied.
spec:
ingressClassName: nginx # Ensures this Ingress uses our Nginx Ingress Controller
rules:
- host: ai-service.your-domain.com # Replace with your actual domain name
http:
paths:
- path: / # Route all traffic meant for the AI service
pathType: Prefix
backend:
service:
name: llm-fastapi-svc # The Kubernetes Service for your FastAPI app
port:
number: 80 # The port exposed by the Service
# Optional: TLS configuration for HTTPS
# tls:
# - hosts:
# - ai-service.your-domain.com
# secretName: your-tls-secret # A Kubernetes Secret containing your TLS cert and key
部署Ingress:
<BASH>
kubectl apply -f k8s/ingress.yaml
部署FastAPI App和Service:
<BASH>
kubectl apply -f k8s/fastapi-deployment.yaml
kubectl apply -f k8s/fastapi-service.yaml
测试:
配置DNS: 将 ai-service.your-domain.com 指向你Kubernetes集群的Ingress Controller暴露的IP地址(LoadBalancer IP 或 NodeIP)。
发送请求:
<BASH>
curl -X POST \
http://ai-service.your-domain.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Kubernetes?"}
],
"temperature": 0.7
}'
你也可以测试流式输出:
<BASH>
curl -N -X POST \
http://ai-service.your-domain.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [
{"role": "user", "content": "Explain the concept of elasticity."}
],
"stream": true
}'
六、 实现高可用性
多副本部署: 通过Deployment的replicas: 3(或更多),确保即使一个Pod出现故障,服务仍能继续运行。Kubernetes会不断尝试将Pod调度到健康的Node上。
Nginx Ingress Controller高可用: 部署Ingress Controller时,常配置controller.replicaCount=2(或更多)和controller.kind="DaemonSet"(运行在每个Node上,或Deployment)。这确保了Ingress层本身不会成为单点故障。
Kubernetes Service负载均衡: ClusterIP类型的Service本身就提供了负载均衡能力,它会将流量平均分发给Deployment的所有健康Pod。
健康检查(Probes):
Readiness Probe: 告诉Kubernetes何时一个Pod已准备好接收流量。当模型正在加载时,Readiness Probe可能会失败;一旦模型加载完毕,Probe成功,Service就会将流量导向该Pod。
Liveness Probe: 告诉Kubernetes何时一个Pod失效,需要重启。如果模型推理卡死,Liveness Probe失败,Kubernetes会重启Pod。
资源请求与限制: 合理设置CPU、内存和GPU的requests和limits,防止Pod因资源不足被OOMKilled,或耗尽节点资源影响其他服务。GPU limits至关重要。
数据持久化(模型加载): 对于使用大模型文件的场景,将模型文件放在PersistentVolume或通过Init Container/Sidecar从对象存储下载,确保Pod重建时模型能被正确加载,而不是每次都从头开始下载。
ELK/Prometheus + Grafana 监控: 实时监控API的QPS、延迟、错误率、GPU利用率、显存占用等关键指标,并设置告警,是保障高可用性的必要组成部分。
七、 总结
通过 Docker 容器化 FastAPI 服务,并结合 Nginx Ingress Controller 的强大负载均衡和反向代理能力,我们可以构建一个灵活、高性能且具备高可用性的AI服务。
Docker 保证了环境的一致性和部署的便捷性。
FastAPI 提供了高性能、异步的API服务能力,非常适合I/O密集型的模型推理。
Nginx Ingress Controller 作为流量的入口,提供了负载均衡、SSL终结、路由规则等关键流量管理能力,并可配置多副本以实现高可用。
Kubernetes 负责 Pod 的调度、自愈、水平扩展(通过HPA)以及整体集群的资源管理。
这个方案是一个坚实的基础,可支撑各种规模的大模型应用。根据实际需求,还可以进一步集成更高级的监控、日志收集、模型版本管理、安全性加固等。