当前位置：首页 > news >正文

大模型部署全攻略：Docker+FastAPI+Nginx搭建高可用AI服务

news 2025/9/5 8:12:56

随着大型语言模型（LLMs）的飞速发展，将其部署到生产环境以提供稳定、高效的服务已成为AI落地应用的关键一步。一个健壮、可扩展且高可用的AI服务架构，能够显著提升用户体验并保证业务连续性。

本文将为您提供一份详尽的部署攻略，指导您如何使用 Docker 构建模型推理服务，FastAPI 作为高性能的API框架，以及Nginx 作为高性能的反向代理和负载均衡器，共同搭建一个高可用的AI服务。我们将贯穿模型封装、服务化、容器化、部署以及高可用性设计的全过程。

一、为什么选择 Docker + FastAPI + Nginx 组合？

在众多技术栈中，为何倾向于选择这三者组合来部署大模型服务？

Docker (容器化)：

环境隔离与一致性：确保模型及其所有依赖（库、Python版本、CUDA等）在一个可移植的容器中运行，免受宿主机环境干扰，极大地简化了“在我的机器上能运行”到“在生产环境也能运行”的转换。

易于部署与管理： Docker镜像可以轻松分发、存储和部署，无论是单机还是大规模集群（如Kubernetes）。

资源控制： Docker可以限制容器的CPU、内存使用，有助于资源的管理。

FastAPI (高性能Python Web框架)：

极高的性能：基于Starlette和Pydantic，FastAPI是Python中最快的Web框架之一，非常适合处理LLM的I/O密集型任务。

现代且易用：支持Python 3.7+的类型提示，自动生成交互式API文档（Swagger/OpenAPI），代码清晰，开发效率高。

异步支持：内置对async/await的支持，能够高效处理并发的HTTP请求，对于需要模型推理（可能耗时）的服务至关重要。

强大的数据校验： Pydantic提供开箱即用的模型数据验证，能轻松处理OpenAI兼容API的请求参数。

Nginx (Web服务器 & 反向代理)：

高性能反向代理： Nginx以其出色的并发处理能力、高效的事件驱动模型而闻名，能够处理大量的并发HTTP请求。

负载均衡：可以将传入的请求分发到多个FastAPI应用实例（Worker Process），实现水平扩展，提高整体吞吐量和可用性。

SSL/TLS终止：能够处理HTTPS加密，并将解密后的请求转发给应用，减轻应用层的负担。

静态文件服务：如果API服务中有静态文件（如Swagger UI），Nginx可以高效地提供服务。

请求/响应缓存：可配置缓存机制。

健康检查： Nginx可以监控后端服务的健康状态，并将流量仅导向健康的实例。

整体架构图：

<TEXT>

+-------------------+ +--------------------+ +-------------------+

| User/Client | ---> | Nginx | ---> | FastAPI |

+-------------------+ | (Reverse Proxy, | | (Model Inference |

| Load Balancer, SSL)| | API Service) |

+--------------------+ +-------------------+

| (Multiple Instances)

+-------------------+

| Docker |

| Container |

+-------------------+

| GPU Hardware |

+-------------------+

二、 Step 1: Model Service - FastAPI API设计

首先，准备一个能够加载和运行你的大模型，并通过FastAPI暴露HTTP API的服务。这里我们继续使用前文提到的OpenAI兼容API封装作为示例。

Python代码 (app/main.py)：

（此处引用上文“K8s部署大模型”中的app/main.py代码，确保其功能完整，能够处理GET /v1/models和POST /v1/chat/completions，支持流式输出。）

关键点回顾：

OpenAI兼容性：严格按照OpenAI的请求/响应格式（messages, role, content, choices, usage, stream等）进行设计。

模型推理： CustomModelService中的infer和stream_infer方法封装了实际的模型调用。在生产环境中，这里会是你的高效推理后端（如vLLM、TensorRT-LLM、Trition Infernece Server或者一个优化的PyTorch/TensorFlow推理代码）。

异步IO： FastAPI和async/await对于处理I/O密集型的模型推理非常高效。

三、 Step 2: Docker化你的FastAPI应用

下一步是为FastAPI服务创建一个Docker镜像。

app/requirements.txt (确保包含所有必要的库)：

<TEXT>

fastapi

uvicorn[standard] # 'standard' includes uvicorn, websockets, httptools which are good for performance

pydantic

torch # or tensorflow, transformers, etc.

numba # for potential performance optimizations

# other dependencies for your model inference

Dockerfile

# --- Base Image ---

# Use a base image with Python and necessary libraries.

# For GPU workloads, use an NVIDIA CUDA-enable base image.

# Example using CUDA 11.8 with Python 3.10:

FROM nvidia/cuda:11.8.0-base-ubuntu22.04 AS builder

# Set environment variables

ENV PYTHONUNBUFFERED=1 \

PORT=8000 \

MODEL_NAME="your-model-name" \

# Adjust model path or loading mechanism as per your needs

# MODEL_PATH="/app/models/your_model"

# Configure user and workdir

ARG USER=appuser

ARG GROUP=appgroup

RUN groupadd --gid 1000 ${GROUP} && useradd --uid 1000 --gid ${GROUP} --shell /bin/bash --create-home ${USER}

WORKDIR /app

# Install system dependencies (e.g., build tools, CUDA specific libs if not in base)

RUN apt-get update && apt-get install -y --no-install-recommends \

build-essential \

python3-pip \

git \

# Add any other necessary system packages here

&& rm -rf /var/lib/apt/lists/*

# Install Python dependencies

COPY app/requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

# Copy application source code

COPY app/ /app/

# OPTIONAL: Copy large model files if they are not dynamically loaded

# If your model is very large, consider using volumes or init containers in K8s

# COPY models/ /app/models/

# Set ownership to the non-root user we created

RUN chown -R ${USER}:${GROUP} /app

# Switch to the non-root user

USER ${USER}

# Run the application

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

# Note: For GPU bound models, often 1 worker is sufficient. If your FastAPI doesn't

# directly use awaitable model calls (e.g., blocking calls run in thread pool),

# you might need more workers, but this increases GPU resource demands.

# Uvicorn with 'standard' extras includes httptools and websockets for better performance.

构建与推送镜像：

<BASH>

# Assuming you are in the project root directory

docker build -t your-dockerhub-username/your-llm-api:v1.0.0 .

docker push your-dockerhub-username/your-llm-api:v1.0.0

四、 Step 3: Nginx 配置与集成

Nginx将作为我们AI服务的前端。它会接收来自客户端的请求，进行负载均衡，并将请求转发给后端的FastAPI应用实例。

a. Nginx 镜像

我们可以使用一个基础的Nginx镜像，然后将其与我們的FastAPI应用一起部署。更常见的做法是在Kubernetes中，将Nginx作为Ingress Controller或LoadBalancer Service的一部分来管理，或者直接在Pod中运行Nginx。

为了简洁起见，我们先考虑在同一个Pod内或通过K8s Service暴露。更高级的部署会使用Nginx Ingress Controller。

b. Nginx 配置文件 (nginx.conf)

我们将创建一个示例nginx.conf，用于将流量代理到FastAPI服务。

<NGINX>

# Custom Nginx configuration

# Note: This is a simplified example. For production, consider more robust configs.

# Example for running FastAPI in a single container with Nginx as a reverse proxy.

# If using Kubernetes, Nginx often acts as INGRESS or a LoadBalancer.

# For a single container setup, you might use supervisord or similar to manage both.

# Here we assume Nginx is already running and proxying to a *separate* FastAPI process/container.

# If running in K8s, the service will handle load balancing to multiple pods of FastAPI.

# Let's structure this for Kubernetes where Nginx likely stands *in front* of the FastAPI pods.

# In K8s, your `nginx.conf` would typically be part of the Ingress Controller or deployed as a separate proxy.

# For this article, we will assume Nginx is running and needs to forward to 'llm-api-service' on port 80.

# --- Simplified Nginx Conf Fragment (Illustrative, as K8s Ingress handles most of this) ---

# In a real K8s setup, this logic is often within the Ingress Controller's configuration.

# But if you were to run Nginx directly in a pod to proxy FastAPI service:

events {

worker_connections 1024;

multi_accept on;

}

http {

include mime.types;

default_type application/json;

sendfile on;

keepalive_timeout 65;

upstream fastapi_llm_api {

# The 'llm-api-service' is the Kubernetes Service name we defined earlier.

# K8s service automatically load balances to healthy pods.

# If running Nginx in the *same* pod as FastAPI (uncommon for production):

# server 127.0.0.1:8000;

# If Nginx is a separate proxy in front of the Service:

server llm-api-service.default. svc.cluster.local:80; # K8s internal DNS resolution

# Example for multiple FastAPI instances if not using K8s Service for balancing

# server fastapi-app-1:8000;

# server fastapi-app-2:8000;

# Load balancing method (e.g., round-robin, least_conn)

# keepalive 32; # Optional: Keep connections to backend open

}

server {

listen 80;

server_name your-ai-service.example.com; # Your domain name

# Basic health check endpoint for Nginx to probe FastAPI service

location /healthz {

access_log off;

proxy_pass http://llm-api-service.default.svc.cluster.local:80/v1/models; # Use a known healthy endpoint

proxy_intercept_errors on;

health_check interval=10s fails=3 backend=llm-api-api; # Requires Nginx Plus or specific modules

}

location / {

proxy_pass http://fastapi_llm_api; # Forward to the upstream group

proxy_set_header Host $host;

proxy_set_header X-Real-IP $remote_addr;

proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

proxy_set_header X-Forwarded-Proto $scheme;

# Add specific headers for streaming if needed

# proxy_request_buffering off; # Important for streaming

# proxy_http_version 1.1; # Ensure HTTP/1.1 for keep-alive

# Optimization for potentially long-lived connections for streaming

proxy_read_timeout 3600s; # Adjust as needed

proxy_send_timeout 3600s;

client_max_body_size 100M; # Adjust for large model inputs if applicable

}

# If you want Nginx to serve Swagger UI (FastAPI includes it at /docs)

location /docs {

proxy_pass http://llm-api-service.default.svc.cluster.local:80/docs;

proxy_set_header Host $host;

}

location /openapi.json {

proxy_pass http://llm-api-service.default.svc.cluster.local:80/openapi.json;

proxy_set_header Host $host;

}

# Add other servers for SSL, etc.

}

说明：

在Kubernetes中，我们通常不直接构建一个包含Nginx和FastAPI的Docker镜像（除非作为简单的Sidecar）。更标准的方式是：

FastAPIlication Pods: 运行FastAPI服务。

Kubernetes Service: 为FastAPI Pods提供一个稳定的ClusterIP，并实现服务层面的负载均衡。

Nginx Ingress Controller: 部署一个Nginx Ingress Controller（通常作为DaemonSet或Deployment），它本身是一个Nginx实例（或多个），负责监听外部流量，解析Ingress资源，并将流量路由到后端的Services。

对于本文的“高可用AI服务”目标，我们将侧重于如何在EKS/GKE/AKS等环境中，通过Ingress Controller来实现Nginx的便利能力。

四、 Step 4: Kubernetes Deployment & Service (FastAPI App)

这部分与前文“K8s部署大模型”的k8s/deployment.yaml和k8s/service.yaml类似，但我们可以稍微调整副本数以展示负载均衡。

k8s/fastapi-deployment.yaml

<YAML>

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: default

labels:

app: llm-fastapi

spec:

replicas: 3 # Start with 3 replicas to demonstrate load balancing

selector:

matchLabels:

app: llm-fastapi

template:

metadata:

labels:

app: llm-fastapi

spec:

containers:

- name: llm-fastapi-container

image: your-dockerhub-username/your-llm-api:v1.0.0 # Your FastAPI image

ports:

- containerPort: 8000

resources:

# Request slightly more resources potentially, and importantly, specify GPU

requests:

cpu: "1" # Request 1 CPU core

memory: "4Gi" # Request 4GB RAM

nvidia.com/gpu: 1 # IMPORTANT: Request 1 GPU

limits:

cpu: "2" # Limit to 2 CPU cores

memory: "8Gi" # Limit to 8GB RAM

nvidia.com/gpu: 1 # Limit to 1 GPU

readinessProbe:

httpGet:

path: /v1/models

port: 8000

initialDelaySeconds: 30

periodSeconds: 10

livenessProbe:

httpGet:

path: /v1/models

port: 8000

initialDelaySeconds: 60

periodSeconds: 15

env:

- name: MODEL_NAME

value: "your-model-name"

k8s/fastapi-service.yaml

<YAML>

apiVersion: v1

kind: Service

metadata:

namespace: default

spec:

selector:

app: llm-fastapi # Matches Deployment's pod labels

ports:

- protocol: TCP

port: 80 # Internal service port

targetPort: 8000 # Container port

type: ClusterIP # Keep as ClusterIP, Ingress will expose it

五、 Step 5: Nginx Ingress Controller Deployment

安装Ingress Controller是Kubernetes暴露HTTP/HTTPS服务的标准方式。这里我们假设您已经安装了Kubernetes集群，并安装了NGINX Ingress Controller。

安装Nginx Ingress Controller (推荐使用Helm)：

<BASH>

helm repo add ingress-nginx Welcome - Ingress-Nginx Controller

helm repo update

helm install nginx-ingress ingress-nginx/ingress-nginx \

--namespace ingress-nginx \

--create-namespace \

--set controller.replicaCount=2 \ # Run 2 Nginx Ingress controller replicas for high availability

--set controller.nodeSelector."kubernetes\.io/os"="linux" \

--set controller.kind="DaemonSet" \ # Or Deployment if preferred, DaemonSet is common for edge

--set controller.hostNetwork=true \ # Often needed for DaemonSet for direct node access

--set controller.admissionWebhooks.patchMode="reuse" \

--set controller.admissionWebhooks.cleanupOnStart=true \

--set controller.podSecurityContext.seccompProfile.type="Unrestricted" # For DaemonSet typically

# Add options for LoadBalancer service if cloud provider support is needed

# --set controller.service.type="LoadBalancer"

注意： controller.hostNetwork=true 为DaemonSet配置，它允许Ingress Controller Pod直接监听节点的主网络接口。如果选择Deployment或在云提供商环境中，通常会配置controller.service.type="LoadBalancer"来获得外部IP。

Ingress Resource Definition

k8s/ingress.yaml

<YAML>

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

namespace: default

annotations:

# Use this annotation if Nginx Ingress Controller is configured without default backend

# If using a default backend, this may not be necessary.

# nginx.ingress.kubernetes.io/enable-non-gzip="true" # For example, to handle non-gzipped responses if needed

# nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" # Long timeout for streaming

# nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"

# nginx.ingress.kubernetes.io/proxy-body-size: "100m"

# Add annotations for SSL if you have a TLS certificate configured

# nginx.ingress.kubernetes.io/ssl-redirect: "true"

# If k8s service is 'llm-fastapi-svc' on port 80, this points to the service.

# If you also want to expose /docs via the same ingress, it will also be proxied.

spec:

ingressClassName: nginx # Ensures this Ingress uses our Nginx Ingress Controller

rules:

- host: ai-service.your-domain.com # Replace with your actual domain name

http:

paths:

- path: / # Route all traffic meant for the AI service

pathType: Prefix

backend:

service:

port:

number: 80 # The port exposed by the Service

# Optional: TLS configuration for HTTPS

# tls:

# - hosts:

# - ai-service.your-domain.com

# secretName: your-tls-secret # A Kubernetes Secret containing your TLS cert and key

部署Ingress：

<BASH>

kubectl apply -f k8s/ingress.yaml

部署FastAPI App和Service：

<BASH>

kubectl apply -f k8s/fastapi-deployment.yaml

kubectl apply -f k8s/fastapi-service.yaml

测试：

配置DNS：将 ai-service.your-domain.com 指向你Kubernetes集群的Ingress Controller暴露的IP地址（LoadBalancer IP 或 NodeIP）。

发送请求：

<BASH>

curl -X POST \

http://ai-service.your-domain.com/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "your-model-name",

"messages": [

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "What is Kubernetes?"}

"temperature": 0.7

你也可以测试流式输出：

<BASH>

curl -N -X POST \

http://ai-service.your-domain.com/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "your-model-name",

"messages": [

{"role": "user", "content": "Explain the concept of elasticity."}

"stream": true

六、实现高可用性

多副本部署：通过Deployment的replicas: 3（或更多），确保即使一个Pod出现故障，服务仍能继续运行。Kubernetes会不断尝试将Pod调度到健康的Node上。

Nginx Ingress Controller高可用：部署Ingress Controller时，常配置controller.replicaCount=2（或更多）和controller.kind="DaemonSet"（运行在每个Node上，或Deployment）。这确保了Ingress层本身不会成为单点故障。

Kubernetes Service负载均衡： ClusterIP类型的Service本身就提供了负载均衡能力，它会将流量平均分发给Deployment的所有健康Pod。

健康检查（Probes）：

Readiness Probe：告诉Kubernetes何时一个Pod已准备好接收流量。当模型正在加载时，Readiness Probe可能会失败；一旦模型加载完毕，Probe成功，Service就会将流量导向该Pod。

Liveness Probe：告诉Kubernetes何时一个Pod失效，需要重启。如果模型推理卡死，Liveness Probe失败，Kubernetes会重启Pod。

资源请求与限制：合理设置CPU、内存和GPU的requests和limits，防止Pod因资源不足被OOMKilled，或耗尽节点资源影响其他服务。GPU limits至关重要。

数据持久化（模型加载）：对于使用大模型文件的场景，将模型文件放在PersistentVolume或通过Init Container/Sidecar从对象存储下载，确保Pod重建时模型能被正确加载，而不是每次都从头开始下载。

ELK/Prometheus + Grafana 监控：实时监控API的QPS、延迟、错误率、GPU利用率、显存占用等关键指标，并设置告警，是保障高可用性的必要组成部分。

七、总结

通过 Docker 容器化 FastAPI 服务，并结合 Nginx Ingress Controller 的强大负载均衡和反向代理能力，我们可以构建一个灵活、高性能且具备高可用性的AI服务。

Docker 保证了环境的一致性和部署的便捷性。

FastAPI 提供了高性能、异步的API服务能力，非常适合I/O密集型的模型推理。

Nginx Ingress Controller 作为流量的入口，提供了负载均衡、SSL终结、路由规则等关键流量管理能力，并可配置多副本以实现高可用。