当前位置: 首页 > news >正文

大模型部署全攻略:Docker+FastAPI+Nginx搭建高可用AI服务

随着大型语言模型(LLMs)的飞速发展,将其部署到生产环境以提供稳定、高效的服务已成为AI落地应用的关键一步。一个健壮、可扩展且高可用的AI服务架构,能够显著提升用户体验并保证业务连续性。

本文将为您提供一份详尽的部署攻略,指导您如何使用 Docker 构建模型推理服务,FastAPI 作为高性能的API框架,以及Nginx 作为高性能的反向代理和负载均衡器,共同搭建一个高可用的AI服务。我们将贯穿模型封装、服务化、容器化、部署以及高可用性设计的全过程。

一、 为什么选择 Docker + FastAPI + Nginx 组合?

在众多技术栈中,为何倾向于选择这三者组合来部署大模型服务?

Docker (容器化):

环境隔离与一致性: 确保模型及其所有依赖(库、Python版本、CUDA等)在一个可移植的容器中运行,免受宿主机环境干扰,极大地简化了“在我的机器上能运行”到“在生产环境也能运行”的转换。

易于部署与管理: Docker镜像可以轻松分发、存储和部署,无论是单机还是大规模集群(如Kubernetes)。

资源控制: Docker可以限制容器的CPU、内存使用,有助于资源的管理。

FastAPI (高性能Python Web框架):

极高的性能: 基于Starlette和Pydantic,FastAPI是Python中最快的Web框架之一,非常适合处理LLM的I/O密集型任务。

现代且易用: 支持Python 3.7+的类型提示,自动生成交互式API文档(Swagger/OpenAPI),代码清晰,开发效率高。

异步支持: 内置对async/await的支持,能够高效处理并发的HTTP请求,对于需要模型推理(可能耗时)的服务至关重要。

强大的数据校验: Pydantic提供开箱即用的模型数据验证,能轻松处理OpenAI兼容API的请求参数。

Nginx (Web服务器 & 反向代理):

高性能反向代理: Nginx以其出色的并发处理能力、高效的事件驱动模型而闻名,能够处理大量的并发HTTP请求。

负载均衡: 可以将传入的请求分发到多个FastAPI应用实例(Worker Process),实现水平扩展,提高整体吞吐量和可用性。

SSL/TLS终止: 能够处理HTTPS加密,并将解密后的请求转发给应用,减轻应用层的负担。

静态文件服务: 如果API服务中有静态文件(如Swagger UI),Nginx可以高效地提供服务。

请求/响应缓存: 可配置缓存机制。

健康检查: Nginx可以监控后端服务的健康状态,并将流量仅导向健康的实例。

整体架构图:

<TEXT>

+-------------------+ +--------------------+ +-------------------+

| User/Client | ---> | Nginx | ---> | FastAPI |

+-------------------+ | (Reverse Proxy, | | (Model Inference |

| Load Balancer, SSL)| | API Service) |

+--------------------+ +-------------------+

| (Multiple Instances)

|

+-------------------+

| Docker |

| Container |

+-------------------+

|

+-------------------+

| GPU Hardware |

+-------------------+

二、 Step 1: Model Service - FastAPI API设计

首先,准备一个能够加载和运行你的大模型,并通过FastAPI暴露HTTP API的服务。这里我们继续使用前文提到的OpenAI兼容API封装作为示例。

Python代码 (app/main.py):

(此处引用上文“K8s部署大模型”中的app/main.py代码,确保其功能完整,能够处理GET /v1/models和POST /v1/chat/completions,支持流式输出。)

关键点回顾:

OpenAI兼容性: 严格按照OpenAI的请求/响应格式(messages, role, content, choices, usage, stream等)进行设计。

模型推理: CustomModelService中的infer和stream_infer方法封装了实际的模型调用。在生产环境中,这里会是你的高效推理后端(如vLLM、TensorRT-LLM、Trition Infernece Server或者一个优化的PyTorch/TensorFlow推理代码)。

异步IO: FastAPI和async/await对于处理I/O密集型的模型推理非常高效。

三、 Step 2: Docker化你的FastAPI应用

下一步是为FastAPI服务创建一个Docker镜像。

app/requirements.txt (确保包含所有必要的库):

<TEXT>

fastapi

uvicorn[standard] # 'standard' includes uvicorn, websockets, httptools which are good for performance

pydantic

torch # or tensorflow, transformers, etc.

numba # for potential performance optimizations

# other dependencies for your model inference

Dockerfile

<DOCKERFILE>

# --- Base Image ---

# Use a base image with Python and necessary libraries.

# For GPU workloads, use an NVIDIA CUDA-enable base image.

# Example using CUDA 11.8 with Python 3.10:

FROM nvidia/cuda:11.8.0-base-ubuntu22.04 AS builder

# Set environment variables

ENV PYTHONUNBUFFERED=1 \

PORT=8000 \

MODEL_NAME="your-model-name" \

# Adjust model path or loading mechanism as per your needs

# MODEL_PATH="/app/models/your_model"

# Configure user and workdir

ARG USER=appuser

ARG GROUP=appgroup

RUN groupadd --gid 1000 ${GROUP} && useradd --uid 1000 --gid ${GROUP} --shell /bin/bash --create-home ${USER}

WORKDIR /app

# Install system dependencies (e.g., build tools, CUDA specific libs if not in base)

RUN apt-get update && apt-get install -y --no-install-recommends \

build-essential \

python3-pip \

git \

# Add any other necessary system packages here

&& rm -rf /var/lib/apt/lists/*

# Install Python dependencies

COPY app/requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

# Copy application source code

COPY app/ /app/

# OPTIONAL: Copy large model files if they are not dynamically loaded

# If your model is very large, consider using volumes or init containers in K8s

# COPY models/ /app/models/

# Set ownership to the non-root user we created

RUN chown -R ${USER}:${GROUP} /app

# Switch to the non-root user

USER ${USER}

# Run the application

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

# Note: For GPU bound models, often 1 worker is sufficient. If your FastAPI doesn't

# directly use awaitable model calls (e.g., blocking calls run in thread pool),

# you might need more workers, but this increases GPU resource demands.

# Uvicorn with 'standard' extras includes httptools and websockets for better performance.

构建与推送镜像:

<BASH>

# Assuming you are in the project root directory

docker build -t your-dockerhub-username/your-llm-api:v1.0.0 .

docker push your-dockerhub-username/your-llm-api:v1.0.0

四、 Step 3: Nginx 配置与集成

Nginx将作为我们AI服务的前端。它会接收来自客户端的请求,进行负载均衡,并将请求转发给后端的FastAPI应用实例。

a. Nginx 镜像

我们可以使用一个基础的Nginx镜像,然后将其与我們的FastAPI应用一起部署。更常见的做法是在Kubernetes中,将Nginx作为Ingress Controller或LoadBalancer Service的一部分来管理,或者直接在Pod中运行Nginx。

为了简洁起见,我们先考虑在同一个Pod内或通过K8s Service暴露。更高级的部署会使用Nginx Ingress Controller。

b. Nginx 配置文件 (nginx.conf)

我们将创建一个示例nginx.conf,用于将流量代理到FastAPI服务。

<NGINX>

# Custom Nginx configuration

# Note: This is a simplified example. For production, consider more robust configs.

# Example for running FastAPI in a single container with Nginx as a reverse proxy.

# If using Kubernetes, Nginx often acts as INGRESS or a LoadBalancer.

# For a single container setup, you might use supervisord or similar to manage both.

# Here we assume Nginx is already running and proxying to a *separate* FastAPI process/container.

# If running in K8s, the service will handle load balancing to multiple pods of FastAPI.

# Let's structure this for Kubernetes where Nginx likely stands *in front* of the FastAPI pods.

# In K8s, your `nginx.conf` would typically be part of the Ingress Controller or deployed as a separate proxy.

# For this article, we will assume Nginx is running and needs to forward to 'llm-api-service' on port 80.

# --- Simplified Nginx Conf Fragment (Illustrative, as K8s Ingress handles most of this) ---

# In a real K8s setup, this logic is often within the Ingress Controller's configuration.

# But if you were to run Nginx directly in a pod to proxy FastAPI service:

events {

worker_connections 1024;

multi_accept on;

}

http {

include mime.types;

default_type application/json;

sendfile on;

keepalive_timeout 65;

upstream fastapi_llm_api {

# The 'llm-api-service' is the Kubernetes Service name we defined earlier.

# K8s service automatically load balances to healthy pods.

# If running Nginx in the *same* pod as FastAPI (uncommon for production):

# server 127.0.0.1:8000;

# If Nginx is a separate proxy in front of the Service:

server llm-api-service.default. svc.cluster.local:80; # K8s internal DNS resolution

# Example for multiple FastAPI instances if not using K8s Service for balancing

# server fastapi-app-1:8000;

# server fastapi-app-2:8000;

# Load balancing method (e.g., round-robin, least_conn)

# keepalive 32; # Optional: Keep connections to backend open

}

server {

listen 80;

server_name your-ai-service.example.com; # Your domain name

# Basic health check endpoint for Nginx to probe FastAPI service

location /healthz {

access_log off;

proxy_pass http://llm-api-service.default.svc.cluster.local:80/v1/models; # Use a known healthy endpoint

proxy_intercept_errors on;

health_check interval=10s fails=3 backend=llm-api-api; # Requires Nginx Plus or specific modules

}

location / {

proxy_pass http://fastapi_llm_api; # Forward to the upstream group

proxy_set_header Host $host;

proxy_set_header X-Real-IP $remote_addr;

proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

proxy_set_header X-Forwarded-Proto $scheme;

# Add specific headers for streaming if needed

# proxy_request_buffering off; # Important for streaming

# proxy_http_version 1.1; # Ensure HTTP/1.1 for keep-alive

# Optimization for potentially long-lived connections for streaming

proxy_read_timeout 3600s; # Adjust as needed

proxy_send_timeout 3600s;

client_max_body_size 100M; # Adjust for large model inputs if applicable

}

# If you want Nginx to serve Swagger UI (FastAPI includes it at /docs)

location /docs {

proxy_pass http://llm-api-service.default.svc.cluster.local:80/docs;

proxy_set_header Host $host;

}

location /openapi.json {

proxy_pass http://llm-api-service.default.svc.cluster.local:80/openapi.json;

proxy_set_header Host $host;

}

}

# Add other servers for SSL, etc.

}

说明:

在Kubernetes中,我们通常不直接构建一个包含Nginx和FastAPI的Docker镜像(除非作为简单的Sidecar)。更标准的方式是:

FastAPIlication Pods: 运行FastAPI服务。

Kubernetes Service: 为FastAPI Pods提供一个稳定的ClusterIP,并实现服务层面的负载均衡。

Nginx Ingress Controller: 部署一个Nginx Ingress Controller(通常作为DaemonSet或Deployment),它本身是一个Nginx实例(或多个),负责监听外部流量,解析Ingress资源,并将流量路由到后端的Services。

对于本文的“高可用AI服务”目标,我们将侧重于如何在EKS/GKE/AKS等环境中,通过Ingress Controller来实现Nginx的便利能力。

四、 Step 4: Kubernetes Deployment & Service (FastAPI App)

这部分与前文“K8s部署大模型”的k8s/deployment.yaml和k8s/service.yaml类似,但我们可以稍微调整副本数以展示负载均衡。

k8s/fastapi-deployment.yaml

<YAML>

apiVersion: apps/v1

kind: Deployment

metadata:

name: llm-fastapi-app # Renamed to be clearer

namespace: default

labels:

app: llm-fastapi

spec:

replicas: 3 # Start with 3 replicas to demonstrate load balancing

selector:

matchLabels:

app: llm-fastapi

template:

metadata:

labels:

app: llm-fastapi

spec:

containers:

- name: llm-fastapi-container

image: your-dockerhub-username/your-llm-api:v1.0.0 # Your FastAPI image

ports:

- containerPort: 8000

resources:

# Request slightly more resources potentially, and importantly, specify GPU

requests:

cpu: "1" # Request 1 CPU core

memory: "4Gi" # Request 4GB RAM

nvidia.com/gpu: 1 # IMPORTANT: Request 1 GPU

limits:

cpu: "2" # Limit to 2 CPU cores

memory: "8Gi" # Limit to 8GB RAM

nvidia.com/gpu: 1 # Limit to 1 GPU

readinessProbe:

httpGet:

path: /v1/models

port: 8000

initialDelaySeconds: 30

periodSeconds: 10

livenessProbe:

httpGet:

path: /v1/models

port: 8000

initialDelaySeconds: 60

periodSeconds: 15

env:

- name: MODEL_NAME

value: "your-model-name"

k8s/fastapi-service.yaml

<YAML>

apiVersion: v1

kind: Service

metadata:

name: llm-fastapi-svc # Renamed for clarity

namespace: default

spec:

selector:

app: llm-fastapi # Matches Deployment's pod labels

ports:

- protocol: TCP

port: 80 # Internal service port

targetPort: 8000 # Container port

type: ClusterIP # Keep as ClusterIP, Ingress will expose it

五、 Step 5: Nginx Ingress Controller Deployment

安装Ingress Controller是Kubernetes暴露HTTP/HTTPS服务的标准方式。这里我们假设您已经安装了Kubernetes集群,并安装了NGINX Ingress Controller。

安装Nginx Ingress Controller (推荐使用Helm):

<BASH>

helm repo add ingress-nginx Welcome - Ingress-Nginx Controller

helm repo update

helm install nginx-ingress ingress-nginx/ingress-nginx \

--namespace ingress-nginx \

--create-namespace \

--set controller.replicaCount=2 \ # Run 2 Nginx Ingress controller replicas for high availability

--set controller.nodeSelector."kubernetes\.io/os"="linux" \

--set controller.kind="DaemonSet" \ # Or Deployment if preferred, DaemonSet is common for edge

--set controller.hostNetwork=true \ # Often needed for DaemonSet for direct node access

--set controller.admissionWebhooks.patchMode="reuse" \

--set controller.admissionWebhooks.cleanupOnStart=true \

--set controller.podSecurityContext.seccompProfile.type="Unrestricted" # For DaemonSet typically

# Add options for LoadBalancer service if cloud provider support is needed

# --set controller.service.type="LoadBalancer"

注意: controller.hostNetwork=true 为DaemonSet配置,它允许Ingress Controller Pod直接监听节点的主网络接口。如果选择Deployment或在云提供商环境中,通常会配置controller.service.type="LoadBalancer"来获得外部IP。

Ingress Resource Definition

k8s/ingress.yaml

<YAML>

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

name: llm-api-ingress

namespace: default

annotations:

# Use this annotation if Nginx Ingress Controller is configured without default backend

# If using a default backend, this may not be necessary.

# nginx.ingress.kubernetes.io/enable-non-gzip="true" # For example, to handle non-gzipped responses if needed

# nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" # Long timeout for streaming

# nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"

# nginx.ingress.kubernetes.io/proxy-body-size: "100m"

# Add annotations for SSL if you have a TLS certificate configured

# nginx.ingress.kubernetes.io/ssl-redirect: "true"

# If k8s service is 'llm-fastapi-svc' on port 80, this points to the service.

# If you also want to expose /docs via the same ingress, it will also be proxied.

spec:

ingressClassName: nginx # Ensures this Ingress uses our Nginx Ingress Controller

rules:

- host: ai-service.your-domain.com # Replace with your actual domain name

http:

paths:

- path: / # Route all traffic meant for the AI service

pathType: Prefix

backend:

service:

name: llm-fastapi-svc # The Kubernetes Service for your FastAPI app

port:

number: 80 # The port exposed by the Service

# Optional: TLS configuration for HTTPS

# tls:

# - hosts:

# - ai-service.your-domain.com

# secretName: your-tls-secret # A Kubernetes Secret containing your TLS cert and key

部署Ingress:

<BASH>

kubectl apply -f k8s/ingress.yaml

部署FastAPI App和Service:

<BASH>

kubectl apply -f k8s/fastapi-deployment.yaml

kubectl apply -f k8s/fastapi-service.yaml

测试:

配置DNS: 将 ai-service.your-domain.com 指向你Kubernetes集群的Ingress Controller暴露的IP地址(LoadBalancer IP 或 NodeIP)。

发送请求:

<BASH>

curl -X POST \

http://ai-service.your-domain.com/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "your-model-name",

"messages": [

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "What is Kubernetes?"}

],

"temperature": 0.7

}'

你也可以测试流式输出:

<BASH>

curl -N -X POST \

http://ai-service.your-domain.com/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "your-model-name",

"messages": [

{"role": "user", "content": "Explain the concept of elasticity."}

],

"stream": true

}'

六、 实现高可用性

多副本部署: 通过Deployment的replicas: 3(或更多),确保即使一个Pod出现故障,服务仍能继续运行。Kubernetes会不断尝试将Pod调度到健康的Node上。

Nginx Ingress Controller高可用: 部署Ingress Controller时,常配置controller.replicaCount=2(或更多)和controller.kind="DaemonSet"(运行在每个Node上,或Deployment)。这确保了Ingress层本身不会成为单点故障。

Kubernetes Service负载均衡: ClusterIP类型的Service本身就提供了负载均衡能力,它会将流量平均分发给Deployment的所有健康Pod。

健康检查(Probes):

Readiness Probe: 告诉Kubernetes何时一个Pod已准备好接收流量。当模型正在加载时,Readiness Probe可能会失败;一旦模型加载完毕,Probe成功,Service就会将流量导向该Pod。

Liveness Probe: 告诉Kubernetes何时一个Pod失效,需要重启。如果模型推理卡死,Liveness Probe失败,Kubernetes会重启Pod。

资源请求与限制: 合理设置CPU、内存和GPU的requests和limits,防止Pod因资源不足被OOMKilled,或耗尽节点资源影响其他服务。GPU limits至关重要。

数据持久化(模型加载): 对于使用大模型文件的场景,将模型文件放在PersistentVolume或通过Init Container/Sidecar从对象存储下载,确保Pod重建时模型能被正确加载,而不是每次都从头开始下载。

ELK/Prometheus + Grafana 监控: 实时监控API的QPS、延迟、错误率、GPU利用率、显存占用等关键指标,并设置告警,是保障高可用性的必要组成部分。

七、 总结

通过 Docker 容器化 FastAPI 服务,并结合 Nginx Ingress Controller 的强大负载均衡和反向代理能力,我们可以构建一个灵活、高性能且具备高可用性的AI服务。

Docker 保证了环境的一致性和部署的便捷性。

FastAPI 提供了高性能、异步的API服务能力,非常适合I/O密集型的模型推理。

Nginx Ingress Controller 作为流量的入口,提供了负载均衡、SSL终结、路由规则等关键流量管理能力,并可配置多副本以实现高可用。

Kubernetes 负责 Pod 的调度、自愈、水平扩展(通过HPA)以及整体集群的资源管理。

这个方案是一个坚实的基础,可支撑各种规模的大模型应用。根据实际需求,还可以进一步集成更高级的监控、日志收集、模型版本管理、安全性加固等。

http://www.dtcms.com/a/366743.html

相关文章:

  • Linux 编译 Android 版 QGroundControl 软件并运行到手机上
  • 一天涨幅2000倍的期权有吗?
  • (JVM)四种垃圾回收算法
  • ArcGIS学习-15 实战-建设用地适宜性评价
  • Node.js轻松生成动态二维码
  • Windows+Docker一键部署CozeStudio私有化,保姆级
  • 【Docker】P1 前言:容器化技术发展之路
  • LangChain4J-(4)-多模态视觉理解
  • 少儿编程C++快速教程之——2. 字符串处理
  • SMARTGRAPHQA —— 基于多模态大模型的PDF 转 Markdown方法和基于大模型格式校正方法
  • Unity之安装教学
  • GcWord V8.2 新版本:TOA/TA字段增强、模板标签管理与PDF导出优化
  • 无需任何软件禁用 10 年 windows 更新
  • ArcGIS答疑-如何消除两张栅格图片中间的黑缝
  • 《D (R,O) Grasp:跨机械手灵巧抓取的机器人 - 物体交互统一表示》论文解读
  • 零售消费企业的数字化增长实践,2025新版下载
  • 三目摄像头 是一种配备三个独立摄像头模块的视觉系统
  • 苍穹外卖Day9 | 用户端、管理端接口功能开发、百度地图解析配送范围
  • 算法之二叉树
  • 不用服务器也能监控网络:MyIP+cpolar让中小企业告别昂贵方案
  • Wisdom SSH 是一款集成了强大 AI 助手功能的 SSH 工具,助你高效管理服务器。
  • 以OWTB为核心的三方仓运配一体化平台架构设计文档V0.1
  • 【软件测试】第1章 认识测试
  • Qt实现2048小游戏:看看AI如何评估棋盘策略实现“人机合一
  • OPENCV复习第二期
  • .NET GcPDF V8.2 新版本:人工智能 PDF 处理
  • Lucene 8.7.0 版本的索引文件格式
  • 学习资料1(粗略版)
  • android View详解—自定义ViewGroup,流式布局
  • Android 项目:画图白板APP开发(三)——笔锋(多 Path 叠加)