PDF 上传并保存到 MinIO 数据库
本项目是一个全栈应用,允许用户上传 PDF 文件。后端使用 Flask 构建,它会将原始 PDF 文件存储在 MinIO 存储桶中,并将其提取的文本内容索引到 OpenSearch 中。前端则是一个用于上传文件的简单 React 应用。
代码链接:https://github.com/zhouruiliangxian/Awesome-demo/tree/main/Fullstack/pdf_search_app
项目结构
pdf_search_app/
├── backend/ # Flask 后端
│ ├── .env # 后端的环境变量
│ ├── app.py # 主要的 Flask 应用逻辑
│ └── requirements.txt# Python 依赖项
├── frontend/ # React 前端
│ ├── public/
│ ├── src/
│ │ ├── App.css # 前端样式文件
│ │ └── App.js # 主要的 React 组件
│ └── package.json
└── docker-compose.yml # 用于运行所有服务的 Docker Compose 文件
如何运行本应用
请遵循以下步骤来启动并运行整个应用。
第 1 步:启动基础设施服务
version: '3.8'services:opensearch-node:image: opensearchproject/opensearch:2.19.1container_name: opensearch-node-pdfenvironment:- cluster.name=opensearch-cluster- node.name=opensearch-node- discovery.type=single-node- bootstrap.memory_lock=true- "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"- "DISABLE_SECURITY_PLUGIN=true"ulimits:memlock:soft: -1hard: -1nofile:soft: 65536hard: 65536volumes:- opensearch-data:/usr/share/opensearch/dataports:- "9200:9200"- "9600:9600"networks:- app-networkopensearch-dashboards:image: opensearchproject/opensearch-dashboards:2.19.1container_name: opensearch-dashboards-pdfports:- "5601:5601"environment:OPENSEARCH_HOSTS: '["http://opensearch-node:9200"]'DISABLE_SECURITY_DASHBOARDS_PLUGIN: "true"networks:- app-networkdepends_on:- opensearch-nodeminio:image: minio/minio:latestcontainer_name: minioports:- "9000:9000" # API Port- "9001:9001" # Console Portvolumes:- minio-data:/dataenvironment:- MINIO_ROOT_USER=minioadmin # Change for production- MINIO_ROOT_PASSWORD=minioadmin # Change for productioncommand: server /data --address ":9000" --console-address ":9001"networks:- app-networkvolumes:opensearch-data:minio-data:networks:app-network:driver: bridge
docker-compose.yml
文件将会启动 OpenSearch、OpenSearch Dashboards 和 MinIO。
在 pdf_search_app
的根目录下,运行:
docker-compose up -d
运行后,您可以通过以下地址访问这些服务:
- OpenSearch 仪表盘:
http://localhost:5601
- MinIO 控制台:
http://localhost:9001
(使用docker-compose.yml
中配置的minioadmin
/minioadmin
登录)
第 2 步:运行 Flask 后端
-
导航到后端目录:
cd backend
-
创建虚拟环境并安装依赖:
# 创建一个虚拟环境 uv venv # 激活它 (Windows) .\venv\Scripts\activate # (macOS/Linux) # source venv/bin/activate# 安装依赖 uv pip install -r requirements.txt
-
启动 Flask 服务器:
重要提示:请使用uv run app.py
命令来启动,以确保初始化代码(如创建 MinIO 存储桶)能够被执行。
app.py
文件# -*- coding: utf-8 -*- import os from flask import Flask, request, jsonify from flask_cors import CORS from dotenv import load_dotenv from minio import Minio from opensearchpy import OpenSearch import PyPDF2 import io# --- Initialization --- load_dotenv()app = Flask(__name__) # Enable CORS for React frontend (adjust in production) CORS(app, resources={r"/api/*": {"origins": "http://localhost:3000"}})# --- Client Connections ---# OpenSearch Client opensearch_client = OpenSearch(hosts=[{'host': os.getenv('OPENSEARCH_HOST'), 'port': int(os.getenv('OPENSEARCH_PORT'))}],http_auth=None,use_ssl=False,verify_certs=False,ssl_assert_hostname=False,ssl_show_warn=False, )# MinIO Client minio_client = Minio(os.getenv('MINIO_ENDPOINT'),access_key=os.getenv('MINIO_ACCESS_KEY'),secret_key=os.getenv('MINIO_SECRET_KEY'),secure=False # Set to True if using HTTPS )import time# --- Helper Functions ---def setup_minio_and_opensearch():"""Ensure MinIO bucket and OpenSearch index exist, with retries."""max_retries = 5retry_delay = 3 # seconds# Setup MinIOfor i in range(max_retries):try:bucket_name = os.getenv('MINIO_BUCKET')found = minio_client.bucket_exists(bucket_name)if not found:minio_client.make_bucket(bucket_name)print(f"MinIO bucket '{bucket_name}' created.")else:print(f"MinIO bucket '{bucket_name}' already exists.")break # Success, exit loopexcept Exception as e:print(f"MinIO setup failed (attempt {i+1}/{max_retries}): {e}")if i + 1 == max_retries:raiseprint(f"Retrying in {retry_delay} seconds...")time.sleep(retry_delay)# Setup OpenSearch (can also have a retry loop if needed)index_name = os.getenv('OPENSEARCH_INDEX')if not opensearch_client.indices.exists(index=index_name):opensearch_client.indices.create(index=index_name)print(f"OpenSearch index '{index_name}' created.")else:print(f"OpenSearch index '{index_name}' already exists.")def extract_text_from_pdf(pdf_file):"""Extracts text content from a PDF file stream."""text = ""try:pdf_reader = PyPDF2.PdfReader(pdf_file)for page in pdf_reader.pages:text += page.extract_text() or ""except Exception as e:print(f"Error extracting PDF text: {e}")return Nonereturn text# --- API Routes ---@app.route('/api/upload', methods=['POST']) def upload_pdf():if 'file' not in request.files:return jsonify({"error": "No file part"}), 400file = request.files['file']if file.filename == '' or not file.filename.lower().endswith('.pdf'):return jsonify({"error": "Invalid file, please upload a PDF"}), 400try:# Read file into memorypdf_bytes = file.read()pdf_stream = io.BytesIO(pdf_bytes)file_length = len(pdf_bytes)file_name = file.filename# 1. Upload original PDF to MinIOminio_bucket = os.getenv('MINIO_BUCKET')minio_client.put_object(minio_bucket,file_name,pdf_stream,length=file_length,content_type='application/pdf')print(f"Successfully uploaded '{file_name}' to MinIO bucket '{minio_bucket}'.")# 2. Extract text from PDFpdf_stream.seek(0) # Reset stream position after uploadextracted_text = extract_text_from_pdf(pdf_stream)if extracted_text is None:return jsonify({"error": "Could not extract text from PDF"}), 500# 3. Index metadata and text into OpenSearchdocument = {'file_name': file_name,'minio_path': f"/{minio_bucket}/{file_name}",'content': extracted_text,'size_bytes': file_length}opensearch_index = os.getenv('OPENSEARCH_INDEX')opensearch_client.index(index=opensearch_index,body=document,refresh=True # Make it immediately searchable)print(f"Successfully indexed metadata for '{file_name}' in OpenSearch.")return jsonify({"message": "File uploaded and indexed successfully!","file_name": file_name,"minio_path": document['minio_path']}), 201except Exception as e:print(f"An error occurred: {e}")return jsonify({"error": "An internal error occurred"}), 500# --- Main Execution ---if __name__ == '__main__':with app.app_context():setup_minio_and_opensearch()app.run(host='0.0.0.0', port=5001, debug=True)
uv run app.py
后端服务器将在
http://localhost:5001
上启动。首次运行时,它会自动创建所需的 MinIO 存储桶 (pdfs
) 和 OpenSearch 索引 (pdf_documents
)。
第 3 步:运行 React 前端
-
打开一个新的终端。
-
导航到前端目录:
cd frontend
-
安装依赖并启动开发服务器:
npm install npm start
-
您的浏览器应该会自动打开
http://localhost:3000
,在这里您会看到 PDF 上传界面。
效果测试
工作原理
- 上传: 您在 React 前端选择一个 PDF 文件并点击“上传”。
- API 调用: 前端将文件发送到 Flask 后端的
/api/upload
端点。 - 处理: Flask 服务器执行以下操作:
a. 将原始 PDF 文件直接上传到 MinIO 的pdfs
存储桶中。
b. 使用PyPDF2
库从 PDF 中提取所有文本。
c. 创建一个包含文件名、其在 MinIO 中的路径以及提取出的文本的 JSON 文档。
d. 将此 JSON 文档索引到 OpenSearch 中。 - 结果: 您现在可以访问 OpenSearch 仪表盘 (
http://localhost:5601
),查看pdf_documents
索引,并搜索您上传的 PDF 的内容。