当前位置：首页 > news >正文

【示例】通义千问Qwen大模型解析本地pdf文档，转换成markdown格式文档

news 2025/9/20 6:16:15

安装必要工具

通义千问（Qwen）目前主要通过API或开源模型提供服务，解析PDF需结合第三方工具（如PyPDF2、pdfplumber）提取文本，再调用Qwen API处理。以下是具体方法：

方法1：使用PyPDF2提取文本后调用Qwen API

安装依赖库
```
pip install pypdf2 requests
```

提取PDF文本

from PyPDF2 import PdfReaderdef extract_text_from_pdf(file_path):reader = PdfReader(file_path)text = ""for page in reader.pages:text += page.extract_text()return text

调用通义千问API处理文本

import requestsdef qwen_api_request(text):api_url = "https://dashscope.aliyuncs.com/api/v1/services/aigc/text-generation/generation"headers = {"Authorization": "Bearer YOUR_API_KEY",  # 替换为实际API密钥"Content-Type": "application/json"}data = {"model": "qwen-max","input": {"messages": [{"role": "user", "content": f"将以下文本转换为Markdown格式:\n{text}"}]}}response = requests.post(api_url, json=data, headers=headers)return response.json()["output"]["text"]

方法2：结合pdfplumber与Qwen API

安装依赖库
```
pip install pdfplumber 
```
提取PDF文本（保留表格和格式）

def extract_text_from_pdf(pdf_path):"""使用 pdfplumber 从PDF中提取全部文本"""text = ""try:with pdfplumber.open(pdf_path) as pdf:for i, page in enumerate(pdf.pages):page_text = page.extract_text()if page_text:text += f"\n--- 第 {i + 1} 页 ---\n"text += page_text + "\n"print("✅ PDF 文本提取成功，前300字符预览：")print(text[:300] + "...")return textexcept Exception as e:print(f"❌ PDF提取失败: {str(e)}")return None

加载Qwen模型转换Markdown

def enhance_text_to_markdown_with_qwen(text_content):"""调用 Qwen-max 将原始文本转换为结构化 Markdown"""if not text_content.strip():return Noneprompt = f"""
你是一个专业的工业设备文档工程师。请将以下从PDF提取的原始文本，转换为结构清晰、格式规范、语义完整的 Markdown 文档。要求：
1. 自动识别并设置合理的标题层级（#、##、###）
2. 列表使用 - 或 1. 2. 3. 格式
3. 表格必须使用标准 Markdown 表格语法对齐（即使原文是文字表格）
4. 保留所有技术细节、参数、步骤说明、安全警告，不得删减
5. 若原文有章节编号（如 1.1, 2.3.1），请保留并转为标题
6. 对于提到的图片（如“见下图”、“结构图如下”），请插入占位符：![描述](image_X.png)
7. 输出纯 Markdown，不要包含任何额外解释原始文本：
---
{text_content}
---
"""try:response = Generation.call(model="qwen-max",  # 使用最强文本模型prompt=prompt,seed=12345,temperature=0.2,  # 低温度，忠实原文top_p=0.85,result_format='message')if response.status_code == 200:markdown_content = response.output.choices[0].message.content.strip()return markdown_contentelse:print(f"❌ Qwen API 调用失败: {response.message}")return Noneexcept Exception as e:print(f"❌ 调用 Qwen 时发生错误: {str(e)}")return None

输出Markdown文件
将处理后的结果写入.md文件：

def pdf_to_markdown_via_text(pdf_path, output_md_path="output.md"):"""主流程：PDF → 本地提取文本 → Qwen结构化 → 保存Markdown"""print(f"📄 正在从本地提取 PDF 文件: {pdf_path}")# Step 1: 本地提取文本raw_text = extract_text_from_pdf(pdf_path)if not raw_text:return False# Step 2: 调用 Qwen 优化结构print("🧠 正在调用通义千问 MAX 模型优化文档结构...")enhanced_md = enhance_text_to_markdown_with_qwen(raw_text)if not enhanced_md:return False# Step 3: 保存结果with open(output_md_path, "w", encoding="utf-8") as f:f.write(enhanced_md)print(f"\n🎉 转换成功！已保存至: {output_md_path}")print("\n🔍 生成的 Markdown 前500字符预览：")print("=" * 60)print(enhanced_md[:500] + "..." if len(enhanced_md) > 500 else enhanced_md)print("=" * 60)return True

方法2 ->完整案例代码

import dashscope
from dashscope import Generation
import pdfplumber  # 轻量级PDF文本提取库# 设置API Key
dashscope.api_key = "YOUR_API_KEY"  # ← 替换为你的API Keydef extract_text_from_pdf(pdf_path):"""使用 pdfplumber 从PDF中提取全部文本"""text = ""try:with pdfplumber.open(pdf_path) as pdf:for i, page in enumerate(pdf.pages):page_text = page.extract_text()if page_text:text += f"\n--- 第 {i + 1} 页 ---\n"text += page_text + "\n"print("✅ PDF 文本提取成功，前300字符预览：")print(text[:300] + "...")return textexcept Exception as e:print(f"❌ PDF提取失败: {str(e)}")return Nonedef enhance_text_to_markdown_with_qwen(text_content):"""调用 Qwen-max 将原始文本转换为结构化 Markdown"""if not text_content.strip():return Noneprompt = f"""
你是一个专业的工业设备文档工程师。请将以下从PDF提取的原始文本，转换为结构清晰、格式规范、语义完整的 Markdown 文档。要求：
1. 自动识别并设置合理的标题层级（#、##、###）
2. 列表使用 - 或 1. 2. 3. 格式
3. 表格必须使用标准 Markdown 表格语法对齐（即使原文是文字表格）
4. 保留所有技术细节、参数、步骤说明、安全警告，不得删减
5. 若原文有章节编号（如 1.1, 2.3.1），请保留并转为标题
6. 对于提到的图片（如“见下图”、“结构图如下”），请插入占位符：![描述](image_X.png)
7. 输出纯 Markdown，不要包含任何额外解释原始文本：
---
{text_content}
---
"""try:response = Generation.call(model="qwen-max",  # 使用最强文本模型prompt=prompt,seed=12345,temperature=0.2,  # 低温度，忠实原文top_p=0.85,result_format='message')if response.status_code == 200:markdown_content = response.output.choices[0].message.content.strip()return markdown_contentelse:print(f"❌ Qwen API 调用失败: {response.message}")return Noneexcept Exception as e:print(f"❌ 调用 Qwen 时发生错误: {str(e)}")return Nonedef pdf_to_markdown_via_text(pdf_path, output_md_path="output.md"):"""主流程：PDF → 本地提取文本 → Qwen结构化 → 保存Markdown"""print(f"📄 正在从本地提取 PDF 文件: {pdf_path}")# Step 1: 本地提取文本raw_text = extract_text_from_pdf(pdf_path)if not raw_text:return False# Step 2: 调用 Qwen 优化结构print("🧠 正在调用通义千问 MAX 模型优化文档结构...")enhanced_md = enhance_text_to_markdown_with_qwen(raw_text)if not enhanced_md:return False# Step 3: 保存结果with open(output_md_path, "w", encoding="utf-8") as f:f.write(enhanced_md)print(f"\n🎉 转换成功！已保存至: {output_md_path}")print("\n🔍 生成的 Markdown 前500字符预览：")print("=" * 60)print(enhanced_md[:500] + "..." if len(enhanced_md) > 500 else enhanced_md)print("=" * 60)return True# ====== 执行 ======
if __name__ == "__main__":PDF_FILE = "维护保养.pdf"OUTPUT_FILE = "维护保养.md"# 安装依赖（如未安装）：# pip install pdfplumber dashscopesuccess = pdf_to_markdown_via_text(PDF_FILE, OUTPUT_FILE)if success:print("\n✅ 全流程执行完毕！")else:print("\n❌ 处理失败，请检查错误信息。")