当前位置：首页 > news >正文

PDF文档转换Markdown文档功能

news 2025/10/22 7:46:30

本文仅对中英文PDF文档进行处理和转换为Markdown文档功能开发

MinerU 2.5多模态模型快速部署

MinerU官网
本地部署多模态模型下载地址

源码安装

git clone https://github.com/opendatalab/MinerU.git
source venv/bin/activate
cd /MinerU
pip install -e .

稳定版本

CUDA Version: 12.4
python=3.12.11
torch=2.8.0
transformers=4.55.2
vllm=0.10.2

源码修改适配本地部署方式

新增配置文件 mineru.json

cd /MinerU
touch mineru.json
chmod 777 mineru.json
vim mineru.json
{"bucket_info":{"bucket-name-1":["ak", "sk", "endpoint"],"bucket-name-2":["ak", "sk", "endpoint"]},"pipeline": "/MinerU",    # 改成本地模型目录"models-dir":"/MinerU/models/MinerU20",  # 同上"output-options": {"save-intermediate": false,  "save-images": false         },"layoutreader-model-dir":"/home/projects/MinerU/mineru/model/reading_order","device-mode":"cpu","num-workers": 8,"memory-limit": "16g","layout-config": {"model": "doclayout_yolo"},"formula-config": {"mfd_model": "yolo_v8_mfd","mfr_model": "unimernet_small","enable": true},"table-config": {"model": "rapid_table","sub_model": "slanet_plus","enable": true,"max_time": 400},"llm-aided-config": {"formula_aided": {"api_key": "your_api_key","base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1","model": "qwen2.5-7b-instruct","enable": false},"text_aided": {"api_key": "your_api_key","base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1","model": "qwen2.5-7b-instruct","enable": false},"title_aided": {"api_key": "your_api_key","base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1","model": "qwen2.5-32b-instruct","enable": false}},"config_version": "1.2.0"
}

指定配置文件

vim /home/projects/New_MinerU/MinerU/mineru/utils/config_reader.py
找到read_config函数，并修改部分代码：def read_config():if os.path.isabs(CONFIG_FILE_NAME):config_file = CONFIG_FILE_NAMEelse:home_dir = os.path.expanduser('~')config_file = os.path.join(home_dir, CONFIG_FILE_NAME)  config_file = "/home/mineru.json"  # 新增部分if not os.path.exists(config_file):# logger.warning(f'{config_file} not found, using default configuration')return Noneelse:with open(config_file, 'r', encoding='utf-8') as f:config = json.load(f)return config

加载本地模型

cd /MinerU/mineru/utils/models_download_utils.py找到auto_download_and_get_model_root_path函数，并替换部分代码：def auto_download_and_get_model_root_path(relative_path: str, repo_mode='pipeline') -> str:model_source = os.getenv('MINERU_MODEL_SOURCE', "huggingface")model_source = "local"  # 新增部分if model_source == 'local':local_models_config = get_local_models_dir()root_path = local_models_config    # 修改部分if not root_path:raise ValueError(f"Local path for repo_mode '{repo_mode}' is not configured.")return root_path

关闭文档复制

cd /home/projects/New_MinerU/MinerU/mineru/cli/common.py找到convert_pdf_bytes_to_bytes_by_pypdfium2函数，并替换部分代码：def convert_pdf_bytes_to_bytes_by_pypdfium2(pdf_bytes, start_page_id=0, end_page_id=None):return pdf_bytes   # 新增部分

关闭表格解析(可选)

cd /home/projects/New_MinerU/MinerU/mineru/backend/pipeline/pipeline_analyze.py找到batch_image_analyze函数，并修改结尾部分代码：table_enable = False   # 新增部分batch_model = BatchAnalyze(model_manager, batch_ratio, formula_enable, table_enable)results = batch_model(images_with_extra_info)clean_memory(get_device())return results

MinerU本地执行方式

mineru -p <input_path> -o <output_path> -m auto -b pipeline -l en -d cpuOptions:-v, --version                   显示版本并退出-p, --path PATH                 输入文件路径或目录（必填）-o, --output PATH               输出目录（必填）-m, --method [auto|txt|ocr]     解析方法：auto（默认）、txt、ocr（仅用于 pipeline 后端）-b, --backend [pipeline|vlm-transformers|vlm-vllm-engine|vlm-http-client]解析后端（默认为 pipeline）-l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|th|el|latin|arabic|east_slavic|cyrillic|devanagari]指定文档语言（可提升 OCR 准确率，仅用于 pipeline 后端）-u, --url TEXT                  当使用 http-client 时，需指定服务地址-s, --start INTEGER             开始解析的页码（从 0 开始）-e, --end INTEGER               结束解析的页码（从 0 开始）-f, --formula BOOLEAN           是否启用公式解析（默认开启）-t, --table BOOLEAN             是否启用表格解析（默认开启）-d, --device TEXT               推理设备（如 cpu/cuda/cuda:0/npu/mps，仅 pipeline 后端）--vram INTEGER                  单进程最大 GPU 显存占用(GB)（仅 pipeline 后端）--source [huggingface|modelscope|local]模型来源，默认 huggingface--help                          显示帮助信息

输出文件

output_dir/vlm/
├── content_list.json       # 存储解析后的文档内容列表，包含文本块、段落、标题等元素的语义信息
├── document1.md         # 轻量级文本（Markdown格式）
├── images/                     # 资源文件目录（如图片、附件）
│   ├── page_1.png
│   ├── page_2.png
│   └── ...
├── layout.pdf                # 可视化展示原始文档的布局，包括文本、图片、表格等元素的位置和排列方式
├── middle.json             # 存储中间处理步骤的数据，如 OCR 识别结果、文本提取过程中的中间表示
├── model.json             # 记录解析过程中使用的模型配置或参数
├──origin.pdf                # 原始输入文档的副本
└──span.pdf                  # 突出显示文档中的特定元素（如跨页表格、公式、图片区域）

查看全文

http://www.dtcms.com/a/511146.html