Docker使用MinerU
Docker使用MinerU
1 介绍
MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式。效果不错,但是有点慢。
# 官网地址
https://opendatalab.github.io/MinerU/zh/# Github地址
https://github.com/opendatalab/mineru
2 构建模型
基于vllm-openai:v0.10.1.1和Dockerfile构建镜像。
下载vllm-openai:v0.10.1.1镜像
docker pull vllm/vllm-openai:v0.10.1.1
Dockerfile内容(可以从官网上下载)
# Use DaoCloud mirrored vllm image for China region for gpu with Ampere architecture and above (Compute Capability>=8.0)
# Compute Capability version query (https://developer.nvidia.com/cuda-gpus)
FROM docker.m.daocloud.io/vllm/vllm-openai:v0.10.1.1# Use the official vllm image
# FROM vllm/vllm-openai:v0.10.1.1# Use DaoCloud mirrored vllm image for China region for gpu with Turing architecture and below (Compute Capability<8.0)
# FROM docker.m.daocloud.io/vllm/vllm-openai:v0.10.2# Use the official vllm image
# FROM vllm/vllm-openai:v0.10.2# Install libgl for opencv support & Noto fonts for Chinese characters
RUN apt-get update && \apt-get install -y \fonts-noto-core \fonts-noto-cjk \fontconfig \libgl1 && \fc-cache -fv && \apt-get clean && \rm -rf /var/lib/apt/lists/*# Install mineru latest
RUN python3 -m pip install -U 'mineru[core]' -i https://mirrors.aliyun.com/pypi/simple --break-system-packages && \python3 -m pip cache purge# Download models and update the configuration file
RUN /bin/bash -c "mineru-models-download -s modelscope -m all"# Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "export MINERU_MODEL_SOURCE=local && exec \"$@\"", "--"]
构建镜像
docker build -t mineru:2.6.4 -f Dockerfile .
构建容器
docker run -itd \
--name mineru \
--gpus all \
--shm-size 32g \
-p 30000:30000 \
-p 7860:7860 \
-p 8000:8000 \
--ipc=host \
-it mineru:2.6.4 \
mineru-vllm-server --port 30000
3 使用Python调用
可参考官网的Python进行修改。
Python代码
import os
from pathlib import Pathfrom mineru.backend.vlm.vlm_middle_json_mkcontent import union_make
from mineru.backend.vlm.vlm_analyze import doc_analyzefrom mineru.utils.enum_class import MakeMode
from mineru.cli.common import read_fn, convert_pdf_bytes_to_bytes_by_pypdfium2, prepare_env
from mineru.data.data_reader_writer import FileBasedDataWriterdef test(pdf_path: Path, pdf_file_name: str, output_dir: Path, server_url):# 解析方法parse_method = "vlm"backend = "http-client"# 读取文件流pdf_bytes = read_fn(pdf_path)pdf_bytes = convert_pdf_bytes_to_bytes_by_pypdfium2(pdf_bytes)# 预处理文件local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name, parse_method)image_writer = FileBasedDataWriter(local_image_dir)md_writer = FileBasedDataWriter(local_md_dir)# 解析文件middle_json, infer_result = doc_analyze(pdf_bytes, image_writer=image_writer, backend=backend, server_url=server_url)# 获取结果print(middle_json)print(infer_result)# 获取解析的信息pdf_info = middle_json["pdf_info"]# 解析Markdown文件f_make_md_mode = MakeMode.MM_MDimage_dir = str(os.path.basename(local_image_dir))md_content_str = union_make(pdf_info, f_make_md_mode, image_dir)# Markdown文本print(md_content_str)# 存储文件md_writer.write_string(f"{pdf_file_name}.md",md_content_str,)if __name__ == '__main__':pdf_input_temp = "E:/test/input/test1.pdf"test(pdf_path=Path(pdf_input_temp), pdf_file_name="test1", output_dir=Path("E:/test/output"), server_url="http://192.168.0.104:30000")
4 执行结果
服务器结果

Python代码执行结果

