当前位置: 首页 > news >正文

小红书开源dots.ocr:单一视觉语言模型中的多语言文档布局解析

在这里插入图片描述

简介

dots.ocr 是一款强大的多语言文档解析器,它将版面检测与内容识别统一整合到单一视觉语言模型中,同时保持出色的阅读顺序还原能力。尽管其基础模型仅为17亿参数的轻量级大语言模型(LLM),但性能达到了业界顶尖水平(SOTA)。

  1. 卓越性能:在OmniDocBench基准测试中,dots.ocr 的文本、表格和阅读顺序解析均达到SOTA水平,公式识别效果更可媲美豆包1.5、gemini2.5-pro等参数量大得多的模型。
  2. 多语言支持:针对低资源语言展现出强大的解析能力,在我们自建的多语言文档测试集上,版面检测与内容识别均取得显著优势。
  3. 统一简洁架构:依托单一视觉语言模型,dots.ocr 的架构远比依赖复杂多模型流水线的传统方案更精简。仅需调整输入提示词即可切换任务,证明视觉语言模型在检测任务上可比肩DocLayout-YOLO等传统检测模型。
  4. 高效推理:基于17亿参数的轻量LLM构建,推理速度优于许多基于更大规模基础模型的高性能方案。

性能对比:dots.ocr 与竞品模型

在这里插入图片描述

备注:

  • EN、ZH 指标为 OmniDocBench 端到端评估结果,Multilingual 指标为 dots.ocr-bench 端到端评估结果。

基准测试结果

  1. OmniDocBench
    不同任务的端到端评估结果
Model
Type
MethodsOverallEditTextEditFormulaEditTableTEDSTableEditRead OrderEdit
ENZHENZHENZHENZHENZHENZH
Pipeline
Tools
MinerU0.1500.3570.0610.2150.2780.57778.662.10.1800.3440.0790.292
Marker0.3360.5560.0800.3150.5300.88367.649.20.6190.6850.1140.340
Mathpix0.1910.3650.1050.3840.3060.45477.067.10.2430.3200.1080.304
Docling0.5890.9090.4160.9870.999161.325.00.6270.8100.3130.837
Pix2Text0.3200.5280.1380.3560.2760.61173.666.20.5840.6450.2810.499
Unstructured0.5860.7160.1980.4810.999100.0610.9980.1450.387
OpenParse0.6460.8140.6810.9740.996164.827.50.2840.6390.5950.641
PPStruct-V30.1450.2060.0580.0880.2950.535--0.1590.1090.0690.091
Expert
VLMs
GOT-OCR0.2870.4110.1890.3150.3600.52853.247.20.4590.5200.1410.280
Nougat0.4520.9730.3650.9980.4880.94139.900.5721.0000.3820.954
Mistral OCR0.2680.4390.0720.3250.3180.49575.863.60.6000.6500.0830.284
OLMOCR-sglang0.3260.4690.0970.2930.4550.65568.161.30.6080.6520.1450.277
SmolDocling-256M0.4930.8160.2620.8380.7530.99744.916.50.7290.9070.2270.522
Dolphin0.2060.3060.1070.1970.4470.58077.367.20.1800.2850.0910.162
MinerU 20.1390.2400.0470.1090.2970.53682.579.00.1410.1950.069<0.118
OCRFlux0.1950.2810.0640.1830.3790.61371.681.30.2530.1390.0860.187
MonkeyOCR-pro-3B0.1380.2060.0670.1070.2460.42181.587.50.1390.1110.1000.185
General
VLMs
GPT4o0.2330.3990.1440.4090.4250.60672.062.90.2340.3290.1280.251
Qwen2-VL-72B0.2520.3270.0960.2180.4040.48776.876.40.3870.4080.1190.193
Qwen2.5-VL-72B0.2140.2610.0920.180.3150.43482.983.90.3410.2620.1060.168
Gemini2.5-Pro0.1480.2120.0550.1680.3560.43985.886.40.130.1190.0490.121
doubao-1-5-thinking-vision-pro-2504280.1400.1620.0430.0850.2950.38483.389.30.1650.0850.0580.094
Expert VLMsdots.ocr0.1250.1600.0320.0660.3290.41688.689.00.0990.0920.0400.067

跨9种PDF页面类型的端到端文本识别性能。

Model
Type
ModelsBookSlidesFinancial
Report
TextbookExam
Paper
MagazineAcademic
Papers
NotesNewspaperOverall
Pipeline
Tools
MinerU0.0550.1240.0330.1020.1590.0720.0250.9840.1710.206
Marker0.0740.3400.0890.3190.4520.1530.0590.6510.1920.274
Mathpix0.1310.2200.2020.2160.2780.1470.0910.6340.6900.300
Expert
VLMs
GOT-OCR0.1110.2220.0670.1320.2040.1980.1790.3880.7710.267
Nougat0.7340.9581.0000.8200.9300.8300.2140.9910.8710.806
Dolphin0.0910.1310.0570.1460.2310.1210.0740.3630.3070.177
OCRFlux0.0680.1250.0920.1020.1190.0830.0470.2230.5360.149
MonkeyOCR-pro-3B0.0840.1290.0600.0900.1070.0730.0500.1710.1070.100
General
VLMs
GPT4o0.1570.1630.3480.1870.2810.1730.1460.6070.7510.316
Qwen2.5-VL-7B0.1480.0530.1110.1370.1890.1170.1340.2040.7060.205
InternVL3-8B0.1630.0560.1070.1090.1290.1000.1590.1500.6810.188
doubao-1-5-thinking-vision-pro-2504280.0480.0480.0240.0620.0850.0510.0390.0960.1810.073
Expert VLMsdots.ocr0.0310.0470.0110.0820.0790.0280.0290.1090.0560.055

备注:

  • 指标数据来源于MonkeyOCR、OmniDocBench及我们内部的评估结果。
  • 我们在结果Markdown中删除了页眉和页脚单元格。
  • 我们使用tikz_preprocess流程将图像分辨率提升至200dpi。
  1. dots.ocr-bench
    这是一个内部基准,包含100种语言的1493张pdf图像。
    不同任务的端到端评估结果。
MethodsOverallEditTextEditFormulaEditTableTEDSTableEditRead OrderEdit
MonkeyOCR-3B0.4830.4450.62750.930.4520.409
doubao-1-5-thinking-vision-pro-2504280.2910.2260.44071.20.2600.238
doubao-1-60.2990.2700.41771.00.2580.253
Gemini2.5-Pro0.2510.1630.40277.10.2360.202
dots.ocr 0.1770.0750.29779.20.1860.152

注:

  • 我们采用了与 OmniDocBench 相同的指标计算流程。
  • 在结果 markdown 中删除了页眉(Page-header)和页脚(Page-footer)单元格。

布局检测

MethodF1@IoU=.50:.05:.95↑F1@IoU=.50↑
OverallTextFormulaTablePictureOverallTextFormulaTablePicture
DocLayout-YOLO-DocStructBench0.7330.6940.4800.8030.6190.8060.7790.6200.8580.678
dots.ocr-parse all0.8310.8010.6540.8380.7480.9220.9090.7700.8880.831
dots.ocr-detection only 0.8450.8160.7160.8750.7650.9300.9170.8320.9180.843

备注:

  • prompt_layout_all_en 用于解析全部,prompt_layout_only_en 仅用于检测,请参阅提示
  1. olmOCR-bench.
ModelArXivOld Scans
Math
TablesOld ScansHeaders and
Footers
Multi
column
Long Tiny
Text
BaseOverall
GOT OCR52.752.00.222.193.642.029.994.048.3 ± 1.1
Marker76.057.957.627.884.972.984.699.170.1 ± 1.1
MinerU75.447.460.917.396.659.039.196.661.5 ± 1.1
Mistral OCR77.267.560.629.393.671.377.199.472.0 ± 1.1
Nanonets OCR67.068.677.739.540.769.953.499.364.5 ± 1.1
GPT-4o
(No Anchor)
51.575.569.140.994.268.954.196.768.9 ± 1.1
GPT-4o
(Anchored)
53.574.570.040.793.869.360.696.869.9 ± 1.1
Gemini Flash 2
(No Anchor)
32.156.361.427.848.058.784.494.057.8 ± 1.1
Gemini Flash 2
(Anchored)
54.556.172.134.264.761.571.595.663.8 ± 1.2
Qwen 2 VL
(No Anchor)
19.731.724.217.188.98.36.855.531.5 ± 0.9
Qwen 2.5 VL
(No Anchor)
63.165.767.338.673.668.349.198.365.5 ± 1.2
olmOCR v0.1.75
(No Anchor)
71.571.471.442.894.177.771.097.874.7 ± 1.1
olmOCR v0.1.75
(Anchored)
74.971.271.042.294.578.373.398.375.5 ± 1.0
MonkeyOCR-pro-3B83.868.874.636.191.276.680.195.375.8 ± 1.0
dots.ocr82.164.288.340.994.182.481.299.579.1 ± 1.0

快速开始

  1. 安装
    安装 dots.ocr
conda create -n dots_ocr python=3.12
conda activate dots_ocrgit clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr# Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -e .

如果在安装过程中遇到问题,可以尝试使用我们的Docker镜像以简化配置,并按照以下步骤操作:

git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr
pip install -e .

下载模型权重

💡注意:请为模型保存路径使用不含句点的目录名(例如用DotsOCR而非dots.ocr)。这是我们与Transformers集成前的临时解决方案。

python3 tools/download_model.py
  1. 部署
    vLLM推理
    我们强烈建议使用vLLM进行部署和推理。我们所有的评估结果均基于vLLM 0.9.1版本。Docker镜像基于官方vLLM镜像构建,您也可以参考Dockerfile自行搭建部署环境。
# You need to register model to vllm at first
python3 tools/download_model.py
export hf_model_path=./weights/DotsOCR  # Path to your downloaded model weights, Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH
sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
from DotsOCR import modeling_dots_ocr_vllm' `which vllm`  # If you downloaded model weights by yourself, please replace `DotsOCR` by your model saved directory name, and remember to use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) # launch vllm server
CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} --tensor-parallel-size 1 --gpu-memory-utilization 0.95  --chat-template-content-format string --served-model-name model --trust-remote-code# If you get a ModuleNotFoundError: No module named 'DotsOCR', please check the note above on the saved model directory name.# vllm api demo
python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en

Huggingface 推理

python3 demo/demo_hf.py

Huggingface 推理细节

import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
from qwen_vl_utils import process_vision_info
from dots_ocr.utils import dict_promptmode_to_promptmodel_path = "./weights/DotsOCR"
model = AutoModelForCausalLM.from_pretrained(model_path,attn_implementation="flash_attention_2",torch_dtype=torch.bfloat16,device_map="auto",trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)image_path = "demo/demo_image1.jpg"
prompt = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.1. Bbox format: [x1, y1, x2, y2]2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].3. Text Extraction & Formatting Rules:- Picture: For the 'Picture' category, the text field should be omitted.- Formula: Format its text as LaTeX.- Table: Format its text as HTML.- All Others (Text, Title, etc.): Format their text as Markdown.4. Constraints:- The output text must be the original text from the image, with no translation.- All layout elements must be sorted according to human reading order.5. Final Output: The entire output must be a single JSON object.
"""messages = [{"role": "user","content": [{"type": "image","image": image_path},{"type": "text", "text": prompt}]}]# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)inputs = inputs.to("cuda")# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=24000)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

3.文档解析

基于vLLM服务器,您可以使用以下命令解析图像或pdf文件:

# Parse all layout info, both detection and recognition
# Parse a single image
python3 dots_ocr/parser.py demo/demo_image1.jpg
# Parse a single PDF
python3 dots_ocr/parser.py demo/demo_pdf1.pdf  --num_threads 64  # try bigger num_threads for pdf with a large number of pages# Layout detection only
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en# Parse text only, except Page-header and Page-footer
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr# Parse layout info by bbox
python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_grounding_ocr --bbox 
http://www.dtcms.com/a/312370.html

相关文章:

  • WebRTC前处理模块技术详解:音频3A处理与视频优化实践
  • ⭐CVPR2025 3D 生成新框架|Kiss3DGen 让 2D 扩散模型玩转 3D 资产生成
  • sqli-labs:Less-26关卡详细解析
  • 【数据迁移】Windows11 下将 Ubuntu 从 C 盘迁移到 D 盘
  • Spring Boot 的事务注解 @Transactional 失效的几种情况
  • MCU中的复位生成器(Reset Generator)是什么?
  • 智能手表项目:原理图
  • kotlin kmp 跨平台环境使用sqldelight
  • Shell脚本-变量如何定义
  • webrtc弱网-QualityScaler 源码分析与算法原理
  • npm ERR! code CERT_HAS_EXPIRED:解决证书过期问题
  • `npm error code CERT_HAS_EXPIRED‘ 问题
  • Azure DevOps — Kubernetes 上的自托管代理 — 第3部分
  • JVM-垃圾回收器与内存分配策略详解
  • Node.js 服务可以实现哪些功能
  • 【python实用小脚本-169】『Python』所见即所得 Markdown 编辑器:写完即出网页预览——告别“写完→保存→刷新”三连
  • 深度学习周报(7.28~8.3)
  • 【机器学习③】 | CNN篇
  • 分享链接实现状态共享
  • 嵌入式相关书籍
  • Javaweb————Windows11系统和idea2023旗舰版手动配置Tomcat9全流程解析
  • FreeRTOS源码分析三:列表数据结构
  • MCP革命:Anthropic如何重新定义AI与外部世界的连接标准
  • Linux系统编程Day4-- Linux常用工具(yum与vim)
  • io_setup系统调用及示例
  • Odoo OWL前端框架全面学习指南 (后端开发者视角)
  • 【LeetCode 热题 100】84. 柱状图中最大的矩形——(解法一)单调栈+三次遍历
  • Flink程序关键一步:触发环境执行
  • 机器翻译入门:定义、发展简史与核心价值
  • 云轴科技ZStack AI翻译平台建设实践-聚焦中英