当前位置: 首页 > news >正文

Dify智能体平台源码二次开发笔记(6) - 优化知识库pdf文档的识别

目录

前言

新增PdfNewExtractor类

替换ExtractProcessor类

最终结果


前言


dify的1.1.3版本知识库pdf解析实现使用pypdfium2提取文本,主要存在以下问题:
1. 文本提取能力有限,对表格和图片支持不足
2. 缺乏专门的中文处理优化
3. 没有文档结构分析
4. 缺少文档质量评估
建议优化方案:
1. 使用pdfplumber替代pypdfium2
2. 增加OCR支持
3. 优化中文处理逻辑
4. 添加文档结构分析
5. 实现智能表格识别
6. 增加缓存机制
7. 优化大文件处理

导入包pdfplumber和pytesseract

pip install pdfplumber
pip install pytesseract

新增PdfNewExtractor类


新增一个PdfNewExtractor处理类替代老的PdfExtractor

from collections.abc import Iterator
from typing import Optional, cast
import pdfplumber
import pytesseract
from PIL import Image
import iofrom core.rag.extractor.blob.blob import Blob
from core.rag.extractor.extractor_base import BaseExtractor
from core.rag.models.document import Document
from extensions.ext_storage import storageclass PdfNewExtractor(BaseExtractor):"""Enhanced PDF loader with improved text extraction, OCR support, and structure analysis.Args:file_path: Path to the PDF file to load.file_cache_key: Optional cache key for storing extracted text.enable_ocr: Whether to enable OCR for text extraction from images."""def __init__(self, file_path: str, file_cache_key: Optional[str] = None, enable_ocr: bool = False):"""Initialize with file path and optional settings."""self._file_path = file_pathself._file_cache_key = file_cache_keyself._enable_ocr = enable_ocrdef extract(self) -> list[Document]:"""Extract text from PDF with caching support."""plaintext_file_exists = Falseif self._file_cache_key:try:text = cast(bytes, storage.load(self._file_cache_key)).decode("utf-8")plaintext_file_exists = Truereturn [Document(page_content=text)]except FileNotFoundError:passdocuments = list(self.load())text_list = []for document in documents:text_list.append(document.page_content)text = "\n\n".join(text_list)# Save plaintext file for cachingif not plaintext_file_exists and self._file_cache_key:storage.save(self._file_cache_key, text.encode("utf-8"))return documentsdef load(self) -> Iterator[Document]:"""Lazy load PDF pages with enhanced text extraction."""blob = Blob.from_path(self._file_path)yield from self.parse(blob)def parse(self, blob: Blob) -> Iterator[Document]:"""Parse PDF with enhanced features including OCR and structure analysis."""with blob.as_bytes_io() as file_obj:with pdfplumber.open(file_obj) as pdf:for page_number, page in enumerate(pdf.pages):# Extract text with layout preservation and encoding detectioncontent = page.extract_text(layout=True)# Try to detect and fix encoding issuestry:# First try to decode as UTF-8content = content.encode('utf-8').decode('utf-8')except UnicodeError:try:# If UTF-8 fails, try GB18030 (common Chinese encoding)content = content.encode('utf-8').decode('gb18030', errors='ignore')except UnicodeError:# If all else fails, use a more lenient approachcontent = content.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore')# Extract tables if presenttables = page.extract_tables()if tables:table_text = "\n\nTables:\n"for table in tables:# Convert table to text formattable_text += "\n" + "\n".join(["\t".join([str(cell) if cell else "" for cell in row]) for row in table])content += table_text# Perform OCR if enabled and text content is limited or contains potential encoding issuesif self._enable_ocr and (len(content.strip()) < 100 or any('\ufffd' in line for line in content.splitlines())):image = page.to_image()img_bytes = io.BytesIO()image.original.save(img_bytes, format='PNG')img_bytes.seek(0)pil_image = Image.open(img_bytes)# Use multiple language models and improve OCR accuracyocr_text = pytesseract.image_to_string(pil_image,lang='chi_sim+chi_tra+eng',  # Support both simplified and traditional Chineseconfig='--psm 3 --oem 3'  # Use more accurate OCR mode)if ocr_text.strip():# Clean and normalize OCR textocr_text = ocr_text.replace('\x0c', '').strip()content = f"{content}\n\nOCR Text:\n{ocr_text}"metadata = {"source": blob.source,"page": page_number,"has_tables": bool(tables)}yield Document(page_content=content, metadata=metadata)

替换ExtractProcessor类


在ExtractProcessor中把两处extractor = PdfExtractor(file_path),替换成extractor = PdfNewExtractor(file_path)。
分别在代码144行和148行

最终结果


经过测试,优化效果完美

相关文章:

  • 如何处理Python爬取视频时的反爬机制?
  • OTP认证系统解析与安当ASP身份认证解决方案
  • 全面排查与修复指南:MSVCP140.dll丢失的解决方法
  • 基于骨骼识别的危险动作报警系统设计与实现
  • Tomcat与Servlet
  • 第一层、第二层与第三层隧道协议
  • windows虚拟机隐藏“弹出虚拟驱动”
  • SpringBoot整合Logback日志框架深度实践
  • WPF依赖注入IHostApplicationLifetime关闭程序
  • JS调用Android接口有几种方式
  • Java HTTP Client API详解
  • 奥创中心卸载工具Armoury Crate Uninstall Tool官网下载
  • 4月16号
  • Linux——Shell编程之正则表达式与文本处理器(笔记)
  • pytorch使用c++/cuda扩展
  • Docker compose入门
  • c#OleDb连接池管理功能
  • C# 中参数前加 this 关键字
  • 【scikit-learn基础】--『监督学习』之 贝叶斯分类
  • 高速电路中的电感、磁珠的选型及应用
  • 为什么自己做的网站打开是乱码/东莞排名优化团队
  • 哪些网站做机票酒店有优势/百度关键词收录
  • 足球比赛直播平台/seo优化专员编辑
  • 南水北调中线建建设管理局网站/网站建设与营销经验
  • 绍兴专业做网站的公司/独立网站怎么做
  • 原生态旅游网站开发需求分析/百度产品大全入口