当前位置: 首页 > news >正文

聊聊Spring AI Alibaba的PdfTablesParser

本文主要研究一下Spring AI Alibaba的PdfTablesParser

PdfTablesParser

community/document-parsers/spring-ai-alibaba-starter-document-parser-pdf-tables/src/main/java/com/alibaba/cloud/ai/parser/pdf/tables/PdfTablesParser.java

public class PdfTablesParser implements DocumentParser {/*** The page number of the PDF file to be parsed. Default value is 1.*/private final Integer page;/*** The metadata of the PDF file to be parsed.*/private final Map<String, String> metadata;public PdfTablesParser() {this(1);}public PdfTablesParser(Integer pageNumber) {this(pageNumber, Map.of());}public PdfTablesParser(Integer pageNumber, Map<String, String> metadata) {this.page = pageNumber;this.metadata = metadata;}@Overridepublic List<Document> parse(InputStream inputStream) {try {return data2Document(parseTables(extraTableData(inputStream)));}catch (Exception e) {throw new RuntimeException(e);}}protected List<Table> extraTableData(InputStream in) throws Exception {PDDocument document = PDDocument.load(in);// check pdf filesint numberOfPages = document.getNumberOfPages();if (numberOfPages < 0) {throw new RuntimeException("No page found in the PDF file.");}if (page > numberOfPages) {throw new RuntimeException("The page number is greater than the number of pages in the PDF file.");}SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();// extract page by page numbers.Page extract = new ObjectExtractor(document).extract(this.page);return sea.extract(extract);}protected List<String> parseTables(List<Table> data) {if (data.isEmpty()) {return Collections.emptyList();}return data.stream().flatMap(table -> table.getRows().stream().map(cells -> cells.stream().map(content -> content.getText().replace("\r", "").replace("\n", " ")).reduce((first, second) -> first + "|" + second).orElse("") + "|")).collect(Collectors.toList());}private List<Document> data2Document(List<String> data) {List<Document> documents = new ArrayList<>();if (data.isEmpty()) {return null;}for (String datum : data) {Document doc = new Document(datum);documents.add(addMetadata(doc));}return documents;}private Document addMetadata(Document document) {if (metadata.isEmpty()) {return document;}for (Map.Entry<String, String> entry : metadata.entrySet()) {document.getMetadata().put(entry.getKey(), entry.getValue());}return document;}}

PdfTablesParser使用tabula来解析pdf,它先执行extraTableData、再执行parseTables,最后执行data2Document;extraTableData方法使用SpreadsheetExtractionAlgorithm去解析为List<Table>,parseTables则将List<Table>解析为List<String>,data2Document方法则将List<String>解析为List<Document>

示例

class PdfTablesParserTests {private Resource resource;private Resource resource2;@BeforeEachvoid setUp() {resource = new DefaultResourceLoader().getResource("classpath:/pdf-tables.pdf");resource2 = new DefaultResourceLoader().getResource("classpath:/sample1.pdf");if (!resource.exists()) {throw new RuntimeException("Resource not found: " + resource);}}/*** tabula-java use.*/@Testvoid PdfTableTest() throws IOException {InputStream in = new FileInputStream(resource.getFile());try (PDDocument document = PDDocument.load(in)) {SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();PageIterator pi = new ObjectExtractor(document).extract();while (pi.hasNext()) {// iterate over the pages of the documentPage page = pi.next();List<Table> table = sea.extract(page);// iterate over the tables of the pagefor (Table tables : table) {List<List<RectangularTextContainer>> rows = tables.getRows();// iterate over the rows of the tablefor (List<RectangularTextContainer> cells : rows) {// print all column-cells of the row plus linefeedfor (RectangularTextContainer content : cells) {// Note: Cell.getText() uses \r to concat text chunkString text = content.getText().replace("\r", " ");System.out.print(text + "|");}System.out.println();}}}}}@Testvoid PdfTablesParseTest() throws IOException {String res = """|name|age|sex||zhangsan|20|m||lisi|21|w||wangwu|22|m||zhangliu|23|w||songqi|24|w|""";InputStream in = new FileInputStream(resource.getFile());PdfTablesParser pdfTablesParser = new PdfTablesParser();List<Document> docs = pdfTablesParser.parse(in);StringBuilder sb = new StringBuilder();docs.subList(1, docs.size()).forEach(doc -> sb.append(doc.getText() + "\n"));Assert.equals(res, sb.toString());}@Testvoid PdfTablesParseTest2() throws IOException {String res = """Sample Date:|May 2001|Prepared by:|Accelio Present Applied Technology|Created and Tested Using:|•Accelio Present Central 5.4•Accelio Present Output Designer 5.4|Features Demonstrated:|•Primary bookmarks in a PDF file.•Secondary bookmarks in a PDF file.|""";InputStream in = new FileInputStream(resource2.getFile());PdfTablesParser pdfTablesParser = new PdfTablesParser();List<Document> docs = pdfTablesParser.parse(in);StringBuilder sb = new StringBuilder();docs.forEach(doc -> sb.append(doc.getText() + "\n"));Assert.equals(res, sb.toString());}@Testvoid PdfTablesParseTest3() throws IOException {String res = """|Filename|||escription|escription|||||||||ap_bookmark.IFD|The template design.||||||ap_bookmark.mdf|The template targeted for PDF output.||||||ap_bookmark.dat|A sample data file in DAT format.||||||ap_bookmark.bmk|A sample bookmark file.||||||ap_bookmark.pdf|Sample PDF output.||||||ap_bookmark_doc.pdf|A document describing the sample.|||||||To bookmark by|Use the command line parameter|||Invoices|-abmkap_bookmark.bmk -abmsinvoices|||Type|-abmkap_bookmark.bmk -abmstype|||Amount|-abmkap_bookmark.bmk -abmsamount||""";InputStream in = new FileInputStream(resource2.getFile());PdfTablesParser pdfTablesParser = new PdfTablesParser(3);List<Document> docs = pdfTablesParser.parse(in);StringBuilder sb = new StringBuilder();docs.forEach(doc -> sb.append(doc.getText() + "\n"));Assert.equals(res, sb.toString());}}

小结

Spring AI Alibaba的spring-ai-alibaba-starter-document-parser-pdf-tables提供了PdfTablesParser用于解析pdf文件中的表格数据到Document。

doc

  • java2ai
  • tabula-java

相关文章:

  • 机器学习简介
  • 【LeetCode Solutions】LeetCode 166 ~ 169 题解
  • vue2.x Echart label根据数据长度选择不同的间隔显示
  • VSTO幻灯片退出播放(C#模拟键盘鼠标的事件)
  • 股指期货怎样选择换月时点?
  • [GESP202409 二级] 小杨的 N 字矩阵 题解
  • 学习笔记十五——rust柯里化,看不懂 `fn add(x) -> impl Fn(y)` 的同学点进来!
  • Oracle--安装Oracle Database23ai Free
  • .net core 项目快速接入Coze智能体-开箱即用-全局说明
  • 第二十四天 - 分布式任务队列 - Celery高级应用 - 练习:分布式监控任务系统
  • Linux 入门指令(2)
  • 数据结构与算法[零基础]---6.算法概况
  • 定制化突围:遨游防爆手机的差异化竞争策略
  • 单细胞分析读取处理大型数十万细胞的数据集的优化
  • Linux,redis数据库安装使用
  • ASP.NET Core Web API 配置系统集成
  • GPIO输出模式
  • 第七章--查找
  • Qt 核心库总结
  • C++11智能指针深度解析:在Visual Studio中高效管理内存
  • 商务部就美国商务部调整芯片出口管制有关表述答记者问
  • 中国古代文学研究专家、南开大学教授李剑国逝世
  • 益阳通报“河水颜色异常有死鱼”:未发现排污,原因待鉴定
  • “先增聘再离任”又添一例,景顺长城基金经理鲍无可官宣辞职
  • 全国多家健身房女性月卡延长,补足因月经期耽误的健身时间
  • 新华时评:让医德医风建设为健康中国护航