当前位置: 首页 > news >正文

聊聊Spring AI Alibaba的MarkdownDocumentParser

本文主要研究一下Spring AI Alibaba的MarkdownDocumentParser

MarkdownDocumentParser

community/document-parsers/spring-ai-alibaba-starter-document-parser-markdown/src/main/java/com/alibaba/cloud/ai/parser/markdown/MarkdownDocumentParser.java

public class MarkdownDocumentParser implements DocumentParser {/*** Configuration to a parsing process.*/private final MarkdownDocumentParserConfig config;/*** Markdown parser.*/private final Parser parser;public MarkdownDocumentParser() {this(MarkdownDocumentParserConfig.defaultConfig());}/*** Create a new {@link MarkdownDocumentParser} instance.**/public MarkdownDocumentParser(MarkdownDocumentParserConfig config) {this.config = config;this.parser = Parser.builder().build();}@Overridepublic List<Document> parse(InputStream inputStream) {try (var input = inputStream) {Node node = this.parser.parseReader(new InputStreamReader(input));DocumentVisitor documentVisitor = new DocumentVisitor(this.config);node.accept(documentVisitor);return documentVisitor.getDocuments();}catch (IOException e) {throw new RuntimeException(e);}}//......
}	

MarkdownDocumentParser使用了org.commonmark.parser.Parser来解析inputStream到node,然后通过DocumentVisitor去解析为Document

DocumentVisitor

	static class DocumentVisitor extends AbstractVisitor {private final List<Document> documents = new ArrayList<>();private final List<String> currentParagraphs = new ArrayList<>();private final MarkdownDocumentParserConfig config;private Document.Builder currentDocumentBuilder;DocumentVisitor(MarkdownDocumentParserConfig config) {this.config = config;}/*** Visits the document node and initializes the current document builder.*/@Overridepublic void visit(org.commonmark.node.Document document) {this.currentDocumentBuilder = Document.builder();super.visit(document);}@Overridepublic void visit(Heading heading) {buildAndFlush();super.visit(heading);}@Overridepublic void visit(ThematicBreak thematicBreak) {if (this.config.horizontalRuleCreateDocument) {buildAndFlush();}super.visit(thematicBreak);}@Overridepublic void visit(SoftLineBreak softLineBreak) {translateLineBreakToSpace();super.visit(softLineBreak);}@Overridepublic void visit(HardLineBreak hardLineBreak) {translateLineBreakToSpace();super.visit(hardLineBreak);}@Overridepublic void visit(ListItem listItem) {translateLineBreakToSpace();super.visit(listItem);}@Overridepublic void visit(BlockQuote blockQuote) {if (!this.config.includeBlockquote) {buildAndFlush();}translateLineBreakToSpace();this.currentDocumentBuilder.metadata("category", "blockquote");super.visit(blockQuote);}@Overridepublic void visit(Code code) {this.currentParagraphs.add(code.getLiteral());this.currentDocumentBuilder.metadata("category", "code_inline");super.visit(code);}@Overridepublic void visit(FencedCodeBlock fencedCodeBlock) {if (!this.config.includeCodeBlock) {buildAndFlush();}translateLineBreakToSpace();this.currentParagraphs.add(fencedCodeBlock.getLiteral());this.currentDocumentBuilder.metadata("category", "code_block");this.currentDocumentBuilder.metadata("lang", fencedCodeBlock.getInfo());buildAndFlush();super.visit(fencedCodeBlock);}@Overridepublic void visit(Text text) {if (text.getParent() instanceof Heading heading) {this.currentDocumentBuilder.metadata("category", "header_%d".formatted(heading.getLevel())).metadata("title", text.getLiteral());}else {this.currentParagraphs.add(text.getLiteral());}super.visit(text);}public List<Document> getDocuments() {buildAndFlush();return this.documents;}private void buildAndFlush() {if (!this.currentParagraphs.isEmpty()) {String content = String.join("", this.currentParagraphs);Document.Builder builder = this.currentDocumentBuilder.text(content);this.config.additionalMetadata.forEach(builder::metadata);Document document = builder.build();this.documents.add(document);this.currentParagraphs.clear();}this.currentDocumentBuilder = Document.builder();}private void translateLineBreakToSpace() {if (!this.currentParagraphs.isEmpty()) {this.currentParagraphs.add(" ");}}}

DocumentVisitor继承了AbstractVisitor,它在每类visit方法将内容添加到currentParagraphs,同时添加对应的metadata,最后通过buildAndFlush去构建document,每次构建完会重新给currentDocumentBuilder赋值为新的Document.builder()

示例

class MarkdownDocumentParserTest {@Testvoid testOnlyHeadersWithParagraphs() throws IOException {MarkdownDocumentParser reader = new MarkdownDocumentParser();List<Document> documents = reader.parse(new DefaultResourceLoader().getResource("classpath:/only-headers.md").getInputStream());assertThat(documents).hasSize(4).extracting(Document::getMetadata, Document::getText).containsOnly(tuple(Map.of("category", "header_1", "title", "Header 1a"),"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur diam eros, laoreet sit amet cursus vitae, varius sed nisi. Cras sit amet quam quis velit commodo porta consectetur id nisi. Phasellus tincidunt pulvinar augue."),tuple(Map.of("category", "header_1", "title", "Header 1b"),"Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Etiam lobortis risus libero, sed sollicitudin risus cursus in. Morbi enim metus, ornare vel lacinia eget, venenatis vel nibh."),tuple(Map.of("category", "header_2", "title", "Header 2b"),"Proin vel laoreet leo, sed luctus augue. Sed et ligula commodo, commodo lacus at, consequat turpis. Maecenas eget sapien odio. Maecenas urna lectus, pellentesque in accumsan aliquam, congue eu libero."),tuple(Map.of("category", "header_2", "title", "Header 2c"),"Ut rhoncus nec justo a porttitor. Pellentesque auctor pharetra eros, viverra sodales lorem aliquet id. Curabitur semper nisi vel sem interdum suscipit."));}@Testvoid testWithFormatting() throws IOException {MarkdownDocumentParser reader = new MarkdownDocumentParser();List<Document> documents = reader.parse(new DefaultResourceLoader().getResource("classpath:/with-formatting.md").getInputStream());assertThat(documents).hasSize(2).extracting(Document::getMetadata, Document::getText).containsOnly(tuple(Map.of("category", "header_1", "title", "This is a fancy header name"),"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec tincidunt velit non bibendum gravida. Cras accumsan tincidunt ornare. Donec hendrerit consequat tellus blandit accumsan. Aenean aliquam metus at arcu elementum dignissim."),tuple(Map.of("category", "header_3", "title", "Header 3"),"Aenean eu leo eu nibh tristique posuere quis quis massa."));}@Testvoid testDocumentDividedViaHorizontalRules() throws IOException {MarkdownDocumentParserConfig config = MarkdownDocumentParserConfig.builder().withHorizontalRuleCreateDocument(true).build();MarkdownDocumentParser reader = new MarkdownDocumentParser(config);List<Document> documents = reader.parse(new DefaultResourceLoader().getResource("classpath:/horizontal-rules.md").getInputStream());assertThat(documents).hasSize(7).extracting(Document::getMetadata, Document::getText).containsOnly(tuple(Map.of(),"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec tincidunt velit non bibendum gravida."),tuple(Map.of(),"Cras accumsan tincidunt ornare. Donec hendrerit consequat tellus blandit accumsan. Aenean aliquam metus at arcu elementum dignissim."),tuple(Map.of(),"Nullam nisi dui, egestas nec sem nec, interdum lobortis enim. Pellentesque odio orci, faucibus eu luctus nec, venenatis et magna."),tuple(Map.of(),"Vestibulum nec eros non felis fermentum posuere eget ac risus. Curabitur et fringilla massa. Cras facilisis nec nisl sit amet sagittis."),tuple(Map.of(),"Aenean eu leo eu nibh tristique posuere quis quis massa. Nullam lacinia luctus sem ut vehicula."),tuple(Map.of(),"Aenean quis vulputate mi. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Nam tincidunt nunc a tortor tincidunt, nec lobortis diam rhoncus."),tuple(Map.of(), "Nulla facilisi. Phasellus eget tellus sed nibh ornare interdum eu eu mi."));}@Testvoid testDocumentNotDividedViaHorizontalRulesWhenIsDisabled() throws IOException {MarkdownDocumentParserConfig config = MarkdownDocumentParserConfig.builder().withHorizontalRuleCreateDocument(false).build();MarkdownDocumentParser reader = new MarkdownDocumentParser(config);List<Document> documents = reader.parse(new DefaultResourceLoader().getResource("classpath:/horizontal-rules.md").getInputStream());assertThat(documents).hasSize(1);Document documentsFirst = documents.get(0);assertThat(documentsFirst.getMetadata()).isEmpty();assertThat(documentsFirst.getText()).startsWith("Lorem ipsum dolor sit amet, consectetur adipiscing elit").endsWith("Phasellus eget tellus sed nibh ornare interdum eu eu mi.");}@Testvoid testSimpleMarkdownDocumentWithHardAndSoftLineBreaks() throws IOException {MarkdownDocumentParser reader = new MarkdownDocumentParser();List<Document> documents = reader.parse(new DefaultResourceLoader().getResource("classpath:/simple.md").getInputStream());assertThat(documents).hasSize(1);Document documentsFirst = documents.get(0);assertThat(documentsFirst.getMetadata()).isEmpty();assertThat(documentsFirst.getText()).isEqualTo("Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec tincidunt velit non bibendum gravida. Cras accumsan tincidunt ornare. Donec hendrerit consequat tellus blandit accumsan. Aenean aliquam metus at arcu elementum dignissim.Nullam nisi dui, egestas nec sem nec, interdum lobortis enim. Pellentesque odio orci, faucibus eu luctus nec, venenatis et magna. Vestibulum nec eros non felis fermentum posuere eget ac risus.Aenean eu leo eu nibh tristique posuere quis quis massa. Nullam lacinia luctus sem ut vehicula.");}@Testvoid testCode() throws IOException {MarkdownDocumentParserConfig config = MarkdownDocumentParserConfig.builder().withHorizontalRuleCreateDocument(true).build();MarkdownDocumentParser reader = new MarkdownDocumentParser(config);List<Document> documents = reader.parse(new DefaultResourceLoader().getResource("classpath:/code.md").getInputStream());assertThat(documents).satisfiesExactly(document -> {assertThat(document.getMetadata()).isEqualTo(Map.of());assertThat(document.getText()).isEqualTo("This is a Java sample application:");}, document -> {assertThat(document.getMetadata()).isEqualTo(Map.of("lang", "java", "category", "code_block"));assertThat(document.getText()).startsWith("package com.example.demo;").contains("SpringApplication.run(DemoApplication.class, args);");}, document -> {assertThat(document.getMetadata()).isEqualTo(Map.of("category", "code_inline"));assertThat(document.getText()).isEqualTo("Markdown also provides the possibility to use inline code formatting throughout the entire sentence.");}, document -> {assertThat(document.getMetadata()).isEqualTo(Map.of());assertThat(document.getText()).isEqualTo("Another possibility is to set block code without specific highlighting:");}, document -> {assertThat(document.getMetadata()).isEqualTo(Map.of("lang", "", "category", "code_block"));assertThat(document.getText()).isEqualTo("./mvnw spring-javaformat:apply\n");});}@Testvoid testCodeWhenCodeBlockShouldNotBeSeparatedDocument() throws IOException {MarkdownDocumentParserConfig config = MarkdownDocumentParserConfig.builder().withHorizontalRuleCreateDocument(true).withIncludeCodeBlock(true).build();MarkdownDocumentParser reader = new MarkdownDocumentParser(config);List<Document> documents = reader.parse(new DefaultResourceLoader().getResource("classpath:/code.md").getInputStream());assertThat(documents).satisfiesExactly(document -> {assertThat(document.getMetadata()).isEqualTo(Map.of("lang", "java", "category", "code_block"));assertThat(document.getText()).startsWith("This is a Java sample application: package com.example.demo").contains("SpringApplication.run(DemoApplication.class, args);");}, document -> {assertThat(document.getMetadata()).isEqualTo(Map.of("category", "code_inline"));assertThat(document.getText()).isEqualTo("Markdown also provides the possibility to use inline code formatting throughout the entire sentence.");}, document -> {assertThat(document.getMetadata()).isEqualTo(Map.of("lang", "", "category", "code_block"));assertThat(document.getText()).isEqualTo("Another possibility is to set block code without specific highlighting: ./mvnw spring-javaformat:apply\n");});}@Testvoid testBlockquote() throws IOException {MarkdownDocumentParser reader = new MarkdownDocumentParser();List<Document> documents = reader.parse(new DefaultResourceLoader().getResource("classpath:/blockquote.md").getInputStream());assertThat(documents).hasSize(2).extracting(Document::getMetadata, Document::getText).containsOnly(tuple(Map.of(),"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur diam eros, laoreet sit amet cursus vitae, varius sed nisi. Cras sit amet quam quis velit commodo porta consectetur id nisi. Phasellus tincidunt pulvinar augue."),tuple(Map.of("category", "blockquote"),"Proin vel laoreet leo, sed luctus augue. Sed et ligula commodo, commodo lacus at, consequat turpis. Maecenas eget sapien odio. Maecenas urna lectus, pellentesque in accumsan aliquam, congue eu libero. Ut rhoncus nec justo a porttitor. Pellentesque auctor pharetra eros, viverra sodales lorem aliquet id. Curabitur semper nisi vel sem interdum suscipit."));}@Testvoid testBlockquoteWhenBlockquoteShouldNotBeSeparatedDocument() throws IOException {MarkdownDocumentParserConfig config = MarkdownDocumentParserConfig.builder().withIncludeBlockquote(true).build();MarkdownDocumentParser reader = new MarkdownDocumentParser(config);List<Document> documents = reader.parse(new DefaultResourceLoader().getResource("classpath:/blockquote.md").getInputStream());assertThat(documents).hasSize(1);Document documentsFirst = documents.get(0);assertThat(documentsFirst.getMetadata()).isEqualTo(Map.of("category", "blockquote"));assertThat(documentsFirst.getText()).isEqualTo("Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur diam eros, laoreet sit amet cursus vitae, varius sed nisi. Cras sit amet quam quis velit commodo porta consectetur id nisi. Phasellus tincidunt pulvinar augue. Proin vel laoreet leo, sed luctus augue. Sed et ligula commodo, commodo lacus at, consequat turpis. Maecenas eget sapien odio. Maecenas urna lectus, pellentesque in accumsan aliquam, congue eu libero. Ut rhoncus nec justo a porttitor. Pellentesque auctor pharetra eros, viverra sodales lorem aliquet id. Curabitur semper nisi vel sem interdum suscipit.");}@Testvoid testLists() throws IOException {MarkdownDocumentParser reader = new MarkdownDocumentParser();List<Document> documents = reader.parse(new DefaultResourceLoader().getResource("classpath:/lists.md").getInputStream());assertThat(documents).hasSize(2).extracting(Document::getMetadata, Document::getText).containsOnly(tuple(Map.of("category", "header_2", "title", "Ordered list"),"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur diam eros, laoreet sit amet cursus vitae, varius sed nisi. Cras sit amet quam quis velit commodo porta consectetur id nisi. Phasellus tincidunt pulvinar augue. Proin vel laoreet leo, sed luctus augue. Sed et ligula commodo, commodo lacus at, consequat turpis. Maecenas eget sapien odio. Pellentesque auctor pharetra eros, viverra sodales lorem aliquet id. Curabitur semper nisi vel sem interdum suscipit. Maecenas urna lectus, pellentesque in accumsan aliquam, congue eu libero. Ut rhoncus nec justo a porttitor."),tuple(Map.of("category", "header_2", "title", "Unordered list"),"Aenean eu leo eu nibh tristique posuere quis quis massa. Aenean imperdiet libero dui, nec malesuada dui maximus vel. Vestibulum sed dui condimentum, cursus libero in, dapibus tortor. Etiam facilisis enim in egestas dictum."));}@Testvoid testWithAdditionalMetadata() throws IOException {MarkdownDocumentParserConfig config = MarkdownDocumentParserConfig.builder().withAdditionalMetadata("service", "some-service-name").withAdditionalMetadata("env", "prod").build();MarkdownDocumentParser reader = new MarkdownDocumentParser(config);List<Document> documents = reader.parse(new DefaultResourceLoader().getResource("classpath:/simple.md").getInputStream());assertThat(documents).hasSize(1);Document documentsFirst = documents.get(0);assertThat(documentsFirst.getMetadata()).isEqualTo(Map.of("service", "some-service-name", "env", "prod"));assertThat(documentsFirst.getText()).startsWith("Lorem ipsum dolor sit amet, consectetur adipiscing elit.");}}

小结

Spring AI Alibaba的spring-ai-alibaba-starter-document-parser-markdown提供了MarkdownDocumentParser用于解析markdown文件到Document。

doc

  • java2ai

相关文章:

  • Go语言实现OAuth 2.0认证服务器
  • 独家!美团2025校招大数据题库
  • 鸿蒙开发之嵌套对象更新
  • FPGA_YOLO(四)用HLS实现循环展开以及存储模块
  • 【WPF-VisionMaster源代码】应用OpenCVSharp仿Vision Master页面开发的软件源代码
  • C++学习之游戏服务器开发git命令
  • [MERN] 项目实战】MERN Multi-Vendor 电商平台开发笔记(v1.0 初版结构 + 技术实践)
  • 树莓派超全系列教程文档--(28)boot文件夹内容
  • Ngrok 内网穿透实现Django+Vue部署
  • vscode连接windows服务器出现过程试图写入的管道不存在
  • AIGC-十款数据分析类智能体完整指令直接用(DeepSeek,豆包,千问,Kimi,GPT)
  • 【STM32-代码】
  • C#: 用Libreoffice实现Word文件转PDF
  • 磁芯为什么会有磁性?磁性材料的磁滞曲线还记得吗?
  • Vue2 nextTick
  • 算法——直接插入排序
  • vue3 defineExpose的使用
  • 工厂模式实现案例
  • 嘉黎技能大赛,活化传承民艺
  • Vue3父子组件数据双向绑定示例
  • 微信防红短链接生成/南宁企业官网seo
  • 网站名称怎么填写/网站排名优化
  • 网站左侧漂浮导航/互联网营销模式
  • 网站制作公司兴田德润简介/用广州seo推广获精准访问量
  • 网站做进一步优化/sem竞价广告
  • 360免费建站李梦/黑马培训是正规学校吗