当前位置: 首页 > news >正文

聊聊Spring AI Alibaba的DocumentParser

本文主要研究一下Spring AI Alibaba的DocumentParser

DocumentParser

spring-ai-alibaba-core/src/main/java/com/alibaba/cloud/ai/document/DocumentParser.java

public interface DocumentParser {/*** Parses a given {@link InputStream} into a {@link Document}. The specific* implementation of this method will depend on the type of the document being parsed.* <p>* Note: This method does not close the provided {@link InputStream} - it is the* caller's responsibility to manage the lifecycle of the stream.* @param inputStream The {@link InputStream} that contains the content of the* {@link Document}.* @return The parsed {@link Document}.*/List<Document> parse(InputStream inputStream);}

DocumentParser接口定义了parse方法,解析inputStream为org.springframework.ai.document.Document,它有TextDocumentParser、JsonDocumentParser等实现

TextDocumentParser

spring-ai-alibaba-core/src/main/java/com/alibaba/cloud/ai/document/TextDocumentParser.java

public class TextDocumentParser implements DocumentParser {private final Charset charset;public TextDocumentParser() {this(UTF_8);}public TextDocumentParser(Charset charset) {Assert.notNull(charset, "charset");this.charset = charset;}@Overridepublic List<Document> parse(InputStream inputStream) {try {String text = new String(inputStream.readAllBytes(), charset);if (text.isBlank()) {throw new Exception();}return Collections.singletonList(new Document(text));}catch (Exception e) {throw new RuntimeException(e);}}}

TextDocumentParser实现了DocumentParser接口,其parse方法将inputStream转换为String

JsonDocumentParser

spring-ai-alibaba-core/src/main/java/com/alibaba/cloud/ai/document/JsonDocumentParser.java

public class JsonDocumentParser implements DocumentParser {private final JsonMetadataGenerator jsonMetadataGenerator;private final ObjectMapper objectMapper = new ObjectMapper();/*** The key from the JSON that we will use as the text to parse into the Document text*/private final List<String> jsonKeysToUse;public JsonDocumentParser(String... jsonKeysToUse) {this(new EmptyJsonMetadataGenerator(), jsonKeysToUse);}public JsonDocumentParser(JsonMetadataGenerator jsonMetadataGenerator, String... jsonKeysToUse) {Objects.requireNonNull(jsonKeysToUse, "keys must not be null");Objects.requireNonNull(jsonMetadataGenerator, "jsonMetadataGenerator must not be null");this.jsonMetadataGenerator = jsonMetadataGenerator;this.jsonKeysToUse = List.of(jsonKeysToUse);}@Overridepublic List<Document> parse(InputStream inputStream) {try {JsonNode rootNode = this.objectMapper.readTree(inputStream);if (rootNode.isArray()) {return StreamSupport.stream(rootNode.spliterator(), true).map(jsonNode -> parseJsonNode(jsonNode, this.objectMapper)).toList();}else {return Collections.singletonList(parseJsonNode(rootNode, this.objectMapper));}}catch (IOException e) {throw new RuntimeException(e);}}//......private Document parseJsonNode(JsonNode jsonNode, ObjectMapper objectMapper) {Map<String, Object> item = objectMapper.convertValue(jsonNode, new TypeReference<Map<String, Object>>() {});var sb = new StringBuilder();this.jsonKeysToUse.stream().filter(item::containsKey).forEach(key -> sb.append(key).append(": ").append(item.get(key)).append(System.lineSeparator()));Map<String, Object> metadata = this.jsonMetadataGenerator.generate(item);String content = sb.isEmpty() ? item.toString() : sb.toString();return new Document(content, metadata);}//......	
}	

JsonDocumentParser使用ObjectMapper来解析json字符串,它先转为map形式,再根据jsonKeysToUse将key和value通过拼接在一起,另外根据jsonMetadataGenerator生成metadata,最后一起构建Document

示例

class JsonDocumentParserTests {private JsonDocumentParser parser;@BeforeEachvoid setUp() {// Initialize parser with text and description fieldsparser = new JsonDocumentParser("text", "description");}@Testvoid testParseSingleJsonObject() {// Test parsing a single JSON object with text and description fieldsString json = """{"text": "Sample text","description": "Sample description","other": "Other field"}""";List<Document> documents = parser.parse(toInputStream(json));assertThat(documents).hasSize(1);Document doc = documents.get(0);assertThat(doc.getText()).contains("Sample text").contains("Sample description");}@Testvoid testParseJsonArray() {// Test parsing an array of JSON objectsString json = """[{"text": "First text","description": "First description"},{"text": "Second text","description": "Second description"}]""";List<Document> documents = parser.parse(toInputStream(json));assertThat(documents).hasSize(2);assertThat(documents.get(0).getText()).contains("First text");assertThat(documents.get(1).getText()).contains("Second text");}@Testvoid testJsonPointerParsing() {// Test parsing using JSON pointer to specific location in documentString json = """{"data": {"items": [{"text": "Pointer text","description": "Pointer description"}]}}""";List<Document> documents = parser.get("/data/items", toInputStream(json));assertThat(documents).hasSize(1);assertThat(documents.get(0).getText()).contains("Pointer text").contains("Pointer description");}@Testvoid testEmptyJsonInput() {// Test handling of empty JSON objectString json = "{}";List<Document> documents = parser.parse(toInputStream(json));assertThat(documents).hasSize(1);assertThat(documents.get(0).getText()).isEqualTo("{}");}@Testvoid testInvalidJsonPointer() {// Test handling of invalid JSON pointerString json = """{"data": {}}""";assertThrows(IllegalArgumentException.class, () -> parser.get("/invalid/pointer", toInputStream(json)));}private InputStream toInputStream(String content) {return new ByteArrayInputStream(content.getBytes(StandardCharsets.UTF_8));}}

小结

Spring AI Alibaba定义了com.alibaba.cloud.ai.document.DocumentParser,然后部分org.springframework.ai.document.DocumentReader的实现是委托给了相应的parser。spring-ai-alibaba-core默认提供了TextDocumentParser、JsonDocumentParser这两种DocumentParser。

doc

  • java2ai

相关文章:

  • Visual Studio + OpenCV C++ 安装与配置教程
  • PTA:古风排版
  • 37-串联所有单词的子串
  • 贪心算法(20)(java)整数替换
  • 通过python实现bilibili缓存视频转为mp4格式
  • 《Ethical Implications of ChatGPT in Higher Education: A Scoping Review》全文翻译
  • 流量统计--Maven依赖
  • 学习笔记十一——零基础搞懂 Rust 函数式编程
  • G2学习打卡
  • Odrive源码分析(七) 逆park变换
  • LoadableTransportInfo函数分析之和全局对象LoadedLoadableTransports的关系
  • 本地Dify配置https协议【无域名版】
  • 每日算法-250415
  • C++中unique_lock和lock_guard区别
  • T1结构像+RS-fMRI影像处理完整过程记录(数据下载+Matlab工具箱+数据处理)
  • 【第45节】windows程序的其他反调试手段上篇
  • 【2025年3月中科院1区SCI】Rating entropy等级熵及5种多尺度,特征提取、故障诊断新方法!
  • Docker技术基础新手入门教程
  • 获取类路径
  • NLP高频面试题(四十二)——RAG系统评估:方法、指标与实践指南
  • 不准打小孩:童年逆境经历视角下的生育友好社会
  • 马上评丨别让“免费领养”套路坑消费者又坑宠物
  • 美国通过《删除法案》:打击未经同意发布他人私密图像,包括“深度伪造”
  • 王毅:携手做世界和平与发展事业的中流砥柱
  • 杨国荣︱学术上的立此存照——《故旧往事,欲说还休》读后
  • “杭州六小龙”的招聘迷局