当前位置: 首页 > news >正文

聊聊Spring AI Alibaba的SentenceSplitter

本文主要研究一下Spring AI Alibaba的SentenceSplitter

SentenceSplitter

spring-ai-alibaba-core/src/main/java/com/alibaba/cloud/ai/transformer/splitter/SentenceSplitter.java

public class SentenceSplitter extends TextSplitter {private final EncodingRegistry registry = Encodings.newLazyEncodingRegistry();private final Encoding encoding = registry.getEncoding(EncodingType.CL100K_BASE);private static final int DEFAULT_CHUNK_SIZE = 1024;private final SentenceModel sentenceModel;private final int chunkSize;public SentenceSplitter() {this(DEFAULT_CHUNK_SIZE);}public SentenceSplitter(int chunkSize) {this.chunkSize = chunkSize;this.sentenceModel = getSentenceModel();}@Overrideprotected List<String> splitText(String text) {SentenceDetectorME sentenceDetector = new SentenceDetectorME(sentenceModel);String[] texts = sentenceDetector.sentDetect(text);if (texts == null || texts.length == 0) {return Collections.emptyList();}List<String> chunks = new ArrayList<>();StringBuilder chunk = new StringBuilder();for (int i = 0; i < texts.length; i++) {int currentChunkSize = getEncodedTokens(chunk.toString()).size();int textTokenSize = getEncodedTokens(texts[i]).size();if (currentChunkSize + textTokenSize > chunkSize) {chunks.add(chunk.toString());chunk = new StringBuilder(texts[i]);}else {chunk.append(texts[i]);}if (i == texts.length - 1) {chunks.add(chunk.toString());}}return chunks;}private SentenceModel getSentenceModel() {try (InputStream is = getClass().getResourceAsStream("/opennlp/opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin")) {if (is == null) {throw new RuntimeException("sentence model is invalid");}return new SentenceModel(is);}catch (IOException e) {throw new RuntimeException(e);}}private List<Integer> getEncodedTokens(String text) {Assert.notNull(text, "Text must not be null");return this.encoding.encode(text).boxed();}}

SentenceSplitter继承了TextSplitter,其构造器会通过getSentenceModel()来加载/opennlp/opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin这个SentenceModel;splitText方法创建SentenceDetectorME,使用其sentDetect来拆分句子,再根据chunkSize进一步合并或拆分

示例

spring-ai-alibaba-core/src/test/java/com/alibaba/cloud/ai/transformer/splitter/SentenceSplitterTests.java

class SentenceSplitterTests {private SentenceSplitter splitter;private static final int CUSTOM_CHUNK_SIZE = 100;@BeforeEachvoid setUp() {// Initialize with default chunk sizesplitter = new SentenceSplitter();}/*** Test default constructor. Verifies that splitter can be created with default chunk* size.*/@Testvoid testDefaultConstructor() {SentenceSplitter defaultSplitter = new SentenceSplitter();assertThat(defaultSplitter).isNotNull();}/*** Test constructor with custom chunk size. Verifies that splitter can be created with* specified chunk size.*/@Testvoid testCustomChunkSizeConstructor() {SentenceSplitter customSplitter = new SentenceSplitter(CUSTOM_CHUNK_SIZE);assertThat(customSplitter).isNotNull();}/*** Test splitting simple sentences. Verifies basic sentence splitting functionality.*/@Testvoid testSplitSimpleSentences() {String text = "This is a test. This is another test. And this is a third test.";Document doc = new Document(text);List<Document> documents = splitter.apply(Collections.singletonList(doc));assertThat(documents).isNotNull();assertThat(documents).hasSize(1);assertThat(documents.get(0).getText()).contains("This is a test", "This is another test","And this is a third test");}/*** Test splitting empty text. Verifies handling of empty input.*/@Testvoid testSplitEmptyText() {Document doc = new Document("");List<Document> documents = splitter.apply(Collections.singletonList(doc));assertThat(documents).isEmpty();}/*** Test splitting text with special characters. Verifies handling of text with various* punctuation and special characters.*/@Testvoid testSplitTextWithSpecialCharacters() {String text = "Hello, world! How are you? I'm doing great... This is a test; with various punctuation.";Document doc = new Document(text);List<Document> documents = splitter.apply(Collections.singletonList(doc));assertThat(documents).isNotNull();assertThat(documents).hasSize(1);assertThat(documents.get(0).getText()).contains("Hello, world", "How are you", "I'm doing great","This is a test");}/*** Test splitting long text. Verifies handling of text that exceeds default chunk* size.*/@Testvoid testSplitLongText() {// Generate a very long text that will exceed the default chunk size (1024// tokens)StringBuilder longText = new StringBuilder();String longSentence = "This is a very long sentence with many words that will contribute to the total token count and eventually force the text to be split into multiple chunks because it exceeds the default chunk size limit of 1024 tokens. ";// Repeat the sentence enough times to ensure we exceed the chunk sizefor (int i = 0; i < 50; i++) {longText.append(longSentence);}Document doc = new Document(longText.toString());List<Document> documents = splitter.apply(Collections.singletonList(doc));// Verify that the text was split into multiple documentsassertThat(documents).isNotNull();assertThat(documents).hasSizeGreaterThan(1);// Verify that each document contains part of the original textdocuments.forEach(document -> assertThat(document.getText()).contains("This is a very long sentence"));}/*** Test splitting text with multiple line breaks. Verifies handling of text with* various types of line breaks.*/@Testvoid testSplitTextWithLineBreaks() {String text = "First sentence.\nSecond sentence.\r\nThird sentence.\rFourth sentence.";Document doc = new Document(text);List<Document> documents = splitter.apply(Collections.singletonList(doc));assertThat(documents).isNotNull();assertThat(documents.get(0).getText()).contains("First sentence", "Second sentence", "Third sentence","Fourth sentence");}/*** Test splitting text with single character sentences. Verifies handling of very* short sentences.*/@Testvoid testSplitSingleCharacterSentences() {String text = "A. B. C. D.";Document doc = new Document(text);List<Document> documents = splitter.apply(Collections.singletonList(doc));assertThat(documents).isNotNull();assertThat(documents).hasSize(1);assertThat(documents.get(0).getText()).contains("A", "B", "C", "D");}/*** Test splitting multiple documents. Verifies handling of multiple input documents.*/@Testvoid testSplitMultipleDocuments() {List<Document> inputDocs = new ArrayList<>();inputDocs.add(new Document("First document. With multiple sentences."));inputDocs.add(new Document("Second document. Also with multiple sentences."));List<Document> documents = splitter.apply(inputDocs);assertThat(documents).isNotNull();assertThat(documents).hasSizeGreaterThan(1);}}

小结

Spring AI Alibaba提供了SentenceSplitter,它使用了opennlp的SentenceDetectorME进行拆分,其构造器会加载/opennlp/opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin这个SentenceModel。

doc

  • 1.0.0-M6.1/get-started

相关文章:

  • Android ImageView 加载 Base64编码图片
  • 告别手动输入密码:基于SSHPass的自动化文件传输实践告别手动输入密码:基于SSHPass的自动化文件传输实践
  • Mac 平台 字体Unicode范围分析器
  • 【人工智能核心技术全景解读】从机器学习到深度学习实战
  • OCCT中的基础变换
  • OpenCV CPU性能优化
  • 旅游设备生产企业的痛点 质检系统在旅游设备生产企业的应用
  • Java死锁问题全解析:从原理到实战解决方案
  • std::iota(C++)
  • 软件工程之形式化说明技术深度解析
  • 对 Kotlin 中的 data 关键字的理解,相比于普通类有哪些特点?
  • Kotlin Coroutine与Retrofit网络层构建指南
  • C++ - 类和对象 #日期类的实现
  • Go主要里程碑版本及其新增特性
  • 微软推动智能体协同运作:支持 A2A、MCP 协议
  • 学习c语言的链表的概念、操作(另一篇链表的笔记在其他的栏目先看这个)
  • Java网络编程:深入剖析UDP数据报的奥秘与实践
  • 【Linux系统】第三节—权限
  • 使用 React 实现语音识别并转换功能
  • STM32教程:串口USART使用(基于STM32F103C8T6最小系统板标准库开发)*详细教程*
  • 海南省三亚市委原常委、秘书长黄兴武被“双开”
  • 奥园集团将召开债券持有人会议,拟调整“H20奥园2”本息兑付方案
  • 马新民卸任外交部条约法律司司长
  • 美联储主席:关税“远超预期”,美联储实现目标的进程或被推迟至明年
  • 潘功胜:央行将创设科技创新债券风险分担工具
  • 习近平致电祝贺默茨当选德国联邦总理