当前位置：首页 > news >正文

敏感词 v0.25.1 新特性之返回匹配词，修正 tags 标签

news 2025/7/1 21:32:57

开源项目

敏感词核心 https://github.com/houbb/sensitive-word

敏感词控台 https://github.com/houbb/sensitive-word-admin

版本特性

大家好，我是老马。

敏感词以前在实现的时候，没有返回底层实际匹配的词，有时候问题排查非常耗费时间。

同时如果使用了一些字符的转换+跳过等，得到了一个匹配词，和定义的匹配词之间不同可能会比较奇怪。

所以 v0.25.1，

问题场景

issues/105

 final String text = "你好敏#!@感$!@词";List<WordTagsDto> wordList = wordBs.findAll(text, WordResultHandlers.wordTags());
[WordTagsDto{word='敏#!@感$!@词', tags=null}]final String text = "你好敏感词";List<WordTagsDto> wordList = wordBs.findAll(text, WordResultHandlers.wordTags());
[WordTagsDto{word='敏感词', tags=[0]}]

PR 111

当然，有小伙伴提交 PR 来解决这个问题

pull/111

但是实际上考虑的场景还是缺失了。

根本原因是什么

最根本的原因在于我们命中了一个词，但是以前只返回命中的文本，比如【敏#!@感$!@词】，但是我们只给【敏感词】定义标签。

如果想穷尽各种匹配后的枚举值，显然是不合理的。

所以我们需要知道匹配的黑名单词到底是什么。

解决方案

黑名单命中词

知道了这个述求，我们在原来的黑名单词处理时，额外返回对应的底层命中词。

内置 tags 调整

public class WordResultHandlerWordTags extends AbstractWordResultHandler<WordTagsDto> {@Overrideprotected WordTagsDto doHandle(IWordResult wordResult, IWordContext wordContext, String originalText) {WordTagsDto dto = new WordTagsDto();// 截取String word = InnerWordCharUtils.getString(originalText.toCharArray(), wordResult);// 获取 tags (使用清理后的单词查找标签)Set<String> wordTags = InnerWordTagUtils.tags(word, wordContext);// 如果为空，则尝试使用命中的敏感词匹配 v0.25.1 bug105if(CollectionUtil.isEmpty(wordTags)) {wordTags = InnerWordTagUtils.tags(wordResult.word(), wordContext);}dto.setWord(word);dto.setTags(wordTags);return dto;}}

为了让结果更加符合直觉，我们最初依然使用匹配的 word 去查看 tags。

如果没有，再用底层命中的黑名单去查询。

测试效果

敏感词 为底层实际的黑名单。

敏---感---词 为忽略字符后命中的返回文本。

@Test
public void testNoiseCharacterInTaggedWords() {Map<String, Set<String>> newHashMap = new HashMap<>();newHashMap.put("敏感词", new HashSet<>(Arrays.asList("政治", "领导人")));// 配置同时启用字符忽略和标签的实例SensitiveWordBs ignoreAndTagWordBs = SensitiveWordBs.newInstance().charIgnore(SensitiveWordCharIgnores.specialChars()) // 启用字符忽略.wordTag(WordTags.map(newHashMap)).init();// 包含噪音字符的敏感词文本final String noisyText = "你好敏---感---词";// 测试同时启用字符忽略和标签的实例（修复前会失败）List<WordTagsDto> fixedWord = ignoreAndTagWordBs.findAll(noisyText, WordResultHandlers.wordTags());Assert.assertEquals(1, fixedWord.size());Assert.assertEquals("敏---感---词", fixedWord.get(0).getWord());Assert.assertNotNull("标签不应为空", fixedWord.get(0).getTags());Assert.assertTrue("应包含'政治'标签", fixedWord.get(0).getTags().contains("政治"));Assert.assertTrue("应包含'领导人'标签", fixedWord.get(0).getTags().contains("领导人"));
}