当前位置：首页 > news >正文

采用 Trie 树结合 RoaringBitmap 技术，构建高效的子串倒排索引

news 2025/10/28 9:38:53

1. Trie 树（前缀树）

✅ 是什么？

Trie 树（发音为 "try"）是一种专门用于处理字符串的树形数据结构，特别适合做前缀匹配和快速查找。

🌰 举个例子：

假设我们有这些公司名：

华为
华为技术
华为终端
华为云
华为供应链

如果用普通列表查找“华为”，需要遍历所有项；但用 Trie 树，它们会自动组织成一个“树”：

    (根)|华|为/  |   \技术 终端 云  供应链

当你输入“华为”，系统立刻定位到“为”节点，然后把所有子节点都作为候选结果返回 —— 实现自动补全和高效模糊匹配。

✅ 在本场景中的作用：

快速识别用户输入中的“产品”“公司”等关键词（如输入“华”就能匹配“华为”）
支持前缀搜索、拼写容错、别名扩展
提升 NER（命名实体识别）的召回率和速度

🔹 2. RoaringBitmap

✅ 是什么？

RoaringBitmap 是一种高效的压缩位图结构，用来紧凑地存储和快速操作大量整数集合，比如“哪些文档包含某个词”。

🌰 举个例子：

假设每个产品有一个 ID（如 1, 2, 3, ... 100万），我们想记录“哪些产品属于‘手机’分类”。
传统方式：用数组或集合存所有 ID → 占内存大、运算慢。
用 RoaringBitmap：把 ID 映射成“位”，用压缩方式存储，比如：

“手机” → {1, 2, 3, 1000, 1001, 200000}
用 RoaringBitmap 存储后，可能只占几十字节，且支持极快的“交集”“并集”运算。

✅ 在本场景中的作用：

将每个关键词（如“华为”）关联到它在知识图谱中的实体 ID 列表
用 RoaringBitmap 存储这些 ID 集合，大幅节省内存
支持高速“关键词匹配后的候选集合并”（如“华为 AND 手机” = 两个 Bitmap 做交集）
提升高并发下模糊搜索的响应速度

🔗 结合使用：Trie + RoaringBitmap

步骤	过程
1️⃣	用户输入“华” → Trie 树快速匹配出所有以“华”开头的词（如“华为”、“华星”）
2️⃣	每个词对应一个 RoaringBitmap，里面存着它关联的实体 ID（如“华为” → {1001, 1002}）
3️⃣	将多个词的 Bitmap 做并集或交集，快速得到最终候选结果
4️⃣	返回给 NER 模型进行语义判断

✅ 优势总结：

快：Trie 实现 O(m) 前缀匹配（m是字符串长度）
省：RoaringBitmap 压缩率高，内存占用比普通集合低 5~10 倍
强：支持大规模维度数据下的实时模糊检索，支撑高并发 AI 查询

💡 总结一句话：

Trie 树负责“快速找到可能的词”，RoaringBitmap 负责“高效记录和计算这些词对应的实体”，两者结合，实现海量业务维度数据下的高性能语义匹配。

public class KGraphCache {private final CompactInvertedIndex invertedIndex = new CompactInvertedIndex();private final Map<String, BIKgInfoVo> kgCache = new ConcurrentHashMap<>(); // subject → Vo// 使用 AtomicBoolean 保证线程安全private final AtomicBoolean initialized = new AtomicBoolean(false);private final AtomicInteger loadAttemptCount = new AtomicInteger(0);private static final int MAX_LOAD_ATTEMPTS = 6;private static final String PASSWORD_KGDICT = "*********";// 定义在类顶部private static final Map<Integer, String> CATEGORY_MAP = createCategoryMap();private static Map<Integer, String> createCategoryMap() {Map<Integer, String> map = new HashMap<>();map.put(1, "产品分类");map.put(2, "公司");map.put(3, "产品");return Collections.unmodifiableMap(map);}@Resourceprivate AppKnowledgeGraphMapper appKnowledgeGraphMapper;@Autowired@Qualifier("asyncExecutor")private Executor asyncExecutor;// ========================// 初始化与调度// ========================@PostConstructpublic void init() {log.info("知识图谱缓存组件已注册，等待首次加载...");}/*** 每 30 分钟检查一次：如果未初始化，则尝试加载* 一旦成功，后续不再执行加载逻辑*/@Scheduled(initialDelay = 10_000, fixedDelay = 30 * 60 * 1000)public void checkAndLoadCache() {// 如果已成功初始化，不再执行if (initialized.get()) {return;}// 如果已达到最大重试次数，不再尝试if (loadAttemptCount.get() >= MAX_LOAD_ATTEMPTS) {return;}int attempt = loadAttemptCount.incrementAndGet();log.info("【延迟初始化检查】第 {} 次尝试加载知识图谱缓存...", attempt);loadCacheAsync().thenAccept(success -> {if (success) {boolean wasSet = initialized.compareAndSet(false, true);if (wasSet) {log.info("知识图谱缓存首次加载成功，已标记为 initialized");// 可选：重置计数器（非必须）loadAttemptCount.set(1); // 或 reset to 0，看需求}} else {int currentFailures = loadAttemptCount.get();log.warn("第 {} 次加载失败，已失败 {} 次，将在下个周期重试...（最多 {} 次）",currentFailures, currentFailures, MAX_LOAD_ATTEMPTS);}}).exceptionally(throwable -> {int currentFailures = loadAttemptCount.get();log.warn("第 {} 次加载任务执行异常：{}", currentFailures, throwable.getMessage(), throwable);return null;});}/*** 每天早上 5:01 执行缓存更新（增量或全量）*/@Scheduled(cron = "0 1 5 * * ?") // 每天 5:01:00public void scheduledRefresh() {log.info("开始执行定时任务：每日知识图谱缓存更新");loadCacheAsync().thenAccept(success -> {if (success) {log.info("每日缓存更新完成");} else {log.warn("每日缓存更新失败，建议人工检查");}});}// ========================// 异步加载核心逻辑// ========================/*** 异步加载缓存，返回是否成功* @return CompletableFuture<Boolean> 加载是否成功*/public CompletableFuture<Boolean> loadCacheAsync() {return CompletableFuture.supplyAsync(() -> {try {log.info("【异步任务】开始加载知识图谱缓存...");long start = System.currentTimeMillis();LocalDate yesterday = LocalDate.now().minusDays(1);LocalDate dayBeforeYesterday = yesterday.minusDays(1);List<AppKnowledgeGraph> allData = selectByDt(yesterday);if (allData.isEmpty()) {log.info("未查到昨天（{}）的数据，尝试查询前天（{}）的数据", yesterday, dayBeforeYesterday);allData = selectByDt(dayBeforeYesterday);}if (allData.isEmpty()) {log.warn("未加载到任何知识图谱数据（尝试了昨天和前天），本次加载视为失败");return false;}// 去重：subject + predicateId + objectId（清洗前去重）Map<List<Object>, AppKnowledgeGraph> dedupMap = allData.stream().filter(kg -> kg.getSubject() != null && kg.getPredicateId() != null && kg.getObjectId() != null).collect(Collectors.toMap(kg -> {String cleanedSubject = TextCleaner.cleanSubject(kg.getSubject());return Arrays.asList(cleanedSubject, kg.getPredicateId(), kg.getObjectId());},kg -> kg,(e1, e2) -> e1 // 保留第一个));List<AppKnowledgeGraph> uniqueData = new ArrayList<>(dedupMap.values());// 分组：按 subject 聚合（注意：这里也要清洗 subject）Map<String, List<AppKnowledgeGraph>> grouped = uniqueData.stream().collect(Collectors.groupingBy(kg -> TextCleaner.cleanSubject(kg.getSubject())));// === 增量更新缓存开始 ===Set<String> currentSubjects = new HashSet<>(grouped.keySet());Set<String> existingSubjects = new HashSet<>(kgCache.keySet());// 1. 删除已不存在的 subjectSet<String> toRemove = new HashSet<>(existingSubjects);toRemove.removeAll(currentSubjects);for (String subject : toRemove) {kgCache.remove(subject);invertedIndex.remove(subject);}// 2. 新增或更新现有 subjectfor (Map.Entry<String, List<AppKnowledgeGraph>> entry : grouped.entrySet()) {String cleanedSubject = entry.getKey();BIKgInfoVo vo = new BIKgInfoVo();vo.setEntity(cleanedSubject);vo.setRelations(entry.getValue().stream().map(kg -> new BIKgRelationVO(kg.getPredicateId(), kg.getObjectId())).collect(Collectors.toList()));// 更新缓存和倒排索引BIKgInfoVo oldVo = kgCache.put(cleanedSubject, vo);if (oldVo == null) {invertedIndex.add(cleanedSubject); // 新增}// 已存在则无需操作 invertedIndex}long time = System.currentTimeMillis() - start;log.info("知识图谱缓存更新完成，共加载 {} 条唯一三元组，缓存大小：{}，耗时 {}ms",uniqueData.size(), kgCache.size(), time);return true;} catch (Exception e) {log.error("异步加载知识图谱缓存时发生异常", e);return false;}}, asyncExecutor);}// ========================// 数据查询// ========================private List<AppKnowledgeGraph> selectByDt(LocalDate localDate) {AppKnowledgeGraphImpl example = new AppKnowledgeGraphImpl();java.sql.Date sqlDate = java.sql.Date.valueOf(localDate);example.createCriteria().andDtEqualTo(sqlDate);return appKnowledgeGraphMapper.selectByExample(example);}// ========================// 查询接口// ========================public List<BIKgInfoVo> searchByQuestion(String question) {if (question == null || question.trim().isEmpty()) {return Collections.emptyList();}question = question.trim();Set<Integer> matchedIds = invertedIndex.search(question);if (matchedIds.isEmpty()) {return Collections.emptyList();}// 获取所有候选 subjectsList<String> subjects = matchedIds.stream().map(invertedIndex::getStringById).filter(Objects::nonNull).collect(Collectors.toList());// 按最长公共子串长度降序 + 字符串长度升序排序String finalQuestion = question;subjects.sort((a, b) -> {int lcsA = longestCommonSubstringLength(a, finalQuestion);int lcsB = longestCommonSubstringLength(b, finalQuestion);if (lcsA != lcsB) {return Integer.compare(lcsB, lcsA); // LCS 越长越靠前}return Integer.compare(a.length(), b.length()); // 长度越短越靠前});// 结果集合List<BIKgInfoVo> result = new ArrayList<>();Set<String> seen = new HashSet<>(); // 防止重复加入同一 subject// 分类统计：key=categoryName, value=countMap<String, Integer> categoryCount = new HashMap<>();Set<String> selectedCategories = new LinkedHashSet<>(); // 保持分类首次出现顺序// 分类映射Map<Integer, String> categoryMap = CATEGORY_MAP;for (String subject : subjects) {if (result.size() >= 12) break;BIKgInfoVo vo = kgCache.get(subject);if (vo == null || seen.contains(subject)) continue;// 从 relations 中提取分类String category = extractCategory(vo, categoryMap);if (category == null) {category = "其他"; // 默认分类}// 判断是否可以加入该分类（最多 3 个分类，每类最多 5 个）if (selectedCategories.size() < 3 || selectedCategories.contains(category)) {int count = categoryCount.getOrDefault(category, 0);if (count < 4) {selectedCategories.add(category);categoryCount.put(category, count + 1);BIKgInfoVo newVo = new BIKgInfoVo();newVo.setEntity(DESUtils.encrypt(vo.getEntity(), PASSWORD_KGDICT)); // 加密newVo.setRelations(vo.getRelations());result.add(newVo);seen.add(subject);}}}return result;}private String extractCategory(BIKgInfoVo vo, Map<Integer, String> categoryMap) {if (vo.getRelations() == null) return null;// 假设 predicate == 1 表示“类型”关系for (BIKgRelationVO rel : vo.getRelations()) {if (rel.getPredicate() != null && rel.getPredicate().equals(1)) {Integer obj = rel.getObject();if (obj != null && categoryMap.containsKey(obj)) {return categoryMap.get(obj);}}}return null; // 无法识别分类}private int longestCommonSubstringLength(String a, String b) {int m = a.length(), n = b.length();if (m == 0 || n == 0) return 0;int[][] dp = new int[m + 1][n + 1];int max = 0;for (int i = 1; i <= m; i++) {for (int j = 1; j <= n; j++) {if (a.charAt(i - 1) == b.charAt(j - 1)) {dp[i][j] = dp[i - 1][j - 1] + 1;max = Math.max(max, dp[i][j]);} else {dp[i][j] = 0;}}}return max;}// ========================// 监控与状态// ========================public boolean isInitialized() {return initialized.get();}public int size() {return kgCache.size();}// ================== 工具类：文本清洗 ==================public static class TextCleaner {/*** 要移除的非法字符：双引号 "、单引号 '、反斜杠 \、尖括号 <>、花括号 {}、方括号 []、竖线 |*/private static final Pattern INVALID_CHARS_PATTERN = Pattern.compile("[\"'\\\\<>{}\\[\\]|]");/*** 清洗 subject 字符串，移除非法字符*/public static String cleanSubject(String subject) {if (subject == null || subject.isEmpty()) {return subject;}return INVALID_CHARS_PATTERN.matcher(subject).replaceAll("");}}
}

/*** 使用 Trie + RoaringBitmap 实现的紧凑倒排索引* 支持：将文本拆分为 ≥2 字子串，插入到 Trie，指向 subjectId* 查询时：从 question 提取子串，快速返回匹配的 subjectId 集合*/
public class CompactInvertedIndex {private final TrieNode root = new TrieNode();private final Map<String, Integer> stringToId = new ConcurrentHashMap<>();private final List<String> idToString = new CopyOnWriteArrayList<>();private volatile int nextId = 0;// ========================// ID 映射管理// ========================private int getId(String str) {return stringToId.computeIfAbsent(str, k -> {int id;synchronized (this) {id = nextId++;while (idToString.size() <= id) {idToString.add(null);}idToString.set(id, k);}return id;});}public String getStringById(int id) {return id >= 0 && id < idToString.size() ? idToString.get(id) : null;}// ========================// 新增：清空整个索引// ========================/*** 清空所有数据：重建 Trie、清空 ID 映射* 线程安全：使用 synchronized 控制*/public synchronized void clear() {this.root.children.clear();if (this.root.bitmap != null) {this.root.bitmap.clear();}this.stringToId.clear();this.idToString.clear();this.nextId = 0;log("倒排索引已清空");}/*** 批量添加多个字符串（如 subject 列表）* @param strings 字符串集合*/public void addAll(Collection<String> strings) {if (strings == null || strings.isEmpty()) return;for (String str : strings) {add(str);}log("批量添加 " + strings.size() + " 个字符串到倒排索引");}// ========================// 构建索引// ========================/*** 添加一个文本（如 subject），绑定到其 ID*/public void add(String text) {if (text == null || text.length() < 2) return;int id = getId(text);for (int i = 0; i <= text.length() - 2; i++) {for (int j = i + 2; j <= text.length(); j++) {String substr = text.substring(i, j);insertSubstring(substr, id);}}}private void insertSubstring(String substr, int id) {TrieNode node = root;for (char c : substr.toCharArray()) {node = node.children.computeIfAbsent(c, k -> new TrieNode());}if (node.bitmap == null) {synchronized (node) {if (node.bitmap == null) {node.bitmap = new RoaringBitmap();}}}node.bitmap.add(id);}// ========================// 查询匹配// ========================/*** 查询 question 中所有 ≥2 字子串，返回匹配的 subject ID 集合*/public Set<Integer> search(String question) {if (question == null || question.length() < 2) {return Collections.emptySet();}Set<Integer> result = ConcurrentHashMap.newKeySet();for (int i = 0; i <= question.length() - 2; i++) {for (int j = i + 2; j <= question.length(); j++) {String substr = question.substring(i, j);RoaringBitmap ids = searchSubstring(substr);if (ids != null && !ids.isEmpty()) {IntIterator iter = ids.getIntIterator();while (iter.hasNext()) {result.add(iter.next());}}}}return result;}private RoaringBitmap searchSubstring(String substr) {TrieNode node = root;for (char c : substr.toCharArray()) {node = node.children.get(c);if (node == null) return null;}return node.bitmap;}// ========================// 移除支持// ========================public void remove(String text) {if (text == null || text.length() < 2) return;Integer id = stringToId.get(text);if (id == null) return;for (int i = 0; i <= text.length() - 2; i++) {for (int j = i + 2; j <= text.length(); j++) {String substr = text.substring(i, j);removeSubstring(substr, id);}}stringToId.remove(text);// 可选：idToString.set(id, null); 如果你想标记为空槽}private void removeSubstring(String substr, int id) {TrieNode node = root;for (char c : substr.toCharArray()) {node = node.children.get(c);if (node == null) return;}if (node.bitmap != null) {node.bitmap.remove(id);// 可考虑回收节点（需父指针），此处略}}// ========================// 统计信息// ========================public int size() {return stringToId.size();}public long getMemoryEstimateKB() {return countNodes(root) * 100L / 1024 +stringToId.keySet().stream().mapToInt(String::length).sum() * 2L / 1024;}private long countNodes(TrieNode node) {if (node == null) return 0;long count = 1;for (TrieNode child : node.children.values()) {count += countNodes(child);}return count;}// ========================// Trie 节点定义// ========================private static class TrieNode {ConcurrentMap<Character, TrieNode> children = new ConcurrentHashMap<>(4);volatile RoaringBitmap bitmap; // 使用 volatile 保证可见性}// ========================// 调试日志（可选）// ========================private void log(String msg) {System.out.println("[CompactInvertedIndex] " + msg);// 建议替换为 SLF4J Logger// log.debug(msg);}
}

CompactInvertedIndex 代码 正是一个典型的、基于 Trie + RoaringBitmap 实现的高效子串倒排索引，它将两种技术有机结合，实现了海量文本中快速模糊匹配的能力。

下面详细解释它是如何实现 Trie + RoaringBitmap 的：

✅ 一、整体设计目标

该组件的目标是：

给定一个用户问题（question），快速找出所有与之子串匹配的“知识图谱实体”（如“华为”“小米手机”），返回这些实体的 ID 集合。

为此，它采用：

Trie 树：实现子串的高速前缀查找
RoaringBitmap：对匹配到的实体 ID 进行高效存储与去重合并

✅ 二、核心结构解析

1. TrieNode：Trie 树的节点

private static class TrieNode {ConcurrentMap<Character, TrieNode> children = new ConcurrentHashMap<>(4);volatile RoaringBitmap bitmap; // 存储命中该子串的所有字符串 ID
}

children：当前字符的下一个字符映射（如 '华' → '为'）
bitmap：当某个子串（如“小米”）被完全匹配时，记录所有包含它的原始字符串（如“小米手机”）的 ID

👉 Trie 树的每条路径代表一个子串，终点节点的 bitmap 存储了所有包含该子串的文本 ID。

2. RoaringBitmap：高效存储 ID 集合

每个 TrieNode 的 bitmap 是一个 RoaringBitmap，用于存储所有在该子串上命中过的字符串 ID。
优点：
- 内存占用小（压缩存储）
- 支持快速 add、remove、or（并集）、and（交集）等集合操作
- 适合高并发、大数据量场景

✅ 三、构建索引：`add()` 方法（写入阶段）

public void add(String text) {if (text == null || text.length() < 2) return;int id = getId(text); // 给每个唯一字符串分配一个 IDfor (int i = 0; i <= text.length() - 2; i++) {for (int j = i + 2; j <= text.length(); j++) {String substr = text.substring(i, j);insertSubstring(substr, id);}}
}

🔍 关键逻辑：

将每个文本（如“华为手机”）拆解为所有 长度 ≥2 的子串：
- “华为”
- “为手”
- “手机”
- “华为手”
- “为手机”
- “华为手机”
每个子串插入 Trie 树，并在终点节点的 bitmap 中记录该文本的 ID。

👉 这样，只要用户输入中包含任意一个子串（如“华为”），就能快速定位到“华为手机”这个实体。

插入过程示意图（以“华为”为例）：

root└── '华'└── '为' → TrieNode.bitmap.add(id_of_华为手机)

✅ 四、查询匹配：`search()` 方法（读取阶段）

public Set<Integer> search(String question) {for (int i = 0; i <= question.length() - 2; i++) {for (int j = i + 2; j <= question.length(); j++) {String substr = question.substring(i, j);RoaringBitmap ids = searchSubstring(substr);if (ids != null && !ids.isEmpty()) {// 将所有命中子串的 ID 合并到 result 中IntIterator iter = ids.getIntIterator();while (iter.hasNext()) {result.add(iter.next());}}}}return result;
}

🔍 查询逻辑：

将用户问题（如“华为销量”）也拆解为所有 ≥2 字的子串：
- “华为”
- “为销”
- “销量”
- “华为销”
- “为销量”
- “华为销量”
对每个子串，在 Trie 树中查找是否有匹配。
如果有，取出对应节点的 bitmap，将其包含的所有 ID 加入结果集。

👉 最终返回的是：所有在问题中出现过任意子串的候选实体 ID 集合。

✅ 五、Trie + RoaringBitmap 的优势体现

技术	作用	在本代码中的体现
Trie 树	快速前缀匹配，避免全量扫描	通过字符逐层查找，O(m) 时间定位子串
RoaringBitmap	高效存储和合并 ID 集合	每个节点用 bitmap 存 ID，查询时自动去重
子串索引	提升模糊匹配召回率	拆解所有 ≥2 字子串，不怕用户输入不完整
并发安全	支持多线程读写	使用 `ConcurrentHashMap`、`volatile`、`synchronized`

✅ 六、性能优化亮点

内存优化：
- 使用 RoaringBitmap 压缩存储 ID，比 HashSet<Integer> 节省 5~10 倍内存。
速度优化：
- Trie 查找时间复杂度接近 O(m)，m 为子串长度。
- 所有子串并行查找（虽然代码是单线程，但可扩展为并行）。
动态更新：
- 支持 add、remove、clear，可用于增量更新缓存。
线程安全：
- 使用 ConcurrentHashMap、volatile、synchronized 保证多线程安全。

✅ 七、举个完整例子

假设：

添加 "华为手机" → 分配 ID=1
添加 "小米手机" → 分配 ID=2

构建后：

Trie 中 "华为" 节点的 bitmap = {1}
"小米" 节点的 bitmap = {2}
"手机" 节点的 bitmap = {1, 2}

当用户输入 "手机品牌"：

子串 "手机" 匹配 → 返回 ID {1, 2}
最终系统可查出“华为手机”和“小米手机”作为候选

✅ 总结：它是如何实现 Trie + RoaringBitmap 的？

该 CompactInvertedIndex 类通过以下方式实现了 Trie + RoaringBitmap：
用 Trie 树组织所有 ≥2 字的子串，实现 O(m) 高速前缀匹配；
每个 Trie 节点维护一个 RoaringBitmap，存储所有包含该子串的原始字符串 ID；
查询时拆解问题为子串，在 Trie 中查找并合并所有命中子串的 bitmap，得到候选 ID 集合；
结合子串索引与 bitmap 压缩，实现高召回、低延迟、低内存的模糊匹配。

🎯 这是工业级 NLP 系统中常见的“候选召回”优化方案，特别适合知识图谱、搜索、AI 分词等场景。

--------------------------------------------------------------------------------

getId(String str) 方法是整个 CompactInvertedIndex 倒排索引中非常关键的一环，它实现了 字符串到整数 ID 的高效、线程安全的唯一映射。

我们来逐行解析它的作用、设计思想和为什么这么写。

🔍 代码原文

private int getId(String str) {return stringToId.computeIfAbsent(str, k -> {int id;synchronized (this) {id = nextId++;while (idToString.size() <= id) {idToString.add(null);}idToString.set(id, k);}return id;});
}

✅ 一、整体功能：字符串 → 整数 ID 映射

目标：

给每一个唯一的字符串（如“华为手机”）分配一个唯一的整数 ID（如 0, 1, 2, ...），后续所有操作都用 ID 而不是字符串，提升性能。

这叫 字符串 intern 机制 或 字典编码（Dictionary Encoding）。

使用场景：

Trie 中不直接存字符串，只存 ID（节省内存）
RoaringBitmap 存的是 int 类型的 ID（高效）

✅ 二、数据结构说明

变量	类型	用途
`stringToId`	`Map<String, Integer>`	字符串 → ID 的映射（主索引）
`idToString`	`List<String>`	ID → 字符串的反向映射（用于查询后还原）
`nextId`	`int`	下一个可用的 ID 编号

✅ 三、逐行解释

第1行：入口

private int getId(String str) {

输入一个字符串 str
返回其对应的整数 ID

第2行：`computeIfAbsent` 实现懒加载 + 线程安全去重

return stringToId.computeIfAbsent(str, k -> { ... });

📌 `computeIfAbsent(key, mappingFunction)`

这是 ConcurrentHashMap 提供的原子操作：

如果 str 已经存在映射，直接返回已有 ID
如果不存在，才执行后面的 lambda 函数生成新 ID 并插入

✅ 优势：

高并发下避免重复分配 ID
不需要外部加锁判断是否存在

第3行：进入同步块

synchronized (this) {

虽然外层用了 ConcurrentHashMap.computeIfAbsent，但内部还要修改共享变量 nextId 和 idToString，所以必须加锁保证原子性。

⚠️ 注意：不能只靠 ConcurrentHashMap，因为 nextId++ 和 idToString 操作需要一起原子执行。

第4行：获取下一个 ID

id = nextId++;

使用自增 ID 分配策略
初始为 0，每次调用后递增

例如：

第一次：id = 0, nextId 变成 1
第二次：id = 1, nextId 变成 2

第5–6行：确保 `idToString` 列表足够长

while (idToString.size() <= id) {idToString.add(null);
}

👉 这是为了防止 List 数组越界。

举个例子：

当前 idToString.size() == 0
id = 5（可能由于并发或历史原因）
直接 set(5, ...) 会抛异常
所以先用 null 占位，直到 size > 5

✅ 安全扩容，避免 IndexOutOfBoundsException

第7行：保存字符串到 ID 的反向映射

idToString.set(id, k);

将字符串保存在 idToString[id] 位置，便于后续通过 ID 查回原始字符串。

比如：

String entity = idToString.get(1); // 得到 "华为手机"

第8–9行：结束同步并返回 ID

}
return id;

退出同步块，返回分配好的 ID。

✅ 四、完整流程图解

假设依次调用 getId("华为手机")、getId("小米手机")：

步骤	操作	`stringToId`	`idToString`	`nextId`
1	调用 `getId("华为手机")`	`"华为手机" → 0`	`[0]="华为手机"`	1
2	调用 `getId("小米手机")`	`"小米手机" → 1`	`[1]="小米手机"`	2
3	再次调用 `getId("华为手机")`	（已存在）直接返回 0	不变	不变

✅ 五、为何这样设计？优点总结

特性	实现方式	好处
唯一性	`computeIfAbsent`	同一个字符串永远返回同一个 ID
线程安全	`ConcurrentHashMap + synchronized(this)`	多线程并发调用不会出错
正查反查	`stringToId` + `idToString`	支持 ID ↔ 字符串双向映射
高性能	用 `int` 替代 `String` 存储	Trie 和 Bitmap 更快更省内存
动态扩展	自增 ID + 动态 List	支持无限添加新字符串