面向大模型输出的“耐脏” JSON 处理:从清洗到严格化的完整方案
面向大模型输出的“耐脏” JSON 处理:从清洗到严格化的完整方案(含可用工具类)
当你把大模型(LLM)的回答接到后端时,JSON 常常“不够干净”:
- 混入代码块围栏 ‘‘‘json...‘‘‘```json ... ```‘‘‘json...‘‘‘ 或说明文字
- 单引号、未加引号字段名、尾逗号、注释
- 被截断:少了结尾
}
/]
、字符串没闭合 - 反斜杠层级、智能引号、奇怪的空白字符……
本文给出一套可直接拷贝落地的 Java 工具类 LLMJsonNormalizer
,目标是:
- 清洗:剥离围栏/说明、统一换行/引号、移除 BOM 等
- 提取:从长文本中提取首段完整
{...}
/[...]
- 修复(可选):为尾部截断等“可判定”的缺损做保守补齐
- 宽松解析:容忍 JSON5 风格(单引号、尾逗号、注释等)
- 严格化输出:最终得到合法标准 JSON(紧凑/美化)
- 策略可选:默认严格失败即抛异常(fail-fast),也可显式启用“尽力而为”
依赖(Jackson 2.15+ 建议)
<dependencies><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-core</artifactId><version>2.15.3</version></dependency><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-databind</artifactId><version>2.15.3</version></dependency>
</dependencies>
完整可用代码(单文件拷走即用)
package ai.util;import com.fasterxml.jackson.core.JsonFactory;
import com.fasterxml.jackson.core.JsonParser;
import com.fasterxml.jackson.core.json.JsonReadFeature;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;import java.nio.charset.StandardCharsets;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;/*** 面向 LLM 输出的“耐脏” JSON 处理工具:* - 支持两种策略:* 1) STRICT_THROW:严格解析,失败直接抛异常(推荐默认)* 2) BEST_EFFORT:尽力清洗/修复/宽松解析,最后严格化** 能力:* - 预处理:去 BOM、统一换行、智能引号->普通引号、清洗不可见空白* - 提取:从 ```json ... ```/ """...""" / `...` / 长文本中提取 JSON 片段* - 修复(保守):为尾部截断补引号/补 } ] / 处理悬空反斜杠* - 宽松解析:单引号、未引号字段、注释、尾逗号、控制符、非数数值* - 轻修复:尾逗号清理、'key'->"key"、说明前后缀剥离、双重转义缓解* - 严格化:通过严格 ObjectMapper round-trip 输出合法 JSON*/
public final class LLMJsonNormalizer {/** 解析策略 */public enum ParsePolicy {STRICT_THROW, // 仅接受严格 JSON;失败立刻抛异常(fail-fast)BEST_EFFORT // 清洗/修复/宽松解析,尽力拿到结果,再严格化}private static final ObjectMapper LENIENT_MAPPER = new ObjectMapper(JsonFactory.builder().configure(JsonReadFeature.ALLOW_UNQUOTED_FIELD_NAMES.mappedFeature(), true).configure(JsonReadFeature.ALLOW_SINGLE_QUOTES.mappedFeature(), true).configure(JsonReadFeature.ALLOW_JAVA_COMMENTS.mappedFeature(), true).configure(JsonReadFeature.ALLOW_TRAILING_COMMA.mappedFeature(), true).configure(JsonReadFeature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER.mappedFeature(), true).configure(JsonReadFeature.ALLOW_NON_NUMERIC_NUMBERS.mappedFeature(), true).configure(JsonReadFeature.ALLOW_UNESCAPED_CONTROL_CHARS.mappedFeature(), true).build());private static final ObjectMapper STRICT_MAPPER = new ObjectMapper();private static final Pattern FENCED_BLOCK = Pattern.compile("```(?:json)?\\s*([\\s\\S]*?)\\s*```", Pattern.CASE_INSENSITIVE);private static final Pattern TRIPLE_QUOTES = Pattern.compile("^\\s*\"\"\"([\\s\\S]*?)\"\"\"\\s*$");private static final Pattern SINGLE_BACKTICK = Pattern.compile("^\\s*`([\\s\\S]*?)`\\s*$");private LLMJsonNormalizer() {}/* ======================== 外部主入口 ======================== *//** 统一入口:按策略解析,返回 JsonNode */public static JsonNode parse(String text, ParsePolicy policy) {if (text == null) throw new IllegalArgumentException("输入为空");if (policy == ParsePolicy.STRICT_THROW) {// 仅做最轻预处理与围栏剥离,不做结构修复String s = preprocess(text);String fromFence = extractFromFencedBlocks(s);String candidate = fromFence != null ? fromFence : s;try {return STRICT_MAPPER.readTree(candidate);} catch (com.fasterxml.jackson.core.JsonProcessingException e) {var loc = e.getLocation();String msg = String.format("严格解析失败:%s (line %d, column %d)",e.getOriginalMessage(),loc == null ? -1 : loc.getLineNr(),loc == null ? -1 : loc.getColumnNr());throw new IllegalArgumentException(msg, e);} catch (Exception e) {throw new IllegalArgumentException("严格解析失败:" + e.getMessage(), e);}} else { // BEST_EFFORTreturn parseBestEffort(text);}}/** 解析为紧凑 JSON(单行) */public static String toCompactJson(String text, ParsePolicy policy) {JsonNode n = parse(text, policy);try { return STRICT_MAPPER.writeValueAsString(n); }catch (Exception e) { throw new IllegalArgumentException("序列化失败:" + e.getMessage(), e); }}/** 解析为美化 JSON(两空格缩进) */public static String toPrettyJson(String text, ParsePolicy policy) {JsonNode n = parse(text, policy);try { return STRICT_MAPPER.writerWithDefaultPrettyPrinter().writeValueAsString(n); }catch (Exception e) { throw new IllegalArgumentException("格式化失败:" + e.getMessage(), e); }}/** 便捷:严格模式 */public static JsonNode parseStrictOrThrow(String text) { return parse(text, ParsePolicy.STRICT_THROW); }/** 便捷:尽力模式 */public static JsonNode parseBestEffortSafe(String text) { return parse(text, ParsePolicy.BEST_EFFORT); }/** 转 Map */@SuppressWarnings("unchecked")public static Map<String, Object> toMap(String text, ParsePolicy policy) {JsonNode node = parse(text, policy);return STRICT_MAPPER.convertValue(node, Map.class);}/** 转 List(根必须为数组) */@SuppressWarnings("unchecked")public static List<Object> toList(String text, ParsePolicy policy) {JsonNode node = parse(text, policy);if (!node.isArray()) throw new IllegalArgumentException("根节点不是数组。");return STRICT_MAPPER.convertValue(node, List.class);}/* ====================== BEST_EFFORT 管线 ====================== */private static JsonNode parseBestEffort(String raw) {String s = preprocess(raw);// 先尝试从围栏/三引号/反引号中抓内容String fromFence = extractFromFencedBlocks(s);if (fromFence != null) {JsonNode n = tryParsePipeline(fromFence);if (n != null) return n;}// 直接整体尝试JsonNode n = tryParsePipeline(s);if (n != null) return n;// 从长文本中抽出首段完整 JSONString extracted = extractFirstJsonObjectOrArray(s);if (extracted != null) {n = tryParsePipeline(extracted);if (n != null) return n;}throw new IllegalArgumentException("未能从文本中解析出有效 JSON。");}/** 解析管线:修复 → 宽松解析 → 轻修复 → 再宽松 → 严格化 */private static JsonNode tryParsePipeline(String s) {// NEW: 破损修复(仅对“尾部截断”做保守补齐)String repaired = repairBrokenJson(s);// A. 宽松解析JsonNode node = tryLenientParse(repaired);if (node != null) return toStrictNode(node);// B. 轻修复(尾逗号、单引号、双重转义、说明前后缀)String fixed = lightFixes(repaired);// C. 再宽松解析node = tryLenientParse(fixed);if (node != null) return toStrictNode(node);// D. 兜底:严格解析(若本就严格)try {return STRICT_MAPPER.readTree(fixed);} catch (Exception ignored) {}return null;}private static JsonNode tryLenientParse(String s) {try { return LENIENT_MAPPER.readTree(s); }catch (Exception e) { return null; }}private static JsonNode toStrictNode(JsonNode n) {try {String compact = STRICT_MAPPER.writeValueAsString(n);return STRICT_MAPPER.readTree(compact);} catch (Exception e) {return n; // 极少数场景保底返回}}/* ========================= 预处理/提取 ========================= */private static String preprocess(String raw) {// 去 BOMbyte[] bytes = raw.getBytes(StandardCharsets.UTF_8);if (bytes.length >= 3 && (bytes[0] & 0xFF) == 0xEF && (bytes[1] & 0xFF) == 0xBB && (bytes[2] & 0xFF) == 0xBF) {raw = new String(bytes, 3, bytes.length - 3, StandardCharsets.UTF_8);}// 统一换行、智能引号->普通引号、清除奇怪空白return raw.replace("\r\n", "\n").replace("\r", "\n").replace('“', '"').replace('”', '"').replace('‘', '\'').replace('’', '\'').replace("\u00A0", " ").replace("\u200B", "").trim();}private static String extractFromFencedBlocks(String s) {Matcher m = FENCED_BLOCK.matcher(s);if (m.find()) return m.group(1).trim();Matcher m2 = TRIPLE_QUOTES.matcher(s);if (m2.find()) return m2.group(1).trim();Matcher m3 = SINGLE_BACKTICK.matcher(s);if (m3.find()) return m3.group(1).trim();return null;}/** 从长文本中“配对括号”抽出第一段完整 JSON(对象或数组) */private static String extractFirstJsonObjectOrArray(String s) {int n = s.length();Deque<Character> stack = new ArrayDeque<>();boolean inString = false;char quote = 0;boolean esc = false;int start = -1;for (int i = 0; i < n; i++) {char c = s.charAt(i);if (inString) {if (esc) { esc = false; }else if (c == '\\') { esc = true; }else if (c == quote) { inString = false; }continue;} else if (c == '"' || c == '\'') {inString = true; quote = c; continue;}if (c == '{' || c == '[') {if (stack.isEmpty()) start = i;stack.push(c);} else if (c == '}' || c == ']') {if (!stack.isEmpty()) {char open = stack.peek();if ((open == '{' && c == '}') || (open == '[' && c == ']')) {stack.pop();if (stack.isEmpty() && start >= 0) {return s.substring(start, i + 1);}} else {// 不匹配就清空,继续找下一段stack.clear(); start = -1;}}}}return null;}/* ====================== 破损修复(保守) ====================== *//** 对尾部截断类问题做保守修复:补引号、补 } ]、处理悬空反斜杠 */private static String repairBrokenJson(String s) {if (s == null || s.isEmpty()) return s;String t = s;// 去掉末尾可能残留的反引号/多余空白t = t.replaceAll("`+$", "").replaceAll("\\s+\\z", "");// 悬空反斜杠:补空格避免形成非法转义if (t.endsWith("\\")) t = t + " ";// 文末未闭合字符串t = closeDanglingStringIfAny(t, '"');t = closeDanglingStringIfAny(t, '\'');// 按开启顺序在末尾补齐 } ]t = balanceBracketsAtEnd(t);return t;}private static String closeDanglingStringIfAny(String s, char quote) {boolean in = false, esc = false;for (int i = 0; i < s.length(); i++) {char c = s.charAt(i);if (in) {if (esc) { esc = false; }else if (c == '\\') { esc = true; }else if (c == quote) { in = false; }} else if (c == quote) {in = true;}}if (in) {return s.endsWith("\\") ? s + " " + quote : s + quote;}return s;}private static String balanceBracketsAtEnd(String s) {Deque<Character> stack = new ArrayDeque<>();boolean inStr = false, esc = false; char q = 0;for (int i = 0; i < s.length(); i++) {char c = s.charAt(i);if (inStr) {if (esc) { esc = false; }else if (c == '\\') { esc = true; }else if (c == q) { inStr = false; }continue;} else if (c == '"' || c == '\'') {inStr = true; q = c; continue;}if (c == '{' || c == '[') {stack.push(c);} else if (c == '}' || c == ']') {if (!stack.isEmpty()) {char open = stack.peek();if ((open == '{' && c == '}') || (open == '[' && c == ']')) {stack.pop();}}}}if (stack.isEmpty()) return s;StringBuilder tail = new StringBuilder();while (!stack.isEmpty()) {char open = stack.pop();tail.append(open == '{' ? '}' : ']');}return s + tail;}/* ======================== 轻量修复/清洗 ======================== */private static String lightFixes(String s) {String t = s;// 双重转义(整体看起来像多转义时,回退一层)if (looksLikeDoubleEscaped(t)) {t = t.replace("\\\"", "\"").replace("\\\\", "\\");}// 尾逗号:{"a":1,} / [1,2,]t = t.replaceAll(",\\s*(\\})", "$1");t = t.replaceAll(",\\s*(\\])", "$1");// 'key' -> "key";: 'value' -> : "value"if (t.contains("{") || t.contains("[")) {t = t.replaceAll("(?<!\\\\)'([A-Za-z0-9_\\-\\.]+)'\\s*:", "\"$1\":");t = t.replaceAll(":\\s*'([^'\\\\]*(?:\\\\.[^'\\\\]*)*)'", ": \"$1\"");}// 剥离说明文字:仅保留首个 { 或 [ 开始到最后一个 } 或 ]t = stripNonJsonPrefixSuffix(t);return t.trim();}private static boolean looksLikeDoubleEscaped(String s) {long q = countOf(s, "\\\""), slash = countOf(s, "\\\\");return q >= 2 && slash >= 2 && q + slash > 6;}private static long countOf(String s, String sub) {long c = 0; int i = 0;while ((i = s.indexOf(sub, i)) >= 0) { c++; i += sub.length(); }return c;}private static String stripNonJsonPrefixSuffix(String s) {String t = s.replaceFirst("^[\\s\\S]{0,200}?(?=\\{|\\[)", "");int lastObj = t.lastIndexOf('}');int lastArr = t.lastIndexOf(']');int cut = Math.max(lastObj, lastArr);if (cut > 0 && cut < t.length() - 1) t = t.substring(0, cut + 1);return t;}
}
使用示例
String llmOutput = """
返回如下结果(可能含说明):```json{'name': 'Alice',// 这是注释age: 23, skills: ['Java','LLM',],}````""";// 1) 默认推荐:严格模式(失败直接抛异常)
try {String strict = LLMJsonNormalizer.toPrettyJson(llmOutput, LLMJsonNormalizer.ParsePolicy.STRICT_THROW);System.out.println(strict);
} catch (IllegalArgumentException e) {
// 记录错误并处理System.err.println(e.getMessage());
}// 2) 尽力模式:清洗/修复/宽松解析,最后严格化
String best = LLMJsonNormalizer.toPrettyJson(llmOutput, LLMJsonNormalizer.ParsePolicy.BEST_EFFORT);
System.out.println(best);
// {
// "name" : "Alice",
// "age" : 23,
// "skills" : [ "Java", "LLM" ]
// }// 3) 破损 JSON(缺右括号/未闭合字符串)
String broken = "{ 'user': {'name': 'Li', 'skills': ['Java','LLM'] ";
System.out.println(LLMJsonNormalizer.toCompactJson(broken, LLMJsonNormalizer.ParsePolicy.BEST_EFFORT));
// -> {"user":{"name":"Li","skills":["Java","LLM"]}}Map<String, Object> map = LLMJsonNormalizer.toMap(best, LLMJsonNormalizer.ParsePolicy.BEST_EFFORT);
关键要点与实践策略
1) 策略分离:默认严格,修复显式开启
- STRICT_THROW:生产核心链路推荐;强契约、可观测、不静默篡改。
- BEST_EFFORT:灰度工具、运营后台、日志分析或 UI 展示时可用;对人眼友好。
2) 修复边界:只修“尾部截断”
- 通过
repairBrokenJson
仅对文末进行保守补齐:闭合字符串、配对}
/]
、处理悬空\
。 - 不尝试猜测中段结构(例如缺逗号/键值丢半截),避免引入错误事实。
3) 宽松解析 + 严格化
LENIENT_MAPPER
启用 JSON5 风格容忍;- 解析成功后通过
STRICT_MAPPER
round-trip,保证输出合法标准 JSON。
4) 提取算法:配对括号
- 当模型把 JSON 混在长文里,使用配对括号在字符串感知下定位首段完整
{...}
/[...]
,鲁棒性优于纯正则。
5) 轻量修复的风险控制
lightFixes
做最小必要的修正:尾逗号、单引号键值、双重转义、说明剥离。- 仍然失败 → 抛异常,让上层决策(重试/回退/提示)。
何时“直接抛异常”更合适
- 支付/权限/配置等敏感场景
- 强 Schema 映射(Avro/Protobuf/严格 DTO)
- CI 测试/回归:把“解析失败”当失败用例,暴露上游质量问题
何时可用“尽力而为”
- 人机混杂文本(文档、聊天、客服)
- 前台展示优先(允许数据做“候选”而非“事实入库”)
- 有人工复核/后续校验的流程
监控与测试建议
- 日志记录:策略、失败原因、错误位置(
line/column
)、截取错误行上下文 3 行(便于排查)。 - 回归样例库:收集典型脏样本(含尾逗号、单引号、注释、截断、围栏、多转义、智能引号、不可见空白)。
- 对偶测试:同一输入分别在
STRICT_THROW
与BEST_EFFORT
下跑用例,对比期望行为。 - 金数据比对:BEST_EFFORT 输出与人工标注真值对比,评估修复引入的偏差风险。
可扩展方向
- extractAllJsonBlocks(…):从长文中提取多段 JSON,逐段修复与解析,返回
List<JsonNode>
。 - Schema 校验:解析后接
JSON Schema
校验;严格流程中比对 DTO。 - 按键容错:允许添加最小默认值或丢弃未知字段(仅展示用)。
- 多语言提示清洗:针对特定语言常见说明模板做更准确的前后缀剥离。
有了这套工具类,你可以在保证核心链路安全可控的同时,让对接 LLM 的体验不被杂质 JSON 拖垮:该抛就抛,该修就修,各得其所。需要我再补一个 extractAllJsonBlocks(...)
的实现或错误行上下文高亮器吗?我可以直接续上。