实现Word文档自动编号提取技术详解
一、背景
因业务需求,目前正在实现一项需求,即将一份试卷的内容提取出来,由非结构化到结构化的转换。在试卷解析的时候发现存在大纲目录中带有自动编号的格式,按照传统的方式没有解析出来,花了一番功夫,最终还是搞定了,于是写下来分享给大家。
二、概述
从一份word格式的试卷中,用java从XWPFParagraph中先判断是否有编号,如果有再提取自动编号,自动编号还是会区分不同类型的,如中文的一、二、三等,阿拉伯数字的1、2、3等,或者是字母a、b、c等。
三、正文
1、需要先实现一个自动编号上下文的class:
NumberingContext里面注册了不同格式的编号,如大写字母编号、小写字母编号、中文编号等,具体可以看代码:
public class NumberingContext {private final Map<String, List<Integer>> numberingStates = new HashMap<>();// 编号格式映射(可以扩展更多格式)public static final Map<String, NumberFormatter> FORMATTERS = new HashMap<>();static {FORMATTERS.put("decimal", new DecimalFormatter());FORMATTERS.put("lowerLetter", new LowerLetterFormatter());FORMATTERS.put("upperLetter", new UpperLetterFormatter()); // 新增FORMATTERS.put("upperRoman", new UpperRomanFormatter());FORMATTERS.put("chineseCounting", new ChineseCountingFormatter());}public String resolveNumberText(String numIdKey, int ilvl, String format, String textTemplate) {List<Integer> counters = numberingStates.computeIfAbsent(numIdKey, k -> new ArrayList<>());// 确保层级计数器足够长while (counters.size() <= ilvl) {counters.add(0);}// 更新当前层级计数器,并清空下级counters.set(ilvl, counters.get(ilvl) + 1);for (int i = ilvl + 1; i < counters.size(); i++) {counters.set(i, 0);}// 获取格式化器NumberFormatter formatter = FORMATTERS.get(format);if (formatter == null) {formatter = new DecimalFormatter();}// 替换模板中的占位符String result = textTemplate;for (int i = 0; i <= ilvl; i++) {String placeholder = "%" + (i + 1);if (i < counters.size()) {String replacement = formatter.format(counters.get(i));result = result.replace(placeholder, replacement);}}return result;}
}
2、不同类型的编号格式
1、DecimalFormatter
public class DecimalFormatter implements NumberFormatter {@Overridepublic String format(int number) {return String.valueOf(number);}
}
2、LowerLetterFormatter
public class LowerLetterFormatter implements NumberFormatter {@Overridepublic String format(int number) {return String.valueOf((char) ('a' + number - 1));}
}
3、UpperLetterFormatter
public class UpperLetterFormatter implements NumberFormatter {@Overridepublic String format(int number) {if (number < 1 || number > 26) {throw new IllegalArgumentException("upperLetter 格式仅支持 1-26");}return String.valueOf((char) ('A' + number - 1));}
}
其他略。。。
3、接下来就是核心逻辑:
这个getParagraphNumbering方法就是可以从XWPFParagraph
提取自动的编号
public static String getParagraphNumbering(NumberingContext numberingContext, XWPFParagraph paragraph) {try {if (paragraph.getNumID() == null) {//"无编号";return "";}BigInteger numId = paragraph.getNumID();BigInteger ilvl = paragraph.getNumIlvl() != null ? paragraph.getNumIlvl() : new BigInteger("0");// 使用numId和文档信息创建唯一键String numIdKey = numId.toString();XWPFNumbering numbering = paragraph.getDocument().getNumbering();if (numbering == null) {//"无编号定义";return "";}XWPFNum num = numbering.getNum(numId);if (num == null) {//"编号定义不存在";return "";}BigInteger abstractNumId = num.getCTNum().getAbstractNumId().getVal();for (XWPFAbstractNum abstractNum : numbering.getAbstractNums()) {if (abstractNum.getCTAbstractNum().getAbstractNumId().equals(abstractNumId)) {CTAbstractNum ctAbstractNum = abstractNum.getCTAbstractNum();CTLvl ctLvl = ctAbstractNum.getLvlArray(ilvl.intValue());if (ctLvl != null && ctLvl.getNumFmt() != null && ctLvl.getLvlText() != null) {String format = ctLvl.getNumFmt().getVal().toString();String text = ctLvl.getLvlText().getVal();return numberingContext.resolveNumberText(numIdKey, ilvl.intValue(), format, text);}}}//"编号层级不存在或定义不完整";return "";} catch (Exception e) {//"解析编号时出错";return "";}}