【word解析】从 Word 提取数学公式并渲染到 Web 页面的完整指南
一、背景
因业务需求,目前正在实现一项需求,即将一份试卷的内容提取出来,由非结构化到结构化的转换。其中在对数学试卷进行解析的时候发现试卷中有数学公式,想要把数学公式也要提取出来,并且展示到web页面的时候遇到了困难,以前也没有遇到过,花了一番功夫,最终还是搞定了,于是写下来分享给大家。
二、概述
从一份word格式的数学试卷中,用java提取公式,此刻数据结构是OMML格式,接下来需要转换为mathml格式,最后再转换为latex格式,将latex格式推给前端由前端进行渲染。
三、正文
1、使用java解析word文件,输出数学公式OMML格式内容。
下图是word内容:
Java解析文件的代码:
public static void readMathFormulas2(String filePath) throws IOException {// 创建文件输入流FileInputStream fis = new FileInputStream(filePath);// 创建XWPFDocument对象XWPFDocument document = new XWPFDocument(fis);for (IBodyElement element : document.getBodyElements()) {if (element instanceof XWPFParagraph) {XWPFParagraph paragraph = (XWPFParagraph) element;// 处理块级公式for (CTOMath math : paragraph.getCTP().getOMathList()) {String omml = math.xmlText();System.out.println("Block formula: " + omml);}// 处理内联公式for (CTOMathPara math22 : paragraph.getCTP().getOMathParaList()) {String omml = math22.xmlText();System.out.println("Inline formula: " + omml);}}}
}
解析出来的数学公式OMML格式内容:
<m:oMath xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16du="http://schemas.microsoft.com/office/word/2023/wordml/word16du" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16sdtfl="http://schemas.microsoft.com/office/word/2024/wordml/sdtformatlock" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"><m:f><m:fPr><m:ctrlPr><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="宋体" w:cs="宋体"/><w:i/></w:rPr></m:ctrlPr></m:fPr><m:num><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="宋体" w:cs="宋体"/></w:rPr><m:t>x</m:t></m:r></m:num><m:den><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="宋体" w:cs="宋体"/></w:rPr><m:t>a</m:t></m:r></m:den></m:f><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="宋体" w:cs="宋体"/></w:rPr><m:t>=</m:t></m:r><m:f><m:fPr><m:ctrlPr><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="宋体" w:cs="宋体"/><w:i/></w:rPr></m:ctrlPr></m:fPr><m:num><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="宋体" w:cs="宋体"/></w:rPr><m:t>y</m:t></m:r></m:num><m:den><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="宋体" w:cs="宋体"/></w:rPr><m:t>b</m:t></m:r></m:den></m:f><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="宋体" w:cs="宋体"/></w:rPr><m:t>=</m:t></m:r><m:f><m:fPr><m:ctrlPr><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="宋体" w:cs="宋体"/><w:i/></w:rPr></m:ctrlPr></m:fPr><m:num><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="宋体" w:cs="宋体"/></w:rPr><m:t>z</m:t></m:r></m:num><m:den><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="宋体" w:cs="宋体"/></w:rPr><m:t>c</m:t></m:r></m:den></m:f><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:hAnsi="宋体" w:cs="宋体"/></w:rPr><m:t>=k</m:t></m:r></m:oMath>
2、将数学公式OMML格式内容转换为mathml
使用java的方式进行转换,如下代码:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(new StreamSource("F:/OMML2MML.xsl"));
transformer.setOutputProperty("encoding", "UTF-8");
transformer.transform(new StreamSource("F:/input.xml"), new StreamResult("F:/output.mml"));
System.out.println("转换完成,输出文件为 UTF-8 编码。");
转换之后生成的mathml内容,如下信息:
<?xml version="1.0" encoding="UTF-8"?><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mfrac><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:math>
3、将mathml格式转换为latex格式
将上述的mathml放到 html文件中,如下信息:
<!DOCTYPE html>
<html>
<head><meta charset="UTF-8"><title>MathML to LaTeX</title>
</head>
<body><!-- 在这里插入你的 MathML 内容 -->
<?xml version="1.0" encoding="UTF-8"?><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mfrac><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:math>
</body>
</html>
使用pandoc将html转换为latex格式
pandoc -f html -t latex input.html -o output.tex
生成的 latex格式内容如下:
\(\frac{x}{a} = \frac{y}{b} = \frac{z}{c} = k\)
4、通过前端页面进行渲染latex格式的公式
渲染效果:
渲染代码:
<!DOCTYPE html>
<html>
<head><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.0/dist/katex.min.css"><script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.0/dist/katex.min.js"></script><script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.0/dist/contrib/auto-render.min.js"onload="renderMathInElement(document.body);"></script>
</head>
<body> \(\frac{x}{a} = \frac{y}{b} = \frac{z}{c} = k\)
</body>
</html>