Latex公式匯出word,Latex轉換MathML使用POI匯出公式可編輯的Word檔案

蒲公英的狂想發表於2022-02-21

背景

之前在 使用spire.doc匯出支援編輯Latex公式的標準格式word 部落格中寫過,使用spire.doc來生成word,不得不說spire.doc的api操作起來還是比較方便,但是使用的過程中還是發生了一些異常,如∑求和公式會報錯,類似 \limit \widehat \sideset \overline \leqslant \geqslant \textcircled 均遇到了問題,類似解析失敗無法渲染、求和公式設定上下限報空指標異常等,使用同樣的方式轉換MathML之後還是同樣的問題,無法解決,一個兩個還能以圖片的形式顯示,隨著這麼多問題的出現,終究不是辦法

POI匯出Latex至word

POI轉Latex轉WORD過程是 Latex → MathML(數學標記語言) → OMML(Word公式)

Latex轉MathML問題

POI支援MathML,我基本上生成的都是數學試卷,Latex公式有了,但是需要轉換為MathML,一開始準備使用fmath三件套,這裡需要吐槽一下,這個官網的下載連結已經失效,搜了一下看到很久沒去的CSDN有資源,一下載50積分沒了,貌似不管啥資源都是50分起步,看來CSDN已經不是我等P民可以混跡的存在了

但是實驗了一下,fmath匯出的複雜公式在word中顯示偶爾有問題,可能是因為版本太老了,在StackOverflow上看到有人推薦使用snuggletex-core這個類庫,我就更換了實現方式,我來找了大量的數學公式latex,先看下效果

POM依賴

<!-- https://mvnrepository.com/artifact/de.rototor.snuggletex/snuggletex-core -->
<dependency>
    <groupId>de.rototor.snuggletex</groupId>
    <artifactId>snuggletex-core</artifactId>
    <version>1.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.poi/poi -->
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>4.1.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.poi/ooxml-schemas -->
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>ooxml-schemas</artifactId>
    <version>1.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml -->
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>4.1.2</version>
</dependency>
<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.11.0</version>
</dependency>

snuggletex-core轉換Latex為MathML

注意:這裡的latex必須用$$包裹,否則在轉換MathML的時候會報錯

@SneakyThrows
public static void addLatex(String latex, XWPFParagraph paragraph) {
    paragraph.setAlignment(ParagraphAlignment.LEFT);
    paragraph.setFontAlignment(ParagraphAlignment.LEFT.getValue());
    SnuggleEngine engine = new uk.ac.ed.ph.snuggletex.SnuggleEngine();
    SnuggleSession session = engine.createSession();
    SnuggleInput input = new uk.ac.ed.ph.snuggletex.SnuggleInput(latex);
    session.parseInput(input);
    String mathML = session.buildXMLString();
    CTOMath ctOMath = getOMML(mathML);
    CTP ctp = paragraph.getCTP();
    CTOMath ctoMath = ctp.addNewOMath();
    ctoMath.set(ctOMath);
}

MathML轉OMML

MML2OMML.XSL在windows的Office安裝目錄裡面直接搜就能拿到

private static File stylesheet = new File("D:\\MML2OMML.XSL");
private static TransformerFactory tFactory = TransformerFactory.newInstance();
private static StreamSource stylesource = new StreamSource(stylesheet);

private static CTOMath getOMML(String mathML) throws Exception {
    Transformer transformer = tFactory.newTransformer(stylesource);

    StringReader stringreader = new StringReader(mathML);
    StreamSource source = new StreamSource(stringreader);

    StringWriter stringwriter = new StringWriter();
    StreamResult result = new StreamResult(stringwriter);
    transformer.transform(source, result);

    String ooML = stringwriter.toString();
    stringwriter.close();

    CTOMathPara ctOMathPara = CTOMathPara.Factory.parse(ooML);
    CTOMath ctOMath = ctOMathPara.getOMathArray(0);

    //for making this to work with Office 2007 Word also, special font settings are necessary
    XmlCursor xmlcursor = ctOMath.newCursor();
    while (xmlcursor.hasNextToken()) {
        XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
        if (tokentype.isStart()) {
            if (xmlcursor.getObject() instanceof CTR) {
                CTR cTR = (CTR) xmlcursor.getObject();
                cTR.addNewRPr2().addNewRFonts().setAscii("Cambria Math");
                cTR.getRPr2().getRFonts().setHAnsi("Cambria Math"); // up to apache poi 4.1.2
                //cTR.getRPr2().getRFontsArray(0).setHAnsi("Cambria Math"); // since apache poi 5.0.0
            }
        }
    }

    return ctOMath;
}

已發現無法識別的符號(目前沒有找到解決方案)

嘗試了很多中元件,spire.doc 、fmath 等都無法渲染 \textcircled ,這個是latex標準支援的公式,效果文字外面圈一個圈類似①這樣的效果,這裡嘗試無果之後只能暫時以比較噁心的方式解決這個問題,方法latexFilter,我這裡只有①②③④這個四個出現的比較多,其他的都沒有出現,如果要使用這個地方需要注意一下

private static String latexFilter(String latex){
    if(!latex.contains("textcircled")){
        return latex;
    }
    return TextCircledEnum.replaceTextCircled(latex);
}

private enum TextCircledEnum{
    Zero("\\\\textcircled\\{0\\}","⓪"),
    One("\\\\textcircled\\{1\\}","①"),
    Two("\\\\textcircled\\{2\\}","②"),
    Three("\\\\textcircled\\{3\\}","③"),
    Four("\\\\textcircled\\{4\\}","④"),
    Five("\\\\textcircled\\{5\\}","⑤"),
    Six("\\\\textcircled\\{6\\}","⑥"),
    Seven("\\\\textcircled\\{7\\}","⑦"),
    Eight("\\\\textcircled\\{8\\}","⑧"),
    Nine("\\\\textcircled\\{9\\}","⑨"),
    Ten("\\\\textcircled\\{10\\}","⑩")
    ;

    TextCircledEnum(String code, String v) {
        this.code = code;
        this.v = v;
    }

    public final String code;
    public final String v;

    public static String replaceTextCircled(String latex){
        for (TextCircledEnum c : TextCircledEnum.values()) {
            latex = latex.replaceAll(c.code,c.v);
        }
        return latex;
    }

}

測試程式碼,附帶大量latex公式

public static void main(String[] args) throws Exception {

    XWPFDocument document = new XWPFDocument();

    XWPFParagraph paragraph = document.createParagraph();
    paragraph.setAlignment(ParagraphAlignment.LEFT);
    List<String> latexList = Arrays.asList("$\\frac{\\sum\\limits_{i=1}^{n}({x}_{i}−\\overline{x})({y}_{i}−\\overline{y})}{\\sum\\limits_{i=1}^{n}({x}_{i}−\\overline{x}{)}^{2}}$"
        , "$\\frac{ \\sum _{i=1}^{n} (x_ {i}-\\overline {x})(y_ {i}-\\overline {y})}{\\sqrt { \\sum _{i=1}^{n} (x_ {i-x})^ {2} \\sum _{i=1}^{n} (y_ {i}-y)^ {2}}}$"
        , "$\\widehat{y}$"
        , "$s_{x}^ {2}$"
        , "$\\sum _{i=1}^{n}$"
        , "$\\frac%…7B(a+b)(c+d)(a+c)(b+d)}$"
        , "$0 \\geqslant x\\leqslant 5 \\widehat{A} \\hat{A} \\sideset{^1_2}{^3_4}Y \\sideset{^1_2}{^3_4}Y $"
        , "$\\textcircled{1}$"
        , "$\\textcircled1$"
        , "$\\f\\relax{x} = \\int_{-\\infty}^\\infty   \\f\\hat\\xi\\,e^{2 \\pi i \\xi x} \\,d\\xi$"
        , "$a_{1} \\quad  x^2 \\quad e^{- \\alpha t}  \\quad b^{3}_{ij} \\quad e^{2}\\neq {e^x}^2$"
        , "$\\sqrt{x} \\quad \\sqrt[3]{x} \\quad \\sqrt{x^{2}+ \\sqrt{y}}$"
        , "$\\frac{x^2}{k+1} \\quad  x^{\\frac{2}{k+1}} \\quad x^{1/2}$"
        , "$\\vec a  \\qquad  \\overrightarrow{AB}  \\qquad  \\overleftarrow{AB}$"
        , "$\\sum_{i=1}^{n} \\quad \\int_{0}^{\\frac{\\pi}{2}} \\quad \\prod_{\\epsilon}$"
        , "$\\alpha \\beta \\gamma \\sigma \\omega \\delta \\pi \\rho \\epsilon \\eta \\lambda \\mu \\xi \\tau \\kappa \\zeta \\phi \\chi$"
        , "$\\le  \\ge  \\ne  \\approx  \\sim  \\subseteq  \\in  \\notin  \\times  \\div  \\pm  \\Rightarrow  \\rightarrow  \\infty  \\partial  \\angle  \\triangle$"
        , "$\\left\\{  \n" +
            "             \\begin{array}{**lr**}  \n" +
            "             x=\\dfrac{3\\pi}{2}(1+2t)\\cos(\\dfrac{3\\pi}{2}(1+2t)), &  \\\\  \n" +
            "             y=s, & 0\\leq s\\leq L,|t|\\leq1.\\\\  \n" +
            "             z=\\dfrac{3\\pi}{2}(1+2t)\\sin(\\dfrac{3\\pi}{2}(1+2t)), &    \n" +
            "             \\end{array}  \n" +
            "\\right.  \n$"
        ,"$F^{HLLC}=\\left\\{\n" +
            "\\begin{array}{rcl}\n" +
            "F_L       &      & {0      <      S_L}\\\\\n" +
            "F^*_L     &      & {S_L \\leq 0 < S_M}\\\\\n" +
            "F^*_R     &      & {S_M \\leq 0 < S_R}\\\\\n" +
            "F_R       &      & {S_R \\leq 0}\n" +
            "\\end{array} \\right. $"
        ,"$\\Bigg ( \\bigg [ \\Big \\{\\big \\langle \\left \\vert \\parallel \\frac{a}{b} \\parallel \\right \\vert \\big \\rangle \\Big \\} \\bigg ] \\Bigg )$"
    );
    latexList.forEach(latex -> addLatex(latexFilter(latex), document.createParagraph()));
    FileOutputStream out = new FileOutputStream("CreateWordFormulaFromMathML.docx");
    document.write(out);
    out.close();
    document.close();

}

fmath轉換Latex為MathML(棄用)

上面的公式用fmath三件套的轉換的時候有報錯地方,而且轉換後的效果有不及預期的,所以就棄用了,下面是fmath轉換的程式碼

@SneakyThrows
public static void addLatexByFMath(String latex, XWPFParagraph paragraph) {
    String mathML = fmath.conversion.ConvertFromLatexToMathML.convertToMathML(latex);
    mathML = mathML.replaceFirst("<math ", "<math xmlns=\"http://www.w3.org/1998/Math/MathML\" ");
    mathML = mathML.replaceAll("±", "±");
    CTOMath ctOMath = getOMML(mathML);
    CTP ctp = paragraph.getCTP();
    CTOMath ctoMath = ctp.addNewOMath();
    ctoMath.set(ctOMath);
}

POI生成Word程式碼API介紹

生成段落

private XWPFParagraph newParagraph(XWPFDocument document) {
    XWPFParagraph paragraph = document.createParagraph();
    paragraph.setSpacingLineRule(LineSpacingRule.AUTO);
    paragraph.setSpacingBefore(30);
    paragraph.setAlignment(ParagraphAlignment.LEFT);
    return paragraph;
}

新增文字

注:POI不支援 \r \n 之類的換行符,如果需要換行顯示呼叫 xwpfRun.addBreak() 來實現換行

public void addText(String text, XWPFParagraph paragraph) {
    if (StringUtils.isEmpty(text)) {
        return;
    }
    XWPFRun xwpfRun = paragraph.createRun();
    String[] lines = text.split("\n");
    if (lines.length < 1) {
        return;
    }
    xwpfRun.setText(lines[0], 0);
    for (int m = 1; m < lines.length; m++) {
        xwpfRun.addBreak();
        xwpfRun.setText(lines[m]);
    }
    if (text.endsWith("\n")) {
        xwpfRun.addBreak();
    }
}

Table渲染

注:這裡在渲染的時候把table行數和列數全部都已計算好(這個不涉及單元格合併功能),table.setWidth() 也是POI4.X版本才支援傳入字串設定百分比

private void parse2Table(WordInnerPojo innerPojo, XWPFParagraph paragraph) {
    XWPFTable table = paragraph.getDocument().createTable(innerPojo.rows, innerPojo.lines);
    table.setWidth("100%");
    for (int i = 0; i < innerPojo.rowLines.size(); i++) {
        List<String> rowLine = innerPojo.rowLines.get(i);
        for (int j = 0; j < rowLine.size(); j++) {
            XWPFTableCell cell = table.getRow(i).getCell(j);
            XWPFParagraph innerParagraph = cell.getParagraphs().size() > 0 ? cell.getParagraphs().get(0) : cell.addParagraph();
            innerParagraph.setSpacingBefore(0);
            innerParagraph.setVerticalAlignment(TextAlignment.CENTER);
            innerParagraph.setAlignment(ParagraphAlignment.LEFT);
            addContent(rowLine.get(j), innerParagraph);
        }
    }
    paragraph.getDocument().createParagraph();
}

插入圖片

注:單位需要轉換為em,直接呼叫org.apache.poi.util.Units的toEMU方法即可,這樣的寫法直接在文字的後面增加圖片,不換行

paragraph.createRun().addPicture(new ByteArrayInputStream(innerPojo.image), 
    XWPFDocument.PICTURE_TYPE_JPEG, "", 
    Units.toEMU(width.intValue()), 
    Units.toEMU(height.intValue()));

word公式渲染POJO類和渲染邏輯

一段原始的html文字需要分段解析的,文字、公式、表格、圖片等,需要解析抽象生成一個POJO類,把這些非文字的型別提出來並標記好佔位符,用於替換和渲染
POJO類

private static class WordInnerPojo {
    protected static final int LATEX_TYPE = 0;
    protected static final int IMG_TYPE = 1;
    protected static final int TABLE_TYPE = 2;
    private int type;
    private byte[] image;
    private String latex;
    private String imageUrl;
    private int rows;
    private int lines;
    private List<List<String>> rowLines;
    private BufferedImage imageTemp;

    @SneakyThrows
    BufferedImage readImage() {
        if (this.imageTemp == null) {
            this.imageTemp = ImageIO.read(new ByteArrayInputStream(this.image));
        }
        return imageTemp;
    }

    private Integer getImageWidth() {
        return readImage().getWidth();
    }

    private Integer getImageHeight() {
        return readImage().getHeight();
    }

}

渲染邏輯

@SneakyThrows
private void appendWordInnerPojo(WordInnerPojo innerPojo, XWPFParagraph paragraph) {
    switch (innerPojo.type) {
        case WordInnerPojo.LATEX_TYPE:
            addLatex(latexFilter(MessageFormat.format("${0}$", URLDecoder.decode(innerPojo.latex, "UTF-8")))), paragraph);
            break;
        case WordInnerPojo.IMG_TYPE:
            log.info("imageUrl:{}", innerPojo.imageUrl);
            /* 控制word中的圖片渲染大小,不要太大 */
            Float width = Float.valueOf(innerPojo.getImageWidth());
            Float height = Float.valueOf(innerPojo.getImageHeight());
            if (width > 300 && width > height) {
                BigDecimal rate = BigDecimal.valueOf(300).divide(BigDecimal.valueOf(width), 8, BigDecimal.ROUND_DOWN);
                height = height * rate.floatValue();
                width = 300f;
            } else if (height > 200 && height > width) {
                BigDecimal rate = BigDecimal.valueOf(200).divide(BigDecimal.valueOf(height), 8, BigDecimal.ROUND_DOWN);
                width = width * rate.floatValue();
                height = 200f;
            }
            paragraph.createRun().addPicture(new ByteArrayInputStream(innerPojo.image), XWPFDocument.PICTURE_TYPE_JPEG, "", Units.toEMU(width.intValue()), Units.toEMU(height.intValue()));
            paragraph.createRun().addBreak();
            break;
        case WordInnerPojo.TABLE_TYPE:
            parse2Table(innerPojo, paragraph);
            break;
    }
}

搞定!匯出的部分樣例如下:


參考連結

https://stackoverflow.com/questions/46623554/add-latex-type-equation-in-word-docx-using-apache-poi

相關文章