PDF轉圖片部分公式字元丟失問題解決的爬坑記錄

蒲公英的狂想發表於2021-10-31

原文網址 : https://www.cnblogs.com/surging-dandelion/p/15489727.html

公式字元

現象

PDF教材匯出到系統中，由程式將PDF轉為圖片後合併成一張大圖供前端標註，但是在標註數學和化學學科的時候且原始檔是PDF的情況下出現公式部分字元丟失的情況，如下圖
原件

轉換後效果

WTF!

轉換方案

PDFBOX（當前方案）

public void pdf2Jpg(InputStream in, String jpgFilePath) {
    int idx = jpgFilePath.lastIndexOf('.');
    String jpgPrefix = StringUtils.substring(jpgFilePath, 0, idx) + File.separator;
    int pdfDpi = 200;
    try (final PDDocument document = Loader.loadPDF(in)) {
        int size = document.getNumberOfPages();
        for (int i = 0; i < size; i++) {
            BufferedImage image = new PDFRenderer(document).renderImageWithDPI(i, pdfDpi, ImageType.RGB);
            File dir = new File(jpgPrefix);
            if (!dir.exists()) {
                dir.mkdirs();
            }
            File jpgFile = new File(jpgPrefix + i + ".jpg");
            ImageIO.write(image, "jpg", jpgFile);
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
}

轉換效果如下

和別人發過來的問題截圖一毛一樣，起初我以為是字型問題，這個公式用的字型是 Cambria Math，我本地已經下載了該字型，但是從PDF中複製出來的字型是CambriaMath 不帶空格，這一塊觸及知識盲區了，為啥不帶空格？複製到Word中顯示的就是一個長方形裡面有個問好，我們換個SDK試下

icepdf

public static void pdf2Pic(String pdfPath, String path) throws IOException, PDFException, PDFSecurityException, InterruptedException {
        org.icepdf.core.pobjects.Document document = new org.icepdf.core.pobjects.Document();
        document.setFile(pdfPath);
        //縮放比例
        float scale = 2.5f;
        //旋轉角度
        float rotation = 0f;

        for (int i = 0; i < document.getNumberOfPages(); i++) {
            BufferedImage image = (BufferedImage)
                    document.getPageImage(i, GraphicsRenderingHints.SCREEN, org.icepdf.core.pobjects.Page.BOUNDARY_CROPBOX, rotation, scale);
            RenderedImage rendImage = image;
            try {
                String imgName = i + ".jpg";
                System.out.println(imgName);
                File file = new File(path + imgName);
                ImageIO.write(rendImage, "jpg", file);
            } catch (IOException e) {
                e.printStackTrace();
            }
            image.flush();
        }
        document.dispose();
    }

轉換效果如下

這下不是空白了，就是那個長方形內嵌問號的亂碼，感覺就是字型問題，但是好像是由於公式引發的，這個是PDF，且我在word中手動輸入的公式匯出PDF再轉圖片是正常的，不明覺厲！我們再換個SDK試下

Aspose.word(有水印，僅測試)

public static void pdfToImage(InputStream inputStream, String imgFilePath) {
    try {
        log.info("convert pdf2jpg begin");
        long old = System.currentTimeMillis();
        Document pdfDocument = new Document(inputStream);
        //解析度
        Resolution resolution = new Resolution(200);
        JpegDevice jpegDevice = new JpegDevice(resolution);
        List<BufferedImage> imageList = new ArrayList<BufferedImage>();
        List<File> fileList = new ArrayList<>();
        for (int index = 1; index <= pdfDocument.getPages().size(); index++) {
            File file = new File(imgFilePath + index + ".jpg");
            FileOutputStream fileOS = new FileOutputStream(file);
            jpegDevice.process(pdfDocument.getPages().get_Item(index), fileOS);
            fileOS.close();
            imageList.add(ImageIO.read(file));
            fileList.add(file);
        }
        //臨時檔案刪除
        long now = System.currentTimeMillis();
        log.info("convert pdf2jpg completed, elapsed ：" + ((now - old) / 1000.0) + "秒");
        //刪除臨時檔案
    } catch (Exception e) {
        e.printStackTrace();
        log.error("convert pdf2jpg error:" + e);
    }

}

轉換效果如下

看到aspose這個樣子我基本上就確定，這個玩意和本地字型沒有什麼關係，慘不忍睹，再來

PDFRender，沒錯我已經用上PDFRender了

private static void pdfToImageRender(InputStream is, String imgFilePath) throws IOException {
        int pagen = 1;
        PDFFile pdffile = null;
        byte[] byt = toByteArray(is);
        try {
            ByteBuffer buf = ByteBuffer.allocate(byt.length);
            buf.put(byt);
            pdffile = new PDFFile(buf);
        } catch (Exception e) {
            e.printStackTrace();
        }
        if (pagen > pdffile.getNumPages())
            return;
        pagen = pdffile.getNumPages();

        for (int i = 1; i <= pagen; i++) {
            PDFPage page = pdffile.getPage(i);
            int width = (int) page.getBBox().getWidth();
            int height = (int) page.getBBox().getHeight();
            BufferedImage img = new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB);
            Graphics2D g2 = img.createGraphics();
            PDFRenderer renderer = new PDFRenderer(page, g2,
                    new Rectangle(0, 0, width, height), null, Color.white);//這個color為渲染出來的圖片的背景顏色
            try {
                page.waitForFinish();
            } catch (Exception e) {
                e.printStackTrace();
            }
            renderer.run();
            g2.dispose();
            OutputStream os = new FileOutputStream(imgFilePath + i + ".jpg");
            try {
                ImageIO.write(img, "jpg", os);
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        }
    }

轉換效果如下

沒錯，你沒看錯，從第一個公式字元開始已經報錯了，剩下的都沒有渲染出來，這特喵的不還是字型問題，報錯資訊如下

java.lang.IllegalArgumentException: newPosition > limit: (1146308935 > 1619740)
	at java.nio.Buffer.createPositionException(Buffer.java:269)
	at java.nio.Buffer.position(Buffer.java:244)
	at com.sun.pdfview.font.ttf.TrueTypeFont.parseDirectories(TrueTypeFont.java:312)
	at com.sun.pdfview.font.ttf.TrueTypeFont.parseFont(TrueTypeFont.java:68)
	at com.sun.pdfview.font.TTFFont.<init>(TTFFont.java:78)
	at com.sun.pdfview.font.TTFFont.<init>(TTFFont.java:50)
	at com.sun.pdfview.font.CIDFontType2.<init>(CIDFontType2.java:68)
	at com.sun.pdfview.font.PDFFont.getFont(PDFFont.java:191)
	at com.sun.pdfview.font.Type0Font.<init>(Type0Font.java:51)
	at com.sun.pdfview.font.PDFFont.getFont(PDFFont.java:156)
	at com.sun.pdfview.PDFParser.getFontFrom(PDFParser.java:1166)
	at com.sun.pdfview.PDFParser.iterate(PDFParser.java:719)
	at com.sun.pdfview.BaseWatchable.run(BaseWatchable.java:102)
	at java.lang.Thread.run(Thread.java:748)
java.lang.IllegalArgumentException: newPosition > limit: (1146308935 > 1619740)
	at java.nio.Buffer.createPositionException(Buffer.java:269)
	at java.nio.Buffer.position(Buffer.java:244)
	at com.sun.pdfview.font.ttf.TrueTypeFont.parseDirectories(TrueTypeFont.java:312)
	at com.sun.pdfview.font.ttf.TrueTypeFont.parseFont(TrueTypeFont.java:68)
	at com.sun.pdfview.font.TTFFont.<init>(TTFFont.java:78)
	at com.sun.pdfview.font.TTFFont.<init>(TTFFont.java:50)
	at com.sun.pdfview.font.CIDFontType2.<init>(CIDFontType2.java:68)
	at com.sun.pdfview.font.PDFFont.getFont(PDFFont.java:191)
	at com.sun.pdfview.font.Type0Font.<init>(Type0Font.java:51)
	at com.sun.pdfview.font.PDFFont.getFont(PDFFont.java:156)
	at com.sun.pdfview.PDFParser.getFontFrom(PDFParser.java:1166)
	at com.sun.pdfview.PDFParser.iterate(PDFParser.java:719)
	at com.sun.pdfview.BaseWatchable.run(BaseWatchable.java:102)
	at java.lang.Thread.run(Thread.java:748)

小結

其實在PDFBOX轉換的時候也有一條告警

00:19:53.100 [main] WARN org.apache.pdfbox.pdmodel.font.PDCIDFontType2 - Could not read embedded OTF for font DCWGQU+CambriaMath
java.io.IOException: head is mandatory
	at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:182)
	at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
	at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:79)
	at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27)
	at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
	at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:73)
	at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:114)
	at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:67)
	at org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:138)
	at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:88)
	at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:96)
	at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
	at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:849)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:142)
	at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:264)
	at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:338)
	at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:259)
	at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:245)

查了一個pdfbox的報錯資訊，pdfbox不支援otf字型，如果要使用otf字型，則需要進行轉換後方可使用，DCWGQU+CambriaMath這個關鍵詞我在Google上面沒有搜到
這裡猜測應當是歷史版本的office生成的公式，並且在轉換PDF的時候把相關字型寫入到PDF中去了，而這些個SDK都沒有去讀取和解析PDF自帶的字型，只是讀了個名稱，然後再呼叫作業系統已安裝的字型，字型不相容解析失敗導致異常，我目前的手上的工具也無法生成這樣PDF來複現這個問題
其實我這邊有拿到過其他有問題PDF的word原件，再用高版本的word匯出PDF後用PDFBOX匯出是沒有問題的，但是這個並沒有解決問題啊，主流的PDF轉圖片的SDK基本都不行，你總不能讓源頭客戶去升級OFFICE再試試吧

一點感想
鎖死了！就好像刺客五六七裡何大春的 “情比金堅七天鎖” ，一鎖七天，不到時間無法開啟。這個鎖就像愛情一樣，內部是打不開的，但是施鎖的那個人如果能主動放棄其實是可以提前開啟的，大春為了救阿七釋放鎖，我也可以，所以就釋放JavaSDK，把搜尋引擎後面的Java四個字母幹掉，看看外面的世界

大殺器ImageMagick（開源，免費）

ImageMagick是一個免費的建立、編輯、合成圖片的軟體。它可以讀取、轉換、寫入多種格式的圖片。圖片切割、顏色替換、各種效果的應用，圖片的旋轉、組合，文字，直線，多邊形，橢圓，曲線，附加到圖片伸展旋轉。ImageMagick是免費軟體：全部原始碼開放，可以自由使用，複製，修改，釋出，它遵守GPL許可協議，可以執行於大多數的作業系統，ImageMagick的大多數功能的使用都來源於命令列工具
百度百科傳送門：https://baike.baidu.com/item/ImageMagick/6865180?fr=aladdin
官網傳送門：https://imagemagick.org/index.php

上程式碼

private static final boolean isWin = System.getProperty("os.name").toLowerCase().contains("win");

    private static final boolean isLinux = System.getProperty("os.name").toLowerCase().indexOf("linux") >= 0;
    private static final String COMMAND_LINE_WIN = "C:/PROGRA~1/ImageMagick-7.1.0-Q16-HDRI/magick.exe convert -density  220 -quality 80 -background white -alpha remove {0} {1}/1.jpg";
    private static final String COMMAND_LINE_LINUX = "magick convert -density 220 -quality 80 -background white -alpha remove {0} {1}/1.jpg";

public void pdf2jpgByMagick(String pdfPath, String jpgFilePath) throws IOException {
    FileUtils.forceMkdir(new File(jpgFilePath));
    String command;
    if (isWin) {
        command = MessageFormat.format(COMMAND_LINE_WIN, pdfPath, jpgFilePath);
    }else if(isLinux){
        command = MessageFormat.format(COMMAND_LINE_LINUX, pdfPath, jpgFilePath);
    }else {
        throw new RuntimeException("暫不支援的平臺");
    }
    CommandUtil.exeCmd(command);
}

轉換效果如下

那是相當的Perfect！但是windows安裝（本地）、Linux伺服器安裝、還有容器部署映象封裝還有一堆事情，還有一些ImageMagick安裝及轉jpg引數設定的一些小坑，且看下回分享

記錄--前端金額運算精度丟失問題及解決方案
2024-02-01
前端
html頁面轉PDF、圖片操作記錄
2020-11-24
HTML
RocketMq訊息丟失問題解決
2020-10-21
MQ
SpringCloud解決feign呼叫token丟失問題
2024-05-17
SpringGCCloud
JavaScript中解決計算精度丟失的問題
2018-08-26
JavaScript
pdf轉圖片
2024-12-06
圖片怎麼轉換成PDF,圖片轉PDF教程
2018-08-08
字元編碼問題記錄
2019-05-12
字元
記錄幾十頁html生成pdf的歷程和坑(已用bookjs-easy解決)(生成、轉換、拼接pdf)
2024-04-30
HTMLJS
解決Hexo關於圖片的問題
2020-11-14
Hexo
解決圖片訪問403 Forbidden問題
2019-04-28
ORB
nginx 解決圖片跨域問題
2020-12-04
Nginx跨域
Vue 路由 history 模式踩坑記錄：nginx 配置解決 404 問題
2019-03-05
Vue路由模式Nginx
合約 USDT 轉賬失敗的問題解決
2022-11-24
Nginx session丟失問題處理解決方法
2021-03-21
NginxSession
前後端分離解決session跨域丟失問題
2018-12-04
後端Session跨域
解決RabbitMQ訊息丟失與重複消費問題
2018-07-08
MQ
python批量ppt轉圖片，pdf轉圖片，word轉圖片指令碼
2021-11-29
Python指令碼
工作踩坑問題記錄
2024-12-07
圖片格式轉換，JPG圖片轉換成PDF
2018-08-07
PDF 和圖片互轉
2019-10-21
Java pdf轉高清圖片
2018-04-11
Java
C# PDF轉圖片
2024-12-06
C#
vue+springboot實現圖片的上傳及回顯失敗問題的解決
2020-10-06
VueSpring Boot
解決npm 安裝部分依賴失敗問題總結
2018-07-14
NPM
Vuex資料頁面重新整理丟失問題解決方案
2021-09-09
Vue
資料庫高可靠，輕鬆解決事務丟失問題
2022-09-14
資料庫
【爬坑日記】Mac中Excel儲存csv格式檔案字元亂碼問題
2018-08-06
MacExcel字元
記錄一次重啟伺服器 Redis 資料丟失問題
2019-05-24
伺服器Redis
v-if與v-show造成部分元素丟失的問題——v-if複用元素問題
2023-04-13
【Socket】解決UDP丟包問題
2024-06-07
UDP
關於 iconv 轉碼導致資料丟失的問題
2019-11-29
Overleaf中插入pdf圖片只顯示圖片路徑的解決方式
2024-08-03
怎麼將pdf中的一部分轉為png圖片
2024-12-05
Redis SortedSet結構score欄位丟失精度問題解決辦法
2018-10-30
Redis
Golang浮點數精度丟失問題擴充套件包解決方案
2021-03-08
Golang套件
php實現pdf轉圖片
2022-06-07
PHP
記hexo寫作圖片與多機器協作問題解決
2019-04-03
Hexo