Java解析OFFICE(word,excel,powerpoint)以及PDF的實現方案及開發中的點滴分享

funnyZpC發表於2017-07-29

JavaExcel

Java解析OFFICE(word,excel,powerpoint)以及PDF的實現方案及開發中的點滴分享

　　在此，先分享下寫此文前的經歷與感受，我所有的感覺濃縮到一個字，那就是:"坑",如果是兩個字那就是"巨坑"=>因為這個需求一開始並不是這樣子的，且聽我漫漫道來：

　　一開始客戶與我們商量的是將office和PDF上傳，將此類檔案解析成html格式，在APP端呼叫內建server直接以html"播放"

　　經歷一個月~，兩個月~，三個月~~~

　　到需求開發階段，發現這是個坑。。。：按照需規的意思這個整體是當做一個功能來做的，技術難度也就算了，而且按照估算的工時也很難做成需規所需要的樣子(缺陷太多！)

　　然後一週~，一週~，又一週~~~

　　各種方案下來將需求做成能用的樣子，然後需求確認時客戶說：“我們沒有要求你們能解析這些文件，我們只要求你們當做一個原始檔上傳，在APP端點選直接能選擇呼叫第三方應用開啟就行了，而且一開始我們的需求就是這樣的。”

　　/**聽完，頓時淚流滿面(ಥ _ ಥ)，如果業務一開始就確認這樣做，何至於浪費如此多的時間，花費如此多的精力繞老大一圈。。。*/

　　需求繞了一圈又繞回來了，作為經歷過的人，現在總結下這需求裡面無盡的坑：

　　A>開源社群有很多Demo，這些Demo有很多缺陷，比如office裡面的藝術字、圖片、公式、顏色樣式、視訊和音訊不能解析

　　B>能解析的物件，解析出來的效果不是很好，比如word和ppt自身的排版亂了,excel單元格里面的自定義格式全變成數字了~等等

　　C>開源社群的資料並不是很全，導致的結果是不同的文件型別需要用不同的解析方式去解析，比如word用docx4j解析、excel用poi解析帶來的程式碼量巨大

　　D>由於程式碼自身的解析效果不是很好，更改後的方案需要在上傳之前將原始檔處理成其他的形式，如pdf需要切成圖片，ppt需要轉換成視訊或是圖片，這樣一來需求實現的方式就變成半自動了╥﹏╥...

　　E>word用docx4j解析一個很大的問題是解析的效率太低了，5MB以上的檔案或者內容比較複雜的word文件解析十分耗時，解析效率太低，再一就是poi解析資料量比較大的Exel(比如>1000行)容易造成記憶體溢位，不好控制

　　F>工時太短，只有15天。。。，加班加點(⊙︿⊙) ，包工頭，加工資！！！ε=怒ε=怒ε=怒ε=怒ε=( o｀ω′)ノ

以上吐槽完了，該展示下最終成果了~

上4圖從左至右依次是pdf、ppt、word、excel的解析html的效果，由於涉及開發協議上圖1和圖2部分地方有塗抹，且以上只是瀏覽器模擬手機顯示，遂顯示效果較為粗糙，在此十分抱歉~

下面介紹一下我的最終實現思路：
　　A>Word文件分兩種格式(03版)doc和(07版)docx，由於doc屬於即將淘汰的格式同時為方便使用docx4j一步到位的實現方式，故不考慮doc格式文件

　　B>同Word一樣，excel也不考慮舊版格式的轉換，方案是選用第三方Demo實現，涉及到具體的技術就是 poi.hssf

　　C>PowerPoint(ppt)由於內建物件比較多，為保證客戶的使用體驗，我的方案是將ppt直接匯出成mp4或圖片(需打zip包)上傳，再用程式碼包裝成html

　　D>對於pdf，同樣沒有很好的Demo實現成html，遂同ppt一樣通過軟體轉換成圖片的形式打包上傳，再用程式碼包裝成html

先展示下word解析的相關程式碼:

(程式碼片段一)

1     public static void Word2Html() throws FileNotFoundException, Docx4JException{
2             //需在log4j內配置docx4j的級別
3             WordprocessingMLPackage wmp = WordprocessingMLPackage.load(new File("C:\\Users\\funnyZpC\\Desktop\\Test\\word.docx"));
4             Docx4J.toHTML(wmp, "C:\\Users\\funnyZpC\\Desktop\\result\\wordIMG", "wordIMG", new FileOutputStream(new File("C:\\Users\\funnyZpC\\Desktop\\result\\word.html")));
5     }

(程式碼片段二)

 1     public ProcessFileInfo processDOCX(File file,String uploadPath)throws Exception{
 2         String fileName=file.getName().substring(0,file.getName().lastIndexOf("."));//獲取檔名稱
 3         WordprocessingMLPackage wmp = WordprocessingMLPackage.load(file);//載入原始檔
 4         String basePath=String.format("%s%s%s", uploadPath,File.separator,fileName);//基址
 5         FileUtils.forceMkdir(new File(basePath));//建立資料夾
 6         String zipFilePath=String.format("%s%s%s.%s", uploadPath,File.separator,fileName,"ZIP");//最終生成檔案的路徑
 7         Docx4J.toHTML(wmp, String.format("%s%s%s", basePath,File.separator,fileName),fileName,new FileOutputStream(new File(String.format("%s%s%s", basePath,File.separator,"index.html"))));//解析
 8         scormService.zip(basePath, zipFilePath);//壓縮包
 9         FileUtils.forceDelete(new File(basePath));//刪除臨時資料夾
10         file.delete();//解析完成，刪除原docx檔案
11         return new ProcessFileInfo(true,new File(zipFilePath).getName(),zipFilePath);//返回目標檔案相關資訊
12     }

解析word(docx)文件所需要的程式碼簡單到只需要兩行程式碼(程式碼片段一3、4兩行)，以上(程式碼片段二)是實際開發的程式碼，建議對比片段一看，同時由於專案可能會部署在linux系統下，建議使用File.separator來代替"/"或者"\"路徑分隔符；同時，需要解釋的是toHtml方法的四個引數==>

　　Docx4j.toHtml(載入源docx檔案的WordprocessingMLPackage例項化物件，存放解析結果(html和圖片)的基目錄,存放圖片的資料夾名稱(在基目錄下),輸出主html的輸出流物件);

下圖是輸出的結果的目錄：

由於docx4j內部的log較多，預設Demo測試的時候輸出檔案會有如下提示：

這句話的大意是：如需隱藏此訊息，請設定docx4j的debug的級別。解決的方式是在實際專案的log4j.properties中新增docx4j的訊息級別為ERROR，如：

如果使用maven管理專案，直接在pom.xml裡面新增docx4j的dependency，如果需手動配置docx4j及其依賴包，一定要注意依賴包與當前docx4j的版本對應性(推薦3.3.5的docx4j，解析效果會好一些！)否則各種毛病啊~，下圖是maven倉庫的一些說明，如需手動配置依賴一定要點進去看下：

下面的程式碼是Excel解析word的部分程式碼片段(程式碼不全,如有需要請郵件私我)：

(程式碼片段一)

 1 /**
 2      * 
 3      * @param file                    原始檔:c://xx//xx.xlsx
 4      * @param uploadPath    基目錄地址
 5      * @return
 6      * @throws Exception
 7      */
 8     public ProcessFileInfo processXLSX(File file,String uploadPath)throws Exception {
 9         List<String> sheets=Excel2HtmlUtils.readExcelToHtml(file.getPath());
10         FileUtils.forceMkdir(new File(uploadPath));//建立資料夾
11         String code=file.getName().substring(0,file.getName().lastIndexOf("."));//檔名稱
12         String basePath=String.format("%s%s%s", uploadPath,File.separator,code);
13         FileUtils.forceMkdir(new File(basePath));
14         File htmlFile = new File(String.format("%s%s%s", basePath,File.separator,"index.html"));
15         Writer fw=null;
16         PrintWriter bw=null;
17         //構建html檔案
18         try{
19              fw= new BufferedWriter( new OutputStreamWriter(new FileOutputStream(htmlFile.getPath()),"UTF-8"));
20              bw=new PrintWriter(fw);
21              //新增表頭及可縮放樣式
22             String head="<!DOCTYPE html><html><head><meta charset=\"UTF-8\"></head><body style=\"transform: scale(0.7,0.7);-webkit-transform: scale(0.7,0.7);\">";
23             StringBuilder body=new StringBuilder();
24             for (String e : sheets) {
25                 body.append(e);
26             }
27             String foot="</body></html>";
28             bw.write(String.format("%s%s%s", head,body.toString(),foot));
29         }catch(Exception e){
30             throw new Exception("");//錯誤扔出
31         }finally{
32             if (bw != null) {
33                 bw.close();
34             }
35             if(fw!=null){
36                 fw.close();
37             }
38         }
39         String htmlZipFile=String.format("%s%s%s.%s",uploadPath,File.separator,file.getName().substring(0,file.getName().lastIndexOf(".")),"ZIP");
40         //壓縮檔案
41         scormService.zip(basePath, htmlZipFile);
42         file.delete();//刪除上傳的xlsx檔案
43         FileUtils.forceDelete(new File(basePath));
44         return new ProcessFileInfo(true,new File(htmlZipFile).getName(),htmlZipFile);
45     }

View Code

(程式碼片段二)

 1     /**
 2      * 程式入口方法
 3      * 
 4      * @param filePath
 5      *            檔案的路徑
 6      * @return <table>
 7      *         ...
 8      *         </table>
 9      *         字串
10      */
11     public static List<String> readExcelToHtml(String filePath) {
12             List<String> htmlExcel=null;
13             try {
14                 File sourcefile = new File(filePath);
15                 InputStream is = new FileInputStream(sourcefile);
16                 Workbook wb = WorkbookFactory.create(is);
17                 htmlExcel = getExcelToHtml(wb);
18             } catch (EncryptedDocumentException e) {
19                 e.printStackTrace();
20             } catch (FileNotFoundException e) {
21                 e.printStackTrace();
22             } catch (InvalidFormatException e) {
23                 e.printStackTrace();
24             } catch (IOException e) {
25                 e.printStackTrace();
26             }
27         return htmlExcel;
28         
29     }

View Code

以上只展示了xlsx檔案的內容包裝和解析excel的入口方法，整個解析類全部放在了utils包下面，service裡面只管呼叫方法傳參就好了,如下圖：

解析Excel的工具類一共有四個檔案類，其中Excel2HtmlUtils是入口類，其它三個均是關聯Excel2HtmlUtils類處理Excel樣式，需要注意的是:工具類處理Excel的時候一定要限制處理記錄的數量,以免造成記憶體溢位錯誤，順便說下：如果您解析的html供移動端使用，建議給html設定可縮放大小=>transform: scale(0.7,0.7);-webkit-transform: scale(0.7,0.7);。

說完Excel解析，下面給出pdf(圖片ZIP包)解析html的程式碼片段，由於程式碼較為簡單，不多的解釋，以下是具體的實現程式碼：

 1     /**
 2          * 根據檔名中的數字排列圖片
 3          *     a>提取檔名中的數字放入int陣列(序列)
 4          *  b>判斷序列陣列元素個數與檔案個數是否一致,不一致則丟擲
 5          *  c>將序列陣列從小到大排列
 6          *  d>遍歷序列陣列獲取Map中的檔名(value)並寫html
 7          */
 8         String nm=null;
 9         int[] i=new int[imgNames.size()];
10         Map<Integer,String> names=new HashMap<Integer,String>();
11         Pattern p=Pattern.compile("[^0-9]");
12         for(int j=0;j<imgNames.size();j++){
13             nm=imgNames.get(j).substring(0,imgNames.get(j).lastIndexOf("."));//提取名稱
14             String idx=p.matcher(nm).replaceAll("").trim();
15             i[j]=Integer.parseInt("".equals(idx)?"0":idx);
16             names.put(i[j],imgNames.get(j));
17         }
18         if(names.keySet().size()!=i.length){
19             //System.out.println("====請檢查您的圖片編號====");/*重複或者不存在數字編號*/
20             return new ProcessFileInfo(false,null,null);
21         }
22         Arrays.sort(i);//int陣列內元素從小到大排列
23 
24         //包裝成html
25         StringBuilder html=new StringBuilder();
26         html.append("<!DOCTYPE html><html><head><meta charset='UTF-8'><title>PDF</title></head>");
27         html.append("<body style=\"margin:0px 0px;padding:0px 0px;\">");
28         for (int  k : i) {
29             html.append(String.format("%s%s%s%s%s","<div style=\"width:100%;\"><img src=\"./",fileName,File.separator,names.get(k),"\"  style=\"width:100%;\" /></div>"));
30         }
31         html.append("</body></html>");
32         File indexFile=new File(String.format("%s%s%s",basePath,File.separator,"index.html"));
33         Writer fw=null;
34         PrintWriter bw=null;
35         //構建檔案(html寫入html檔案)
36         try{
37              fw= new BufferedWriter( new OutputStreamWriter(new FileOutputStream(indexFile),"UTF-8"));//以UTF-8的格式寫入檔案
38              bw=new PrintWriter(fw);
39              bw.write(html.toString());
40         }catch(Exception e){
41             throw new Exception(e.toString());//錯誤扔出
42         }finally{
43             if (bw != null) {
44                 bw.close();
45             }
46             if(fw!=null){
47                 fw.close();
48             }
49         }
50         String zipFilePath=String.format("%s%s%s.%s", uploadPath,File.separator,file.hashCode(),"ZIP");
51         scormService.zip(basePath, zipFilePath);
52         //刪除檔案
53         file.delete();
54         FileUtils.forceDelete(new File(basePath));
55         return new ProcessFileInfo(true,new File(zipFilePath).getName(),zipFilePath);
56     }

View Code

同Excel，由於我將ppt存為mp4格式，上傳後只需要做簡單包裝就可以了，處理的時候一定要注意html對視訊的相對引用，以下是具體的實現程式碼：

 1     /**
 2      * 
 3      * @param file                    上傳的檔案的路徑 c://xx.//xxx.mp4
 4      * @param uploadPath    儲存html的基目錄路徑
 5      * @return
 6      * @throws Exception
 7      */
 8     public ProcessFileInfo processPPTX(File file,String uploadPath)throws Exception{
 9         String fileName=file.getName().substring(0,file.getName().lastIndexOf("."));//獲取檔名稱
10         String suffix=file.getName().substring(file.getName().lastIndexOf(".")+1,file.getName().length()).toLowerCase();//音訊檔案字尾名
11         String basePath=String.format("%s%s%s", uploadPath,File.separator,fileName);
12         FileUtils.forceMkdir(new File(basePath));
13         //將視訊檔案copy到basePath內
14         String videoPath=String.format("%s%s%s", basePath,File.separator,file.getName());
15         FileUtils.copyFile(file, new File(videoPath));
16         StringBuilder html=new StringBuilder();
17         html.append("<!DOCTYPE html><html><head><meta charset='utf-8'><title>powerpoint</title></head>");
18         html.append("<body style=\"margin:0px 0px;\"><div style=\"width:100%;margin:auto 0% auto 0%;\">");
19         html.append("<video controls=\"controls\"  width=\"100%\"  height=\"100%\" name=\"media\" >");//無背景圖片
20         html.append(String.format("%s%s.%s%s%s%s%s","<source src=\"",fileName,suffix,"\" type=\"audio/",suffix,"\" >","</video></div>"));//視訊
21         html.append("</body></html>");//結尾
22         File indexFile=new File(String.format("%s%s%s",basePath,File.separator,"index.html"));
23         Writer fw=null;
24         PrintWriter bw=null;
25         //構建檔案(html寫入html檔案)
26         try{
27              fw= new BufferedWriter( new OutputStreamWriter(new FileOutputStream(indexFile),"UTF-8"));//以UTF-8的格式寫入檔案
28              bw=new PrintWriter(fw);
29              bw.write(html.toString());
30         }catch(Exception e){
31             throw new Exception(e.toString());//錯誤扔出
32         }finally{
33             if (bw != null) {
34                 bw.close();
35             }
36             if(fw!=null){
37                 fw.close();
38             }
39         }
40         String zipFilePath=String.format("%s%s%s.%s", uploadPath,File.separator,fileName,"ZIP");
41         scormService.zip(basePath, zipFilePath);
42         //刪除檔案
43         file.delete();
44         FileUtils.forceDelete(new File(basePath));
45         return new ProcessFileInfo(true,new File(zipFilePath).getName(),zipFilePath);
46     }

View Code

　　雖然需求最終還是改成最簡單的實現方式，這中間近乎白忙活的結果研究出來的實現方案還是有必要分享的，以上如能幫助到開發者，哪怕只有一位，也是非常值得的。

轉載請註明地址：http://www.cnblogs.com/funnyzpc/p/7225988.html

word與excel轉pdf方法
2020-11-06
Excel
把HTML轉成PDF的4個方案及實現方法
2019-04-01
HTML
Java併發程式設計：ThreadLocal的使用以及實現原理解析
2019-01-03
Java程式設計thread
Java PDF 轉 Word 教程
2024-11-25
Java
java Word 轉 PDF格式
2021-07-22
Java
java中的鎖及實現原理
2024-10-22
Java
office Mac：ppt演示Microsoft PowerPoint
2022-01-25
MacROS
Java 根據模板生成 PDF 檔案以及 excel 檔案
2024-09-02
JavaExcel
Promise的實現及解析
2018-03-23
Promise
PowerPoint 教程「26」，如何在 PowerPoint 中將簡報另存為 PDF 檔案？
2022-07-04
利用wps的com口用python實現excel轉pdf
2024-07-16
PythonExcel
Java中Singleton的三種實現方式解析
2021-02-20
Java
使用ABAP程式設計實現對微軟Office Word文件的操作
2018-08-16
程式設計微軟
小程式開發點滴積累
2019-03-24
KubeCon中國的點滴資訊
2019-06-25
NPOI匯出和匯入Excel,Word和PDF
2018-07-23
Excel
用Aspose-Java免費實現 PDF、Word、Excel、Word互相轉換並將轉換過得檔案上傳OSS，返回轉換後的檔案路徑
2023-05-02
JavaExcel
微服務實戰：服務發現的可行方案以及實踐案例
2018-11-24
微服務
linux下安裝 openoffice 實現 office文件轉 pdf
2021-10-13
Linux
pdf轉word格式PDF to word for Mac
2023-05-05
Mac
10分鐘做好 Bootstrap Blazor 的表格元件匯出 Excel/Word/Html/Pdf
2023-01-11
bootBlazor元件ExcelHTML
Win10系統中自帶的Office(Word,Excel)突然變成英文怎麼變回中文
2020-03-07
Win10Excel
技術分享| 基於 Etcd 的分散式鎖實現原理及方案
2022-10-26
分散式
Promise的分層解析及實現
2019-02-28
Promise
【Excel】Excel 拆分以及批次匯入指令碼開發
2024-03-11
Excel指令碼
Lru-k在Rust中的實現及原始碼解析
2024-06-21
Rust原始碼
Lfu快取在Rust中的實現及原始碼解析
2024-06-27
快取Rust原始碼
公鏈開發特點優缺點分析及前端實現
2024-03-13
前端
maven中遇到的一些使用點滴
2018-09-10
Maven
帶有ttl的Lru在Rust中的實現及原始碼解析
2024-06-24
Rust原始碼
Promise的祕密（Promise原理解析以及實現）
2019-10-13
Promise
記錄--前端實現檔案預覽(word、excel、pdf、ppt、xmind、音影片、圖片、文字) 國際化
2024-06-12
前端Excel
Java及Jquery獲取螢幕解析度的實現程式碼
2019-12-23
JavajQuery
win10系統office（word/excel）打字卡頓如何解決
2019-11-04
Win10Excel
使用 NPOI/DocX 二次封裝Office（Word、Excel）幫助類
2020-10-04
封裝Excel
一點一滴記錄 Java 8 stream 的使用
2020-01-11
Java
pytesseract實現識別pdf檔案並將內容寫入word文件中
2024-09-05
word放不下excel表格怎麼辦 word放不下excel表格的方法
2022-05-09
Excel
【Java】Word題庫解析2
2024-09-08
Java

Java解析OFFICE(word,excel,powerpoint)以及PDF的實現方案及開發中的點滴分享

相關文章