xpdf4.0的使用（詳細）

kevin-onepiece發表於2018-04-11

原文網址 : https://blog.csdn.net/weixin_40503094/article/details/79904155

1.下載xpdf tools 和Chinese simplified，（剛開始下載了原始碼但是沒有找到pdftotxt.exe,結果發現在tools裡）

2.新建xpdf資料夾，解壓tools到xpdf裡，將bin64裡的所有檔案複製到xpdf下，再將doc裡的sample-xpdfrc複製到xpdf下，並重新命名為xpdfrc，將xpdf-chinese-simplified複製到xpdf下。

3.修改xpdfrc檔案，

（1）修改檔案 xpdfrc 第73行，將 textEncoding UTF-8 註釋開啟，指定編碼為UTF-8，

（2）並在下面增加 textPageBreaks no 引數，意思是在pdf文件的兩頁間不加入分行符。

（3）在最後加入這些程式碼：其中的路徑換成自己的路徑

#-----?begin?Chinese?Simplified?support?package?(2011-sep-02)??
cidToUnicode?Adobe-GB1?E:/xpdf/xpdf-chinese-simplified/Adobe-GB1.cidToUnicode??
unicodeMap?ISO-2022-CN?E:/xpdf/xpdf-chinese-simplified/ISO-2022-CN.unicodeMap??
unicodeMap?EUC-CN?E:/xpdf/xpdf-chinese-simplified/EUC-CN.unicodeMap??
unicodeMap?GBK?E:/xpdf/xpdf-chinese-simplified/GBK.unicodeMap??
cMapDir?Adobe-GB1?E:/xpdf/xpdf-chinese-simplified/CMap??
toUnicodeDir?E:/xpdf/xpdf-chinese-simplified/CMap??
#displayCIDFontTT?Adobe-GB1?E:/xpdf/xpdf-chinese-simplified/CMap/gkai00mp.ttf???
#fontFileCC?Adobe-GB1?/usr/..../gkai00mp.ttf??
#-----?end?Chinese?Simplified?support?package??

4.接下來貼程式碼：

import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.logging.Level;
import java.util.logging.Logger;
/**
*
* @author Angela
*/
public class Pdf4Text {
/**
* 讀取pdf文字
* @param file pdf檔案路徑
* @param isLayout 是否維持原來佈局
*/
public static void extractTXT(String file,boolean isLayout){
// XPDF存放路徑
String PATH_TO_XPDF=
"E:\\xpdf\\pdftotext.exe";
File pdffile=new File(file);
// PDF檔案的絕對路徑
String source_absolutePath = pdffile.getAbsolutePath();
// -表示不儲存文字
String target_absolutePath = "-";
//-layout表示保持原有的layout，-enc指定字符集，
//-q設定不列印任何訊息和錯誤，-nopgbrk指定不分頁
// 保持原來的layout
String layout = "-layout";
// 如果isLayout為false，則設定不保持原來的layout
if(!isLayout) layout = "";
// 設定編碼方式
String encoding = "-enc";
String character = "UTF-8";
// 設定不列印任何訊息和錯誤
String mistake = "-q";
// 頁面之間不加入分頁
String nopagebrk = "-nopgbrk";
//命令列
String[] cmd = new String[] { PATH_TO_XPDF, layout, encoding, character,
mistake, nopagebrk, source_absolutePath, target_absolutePath };
try {
Process p = Runtime.getRuntime().exec(cmd);
//獲取控制檯的輸入流
BufferedInputStream bis = new BufferedInputStream(p.getInputStream());
InputStreamReader reader = new InputStreamReader(bis, character);
StringBuilder sb=new StringBuilder();
char [] buf = new char[10000];
int len;
while((len = reader.read(buf))>= 0) {
sb.append(buf);
}
System.out.println(sb.toString());
reader.close();
} catch (IOException ex) {
Logger.getLogger(Pdf4Text.class.getName()).log(Level.SEVERE, null, ex);
}
}
/**
* 儲存pdf文字內容
* @param file pdf檔案路徑
* @param savePath 文字儲存路徑
* @param isLayout 是否維持原來佈局
*/
public static void extractTXT(String file,String savePath,boolean isLayout){
// XPDF存放路徑
String PATH_TO_XPDF=
"E:\\xpdf\\pdftotext.exe";
File pdffile=new File(file);
// PDF檔案的絕對路徑
String source_absolutePath = pdffile.getAbsolutePath();
File targetfile=new File(savePath);
// 輸出文字檔案的絕對路徑
String target_absolutePath = targetfile.getAbsolutePath();
// 保持原來的layout
String layout = "-layout";
// 如果isLayout為false，則設定不保持原來的layout
if(!isLayout) layout = "";
// 設定編碼方式
String encoding = "-enc";
String character = "UTF-8";
// 設定不列印任何訊息和錯誤
String mistake = "-q";
// 頁面之間不加入分頁
String nopagebrk = "-nopgbrk";
//命令列
String[] cmd = new String[] { PATH_TO_XPDF, layout, encoding, character,
mistake, nopagebrk, source_absolutePath, target_absolutePath };
try {
Runtime.getRuntime().exec(cmd);
} catch (IOException ex) {
Logger.getLogger(Pdf4Text.class.getName()).log(Level.SEVERE, null, ex);
}
}
public static void main(String args[]){
String file="E:\\three body.pdf";
String savePath="E:\\three body.txt";
long startTime=System.currentTimeMillis();
extractTXT(file,savePath,true);
long endTime=System.currentTimeMillis();
System.out.println("讀寫所用時間為："+(endTime-startTime)+"ms");
}
}

當然你還可以利用pdftohtml.exe,pdftopng.exe,pdftoppm.exe,pdftops.exe等EXE轉化成其他格式的檔案。

有什麼問題就在下面評論，我看到了就會及時的回覆。

Flutter的staggered GridView詳細使用
2020-05-26
FlutterView
SpringCloud中GateWay的詳細使用
2020-10-10
SpringGCCloudGateway
Docker exec 命令的詳細使用
2021-07-15
Docker
Python - random 庫的詳細使用
2021-06-03
Pythonrandom
SourceTree詳細使用教程
2018-06-14
wrk（1）- 詳細使用
2021-06-10
github的詳細使用，非常簡單！
2018-09-21
Github
Spring Boot Test 的詳細使用教程
2024-11-12
Spring Boot
vue3保證你看懂watch和watchEffect的詳細詳細使用
2022-01-10
Vue
轉 Git使用詳細教程
2019-03-20
Git
python使用xpath（超詳細）
2020-10-07
Python
JavaScript FormData的詳細介紹及使用
2019-01-09
JavaScriptORM
onethink 如何使用外掛詳細的教程
2019-05-11
Fiddler 抓包詳細使用教程
2020-10-22
使用 createError 建立錯誤物件的詳細指南
2024-08-08
Error物件
PyCharm的使用教程【圖文結合-詳細】
2021-12-02
PyCharm
詳細資訊用於javascript中的承諾使用詳解
2020-11-10
JavaScript
Git 打補丁-- patch 和 diff 的使用（詳細）
2018-07-25
Git
Git 打補丁– patch 和 diff 的使用（詳細）
2019-03-04
Git
useRoute 函式的詳細介紹與使用示例
2024-07-27
函式
jetson nano開發使用的基礎詳細分享
2021-12-26
NaN
Re:從零開始的Git詳細使用教程
2021-04-13
Git
同步控制和鎖，ReenterLock和Condition的詳細使用
2020-12-29
0001_03_Eclipse的安裝及詳細使用
2021-01-05
Eclipse
Ajax入門以及Axios的詳細使用（含Promise）
2024-12-30
iOSPromise
使用廣播-BroadcastReceiver最詳細解析
2018-07-31
AST
Flask-Limit使用詳細說明
2020-07-10
FlaskMIT
Docker Swarm 核心概念及詳細使用
2024-11-19
DockerSwarm
Stable Diffusion WebUI詳細使用指南
2024-05-29
WebUI
Python 內建logging 使用詳細講
2022-07-11
Python
Trino 436 - 使用教程（親測，詳細）
2024-12-10
Jenkins安裝部署使用圖文詳解（非常詳細）
2021-12-23
Jenkins
開源179個Flutter元件的詳細使用介紹
2020-04-20
Flutter元件
sed編輯器的使用以及詳細解讀
2020-12-27
vuex詳細介紹和使用方法
2019-01-24
Vue
【科普】Tomcat使用方法(超級詳細)
2018-06-07
Tomcat
【Linux工具】yum和gdb詳細使用教程。
2024-02-02
Linux
mysql語法使用詳細程式碼版
2022-07-18
MySql

xpdf4.0的使用（詳細）

相關文章