lucene開發中有關讀取pdf,html,word,rtf,txt,powerpoint,excel等文件的操作

iteye_9300發表於2010-03-23

原文網址 : https://blog.csdn.net/iteye_9300/article/details/81833190

HTMLExcel

關於這七種文件，我相信應該是最常用的文件了

在以下的介紹中會提到POI，現介紹下POI吧

poi處理WORD,EXCEL比較好:http://jakarta.apache.org/poi/

poi處理至少需要如下幾個JAR包

PDFbox處理PDF比較好：http://pdfbox.apache.org/download.html

下面一一介紹了

第一和第二是隻支援03版的word和excel文件

第一、首先來看WORD文件：
我這裡用的是poi，相關jar包自己去下載，然後加到工程中（以下所要用的jar包也是，不再重複說）

public static String readWord(String path) throws Exception {
String bodyText = null;
try {
FileInputStream is = new FileInputStream(path);
bodyText = new WordExtractor(is).getText();
} catch (Exception e) {
System.out.println("=======");
}
return bodyText;
}

第二、Exel的文件

public static String ReadExcel(String path) throws IOException {
InputStream inputStream = null;
String content = null;
try {
inputStream = new FileInputStream(path);
HSSFWorkbook wb = new HSSFWorkbook(inputStream);
ExcelExtractor extractor = new ExcelExtractor(wb);
extractor.setFormulasNotResults(true);
extractor.setIncludeSheetNames(false);
content = extractor.getText();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
return content;
}

針對07版的word和excel的操作

package com.test;

   
/**   
 * 需要的jar包：   
 * poi-3.0.2-FINAL-20080204.jar   
 * poi-contrib-3.0.2-FINAL-20080204.jar   
 * poi-scratchpad-3.0.2-FINAL-20080204.jar   
 * poi-3.5-beta6-20090622.jar   
 * geronimo-stax-api_1.0_spec-1.0.jar   
 * ooxml-schemas-1.0.jar   
 * openxml4j-bin-beta.jar   
 * poi-ooxml-3.5-beta6-20090622.jar   
 * xmlbeans-2.3.0.jar   
 * dom4j-1.6.1.jar   
 */   
   
import java.io.FileInputStream;    
import java.io.IOException;    
import java.io.InputStream;    
   
import org.apache.poi.POIXMLDocument;    
import org.apache.poi.POIXMLTextExtractor;    
import org.apache.poi.hssf.usermodel.HSSFCell;    
import org.apache.poi.hssf.usermodel.HSSFRow;    
import org.apache.poi.hssf.usermodel.HSSFSheet;    
import org.apache.poi.hssf.usermodel.HSSFWorkbook;    
import org.apache.poi.hwpf.extractor.WordExtractor;    
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;    
import org.apache.poi.openxml4j.opc.OPCPackage;    
import org.apache.poi.xssf.usermodel.XSSFCell;    
import org.apache.poi.xssf.usermodel.XSSFRow;    
import org.apache.poi.xssf.usermodel.XSSFSheet;    
import org.apache.poi.xssf.usermodel.XSSFWorkbook;    
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;    
import org.apache.xmlbeans.XmlException;    
   
public class WordAndExcelExtractor {    
 public static void main(String[] args){    
  try{    
   String wordFile = "D:/松山血戰.docx";    
   String wordText2007 = WordAndExcelExtractor.extractTextFromDOC2007(wordFile);    
   System.out.println("wordText2007======="+wordText2007);    
       
   InputStream is = new FileInputStream("D:/XXX研發中心技術崗位職位需求.xls");       
   String excelText = WordAndExcelExtractor.extractTextFromXLS(is);       
   System.out.println("text2003==========" + excelText);    
       
   String excelFile = "D:/Hello2007.xlsx";       
   String excelText2007 = WordAndExcelExtractor.extractTextFromXLS2007(excelFile);    
   System.out.println("excelText2007==========" + excelText2007);    
   
       
  }catch(Exception e ){    
   e.printStackTrace();    
  }    
 }    
     
 /**   
  * @Method: extractTextFromDOCX   
  * @Description: 從word 2003文件中提取純文字   
  *   
  * @param    
  * @return String   
  * @throws   
  */   
    public static String extractTextFromDOC(InputStream is) throws IOException {    
        WordExtractor ex = new WordExtractor(is); //is是WORD檔案的InputStream     
   
        return ex.getText();    
    }    
     
 /**   
  * @Method: extractTextFromDOCX   
  * @Description: 從word 2007文件中提取純文字   
  *   
  * @param    
  * @return String   
  * @throws   
  */   
    public static String extractTextFromDOC2007(String fileName) throws IOException, OpenXML4JException, XmlException {    
     OPCPackage opcPackage = POIXMLDocument.openPackage(fileName);    
     POIXMLTextExtractor ex = new XWPFWordExtractor(opcPackage);       
   
        return ex.getText();    
    }    
     
 /**   
  * @Method: extractTextFromXLS   
  * @Description: 從excel 2003文件中提取純文字   
  *   
  * @param    
  * @return String   
  * @throws   
  */   
    @SuppressWarnings("deprecation")    
 private static String extractTextFromXLS(InputStream is)    
        throws IOException {    
        StringBuffer content  = new StringBuffer();    
        HSSFWorkbook workbook = new HSSFWorkbook(is); //建立對Excel工作簿檔案的引用     
   
        for (int numSheets = 0; numSheets < workbook.getNumberOfSheets(); numSheets++) {    
            if (null != workbook.getSheetAt(numSheets)) {    
                HSSFSheet aSheet = workbook.getSheetAt(numSheets); //獲得一個sheet    
   
                for (int rowNumOfSheet = 0; rowNumOfSheet <= aSheet.getLastRowNum(); rowNumOfSheet++) {    
                    if (null != aSheet.getRow(rowNumOfSheet)) {    
                        HSSFRow aRow = aSheet.getRow(rowNumOfSheet); //獲得一行    
   
                        for (short cellNumOfRow = 0; cellNumOfRow <= aRow.getLastCellNum(); cellNumOfRow++) {    
                            if (null != aRow.getCell(cellNumOfRow)) {    
                                HSSFCell aCell = aRow.getCell(cellNumOfRow); //獲得列值    
                                                                    
                                if(aCell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC){    
                                 content.append(aCell.getNumericCellValue());    
                                }else if(aCell.getCellType() == HSSFCell.CELL_TYPE_BOOLEAN){    
                                 content.append(aCell.getBooleanCellValue());    
                                }else {    
                                 content.append(aCell.getStringCellValue());    
                                }    
                            }    
                        }    
                    }    
                }    
            }    
        }    
   
        return content.toString();    
    }    
        
    /**   
     * @Method: extractTextFromXLS2007   
     * @Description: 從excel 2007文件中提取純文字   
     *   
     * @param    
     * @return String   
     * @throws   
     */   
    private static String extractTextFromXLS2007(String fileName) throws Exception{    
     StringBuffer content = new StringBuffer();    
         
     //構造 XSSFWorkbook 物件，strPath 傳入檔案路徑        
  XSSFWorkbook xwb = new XSSFWorkbook(fileName);    
      
  //迴圈工作表Sheet    
  for(int numSheet = 0; numSheet < xwb.getNumberOfSheets(); numSheet++){    
   XSSFSheet xSheet = xwb.getSheetAt(numSheet);     
   if(xSheet == null){    
    continue;    
   }    
       
   //迴圈行Row    
   for(int rowNum = 0; rowNum <= xSheet.getLastRowNum(); rowNum++){    
    XSSFRow xRow = xSheet.getRow(rowNum);    
    if(xRow == null){    
     continue;    
    }    
        
    //迴圈列Cell    
    for(int cellNum = 0; cellNum <= xRow.getLastCellNum(); cellNum++){    
     XSSFCell xCell = xRow.getCell(cellNum);    
     if(xCell == null){    
      continue;    
     }    
         
     if(xCell.getCellType() == XSSFCell.CELL_TYPE_BOOLEAN){    
      content.append(xCell.getBooleanCellValue());    
     }else if(xCell.getCellType() == XSSFCell.CELL_TYPE_NUMERIC){    
      content.append(xCell.getNumericCellValue());    
     }else{    
      content.append(xCell.getStringCellValue());    
     }    
    }    
   }    
  }    
      
  return content.toString();    
    }    
        
}

第三、PowerPoint的文件

public static String readPowerPoint(String path) {
StringBuffer content = new StringBuffer("");
try {
SlideShow ss = new SlideShow(new HSLFSlideShow(new FileInputStream(
path)));// is
// 為檔案的InputStream，建立SlideShow
Slide[] slides = ss.getSlides();// 獲得每一張幻燈片
for (int i = 0; i < slides.length; i++) {
TextRun[] t = slides[i].getTextRuns();// 為了取得幻燈片的文字內容，建立TextRun
for (int j = 0; j < t.length; j++) {
content.append(t[j].getText());// 這裡會將文字內容加到content中去
}
}
} catch (Exception ex) {
System.out.println(ex.toString());
}
return content.toString();
}

第四、PDF的文件

public static String readPdf(String path) throws Exception {
StringBuffer content = new StringBuffer("");
FileInputStream fis = new FileInputStream(path);
PDFParser p = new PDFParser(fis);
p.parse();
PDFTextStripper ts = new PDFTextStripper();
content.append(ts.getText(p.getPDDocument()));
fis.close();
return content.toString().trim();
}

第五、HTML的文件，要說明的是，HTML文件我們要獲取其TITLE，BODY中的內容就要先獲取原始檔，然後再對原始檔進行標籤上的過濾，很麻煩

public static String readHtml(String urlString) {
StringBuffer content = new StringBuffer("");
File file = new File(urlString);
FileInputStream fis = null;
try {
fis = new FileInputStream(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(
fis, "utf-8"));
String line = null;
while ((line = reader.readLine()) != null) {
content.append(line + "\n");
}
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
String contentString = content.toString();
String htmlStr = contentString; // 含html標籤的字串
String textStr = "";
java.util.regex.Pattern p_script;
java.util.regex.Matcher m_script;
java.util.regex.Pattern p_style;
java.util.regex.Matcher m_style;
java.util.regex.Pattern p_html;
java.util.regex.Matcher m_html;
try {
String regEx_script = "<[\\s]*?script[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?script[\\
String regEx_style = "<[\\s]*?style[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?style[\\s]*
String regEx_html = "<[^>]+>"; // 定義HTML標籤的正規表示式
p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE);
m_script = p_script.matcher(htmlStr);
htmlStr = m_script.replaceAll(""); // 過濾script標籤
p_style = Pattern.compile(regEx_style, Pattern.CASE_INSENSITIVE);
m_style = p_style.matcher(htmlStr);
htmlStr = m_style.replaceAll(""); // 過濾style標籤
p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE);
m_html = p_html.matcher(htmlStr);
htmlStr = m_html.replaceAll(""); // 過濾html標籤
textStr = htmlStr;
} catch (Exception e) {
System.err.println("Html2Text: " + e.getMessage());
}
return textStr;// 返回文字字串
}

第六、TXT的文件，給TXT文字建立索引時要注意
本專案實現了組合查詢的功能
//這一步如果不設定為GBK，TXT內容將全部亂碼 BufferedReader reader=new BufferedReader(new InputStreamReader(is,"GBK")); 具體程式碼如下

public static String readTxt(String path) throws IOException {
StringBuffer sb = new StringBuffer("");
InputStream is = new FileInputStream(path);
// 必須設定成GBK，否則將出現亂碼
BufferedReader reader = new BufferedReader(new InputStreamReader(is,
"GBK"));
try {
String line = "";
while ((line = reader.readLine()) != null) {
sb.append(line + "\r");
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
return sb.toString().trim();
}

第七、RTF文件，rtf的轉換則在javax中就有

public static String readRtf(String path) {
String result = null;
File file = new File(path);
try {
DefaultStyledDocument styledDoc = new DefaultStyledDocument();
InputStream is = new FileInputStream(file);
new RTFEditorKit().read(is, styledDoc, 0);
result = new String(styledDoc.getText(0, styledDoc.getLength())
.getBytes("iso8859-1"), "gbk");
// 提取文字，讀取中文需要使用gbk編碼，否則會出現亂碼
} catch (IOException e) {
e.printStackTrace();
} catch (BadLocationException e) {
e.printStackTrace();
}
return result;
}

Java 讀取txt檔案生成Word文件
2021-12-03
Java
C# 讀取txt檔案生成Word文件
2022-01-28
C#
10分鐘做好 Bootstrap Blazor 的表格元件匯出 Excel/Word/Html/Pdf
2023-01-11
bootBlazor元件ExcelHTML
求助帖:關於 python 操作 Word 文件獲取書籤名集合
2019-10-26
Python
word與excel轉pdf方法
2020-11-06
Excel
pdf轉換成word文件
2024-10-27
利用html5 file api讀取本地檔案（如圖片、PDF等）
2019-03-01
HTMLAPI
[20181213]ogg大量讀取操作.txt
2018-12-17
dact-admin：獲取資料生成、下載word/pdf文件
2022-04-15
PDF文件轉換器PDF to Word Document Converter Mac
2022-07-09
Mac
Ubuntu中使用pdftk合併、分離PDF文件等操作
2021-12-25
Ubuntu
微信公眾號圖文中怎麼新增附件？Word、Excel、PPT、PDF的文件附件
2022-03-28
Excel
Python 將Word/ Exce/ PDF/ PPT文件轉為OFD文件
2024-06-03
Python
使用Aspose.Words元件進行word文件書籤替換,文件、圖表插入,轉pdf等
2018-08-20
元件
Word的相關操作
2024-04-30
如何將Word文件轉成Excel表格？
2024-03-18
Excel
擴大元件（HTML、TEXT、RTF等）字型的尺寸，它的快捷鍵是什麼？
2019-01-25
元件HTML
java讀取excel為物件並進行讀寫操作
2020-11-03
JavaExcel物件
go用unioffice轉換word文件為pdf
2024-07-06
Go
線上免費工具——pdf轉word文件
2022-01-06
python操作word、pdf問題彙總
2020-11-05
Python
NPOI匯出和匯入Excel,Word和PDF
2018-07-23
Excel
如何將excel表格匯入word並保持格式不變如何把excel的表格弄到word文件
2022-03-26
Excel
使用【APIDOC】生成JavaWeb的API文件（HTML，MarkDown，PDF）
2019-01-04
APIJavaWebHTML
golang讀取pdf
2018-10-18
Golang
pdf轉word格式PDF to word for Mac
2023-05-05
Mac
C#中PDF文件操作類
2018-09-07
C#
python對Excel的讀取
2019-07-18
PythonExcel
Laravel讀取Excel
2020-11-11
LaravelExcel
python-docx操作word文件詳解
2022-12-16
Python
Java 讀取PDF中的表格
2021-10-22
Java
手機PDF檔案怎麼轉成word文件
2018-08-21
利用vbs指令碼將word文件轉換為pdf
2024-07-06
指令碼
POI操作Excel文件－中級篇
2021-09-09
Excel
word只讀文件怎麼改成可編輯文件 word檔案怎樣取消只讀模式
2022-03-28
模式
win10 pdf怎麼轉換成word_win10如何把pdf轉換成word文件
2020-03-13
Win10
python 讀取PDF表格
2020-09-25
Python
移動端html展示word文件轉換方法
2018-11-19
HTML
PHP 讀取 WORD 內容
2021-08-26
PHP

lucene開發中有關讀取pdf,html,word,rtf,txt,powerpoint,excel等文件的操作

相關文章