搜尋本地pdf檔案內容

园糯發表於2024-12-04

原文網址 : https://www.cnblogs.com/tellw/p/18587207

import fitz  # PyMuPDF
import re
from pathlib import Path
from colorama import Fore
import sys
import os
 
def search_pdf(pdf_path, format,keywords,context_len=10):
	relp=os.path.relpath(pdf_path,'C:/Users/tellw')
	if not os.path.exists(relp):
		ds,filename=relp.rsplit('\\',1)
		if not os.path.exists(ds):
			os.makedirs(ds)
		if format=='pdf':
			with open(relp,'w',encoding='utf8') as f:
				# 開啟PDF檔案
				document = fitz.open(pdf_path)
				# 搜尋PDF中的文字
				for page_num in range(len(document)):
					page = document[page_num]
					text = re.sub(r'\s','',page.get_text()).lower()
					f.write(text)

				# 關閉PDF文件
				document.close()
		elif format=='txt':
			with open(pdf_path,'r',encoding='utf8') as f:
				text=f.read()
			with open(relp,'w',encoding='utf8') as f:
				f.write(re.sub(r'\s','',text).lower()) # \s匹配任意的空白符，包括空格，製表符(Tab)，換行符，中文全形空格等。
	with open(relp,'r',encoding='utf8') as f:
		text=f.read()
	search_re='.{0,20}'.join(keywords)
	search_res=re.findall(f'.{{0,{context_len}}}{search_re}.{{0,{context_len}}}',text)
	for sr in search_res:
		for kw in keywords:
			sr=re.sub(kw,f'{Fore.RED}{kw}{Fore.BLACK}',sr)
		print(sr+'\t\t\t\t'+str(pdf_path)+'\n')

if len(sys.argv)>=2:
	keywords=sys.argv[1:]
else:
	sys.exit(1)
dirs=['C:/Users/tellw/open_title/file_updates','C:/Users/tellw/open_title/papers/benchmark','C:/Users/tellw/open_title/papers/edge_computing','C:/Users/tellw/open_title/papers/guidance','C:/Users/tellw/open_title/papers/methodology','C:/Users/tellw/open_title/papers/misc','C:/Users/tellw/open_title/papers/other-themes-benchmark','C:/Users/tellw/open_title/papers/speech_recognition','C:/Users/tellw/open_title/papers/test','C:/Users/tellw/open_title/papers/to_c','C:/Users/tellw/open_title/papers/books']
pdf_file_paths=[]
txt_file_paths=[]
for d in dirs:
	pdf_file_paths.extend(list(Path(d).glob('*.pdf')))
for d in dirs:
	txt_file_paths.extend(list(Path(d).glob('*.txt')))
os.chdir('C:/Users/tellw/open_title/paper_search_space')
context_len=30
for pdf_file_path in pdf_file_paths:
	search_pdf(pdf_file_path, 'pdf',keywords,context_len)
for txt_file_path in txt_file_paths:
	search_pdf(txt_file_path,'txt',keywords,context_len)

首先由百度gpt給出搜尋pdf檔案中關鍵字的程式碼，後在其上改進。搜尋關鍵詞作為指令碼的引數，['搜','索','內','容']，首先找到目標文件————pdf檔案和txt檔案，在其對應的搜尋空間裡，去掉原始檔中的空格、換行符、大小寫形式等與搜尋結果展示無關的內容，在搜尋空間裡按照正規表示式.{0,30}搜.{0,20}索.{0,20}內.{0,20}容.{0,30}搜尋目標字串

建立於2404061003，修改於2412042104

pdf檔案搜尋工具：PDF Search Mac
2022-03-27
Mac
PDF檔案搜尋工具PDF Search for Mac
2022-07-22
Mac
PDF Search for Mac(pdf檔案搜尋工具)
2022-07-18
Mac
Mac pdf檔案搜尋工具
2021-10-08
Mac
win10搜尋檔案內容怎麼操作_win10如何搜尋文件內的內容
2020-07-03
Win10
PDF Search for Macpdf檔案搜尋工具
2020-12-03
Mac
PDF Search for Mac(pdf檔案搜尋工具)13.6啟用版
2023-11-05
Mac
HHDESK檔案內文字搜尋功能
2023-02-13
如何編輯PDF檔案的內容？
2022-02-23
蘋果電腦上pdf檔案搜尋工具——PDF Search for Mac
2020-04-24
蘋果Mac
pdf檔案內容怎麼修改 pdf怎麼免費編輯修改內容
2022-03-26
html如何根據檔案路徑顯示檔案內容（pdf）
2020-07-22
HTML
win10 搜尋不到匹配的內容怎麼辦 win10 搜尋不到檔案怎麼解決
2020-10-21
Win10
檔案智慧搜尋
2024-04-09
java檔案相關（檔案追加內容、檔案內容清空、檔案內容讀取）
2018-06-29
Java
PDF編輯器怎麼用？如何編輯PDF檔案內容
2019-09-23
PDF文字怎麼編輯？怎麼編輯PDF檔案內容
2019-08-29
Java讀取本地檔案內容並輸出
2020-09-25
Java
idea全域性搜尋的快捷鍵 idea搜尋內容快捷鍵
2022-05-25
Idea
搜尋引擎優化內容及方法
2020-12-04
優化
Win10系統中搜尋框搜尋不到本地應用和檔案如何解決
2020-11-22
Win10
win10搜尋檔案在哪裡檢視搜尋檔案位置的方法
2021-01-25
Win10
內容社群行業搜尋最佳實踐
2022-11-23
行業
通過API開啟PDF檔案，並進行指定文字的搜尋
2021-10-08
API
Vim模糊檔案搜尋fzf
2019-01-07
檔案搜尋工具ProFind for Mac
2022-07-25
Mac
ProFind for Mac(檔案搜尋工具)
2022-07-07
Mac
用 ABAP 讀取本地文字檔案內容試讀版
2022-04-03
JavaScript讀取本地檔案內容（相容低版本IE）
2021-07-20
JavaScript
瀏覽器檢視Base64的PDF檔案內容
2024-12-03
瀏覽器
專業的本地檔案搜尋查詢器：Find Any File (FAF) for Mac
2024-01-15
Mac
PDF文件搜尋工具：PDF Search Mac版
2022-04-17
Mac
SAP CRM中介軟體BDOC內容搜尋工具
2020-09-24
ProFind for Mac(mac檔案搜尋工具)
2020-10-13
Mac
精通 Linux 上的檔案搜尋
2021-09-09
Linux
Scherlokk for Mac(檔案搜尋軟體)
2021-11-17
Mac
Find Any File for Mac(檔案搜尋)
2022-07-06
Mac
Scherlokk for Mac檔案搜尋軟體
2022-05-31
Mac

搜尋本地pdf檔案內容

相關文章