匯出 VuePress構建的網站為 PDF

leetao94發表於2019-05-09

原文網址 : https://www.cnblogs.com/leetao94/p/10840552.html

前言

學 Rust 也有一段時間了,網上也有不少官方文件的中文翻譯版,但是似乎只有 Rust中文網站文件一直是最新的,奈何並沒有 PDF 供直接下載,是在是不太方便,為了方便閱讀以及方便後續文件更新,決定用 Python 寫一個爬蟲將網頁下載下來保持為 PDF. 最後完成結果如下:

是的沒錯,將官網樣式也保留下來成功轉為 PDF,接下來分享一下整個爬蟲的過程,最終的爬蟲可以匯出任意 VuePress 搭建的網站為 PDF.

爬蟲

依賴庫的選定

requests
BeautifulSoup4
pdfkit

關於 requests 和 BeautifulSoup4 庫這裡就不做介紹了, 寫過爬蟲的基本上都接觸過, 重點說一下 pdfkit 庫, 毫無疑問,它就是匯出 PDF 的關鍵,簡單說一下它的用法

PdfKit

PdfKit 庫是對 Wkhtmltopdf 工具包的封裝類,所以在使用之前,需要去官網下載相應的安裝包安裝到電腦上, 下載地址

匯出 VuePress構建的網站為 PDF

可選: 安裝完成之後可以 Windows 下可以將安裝路徑新增到系統環境變數中

安裝完成之後,說一下 PdfKit 的常用方法,常用方法有三個

from_url

def from_url(url, output_path, options=None, toc=None, cover=None,
             configuration=None, cover_first=False):
    """
    Convert file of files from URLs to PDF document

    :param url: URL or list of URLs to be saved
    :param output_path: path to output PDF file. False means file will be returned as string.
    :param options: (optional) dict with wkhtmltopdf global and page options, with or w/o '--'
    :param toc: (optional) dict with toc-specific wkhtmltopdf options, with or w/o '--'
    :param cover: (optional) string with url/filename with a cover html page
    :param configuration: (optional) instance of pdfkit.configuration.Configuration()
    :param configuration_first: (optional) if True, cover always precedes TOC

    Returns: True on success
    """

    r = PDFKit(url, 'url', options=options, toc=toc, cover=cover,
               configuration=configuration, cover_first=cover_first)

    return r.to_pdf(output_path)

從函式名上就很容易理解這個函式的作用,沒錯就是根據 url 下載網頁為 PDF

from_file()

def from_file(input, output_path, options=None, toc=None, cover=None, css=None,
              configuration=None, cover_first=False):
    """
    Convert HTML file or files to PDF document

    :param input: path to HTML file or list with paths or file-like object
    :param output_path: path to output PDF file. False means file will be returned as string.
    :param options: (optional) dict with wkhtmltopdf options, with or w/o '--'
    :param toc: (optional) dict with toc-specific wkhtmltopdf options, with or w/o '--'
    :param cover: (optional) string with url/filename with a cover html page
    :param css: (optional) string with path to css file which will be added to a single input file
    :param configuration: (optional) instance of pdfkit.configuration.Configuration()
    :param configuration_first: (optional) if True, cover always precedes TOC

    Returns: True on success
    """

    r = PDFKit(input, 'file', options=options, toc=toc, cover=cover, css=css,
               configuration=configuration, cover_first=cover_first)

    return r.to_pdf(output_path)

這個則是從檔案中生成 PDF, 也是我最後選擇的方案,至於為什麼沒有選擇 from_url(),稍後等我分析完,就會明白了.

from_string

def from_string(input, output_path, options=None, toc=None, cover=None, css=None,
                configuration=None, cover_first=False):
    """
    Convert given string or strings to PDF document

    :param input: string with a desired text. Could be a raw text or a html file
    :param output_path: path to output PDF file. False means file will be returned as string.
    :param options: (optional) dict with wkhtmltopdf options, with or w/o '--'
    :param toc: (optional) dict with toc-specific wkhtmltopdf options, with or w/o '--'
    :param cover: (optional) string with url/filename with a cover html page
    :param css: (optional) string with path to css file which will be added to a input string
    :param configuration: (optional) instance of pdfkit.configuration.Configuration()
    :param configuration_first: (optional) if True, cover always precedes TOC

    Returns: True on success
    """

    r = PDFKit(input, 'string', options=options, toc=toc, cover=cover, css=css,
               configuration=configuration, cover_first=cover_first)

    return r.to_pdf(output_path)

這個方法則是從字串中生成 PDF,很明顯沒有辦法保持網頁樣式,所以不考慮.關於更多 PdfKit 的用法,可以去 wkhtmltopdf文件檢視

分析目標網頁

依賴庫選定完畢,接下來就是分析目標網頁,開始寫爬蟲的過程了.

測試 PdfKit

PdfKit 自帶一個 from_url 生成 PDF 的功能,如果可以生成合適的 PDF,那我們只需要獲取所有網頁連結就可以了,可以節省很多時間,先測試一下生成的效果

import pdfkit
pdfkit.from_url("https://rustlang-cn.org/office/rust/book/", 'out.pdf', configuration=pdfkit.configuration(
                wkhtmltopdf="path/to/wkhtmltopdf.exe"))

匯出結果如下:

匯出 VuePress構建的網站為 PDF

從結果不難看出,網頁的樣式儲存下來了,但是側邊欄,頂部和底邊導航欄也都被保留下來了,並且側邊欄還擋住了主要內容,所以使用 from_url 這個方法就被排除了.

最終方案

通過測試,我們得知不能使用 from_url 那麼只能通過使用 from_file 去匯出了, 並且在我們將網頁下載下來儲存到本地之前,我們需要修改網頁內容,移除頂部導航欄,側邊欄,以及底部導航欄

獲取相應元素

現在讓我們先獲取頁面下一頁連結,開啟瀏覽器除錯模式,審查一下網頁元素,不難發現所有下一頁導航,都處於之下的超連結中,如下圖:

匯出 VuePress構建的網站為 PDF

通過同樣的方法,不難發現頂部導航欄,側邊欄,以及底部導航欄對應的元素,依次為

<div class="navbar"></div>
<div class="sidebar"></div>
<div class="page-edit"></div>

找到對應的元素接著就是獲取連結和銷燬不必要元素

class DownloadVuePress2Pdf:

    def get_content_and_next_url(self, content): # content 為網頁內容
         # 獲取連結和銷燬不必要元素
        navbar = soup.select('.navbar')
        if len(navbar):
            navbar[0].decompose()

        sidebar = soup.select('.sidebar')
        if len(sidebar):
            sidebar[0].decompose()

        page_edit = soup.select('.page-edit')
        if len(page_edit):
            page_edit[0].decompose()

        # 注意下一頁連結在底部導航欄元素中,
        # 要先獲取連結後,才能銷燬元素,順序不能顛倒
        next_span = soup.select(".next")
        if len(next_span):
            next_span_href = next_span[0].a['href']
        else:
            next_span_href = None

        page_nav = soup.select('.page-nav')
        if len(page_nav):
            page_nav[0].decompose()

保持匯出 PDF 樣式

為了使得匯出 PDF 的樣式和網頁一致,我們有倆種方法:

根據原始碼在對應目錄建立本地 css 檔案,顯然這種方法不具有普遍性,不能每匯出一個網站,我們就新建一個 css 檔案
既然本地的不行,那我們就將網頁中的 css 連結 href 地址指向遠端 css

在上述程式碼中新增如下程式碼:

for link in links:
     if not link['href'].startswith("http"):
        link['href'] = css_domain + link['href'] # css_domain 為 css 預設域名,需要設定,獲取方式可見下圖

匯出 VuePress構建的網站為 PDF

匯出

通過上述的方式,我們將網頁下載下來儲存到本地,全部下載完成之後,最後就是匯出為 PDF 了,通過 from_file() 方法很容易完成匯出這個操作

pdfkit.from_file([檔案列表], "匯出的檔名稱.pdf", options=options, configuration=config)

至此匯出 Rust 官網文件為 PDF 的過程全部完成,效果如開頭展示的那樣

注意: 由於 VuePress 搭建的網站基本上佈局格式一樣, 所以上面的程式碼同樣可以用來匯出其他由 VuePress 構建的網站

完整程式碼

搜尋公眾號 LeeTao，回覆 20190509 即可獲得

Gradle環境下匯出Swagger為PDF
2019-06-25
GradleSwagger
如何把markdown檔案匯出為pdf
2024-11-13
Vue+ElementUI 匯出為PDF檔案
2024-11-19
VueUI
PHP 匯出 PDF
2019-08-01
PHP
CAD工具——匯出PDF
2021-08-30
前端（vue）匯出pdf
2023-02-08
前端Vue
為什麼CAD匯出PDF沒有顏色
2022-04-11
Vue框架下實現匯入匯出Excel、匯出PDF
2021-09-09
Vue框架Excel
clickhouse表結構匯出為
2024-05-20
Laravel-snappy匯出PDF
2022-07-04
LaravelAPP
【實戰】通過 JS 將 HTML 匯出為 PDF 文件
2018-10-29
JSHTML
使用vscode寫Markdown並且匯出為pdf（乾貨）
2024-04-28
VSCode
網頁中Office和pdf相關檔案匯出
2020-11-22
網頁
NPOI匯出和匯入Excel,Word和PDF
2018-07-23
Excel
Java匯出Pdf格式表單
2020-11-04
Java
java模板匯出PDF檔案
2020-10-26
Java
使用Laravel-snappy匯出PDF
2021-08-30
LaravelAPP
開源Jekyll助您構建你的網站
2022-01-25
網站
構建一個閱讀網站
2019-02-16
網站
前端網頁列印外掛print.js(可匯出pdf)
2020-12-30
前端網頁JS
React專案實現匯出PDF的功能
2022-06-06
React
網站SEO如何構建良好的樹狀結構呢?
2020-09-29
網站
win10 如何匯出證書_win10網站的證書怎麼匯出
2020-08-28
Win10網站
網站建設之企業為什麼要做網站？
2021-01-12
網站
如何使用Java建立資料透視表並匯出為PDF
2023-10-08
Java
【匯出PDF-專案應用】
2018-08-12
JasperReport+iReport匯出pdf中文失效
2020-11-04
Java整合FreeMarker匯出Pdf檔案
2024-06-11
Java
laravel 推薦優雅匯出pdf
2021-03-08
Laravel
Nginx網站服務與LNMP構建
2020-12-01
Nginx網站LNMP
將SAP CRM WebClient UI的表格匯出成PDF
2020-08-02
WebclientUI
如何將SAP WebClient UI的表格匯出成PDF
2020-08-26
WebclientUI
[譯] 構建世界上最快的會議網站
2019-04-22
網站
你的網站或許不需要前端構建
2019-05-26
網站前端
介紹Cloudflare頁面：構建JAMstack網站的最佳方法
2021-01-21
Cloud網站
是程式設計師，就用python匯出pdf
2019-03-29
程式設計師Python
網站建設應追求網站本身的質量
2021-01-26
網站
網站建設中如何測試完成的網站？
2021-01-06
網站

匯出 VuePress構建的網站為 PDF

前言

爬蟲

依賴庫的選定

PdfKit

from_url

from_file()

from_string

分析目標網頁

測試 PdfKit

最終方案

獲取相應元素

保持匯出 PDF 樣式

匯出

完整程式碼

相關文章