學以致用:Python爬取廖大Python教程製作pdf

morethink發表於2019-02-27

原文網址 : https://flycode.co/archives/263253

Python

當我學了廖大的Python教程後，感覺總得做點什麼，正好自己想隨時查閱，於是就開始有了製作PDF這個想法。

想要把教程變成PDF有三步：

先生成空html，爬取每一篇教程放進一個新生成的div，這樣就生成了包含所有教程的html檔案(BeautifulSoup)
將html轉換成pdf(wkhtmltopdf)
由於廖大是寫教程的，反爬做的比較好，在爬取的過程中還需要代理ip(免費 or 付費)

BeautifulSoup

Beautiful Soup 是一個可以從HTML或XML檔案中提取資料的Python庫.它能夠通過你喜歡的轉換器實現慣用的文件導航,查詢,修改文件的方式.Beautiful Soup會幫你節省數小時甚至數天的工作時間.

安裝

pip3 install BeautifulSoup4
複製程式碼

開始使用

將一段文件傳入 BeautifulSoup 的構造方法,就能得到一個文件的物件, 可以傳入一段字串或一個檔案控制程式碼.

如下所示：

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
複製程式碼

首先,文件被轉換成Unicode,並且HTML的例項都被轉換成Unicode編碼.
然後,Beautiful Soup選擇最合適的解析器來解析這段文件,如果手動指定解析器那麼Beautiful Soup會選擇指定的解析器來解析文件.

物件的種類

Beautiful Soup 將複雜 HTML 文件轉換成一個複雜的樹形結構,每個節點都是 Python 物件,所有物件可以歸納為 4 種: Tag , NavigableString , BeautifulSoup , Comment .

Tag：通俗點講就是 HTML 中的一個個標籤，類似 div，p。
NavigableString：獲取標籤內部的文字，如，soup.p.string。
BeautifulSoup：表示一個文件的全部內容。
Comment：Comment 物件是一個特殊型別的 NavigableString 物件，其輸出的內容不包括註釋符號.

Tag

Tag就是html中的一個標籤，用BeautifulSoup就能解析出來Tag的具體內容，具體的格式為soup.name,其中name是html下的標籤，具體例項如下：

print soup.title輸出title標籤下的內容，包括此標籤，這個將會輸出
```
<title>The Dormouse's story</title>
複製程式碼
```

print soup.head輸出head標籤下的內容

<head><title>The Dormouse's story</title></head>
複製程式碼

如果 Tag 物件要獲取的標籤有多個的話，它只會返回所以內容中第一個符合要求的標籤。

Tag 屬性

每個 Tag 有兩個重要的屬性 name 和 attrs：

name：對於Tag，它的name就是其本身，如soup.p.name就是p
attrs是一個字典型別的，對應的是屬性-值，如print soup.p.attrs,輸出的就是{'class': ['title'], 'name': 'dromouse'},當然你也可以得到具體的值，如print soup.p.attrs['class'],輸出的就是[title]是一個列表的型別，因為一個屬性可能對應多個值,當然你也可以通過get方法得到屬性的，如：print soup.p.get('class')。還可以直接使用print soup.p['class']

get

get方法用於得到標籤下的屬性值，注意這是一個重要的方法，在許多場合都能用到，比如你要得到<img src="#">標籤下的影象url,那麼就可以用soup.img.get('src'),具體解析如下：

# 得到第一個p標籤下的src屬性
print soup.p.get("class")   
複製程式碼

string

得到標籤下的文字內容，只有在此標籤下沒有子標籤，或者只有一個子標籤的情況下才能返回其中的內容，否則返回的是None具體例項如下：

# 在上面的一段文字中p標籤沒有子標籤，因此能夠正確返回文字的內容
print soup.p.string
# 這裡得到的就是None,因為這裡的html中有很多的子標籤
print soup.html.string  
複製程式碼

`get_text()`

可以獲得一個標籤中的所有文字內容，包括子孫節點的內容，這是最常用的方法。

搜尋文件樹

BeautifulSoup 主要用來遍歷子節點及子節點的屬性，通過Tag取屬性的方式只能獲得當前文件中的第一個 tag，例如，soup.p。如果想要得到所有的<p> 標籤,或是通過名字得到比一個 tag 更多的內容的時候,就需要用到 find_all()

find_all(name, attrs, recursive, text, **kwargs )
複製程式碼

find_all是用於搜尋節點中所有符合過濾條件的節點。

name引數：是Tag的名字，如p,div,title

# 1. 節點名
print(soup.find_all('p'))
# 2. 正規表示式
print(soup.find_all(re.compile('^p')))
# 3. 列表  
print(soup.find_all(['p', 'a']))
複製程式碼

另外 attrs 引數可以也作為過濾條件來獲取內容，而 limit 引數是限制返回的條數。

CSS 選擇器

以 CSS 語法為匹配標準找到 Tag。同樣也是使用到一個函式，該函式為select()，返回型別是 list。它的具體用法如下：

# 1. 通過 tag 標籤查詢
print(soup.select(head))
# 2. 通過 id 查詢
print(soup.select('#link1'))
# 3. 通過 class 查詢
print(soup.select('.sister'))
# 4. 通過屬性查詢
print(soup.select('p[name=dromouse]'))
# 5. 組合查詢
print(soup.select("body p"))
複製程式碼

wkhtmltopdf

wkhtmltopdf主要用於HTML生成PDF。

pdfkit是基於wkhtmltopdf的python封裝，支援URL，本地檔案，文字內容到PDF的轉換，其最終還是呼叫wkhtmltopdf命令。

安裝

先安裝wkhtmltopdf，再安裝pdfkit。

wkhtmltopdf.org/downloads.h…
pdfkit
```
pip3 install pdfkit
複製程式碼
```

轉換url/file/string

import pdfkit

pdfkit.from_url('http://google.com', 'out.pdf')
pdfkit.from_file('index.html', 'out.pdf')
pdfkit.from_string('Hello!', 'out.pdf')
複製程式碼

轉換url或者檔名列表

pdfkit.from_url(['google.com', 'baidu.com'], 'out.pdf')
pdfkit.from_file(['file1.html', 'file2.html'], 'out.pdf')
複製程式碼

轉換開啟檔案

with open('file.html') as f:
    pdfkit.from_file(f, 'out.pdf')
複製程式碼

自定義設定

options = {
    'page-size': 'Letter',
    'margin-top': '0.75in',
    'margin-right': '0.75in',
    'margin-bottom': '0.75in',
    'margin-left': '0.75in',
    'encoding': "UTF-8",
    'custom-header' : [
        ('Accept-Encoding', 'gzip')
    ]
    'cookie': [
        ('cookie-name1', 'cookie-value1'),
        ('cookie-name2', 'cookie-value2'),
    ],
    'no-outline': None,
    'outline-depth': 10,
}

pdfkit.from_url('http://google.com', 'out.pdf', options=options)
複製程式碼

使用代理ip

爬取十幾篇教程之後觸發了這個錯誤：

看來廖大的反爬蟲做的很好，於是只好使用代理ip了，嘗試了免費的西刺免費代理後，最後選擇了付費的阿布雲，感覺響應速度和穩定性還OK。

執行結果

執行過程截圖：

生成的效果圖：

程式碼如下：

import time
import pdfkit
import requests
from bs4 import BeautifulSoup


# 使用 阿布雲代理 
# 可以選擇不使用或是其他代理
def get_soup(target_url):
    proxy_host = "http-dyn.abuyun.com"
    proxy_port = "9020"
    proxy_user = "你的使用者"
    proxy_pass = "你的密碼"
    proxy_meta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host": proxy_host,
        "port": proxy_port,
        "user": proxy_user,
        "pass": proxy_pass,
    }

    proxies = {
        "http": proxy_meta,
        "https": proxy_meta,
    }
    headers = {'User-Agent':
                   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
    flag = True
    while flag:
        try:
            resp = requests.get(target_url, proxies=proxies, headers=headers)
            flag = False
        except Exception as e:
            print(e)
            time.sleep(0.4)

    soup = BeautifulSoup(resp.text, 'html.parser')
    return soup


def get_toc(url):
    soup = get_soup(url)
    toc = soup.select("#x-wiki-index a")
    print(toc[0]['href'])
    return toc


# ⬇️教程html
def download_html(url, depth):
    soup = get_soup(url)
    # 處理目錄
    if int(depth) <= 1:
        depth = '1'
    elif int(depth) >= 2:
        depth = '2'
    title = soup.select(".x-content h4")[0]
    new_title = BeautifulSoup('<h' + depth + '>' + title.string + '</h' + depth + '>', 'html.parser')
    print(new_title)
    # 載入圖片
    images = soup.find_all('img')
    for x in images:
        x['src'] = x['data-src']

    div_content = soup.find('div', class_='x-wiki-content')
    return new_title, div_content


def convert_pdf(template):
    html_file = "python-tutorial-pdf.html"
    with open(html_file, mode="w", encoding="utf8") as code:
        code.write(str(template))
    pdfkit.from_file(html_file, 'python-tutorial-pdf.pdf')


if __name__ == '__main__':
    # html 模板
    template = BeautifulSoup(
        '<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <link rel="stylesheet" href="https://cdn.liaoxuefeng.com/cdn/static/themes/default/css/all.css?v=bc43d83"> <script src="https://cdn.liaoxuefeng.com/cdn/static/themes/default/js/all.js?v=bc43d83"></script> </head> <body> </body> </html>',
        'html.parser')
    # 教程目錄
    toc = get_toc('https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000')
    for i, x in enumerate(toc):
        url = 'https://www.liaoxuefeng.com' + x['href']
        # ⬇️教程html
        content = download_html(url, x.parent['depth'])
        # 往template新增新的教程
        new_div = template.new_tag('div', id=i)
        template.body.insert(3 + i, new_div)
        new_div.insert(3, content[0])
        new_div.insert(3, content[1])
        time.sleep(0.4)
    convert_pdf(template)
複製程式碼

參考文件：

爬取《The Hitchhiker’s Guide to Python!》python進階書並製成pdf
2019-03-02
GUIIDEPython
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
python爬取網頁詳細教程
2021-09-11
Python網頁
pdf expert使用教程：製作PDF的基礎教程
2021-07-08
跟著廖雪峰學python 005
2023-02-19
Python
用Python爬取線上教程轉成PDF，媽媽再也不用擔心我的學習了！
2018-06-10
Python
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
python爬蟲利用requests製作代理池s
2019-12-04
Python爬蟲
python 讀取PDF表格
2020-09-25
Python
【廖雪峰python進階筆記】定製類
2018-07-10
Python筆記
Python爬蟲新手教程：知乎文章圖片爬取器
2019-07-20
Python爬蟲
python爬蟲開發微課版pdf_Python爬蟲開發實戰教程（微課版）
2020-11-21
Python爬蟲
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
python爬取FY-4作為桌面背景
2020-11-12
Python
新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
Python爬蟲教程-17-ajax爬取例項（豆瓣電影）
2018-09-06
Python爬蟲
python爬蟲學習01--電子書爬取
2020-07-13
Python爬蟲
python爬取網圖
2019-10-15
Python
Python爬蟲入門教程 2-100 妹子圖網站爬取
2018-12-13
Python爬蟲網站
python如何實現簡單的爬蟲功能?Python學習教程!
2021-01-12
Python爬蟲
Python爬取電影天堂
2018-11-01
Python
Python爬取周杰倫instagram
2018-07-08
Python
python 爬取 mc 皮膚
2019-08-02
Python
Python《爬取IPhone各式桌布》
2020-12-11
PythoniPhone
python 爬蟲爬取 learnku 精華文章
2020-04-17
Python爬蟲
python學習值爬取百度翻譯
2020-10-26
Python
python爬蟲實戰教程-Python爬蟲開發實戰教程（微課版）
2020-11-11
Python爬蟲
Angular 17+ 高階教程 – 學以致用
2024-04-16
Angular
Python大牛廖雪峰13個案例帶你全面掌握商業爬蟲！
2018-09-25
Python爬蟲
用python爬取知識星球
2019-02-16
Python
python爬取糗事百科
2018-08-14
Python
python爬取北京租房資訊
2018-05-18
Python
Python：爬取疫情每日資料
2020-02-17
Python
利用Python爬取必應桌布
2020-10-13
Python
Python-爬取CVE漏洞庫?
2021-11-05
Python
關於python爬取網頁
2021-03-10
Python網頁