Python 爬蟲實戰（一）：使用 requests 和 BeautifulSoup

吳小龍同學發表於2017-12-11

原文網址 : https://juejin.im/post/5a2e98d6f265da430c11c5ca

Python 基礎

我之前寫的《Python 3 極簡教程.pdf》，適合有點程式設計基礎的快速入門，通過該系列文章學習，能夠獨立完成介面的編寫，寫寫小東西沒問題。

requests

requests，Python HTTP 請求庫，相當於 Android 的 Retrofit，它的功能包括 Keep-Alive 和連線池、Cookie 持久化、內容自動解壓、HTTP 代理、SSL 認證、連線超時、Session 等很多特性，同時相容 Python2 和 Python3，GitHub：github.com/requests/re… 。

安裝

Mac：

pip3 install requests複製程式碼

Windows：

pip install requests複製程式碼

傳送請求

HTTP 請求方法有 get、post、put、delete。

import requests

# get 請求
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all')

# post 請求
response = requests.post('http://127.0.0.1:1024/developer/api/v1.0/insert')

# put 請求
response = requests.put('http://127.0.0.1:1024/developer/api/v1.0/update')

# delete 請求
response = requests.delete('http://127.0.0.1:1024/developer/api/v1.0/delete')複製程式碼

請求返回 Response 物件，Response 物件是對 HTTP 協議中服務端返回給瀏覽器的響應資料的封裝，響應的中的主要元素包括：狀態碼、原因短語、響應首部、響應 URL、響應 encoding、響應體等等。

# 狀態碼
print(response.status_code)

# 響應 URL
print(response.url)

# 響應短語
print(response.reason)

# 響應內容
print(response.json())複製程式碼

定製請求頭

請求新增 HTTP 頭部 Headers，只要傳遞一個 dict 給 headers 關鍵字引數就可以了。

header = {'Application-Id': '19869a66c6',
          'Content-Type': 'application/json'
          }
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all/', headers=header)複製程式碼

構建查詢引數

想為 URL 的查詢字串(query string)傳遞某種資料，比如：http://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2 ，Requests 允許你使用 params 關鍵字引數，以一個字串字典來提供這些引數。

payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)複製程式碼

還可以將 list 作為值傳入：

payload = {'key1': 'value1', 'key2': ['value2', 'value3']}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)

# 響應 URL
print(response.url)# 列印：http://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2&key2=value3複製程式碼

post 請求資料

如果伺服器要求傳送的資料是表單資料，則可以指定關鍵字引數 data。

payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert", data=payload)複製程式碼

如果要求傳遞 json 格式字串引數，則可以使用 json 關鍵字引數，引數的值都可以字典的形式傳過去。

obj = {
    "article_title": "小公務員之死2"
}
# response = requests.post('http://127.0.0.1:1024/developer/api/v1.0/insert', json=obj)複製程式碼

響應內容

Requests 會自動解碼來自伺服器的內容。大多數 unicode 字符集都能被無縫地解碼。請求發出後，Requests 會基於 HTTP 頭部對響應的編碼作出有根據的推測。

# 響應內容
# 返回是 是 str 型別內容
# print(response.text())
# 返回是 JSON 響應內容
print(response.json())
# 返回是二進位制響應內容
# print(response.content())
# 原始響應內容，初始請求中設定了 stream=True
# response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', stream=True)
# print(response.raw())複製程式碼

超時

如果沒有顯式指定了 timeout 值，requests 是不會自動進行超時處理的。如果遇到伺服器沒有響應的情況時，整個應用程式一直處於阻塞狀態而沒法處理其他請求。

response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', timeout=5)  # 單位秒數複製程式碼

代理設定

如果頻繁訪問一個網站，很容易被伺服器遮蔽掉，requests 完美支援代理。

# 代理
proxies = {
    'http': 'http://127.0.0.1:1024',
    'https': 'http://127.0.0.1:4000',
}
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', proxies=proxies)複製程式碼

BeautifulSoup

BeautifulSoup，Python Html 解析庫，相當於 Java 的 jsoup。

安裝

BeautifulSoup 3 目前已經停止開發，直接使用BeautifulSoup 4。

Mac：

pip3 install beautifulsoup4複製程式碼

Windows：

pip install beautifulsoup4複製程式碼

安裝解析器

我用的是 html5lib，純 Python 實現的。

Mac：

pip3 install html5lib複製程式碼

Windows：

pip install html5lib複製程式碼

簡單使用

BeautifulSoup 將複雜 HTML 文件轉換成一個複雜的樹形結構，每個節點都是 Python 物件。

解析

from bs4 import BeautifulSoup

def get_html_data():
    html_doc = """
    <html>
    <head>
    <title>WuXiaolong</title>
    </head>
    <body>
    <p>分享 Android 技術，也關注 Python 等熱門技術。</p>
    <p>寫部落格的初衷：總結經驗，記錄自己的成長。</p>
    <p>你必須足夠的努力，才能看起來毫不費力！專注！精緻！
    </p>
    <p class="Blog"><a href="http://wuxiaolong.me/">WuXiaolong's blog</a></p>
    <p class="WeChat"><a href="https://open.weixin.qq.com/qr/code?username=MrWuXiaolong">公眾號：吳小龍同學</a> </p>
    <p class="GitHub"><a href="http://example.com/tillie" class="sister" id="link3">GitHub</a></p>
    </body>
    </html>   
    """
    soup = BeautifulSoup(html_doc, "html5lib")複製程式碼

tag

tag = soup.head
print(tag)  # <head><title>WuXiaolong</title></head>
print(tag.name)  # head
print(tag.title)  # <title>WuXiaolong</title>
print(soup.p)  # <p>分享 Android 技術，也關注 Python 等熱門技術。</p>
print(soup.a['href'])  # 輸出 a 標籤的 href 屬性：http://wuxiaolong.me/複製程式碼

注意：tag 如果多個匹配，返回第一個，比如這裡的 p 標籤。

查詢

print(soup.find('p'))  # <p>分享 Android 技術，也關注 Python 等熱門技術。</p>複製程式碼

find 預設也是返回第一個匹配的標籤，沒找到匹配的節點則返回 None。如果我想指定查詢，比如這裡的公眾號，可以指定標籤的如 class 屬性值：

# 因為 class 是 Python 關鍵字，所以這裡指定為 class_。
print(soup.find('p', class_="WeChat"))
# <p class="WeChat"><a href="https://open.weixin.qq.com/qr/code?username=MrWuXiaolong">公眾號</a> </p>複製程式碼

查詢所有的 P 標籤：

for p in soup.find_all('p'):
    print(p.string) 複製程式碼

實戰

前段時間，有使用者反饋，我的個人 APP 掛了，雖然這個 APP 我已經不再維護，但是我也得起碼保證它能正常執行。大部分人都知道這個 APP 資料是爬來的（詳見：《手把手教你做個人app》），資料爬來的好處之一就是不用自己管資料，弊端是別人網站掛了或網站的 HTML 節點變了，我這邊就解析不到，就沒資料。這次使用者反饋，我在想要不要把他們網站資料直接爬蟲了，正好自學 Python，練練手，嗯說幹就幹，本來是想著先用 Python 爬蟲，MySQL 插入本地資料庫，然後 Flask 自己寫介面，用 Android 的 Retrofit 調，再用 bmob sdk 插入 bmob……哎，費勁，感覺行不通，後來我得知 bmob 提供了 RESTful，解決大問題，我可以直接 Python 爬蟲插入就好了，這裡我演示的是插入本地資料庫，如果用 bmob，是調 bmob 提供的 RESTful 插資料。

網站選定

我選的演示網站：meiriyiwen.com/random ，大家可以發現，每次請求的文章都不一樣，正好利用這點，我只要定時去請求，解析自己需要的資料，插入資料庫就 OK 了。

建立資料庫

我直接用 NaviCat Premium 建立的，當然也可以用命令列。

建立表

建立表 article，用的 pymysql，表需要 id，article_title，article_author，article_content 欄位，程式碼如下，只需要調一次就好了。

import pymysql


def create_table():
    # 建立連線
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn')
    # 建立名為 article 資料庫語句
    sql = '''create table if not exists article (
    id int NOT NULL AUTO_INCREMENT, 
    article_title text,
    article_author text,
    article_content text,
    PRIMARY KEY (`id`)
    )'''
    # 使用 cursor() 方法建立一個遊標物件 cursor
    cursor = db.cursor()
    try:
        # 執行 sql 語句
        cursor.execute(sql)
        # 提交事務
        db.commit()
        print('create table success')
    except BaseException as e:  # 如果發生錯誤則回滾
        db.rollback()
        print(e)

    finally:
        # 關閉遊標連線
        cursor.close()
        # 關閉資料庫連線
        db.close()


if __name__ == '__main__':
    create_table()
複製程式碼

解析網站

首先需要 requests 請求網站，然後 BeautifulSoup 解析自己需要的節點。

import requests
from bs4 import BeautifulSoup


def get_html_data():
    # get 請求
    response = requests.get('https://meiriyiwen.com/random')

    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id='article_show')
    article_title = article.h1.string
    print('article_title=%s' % article_title)
    article_author = article.find('p', class_="article_author").string
    print('article_author=%s' % article.find('p', class_="article_author").string)
    article_contents = article.find('div', class_="article_text").find_all('p')
    article_content = ''
    for content in article_contents:
        article_content = article_content + str(content)
        print('article_content=%s' % article_content)複製程式碼

插入資料庫

這裡做了一個篩選，預設這個網站的文章標題是唯一的，插入資料時，如果有了同樣的標題就不插入。

import pymysql


def insert_table(article_title, article_author, article_content):
    # 建立連線
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn',
                         charset="utf8")
    # 插入資料
    query_sql = 'select * from article where article_title=%s'
    sql = 'insert into article (article_title,article_author,article_content) values (%s, %s, %s)'
    # 使用 cursor() 方法建立一個遊標物件 cursor
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        # 執行 sql 語句
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # 提交事務
            db.commit()
            print('--------------《%s》 insert table success-------------' % article_title)
            return True
        else:
            print('--------------《%s》 已經存在-------------' % article_title)
            return False

    except BaseException as e:  # 如果發生錯誤則回滾
        db.rollback()
        print(e)

    finally:  # 關閉遊標連線
        cursor.close()
        # 關閉資料庫連線
        db.close()複製程式碼

定時設定

做了一個定時，過段時間就去爬一次。

import sched
import time


# 初始化 sched 模組的 scheduler 類
# 第一個引數是一個可以返回時間戳的函式，第二個引數可以在定時未到達之前阻塞。
schedule = sched.scheduler(time.time, time.sleep)


# 被週期性排程觸發的函式
def print_time(inc):
    # to do something
    print('to do something')
    schedule.enter(inc, 0, print_time, (inc,))


# 預設引數 60 s
def start(inc=60):
    # enter四個引數分別為：間隔事件、優先順序（用於同時間到達的兩個事件同時執行時定序）、被呼叫觸發的函式，
    # 給該觸發函式的引數（tuple形式）
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()


if __name__ == '__main__':
    # 5 s 輸出一次
    start(5)複製程式碼

完整程式碼

import pymysql
import requests
from bs4 import BeautifulSoup
import sched
import time


def create_table():
    # 建立連線
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn')
    # 建立名為 article 資料庫語句
    sql = '''create table if not exists article (
    id int NOT NULL AUTO_INCREMENT, 
    article_title text,
    article_author text,
    article_content text,
    PRIMARY KEY (`id`)
    )'''
    # 使用 cursor() 方法建立一個遊標物件 cursor
    cursor = db.cursor()
    try:
        # 執行 sql 語句
        cursor.execute(sql)
        # 提交事務
        db.commit()
        print('create table success')
    except BaseException as e:  # 如果發生錯誤則回滾
        db.rollback()
        print(e)

    finally:
        # 關閉遊標連線
        cursor.close()
        # 關閉資料庫連線
        db.close()


def insert_table(article_title, article_author, article_content):
    # 建立連線
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn',
                         charset="utf8")
    # 插入資料
    query_sql = 'select * from article where article_title=%s'
    sql = 'insert into article (article_title,article_author,article_content) values (%s, %s, %s)'
    # 使用 cursor() 方法建立一個遊標物件 cursor
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        # 執行 sql 語句
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # 提交事務
            db.commit()
            print('--------------《%s》 insert table success-------------' % article_title)
            return True
        else:
            print('--------------《%s》 已經存在-------------' % article_title)
            return False

    except BaseException as e:  # 如果發生錯誤則回滾
        db.rollback()
        print(e)

    finally:  # 關閉遊標連線
        cursor.close()
        # 關閉資料庫連線
        db.close()


def get_html_data():
    # get 請求
    response = requests.get('https://meiriyiwen.com/random')

    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id='article_show')
    article_title = article.h1.string
    print('article_title=%s' % article_title)
    article_author = article.find('p', class_="article_author").string
    print('article_author=%s' % article.find('p', class_="article_author").string)
    article_contents = article.find('div', class_="article_text").find_all('p')
    article_content = ''
    for content in article_contents:
        article_content = article_content + str(content)
        print('article_content=%s' % article_content)

    # 插入資料庫
    insert_table(article_title, article_author, article_content)


# 初始化 sched 模組的 scheduler 類
# 第一個引數是一個可以返回時間戳的函式，第二個引數可以在定時未到達之前阻塞。
schedule = sched.scheduler(time.time, time.sleep)


# 被週期性排程觸發的函式
def print_time(inc):
    get_html_data()
    schedule.enter(inc, 0, print_time, (inc,))


# 預設引數 60 s
def start(inc=60):
    # enter四個引數分別為：間隔事件、優先順序（用於同時間到達的兩個事件同時執行時定序）、被呼叫觸發的函式，
    # 給該觸發函式的引數（tuple形式）
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()


if __name__ == '__main__':
    start(60*5)
複製程式碼

問題：這只是對一篇文章爬蟲，如果是那種文章列表，點選是文章詳情，這種如何爬蟲解析？首先肯定要拿到列表，再迴圈一個個解析文章詳情插入資料庫？還沒有想好該如何做更好，留給後面的課題吧。

最後

雖然我學 Python 純屬業餘愛好，但是也要學以致用，不然這些知識很快就忘記了，期待下篇 Python 方面的文章。

參考

快速上手 — Requests 2.18.1 文件

爬蟲入門系列（二）：優雅的HTTP庫requests

Beautiful Soup 4.2.0 文件

爬蟲入門系列（四）：HTML文字解析庫BeautifulSoup

Python 爬蟲實戰（二）：使用 requests-html
2018-03-14
Python爬蟲HTML
實戰（二）輕鬆使用requests庫和beautifulsoup爬連結
2019-03-03
使用requests+BeautifulSoup的簡單爬蟲練習
2018-04-06
爬蟲
Python爬蟲之BeautifulSoup
2019-02-16
Python爬蟲
Python爬蟲之BeautifulSoup庫
2020-12-14
Python爬蟲
Python網路爬蟲資料採集實戰：Requests和Re庫
2020-03-22
Python爬蟲
使用Python和requests庫的簡單爬蟲程式
2023-11-13
Python爬蟲
Python爬蟲神器requests庫的使用
2024-11-07
Python爬蟲
python爬蟲：使用BeautifulSoup修改網頁內容
2020-04-05
Python爬蟲網頁
python 小爬蟲 DrissionPage+BeautifulSoup
2024-06-16
Python爬蟲
BeautifulSoup + requests 爬取扇貝 python 單詞書
2019-07-11
Python
python爬蟲requests模組
2019-03-01
Python爬蟲
python動態網站爬蟲實戰(requests+xpath+demjson+redis)
2021-09-16
Python網站爬蟲JSONRedis
Python 爬蟲實戰
2023-10-16
Python爬蟲
python爬蟲之 BeautifulSoup庫入門
2019-12-09
Python爬蟲
Python 爬蟲進階篇-利用beautifulsoup庫爬取網頁文章內容實戰演示
2020-09-14
Python爬蟲網頁
python爬蟲常用庫之BeautifulSoup詳解
2018-04-01
Python爬蟲
Python爬蟲教程-06-爬蟲實現百度翻譯(requests)
2018-09-06
Python爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
python爬蟲實戰教程-Python爬蟲開發實戰教程（微課版）
2020-11-11
Python爬蟲
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
Python爬蟲教程-23-資料提取-BeautifulSoup4（一）
2018-09-06
Python爬蟲
Python網路爬蟲實戰
2022-03-18
Python爬蟲
python 爬蟲實戰的原理
2021-10-29
Python爬蟲
Python爬蟲實戰之bilibili
2021-04-04
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
Requests如何在Python爬蟲中實現get請求？
2021-09-11
Python爬蟲
python爬蟲常用庫之requests詳解
2019-03-04
Python爬蟲
Python 實用爬蟲-04-使用 BeautifulSoup 去水印下載 CSDN 部落格圖片
2019-06-16
Python爬蟲
實戰：如何通過python requests庫寫一個抓取小網站圖片的小爬蟲
2020-01-25
Python網站爬蟲
Python網路爬蟲實戰(一)快速入門
2019-09-16
Python爬蟲
Python【爬蟲實戰】提取資料
2020-11-17
Python爬蟲
爬蟲——Requests模組
2019-01-13
爬蟲
爬蟲-Requests模組
2022-03-03
爬蟲
爬蟲之requests庫
2022-03-20
爬蟲
python3網路爬蟲開發實戰_Python 3開發網路爬蟲(一)
2020-12-07
Python爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲

Python 爬蟲實戰（一）：使用 requests 和 BeautifulSoup

Python 基礎

requests

安裝

傳送請求

定製請求頭

構建查詢引數

post 請求資料

響應內容

超時

代理設定

BeautifulSoup

安裝

安裝解析器

簡單使用

解析

tag

查詢

實戰

網站選定

建立資料庫

建立表

解析網站

插入資料庫

定時設定

完整程式碼

最後

參考

相關文章