爬蟲：HTTP請求與HTML解析（爬取某乎網站）

thoustree發表於2021-05-19

爬蟲HTTPHTML網站

1. 傳送web請求

1.1 requests

　　用requests庫的get()方法傳送get請求，常常會新增請求頭"user-agent"，以及登入"cookie"等引數

1.1.1 user-agent

　　登入網站，將"user-agent"值複製到文字檔案

1.1.2 cookie

　　登入網站，將"cookie"值複製到文字檔案

1.1.3 測試程式碼

import requests
from requests.exceptions import RequestException

headers = {
    'cookie': '',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
} # 替換為自己的cookie


def get_page(url):
    try:
        html = requests.get(url, headers=headers, timeout=5)
        if html.status_code == 200:
            print('請求成功')
            return html.text
        else:   # 這個else語句不是必須的
            return None
    except RequestException:
        print('請求失敗')


if __name__ == '__main__':
    input_url = 'https://www.zhihu.com/hot'
    get_page(input_url)

1.2 selenium

　　多數網站能通過window.navigator.webdriver的值識別selenium爬蟲，因此selenium爬蟲首先要防止網站識別selenium模擬瀏覽器。同樣，selenium請求也常常需要新增請求頭"user-agent"，以及登入"cookie"等引數

1.2.1 移除Selenium中window.navigator.webdriver的值

　　在程式中新增如下程式碼（對應老版本谷歌）

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions


option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)
time.sleep(10)

1.2.2 user-agent

　　登入網站，將"user-agent"值複製到文字檔案，執行如下程式碼將新增請求頭

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions


option = ChromeOptions()
option.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"')

1.2.3 cookie

　　因為selenium要求cookie需要有"name","value"兩個鍵以及對應的值的值，如果網站上面的cookie是字串的形式，直接複製網站的cookie值將不符合selenium要求，可以用selenium中的get_cookies()方法獲取登入"cookie"

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time
import json

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"')
driver = Chrome(options=option)
time.sleep(10)

driver.get('https://www.zhihu.com/signin?next=%2F')
time.sleep(30)  
driver.get('https://www.zhihu.com/')
cookies = driver.get_cookies()
jsonCookies = json.dumps(cookies)
    
with open('cookies.txt', 'a') as f:  # 檔名和檔案位置自己定義
    f.write(jsonCookies)
    f.write('\n')

1.2.4 測試程式碼示例

　　將上面獲取到的cookie複製到下面程式中便可執行

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)
time.sleep(10)

driver.get('https://www.zhihu.com')
time.sleep(10)

driver.delete_all_cookies()   # 清除剛才的cookie
time.sleep(2)

cookie = {}  #  替換為自己的cookie
driver.add_cookie(cookie)
driver.get('https://www.zhihu.com/')
time.sleep(5)
for i in driver.find_elements_by_css_selector('div[itemprop="zhihu:question"] > a'):
    print(i.text)

2. HTML解析（元素定位）

　　要爬取到目標資料首先要定位資料所屬元素，BeautifulSoup和selenium都很容易實現對HTML的元素遍歷

2.1 BeautifulSoup元素定位

　　下面程式碼BeautifulSoup首先定位到屬性為"HotItem-title"的"h2"標籤，然後再通過.text()方法獲取字串值

import requests
from requests.exceptions import RequestException

headers = {
    'cookie': '',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}           # 替換為自己的cookie


def get_page(url):
    try:
        html = requests.get(url, headers=headers, timeout=5)
        if html.status_code == 200:
            print('請求成功')
            return html.text
        else:   # 這個else語句不是必須的
            return None
    except RequestException:
        print('請求失敗')

def parse_page(html):
    html = BeautifulSoup(html, "html.parser")
    titles = html.find_all("h2", {'class': 'HotItem-title'})[:10]
    for title in titles:
        print(title.text())


if __name__ == '__main__':
    input_url = 'https://www.zhihu.com/hot'
    parse_page(get_page(input_url))

2.2 selenium元素定位

　　selenium元素定位語法形式與requests不太相同，下面程式碼示例（1.2.4 測試程式碼示例）採用了一種層級定位方法：'div[itemprop="zhihu:question"] > a'，筆者覺得這樣定位比較放心。

　　selenium獲取文字值得方法是.text，區別於requests的.text()

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)
time.sleep(10)

driver.get('https://www.zhihu.com')
time.sleep(10)

driver.delete_all_cookies()   # 清除剛才的cookie
time.sleep(2)

cookie = {}  #  替換為自己的cookie
driver.add_cookie(cookie)
driver.get('https://www.zhihu.com/')
time.sleep(5)
for i in driver.find_elements_by_css_selector('div[itemprop="zhihu:question"] > a'):
    print(i.text)

Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
爬蟲實戰：從HTTP請求獲取資料解析社群
2024-03-20
爬蟲HTTP
爬蟲搭建代理池、爬取某網站影片案例、爬取新聞案例
2023-03-16
爬蟲網站
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
[網路爬蟲] Jsoup : HTML 解析工具
2024-10-06
爬蟲JSHTML
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
使用正則編寫簡單的爬蟲爬取某網站的圖片
2018-06-06
爬蟲網站
python爬蟲請求頭
2020-10-06
Python爬蟲
簡單的爬蟲：爬取網站內容正文與圖片
2021-09-09
爬蟲網站
Python網路爬蟲第三彈《爬取get請求的頁面資料》
2018-09-14
Python爬蟲
Python網路爬蟲3 – 生產者消費者模型爬取某金融網站資料
2019-02-28
Python爬蟲模型網站
Python網路爬蟲3 - 生產者消費者模型爬取某金融網站資料
2018-05-01
Python爬蟲模型網站
爬蟲---xpath解析（爬取美女圖片）
2020-12-23
爬蟲
Python爬蟲新手教程：知乎文章圖片爬取器
2019-07-20
Python爬蟲
爬取某網站寫的python程式碼
2019-11-29
網站Python
【0基礎學爬蟲】爬蟲基礎之網路請求庫的使用
2023-03-26
爬蟲
新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
分散式爬蟲之知乎使用者資訊爬取
2018-08-31
分散式爬蟲
如何使用robots禁止各大搜尋引擎爬蟲爬取網站
2018-08-28
爬蟲網站
Python 爬蟲網頁解析工具lxml.html(二)
2018-12-05
Python爬蟲網頁XMLHTML
Python 爬蟲網頁解析工具lxml.html(一)
2018-12-05
Python爬蟲網頁XMLHTML
Python爬蟲（二）——傳送請求
2021-08-27
Python爬蟲
Java爬蟲利器HTML解析工具-Jsoup
2019-06-21
Java爬蟲HTMLJS
爬蟲-使用lxml解析html資料
2021-01-20
爬蟲XMLHTML
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
全棧 – 7 爬蟲 Http請求和Chrome
2019-02-10
全棧爬蟲HTTPChrome
Python爬蟲入門教程 2-100 妹子圖網站爬取
2018-12-13
Python爬蟲網站
Python網路爬蟲實戰：爬取知乎話題下 18934 條回答資料
2019-01-17
Python爬蟲
招聘網站爬蟲模板
2020-09-20
網站爬蟲
scrapy + mogoDB 網站爬蟲
2019-05-19
Go網站爬蟲
Python爬蟲基礎-01-帶有請求引數的爬蟲
2018-06-06
Python爬蟲
爬蟲 | 非同步請求aiohttp模組
2024-06-16
爬蟲非同步AIHTTP
python爬蟲如何爬知乎的話題？
2019-02-16
Python爬蟲
[Python]爬蟲獲取知乎某個問題下所有圖片並去除水印
2021-09-20
Python爬蟲
Java爬蟲系列四：使用selenium-java爬取js非同步請求的資料
2021-10-17
Java爬蟲JS非同步
Java爬蟲系列三：使用Jsoup解析HTML
2019-05-25
Java爬蟲JSHTML
某網站加密返回資料加密_爬取過程
2024-06-08
網站加密
爬蟲練習——爬取縱橫中文網
2020-10-19
爬蟲