[雪峰磁針石部落格]python爬蟲cookbook1爬蟲入門

書籍尋找發表於2018-09-10

原文網址 : https://flycode.co/archives/168094

第一章爬蟲入門

Requests和Beautiful Soup 爬取python.org
urllib3和Beautiful Soup 爬取python.org
Scrapy 爬取python.org
Selenium和PhantomJs爬取Python.org

請確認可以開啟：https://www.python.org/events/pythonevents
安裝好requests、bs4，然後我們開始例項1：Requests和Beautiful Soup 爬取python.org,


# pip3 install requests bs4

Requests和Beautiful Soup 爬取python.org

目標：爬取https://www.python.org/events/python-events/中事件的名稱、地點和時間。

01_events_with_requests.py


import requests
from bs4 import BeautifulSoup

def get_upcoming_events(url):
    req = requests.get(url)

    soup = BeautifulSoup(req.text, `lxml`)

    events = soup.find(`ul`, {`class`: `list-recent-events`}).findAll(`li`)

    for event in events:
        event_details = dict()
        event_details[`name`] = event.find(`h3`).find("a").text
        event_details[`location`] = event.find(`span`, {`class`, `event-location`}).text
        event_details[`time`] = event.find(`time`).text
        print(event_details)

get_upcoming_events(`https://www.python.org/events/python-events/`)

執行結果：


$ python3 01_events_with_requests.py 
{`name`: `PyCon US 2018`, `location`: `Cleveland, Ohio, USA`, `time`: `09 May – 18 May  2018`}
{`name`: `DjangoCon Europe 2018`, `location`: `Heidelberg, Germany`, `time`: `23 May – 28 May  2018`}
{`name`: `PyCon APAC 2018`, `location`: `NUS School of Computing / COM1, 13 Computing Drive, Singapore 117417, Singapore`, `time`: `31 May – 03 June  2018`}
{`name`: `PyCon CZ 2018`, `location`: `Prague, Czech Republic`, `time`: `01 June – 04 June  2018`}
{`name`: `PyConTW 2018`, `location`: `Taipei, Taiwan`, `time`: `01 June – 03 June  2018`}
{`name`: `PyLondinium`, `location`: `London, UK`, `time`: `08 June – 11 June  2018`}

注意：因為事件的內容未必相同，所以每次的結果也不會一樣

課後習題：用requests爬取https://china-testing.github.io/首頁的部落格標題，共10條。

參考答案：

01_blog_title.py


import requests
from bs4 import BeautifulSoup

def get_upcoming_events(url):
    req = requests.get(url)

    soup = BeautifulSoup(req.text, `lxml`)

    events = soup.findAll(`article`)

    for event in events:
        event_details = {}
        event_details[`name`] = event.find(`h1`).find("a").text
        print(event_details)

get_upcoming_events(`https://china-testing.github.io/`)

執行結果：


$ python3 01_blog_title.py 
{`name`: `10分鐘學會API測試`}
{`name`: `python資料分析快速入門教程4-資料匯聚`}
{`name`: `python資料分析快速入門教程6-重整`}
{`name`: `python資料分析快速入門教程5-處理缺失資料`}
{`name`: `python庫介紹-pytesseract: OCR光學字元識別`}
{`name`: `軟體自動化測試初學者忠告`}
{`name`: `使用opencv轉換3d圖片`}
{`name`: `python opencv3例項(物件識別和擴增實境)2-邊緣檢測和應用影像過濾器`}
{`name`: `numpy學習指南3rd3:常用函式`}
{`name`: `numpy學習指南3rd2:NumPy基礎`}

urllib3和Beautiful Soup 爬取python.org

程式碼：02_events_with_urlib3.py


import urllib3
from bs4 import BeautifulSoup

def get_upcoming_events(url):
    req = urllib3.PoolManager()
    res = req.request(`GET`, url)

    soup = BeautifulSoup(res.data, `html.parser`)

    events = soup.find(`ul`, {`class`: `list-recent-events`}).findAll(`li`)

    for event in events:
        event_details = dict()
        event_details[`name`] = event.find(`h3`).find("a").text
        event_details[`location`] = event.find(`span`, {`class`, `event-location`}).text
        event_details[`time`] = event.find(`time`).text
        print(event_details)

get_upcoming_events(`https://www.python.org/events/python-events/`)

requests對urllib3進行了封裝，一般是直接使用requests。

Scrapy 爬取python.org

Scrapy是用於提取資料的非常流行的開源Python抓取框架。 Scrapy提供所有這些功能以及許多其他內建模組和擴充套件。當涉及到使用Python進行挖掘時，它也是我們的首選工具。
Scrapy提供了許多值得一提的強大功能：

內建的擴充套件來生成HTTP請求並處理壓縮，身份驗證，快取，操作使用者代理和HTTP標頭
內建的支援選擇和提取選擇器語言如資料CSS和XPath，以及支援使用正規表示式選擇內容和連結。
編碼支援來處理語言和非標準編碼宣告
靈活的API來重用和編寫自定義中介軟體和管道，提供乾淨而簡單的方法來實現自動化等任務。比如下載資產（例如影像或媒體）並將資料儲存在儲存器中，如檔案系統，S3，資料庫等

有幾種使用Scrapy的方法。一個是程式模式我們在程式碼中建立抓取工具和蜘蛛。也可以配置Scrapy模板或生成器專案，然後從命令列使用執行。本書將遵循程式模式，因為它的程式碼在單個檔案中。

程式碼：03_events_with_scrapy.py


import scrapy
from scrapy.crawler import CrawlerProcess

class PythonEventsSpider(scrapy.Spider):
    name = `pythoneventsspider`

    start_urls = [`https://www.python.org/events/python-events/`,]
    found_events = []

    def parse(self, response):
        for event in response.xpath(`//ul[contains(@class, "list-recent-events")]/li`):
            event_details = dict()
            event_details[`name`] = event.xpath(`h3[@class="event-title"]/a/text()`).extract_first()
            event_details[`location`] = event.xpath(`p/span[@class="event-location"]/text()`).extract_first()
            event_details[`time`] = event.xpath(`p/time/text()`).extract_first()
            self.found_events.append(event_details)

if __name__ == "__main__":
    process = CrawlerProcess({ `LOG_LEVEL`: `ERROT630:~/code/china-testing/python3_libraries/pytest_testing/ch2/tasks_proj/tests/func$ pytest test_api_exceptions.py  -v -m "smoke and not get"
=========================================== test session starts ===========================================
platform linux -- Python 3.5.2, pytest-3.5.1, py-1.5.3, pluggy-0.6.0 -- /usr/bin/python3
cachedir: ../.pytest_cache
rootdir: /home/andrew/code/china-testing/python3_libraries/pytest_testing/ch2/tasks_proj/tests, inifile: pytest.ini
collected 7 items / 6 deselected                                                                          

test_api_exceptions.py::test_list_raises PASSED                                                     [100%]
R`})
    process.crawl(PythonEventsSpider)
    spider = next(iter(process.crawlers)).spider
    process.start()

    for event in spider.found_events: print(event)

課後習題：用scrapy爬取https://china-testing.github.io/首頁的部落格標題，共10條。

參考答案：

03_blog_with_scrapy.py


from scrapy.crawler import CrawlerProcess

class PythonEventsSpider(scrapy.Spider):
    name = `pythoneventsspider`

    start_urls = [`https://china-testing.github.io/`,]
    found_events = []

    def parse(self, response):
        for event in response.xpath(`//article//h1`):
            event_details = dict()
            event_details[`name`] = event.xpath(`a/text()`).extract_first()
            self.found_events.append(event_details)

if __name__ == "__main__":
    process = CrawlerProcess({ `LOG_LEVEL`: `ERROR`})
    process.crawl(PythonEventsSpider)
    spider = next(iter(process.crawlers)).spider
    process.start()

    for event in spider.found_events: print(event)

Selenium和PhantomJs爬取Python.org

04_events_with_selenium.py


from selenium import webdriver

def get_upcoming_events(url):
    driver = webdriver.Chrome()
    driver.get(url)

    events = driver.find_elements_by_xpath(`//ul[contains(@class, "list-recent-events")]/li`)

    for event in events:
        event_details = dict()
        event_details[`name`] = event.find_element_by_xpath(`h3[@class="event-title"]/a`).text
        event_details[`location`] = event.find_element_by_xpath(`p/span[@class="event-location"]`).text
        event_details[`time`] = event.find_element_by_xpath(`p/time`).text
        print(event_details)

    driver.close()

get_upcoming_events(`https://www.python.org/events/python-events/`)

改用driver = webdriver.PhantomJS(`phantomjs`)可以使用無介面的方式，程式碼如下：

05_events_with_phantomjs.py


from selenium import webdriver

def get_upcoming_events(url):
    driver = webdriver.Chrome()
    driver.get(url)

    events = driver.find_elements_by_xpath(`//ul[contains(@class, "list-recent-events")]/li`)

    for event in events:
        event_details = dict()
        event_details[`name`] = event.find_element_by_xpath(`h3[@class="event-title"]/a`).text
        event_details[`location`] = event.find_element_by_xpath(`p/span[@class="event-location"]`).text
        event_details[`time`] = event.find_element_by_xpath(`p/time`).text
        print(event_details)

    driver.close()

get_upcoming_events(`https://www.python.org/events/python-events/`)

不過selenium的headless模式已經可以更好的代替phantomjs了。

04_events_with_selenium_headless.py


from selenium import webdriver

def get_upcoming_events(url):
    
    options = webdriver.ChromeOptions()
    options.add_argument(`headless`)
    driver = webdriver.Chrome(chrome_options=options)
    driver.get(url)

    events = driver.find_elements_by_xpath(`//ul[contains(@class, "list-recent-events")]/li`)

    for event in events:
        event_details = dict()
        event_details[`name`] = event.find_element_by_xpath(`h3[@class="event-title"]/a`).text
        event_details[`location`] = event.find_element_by_xpath(`p/span[@class="event-location"]`).text
        event_details[`time`] = event.find_element_by_xpath(`p/time`).text
        print(event_details)

    driver.close()

get_upcoming_events(`https://www.python.org/events/python-events/`)

參考資料

討論qq群144081101 591302926 567351477 釘釘免費群21745728
本文最新版本地址
本文涉及的python測試開發庫謝謝點贊！
本文相關海量書籍下載
原始碼下載
本文英文版書籍下載

Python爬蟲入門
2020-11-30
Python爬蟲
[雪峰磁針石部落格]資料倉儲快速入門教程1簡介
2019-01-28
[雪峰磁針石部落格]Bokeh資料視覺化工具1快速入門
2018-08-18
視覺化
[雪峰磁針石部落格]2018最佳python編輯器和IDE
2018-11-26
PythonIDE
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
01、部落格爬蟲
2019-04-11
爬蟲
[雪峰磁針石部落格]介面測試面試題
2018-11-19
面試題
[雪峰磁針石部落格]tesseractOCR識別工具及pytesseract
2018-09-03
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
[雪峰磁針石部落格]python應用效能監控工具簡介
2018-11-05
Python
[雪峰磁針石部落格]可愛的python測試開發庫
2018-08-18
Python
python-爬蟲入門
2024-09-22
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
[雪峰磁針石部落格]multi-mechanize效能測試工具
2018-08-16
Python爬蟲入門，8個常用爬蟲技巧盤點
2018-12-12
Python爬蟲
[雪峰磁針石部落格]2019-Python最佳資料科學工具庫
2019-01-28
Python資料科學
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
爬蟲入門
2024-04-13
爬蟲
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲
爬蟲入門基礎-Python
2020-05-09
爬蟲Python
python3 爬蟲入門
2021-09-09
Python爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
[雪峰磁針石部落格]2018最佳ssh免費登陸工具
2018-10-17
[雪峰磁針石部落格]pythontkinter圖形工具樣式作業
2018-12-04
Python
[雪峰磁針石部落格]python包管理工具：Conda和pip比較
2018-12-04
Python
Python爬蟲入門教程 40-100 部落格園Python相關40W部落格抓取 scrapy
2019-02-25
Python爬蟲
帶你入門Python爬蟲，8個常用爬蟲技巧盤點
2018-08-06
Python爬蟲
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
[雪峰磁針石部落格]資料分析工具pandas快速入門教程4-資料匯聚
2018-08-24
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲
python爬蟲之 BeautifulSoup庫入門
2019-12-09
Python爬蟲
Python3爬蟲入門(一)
2020-12-05
Python爬蟲
[雪峰磁針石部落格]pythonGUI工具書籍下載-持續更新
2018-12-03
PythonNGUI
[雪峰磁針石部落格]python計算機視覺深度學習1簡介
2018-10-17
Python計算機視覺深度學習
[雪峰磁針石部落格]python標準模組介紹-string:文字常量和模板
2018-08-16
Python
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址（1）
2018-12-26
爬蟲Python
爬蟲（1） - 爬蟲基礎入門理論篇
2022-06-30
爬蟲