[雪峰磁針石部落格]python爬蟲cookbook1爬蟲入門

書籍尋找發表於2018-09-10

第一章 爬蟲入門

  • Requests和Beautiful Soup 爬取python.org
  • urllib3和Beautiful Soup 爬取python.org
  • Scrapy 爬取python.org
  • Selenium和PhantomJs爬取Python.org

請確認可以開啟:https://www.python.org/events/pythonevents
安裝好requests、bs4,然後我們開始例項1:Requests和Beautiful Soup 爬取python.org,


# pip3 install requests bs4

Requests和Beautiful Soup 爬取python.org

01_events_with_requests.py


import requests
from bs4 import BeautifulSoup

def get_upcoming_events(url):
    req = requests.get(url)

    soup = BeautifulSoup(req.text, `lxml`)

    events = soup.find(`ul`, {`class`: `list-recent-events`}).findAll(`li`)

    for event in events:
        event_details = dict()
        event_details[`name`] = event.find(`h3`).find("a").text
        event_details[`location`] = event.find(`span`, {`class`, `event-location`}).text
        event_details[`time`] = event.find(`time`).text
        print(event_details)

get_upcoming_events(`https://www.python.org/events/python-events/`)

執行結果:


$ python3 01_events_with_requests.py 
{`name`: `PyCon US 2018`, `location`: `Cleveland, Ohio, USA`, `time`: `09 May – 18 May  2018`}
{`name`: `DjangoCon Europe 2018`, `location`: `Heidelberg, Germany`, `time`: `23 May – 28 May  2018`}
{`name`: `PyCon APAC 2018`, `location`: `NUS School of Computing / COM1, 13 Computing Drive, Singapore 117417, Singapore`, `time`: `31 May – 03 June  2018`}
{`name`: `PyCon CZ 2018`, `location`: `Prague, Czech Republic`, `time`: `01 June – 04 June  2018`}
{`name`: `PyConTW 2018`, `location`: `Taipei, Taiwan`, `time`: `01 June – 03 June  2018`}
{`name`: `PyLondinium`, `location`: `London, UK`, `time`: `08 June – 11 June  2018`}

注意:因為事件的內容未必相同,所以每次的結果也不會一樣

課後習題: 用requests爬取https://china-testing.github.io/首頁的部落格標題,共10條。

參考答案:

01_blog_title.py


import requests
from bs4 import BeautifulSoup

def get_upcoming_events(url):
    req = requests.get(url)

    soup = BeautifulSoup(req.text, `lxml`)

    events = soup.findAll(`article`)

    for event in events:
        event_details = {}
        event_details[`name`] = event.find(`h1`).find("a").text
        print(event_details)

get_upcoming_events(`https://china-testing.github.io/`)

執行結果:


$ python3 01_blog_title.py 
{`name`: `10分鐘學會API測試`}
{`name`: `python資料分析快速入門教程4-資料匯聚`}
{`name`: `python資料分析快速入門教程6-重整`}
{`name`: `python資料分析快速入門教程5-處理缺失資料`}
{`name`: `python庫介紹-pytesseract: OCR光學字元識別`}
{`name`: `軟體自動化測試初學者忠告`}
{`name`: `使用opencv轉換3d圖片`}
{`name`: `python opencv3例項(物件識別和擴增實境)2-邊緣檢測和應用影像過濾器`}
{`name`: `numpy學習指南3rd3:常用函式`}
{`name`: `numpy學習指南3rd2:NumPy基礎`}

urllib3和Beautiful Soup 爬取python.org

程式碼:02_events_with_urlib3.py


import urllib3
from bs4 import BeautifulSoup

def get_upcoming_events(url):
    req = urllib3.PoolManager()
    res = req.request(`GET`, url)

    soup = BeautifulSoup(res.data, `html.parser`)

    events = soup.find(`ul`, {`class`: `list-recent-events`}).findAll(`li`)

    for event in events:
        event_details = dict()
        event_details[`name`] = event.find(`h3`).find("a").text
        event_details[`location`] = event.find(`span`, {`class`, `event-location`}).text
        event_details[`time`] = event.find(`time`).text
        print(event_details)

get_upcoming_events(`https://www.python.org/events/python-events/`)

requests對urllib3進行了封裝,一般是直接使用requests。

Scrapy 爬取python.org

Scrapy是用於提取資料的非常流行的開源Python抓取框架。 Scrapy提供所有這些功能以及許多其他內建模組和擴充套件。當涉及到使用Python進行挖掘時,它也是我們的首選工具。
Scrapy提供了許多值得一提的強大功能:

  • 內建的擴充套件來生成HTTP請求並處理壓縮,身份驗證,快取,操作使用者代理和HTTP標頭
  • 內建的支援選擇和提取選擇器語言如資料CSS和XPath,以及支援使用正規表示式選擇內容和連結。
  • 編碼支援來處理語言和非標準編碼宣告
  • 靈活的API來重用和編寫自定義中介軟體和管道,提供乾淨而簡單的方法來實現自動化等任務。比如下載資產(例如影像或媒體)並將資料儲存在儲存器中,如檔案系統,S3,資料庫等

有幾種使用Scrapy的方法。一個是程式模式我們在程式碼中建立抓取工具和蜘蛛。也可以配置Scrapy模板或生成器專案,然後從命令列使用執行。本書將遵循程式模式,因為它的程式碼在單個檔案中。

程式碼:03_events_with_scrapy.py


import scrapy
from scrapy.crawler import CrawlerProcess

class PythonEventsSpider(scrapy.Spider):
    name = `pythoneventsspider`

    start_urls = [`https://www.python.org/events/python-events/`,]
    found_events = []

    def parse(self, response):
        for event in response.xpath(`//ul[contains(@class, "list-recent-events")]/li`):
            event_details = dict()
            event_details[`name`] = event.xpath(`h3[@class="event-title"]/a/text()`).extract_first()
            event_details[`location`] = event.xpath(`p/span[@class="event-location"]/text()`).extract_first()
            event_details[`time`] = event.xpath(`p/time/text()`).extract_first()
            self.found_events.append(event_details)

if __name__ == "__main__":
    process = CrawlerProcess({ `LOG_LEVEL`: `ERROT630:~/code/china-testing/python3_libraries/pytest_testing/ch2/tasks_proj/tests/func$ pytest test_api_exceptions.py  -v -m "smoke and not get"
=========================================== test session starts ===========================================
platform linux -- Python 3.5.2, pytest-3.5.1, py-1.5.3, pluggy-0.6.0 -- /usr/bin/python3
cachedir: ../.pytest_cache
rootdir: /home/andrew/code/china-testing/python3_libraries/pytest_testing/ch2/tasks_proj/tests, inifile: pytest.ini
collected 7 items / 6 deselected                                                                          

test_api_exceptions.py::test_list_raises PASSED                                                     [100%]
R`})
    process.crawl(PythonEventsSpider)
    spider = next(iter(process.crawlers)).spider
    process.start()

    for event in spider.found_events: print(event)

課後習題: 用scrapy爬取https://china-testing.github.io/首頁的部落格標題,共10條。

參考答案:

03_blog_with_scrapy.py


from scrapy.crawler import CrawlerProcess

class PythonEventsSpider(scrapy.Spider):
    name = `pythoneventsspider`

    start_urls = [`https://china-testing.github.io/`,]
    found_events = []

    def parse(self, response):
        for event in response.xpath(`//article//h1`):
            event_details = dict()
            event_details[`name`] = event.xpath(`a/text()`).extract_first()
            self.found_events.append(event_details)

if __name__ == "__main__":
    process = CrawlerProcess({ `LOG_LEVEL`: `ERROR`})
    process.crawl(PythonEventsSpider)
    spider = next(iter(process.crawlers)).spider
    process.start()

    for event in spider.found_events: print(event)

Selenium和PhantomJs爬取Python.org

04_events_with_selenium.py


from selenium import webdriver

def get_upcoming_events(url):
    driver = webdriver.Chrome()
    driver.get(url)

    events = driver.find_elements_by_xpath(`//ul[contains(@class, "list-recent-events")]/li`)

    for event in events:
        event_details = dict()
        event_details[`name`] = event.find_element_by_xpath(`h3[@class="event-title"]/a`).text
        event_details[`location`] = event.find_element_by_xpath(`p/span[@class="event-location"]`).text
        event_details[`time`] = event.find_element_by_xpath(`p/time`).text
        print(event_details)

    driver.close()

get_upcoming_events(`https://www.python.org/events/python-events/`)

改用driver = webdriver.PhantomJS(`phantomjs`)可以使用無介面的方式,程式碼如下:

05_events_with_phantomjs.py


from selenium import webdriver

def get_upcoming_events(url):
    driver = webdriver.Chrome()
    driver.get(url)

    events = driver.find_elements_by_xpath(`//ul[contains(@class, "list-recent-events")]/li`)

    for event in events:
        event_details = dict()
        event_details[`name`] = event.find_element_by_xpath(`h3[@class="event-title"]/a`).text
        event_details[`location`] = event.find_element_by_xpath(`p/span[@class="event-location"]`).text
        event_details[`time`] = event.find_element_by_xpath(`p/time`).text
        print(event_details)

    driver.close()

get_upcoming_events(`https://www.python.org/events/python-events/`)

不過selenium的headless模式已經可以更好的代替phantomjs了。

04_events_with_selenium_headless.py


from selenium import webdriver

def get_upcoming_events(url):
    
    options = webdriver.ChromeOptions()
    options.add_argument(`headless`)
    driver = webdriver.Chrome(chrome_options=options)
    driver.get(url)

    events = driver.find_elements_by_xpath(`//ul[contains(@class, "list-recent-events")]/li`)

    for event in events:
        event_details = dict()
        event_details[`name`] = event.find_element_by_xpath(`h3[@class="event-title"]/a`).text
        event_details[`location`] = event.find_element_by_xpath(`p/span[@class="event-location"]`).text
        event_details[`time`] = event.find_element_by_xpath(`p/time`).text
        print(event_details)

    driver.close()

get_upcoming_events(`https://www.python.org/events/python-events/`)

參考資料


相關文章