[雪峰磁針石部落格]python爬蟲cookbook1爬蟲入門
第一章 爬蟲入門
- Requests和Beautiful Soup 爬取python.org
- urllib3和Beautiful Soup 爬取python.org
- Scrapy 爬取python.org
- Selenium和PhantomJs爬取Python.org
請確認可以開啟:https://www.python.org/events/pythonevents
安裝好requests、bs4,然後我們開始例項1:Requests和Beautiful Soup 爬取python.org,
# pip3 install requests bs4
Requests和Beautiful Soup 爬取python.org
- 目標: 爬取https://www.python.org/events/python-events/中事件的名稱、地點和時間。
01_events_with_requests.py
import requests
from bs4 import BeautifulSoup
def get_upcoming_events(url):
req = requests.get(url)
soup = BeautifulSoup(req.text, `lxml`)
events = soup.find(`ul`, {`class`: `list-recent-events`}).findAll(`li`)
for event in events:
event_details = dict()
event_details[`name`] = event.find(`h3`).find("a").text
event_details[`location`] = event.find(`span`, {`class`, `event-location`}).text
event_details[`time`] = event.find(`time`).text
print(event_details)
get_upcoming_events(`https://www.python.org/events/python-events/`)
執行結果:
$ python3 01_events_with_requests.py
{`name`: `PyCon US 2018`, `location`: `Cleveland, Ohio, USA`, `time`: `09 May – 18 May 2018`}
{`name`: `DjangoCon Europe 2018`, `location`: `Heidelberg, Germany`, `time`: `23 May – 28 May 2018`}
{`name`: `PyCon APAC 2018`, `location`: `NUS School of Computing / COM1, 13 Computing Drive, Singapore 117417, Singapore`, `time`: `31 May – 03 June 2018`}
{`name`: `PyCon CZ 2018`, `location`: `Prague, Czech Republic`, `time`: `01 June – 04 June 2018`}
{`name`: `PyConTW 2018`, `location`: `Taipei, Taiwan`, `time`: `01 June – 03 June 2018`}
{`name`: `PyLondinium`, `location`: `London, UK`, `time`: `08 June – 11 June 2018`}
注意:因為事件的內容未必相同,所以每次的結果也不會一樣
課後習題: 用requests爬取https://china-testing.github.io/首頁的部落格標題,共10條。
參考答案:
01_blog_title.py
import requests
from bs4 import BeautifulSoup
def get_upcoming_events(url):
req = requests.get(url)
soup = BeautifulSoup(req.text, `lxml`)
events = soup.findAll(`article`)
for event in events:
event_details = {}
event_details[`name`] = event.find(`h1`).find("a").text
print(event_details)
get_upcoming_events(`https://china-testing.github.io/`)
執行結果:
$ python3 01_blog_title.py
{`name`: `10分鐘學會API測試`}
{`name`: `python資料分析快速入門教程4-資料匯聚`}
{`name`: `python資料分析快速入門教程6-重整`}
{`name`: `python資料分析快速入門教程5-處理缺失資料`}
{`name`: `python庫介紹-pytesseract: OCR光學字元識別`}
{`name`: `軟體自動化測試初學者忠告`}
{`name`: `使用opencv轉換3d圖片`}
{`name`: `python opencv3例項(物件識別和擴增實境)2-邊緣檢測和應用影像過濾器`}
{`name`: `numpy學習指南3rd3:常用函式`}
{`name`: `numpy學習指南3rd2:NumPy基礎`}
urllib3和Beautiful Soup 爬取python.org
程式碼:02_events_with_urlib3.py
import urllib3
from bs4 import BeautifulSoup
def get_upcoming_events(url):
req = urllib3.PoolManager()
res = req.request(`GET`, url)
soup = BeautifulSoup(res.data, `html.parser`)
events = soup.find(`ul`, {`class`: `list-recent-events`}).findAll(`li`)
for event in events:
event_details = dict()
event_details[`name`] = event.find(`h3`).find("a").text
event_details[`location`] = event.find(`span`, {`class`, `event-location`}).text
event_details[`time`] = event.find(`time`).text
print(event_details)
get_upcoming_events(`https://www.python.org/events/python-events/`)
requests對urllib3進行了封裝,一般是直接使用requests。
Scrapy 爬取python.org
Scrapy是用於提取資料的非常流行的開源Python抓取框架。 Scrapy提供所有這些功能以及許多其他內建模組和擴充套件。當涉及到使用Python進行挖掘時,它也是我們的首選工具。
Scrapy提供了許多值得一提的強大功能:
- 內建的擴充套件來生成HTTP請求並處理壓縮,身份驗證,快取,操作使用者代理和HTTP標頭
- 內建的支援選擇和提取選擇器語言如資料CSS和XPath,以及支援使用正規表示式選擇內容和連結。
- 編碼支援來處理語言和非標準編碼宣告
- 靈活的API來重用和編寫自定義中介軟體和管道,提供乾淨而簡單的方法來實現自動化等任務。比如下載資產(例如影像或媒體)並將資料儲存在儲存器中,如檔案系統,S3,資料庫等
有幾種使用Scrapy的方法。一個是程式模式我們在程式碼中建立抓取工具和蜘蛛。也可以配置Scrapy模板或生成器專案,然後從命令列使用執行。本書將遵循程式模式,因為它的程式碼在單個檔案中。
程式碼:03_events_with_scrapy.py
import scrapy
from scrapy.crawler import CrawlerProcess
class PythonEventsSpider(scrapy.Spider):
name = `pythoneventsspider`
start_urls = [`https://www.python.org/events/python-events/`,]
found_events = []
def parse(self, response):
for event in response.xpath(`//ul[contains(@class, "list-recent-events")]/li`):
event_details = dict()
event_details[`name`] = event.xpath(`h3[@class="event-title"]/a/text()`).extract_first()
event_details[`location`] = event.xpath(`p/span[@class="event-location"]/text()`).extract_first()
event_details[`time`] = event.xpath(`p/time/text()`).extract_first()
self.found_events.append(event_details)
if __name__ == "__main__":
process = CrawlerProcess({ `LOG_LEVEL`: `ERROT630:~/code/china-testing/python3_libraries/pytest_testing/ch2/tasks_proj/tests/func$ pytest test_api_exceptions.py -v -m "smoke and not get"
=========================================== test session starts ===========================================
platform linux -- Python 3.5.2, pytest-3.5.1, py-1.5.3, pluggy-0.6.0 -- /usr/bin/python3
cachedir: ../.pytest_cache
rootdir: /home/andrew/code/china-testing/python3_libraries/pytest_testing/ch2/tasks_proj/tests, inifile: pytest.ini
collected 7 items / 6 deselected
test_api_exceptions.py::test_list_raises PASSED [100%]
R`})
process.crawl(PythonEventsSpider)
spider = next(iter(process.crawlers)).spider
process.start()
for event in spider.found_events: print(event)
課後習題: 用scrapy爬取https://china-testing.github.io/首頁的部落格標題,共10條。
參考答案:
03_blog_with_scrapy.py
from scrapy.crawler import CrawlerProcess
class PythonEventsSpider(scrapy.Spider):
name = `pythoneventsspider`
start_urls = [`https://china-testing.github.io/`,]
found_events = []
def parse(self, response):
for event in response.xpath(`//article//h1`):
event_details = dict()
event_details[`name`] = event.xpath(`a/text()`).extract_first()
self.found_events.append(event_details)
if __name__ == "__main__":
process = CrawlerProcess({ `LOG_LEVEL`: `ERROR`})
process.crawl(PythonEventsSpider)
spider = next(iter(process.crawlers)).spider
process.start()
for event in spider.found_events: print(event)
Selenium和PhantomJs爬取Python.org
04_events_with_selenium.py
from selenium import webdriver
def get_upcoming_events(url):
driver = webdriver.Chrome()
driver.get(url)
events = driver.find_elements_by_xpath(`//ul[contains(@class, "list-recent-events")]/li`)
for event in events:
event_details = dict()
event_details[`name`] = event.find_element_by_xpath(`h3[@class="event-title"]/a`).text
event_details[`location`] = event.find_element_by_xpath(`p/span[@class="event-location"]`).text
event_details[`time`] = event.find_element_by_xpath(`p/time`).text
print(event_details)
driver.close()
get_upcoming_events(`https://www.python.org/events/python-events/`)
改用driver = webdriver.PhantomJS(`phantomjs`)可以使用無介面的方式,程式碼如下:
05_events_with_phantomjs.py
from selenium import webdriver
def get_upcoming_events(url):
driver = webdriver.Chrome()
driver.get(url)
events = driver.find_elements_by_xpath(`//ul[contains(@class, "list-recent-events")]/li`)
for event in events:
event_details = dict()
event_details[`name`] = event.find_element_by_xpath(`h3[@class="event-title"]/a`).text
event_details[`location`] = event.find_element_by_xpath(`p/span[@class="event-location"]`).text
event_details[`time`] = event.find_element_by_xpath(`p/time`).text
print(event_details)
driver.close()
get_upcoming_events(`https://www.python.org/events/python-events/`)
不過selenium的headless模式已經可以更好的代替phantomjs了。
04_events_with_selenium_headless.py
from selenium import webdriver
def get_upcoming_events(url):
options = webdriver.ChromeOptions()
options.add_argument(`headless`)
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
events = driver.find_elements_by_xpath(`//ul[contains(@class, "list-recent-events")]/li`)
for event in events:
event_details = dict()
event_details[`name`] = event.find_element_by_xpath(`h3[@class="event-title"]/a`).text
event_details[`location`] = event.find_element_by_xpath(`p/span[@class="event-location"]`).text
event_details[`time`] = event.find_element_by_xpath(`p/time`).text
print(event_details)
driver.close()
get_upcoming_events(`https://www.python.org/events/python-events/`)
參考資料
- 討論qq群144081101 591302926 567351477 釘釘免費群21745728
- 本文最新版本地址
- 本文涉及的python測試開發庫 謝謝點贊!
- 本文相關海量書籍下載
- 原始碼下載
- 本文英文版書籍下載
相關文章
- Python爬蟲入門Python爬蟲
- [雪峰磁針石部落格]資料倉儲快速入門教程1簡介
- [雪峰磁針石部落格]Bokeh資料視覺化工具1快速入門視覺化
- [雪峰磁針石部落格]2018最佳python編輯器和IDEPythonIDE
- 【爬蟲】python爬蟲從入門到放棄爬蟲Python
- 01、部落格爬蟲爬蟲
- [雪峰磁針石部落格]介面測試面試題面試題
- [雪峰磁針石部落格]tesseractOCR識別工具及pytesseract
- 什麼是Python爬蟲?python爬蟲入門難嗎?Python爬蟲
- [雪峰磁針石部落格]python應用效能監控工具簡介Python
- [雪峰磁針石部落格]可愛的python測試開發庫Python
- python-爬蟲入門Python爬蟲
- 【Python學習】爬蟲爬蟲爬蟲爬蟲~Python爬蟲
- [雪峰磁針石部落格]multi-mechanize效能測試工具
- Python爬蟲入門,8個常用爬蟲技巧盤點Python爬蟲
- [雪峰磁針石部落格]2019-Python最佳資料科學工具庫Python資料科學
- Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作Python爬蟲
- 爬蟲入門爬蟲
- 為什麼學習python及爬蟲,Python爬蟲[入門篇]?Python爬蟲
- 爬蟲入門基礎-Python爬蟲Python
- python3 爬蟲入門Python爬蟲
- 不踩坑的Python爬蟲:Python爬蟲開發與專案實戰,從爬蟲入門 PythonPython爬蟲
- [雪峰磁針石部落格]2018最佳ssh免費登陸工具
- [雪峰磁針石部落格]pythontkinter圖形工具樣式作業Python
- [雪峰磁針石部落格]python包管理工具:Conda和pip比較Python
- Python爬蟲入門教程 40-100 部落格園Python相關40W部落格抓取 scrapyPython爬蟲
- 帶你入門Python爬蟲,8個常用爬蟲技巧盤點Python爬蟲
- Python爬蟲入門【9】:圖蟲網多執行緒爬取Python爬蟲執行緒
- [雪峰磁針石部落格]資料分析工具pandas快速入門教程4-資料匯聚
- Python爬蟲入門【5】:27270圖片爬取Python爬蟲
- python爬蟲 之 BeautifulSoup庫入門Python爬蟲
- Python3爬蟲入門(一)Python爬蟲
- [雪峰磁針石部落格]pythonGUI工具書籍下載-持續更新PythonNGUI
- [雪峰磁針石部落格]python計算機視覺深度學習1簡介Python計算機視覺深度學習
- [雪峰磁針石部落格]python標準模組介紹-string:文字常量和模板Python
- python爬蟲---網頁爬蟲,圖片爬蟲,文章爬蟲,Python爬蟲爬取新聞網站新聞Python爬蟲網頁網站
- 【爬蟲】利用Python爬蟲爬取小麥苗itpub部落格的所有文章的連線地址(1)爬蟲Python
- 爬蟲(1) - 爬蟲基礎入門理論篇爬蟲