寫在前面
用python寫爬蟲的人很多,python的爬蟲框架也很多,諸如pyspider 和 scrapy,筆者還是筆記傾向於scrapy,本文就用python寫一個小爬蟲demo。
本文適用於有一定python基礎的,並且對爬蟲有一定了解的開發者。
安裝 Scrapy
檢查環境,python的版本為3.6.2,pip為9.0.1
F: echleepython>python --version
Python 3.6.2
F: echleepython>pip --version
pip 9.0.1 from d:program filespythonpython36-32libsite-packages (python 3.6)
安裝scrapy框架
F: echleepython>pip install scrapy
Collecting scrapy
Downloading Scrapy-1.4.0-py2.py3-none-any.whl (248kB)
100% |████████████████████████████████| 256kB 188kB/s
// 漫長的安裝過程
Successfully installed Twisted-17.9.0 scrapy-1.4.0
如果報錯:
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools
請安裝Visual C++ 2015 Build Tools
http://landinghub.visualstudi…
安裝完成
F: echleepython>scrapy version
Scrapy 1.4.0
建立專案
F: echleepython>scrapy startproject scrapyDemo
New Scrapy project `scrapyDemo`, using template directory `d:\program files\python\python36-32\lib\site-packages\scrapy\templates\project`, created in:
F: echleepythonscrapyDemo
You can start your first spider with:
cd scrapyDemo
scrapy genspider example example.com
目錄結構
scrapyDemo/
scrapy.cfg # 部署配置檔案
scrapyDemo/ # python模組
__init__.py
items.py # 資料容器
pipelines.py # project pipelines file
settings.py # 配置檔案
spiders/ # Spider類定義瞭如何爬取某個(或某些)網站
__init__.py
建立執行爬取的類ImoocSpider在 scrapyDemo/spiders中
# -*- coding: utf-8 -*-
import scrapy
from urllib import parse as urlparse
# 慕課網爬取
class ImoocSpider(scrapy.Spider):
# spider的名字定義了Scrapy如何定位(並初始化)spider,所以其必須是唯一的
name = "imooc"
# URL列表
start_urls = [`http://www.imooc.com/course/list`]
# 域名不在列表中的URL不會被爬取。
allowed_domains = [`www.imooc.com`]
def parse(self, response):
learn_nodes = response.css(`a.course-card`)
for learn_node in learn_nodes :
learn_url = learn_node.css("::attr(href)").extract_first()
yield scrapy.Request(url=urlparse.urljoin(response.url,learn_url),callback=self.parse_learn)
def parse_learn(self, response):
title = response.xpath(`//h2[@class="l"]/text()`).extract_first()
content = response.xpath(`//div[@class="course-brief"]/p/text()`).extract_first()
url = response.url
print (`標題:` + title)
print (`地址:` + url)
開始爬取
F: echleepythonscrapyDemo>scrapy crawl imooc
如果出現,則缺少win32api庫,選擇相應的版本
import win32api
ModuleNotFoundError: No module named `win32api`
大功告成
看到如下輸出,就說明爬取成功啦
F: echleepythonscrapyDemo>scrapy crawl imooc
2017-10-17 14:28:32 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapyDemo)
……
2017-10-17 14:28:32 [scrapy.core.engine] INFO: Spider opened
2017-10-17 14:28:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-17 14:28:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-17 14:28:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/robots.txt> (referer: None)
2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/course/list> (referer: None)
2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/876> (referer: http://www.imooc.com/course/list)
標題:整合MultiDex專案實戰
地址:http://www.imooc.com/learn/876
2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/893> (referer: http://www.imooc.com/course/list)
標題:阿里D2前端技術論壇——2016初心
地址:http://www.imooc.com/learn/893
2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/890> (referer: http://www.imooc.com/course/list)
2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/888> (referer: http://www.imooc.com/course/list)
標題:Hadoop進階
地址:http://www.imooc.com/learn/890
標題:Javascript實現二叉樹演算法
地址:http://www.imooc.com/learn/888
2017-10-17 14:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/894> (referer: http://www.imooc.com/course/list)
標題:Fragment應用上
地址:http://www.imooc.com/learn/894
2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/887> (referer: http://www.imooc.com/course/list)
標題:PHP-物件導向
地址:http://www.imooc.com/learn/887
2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/900> (referer: http://www.imooc.com/course/list)
2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/889> (referer: http://www.imooc.com/course/list)
2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/901> (referer: http://www.imooc.com/course/list)
標題:Sketch的基礎例項應用
地址:http://www.imooc.com/learn/900
標題:ElasticSearch入門
地址:http://www.imooc.com/learn/889
標題:使用Google Guice實現依賴注入
地址:http://www.imooc.com/learn/901
2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/867> (referer: http://www.imooc.com/course/list)
標題:Docker入門
地址:http://www.imooc.com/learn/867
2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/878> (referer: http://www.imooc.com/course/list)
標題:Android圖表繪製之直方圖
地址:http://www.imooc.com/learn/878
2017-10-17 14:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/892> (referer: http://www.imooc.com/course/list)
標題:UI版式設計
地址:http://www.imooc.com/learn/892
2017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/877> (referer: http://www.imooc.com/course/list)
2017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/886> (referer: http://www.imooc.com/course/list)
標題:RxJava與RxAndroid基礎入門
地址:http://www.imooc.com/learn/877
標題:iOS開發之Audio特輯
地址:http://www.imooc.com/learn/886
2017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/861> (referer: http://www.imooc.com/course/list)
標題:基於Websocket的火拼俄羅斯(基礎)
地址:http://www.imooc.com/learn/861
2017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/895> (referer: http://www.imooc.com/course/list)
2017-10-17 14:28:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/882> (referer: http://www.imooc.com/course/list)
標題:2017AWS 技術峰會——大資料技術專場
地址:http://www.imooc.com/learn/895
標題:基於websocket的火拼俄羅斯(單機版)
地址:http://www.imooc.com/learn/882
原文 https://www.tech1024.cn/origi…
儲存資料到mysql資料庫 https://www.tech1024.cn/origi…