(轉載)http://www.aobosir.com/blog/2016/11/26/python3-large-web-crawler-001-Build-development-environment/
前言
開發Python爬蟲有很多種方式,從程式的複雜程度的角度來說,可以分為:爬蟲專案和爬蟲檔案。 相信有些朋友玩過Python的urllib模組,一般我們可以用該模組寫一些爬蟲檔案,實現起來非常方便,但做大型專案的時候,會發現效率不是太好、並且程式的穩定性也不是太好。 Scrapy是一個Python的爬蟲框架,使用Scrapy可以提高開發效率,並且非常適合做一些中大型爬蟲專案。 簡單來說,urllib庫更適合寫爬蟲檔案,scrapy更適合做爬蟲專案。
本套專欄,就來講解如何做爬蟲專案。本篇部落格是第一篇部落格:搭建開發環境。
1 . 安裝Python3
到官網下載就可以了,下載一個Python3.5版本就可以,傻瓜式安裝。
Python 3 被預設安裝在:C:\Users\[Username]\AppData\Local\Programs\Python\Python35
這個路徑裡面。
2 . 安裝Python程式開發整合開發環境 — PyCharm IDE 2016.1.4
軟體下載:https://www.jetbrains.com/pycharm/download/#section=windows
注意:
Professional是完整版的,但是需要註冊碼
註冊方法:http://blog.csdn.net/tianzhaixing2013/article/details/44997881
我這次安裝的是PyCharm 2016。
Community是免費版的,但是軟體裡面的Terminal是不能使用的。
3 . 安裝 Visual Studio 2015 軟體
要知道:為什麼需要 Visual Studio 軟體了。(參考這個網站)
如果不安裝,當中你執行pip install third-package-name
時,有時會出現下面這個錯誤:error: Unable to find vcvarsall.bat
安裝Visual Studio 2015 軟體是為了安裝裡面的Python Tools 2.2.5 for Visual Studio 2015軟體。
下載和安裝 Visual Studio 2015 軟體 的方法在這裡。
4 . 升級 pip 工具
在DOS視窗中執行下面的命令來升級pip工具。
1
| python -m pip install --upgrade pip |
5 . 安裝一些第三方庫
lxml、Twisted、pywin32、scrapy
lxml是一種可以迅速、靈活地處理 XML。 Twisted是用Python實現的基於事件驅動的網路引擎框架。 pywin32提供win32api。 Scrapy是一個為了爬取網站資料,提取結構性資料而編寫的應用框架。
我們安裝的是python3.5,並且我的電腦是64位的,所以:下載:
lxml‑3.6.4‑cp35‑cp35m‑win_amd64.whl
Twisted‑16.5.0‑cp35‑cp35m‑win_amd64.whl
pywin32‑220.1‑cp35‑cp35m‑win_amd64.whl
scrapy(直接使用命令:pip.exe install scrapy
來安裝。)
Python安裝第三方庫的方法:http://blog.csdn.net/github_35160620/article/details/52203682
注意:如果你的電腦之前安裝了Python2,那麼Python2 有自己的pip工具,Python3 也是有自己的pip工具,所以,如果你在DOS命令列上執行pip install some-package-name
命令的時候,系統會使用哪個pip工具呢?是python2的pip,還是python3的pip?
這個問題,你可以在這篇部落格裡得到解決答案:http://www.aobosir.com/blog/2016/11/23/pip-install-python2-python3/
下載後,在我的電腦上是這樣安裝:
安裝 lxml:
1
2
3
4
5
| C:\Users\AOBO>cd C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts
C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install D:\software_install_package_win\python\some-Python-third-packages\lxml-3.6.4-cp35-cp35m-win_amd64.whl
Processing d:\software_install_package_win\python\some-python-third-packages\lxml-3.6.4-cp35-cp35m-win_amd64.whl
Installing collected packages: lxml
Successfully installed lxml-3.6.4
|
安裝 Twisted :(執行到Collecting constantly>=15.1 (from Twisted==16.5.0)
這句時,卡住了,我按了 Ctrl+C 才繼續執行下去。自動下載了下面的:constantly、incremental、zope.interface 這三個依賴庫)
1
2
3
4
5
6
7
8
9
10
11
12
13
| C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install D:\software_install_package_win\python\some-Python-third-packages\Twisted-16.5.0-cp35-cp35m-win_amd64.whl
Processing d:\software_install_package_win\python\some-python-third-packages\twisted-16.5.0-cp35-cp35m-win_amd64.whl
Collecting constantly>=15.1 (from Twisted==16.5.0)
#(執行到這卡住了,我按了 Ctrl+C 才繼續執行下去。自動下載了下面的:constantly、incremental、zope.interface 這三個依賴庫)
Downloading constantly-15.1.0-py2.py3-none-any.whl
Collecting incremental>=16.10.1 (from Twisted==16.5.0)
Downloading incremental-16.10.1-py2.py3-none-any.whl
Collecting zope.interface>=4.0.2 (from Twisted==16.5.0)
Downloading zope.interface-4.3.2-cp35-cp35m-win_amd64.whl (136kB)
100% |████████████████████████████████| 143kB 7.1kB/s
Requirement already satisfied: setuptools in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from zope.interface>=4.0.2->Twisted==16.5.0)
Installing collected packages: constantly, incremental, zope.interface, Twisted
Successfully installed Twisted-16.5.0 constantly-15.1.0 incremental-16.10.1 zope.interface-4.3.2 |
安裝pywin32:
1
2
3
4
| C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install D:\software_install_package_win\python\some-Python-third-packages\pywin32-220.1-cp35-cp35m-win_amd64.whl
Processing d:\software_install_package_win\python\some-python-third-packages\pywin32-220.1-cp35-cp35m-win_amd64.whl
Installing collected packages: pywin32
Successfully installed pywin32-220.1 |
安裝scropy:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
| C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install scrapy
Collecting scrapy
Downloading Scrapy-1.2.1-py2.py3-none-any.whl (294kB)
100% |████████████████████████████████| 296kB 338kB/s
Collecting service-identity (from scrapy)
Downloading service_identity-16.0.0-py2.py3-none-any.whl
Collecting six>=1.5.2 (from scrapy)
Downloading six-1.10.0-py2.py3-none-any.whl
Collecting w3lib>=1.15.0 (from scrapy)
Downloading w3lib-1.16.0-py2.py3-none-any.whl
Collecting PyDispatcher>=2.0.5 (from scrapy)
Downloading PyDispatcher-2.0.5.tar.gz
Requirement already satisfied: Twisted>=10.0.0 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from scrapy)
Requirement already satisfied: lxml in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from scrapy)
Collecting cssselect>=0.9 (from scrapy)
Downloading cssselect-1.0.0-py2.py3-none-any.whl
Collecting parsel>=0.9.3 (from scrapy)
Downloading parsel-1.1.0-py2.py3-none-any.whl
Collecting queuelib (from scrapy)
Downloading queuelib-1.4.2-py2.py3-none-any.whl
Collecting pyOpenSSL (from scrapy)
Downloading pyOpenSSL-16.2.0-py2.py3-none-any.whl (43kB)
100% |████████████████████████████████| 51kB 4.7MB/s
Collecting pyasn1 (from service-identity->scrapy)
Downloading pyasn1-0.1.9-py2.py3-none-any.whl
Collecting pyasn1-modules (from service-identity->scrapy)
Downloading pyasn1_modules-0.0.8-py2.py3-none-any.whl
Collecting attrs (from service-identity->scrapy)
Downloading attrs-16.2.0-py2.py3-none-any.whl
Requirement already satisfied: constantly>=15.1 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from Twisted>=10.0.0->scrapy)
Requirement already satisfied: zope.interface>=4.0.2 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from Twisted>=10.0.0->scrapy)
Requirement already satisfied: incremental>=16.10.1 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from Twisted>=10.0.0->scrapy)
Collecting cryptography>=1.3.4 (from pyOpenSSL->scrapy)
Downloading cryptography-1.6-cp35-cp35m-win_amd64.whl (1.3MB)
100% |████████████████████████████████| 1.3MB 257kB/s
Requirement already satisfied: setuptools in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from zope.interface>=4.0.2->Twisted>=10.0.0->scrapy)
Collecting cffi>=1.4.1 (from cryptography>=1.3.4->pyOpenSSL->scrapy)
Downloading cffi-1.9.1-cp35-cp35m-win_amd64.whl (158kB)
100% |████████████████████████████████| 163kB 322kB/s
Collecting idna>=2.0 (from cryptography>=1.3.4->pyOpenSSL->scrapy)
Downloading idna-2.1-py2.py3-none-any.whl (54kB)
100% |████████████████████████████████| 61kB 4.4MB/s
Collecting pycparser (from cffi>=1.4.1->cryptography>=1.3.4->pyOpenSSL->scrapy)
Downloading pycparser-2.17.tar.gz (231kB)
100% |████████████████████████████████| 235kB 311kB/s
Installing collected packages: six, pycparser, cffi, pyasn1, idna, cryptography, pyOpenSSL, pyasn1-modules, attrs, service-identity, w3lib, PyDispatcher, cssselect, parsel, queuelib, scrapy
Running setup.py install for pycparser ... done
Running setup.py install for PyDispatcher ... done
Successfully installed PyDispatcher-2.0.5 attrs-16.2.0 cffi-1.9.1 cryptography-1.6 cssselect-1.0.0 idna-2.1 parsel-1.1.0 pyOpenSSL-16.2.0 pyasn1-0.1.9 pyasn1-modules-0.0.8 pycparser-2.17 queuelib-1.4.2 scrapy-1.2.1 service-identity-16.0.0 six-1.10.0 w3lib-1.16.0 |
檢視 scrapy
是否安裝成功:(執行scrapy -h
命令,如果能輸出資訊,說明安裝成功)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| C:\Users\AOBO>scrapy -h
Scrapy 1.2.1 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
commands
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
C:\Users\AOBO>
|
檢查所有剛剛安裝的庫是否安裝成功:
啟動PyCharm 軟體,新建一個工程:
剛剛安裝的庫在這裡可以看到:
安裝成功。
6 . 一個超好的命令列串列埠軟體 — PowerCmd
PowerCmd 是一款Windows CMD 的增強工具。
下載安裝地址:http://www.aobosir.com/blog/2016/11/23/powercmd-install/
這個軟體真的很嘍,像我執行scrapy -h
這樣的命令,都列印不出資訊,在DOS視窗裡面是有資訊列印出來的。
測試環境
1 . 執行 scrapy -h
,如果有列印出來資訊,說明Scrapy 安裝成功。
2 . 執行 scrapy bench
,如果遇到問題,說明pywin32庫還有需要完成的步驟。(解決問題: import win32api ImportError: DLL load failed,到這裡檢視解決辦法。)
接下來,我們學習 Scrapy 的命令。瞭解了Scrapy 命令後,我學習:scrapy 爬蟲專案的建立及爬蟲的建立 — 例項:爬取百度標題和CSDN部落格。