Python3 大型網路爬蟲實戰 001 --- 搭建開發環境

zhuhai__yizhi發表於2017-08-24

我使用的電腦: Windows 10 64位

前言

開發Python爬蟲有很多種方式,從程式的複雜程度的角度來說,可以分為:爬蟲專案和爬蟲檔案。
相信有些朋友玩過Python的urllib模組,一般我們可以用該模組寫一些爬蟲檔案,實現起來非常方便,但做大型專案的時候,會發現效率不是太好、並且程式的穩定性也不是太好。
Scrapy是一個Python的爬蟲框架,使用Scrapy可以提高開發效率,並且非常適合做一些中大型爬蟲專案。
簡單來說,urllib庫更適合寫爬蟲檔案,scrapy更適合做爬蟲專案。

本套專欄,就來講解如何做爬蟲專案。本篇部落格是第一篇部落格:搭建開發環境。

1 . 安裝Python3

到官網下載就可以了,下載一個Python3.5版本就可以,傻瓜式安裝。

Python 3 被預設安裝在:C:\Users\[Username]\AppData\Local\Programs\Python\Python35 這個路徑裡面。

2 . 安裝Python程式開發整合開發環境 — PyCharm IDE 2016.1.4

軟體下載:https://www.jetbrains.com/pycharm/download/#section=windows

注意:

Professional是完整版的,但是需要註冊碼

註冊方法:http://blog.csdn.net/tianzhaixing2013/article/details/44997881

我這次安裝的是PyCharm 2016。

Community是免費版的,但是軟體裡面的Terminal是不能使用的。

3 . 安裝 Visual Studio 2015 軟體

要知道:為什麼需要 Visual Studio 軟體了。(參考這個網站

如果不安裝,當中你執行pip install third-package-name時,有時會出現下面這個錯誤:error: Unable to find vcvarsall.bat

Alt text

安裝Visual Studio 2015 軟體是為了安裝裡面的Python Tools 2.2.5 for Visual Studio 2015軟體。

下載和安裝 Visual Studio 2015 軟體 的方法在這裡

4 . 升級 pip 工具

在DOS視窗中執行下面的命令來升級pip工具。

python -m pip install --upgrade pip

5 . 安裝一些第三方庫

lxml、Twisted、pywin32、scrapy

lxml是一種可以迅速、靈活地處理 XML。
Twisted是用Python實現的基於事件驅動的網路引擎框架。
pywin32提供win32api。
Scrapy是一個為了爬取網站資料,提取結構性資料而編寫的應用框架。


我們安裝的是python3.5,並且我的電腦是64位的,所以:下載:

lxml‑3.6.4‑cp35‑cp35m‑win_amd64.whl

Twisted‑16.5.0‑cp35‑cp35m‑win_amd64.whl

pywin32‑220.1‑cp35‑cp35m‑win_amd64.whl

scrapy(直接使用命令:pip.exe install scrapy 來安裝。)


Python安裝第三方庫的方法:http://blog.csdn.net/github_35160620/article/details/52203682

注意:如果你的電腦之前安裝了Python2,那麼Python2 有自己的pip工具,Python3 也是有自己的pip工具,所以,如果你在DOS命令列上執行pip install some-package-name命令的時候,系統會使用哪個pip工具呢?是python2的pip,還是python3的pip?

這個問題,你可以在這篇部落格裡得到解決答案:http://www.aobosir.com/blog/2016/11/23/pip-install-python2-python3/


下載後,在我的電腦上是這樣安裝:

安裝 lxml:

C:\Users\AOBO>cd C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts
C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install D:\software_install_package_win\python\some-Python-third-packages\lxml-3.6.4-cp35-cp35m-win_amd64.whl
Processing d:\software_install_package_win\python\some-python-third-packages\lxml-3.6.4-cp35-cp35m-win_amd64.whl
Installing collected packages: lxml
Successfully installed lxml-3.6.4

安裝 Twisted :(執行到Collecting constantly>=15.1 (from Twisted==16.5.0)這句時,卡住了,我按了 Ctrl+C 才繼續執行下去。自動下載了下面的:constantly、incremental、zope.interface 這三個依賴庫)

C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install D:\software_install_package_win\python\some-Python-third-packages\Twisted-16.5.0-cp35-cp35m-win_amd64.whl
Processing d:\software_install_package_win\python\some-python-third-packages\twisted-16.5.0-cp35-cp35m-win_amd64.whl
Collecting constantly>=15.1 (from Twisted==16.5.0)
#(執行到這卡住了,我按了 Ctrl+C 才繼續執行下去。自動下載了下面的:constantly、incremental、zope.interface 這三個依賴庫)
  Downloading constantly-15.1.0-py2.py3-none-any.whl
Collecting incremental>=16.10.1 (from Twisted==16.5.0)
  Downloading incremental-16.10.1-py2.py3-none-any.whl
Collecting zope.interface>=4.0.2 (from Twisted==16.5.0)
  Downloading zope.interface-4.3.2-cp35-cp35m-win_amd64.whl (136kB)
    100% |████████████████████████████████| 143kB 7.1kB/s
Requirement already satisfied: setuptools in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from zope.interface>=4.0.2->Twisted==16.5.0)
Installing collected packages: constantly, incremental, zope.interface, Twisted
Successfully installed Twisted-16.5.0 constantly-15.1.0 incremental-16.10.1 zope.interface-4.3.2

安裝pywin32:

C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install D:\software_install_package_win\python\some-Python-third-packages\pywin32-220.1-cp35-cp35m-win_amd64.whl
Processing d:\software_install_package_win\python\some-python-third-packages\pywin32-220.1-cp35-cp35m-win_amd64.whl
Installing collected packages: pywin32
Successfully installed pywin32-220.1

安裝scropy:

C:\Users\AOBO\AppData\Local\Programs\Python\Python35\Scripts>pip.exe install scrapy
Collecting scrapy
  Downloading Scrapy-1.2.1-py2.py3-none-any.whl (294kB)
    100% |████████████████████████████████| 296kB 338kB/s
Collecting service-identity (from scrapy)
  Downloading service_identity-16.0.0-py2.py3-none-any.whl
Collecting six>=1.5.2 (from scrapy)
  Downloading six-1.10.0-py2.py3-none-any.whl
Collecting w3lib>=1.15.0 (from scrapy)
  Downloading w3lib-1.16.0-py2.py3-none-any.whl
Collecting PyDispatcher>=2.0.5 (from scrapy)
  Downloading PyDispatcher-2.0.5.tar.gz
Requirement already satisfied: Twisted>=10.0.0 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from scrapy)
Requirement already satisfied: lxml in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from scrapy)
Collecting cssselect>=0.9 (from scrapy)
  Downloading cssselect-1.0.0-py2.py3-none-any.whl
Collecting parsel>=0.9.3 (from scrapy)
  Downloading parsel-1.1.0-py2.py3-none-any.whl
Collecting queuelib (from scrapy)
  Downloading queuelib-1.4.2-py2.py3-none-any.whl
Collecting pyOpenSSL (from scrapy)
  Downloading pyOpenSSL-16.2.0-py2.py3-none-any.whl (43kB)
    100% |████████████████████████████████| 51kB 4.7MB/s
Collecting pyasn1 (from service-identity->scrapy)
  Downloading pyasn1-0.1.9-py2.py3-none-any.whl
Collecting pyasn1-modules (from service-identity->scrapy)
  Downloading pyasn1_modules-0.0.8-py2.py3-none-any.whl
Collecting attrs (from service-identity->scrapy)
  Downloading attrs-16.2.0-py2.py3-none-any.whl
Requirement already satisfied: constantly>=15.1 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from Twisted>=10.0.0->scrapy)
Requirement already satisfied: zope.interface>=4.0.2 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from Twisted>=10.0.0->scrapy)
Requirement already satisfied: incremental>=16.10.1 in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from Twisted>=10.0.0->scrapy)
Collecting cryptography>=1.3.4 (from pyOpenSSL->scrapy)
  Downloading cryptography-1.6-cp35-cp35m-win_amd64.whl (1.3MB)
    100% |████████████████████████████████| 1.3MB 257kB/s
Requirement already satisfied: setuptools in c:\users\aobo\appdata\local\programs\python\python35\lib\site-packages (from zope.interface>=4.0.2->Twisted>=10.0.0->scrapy)
Collecting cffi>=1.4.1 (from cryptography>=1.3.4->pyOpenSSL->scrapy)
  Downloading cffi-1.9.1-cp35-cp35m-win_amd64.whl (158kB)
    100% |████████████████████████████████| 163kB 322kB/s
Collecting idna>=2.0 (from cryptography>=1.3.4->pyOpenSSL->scrapy)
  Downloading idna-2.1-py2.py3-none-any.whl (54kB)
    100% |████████████████████████████████| 61kB 4.4MB/s
Collecting pycparser (from cffi>=1.4.1->cryptography>=1.3.4->pyOpenSSL->scrapy)
  Downloading pycparser-2.17.tar.gz (231kB)
    100% |████████████████████████████████| 235kB 311kB/s
Installing collected packages: six, pycparser, cffi, pyasn1, idna, cryptography, pyOpenSSL, pyasn1-modules, attrs, service-identity, w3lib, PyDispatcher, cssselect, parsel, queuelib, scrapy
  Running setup.py install for pycparser ... done
  Running setup.py install for PyDispatcher ... done
Successfully installed PyDispatcher-2.0.5 attrs-16.2.0 cffi-1.9.1 cryptography-1.6 cssselect-1.0.0 idna-2.1 parsel-1.1.0 pyOpenSSL-16.2.0 pyasn1-0.1.9 pyasn1-modules-0.0.8 pycparser-2.17 queuelib-1.4.2 scrapy-1.2.1 service-identity-16.0.0 six-1.10.0 w3lib-1.16.0

檢視 scrapy 是否安裝成功:(執行scrapy -h 命令,如果能輸出資訊,說明安裝成功)

C:\Users\AOBO>scrapy -h
Scrapy 1.2.1 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  commands
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

C:\Users\AOBO>

檢查所有剛剛安裝的庫是否安裝成功:

啟動PyCharm 軟體,新建一個工程:

Alt text

Alt text

剛剛安裝的庫在這裡可以看到:

Alt text

安裝成功。


6 . 一個超好的命令列串列埠軟體 — PowerCmd

PowerCmd 是一款Windows CMD 的增強工具。

下載安裝地址:http://www.aobosir.com/blog/2016/11/23/powercmd-install/

這個軟體真的很嘍,像我執行scrapy -h 這樣的命令,都列印不出資訊,在DOS視窗裡面是有資訊列印出來的。



測試環境

1 . 執行 scrapy -h,如果有列印出來資訊,說明Scrapy 安裝成功。

2 . 執行 scrapy bench ,如果遇到問題,說明pywin32庫還有需要完成的步驟。(解決問題: import win32api ImportError: DLL load failed,到這裡檢視解決辦法。)


接下來,我們學習 Scrapy 的命令。瞭解了Scrapy 命令後,我學習:scrapy 爬蟲專案的建立及爬蟲的建立 — 例項:爬取百度標題和CSDN部落格


請訪問:http://www.aobosir.com/

相關文章