《用Python寫網路爬蟲》--編寫第一個網路爬蟲

zhujianing^_^發表於2017-03-30

編寫第一個python網路爬蟲

為了抓取網頁，首先要下載包含有感興趣資料的網頁，該過程一般被稱為爬取(crawing)。
本文主要介紹了利用sitemap檔案，遍歷ID，跟蹤網頁的方法獲取網頁內容。

下載網頁

想要爬取網頁，我們首先要將其下載下來。下載的指令碼如下：

import urllib2
def download(url):
    return urllib2.urlopen(url).read()

當傳入URL地址時，該函式將會下載並返回其HTML。
不過這個程式碼片存在一點問題，假如URL地址不存在時，urllib2就會丟擲異常。改進的版本為：

import urllib2
def download(url):
    try:
        html=urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Downloading error:',e.reason
        html=None
    return html
print download('http://www.sse.com.cn')

下載時遇到的錯誤經常是臨時性的，比如伺服器過載時返回的503錯誤，對於此類錯誤，重新下載即可，下面是新增重新重新下載功能的程式碼：

import urllib2
def download(url,num_reload=5):
    try:
        html=urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Downloading error:',e.reason
        html = None
        if num_reload>0 and ( hasattr(e,'code') and 500<=e.code<=600 ):
            return download(url,num_reload-1)
    return html

download('http://httpstat.us/500')

下面的這步可略過：

設定使用者代理

預設情況下，urllib2使用Python-urllib2/2.7作為使用者代理下載網頁內容，其中2.7是python的版本號。

import urllib2
def download(url,user_agent='wswp',num_reload=5):
    headers={'User-agent':user_agent}
    request=urllib2.Request(url,headers=headers)
    try:
        html=urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Downloading error:',e.reason
        html = None
        if num_reload>0 and ( hasattr(e,'code') and 500<=e.code<=600 ):
            return download(url,user_agent,num_reload-1)
    return html

download('http://httpstat.us/500')

現在，我們有了一個靈活的函式，可以設定使用者代理，設定重試次數。

網站地圖爬蟲

使用robots.txt檔案中的網站地圖（即sitemap檔案）下載所有的網頁。
robots.txt
# section 1
User-agent: BadCrawler
Disallow: /
# section 2
User-agent: *
Crawl-delay: 5
Disallow: /trap
# section 3
Sitemap: http://example.webscraping.com/sitemap.xml
下面是使用sitemap檔案爬蟲的程式碼：

#!/usr/bin/python
#coding:utf-8
import urllib2
import re
def download(url,user_agent='wswp',num_reload=5):
    headers={'User-agent':user_agent}
    request=urllib2.Request(url,headers=headers)
    try:
        html=urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Downloading error:',e.reason
        html = None
        if num_reload>0 and ( hasattr(e,'code') and 500<=e.code<=600 ):
            return download(url,user_agent,num_reload-1)
    return html

def crawl_sitemap(url):
    sitemap = download(url) #下載網頁檔案
    links = re.findall('<loc>(.*?)</loc>',sitemap) # 提取sitemap檔案裡的格式化連線
    for link in links:
        print link
        html = download(link)
       # print html

crawl_sitemap('http://example.webscraping.com/sitemap.xml')

下面是程式碼執行的效果：
《用Python寫網路爬蟲》--編寫第一個網路爬蟲

ID遍歷爬蟲

下面是一些示例國家的URL
http://example.webscraping.com/view/Afghanistan-1
http://example.webscraping.com/view/Aland-Islands-2
http://example.webscraping.com/view/Albania-3
http://example.webscraping.com/view/Algeria-4
http://example.webscraping.com/view/American-Samoa-5
http://example.webscraping.com/view/Andorra-6
http://example.webscraping.com/view/Angola-7
http://example.webscraping.com/view/Anguilla-8
可以看出，這些URL只在結尾處有區別。在URL中包含頁面別名是非常普遍的做法，可以對搜尋引擎起到優化的作用。一般情況下，Web伺服器會忽略這個字串，只用ID來匹配資料中的相關記錄。嘗試使用http://example.webscraping.com/view/-%d來匹配所有頁面。下面是示例程式碼：

#!/usr/bin/python
#coding:utf-8
import urllib2
import itertools
def download(url,user_agent='wswp',num_reload=5):
    headers={'User-agent':user_agent}
    request=urllib2.Request(url,headers=headers)
    try:
        html=urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Downloading error:',e.reason
        html = None
        if num_reload>0 and ( hasattr(e,'code') and 500<=e.code<=600 ):
            return download(url,user_agent,num_reload-1)
    return html

for page in itertools.count(1):
    url='http://example.webscraping.com/view/%d'%page
    html=download(url)
    print url
    if html is None:
        break

連結爬蟲

本次將使用正規表示式來確定要下載哪些頁面。

什麼是網路爬蟲?為什麼用Python寫爬蟲?
2021-03-08
爬蟲Python
使用 Kotlin DSL 編寫網路爬蟲
2024-03-26
Kotlin爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
如何自己寫一個網路爬蟲
2020-02-27
爬蟲
網路爬蟲編寫常見問題
2020-07-30
爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
寫網路爬蟲的法律邊界
2018-12-20
爬蟲
精通Scrapy網路爬蟲【一】第一個爬蟲專案
2021-06-19
爬蟲
手把手教你寫網路爬蟲（2）：迷你爬蟲架構
2018-04-27
爬蟲架構
python DHT網路爬蟲
2019-02-14
Python爬蟲
網路爬蟲
2018-12-07
爬蟲
python網路爬蟲（14）使用Scrapy搭建爬蟲框架
2019-07-27
Python爬蟲框架
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
[Python] 網路爬蟲與資訊提取（1）網路爬蟲之規則
2020-11-06
Python爬蟲
什麼是Python網路爬蟲?常見的網路爬蟲有哪些?
2020-11-27
Python爬蟲
python網路爬蟲合法嗎
2021-09-11
Python爬蟲
專案－－python網路爬蟲
2020-08-15
Python爬蟲
Python網路爬蟲實戰
2022-03-18
Python爬蟲
網路爬蟲（python專案）
2018-12-04
爬蟲Python
5 個用 Python 編寫 web 爬蟲的方法
2018-05-20
PythonWeb爬蟲
為什麼寫網路爬蟲天然就是擇Python而用
2018-12-02
爬蟲Python
手把手教你寫網路爬蟲（3）：開源爬蟲框架對比
2018-04-28
爬蟲框架
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
python網路爬蟲（9）構建基礎爬蟲思路
2019-06-09
Python爬蟲
寫網路爬蟲程式的三種難度
2018-12-02
爬蟲
網路爬蟲精要
2019-04-27
爬蟲
網路爬蟲示例
2018-10-30
爬蟲
python3網路爬蟲開發實戰_Python 3開發網路爬蟲(一)
2020-12-07
Python爬蟲
python網路爬蟲筆記（一）
2020-10-25
Python爬蟲筆記
python實現selenium網路爬蟲
2021-03-11
Python爬蟲
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
python網路爬蟲--爬取淘寶聯盟
2018-07-17
Python爬蟲
網路爬蟲專案
2022-01-29
爬蟲
網路爬蟲的原理
2018-12-02
爬蟲
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲
寫個爬蟲唄
2019-02-25
爬蟲
教你如何編寫第一個簡單的爬蟲
2020-02-16
爬蟲
使用python的scrapy來編寫一個爬蟲
2019-03-14
Python爬蟲