Python 3網路爬蟲開發實戰

lxcl96發表於2021-04-28

分析Robots協議

書中以簡書為例,對robots.txt檔案分析。

robots.txt

簡書robots.txt檔案內容如下:

# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
User-agent: *
Disallow: /search
Disallow: /convos/
Disallow: /notes/
Disallow: /admin/
Disallow: /adm/
Disallow: /p/0826cf4692f9
Disallow: /p/d8b31d20a867
Disallow: /collections/*/recommended_authors
Disallow: /trial/*
Disallow: /keyword_notes
Disallow: /stats-2017/*

User-agent: trendkite-akashic-crawler
Request-rate: 1/2 # load 1 page per 2 seconds
Crawl-delay: 60

User-agent: YisouSpider
Request-rate: 1/10 # load 1 page per 10 seconds
Crawl-delay: 60

User-agent: Cliqzbot
Disallow: /

User-agent: Googlebot
Request-rate: 2/1 # load 2 page per 1 seconds
Crawl-delay: 10
Allow: /

User-agent: Mediapartners-Google
Allow: /

爬蟲程式碼:

書中程式碼返回為True,False,但是我在實際中執行一直是False,False:

from urllib.robotparser import RobotFileParser
import urllib.request

rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')

# 必不可少的一部,雖然沒有返回
rp.read()
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=l&type=collections"))
print(rp.mtime())

於是我就開始嘗試分析RobotFileParser中read()方法,發現它使用的是urlopen,懷疑是urlopen在請求連線時沒有加上瀏覽器頭部資訊被服務端認為是爬蟲拒絕訪問導致。

# 嘗試用urlopen方法開啟此連結
f = urllib.request.urlopen('http://www.jianshu.com/robots.txt')
print(f)

結果果然如此:

解決這種情況有兩種辦法:

①:對read()方法進行修改,棄用urlopen啟用request方法:

read()原始碼如下:

    def read(self):
        """Reads the robots.txt URL and feeds it to the parser."""
        try:
            f = urllib.request.urlopen(self.url)
        except urllib.error.HTTPError as err:
            if err.code in (401, 403):
                self.disallow_all = True
            elif err.code >= 400 and err.code < 500:
                self.allow_all = True
        else:
            raw = f.read()
            self.parse(raw.decode("utf-8").splitlines())

修改後read()方法:

    def read(self):
        """Reads the robots.txt URL and feeds it to the parser."""
        try:
            f = urllib.request.Request(self.url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'})
        except urllib.error.HTTPError as err:
            if err.code in (401, 403):
                self.disallow_all = True
            elif err.code >= 400 and err.code < 500:
                self.allow_all = True
        else:
            raw = urllib.request.urlopen(f).read()
            self.parse(raw.decode("utf-8").splitlines())

 

再次執行結果如下:成功!

②:不用read()改用 parse()方法 即

使用parse()方法來執行robots.txt分析 和讀取

程式程式碼如下:

rps = RobotFileParser()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
response = urllib.request.Request('http://www.jianshu.com/robots.txt', headers=headers)
req = urllib.request.urlopen(response).read().decode('utf-8').split('\n')

# 對robots.txt 內容進行分析
rps.parse(req)
# 使用了request 加上heders資訊 返回True了
print(rps.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rps.can_fetch('*', "http://www.jianshu.com/search?q=python&page=l&type=collections"))
print(rps.mtime())

執行結果如下:成功

 

相關文章