爬取網站前4_避免爬蟲陷阱

Pop_Rain發表於2017-05-19

原文網址 : https://blog.csdn.net/pop_rain/article/details/72526642

目前，我們的爬蟲會跟蹤所有之前沒有訪問過的連結。但是，一些網站會動態生成頁面內容，這樣就會出現無限多的網頁。比如，網站有一個線上日曆功能，提供了可以訪問下個月和下一年的連結，那麼下個月的頁面中同樣會包含訪問再下個月的連結，這樣頁面就會無止境地連結下去，這種情況被稱為爬蟲陷阱。

想要避免陷入爬蟲陷阱，一個簡單的方法是記錄到達當前網頁經過了多少個連結，也就是深度。當到達最大深度時，爬蟲就不再向佇列中新增該網頁中的連結了。要實現這一功能，我們需要修改seen變數。該變數原先只記錄訪問過的網頁連結，現在修改為一個字典，增加了頁面深度的記錄。

#為避免爬蟲陷阱，將用於避免重複連結的seen記錄值修改為字典，增加記錄訪問次數
#現在有了這一功能，我們就有信心爬蟲最終一定能夠完成。
#如果想要禁用該功能，只需將max_depth設為一個負數即可，此時當前深度永遠不會與之相等。
def link_crawler(..., max_depth=2):
    max_depth = 2
    seen = {}
    ...
    depth = seen[url]
    if depth != max_depth:
        for link in links:
            if link not in seen:
                seen[link] = depth + 1
                crawl_queue.append(link)

將上述功能整合到之前的連結爬蟲裡，有以下程式碼：

import urllib.request
import urllib.error 
import re #正規表示式
import urllib.parse #將url連結從相對路徑（瀏覽器可懂但python不懂）轉為絕對路徑（python也懂了）
import urllib.robotparser #爬取資料前解析網站robots.txt檔案，避免爬取網站所禁止或限制的
import datetime  #下載限速功能所需模組
def download(url, user_agent = "brain", proxy = None, num_retries = 2):  #下載url網頁,proxy是支援代理功能，初始值為None，想要設定就直接傳引數即可
    print("downloading：",url)
    header = {"user-agent": user_agent} #設定使用者代理，而不使用python預設的使用者代理Python-urllib/3.6
    req = urllib.request.Request(url, headers = header)    
    
    opener = urllib.request.build_opener()  #為支援代理功能時刻準備著
    if proxy:   #如果設定了proxy，那麼就進行以下設定以實現支援代理功能
        proxy_params = { urllib.parse.urlparse(url).scheme: proxy }
        opener.add_handler(urllib.request.ProxyHandler(proxy_params))
        response = opener.open(req)
        
    try:
        html = urllib.request.urlopen(req).read()
    except urllib.error.URLError as e:    #下載過程中出現問題
        print("download error：",e.reason)
        html = None

        if num_retries > 0:     #錯誤4XX發生在請求存在問題，而5XX錯誤則發生在服務端存在問題，所以在發生5XX錯誤時重試下載
            if hasattr(e, "code") and 500<= e.code <600:
                return  download(url, user_agent, num_retries-1)  # recursively retry 5XX HTTP errors
    return html
#download("http://example.webscraping.com") #訪問正常
#download("http://httpstat.us/500") #這個網頁測試用，一直是5XXerror

#跟蹤連結的爬蟲
#link_crawler()函式傳入兩個引數：要爬取的網站URL、用於跟蹤連結的正規表示式。
def link_crawler(seed_url, link_regex, max_depth=2):
    """先下載 seed_url 網頁的原始碼，然後提取出裡面所有的連結URL，接著對所有匹配到的連結URL與link_regex 進行匹配，
如果連結URL裡面有link_regex內容，就將這個連結URL放入到佇列中，
下一次 執行 while crawl_queue: 就對這個連結URL 進行同樣的操作。
反反覆覆，直到 crawl_queue 佇列為空，才退出函式。"""
    crawl_queue = [seed_url]
    max_depth = 2 #為避免爬蟲陷阱，將用於避免重複連結的seen記錄值修改為字典，增加記錄訪問次數；如果想要禁用該功能，只需將max_depth設為一個負數即可，此時當前深度永遠不會與之相等
    seen = {seed_url:0} #初始化seed_url訪問深度為0
    
    #seen = set(crawl_queue) #有可能連結中互相重複指向，為避免爬取相同的連結，所以我們需要記錄哪些連結已經被爬取過(放在集合seen中)，若已被爬取過，不再爬取
    while crawl_queue:
        url = crawl_queue.pop()
        
        rp = urllib.robotparser.RobotFileParser()   #爬取前解析網站robots.txt，檢查是否可以爬取網站，避免爬取網站禁止或限制的
        rp.set_url("http://example.webscraping.com/robots.txt")
        rp.read()
        user_agent = "brain"
        if rp.can_fetch(user_agent, url):  #解析後發現如果可以正常爬取網站，則繼續執行
            
            #爬取網站的下載限速功能的類的呼叫，每次在download下載前使用
            throttle = Throttle(delay=5) #這裡例項網站robots.txt中的delay值為5
            throttle.wait(url)
            html = download(url)   #html = download(url, hearders, proxy=proxy, num_retries=num_retries)這裡可以傳所需要的引數
            
            html = str(html)
            #filter for links matching our regular expression
            if html == None:
                continue

            depth = seen[url]  #用於避免爬蟲陷阱的記錄爬取深度的depth
            if depth != max_depth:
                for link in get_links(html):
                    if re.match(link_regex, link):
                        link = urllib.parse.urljoin(seed_url, link) #把提取的相對url路徑link(view/178)轉化成絕對路徑(/view/Poland-178)link
                        if link not in seen:  #判斷是否之前已經爬取
                            seen[link] = depth + 1 #在之前的爬取深度上加1
                            crawl_queue.append(link) #之前沒有的話這個連結可用，放在列表中繼續進行爬取
        else:
            print("Blocked by %s robots,txt" % url)
            continue
        
def get_links(html):
    """用來獲取一個html網頁中所有的連結URL"""
    #做了一個匹配模板 webpage_regex，匹配 <a href="xxx"> or <a href='xxx'>這樣的字串，並提取出裡面xxx的URL，請注意這裡的xxxURL很可能是原始碼中相對路徑，eg view/1 正常訪問肯定是打不開的
    webpage_regex = re.compile('<a href=["\'](.*?)["\']', re.IGNORECASE)
    return re.findall(webpage_regex,html)
    #return re.findall('<a[^>]+href=["\'](.*?)["\']', html)也可以這樣實現，但沒有上面的先編譯模板再匹配好

class Throttle:  #爬取網站的下載限速功能的類的實現，每次在download下載前使用
    """Add a delay between downloads to the same domain"""
    def __init__(self, delay):
        self.delay = delay  # value of delay between downloads for each domain
        self.domains = {}   # timestamp of when a domain was last accessed記錄上次訪問的時間，小知識timestamp：時間戳是指格林威治時間1970年01月01日00時00分00秒(北京時間1970年01月01日08時00分00秒)起至現在的總秒數。

    def wait(self, url):
        domain = urllib.parse.urlparse(url).netloc
        last_accessed = self.domains.get(domain)

        if self.delay>0 and last_accessed is not None:
            sleep_secs = self.delay - (datetime.datetime.now() - last_accessed).seconds
            if sleep_secs > 0:
                time.sleep(sleep_secs)  #domain has been accessed recently,so need to sleep
        self.domains[domain] = datetime.datetime.now()

#只想找http://example.webscraping.com/index... or http://example.webscraping.com/view...
link_crawler("http://example.webscraping.com", "/(index|view)")

python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
爬蟲練習——爬取縱橫中文網
2020-10-19
爬蟲
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
python網路爬蟲--爬取淘寶聯盟
2018-07-17
Python爬蟲
爬蟲之股票定向爬取
2018-12-06
爬蟲
爬蟲搭建代理池、爬取某網站影片案例、爬取新聞案例
2023-03-16
爬蟲網站
Node JS爬蟲：爬取瀑布流網頁高清圖
2018-05-17
JS爬蟲網頁
Python爬蟲實踐--爬取網易雲音樂
2022-02-15
Python爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
爬蟲爬取微信小程式
2019-02-16
爬蟲微信小程式
Java爬蟲批量爬取圖片
2021-09-24
Java爬蟲
如何合理控制爬蟲爬取速度？
2022-06-02
爬蟲
網路爬蟲---從千圖網爬取圖片到本地
2019-09-03
爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
一起學爬蟲——使用Beautiful Soup爬取網頁
2018-11-26
爬蟲網頁
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
【爬蟲】專案篇-使用selenium爬取大魚潮汐網
2024-04-05
爬蟲
Python爬蟲入門【3】：美空網資料爬取
2019-07-30
Python爬蟲
Python使用多程式提高網路爬蟲的爬取速度
2019-02-01
Python爬蟲
如何用Python網路爬蟲爬取網易雲音樂歌曲
2018-04-27
Python爬蟲
python 爬蟲爬取 learnku 精華文章
2020-04-17
Python爬蟲
Java爬蟲-爬取疫苗批次資訊
2024-06-03
Java爬蟲
提高爬蟲爬取效率的辦法
2022-04-06
爬蟲
爬蟲---xpath解析（爬取美女圖片）
2020-12-23
爬蟲
Python爬蟲教程-13-爬蟲使用cookie爬取登入後的頁面(人人網)（下）
2018-09-06
Python爬蟲Cookie
Python爬蟲教程-12-爬蟲使用cookie爬取登入後的頁面(人人網)（上）
2018-09-06
Python爬蟲Cookie
教你用Python爬取圖蟲網
2019-02-26
Python
如何使用robots禁止各大搜尋引擎爬蟲爬取網站
2018-08-28
爬蟲網站
Python爬蟲訓練：爬取酷燃網視訊資料
2020-10-23
Python爬蟲
python網路爬蟲（7）爬取靜態資料詳解
2019-06-07
Python爬蟲
python 爬蟲如何爬取動態生成的網頁內容
2024-10-31
Python爬蟲網頁
python爬蟲:瞭解JS加密爬取網易雲音樂
2021-08-19
Python爬蟲JS加密
爬蟲：HTTP請求與HTML解析（爬取某乎網站）
2021-05-19
爬蟲HTTPHTML網站

爬取網站前4_避免爬蟲陷阱

相關文章