python 高度健壯性爬蟲的異常和超時問題

路易十四發表於2017-02-13

爬蟲這型別程式典型特徵是意外多，無法確保每次請求都是穩定的返回統一的結果，要提高健壯性，能對錯誤資料or超時or程式死鎖等都能進行處理，才能確保程式幾個月不停止。本專案乃長期維護github：反反爬蟲開源庫中積累下來，更多幹貨歡迎star。

一：基礎try&except異常處理

try&except的語句作用不僅僅是要讓其捕獲異常更重要的是讓其忽略異常，因為爬蟲中的絕大多數異常可能重新請求就不存在，因此，發現異常的時候將其任務佇列進行修復其實是個最省力的好辦法。

其次被try包住的語句即使出錯也不會導致整個程式的退出，相信我，你絕對不希望計劃跑一個週末的程式在半夜停止了。

 try:
    passhttp://top.jobbole.com/deliver-article/#
    #可能出錯的語句
except Exception,e:
    pass
    #保留錯誤的url，留待下次重跑
    print e
finally:
    #無論是否處理了異常都繼續執行
    print time.ctime()

try:

passhttp://top.jobbole.com/deliver-article/#

#可能出錯的語句

except Exception,e:

pass

#保留錯誤的url，留待下次重跑

print e

finally:

#無論是否處理了異常都繼續執行

print time.ctime()

二：請求函式的超時處理

2.1:普通請求：

2.1.1單請求型別：

import requests
requests.get(url,timeout=60)

1 2	import requests requests.get(url,timeout=60)

2.1.2會話保持型別：

import requesocks
session = requesocks.session()
response = session.get(URL,headers=headers,timeout=10)

import requesocks

session = requesocks.session()

response = session.get(URL,headers=headers,timeout=10)

三：selenium+chrome | phantomjs 的超時處理

2.2.1：selenium+chrome的超時設定

官網原文：http://selenium-python.readthedocs.io/waits.html

顯式等待：、等待某個條件發生，然後再繼續進行程式碼。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
    element = WebDriverWait(driver, 10).until(  #這裡修改時間
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )
finally:
    driver.quit()

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()

driver.get("http://somedomain/url_that_delays_loading")

try:

element = WebDriverWait(driver, 10).until( #這裡修改時間

EC.presence_of_element_located((By.ID, "myDynamicElement"))

)

finally:

driver.quit()

隱式等待：是告訴WebDriver在嘗試查詢一個或多個元素（如果它們不是立即可用的）時輪詢DOM一定時間。預設設定為0，一旦設定，將為WebDriver物件例項的生命期設定隱式等待。

from selenium import webdriver
 
driver = webdriver.Firefox()
driver.implicitly_wait(10) # seconds
driver.get("http://somedomain/url_that_delays_loading")
myDynamicElement = driver.find_element_by_id("myDynamicElement")

from selenium import webdriver

driver = webdriver.Firefox()

driver.implicitly_wait(10) # seconds

driver.get("http://somedomain/url_that_delays_loading")

myDynamicElement = driver.find_element_by_id("myDynamicElement")

2.2.2：phantomjs的超時設定

這裡使用不帶selenium的phantomjs，需要使用js。主要設定語句是

page.settings.resourceTimeout = 5000; // 等待5秒

var system = require('system');
var args = system.args;
var url = args[1];
var page = require('webpage').create();
page.settings.resourceTimeout = 5000; // 等待5秒
page.onResourceTimeout = function(e) {
console.log(e.errorCode);&nbsp;&nbsp; //列印錯誤碼
console.log(e.errorString);//列印錯誤語句
console.log(e.url); &nbsp;&nbsp;&nbsp;&nbsp;//列印錯誤url
phantom.exit(1);
};
page.open(url, function(status) {
if(status==='success'){
var html=page.evaluate(function(){
returndocument.documentElement.outerHTML;
});
console.log(html);
}
phantom.exit();
});
//$phantomjs xx.js http://bbs.pcbaby.com.cn/topic-2149414.html

page.settings.resourceTimeout = 5000; // 等待5秒

var system = require('system');

var args = system.args;

var url = args[1];

var page = require('webpage').create();

page.settings.resourceTimeout = 5000; // 等待5秒

page.onResourceTimeout = function(e) {

console.log(e.errorCode);   //列印錯誤碼

console.log(e.errorString);//列印錯誤語句

console.log(e.url);     //列印錯誤url

phantom.exit(1);

};

page.open(url, function(status) {

if(status==='success'){

var html=page.evaluate(function(){

returndocument.documentElement.outerHTML;

});

console.log(html);

}

phantom.exit();

});

//$phantomjs xx.js http://bbs.pcbaby.com.cn/topic-2149414.html

四：自定義函式的死鎖or超時處理

這個非常重要！！

python是順序執行的，但是如果下一句話可能導致死鎖（比如一個while（1））那麼如何強制讓他超時呢？他本身如果沒有帶有超時設定的話，就要自己執行訊號（import signal）來處理

#coding:utf-8
import time
import signal
 
def test(i):
    time.sleep(0.999)#模擬超時的情況
    print "%d within time"%(i)
    return i
 
def fuc_time(time_out):
    # 此為函式超時控制，替換下面的test函式為可能出現未知錯誤死鎖的函式
    def handler(signum, frame):
        raise AssertionError
    try:
        signal.signal(signal.SIGALRM, handler)
        signal.alarm(time_out)#time_out為超時時間
        temp = test(1) #函式設定部分，如果未超時則正常返回資料，
        return temp
    except AssertionError:
        print "%d timeout"%(i)# 超時則報錯
 
if __name__ == '__main__':
    for i in range(1,10):
        fuc_time(1)

#coding:utf-8

import time

import signal

def test(i):

time.sleep(0.999)#模擬超時的情況

print "%d within time"%(i)

return i

def fuc_time(time_out):

# 此為函式超時控制，替換下面的test函式為可能出現未知錯誤死鎖的函式

def handler(signum, frame):

raise AssertionError

try:

signal.signal(signal.SIGALRM, handler)

signal.alarm(time_out)#time_out為超時時間

temp = test(1) #函式設定部分，如果未超時則正常返回資料，

return temp

except AssertionError:

print "%d timeout"%(i)# 超時則報錯

if __name__ == '__main__':

for i in range(1,10):

fuc_time(1)

五：自定義執行緒的死鎖or超時處理

在某個程式中一方面不適合使用selenium+phantomjs的方式（要實現的功能比較難不適合）因為只能用原生的phantomjs，但是這個問題他本身在極端情況下也有可能停止（在超時設定之前因為某些錯誤）

那麼最佳方案就是用python單獨開一個執行緒（程式）呼叫原生phantomjs，然後對這個執行緒程式進行超時控制。

這裡用ping這個命令先做測試，

import subprocess
from threading import Timer
import time
 
kill = lambda process: process.kill()
 
cmd = ["ping", "www.google.com"]
ping = subprocess.Popen(
    cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
 
my_timer = Timer(5, kill, [ping])#這裡設定時間，和命令
try:
    my_timer.start()#啟用
    stdout, stderr = ping.communicate()#獲得輸出
    #print stderr
    print time.ctime()
finally:
    print time.ctime()
    my_timer.cancel()

import subprocess

from threading import Timer

import time

kill = lambda process: process.kill()

cmd = ["ping", "www.google.com"]

ping = subprocess.Popen(

cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

my_timer = Timer(5, kill, [ping])#這裡設定時間，和命令

try:

my_timer.start()#啟用

stdout, stderr = ping.communicate()#獲得輸出

#print stderr

print time.ctime()

finally:

print time.ctime()

my_timer.cancel()

六：自重啟的程式設計

比如程式在某種情況下報錯多次，，那麼滿足條件後，讓其重啟即可解決大多數問題，當然這只不過是治標不治本而已，如果這個程式重啟沒有大問題（例如讀佇列型別）那麼自重啟這是最省力的方式之一。

import time
import sys
import os
def restart_program():
  python = sys.executable
  os.execl(python, python, * sys.argv)
  
if __name__ == "__main__":
  print 'start...'
  print u"3秒後,程式將結束...".encode("utf8")
  time.sleep(3)
  restart_program()

import time

import sys

import os

def restart_program():

python = sys.executable

os.execl(python, python, * sys.argv)

if __name__ == "__main__":

print 'start...'

print u"3秒後,程式將結束...".encode("utf8")

time.sleep(3)

restart_program()

打賞支援我寫出更多好文章，謝謝！
打賞作者

打賞支援我寫出更多好文章，謝謝！

python 高度健壯性爬蟲的異常和超時問題

聊聊 Python 的應用 - 健壯高效的網路爬蟲
2018-10-19
Python爬蟲
Go 函式的健壯性、panic異常處理、defer 機制
2023-10-19
Go函式
python爬蟲常見的那點問題！
2021-07-05
Python爬蟲
Python 系列：如何提高 python 程式程式碼的健壯性
2024-05-21
Python
Selenium爬蟲遇到超時TimeOut問題的解決方法
2018-12-06
爬蟲
掘金直播第十期聊聊 Python 的應用 - 健壯高效的網路爬蟲
2018-10-15
Python爬蟲
Python爬蟲亂碼問題
2018-05-11
Python爬蟲
Python爬蟲程式設計常見問題解決方法
2018-09-07
Python爬蟲程式設計
論程式的健壯性——就看Redis
2020-09-21
Redis
用Flow提升前端健壯性
2019-02-28
前端
爬蟲抓取網路資料時經常遇到的六種問題
2022-06-11
爬蟲
基於介面資料變異的App健壯性測試實踐
2024-03-01
APP
網路爬蟲編寫常見問題
2020-07-30
爬蟲
爬蟲常見問題及解決方式
2022-06-10
爬蟲
【彙總】Python爬蟲常見面試題!
2021-06-04
Python爬蟲面試題
Python異常 ValueError的問題詳解
2023-04-21
PythonError
關於對健壯性程式碼的理解
2020-06-06
論怎麼提高程式的健壯性
2021-01-03
使用 CSS perfer-* 規範，提升網站的可訪問性與健壯性
2021-05-24
CSS網站
程式碼安全性和健壯性：如何在if和assert中做選擇？
2021-02-27
健壯性測試工具-stress-ng
2024-05-20
python爬蟲爬取網頁中文亂碼問題的解決
2024-11-17
Python爬蟲網頁
網路爬蟲常見問題（個人總結）
2019-01-24
爬蟲
python爬蟲如何爬知乎的話題？
2019-02-16
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
導致Python爬蟲封禁的常見原因
2022-07-14
Python爬蟲
Python小白的爬蟲問題與解決（含程式碼）
2018-05-10
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
dingo + jwt 搭建後 handler 全域性接收異常的問題
2019-10-31
GoJWT
什麼是Python網路爬蟲?常見的網路爬蟲有哪些?
2020-11-27
Python爬蟲
PostgreSQL 連線超時異常
2024-01-05
SQL
如何保障物聯網平臺的安全性與健壯性
2022-09-13
如何利用ip住宅代理解決python爬蟲遇到反爬措施的問題？
2023-05-18
Python爬蟲
Python爬蟲——Xpath和lxml
2019-01-20
Python爬蟲XML
python和爬蟲代理的關聯
2020-08-05
Python爬蟲
python 爬蟲之獲取標題和連結
2020-11-27
Python爬蟲
Python爬蟲和java爬蟲哪個效率高
2023-10-12
Python爬蟲Java
python的檔案和異常
2024-06-25
Python
【程式異常記錄】01 --- python編碼問題
2024-06-02
Python

python 高度健壯性爬蟲的異常和超時問題

目錄：

一：基礎try&except異常處理

二：請求函式的超時處理

2.1:普通請求：

三：selenium+chrome | phantomjs 的超時處理

2.2.1：selenium+chrome的超時設定

2.2.2：phantomjs的超時設定

四：自定義函式的死鎖or超時處理

五：自定義執行緒的死鎖or超時處理

六：自重啟的程式設計

打賞支援我寫出更多好文章，謝謝！

相關文章