Python 標準庫 urllib2 的使用細節

發表於2015-12-17

Python 標準庫中有很多實用的工具類，但是在具體使用時，標準庫文件上對使用細節描述的並不清楚，比如 urllib2 這個 HTTP 客戶端庫。這裡總結了一些 urllib2 的使用細節。

Proxy 的設定

urllib2 預設會使用環境變數 http_proxy 來設定 HTTP Proxy。如果想在程式中明確控制 Proxy 而不受環境變數的影響，可以使用下面的方式

import urllib2

enable_proxy = True
proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})
null_proxy_handler = urllib2.ProxyHandler({})

if enable_proxy:
    opener = urllib2.build_opener(proxy_handler)
else:
    opener = urllib2.build_opener(null_proxy_handler)

urllib2.install_opener(opener)

import urllib2

enable_proxy = True

proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})

null_proxy_handler = urllib2.ProxyHandler({})

if enable_proxy:

opener = urllib2.build_opener(proxy_handler)

else:

opener = urllib2.build_opener(null_proxy_handler)

urllib2.install_opener(opener)

這裡要注意的一個細節，使用 urllib2.install_opener() 會設定 urllib2 的全域性 opener 。這樣後面的使用會很方便，但不能做更細粒度的控制，比如想在程式中使用兩個不同的 Proxy 設定等。比較好的做法是不使用 install_opener 去更改全域性的設定，而只是直接呼叫 opener 的 open 方法代替全域性的 urlopen 方法。

Timeout 設定

在老版 Python 中，urllib2 的 API 並沒有暴露 Timeout 的設定，要設定 Timeout 值，只能更改 Socket 的全域性 Timeout 值。

import urllib2

import socket
socket.setdefaulttimeout(10) # 10 秒鐘後超時
urllib2.socket.setdefaulttimeout(10) # 另一種方式

import urllib2

import socket

socket.setdefaulttimeout(10) # 10 秒鐘後超時

urllib2.socket.setdefaulttimeout(10) # 另一種方式

在 Python 2.6 以後，超時可以通過 urllib2.urlopen() 的 timeout 引數直接設定。

import urllib2
response = urllib2.urlopen('http://www.google.com', timeout=10)

1 2	import urllib2 response = urllib2.urlopen('http://www.google.com', timeout=10)

在 HTTP Request 中加入特定的 Header

要加入 header，需要使用 Request 物件：

import urllib2
request = urllib2.Request(uri)
request.add_header('User-Agent', 'fake-client')
response = urllib2.urlopen(request)

import urllib2

request = urllib2.Request(uri)

request.add_header('User-Agent', 'fake-client')

response = urllib2.urlopen(request)

對有些 header 要特別留意，伺服器會針對這些 header 做檢查

User-Agent : 有些伺服器或 Proxy 會通過該值來判斷是否是瀏覽器發出的請求

Content-Type : 在使用 REST 介面時，伺服器會檢查該值，用來確定 HTTP Body 中的內容該怎樣解析。常見的取值有：

application/xml ：在 XML RPC，如 RESTful/SOAP 呼叫時使用
application/json ：在 JSON RPC 呼叫時使用
application/x-www-form-urlencoded ：瀏覽器提交 Web 表單時使用

在使用伺服器提供的 RESTful 或 SOAP 服務時， Content-Type 設定錯誤會導致伺服器拒絕服務

Redirect

urllib2 預設情況下會針對 HTTP 3XX 返回碼自動進行 redirect 動作，無需人工配置。要檢測是否發生了 redirect 動作，只要檢查一下 Response 的 URL 和 Request 的 URL 是否一致就可以了。

import urllib2
response = urllib2.urlopen('http://www.google.cn')
redirected = response.geturl() == 'http://www.google.cn'

import urllib2

response = urllib2.urlopen('http://www.google.cn')

redirected = response.geturl() == 'http://www.google.cn'

如果不想自動 redirect，除了使用更低層次的 httplib 庫之外，還可以自定義 HTTPRedirectHandler 類。

import urllib2

class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_301(self, req, fp, code, msg, headers):
        pass
    def http_error_302(self, req, fp, code, msg, headers):
        pass

opener = urllib2.build_opener(RedirectHandler)
opener.open('http://www.google.cn')

import urllib2

class RedirectHandler(urllib2.HTTPRedirectHandler):

def http_error_301(self, req, fp, code, msg, headers):

pass

def http_error_302(self, req, fp, code, msg, headers):

pass

opener = urllib2.build_opener(RedirectHandler)

opener.open('http://www.google.cn')

Cookie

urllib2 對 Cookie 的處理也是自動的。如果需要得到某個 Cookie 項的值，可以這麼做：

import urllib2
import cookielib

cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('http://www.google.com')
for item in cookie:
    if item.name == 'some_cookie_item_name':
        print item.value

import urllib2

import cookielib

cookie = cookielib.CookieJar()

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))

response = opener.open('http://www.google.com')

for item in cookie:

if item.name == 'some_cookie_item_name':

print item.value

使用 HTTP 的 PUT 和 DELETE 方法

urllib2 只支援 HTTP 的 GET 和 POST 方法，如果要使用 HTTP PUT 和 DELETE ，只能使用比較低層的 httplib 庫。雖然如此，我們還是能通過下面的方式，使 urllib2 能夠發出 PUT 或 DELETE 的請求：

import urllib2

request = urllib2.Request(uri, data=data)
request.get_method = lambda: 'PUT' # or 'DELETE'
response = urllib2.urlopen(request)

import urllib2

request = urllib2.Request(uri, data=data)

request.get_method = lambda: 'PUT' # or 'DELETE'

response = urllib2.urlopen(request)

得到 HTTP 的返回碼

對於 200 OK 來說，只要使用 urlopen 返回的 response 物件的 getcode() 方法就可以得到 HTTP 的返回碼。但對其它返回碼來說，urlopen 會丟擲異常。這時候，就要檢查異常物件的 code 屬性了：

import urllib2
try:
    response = urllib2.urlopen('http://restrict.web.com')
except urllib2.HTTPError, e:
    print e.code

import urllib2

try:

response = urllib2.urlopen('http://restrict.web.com')

except urllib2.HTTPError, e:

print e.code

Debug Log
使用 urllib2 時，可以通過下面的方法把 debug Log 開啟，這樣收發包的內容就會在螢幕上列印出來，方便除錯，有時可以省去抓包的工作

import urllib2

httpHandler = urllib2.HTTPHandler(debuglevel=1)
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)

urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.google.com')

import urllib2

httpHandler = urllib2.HTTPHandler(debuglevel=1)

httpsHandler = urllib2.HTTPSHandler(debuglevel=1)

opener = urllib2.build_opener(httpHandler, httpsHandler)

urllib2.install_opener(opener)

response = urllib2.urlopen('http://www.google.com')

PS: 藉助urllib2抓取網站生成RSS
看了看OsChina的部落格頁面,發現可以使用python來抓取.記得前段時間看到有人使用python的RSS模組PyRSS2Gen生成了RSS.於是忍不住手癢自己試著實現了下,幸好還是成功了,下面程式碼共享給大家.
首先需要安裝PyRSS2Gen模組和BeautifulSoup模組,pip安裝下就好了,我就不再贅述了.
下面貼出程式碼

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import urllib2

import datetime
import time
import PyRSS2Gen
from email.Utils import formatdate
import re
import sys
import os
reload(sys)
sys.setdefaultencoding('utf-8')

class RssSpider():
    def __init__(self):
        self.myrss = PyRSS2Gen.RSS2(title='OSChina',
                                    link='http://my.oschina.net',
                                    description=str(datetime.date.today()),
                                    pubDate=datetime.datetime.now(),
                                    lastBuildDate = datetime.datetime.now(),
                                    items=[]
                                    )
        self.xmlpath=r'/var/www/myrss/oschina.xml'

        self.baseurl="http://www.oschina.net/blog"
        #if os.path.isfile(self.xmlpath):
            #os.remove(self.xmlpath)
    def useragent(self,url):
        i_headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) \
    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36", \
    "Referer": 'http://baidu.com/'}
        req = urllib2.Request(url, headers=i_headers)
        html = urllib2.urlopen(req).read()
        return html
    def enterpage(self,url):
        pattern = re.compile(r'\d{4}\S\d{2}\S\d{2}\s\d{2}\S\d{2}')
        rsp=self.useragent(url)
        soup=BeautifulSoup(rsp)
        timespan=soup.find('div',{'class':'BlogStat'})
        timespan=str(timespan).strip().replace('\n','').decode('utf-8')
        match=re.search(r'\d{4}\S\d{2}\S\d{2}\s\d{2}\S\d{2}',timespan)
        timestr=str(datetime.date.today())
        if match:
            timestr=match.group()
            #print timestr
        ititle=soup.title.string
        div=soup.find('div',{'class':'BlogContent'})
        rss=PyRSS2Gen.RSSItem(
                              title=ititle,
                              link=url,
                              description = str(div),
                              pubDate = timestr
                              )

        return rss
    def getcontent(self):
        rsp=self.useragent(self.baseurl)
        soup=BeautifulSoup(rsp)
        ul=soup.find('div',{'id':'RecentBlogs'})
        for li in ul.findAll('li'):
            div=li.find('div')
            if div is not None:
                alink=div.find('a')
                if alink is not None:
                    link=alink.get('href')
                    print link
                    html=self.enterpage(link)
                    self.myrss.items.append(html)
    def SaveRssFile(self,filename):
        finallxml=self.myrss.to_xml(encoding='utf-8')
        file=open(self.xmlpath,'w')
        file.writelines(finallxml)
        file.close()

if __name__=='__main__':
    rssSpider=RssSpider()
    rssSpider.getcontent()
    rssSpider.SaveRssFile('oschina.xml')

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

import urllib2

import datetime

import time

import PyRSS2Gen

from email.Utils import formatdate

import re

import sys

import os

reload(sys)

sys.setdefaultencoding('utf-8')

class RssSpider():

def __init__(self):

self.myrss = PyRSS2Gen.RSS2(title='OSChina',

link='http://my.oschina.net',

description=str(datetime.date.today()),

pubDate=datetime.datetime.now(),

lastBuildDate = datetime.datetime.now(),

items=[]

)

self.xmlpath=r'/var/www/myrss/oschina.xml'

self.baseurl="http://www.oschina.net/blog"

#if os.path.isfile(self.xmlpath):

#os.remove(self.xmlpath)

def useragent(self,url):

i_headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) \

AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36", \

"Referer": 'http://baidu.com/'}

req = urllib2.Request(url, headers=i_headers)

html = urllib2.urlopen(req).read()

return html

def enterpage(self,url):

pattern = re.compile(r'\d{4}\S\d{2}\S\d{2}\s\d{2}\S\d{2}')

rsp=self.useragent(url)

soup=BeautifulSoup(rsp)

timespan=soup.find('div',{'class':'BlogStat'})

timespan=str(timespan).strip().replace('\n','').decode('utf-8')

match=re.search(r'\d{4}\S\d{2}\S\d{2}\s\d{2}\S\d{2}',timespan)

timestr=str(datetime.date.today())

if match:

timestr=match.group()

#print timestr

ititle=soup.title.string

div=soup.find('div',{'class':'BlogContent'})

rss=PyRSS2Gen.RSSItem(

title=ititle,

link=url,

description = str(div),

pubDate = timestr

)

return rss

def getcontent(self):

rsp=self.useragent(self.baseurl)

soup=BeautifulSoup(rsp)

ul=soup.find('div',{'id':'RecentBlogs'})

for li in ul.findAll('li'):

div=li.find('div')

if div is not None:

alink=div.find('a')

if alink is not None:

link=alink.get('href')

print link

html=self.enterpage(link)

self.myrss.items.append(html)

def SaveRssFile(self,filename):

finallxml=self.myrss.to_xml(encoding='utf-8')

file=open(self.xmlpath,'w')

file.writelines(finallxml)

file.close()

if __name__=='__main__':

rssSpider=RssSpider()

rssSpider.getcontent()

rssSpider.SaveRssFile('oschina.xml')

可以看到,主要是使用BeautifulSoup來抓取站點然後使用PyRSS2Gen來生成RSS並儲存為xml格式檔案.
順便共享下我生成的RSS地址

http://104.224.129.109/myrss/oschina.xml

1	http://104.224.129.109/myrss/oschina.xml

大家如果不想折騰的話直接使用feedly訂閱就行了.
指令碼我會每10分鐘執行一次的.

網路爬蟲（五）：urllib2的使用細節與抓站技巧
2014-09-17
爬蟲
python常用標準庫
2020-02-06
Python
python標準庫目錄
2018-09-05
Python
Python標準庫（待續）
2018-08-08
Python
Python標準庫一覽
2017-04-01
Python
【python】Python標準庫defaultdict模組
2016-07-24
Python
python標準庫00學習準備
2015-12-29
Python
標準庫 fmt 包的基本使用
2020-11-29
Python標準庫06 子程式
2021-09-09
Python
Python標準庫（1） — Itertools模組
2017-08-15
Python
python標準庫SocketServer學習
2013-05-07
PythonServer
Python標準庫中隱藏的利器
2023-11-12
Python
Python gevent 是如何 patch 標準庫的 ?
2018-02-25
Python
Python 快速教程（標準庫）：學習準備
2015-11-07
Python
細談WEB標準
2016-03-08
Web
python 標準庫和第3方庫的介紹
2016-08-16
Python
Python中urllib和urllib2庫的用法
2023-11-24
Python
【推薦】5個常用的Python標準庫!
2022-06-22
Python
python標準庫模組放在哪裡？
2021-09-11
Python
python官方標準庫（中文版）
2020-10-09
Python
Python 2.* 標準庫簡介
2018-01-18
Python
Python標準庫系列之Memcache模組
2017-04-04
Python
python：模組1——標準庫簡介
2017-05-09
Python
Python標準庫之functools/itertools/operator
2017-01-22
Python
Python標準庫系列之Redis模組
2017-01-26
PythonRedis
Python - random 庫的詳細使用
2021-06-03
Pythonrandom
Python標準庫14 資料庫 (sqlite3)
2019-10-20
Python資料庫SQLite
python自帶效能強悍的標準庫 itertools
2021-12-12
Python
Python標準庫13 迴圈器 (itertools)
2019-11-28
Python
整合 Python標準庫之 Path/File 類
2019-02-26
Python
Python標準庫系列之模組介紹
2017-02-03
Python
CUJ：標準庫：標準庫中的搜尋演算法 (轉)
2007-12-14
演算法
簡單介紹標準庫fmt的基本使用
2020-12-14
C 標準庫 -
2016-07-17
Go標準庫所有方法使用例子
2019-04-12
Go
使用Profile標準化資料庫管理
2021-02-08
資料庫
標準模板庫（STL）使用入門（下）
2015-07-20
標準模板庫（STL）使用入門（上）
2015-06-29

Python 標準庫 urllib2 的使用細節

相關文章