python網路爬蟲實戰--重點整理

BIGKAKA發表於2017-07-12

第四章--python爬蟲常用模組

urllib2.urlopen(url,timeout)請求返回響應，timeout是超時時間設定

#! python2.7
#-*- coding:utf-8 -*-


import urllib2

def linkBaidu():
    url='http://www.baidu.com'
    try:
        response=urllib2.urlopen(url,timeout=4)
    except urllib2.URLError:
        print("網路地址錯誤")
        exit()
    with open('baiduResponse.txt','w') as fp:   #寫入文件
        fp.write(response.read())
    print(response.geturl())    #獲取url資訊
    print(response.getcode())  #返回狀態碼
    print(response.info())  #返回資訊


if __name__=='__main__':
    linkBaidu()

使用代理伺服器來訪問url
有免費的代理伺服器，但是使用urlopen的使用很容易出現<urlopen error timed out>，所以需要迴圈呼叫urlopen()。

#-*-coding:utf-8 -*-
'''
測試代理proxy是否有效
'''
import urllib2,re

class TestProxy():
    def __init__(self,proxy):
        self.proxy=proxy
        self.checkProxyFormat(self.proxy)
        self.url='http://www.baidu.com'
        self.timeout=4
        self.keyword='百度'    #在網頁返回的資料中查詢這個詞
        self.useProxy(proxy)

    def checkProxyFormat(self,proxy):
        try:
            match=re.compile(r'^http[s]?://[\d]{1,3}.[\d]{1,3}.[\d]{1,3}.[\d]{1,3}:[\d]{1,5}$')
            match.search(proxy).group()
        except AttributeError:
            print("你輸入的代理地址格式不正確")
            exit()
        flag=1
        proxy=proxy.replace('//','')
        try:
            protocol=proxy.split(':')[0]
            ip=proxy.split(':')[1]
            port=proxy.split(':')[2]
        except IndexError:
            print('下標出界')
            exit()
        flag=flag and ip.split('.')[0] in map(str,xrange(1,256))   #map對每個數應用到str函式
        flag=flag and ip.split('.')[1] in map(str,xrange(256))
        flag=flag and ip.split('.')[2] in map(str,xrange(256))
        flag=flag and ip.split('.')[3] in map(str,xrange(1,255))
        flag=flag and protocol in ['http','https']
        flag=flag and port in map(str,xrange(1,65535))
        if flag:
            print('輸入的http代理伺服器符合標準')
        else:
            exit()

    def useProxy(self,proxy):
        protocol=proxy.split('//')[0].replace(':','')
        ip=proxy.split('//')[1]
        print(protocol,ip)
        '''
        build_opener ()返回的物件具有open()方法，與urlopen()函式的功能相同，
        install_opener 用來建立（全域性）預設opener。這個表示呼叫urlopen將使用你安裝的opener
        '''
        opener=urllib2.build_opener(urllib2.ProxyHandler({protocol:ip}))  #protocol:http ip:163.125.68.237:8888
        urllib2.install_opener(opener)
        for i in range(10):
            try:
                response=urllib2.urlopen(self.url,timeout=5)
                break
            except Exception as e:
                print(e)
        str=response.read()
        if re.search(self.keyword,str):
            print("已提取特徵詞，該代理可用")
        else:
            print('該代理不可用')

if __name__=='__main__':
    proxy=r'http://163.125.68.237:8888'
    TestProxy(proxy)

修改header
網站是通過瀏覽器傳送過來的User-Agent的值來確認瀏覽器的身份的，所以可能有些網站不允許被程式訪問，所以我們傳送請求時需要修改User-Agent欺騙網站，利用add_header()可以新增頭部。同一網站會給不同的瀏覽器訪問不同的內容。
下面程式是用IE和手機版的UC訪問有道翻譯。

#-*-coding:utf-8 -*-
import userAgents
import urllib2

class ModifyHeader():
    def __init__(self):
        piua=userAgents.pcUserAgent.get('IE 9.0')
        muua=userAgents.mobileUserAgent.get('UC standard')
        print('piua: '+piua)
        self.url='http://fanyi.youdao.com'
        self.userAgent(piua,1)
        self.userAgent(muua,2)

    def userAgent(self,agent,name):
        request=urllib2.Request(self.url)
        request.add_header(agent.split(':')[0],agent.split(':')[1])

        response=urllib2.urlopen(request)
        filename=str(name)+'.html'
        with open(filename,'w') as fp:
            fp.write('%s\n\n'%agent)
            fp.write(response.read())

if __name__=='__main__':
    ModifyHeader()

getpass.getuser()

返回當前使用者名稱。這個函式會按順序檢查環境變數LOGNAME, USER, LNAME和USERNAME。返回第一個非空的值。如果檢查不到非空的值，模組會嘗試匯入pwd模組，如果系統支援pwd模組，會返回通過pwd模組獲取的使用者名稱，否則報錯。
re模組
\A：僅匹配字串開頭，如\Aabc
\Z：僅匹配字串結尾，如abc\Z

Python網路爬蟲實戰
2022-03-18
Python爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
【Python爬蟲9】Python網路爬蟲例項實戰
2017-02-17
Python爬蟲
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
網路爬蟲（六）：實戰
2014-09-19
爬蟲
乾貨分享！Python網路爬蟲實戰
2020-08-07
Python爬蟲
Python網路爬蟲實戰小專案
2021-04-12
Python爬蟲
Python 3網路爬蟲開發實戰
2021-04-28
Python爬蟲
Python網路爬蟲實戰專案大全！
2020-12-19
Python爬蟲
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲
python3網路爬蟲開發實戰_Python 3開發網路爬蟲(一)
2020-12-07
Python爬蟲
[Python3網路爬蟲開發實戰] 分散式爬蟲原理
2019-12-08
Python爬蟲分散式
Python網路爬蟲實戰(一)快速入門
2019-09-16
Python爬蟲
Python3網路爬蟲開發實戰
2021-04-15
Python爬蟲
Python網路爬蟲實戰專案大全 32個Python爬蟲專案demo
2019-04-24
Python爬蟲
python3網路爬蟲開發實戰pdf
2021-11-30
Python爬蟲
Python 爬蟲實戰
2023-10-16
Python爬蟲
Python3網路爬蟲快速入門實戰解析
2020-04-23
Python爬蟲
《Python 3網路爬蟲開發實戰》chapter3
2019-07-09
Python爬蟲APT
《python3網路爬蟲開發實戰》--pyspider
2018-10-18
Python爬蟲IDE
Python大型網路爬蟲專案開發實戰（全套）
2017-06-14
Python爬蟲
Python3 大型網路爬蟲實戰 — 給 scrapy 爬蟲專案設定為防反爬
2016-12-06
Python爬蟲
Python3網路爬蟲快速入門實戰解析（一小時入門 Python 3 網路爬蟲）
2017-11-15
Python爬蟲
Python爬蟲實戰之叩富網
2021-04-04
Python爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
最新《30小時搞定Python網路爬蟲專案實戰》
2020-02-18
Python爬蟲
[Python3網路爬蟲開發實戰] Charles 的使用
2019-12-08
Python爬蟲
《Python3網路爬蟲開發實戰》開源啦！
2019-10-23
Python爬蟲
[Python3網路爬蟲開發實戰] --Splash的使用
2019-06-10
Python爬蟲
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
圖靈樣書爬蟲 - Python 爬蟲實戰
2017-06-08
圖靈爬蟲Python
python實現selenium網路爬蟲
2021-03-11
Python爬蟲
python爬蟲實戰教程-Python爬蟲開發實戰教程（微課版）
2020-11-11
Python爬蟲
《網路爬蟲開發實戰案例》筆記
2020-08-10
爬蟲筆記
[Python3網路爬蟲開發實戰] 2-爬蟲基礎 2-網頁基礎
2018-03-08
Python爬蟲網頁
python DHT網路爬蟲
2019-02-14
Python爬蟲
《從零開始學習Python爬蟲：頂點小說全網爬取實戰》
2024-07-06
Python爬蟲

python網路爬蟲實戰--重點整理

相關文章