Python3爬蟲實戰（urllib模組）

Mr_blueD發表於2018-01-27

原文網址 : https://blog.csdn.net/mr_blued/article/details/79180017

Python爬蟲

2018.01.27 。我的第一篇部落格。

在自學Python的過程中,爬蟲是我學的最有趣的一個方面，現在我把學習爬蟲的總結展示出來。

學Python爬蟲中，第一個接觸的模組就是urllib，下面我將通過實戰教學告訴大家如何使用urllib中的request模組構造爬蟲，使用工具為Pycharm。

1.Request

urllib.request.Request(url, data=None, headers={}, method=None)

使用Request() 來建立Request物件，並新增請求引數。headers引數為用字典形式設定請求頭。

from urllib import request

header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/\
    537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
req = request.Request(url,headers=header)
# 也可以使用add_header()方法新增請求頭
# req = request.Request(url)
# req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36')

2.urlopen

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url: 需要開啟的網址

timeout：設定網站的訪問超時時間

使用1裡的Request() 來包裝請求後，用urllib.request模組的urlopen() 獲取頁面，使用read()方法獲得的頁面，資料格式為bytes型別，需要通過decode()解碼，轉換成str型別。

from urllib import request

'''
url = 'http://www.baidu.com'
response = request.urlopen(url)
response.read().decode('utf-8')
'''

# 通常網站都會檢查請求引數，所以爬蟲最好新增請求頭引數
# req 是1中新增請求頭後的Request物件
response = request.urlopen(req)
response.read().decode('utf-8')

下面我們來進行實戰檢驗

一.爬取妹子圖。（目標網址：http://www.meizitu.com/）

import urllib.request
import os
import re
import time

def url_open(url):
    # 建立一個 Request物件 req
    req = urllib.request.Request(url)

    # 通過 add_header( )方法新增請求頭，防止基本的網站反爬策略
    req.add_header('User-Agent', "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/\
                    537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36")

    # 將獲取的網頁資訊通過read()方法讀取出來
    response = urllib.request.urlopen(req).read()
    return response

# 另一種方法獲取網頁
'''
def url_open(url):
    req = urllib.request.Request(url)
    header = ('User-Agent', "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/\
                    537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
    )
    # 建立opner物件
    opener = urllib.request.build_opener()

    # 給該物件新增請求頭
    opener.addheaders = [header]

    # 用open方法獲取網頁並讀取
    response = opener.open(url).read()
    return response
'''

def find_imgs(url):
    # 將網頁內容進行解碼，網頁編碼是GBK，就換成gbk
    html = url_open(url).decode('utf-8')

    # 使用正規表示式獲取目標資料
    p = r'<img src="([^"]+\.jpg)"'
    img_addrs = re.findall(p, html)

    return img_addrs

def download_mm(folder='OOXX'):
    os.mkdir(folder)
    os.chdir(folder)

    page_num = 1  # 設定為從第一頁開始爬取，可以自己改
    x = 0  # 自命名圖片
    img_addrs = []  # 防止圖片重複

    # 只爬取前兩頁的圖片，可改，同時給圖片重新命名
    while page_num <= 2:
        page_url = url + 'a/more_' + str(page_num) + '.html'
        addrs = find_imgs(page_url)
        print(len(addrs))
        # img_addrs = []
        for i in addrs:
            if i in img_addrs:
                continue
            else:
                img_addrs.append(i)
        print(len(img_addrs))
        for each in img_addrs:
            print(each)
        page_num += 1
        time.sleep()
        # x = (len(2img_addrs)+1)*(page_num-1)
    for each in img_addrs:
        filename = str(x) + '.' + each.split('.')[-1]
        x += 1
        with open(filename, 'wb') as f:
            img = url_open(each)
            f.write(img)
        # page_num += 1

if __name__ == '__main__':
    url = 'http://www.meizitu.com/'
    download_mm()

二.爬取百度貼吧圖片（目標網址：https://tieba.baidu.com/p/5085123197）

# -*-coding:utf-8 -*-

import urllib.request
import re
import os

def open_url(url):
    req = urllib.request.Request(url)
    req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64)AppleWebKit/537.36 (KHTML, like Gecko)\
     Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0")
    response = urllib.request.urlopen(req).read()

    return response

def find_img(url):
    html = open_url(url).decode('utf-8')
    p = r'<img class="BDE_Image" src="([^"]+\.jpg)"'
    img_addrs = re.findall(p, html)

    for each in img_addrs:
        print(each)
    for each in img_addrs:
        file = each.split("/")[-1]
        with open(file, "wb") as f:
            img = open_url(each)
            f.write(img)

def get_img():
    os.mkdir("TieBaTu")
    os.chdir("TieBaTu")
    find_img(url)

if __name__ == "__main__":
    url = 'https://tieba.baidu.com/p/5085123197'
    get_img()

3.urllib.parese.urlencode(）

urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None)

urlencode（）主要作用就是將url附上要提交的資料。

通過下面這個爬取有道詞典的例子大家就會清楚urllib.parse.urlencode()的用處了。

import urllib.request
import  urllib.parse
import json

while True:
    content = input("請輸入：")
    if content == "":
        print("歡迎下次使用")
        break
    data = {
         'i': content
        , 'from': 'AUTO'
        , 'to': 'AUTO'
        , 'smartresult': 'dict'
        , 'client': 'fanyideskweb'
        , 'salt': '1514345577426'
        , 'sign': '8a12c3bae1619e0d60247aa90a4d945e'
        , 'doctype': 'json'
        , 'version': '2.1'
        , 'keyfrom': 'fanyi.web'
        , 'action': 'FY_BY_REALTIME'
        , 'typoResult': 'false'
        }
    data = urllib.parse.urlencode(data).encode('utf-8')
    url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&sessionFrom=null'
    req = urllib.request.Request(url,data)
    req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0")
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')
    
    # 因為翻譯後的資料是通過json格式返回的，所以要新增json模組進行解析
    target = json.loads(html)
    # print(html)
    print("翻譯結果：")
    print(target['translateResult'][0][0]['tgt'])
    print()

通過上述三個例子，我想大家應該知曉如何進行自己目標網頁的爬取了。

總結：

1.熟練使用urllib模組的方法，清楚這些方法分別實現了什麼功能以及在什麼地方使用這些方法；

2.學會使用正規表示式匹配網頁原始碼中的目標資料；

3.知道os等模組的功能以及作用。

爬蟲-urllib模組的使用
2021-01-14
爬蟲
爬蟲-urllib3模組的使用
2021-01-15
爬蟲
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
網路爬蟲——Urllib模組實戰專案（含程式碼）爬取你的第一個網站
2020-02-12
爬蟲網站
python3 爬蟲實戰：為爬蟲新增 GUI 影象介面
2020-03-06
Python爬蟲GUI
python爬蟲系列(4.5-使用urllib模組方式下載圖片)
2018-11-09
Python爬蟲
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲
[Python3網路爬蟲開發實戰] 分散式爬蟲原理
2019-12-08
Python爬蟲分散式
Python3網路爬蟲開發實戰
2021-04-15
Python爬蟲
爬蟲——Requests模組
2019-01-13
爬蟲
爬蟲-Requests模組
2022-03-03
爬蟲
《python3網路爬蟲開發實戰》--pyspider
2018-10-18
Python爬蟲IDE
python3網路爬蟲開發實戰pdf
2021-11-30
Python爬蟲
Python模組之urllib模組
2020-10-30
Python
python爬蟲requests模組
2019-03-01
Python爬蟲
python爬蟲基礎之urllib
2020-11-26
Python爬蟲
[實戰演練]python3使用requests模組爬取頁面內容
2021-09-09
Python
Python3網路爬蟲快速入門實戰解析
2020-04-23
Python爬蟲
python3網路爬蟲開發實戰_Python 3開發網路爬蟲(一)
2020-12-07
Python爬蟲
為爬蟲框架構建Selenium模組、DSL模組(Kotlin實現)
2018-06-12
爬蟲框架架構Kotlin
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
Python爬蟲之路-jsonpath模組
2021-01-04
Python爬蟲JSON
Python爬蟲之路-lxml模組
2021-01-04
Python爬蟲XML
[Python3網路爬蟲開發實戰] Charles 的使用
2019-12-08
Python爬蟲
《Python3網路爬蟲開發實戰》開源啦！
2019-10-23
Python爬蟲
[Python3網路爬蟲開發實戰] --Splash的使用
2019-06-10
Python爬蟲
爬蟲實戰scrapy
2018-03-11
爬蟲
Python 爬蟲實戰
2023-10-16
Python爬蟲
讀書筆記：《Python3網路爬蟲開發實戰》——第2章：爬蟲基礎
2019-04-09
筆記Python爬蟲
我的爬蟲入門書 —— 《Python3網路爬蟲開發實戰（第二版）》
2022-02-27
爬蟲Python
《Python3 網路爬蟲開發實戰》—學習筆記
2019-07-30
Python爬蟲筆記
Python3網路爬蟲開發實戰（第二版）
2022-01-15
Python爬蟲
Python 爬蟲十六式 - 第二式： urllib 與 urllib3
2019-01-07
Python爬蟲
python爬蟲常用庫之urllib詳解
2018-03-11
Python爬蟲
python爬蟲需要什麼模組
2021-09-11
Python爬蟲
[Python3網路爬蟲開發實戰] 2-爬蟲基礎 2-網頁基礎
2018-03-08
Python爬蟲網頁
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
爬蟲實戰（二）：Selenium 模擬登入並爬取資訊
2018-07-15
爬蟲
Python爬蟲實戰之（五）| 模擬登入wechat
2018-04-10
Python爬蟲