Python3爬蟲實戰(requests模組)

Mr_blueD發表於2018-01-27

上次我通過兩個實戰教學展示瞭如何使用urllib模組(http://blog.csdn.net/mr_blued/article/details/79180017)來構造爬蟲,這次告訴大家一個更好的實現爬蟲的模組,requests模組。

使用requests模組進行爬蟲構造時最好先去了解一下HTTP協議與常見的幾種網頁請求方式。

閒話少說,我們進入正題。

使用requests模組改進上次的例子中的程式碼

1.爬取妹子圖。(目標網址:http://www.meizitu.com/

import requests
import os
import re
import time

def url_open(url):
    # 以字典的形式新增請求頭
    header = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0"
        }
    # 使用get方法傳送請求獲取網頁原始碼
    response = requests.get(url, headers=header)
    return response

def find_imgs(url):
    html = url_open(url).text
    p = r'<img src="([^"]+\.jpg)"'

    img_addrs = re.findall(p, html)

    return img_addrs

def download_mm(folder='OOXX'):
    os.mkdir(folder)
    os.chdir(folder)

    page_num = 1  # 設定為從第一頁開始爬取,可以自己改
    x = 0  # 自命名圖片
    img_addrs = []  # 防止圖片重複

    # 只爬取前兩頁的圖片,可改,同時給圖片重新命名
    while page_num <= 2:
        page_url = url + 'a/more_' + str(page_num) + '.html'
        addrs = find_imgs(page_url)
        print(len(addrs))
        # img_addrs = []
        for i in addrs:
            if i in img_addrs:
                continue
            else:
                img_addrs.append(i)
        print(len(img_addrs))
        for each in img_addrs:
            print(each)
        page_num += 1
        # x = (len(img_addrs)+1)*(page_num-1)
    for each in img_addrs:
        filename = str(x) + '.' + each.split('.')[-1]
        x += 1
        with open(filename, 'wb') as f:
            img = url_open(each).content
            f.write(img)
        # page_num += 1

if __name__ == '__main__':
    url = 'http://www.meizitu.com/'
    download_mm()



2.爬取百度貼吧圖片 (目標網址:https://tieba.baidu.com/p/5085123197)

import requests
import re
import os

def open_url(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0"}
    response = requests.get(url, headers=headers)

    return response

def find_img(url):
    html = open_url(url).text
    p = r'<img class="BDE_Image" src="([^"]+\.jpg)"'
    img_addrs = re.findall(p, html)

    for each in img_addrs:
        print(each)
    for each in img_addrs:
        file = each.split("/")[-1]
        with open(file, "wb") as f:
            img = open_url(each).content
            f.write(img)

def get_img():
    os.mkdir("TieBaTu")
    os.chdir("TieBaTu")
    find_img(url)

if __name__ == "__main__":
    url = 'https://tieba.baidu.com/p/5085123197'
    get_img()


總結:
1.熟悉requests模組的方法,以及瞭解http協議和幾種常見的請求方式
2.瞭解網站的反爬蟲策略,並建立相對應的反反爬蟲手段
3.知道其他模組的作用。

爬蟲專案地址:github

相關文章