Python爬蟲入門教程 4-100 美空網未登入圖片爬取

夢想橡皮擦發表於2018-12-17

原文網址 : https://flycode.co/archives/232809

Python爬蟲

美空網未登入圖片—-簡介

上一篇寫的時間有點長了，接下來繼續把美空網的爬蟲寫完，這套教程中編寫的爬蟲在實際的工作中可能並不能給你增加多少有價值的技術點，因為它只是一套入門的教程，老鳥你自動繞過就可以了，或者帶帶我也行。

美空網未登入圖片—-爬蟲分析

首先，我們已經爬取到了N多的使用者個人主頁，我通過連結拼接獲取到了

http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/list.html

在這裡插入圖片描述

在這個頁面中，我們們要找幾個核心的關鍵點，發現平面拍攝點選進入的是圖片列表頁面。
接下來開始程式碼走起。

獲取所有列表頁面

我通過上篇部落格已經獲取到了70000（實際測試50000+）使用者資料，讀取到python中。

這個地方，我使用了一個比較好用的python庫pandas，大家如果不熟悉，先模仿我的程式碼就可以了，我把註釋都寫完整。

import pandas as pd

# 使用者圖片列表頁模板
user_list_url = "http://www.moko.cc/post/{}/list.html"
# 存放所有使用者的列表頁
user_profiles = []


def read_data():
    # pandas從csv裡面讀取資料
    df = pd.read_csv("./moko70000.csv")   #檔案在本文末尾可以下載
    # 去掉暱稱重複的資料
    df = df.drop_duplicates(["nikename"])
    # 按照粉絲數目進行降序
    profiles = df.sort_values("follows", ascending=False)["profile"]

    for i in profiles:
        # 拼接連結
        user_profiles.append(user_list_url.format(i))

if __name__ == `__main__`:
    read_data()
    print(user_profiles)

資料已經拿到，接下來我們需要獲取圖片列表頁面，找一下規律，看到重點的資訊如下所示，找對位置，就是正規表示式的事情了。

在這裡插入圖片描述
快速的編寫一個正規表示式
<p class="title"><a hidefocus="ture".*?href="(.*?)" class="mwC u">.*?((d+?))</a></p>

引入re,requests模組

import requests
import re

# 獲取圖片列表頁面
def get_img_list_page():
    # 固定一個地址，方便測試
    test_url = "http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/list.html"
    response = requests.get(test_url,headers=headers,timeout=3)
    page_text = response.text
    pattern = re.compile(`<p class="title"><a hidefocus="ture".*?href="(.*?)" class="mwC u">.*?((d+?))</a></p>`)
    # 獲取page_list
    page_list = pattern.findall(page_text)

執行得到結果

[(`/post/da39db43246047c79dcaef44c201492d/category/304475/1.html`, `85`), (`/post/da39db43246047c79dcaef44c201492d/category/304476/1.html`, `2`), (`/post/da39db43246047c79dcaef44c201492d/category/304473/1.html`, `0`)]

繼續完善程式碼，我們發現上面獲取的資料，有”0″的產生，需要過濾掉

# 獲取圖片列表頁面
def get_img_list_page():
    # 固定一個地址，方便測試
    test_url = "http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/list.html"
    response = requests.get(test_url,headers=headers,timeout=3)
    page_text = response.text
    pattern = re.compile(`<p class="title"><a hidefocus="ture".*?href="(.*?)" class="mwC u">.*?((d+?))</a></p>`)
    # 獲取page_list
    page_list = pattern.findall(page_text)
    # 過濾資料
    for page in page_list:
        if page[1] == `0`:
            page_list.remove(page)
    print(page_list)

獲取到列表頁的入口，下面就要把所有的列表頁面全部拿到了,這個地方需要點選下面的連結檢視一下

http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/1.html

本頁面有分頁，4頁，每頁顯示資料4*7=28條
所以，基本計算公式為 math.ceil(85/28)
接下來是連結生成了，我們要把上面的連結，轉換成

http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/1.html
http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/2.html
http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/3.html
http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/4.html

    page_count =  math.ceil(int(totle)/28)+1
    for i in range(1,page_count):
        # 正規表示式進行替換
        pages = re.sub(r`d+?.html`,str(i)+".html",start_page)
        all_pages.append(base_url.format(pages))

當我們回去到足夠多的連結之後，對於初學者，你可以先幹這麼一步，把這些連結儲存到一個csv檔案中，方便後續開發

# 獲取所有的頁面
def get_all_list_page(start_page,totle):

    page_count =  math.ceil(int(totle)/28)+1
    for i in range(1,page_count):
        pages = re.sub(r`d+?.html`,str(i)+".html",start_page)
        all_pages.append(base_url.format(pages))

    print("已經獲取到{}條資料".format(len(all_pages)))
    if(len(all_pages)>1000):
        pd.DataFrame(all_pages).to_csv("./pages.csv",mode="a+")
        all_pages.clear()

讓爬蟲飛一會，我這邊拿到了80000+條資料

80000+資料
好了，列表資料有了，接下來，我們繼續操作這個資料，是不是感覺速度有點慢，程式碼寫的有點LOW，好吧，我承認這是給新手寫的其實就是懶，我回頭在用一篇文章把他給改成物件導向和多執行緒的

表情包

我們接下來基於爬取到的資料再次進行分析

例如 http://www.moko.cc/post/nimusi/category/31793/1.html 這個頁面中，我們需要獲取到，紅色框框的地址，為什麼要或者這個？因為點選這個圖片之後進入裡面才是完整的圖片列表。
在這裡插入圖片描述
我們還是應用爬蟲獲取
幾個步驟

迴圈我們剛才的資料列表
抓取網頁原始碼
正規表示式匹配所有的連結

def read_list_data():
    # 讀取資料
    img_list = pd.read_csv("./pages.csv",names=["no","url"])["url"]

    # 迴圈運算元據
    for img_list_page in img_list:
        try:
            response = requests.get(img_list_page,headers=headers,timeout=3)
        except Exception as e:
            print(e)
            continue
        # 正規表示式獲取圖片列表頁面
        pattern = re.compile(`<a hidefocus="ture" alt="(.*?)".*? href="(.*?)".*?>VIEW MORE</a>`)
        img_box = pattern.findall(response.text)

        need_links = []  # 待抓取的圖片資料夾
        for img in img_box:
            need_links.append(img)

            # 建立目錄
            file_path = "./downs/{}".format(str(img[0]).replace(`/`, ``))

            if not os.path.exists(file_path):
                os.mkdir(file_path)  # 建立目錄

        for need in need_links:
            # 獲取詳情頁面圖片連結
            get_my_imgs(base_url.format(need[1]), need[0])

上面程式碼幾個重點地方

        pattern = re.compile(`<a hidefocus="ture" alt="(.*?)".*? href="(.*?)".*?>VIEW MORE</a>`)
        img_box = pattern.findall(response.text)

        need_links = []  # 待抓取的圖片資料夾
        for img in img_box:
            need_links.append(img)

獲取到抓取目錄，這個地方，我匹配了兩個部分，主要用於建立資料夾
建立資料夾需要用到 os 模組，記得匯入一下

            # 建立目錄
            file_path = "./downs/{}".format(str(img[0]).replace(`/`, ``))

            if not os.path.exists(file_path):
                os.mkdir(file_path)  # 建立目錄

獲取到詳情頁面圖片連結之後，在進行一次訪問抓取所有圖片連結

#獲取詳情頁面資料
def get_my_imgs(img,title):
    print(img)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}
    response = requests.get(img, headers=headers, timeout=3)
    pattern = re.compile(`<img src2="(.*?)".*?>`)
    all_imgs = pattern.findall(response.text)
    for download_img in all_imgs:
        downs_imgs(download_img,title)

最後編寫一個圖片下載的方法,所有的程式碼完成，圖片儲存本地的地址，用的是時間戳。



def downs_imgs(img,title):

    headers ={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}
    response = requests.get(img,headers=headers,timeout=3)
    content = response.content
    file_name = str(int(time.time()))+".jpg"
    file = "./downs/{}/{}".format(str(title).replace(`/`,``).strip(),file_name)
    with open(file,"wb+") as f:
        f.write(content)

    print("完畢")

執行程式碼，等著收圖

程式碼執行一下，發現報錯了
在這裡插入圖片描述
原因是路徑的問題，在路徑中出現了…這個特殊字元，我們需要類似上面處理/的方式處理一下。自行處理一下吧。

資料獲取到，就是這個樣子的

在這裡插入圖片描述

程式碼中需要完善的地方

程式碼分成了兩部分，並且是程式導向的，非常不好，需要改進
網路請求部分重複程式碼過多，需要進行抽象，並且加上錯誤處理，目前是有可能報錯的
程式碼單執行緒，效率不高，可以參照前兩篇文章進行改進
沒有模擬登入，最多隻能爬取6個圖片，這也是為什麼先把資料儲存下來的原因，方便後期直接改造

github程式碼地址與csv地址

Python爬蟲入門【4】：美空網未登入圖片爬取
2019-07-30
Python爬蟲
Python爬蟲入門【3】：美空網資料爬取
2019-07-30
Python爬蟲
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲
Python爬蟲入門教程 2-100 妹子圖網站爬取
2018-12-13
Python爬蟲網站
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
Python爬蟲入門教程 8-100 蜂鳥網圖片爬取之三
2018-12-20
Python爬蟲
Python爬蟲入門教程 18-100 煎蛋網XXOO圖片抓取
2019-01-04
Python爬蟲
Python爬蟲入門【7】：蜂鳥網圖片爬取之二
2019-07-31
Python爬蟲
Python爬蟲入門【8】：蜂鳥網圖片爬取之三
2019-07-31
Python爬蟲
Python爬蟲入門【6】：蜂鳥網圖片爬取之一
2019-07-30
Python爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
Python爬蟲教程-13-爬蟲使用cookie爬取登入後的頁面(人人網)（下）
2018-09-06
Python爬蟲Cookie
Python爬蟲教程-12-爬蟲使用cookie爬取登入後的頁面(人人網)（上）
2018-09-06
Python爬蟲Cookie
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python爬蟲入門【11】：半次元COS圖爬取
2019-07-31
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
Python爬蟲新手教程：知乎文章圖片爬取器
2019-07-20
Python爬蟲
Python爬蟲入門
2020-11-30
Python爬蟲
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
Python爬蟲入門教程導航帖
2019-01-08
Python爬蟲
新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
Python爬蟲入門教程 13-100 鬥圖啦表情包多執行緒爬取
2018-12-27
Python爬蟲執行緒
python-爬蟲入門
2024-09-22
Python爬蟲
Python爬蟲入門教程 11-100 行行網電子書多執行緒爬取
2018-12-25
Python爬蟲執行緒
【python--爬蟲】千圖網高清背景圖片爬蟲
2019-05-21
Python爬蟲
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
Python 從入門到爬蟲極簡教程
2019-02-16
Python爬蟲
Python網路爬蟲4 - scrapy入門
2018-05-29
Python爬蟲
Python爬蟲入門學習線路圖2019最新版（附Python爬蟲視訊教程）
2019-01-09
Python爬蟲
Java爬蟲批量爬取圖片
2021-09-24
Java爬蟲
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
python入門012～使用requests爬取網路圖片並儲存到本地
2021-09-09
Python
網路爬蟲---從千圖網爬取圖片到本地
2019-09-03
爬蟲
GitHub 熱門：各大網站的 Python 爬蟲登入彙總
2019-03-18
Github網站Python爬蟲
Python爬蟲實戰詳解：爬取圖片之家
2020-11-04
Python爬蟲
爬蟲入門基礎-Python
2020-05-09
爬蟲Python
python3 爬蟲入門
2021-09-09
Python爬蟲
爬蟲入門
2024-04-13
爬蟲

Python爬蟲入門教程 4-100 美空網未登入圖片爬取

美空網未登入圖片—-簡介

美空網未登入圖片—-爬蟲分析

獲取所有列表頁面

相關文章