Python《爬取IPhone各式桌布》

星海千尋發表於2020-12-11

原文網址 : https://blog.csdn.net/qq_29367075/article/details/111027702

這會兒是國內時間6點半了，感覺我是依然樂此不疲地學習怎麼取爬取網站的圖片，經過上一次的實驗，感覺初步淺顯地基本關鍵包的用法，以及自己本來就具有HTML的基礎，這次我是找到了一個網站，從頭到尾自己獨立實驗成功，內心有點小激動，感覺自己融會貫通的能力有上一層樓啊，於是趕緊寫下來紀念一下。
話不多說，整起！

我們來爬取https://divnil.com/wallpaper/iphone/ 這個網址下的圖片。
這是個日語網站，但不影響我們推理和爬取圖片，總體看上去呢，網站具有一定的分層結構，https://divnil.com/wallpaper/iphone/，是根url，下面掛在了很多的tag標籤的html。
在這裡插入圖片描述

每個頁面的class都是“tag_link”，這樣方便我們找出所有的標籤，但是即得到要去重。

選擇某個標籤，就會出現分頁式的html，命令也很有規律，就是在後面加上_num的方式。
下面這些統計是程式碼中爬取圖片時候順帶統計列印出來的，在最後我貼了完整程式碼，只需要把 save_all_images 函式呼叫註釋即可列印出所有需要訪問的目錄，註釋到這樣快一些，因為我只想看訪問連結。
在這裡插入圖片描述

點選某個標籤頁，比如：
在這裡插入圖片描述

就會在網頁下方出現
在這裡插入圖片描述

的按鈕，雖然是日語，但是猜也能猜到這是下一頁的意思。點選它進入分頁。
在這裡插入圖片描述

所以如果我們確定了搜尋某個tag，只要一直找這樣的標籤就能繼續跳到下一個頁面去爬取圖片，找到這個按鈕的href，經過觀察，只要還在分頁中，class=“btn_next”的數目就一定是等於2的，且第一個一定是按鈕【下一個】，也就是我們要找的。

那麼直到所有分頁完了是什麼情況呢？
經過觀察，當分頁結束了的最後一頁的時候，class=“btn_next”的數目就不是2，且頁面沒有按鈕【下一個】的字樣。所以從這裡可以判斷出分頁結束與否。
在這裡插入圖片描述

接下來我們看看每個圖片的url有什麼規律？
在這裡插入圖片描述

每個圖片區域的< a>的“rel=wallpaper”，但是包含的img的href有的屬性名是src，有的是original，這個也很好辦，這裡需要注意下就是。

完整測試程式碼如下：

import os
import requests
from bs4 import BeautifulSoup

rootrurl = 'https://divnil.com/wallpaper/iphone/'
save_dir = 'D:/estimages/'
no_more_pages = 'END'

# 這是一個集合，不能重複，也就是不能重複下載圖片
image_cache = set()
index = len(image_cache)

# step 1: 找到所有的標籤頁
# 隨便挑選了一個tag最多最全的一個頁面來獲取
tmpurl = 'https://divnil.com/wallpaper/iphone/%E6%98%9F%E7%A9%BA%E3%81%AE%E5%A3%81%E7%B4%99.html'
html = BeautifulSoup(requests.get(tmpurl).text, features="html.parser")
tag_list = set()
for link in html.find_all('a', {'class': 'tag_link'}):
    tag_list.add(link.get('href'))

print("the number of unique tag is : %d" % len(tag_list))
print(tag_list)

# step 2: 按照每個標籤頁反覆區按照分頁形式不斷往下尋找
def save_all_images(html, saveDir):
    global index
    imgs = html.find_all('a', {'rel': 'wallpaper'})
    print('total imgs is %d:' % len(imgs))
    for img in imgs:

        # 有一些圖片的href是src, 有一些圖片的href是original，因此都是要考慮的。
        href = img.find('img').get('src')
        if href == '/wallpaper/img/app/white.png':
            href = img.find('img').get('original')
        print(href)

        # 判斷是否重複下載
        if href not in image_cache:
            with open(
                    '{}/{}'.format(saveDir, href.split("/")[-1]), 'wb') as jpg:  # 請求圖片並寫進去到本地檔案
                jpg.write(requests.get(rootrurl + href).content)

            image_cache.add(href)
            print("正在抓取第%s條資料" % index)
            index += 1

def findNextPage(html):
    nextBtn = html.find_all('li', {'class': 'btn_next'})  # 找到next按鈕，獲得下一個連線網頁的位置
    if len(nextBtn) != 2:  # 只要分頁沒結束，這個數目一定是2
        print('no more page ============================ ')
        return no_more_pages
    else:
        tmpurl = nextBtn[0].find('a').get('href')  # 只要分頁沒結束，這個數目一定是2，且第一個元素一定是我們要找的“下一個”按鈕。
        return rootrurl + tmpurl

def serachSubPages(tag):

    # 建立這個標籤的子目錄，圖片太多了，按照標籤建立子目錄，方便儲存和整理
    tag_name = tag.split(".")[0]
    if not os.path.exists(save_dir + tag_name):
        os.mkdir(save_dir + tag_name)
    print("current tag is : " + tag_name)

    url = rootrurl + tag
    while 1:
        print("next page: " + url)
        html = BeautifulSoup(requests.get(url).text, features="html.parser")
        save_all_images(html, save_dir + tag_name)
        url = findNextPage(html)
        if url == no_more_pages:
            break;

if __name__ == '__main__':
    for tag in tag_list:
        serachSubPages(tag)

測試一些後我終止了程式，因為太慢了。
效果如下：
在這裡插入圖片描述

利用Python爬取必應桌布
2020-10-13
Python
Python《爬取手機和桌面桌布》
2020-12-25
Python
Python 爬取 "王者榮耀.英雄桌布" 過程中的矛和盾
2022-03-05
Python
【python--爬蟲】彼岸圖網高清桌布爬蟲
2019-07-21
Python爬蟲
爬取彼岸網站的桌布（分類可選）
2024-07-03
網站
python爬取網圖
2019-10-15
Python
python 爬蟲爬取 learnku 精華文章
2020-04-17
Python爬蟲
Python爬蟲，高清美圖我全都要（彼岸桌面桌布）
2020-03-30
Python爬蟲
Python爬取電影天堂
2018-11-01
Python
Python爬取周杰倫instagram
2018-07-08
Python
python 爬取 mc 皮膚
2019-08-02
Python
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
python 爬蟲 1 爬取酷狗音樂
2020-03-29
Python爬蟲
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
用python爬取知識星球
2019-02-16
Python
python爬取糗事百科
2018-08-14
Python
python爬取北京租房資訊
2018-05-18
Python
Python：爬取疫情每日資料
2020-02-17
Python
Python-爬取CVE漏洞庫?
2021-11-05
Python
關於python爬取網頁
2021-03-10
Python網頁
python——豆瓣top250爬取
2021-01-02
Python
Python爬蟲：爬取instagram，破解js加密引數
2019-04-09
Python爬蟲JS加密
python網路爬蟲--爬取淘寶聯盟
2018-07-17
Python爬蟲
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲
Python 第一個爬蟲，爬取 147 小說
2020-05-08
Python爬蟲
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
爬蟲——爬取貴陽房價（Python實現）
2022-02-09
爬蟲Python
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python 爬取 baidu 股票市值資料
2019-02-16
PythonAI
Python—Requests庫的爬取效能分析
2018-05-16
Python
Python爬取糗事百科段子
2018-08-31
Python
Python爬取噹噹網APP資料
2020-10-21
PythonAPP
python爬取王者榮耀皮膚
2020-10-06
Python

Python《爬取IPhone各式桌布》

相關文章