Python《爬取手機和桌面桌布》

星海千尋發表於2020-12-25

原文網址 : https://blog.csdn.net/qq_29367075/article/details/111659783

此次爬取桌布網站，此網站全是靜態的，沒有反爬蟲手段，感覺是適合新手練手。
http://www.win4000.com/mobile.html
http://www.win4000.com/wallpaper.html
分別是手機桌布和桌面桌布。
比如點開手機桌布，我們會發現有很多標籤。
在這裡插入圖片描述

點開其中的標籤，進入到該標籤頁。
在這裡插入圖片描述

發現有很多的圖片組，且包含有分頁。
在這裡插入圖片描述

再最後點選某個圖片組可以發現有多張高清桌布
在這裡插入圖片描述

一個組圖中，html頁面的url是有規律的。
http://www.win4000.com/mobile_detail_178569_1.html
http://www.win4000.com/mobile_detail_178569_2.html
http://www.win4000.com/mobile_detail_178569_3.html
http://www.win4000.com/mobile_detail_178569_4.html
………

好了，頁面分析完畢。直接整程式碼：

import time
from concurrent.futures import ThreadPoolExecutor
import time
import os
import re
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import  Options

rootrurl = 'http://www.win4000.com/'
save_dir = 'D:/estimages/'

headers = {
    "Referer": rootrurl,
    'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
    'Accept-Language': 'en-US,en;q=0.8',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive'
}  ###設定請求的頭部，偽裝成瀏覽器


def saveOneImg(dir, img_url):
    new_headers = {
        "Referer": img_url,
        'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
        'Accept-Language': 'en-US,en;q=0.8',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive'
    }  ###設定請求的頭部，偽裝成瀏覽器，實時換成新的 header 是為了防止403 http code問題，防止反盜鏈，

    try:
        img = requests.get(img_url, headers=new_headers)  # 請求圖片的實際URL
        if (str(img).find('200') > 1):
            with open(
                    '{}/{}.jpg'.format(dir, img_url.split('/')[-1].split('?')[0]), 'wb') as jpg:  # 請求圖片並寫進去到本地檔案
                jpg.write(img.content)
                print(img_url)
                jpg.close()
            return True
        else:
            return False
    except Exception as e:
        print('exception occurs: ' + img_url)
        print(e)
        return False

def processOnePage(tag, url):
    print('current group page is: %s' % url)

    html = BeautifulSoup(requests.get(url).text, features="html.parser")
    div = html.find('div', {'class': 'ptitle'})
    title = div.find('h1').get_text()
    num = int(div.find('em').get_text())

    tmpDir = '{}{}'.format(tag, title)
    if not os.path.exists(tmpDir):
        os.makedirs(tmpDir)

    for i in range(1, (num + 1)):
        tmpurl = '{}_{}{}'.format(url[:-5], i, '.html')
        html = BeautifulSoup(requests.get(tmpurl).text, features="html.parser")
        saveOneImg(tmpDir, html.find('img', {'class': 'pic-large'}).get('src'))
    pass


def processPages(tag, a_s):
    for a in a_s:
        processOnePage(tag, a.get('href'))
        time.sleep(1)
    pass


def tagSpiders(tag, url):
    while 1:
        html = BeautifulSoup(requests.get(url).text, features="html.parser")

        a_s = html.find('div', {'class': 'tab_box'}).find_all('a')
        processPages(tag, a_s)

        next = html.find('a', {'class': 'next'})
        if next is None:
            break
        url = next.get('href')
        time.sleep(1)
    pass


def getAllTags(taglist):
    list = {}
    for tag, url in taglist.items():
        html = BeautifulSoup(requests.get(url).text, features="html.parser")
        tags = html.find('div', {'class': 'cont1'}).find_all('a')[1:]
        for a in tags:
            list['{}{}/{}/'.format(save_dir, tag, a.get_text())] = a.get('href')
    return list

if __name__ == '__main__':
    # 獲得所有標籤
    list = {'手機桌布': 'http://www.win4000.com/mobile.html',
               '桌面桌布': 'http://www.win4000.com/wallpaper.html'}
    taglist = getAllTags(list)
    print(taglist)
    #
    # 給每個標籤配備一個執行緒
    with ThreadPoolExecutor(max_workers=40) as t:  # 建立一個最大容納數量為20的執行緒池
        for tag, url in taglist.items():
            t.submit(tagSpiders, tag, url)

    # 單個連線測試下下
    # tagSpiders('D:/estimages/手機桌布/美女/', 'http://www.win4000.com/mobile_2340_0_0_1.html')

    # 等待所有執行緒都完成。
    while 1:
        print('-------------------')
        time.sleep(1)

效果如下：

請新增圖片描述

Python《爬取IPhone各式桌布》
2020-12-11
PythoniPhone
利用Python爬取必應桌布
2020-10-13
Python
Python爬蟲，高清美圖我全都要（彼岸桌面桌布）
2020-03-30
Python爬蟲
Python 爬取 "王者榮耀.英雄桌布" 過程中的矛和盾
2022-03-05
Python
手機版python爬取網頁書籍
2020-12-19
Python網頁
Python《必應bing桌面圖片爬取》
2020-12-26
Python
python爬取FY-4作為桌面背景
2020-11-12
Python
【python--爬蟲】彼岸圖網高清桌布爬蟲
2019-07-21
Python爬蟲
python爬蟲解決趕集網掃碼獲取手機號
2018-03-13
Python爬蟲
爬取彼岸網站的桌布（分類可選）
2024-07-03
網站
Fiddler抓包---手機APP--python爬蟲基本設定和操作
2018-10-24
APPPython爬蟲
python爬取網圖
2019-10-15
Python
python爬蟲從ip池獲取隨機IP
2021-09-11
Python爬蟲隨機
python 爬蟲實現增量去重和定時爬取例項
2020-03-06
Python爬蟲
使用Python進行Web爬取和資料提取
2020-07-28
PythonWeb
python 爬蟲之獲取標題和連結
2020-11-27
Python爬蟲
Python自動化爬取小說，解放你的雙手
2021-04-02
Python
python 爬蟲爬取 learnku 精華文章
2020-04-17
Python爬蟲
Python爬取電影天堂
2018-11-01
Python
Python爬取周杰倫instagram
2018-07-08
Python
python 爬取 mc 皮膚
2019-08-02
Python
老司機帶你用python來爬取妹子圖
2018-03-22
Python
分享必應桌布介面，可用來獲取高質量桌布和故事
2021-09-14
[譯] 如何使用 Python 和 BeautifulSoup 爬取網站內容
2019-02-23
Python網站
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
python 爬蟲 1 爬取酷狗音樂
2020-03-29
Python爬蟲
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
Python 從零開始爬蟲(六)——動態爬取解決方案之手動分析
2018-05-09
Python爬蟲
「更新」Macos動態桌面桌布：iWall
2023-10-12
Mac
Dynamic Wallpaper for Mac 動態桌布桌面
2022-05-30
Mac
Python爬蟲新手教程：手機APP資料抓取 pyspider
2019-07-20
Python爬蟲APPIDE
Python爬蟲入門教程 48-100 使用mitmdump抓取手機惠農APP-手機APP爬蟲部分
2019-03-12
Python爬蟲MITAPP
用python爬取知識星球
2019-02-16
Python
python爬取糗事百科
2018-08-14
Python
python爬取北京租房資訊
2018-05-18
Python

Python《爬取手機和桌面桌布》

相關文章