Python《多執行緒併發爬蟲》

星海千尋發表於2020-12-12

原文網址 : https://blog.csdn.net/qq_29367075/article/details/111056339

今天再去爬取另外一個網站 http://pic.netbian.com/
先來看看這個網站的幾張圖片，我們試圖單獨爬取看看。
在這裡插入圖片描述

我們單獨爬取一下試一試
在這裡插入圖片描述

本地檢視，證明圖片是可以爬取成功的。
程式碼如下：

import requests   #匯入模組

def run4():
    headers = {'referer': 'http://pic.netbian.com/',
           'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0'}
    with open("D:/estimages/ali.jpg", "wb") as f :
        f.write(requests.get("http://pic.netbian.com/uploads/allimg/201207/233228-16073551488aeb.jpg", headers=headers).content)
        f.close

    with open("D:/estimages/ali1.jpg", "wb") as f :
        f.write(requests.get("http://pic.netbian.com/uploads/allimg/200618/005100-1592412660d6f4.jpg", headers=headers).content)
        f.close

    with open("D:/estimages/ali2.jpg", "wb") as f :
        f.write(requests.get("http://pic.netbian.com/uploads/allimg/190518/174718-1558172838db13.jpg", headers=headers).content)
        f.close

if __name__ == "__main__":   #主程式入口
    run4()    #呼叫上面的run方法

開始開始。。。
接下來我們得分析下這個網站的結構，http://pic.netbian.com/ 是根url，存在多個標籤，也是個分層的展示結構。
在這裡插入圖片描述

而且是固定這些標籤，數量是固定的，因此我們還是先爬取標籤按種類。
隨意選首頁即可把所有的標籤都爬取下來。
而且標籤的 href 值是下一層的地址，text文字是標籤的文字內容。

點選一個標籤進去，發現是一大堆屬於該標籤的圖片，還有分頁。
比如我點選的是【4K美女】，位址列顯示的是http://pic.netbian.com/4kmeinv/
此時每張圖片的HTML位置如下：在class=”slist”的div塊中。
在這裡插入圖片描述

因此我們需要把這個div下所有img都需要爬取出來。

另外分頁欄如下：
在這裡插入圖片描述

經過觀察可見每個分頁欄的命名也是很有規律的啊，因此我們可以推測出 http://pic.netbian.com/4kmeinv/ 是等於http://pic.netbian.com/4kmeinv/index.html ，在瀏覽器中這樣輸入，果然是對的，其他的頁面都是按照 index_+數字編號的方式。
在這裡插入圖片描述

後面我會把完整程式碼貼上，需要把save_all_images函式的呼叫註釋掉，且讓執行緒順序執行的展示結果（也就是在t.start() 後面增加了t.Join()，這樣就能順序執行了），我做了額外處理，就是頁面實在太多了，越是做了截斷（每個標籤最多爬取10個頁面）。

這裡我還是選擇找【下一頁】的href值來選擇下一個頁面。
在這裡插入圖片描述

它處在class=”page”的div中，找到這個【下一頁】，得到其href值，就得到了下一個頁面的地址。

這樣一來，我們的策略就是：
1：先得到所有的標籤
2：根據標籤建立一個執行緒單獨處理該標籤下所有圖片
3：每一個標籤下，按照分頁，一旦訪問一個頁面，就爬取該頁面所有圖片。

具體程式碼如下：

#-*- coding:utf-8 -*-
import os
import requests
from bs4 import BeautifulSoup
import threading
import time

rootrurl = 'http://pic.netbian.com/'
save_dir = 'D:/estimages/'
no_more_pages = 'END'
max_pages = 8

image_cache = set()
index = len(image_cache)

# step 1: 得到所有標籤tags
html = BeautifulSoup(requests.get(rootrurl).text.encode('iso-8859-1').decode('gbk'), features="html.parser")
tag_list = {}
for link in html.find('div', {'class': 'classify clearfix'}).find_all('a'):
    tag_list[link.string] = link.get('href')

print("the number of unique tag is : %d" % len(tag_list))
print(tag_list)

# step 2: 根據每個標籤分別爬取，每個標籤點進去都是一個分頁
# 因此需要做好分頁，鑑於圖片太多了，我們需要做個多執行緒來操作


class myThread (threading.Thread):   #繼承父類threading.Thread
    def __init__(self, threadID, key ,value):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.key = key
        self.value = value

    def run(self):                   #把要執行的程式碼寫到run函式裡面 執行緒在建立後會直接執行run函式
        time.sleep(1)
        print("Thread %d is running..." % self.threadID)
        self.serachSubPages(self.key, self.value)
        print("Thread %d is over..." % self.threadID)

    def serachSubPages(self, key ,value):

        # 建立這個標籤的子目錄，圖片太多了，按照標籤建立子目錄，方便儲存和整理
        tag_name = key
        if not os.path.exists(save_dir + tag_name):
            os.mkdir(save_dir + tag_name)
        print("current tag is : " + tag_name)

        url = rootrurl + value + 'index.html'
        pages = 0;
        while 1:
            print("next page: " + url)
            html = BeautifulSoup(requests.get(url).text.encode('iso-8859-1').decode('gbk'), features="html.parser")
            self.save_all_images(html, save_dir + tag_name)

            if pages >= max_pages:  #每個標籤最多搜尋的頁面數
                break;
            pages = pages + 1

            url = self.findNextPage(html)
            if url == no_more_pages:
                break
            url = rootrurl + url



    def save_all_images(self, html, saveDir):
        global index
        imgs = html.find('div', {'class': 'slist'}).find_all('img')
        print('total imgs is %d:' % len(imgs))
        for img in imgs:

            # 有一些圖片的href是src, 有一些圖片的href是original，因此都是要考慮的。
            href = img.get('src')
            print(href)

            # 判斷是否重複下載
            if href not in image_cache:
                with open(
                        '{}/{}'.format(saveDir, href.split("/")[-1]), 'wb') as jpg:  # 請求圖片並寫進去到本地檔案
                    jpg.write(requests.get(rootrurl + href).content)

                image_cache.add(href)
                print("正在抓取第%s條資料" % index)
                index += 1

    def findNextPage(self, html):
        nextBtn = html.find('div', {'class': 'page'}).find_all('a')  # 找到next按鈕，獲得下一個連線網頁的位置
        for link in nextBtn:
            if link.string == '下一頁':
                return link.get('href')[1:]

        return no_more_pages


if __name__ == '__main__':
    i = 0
    thread_list = []
    for key ,value in tag_list.items():
        thread1 = myThread(i, key ,value[1:])  # 建立多個執行緒去分別爬取各個標籤頁的資料
        thread_list.append(thread1)
        i=i+1

    for t in thread_list:
        # t.setDaemon(True)  # 設定為守護執行緒，不會因主執行緒結束而中斷
        t.start()

效果如下：
在這裡插入圖片描述

請新增圖片描述

請新增圖片描述

python多執行緒爬蟲與單執行緒爬蟲效率效率對比
2021-03-19
Python執行緒爬蟲
python爬蟲入門八：多程式/多執行緒
2019-01-07
Python爬蟲執行緒
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
python爬蟲之多執行緒、多程式+程式碼示例
2020-08-26
Python爬蟲執行緒
python多執行緒非同步爬蟲-Python非同步爬蟲試驗[Celery,gevent,requests]
2020-11-11
Python執行緒非同步爬蟲
簡易多執行緒爬蟲框架
2018-06-02
執行緒爬蟲框架
多執行緒爬蟲實現（上）
2018-05-26
執行緒爬蟲
Python爬蟲入門【10】：電子書多執行緒爬取
2019-07-31
Python爬蟲執行緒
資料提取方法-多程式多執行緒爬蟲
2020-11-16
執行緒爬蟲
JAVA多執行緒併發
2020-12-21
Java執行緒
多執行緒併發篇——如何停止執行緒
2019-03-03
執行緒
多執行緒與高併發(一)多執行緒入門
2019-06-25
執行緒
python基礎執行緒-管理併發執行緒
2020-09-27
Python執行緒
多執行緒與高併發(二)執行緒安全
2019-06-30
執行緒
併發與多執行緒之執行緒安全篇
2022-01-04
執行緒
java多執行緒與併發 - 執行緒池詳解
2018-03-13
Java執行緒
java多執行緒與併發 - 併發工具類
2018-03-09
Java執行緒
多執行緒併發執行及解決方法
2019-02-02
執行緒
多執行緒與併發----Semaphere同步
2018-05-13
執行緒
MySQL多執行緒併發調優
2018-06-19
MySql執行緒
併發與多執行緒基礎
2019-02-19
執行緒
Python爬蟲入門教程 13-100 鬥圖啦表情包多執行緒爬取
2018-12-27
Python爬蟲執行緒
Python爬蟲入門教程 11-100 行行網電子書多執行緒爬取
2018-12-25
Python爬蟲執行緒
如何使用queue模組實現多執行緒爬蟲
2023-11-29
執行緒爬蟲
python多執行緒爬去糗事百科
2018-04-03
Python執行緒
3種方式實現python多執行緒併發處理
2019-02-16
Python執行緒
【多執行緒與高併發】- 執行緒基礎與狀態
2023-02-16
執行緒
最令人頭疼的Python問題：Python多執行緒在爬蟲中的應用
2019-11-05
Python執行緒爬蟲
多執行緒與併發----讀寫鎖
2018-05-12
執行緒
HashMap多執行緒併發問題分析
2018-06-21
HashMap執行緒
JAVA多執行緒和併發基礎
2018-07-26
Java執行緒
Java多執行緒與併發之ThreadLocal
2020-02-17
Java執行緒thread
用多執行緒，實現併發，TCP
2020-04-25
執行緒TCP
Java併發/多執行緒-CAS原理分析
2021-01-19
Java執行緒
Go高效併發 10 | Context：多執行緒併發控制神器
2021-02-19
GoContext執行緒
Java高併發與多執行緒（二）-----執行緒的實現方式
2021-01-18
Java執行緒
python併發爬蟲利器tomorrow(一)
2018-10-16
Python爬蟲
【python高併發】程序、執行緒的理解
2024-06-25
Python執行緒

Python《多執行緒併發爬蟲》

相關文章