我有一個朋友,喜歡在一個圖站看圖(xie)片(zhen),光看就算了,他還有收集癖,想把網站的所有圖片都下載下來,於是找我幫忙。
本業餘玩家經過【好久的】研究,終於實現,寫成本教程。本人經濟學專業,程式設計純屬玩票,不足之處請指出,勿噴,謝謝。
本文分兩部分:第一部分是基礎方法,也就是單執行緒下爬圖片的流程;第二部分是使用了多執行緒的功能,大大提高了爬取的效率。
前言
本次爬取基於的是BeautifulSoup+urllib/urllib2模組,Python另一個高效的爬蟲模組叫Scrapy,但是我至今沒研究懂,因此暫時不用。
基礎流程
說明
此次爬取,在輸入端僅需要一個初始網址(為避免彼網站找我麻煩,就以URL
代替),以及檔案儲存路徑(為保護我隱私,以PATH
代替),大家在閱讀程式碼時敬請注意。
從該網站下載圖片以及檔案處理有如下幾步:【我要是會畫流程圖就好了】
1.開啟網站首頁,獲得總頁數,獲得每個專輯的連結;
2.點進某專輯,獲得專輯的標題作為儲存的資料夾名,並獲得該專輯的頁數;
3.獲取每個圖片的連結
4.下載圖片,以網站上圖片的檔名儲存至本地,同時對應第2步的資料夾。
程式碼和解釋
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
# -*- coding: utf-8 -*- """ @author: Adam """ import urllib2, urllib, os from bs4 import BeautifulSoup root = PATH url = URL req = urllib2.Request(url) content = urllib2.urlopen(req).read() soup = BeautifulSoup(content, "lxml") page = soup.find_all('a') pagenum1 = page[-3].get_text() #注1 for i in range(0, int(pagenum1) + 1): if i == 0: url1 = URL else: url1 = URL + str(i+1) + ".html" #注2 req1 = urllib2.Request(url1) # #print url content1 = urllib2.urlopen(req1).read() soup1 = BeautifulSoup(content1, "lxml") table = soup1.find_all('td') title = soup1.find_all('div', class_ = 'title') #注3 #print title for j in range(1, 19): folder = title[j-1].get_text() folder = folder.replace('\\\\n', '') #注4 curl=table[j].a['href'] #注5 purl = URL+curl #Second Page preq = urllib2.Request(purl) pcontent = urllib2.urlopen(preq).read() psoup = BeautifulSoup(pcontent, "lxml") page2 = psoup.find_all('a') pagenum2 = page2[-4].get_text() if not os.path.exists(root + folder): os.mkdir(root + folder) else: os.chdir(root + folder) #print folder for t in range(1, int(pagenum2) + 1): if t == 1: purl1 = purl else: purl1 = purl[:-5] + '-' + str(t) + '.html' preq2 = urllib2.Request(purl1) pcontent2 = urllib2.urlopen(preq2).read() psoup2 = BeautifulSoup(pcontent2, "lxml") picbox = psoup2.find_all('div', class_ = 'pic_box') #注6 for k in range(1,7): filename = root + folder + "/" + str(k+6*(t-1)) + ".jpg" if not os.path.exists(filename): try: pic = picbox[k].find('img') piclink = pic.get('src') #注7 urllib.urlretrieve(piclink, filename) except: continue |
注1:獲取頁碼的方法,因為頁碼的HTML原始碼為<a href="/albums/XiuRen-27.html">27</a>
注2:因為我發現該網站翻頁後的網址即為首頁網址後加頁碼數字
注3:專輯標題的HTML原始碼為<div class=”title”><span class=”name”>專輯標題</span></div>
注4:將專輯標題命名為資料夾名,這裡要給title
字串做些處理,下問講
注5:放每個專輯自己連結的HTML原始碼為<td><a href="/photos/XiuRen-5541.html" target="_blank"></a></td>
注6、7:放圖片的HTML原始碼為<div class="pic_box"><i.m.g(為防系統認為有圖片) src=" " alt=" "></div>
通過find_all('div', class_ = 'pic_box')
找到放圖的區塊,然後用find('img')
找到圖片的標籤,再用get('src')
的方法獲取圖片連結
通過以上的程式碼,下載所有圖片並儲存到對應的資料夾的流程就笨拙地完成了。該方法效率極低,首先是單執行緒操作,其次用了N次巢狀迴圈,因此我想到了藉助多執行緒提高效率的方式。
多執行緒方法
介紹
Python多執行緒的方法在網上有很多文章介紹,但是都好(是)像(我)很(水)復(平)雜(低),後來我發現了一個模組,寥寥幾行就實現了功能。
from multiprocessing.dummy import Pool as ThreadPool
import urllib2url = “http://www.cnblogs.com“
urls = [url] * 50
pool = ThreadPool(4)
results = pool.map(urllib2.urlopen, urls)
pool.close()
pool.join()
其中,urls
是一個列表,該模組正是用map(func, list)
的方法將list
的元素從前到後送入至func
運算。results
傳出的是一個列表類型
在本案例中應用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
<span class="hljs-comment"># -*- coding: utf-8 -*-</span> <span class="hljs-string">""" @author: Adam """</span> <span class="hljs-keyword">import</span> os <span class="hljs-keyword">import</span> urllib2 <span class="hljs-keyword">import</span> urllib <span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup <span class="hljs-keyword">import</span> re <span class="hljs-keyword">import</span> inspect <span class="hljs-keyword">from</span> multiprocessing.dummy <span class="hljs-keyword">import</span> Pool <span class="hljs-keyword">as</span> ThreadPool <span class="hljs-keyword">from</span> multiprocessing <span class="hljs-keyword">import</span> cpu_count <span class="hljs-keyword">as</span> cpu <span class="hljs-comment">#程式池個數等於CPU個數</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_url</span><span class="hljs-params">(url)</span>:</span> req = urllib2.Request(url) fails = <span class="hljs-number">0</span> <span class="hljs-keyword">while</span> fails < <span class="hljs-number">5</span>: <span class="hljs-keyword">try</span>: content = urllib2.urlopen(req, timeout=<span class="hljs-number">20</span>).read() <span class="hljs-keyword">break</span> <span class="hljs-keyword">except</span>: fails += <span class="hljs-number">1</span> <span class="hljs-keyword">print</span> inspect.stack()[<span class="hljs-number">1</span>][<span class="hljs-number">3</span>] + <span class="hljs-string">' occused error'</span> <span class="hljs-comment">#注1</span> <span class="hljs-keyword">raise</span> <span class="hljs-comment">#注2</span> soup = BeautifulSoup(content, <span class="hljs-string">"lxml"</span>) <span class="hljs-keyword">return</span> soup <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_links_first</span><span class="hljs-params">(url)</span>:</span> soup = read_url(url) page = soup.find_all(<span class="hljs-string">'a'</span>) pagenum = page[-<span class="hljs-number">3</span>].get_text() link = [url[:-<span class="hljs-number">5</span>] + <span class="hljs-string">'-'</span> + str(i) + <span class="hljs-string">'.html'</span> <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">2</span>, int(pagenum)+<span class="hljs-number">1</span>)] link.insert(<span class="hljs-number">0</span>, url) <span class="hljs-comment">#注3</span> <span class="hljs-keyword">return</span> link <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_links_second</span><span class="hljs-params">(url)</span>:</span> purlist = [] soup = read_url(url) table = soup.find_all(<span class="hljs-string">'td'</span>) <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">19</span>): <span class="hljs-keyword">try</span>: curl = table[j].a[<span class="hljs-string">'href'</span>] purl = URL + curl <span class="hljs-keyword">if</span> purl <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> purlist: <span class="hljs-comment"># print purl</span> purlist.append(purl) <span class="hljs-keyword">except</span>: <span class="hljs-keyword">continue</span> <span class="hljs-keyword">return</span> purlist <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_links_third</span><span class="hljs-params">(url)</span>:</span> <span class="hljs-comment"># print url</span> download = {} soup = read_url(url) page = soup.find_all(<span class="hljs-string">'a'</span>) pagenum = page[-<span class="hljs-number">4</span>].get_text() link = [url[:-<span class="hljs-number">5</span>] + <span class="hljs-string">'-'</span> + str(i) + <span class="hljs-string">'.html'</span> <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">2</span>, int(pagenum)+<span class="hljs-number">1</span>)] link.insert(<span class="hljs-number">0</span>, url) title = soup.find_all(<span class="hljs-string">'div'</span>, class_ = <span class="hljs-string">'inline'</span>)[<span class="hljs-number">0</span>].get_text() <span class="hljs-comment">#注4</span> title = re.findall(<span class="hljs-string">'\\S'</span>, title) folder = <span class="hljs-string">''</span>.join(title) <span class="hljs-comment">#注5</span> <span class="hljs-comment"># print folder</span> download[folder] = link <span class="hljs-comment">#注6</span> <span class="hljs-keyword">return</span> download <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_picture</span><span class="hljs-params">((url, folder))</span>:</span> <span class="hljs-comment">#注7</span> soup = read_url(url) picbox = soup.find_all(<span class="hljs-string">'div'</span>, class_ = <span class="hljs-string">'pic_box'</span>) <span class="hljs-keyword">for</span> k <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">7</span>): <span class="hljs-keyword">try</span>: pic = picbox[k].find(<span class="hljs-string">'img'</span>) piclink = pic.get(<span class="hljs-string">'src'</span>) filename = folder + <span class="hljs-string">'/'</span> + os.path.basename(piclink) <span class="hljs-keyword">print</span> folder <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.exists(filename): urllib.urlretrieve(piclink, filename) <span class="hljs-comment"># print filename</span> <span class="hljs-keyword">except</span>: <span class="hljs-comment"># raise</span> <span class="hljs-keyword">continue</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">multi_get_sec_link</span><span class="hljs-params">(url)</span>:</span> pool = ThreadPool(cpu()) linkset = pool.map(get_links_second, url) pool.close() pool.join() <span class="hljs-keyword">return</span> linkset <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">multi_get_third_link</span><span class="hljs-params">(url)</span>:</span> pool = ThreadPool(<span class="hljs-number">4</span>) linkset = pool.map(get_links_third, url) pool.close() pool.join() <span class="hljs-keyword">return</span> linkset <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">multi_get_picture</span><span class="hljs-params">(download, root=PATH)</span>:</span> pool = ThreadPool(<span class="hljs-number">4</span>) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(download)): picurlset = download[download.keys()[i]] folder = root + <span class="hljs-string">'/'</span> + download.keys()[i] <span class="hljs-comment">#注8</span> <span class="hljs-comment"># print folder</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.exists(folder): os.mkdir(folder) <span class="hljs-comment"># else:</span> <span class="hljs-comment"># os.chdir(folder)</span> pool.map(get_picture, zip(picurlset, folder)) <span class="hljs-comment">#注9</span> pool.close() pool.join() <span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>: url = URL linkset = multi_get_sec_link(get_links_first(url)) linkset = [j <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> linkset <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> i] <span class="hljs-comment">#注10</span> linkset2 = multi_get_third_link(linkset) finalset = {} <span class="hljs-keyword">for</span> dic <span class="hljs-keyword">in</span> linkset2: finalset.update(dic) <span class="hljs-comment">#注11</span> multi_get_picture(finalset) |
附加
由於我處的網路環境實在是太奇葩了,即使網頁打得開、視訊流暢,在用urlopen
時卻各種timeout
,因此我新增了以下程式碼,將最後的連結字典儲存成一個檔案。
1 2 3 |
f = open(FILENAME,"r") f.write(str(finalset)) f.close() |
之後直接讀取該檔案。同時為了讓程式在無數次蛋疼的timeout
報錯中自己不斷重試,又進一步增加了迴圈語句。
1 2 3 4 5 6 7 8 |
f = open("/Users/Adam/Documents/Python_Scripts/Photos/links.txt","r") finalset = eval(f.read()) while True: try: multi_get_picture(finalist) #轉成字串的字典再轉回來 break except: continue |
寫到這裡,本程式就寫完了,接下來就是跑起來然後看著一個個資料夾在電腦裡冒出來,然後一個個圖片如雨後春筍般出現吧。
以上です
最後,讀者可修改我的程式碼為自己所用,文章轉載請註明出處。(雖然我知道沒人會轉的)