Python多執行緒爬取知乎獲贊過千的答案連結

shallowlearning發表於2015-08-10

最近因維護微信公眾號需要，想用Python自動獲取知乎上獲贊過千的答案。

於是想到了爬蟲，當然一開始做得很簡單，僅僅是單執行緒的爬取。後來發現速度實在太慢，就開發了多執行緒功能。

關鍵的地方都加了註釋，思想也不復雜，所以直接上程式碼：

#coding=utf-8
import urllib2,re,os,threading#使用正則匹配出所需部分
def spider(url):
    try:
        #user_agent = {'User-agent': 'spider'}
        r = urllib2.Request(url)
        data=urllib2.urlopen(r).read()
        title_reg = re.compile(r'<title>\s*(.*?)\s*</title>')
        title=re.findall(title_reg,data)[0]
        title=re.findall(r'(.*?) -.* -.*',title)[0].decode('utf-8')#加了decode之後解決中文資料夾名問題
        if not os.path.exists(title):
            vote=map(int,re.findall('data-votecount="(.*?)"',data))#點贊量
            link=re.findall('target="_blank" href="(.*?)"',data)#該回答的連結
            if max(vote)<1000:
                return data
            f=open(title+'.txt','w+')
            for i in range(len(vote)):
                if vote[i]>1000:#若贊數超過一千，則將該答案的連結儲存下來
                    f.write('www.zhihu.com'+link[i]+'\n'+str(vote[i])+'贊'+'\n\n')
            f.close()
    except:
        return

def CrawlTopic(topicURL):
    basicURL='http://www.zhihu.com'
    topicURL+='/questions?page='
    filename=topicURL[27:34]+'.txt'#file用來存取每次爬取結束後的頁數，因為爬取時間較長，很難一次爬完
    if not os.path.exists(filename):
        f_page=open(filename,'w+')
        f_page.write('1')
        f_page.close()
        page=1
    else:
        f_page=open(filename,'r+')
        page=int(f_page.read())
        f_page.close()
    while 1:
        r1 = urllib2.Request(topicURL+str(page))
        try:
            res=urllib2.urlopen(r1).read()
        except:
            page+=1
            f_page=open(filename,'w+')
            f_page.write(str(page))
            f_page.close()
            continue
        if not res:
            f_page=open(filename,'w+')
            f_page.write(str(page))
            f_page.close()
            continue
        questions=re.findall('<a target="_blank" class="question_link" href="(.*?)">',res)
        print page
        for q in questions:
            spider(basicURL+q)
        page+=1
        f_page=open(filename,'w+')
        f_page.write(str(page))
        f_page.close()

threads = []
topics=[#需要爬取的網頁
    'http://www.zhihu.com/topic/19551147',
    'http://www.zhihu.com/topic/19569848',
    'http://www.zhihu.com/topic/19691659',
    'http://www.zhihu.com/topic/19556423',
    'http://www.zhihu.com/topic/19550564',
    'http://www.zhihu.com/topic/19566266',
    'http://www.zhihu.com/topic/19556758',
    'http://www.zhihu.com/topic/19694211'
    ]
for item in topics:
    t = threading.Thread(target=CrawlTopic,args=(item,))
    threads.append(t)#設定多執行緒物件

for t in threads:
    t.setDaemon(True)
    #將執行緒宣告為守護執行緒，必須在start() 方法呼叫之前設定，如果不設定為守護執行緒程式會被無限掛起。
    t.start()
    #開始執行緒活動

t.join()#在子執行緒完成執行之前，這個子執行緒的父執行緒將一直被阻塞。否則父執行緒一結束就將關閉子執行緒

爬完之後的結果是這樣的
爬取結果

每個txt檔案內容如下

這樣就避開了很多無效資訊，以後閱讀知乎的效率也就高了不少。

【java】【多執行緒】獲取和設定執行緒名字、獲取執行緒物件（3）
2018-04-15
Java執行緒物件
Java獲取多執行緒執行結果方式的歸納與總結
2021-04-28
Java執行緒
Java多執行緒——獲取多個執行緒任務執行完的時間
2017-06-24
Java執行緒
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
Python爬蟲入門【10】：電子書多執行緒爬取
2019-07-31
Python爬蟲執行緒
Python 爬蟲 (四) --多執行緒
2017-08-27
Python爬蟲執行緒
如何使用python多執行緒有效爬取大量資料？
2021-09-11
Python執行緒
多執行緒爬取B站視訊
2020-10-13
執行緒
如何爬取 python 進行多執行緒跑資料的內容
2023-11-09
Python執行緒
多執行緒的補充獲取一定時間的執行結果
2018-05-03
執行緒
多執行緒-獲取和設定執行緒物件名稱
2017-05-31
執行緒物件
Python建立多執行緒任務並獲取每個執行緒返回值
2018-09-29
Python執行緒
Python《多執行緒併發爬蟲》
2020-12-12
Python執行緒爬蟲
python多執行緒爬蟲與單執行緒爬蟲效率效率對比
2021-03-19
Python執行緒爬蟲
通過文章獲得的贊同數爬取、過濾“掘金”中的文章（python Web）
2018-01-21
PythonWeb
Python 爬蟲 (五) --多執行緒續 (Queue )
2017-08-27
Python爬蟲執行緒
python爬蟲入門八：多程式/多執行緒
2019-01-07
Python爬蟲執行緒
Python多執行緒抓取Google搜尋連結網頁
2013-04-10
Python執行緒Go網頁
多執行緒-執行緒排程及獲取和設定執行緒優先順序
2017-05-31
執行緒
python 爬蟲之獲取標題和連結
2020-11-27
Python爬蟲
python多執行緒爬去糗事百科
2018-04-03
Python執行緒
Python爬蟲：一些常用的爬蟲技巧總結(IP,cookie,header,多執行緒)
2016-05-17
Python爬蟲CookieHeader執行緒
python多執行緒
2017-12-26
Python執行緒
Python 多執行緒
2016-05-24
Python執行緒
python爬蟲之多執行緒、多程式+程式碼示例
2020-08-26
Python爬蟲執行緒
python多執行緒的優缺點總結
2021-09-11
Python執行緒
獲取多臺主機命令執行結果
2017-12-05
python多執行緒程式設計1— python對多執行緒的支援
2013-04-02
Python執行緒程式設計
Python爬蟲入門教程 13-100 鬥圖啦表情包多執行緒爬取
2018-12-27
Python爬蟲執行緒
Python爬蟲入門教程 11-100 行行網電子書多執行緒爬取
2018-12-25
Python爬蟲執行緒
【多執行緒總結(二)－執行緒安全與執行緒同步】
2018-09-09
執行緒
Python中獲取執行緒返回值的常用方法！
2023-05-10
Python執行緒
Python 多執行緒多程式
2021-03-26
Python執行緒
簡易多執行緒爬蟲框架
2018-06-02
執行緒爬蟲框架
多執行緒爬蟲實現（上）
2018-05-26
執行緒爬蟲
Python的多程式和多執行緒
2021-03-28
Python執行緒
Python中的多工:多執行緒
2021-04-27
Python執行緒
用Python爬取圖片網站——基於BS4+多執行緒的處理
2016-04-25
Python網站執行緒

Python多執行緒爬取知乎獲贊過千的答案連結

相關文章