分散式爬蟲之知乎使用者資訊爬取

NGU發表於2018-08-31

原文網址 : https://juejin.im/post/5b88c46b6fb9a019f82fbbe5

前言

好久沒有給大家更新爬蟲的專案了，說來也有點慚愧，本著和廣大Python愛好者一起學習的目的，這次給大家帶來了Scrapy的分散式爬蟲。

爬蟲邏輯

本次我們的爬蟲目的是爬取知乎資訊，即爬取你所要爬取的知乎使用者和其關注者以及其關注者的資訊，這裡有點繞，我不知道大家聽懂了沒有。相當於演算法裡的遞迴，由一個使用者擴散到關注者使用者，再到其關注者。爬取的初始連結頁面如下。

我們發現people後面的是使用者名稱，即你可以修改你想要爬取的指定使用者，follower是關注者，如果你想爬取你所關注的人資訊的話，改成following即可。我們開啟開發者工具，發現當前資訊頁面並沒有我們要提取的資訊，究其原因是因為此頁面是Ajax載入形式，我們需要切換到XHR欄中找到我們需要的連結。如下圖所示。

最終我們發現其載入頁面，並獲得其載入連結，發現其資料格式是Json格式，那麼這就對我們的資料採集來說就方便很多了。如果我們想要爬取更多的關注者，就只需要把limit裡的數值改成20的倍數就可以了。至此我們的爬蟲邏輯就已經講解清楚了。

原始碼部分

1.不使用分散式

class ZhihuinfoSpider(Spider):
    name = 'zhihuinfo'
    #radis_key='ZhihuinfoSpider:start_urls'
    allowed_domains = ['www.zhihu.com']
    start_urls = ['https://www.zhihu.com/api/v4/members/bu-xin-ming-71/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20']

    def parse(self, response):
        responses=json.loads(response.body.decode('utf-8'))["data"]
        count=len(responses)
        if count<20:
            pass
        else:
            page_offset=int(re.findall('&offset=(.*?)&',response.url)[0])
            new_page_offset=page_offset+20
            new_page_url=response.url.replace(
                '&offset='+str(page_offset)+'&',
                '&offset=' + str(new_page_offset) + '&'
            )
            yield Request(url=new_page_url,callback=self.parse)
        for user in responses:
            item=ZhihuItem()
            item['name']=user['name']
            item['id']= user['id']
            item['headline'] = user['headline']
            item['url_token'] = user['url_token']
            item['user_type'] = user['user_type']
            item['gender'] = user['gender']
            item['articles_count'] = user['articles_count']
            item['answer_count'] = user['answer_count']
            item['follower_count'] = user['follower_count']

            with open('userinfo.txt') as f:
                user_list=f.read()
            if user['url_token'] not in user_list:
                with open('userinfo.txt','a') as f:
                    f.write(user['url_token']+'-----')
                yield item

                new_url='https://www.zhihu.com/api/v4/members/'+user['url_token']+'/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20'
                yield  Request(url=new_url,callback=self.parse)
複製程式碼

2.使用分散式

class ZhihuinfoSpider(RedisCrawlSpider):
    name = 'zhihuinfo'
    radis_key='ZhihuinfoSpider:start_urls'
    allowed_domains = ['www.zhihu.com']
    #start_urls = ['https://www.zhihu.com/api/v4/members/bu-xin-ming-71/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20']

    def parse(self, response):
        responses=json.loads(response.body.decode('utf-8'))["data"]
        count=len(responses)
        if count<20:
            pass
        else:
            page_offset=int(re.findall('&offset=(.*?)&',response.url)[0])
            new_page_offset=page_offset+20
            new_page_url=response.url.replace(
                '&offset='+str(page_offset)+'&',
                '&offset=' + str(new_page_offset) + '&'
            )
            yield Request(url=new_page_url,callback=self.parse)
        for user in responses:
            item=ZhihuItem()
            item['name']=user['name']
            item['id']= user['id']
            item['headline'] = user['headline']
            item['url_token'] = user['url_token']
            item['user_type'] = user['user_type']
            item['gender'] = user['gender']
            item['articles_count'] = user['articles_count']
            item['answer_count'] = user['answer_count']
            item['follower_count'] = user['follower_count']

            with open('userinfo.txt') as f:
                user_list=f.read()
            if user['url_token'] not in user_list:
                with open('userinfo.txt','a') as f:
                    f.write(user['url_token']+'-----')
                yield item

                new_url='https://www.zhihu.com/api/v4/members/'+user['url_token']+'/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20'
                yield  Request(url=new_url,callback=self.parse)
複製程式碼

可以發現使用Scrapy分散式只需要改動兩處就可以，再在之後的配置檔案中加上配置即可。如果是分散式執行，最後開啟多個終端即可。不使用分散式就只需要在一個終端上執行就可以了

執行頁面

執行結果

分散式爬蟲的速度很快，經小編測試半分鐘不到就已經採集了兩萬多條資料。感興趣的小夥伴們可以先嚐試下，對於沒有Scrapy框架基礎的小夥伴，也沒有關係，爬蟲邏輯都是一樣的。你們只需要複製爬蟲部分程式碼也可以執行。

推薦閱讀：

爬蟲進階之去哪兒酒店(國內外)

Scrapy之抓取淘寶美食

大型爬蟲案例：爬取去哪兒網

對爬蟲，資料分析，演算法感興趣的朋友們，可以加微信公眾號 TWcoding，我們一起玩轉Python。

If it works for you.Please,star.

自助者,天助之

Python爬蟲抓取知乎所有使用者資訊
2018-03-14
Python爬蟲
分散式爬蟲很難嗎？用Python寫一個小白也能聽懂的分散式知乎爬蟲
2018-05-04
分散式爬蟲Python
Python爬蟲新手教程：知乎文章圖片爬取器
2019-07-20
Python爬蟲
Python分散式爬蟲(三) - 爬蟲基礎知識
2019-03-21
Python分散式爬蟲
新手爬蟲教程：Python爬取知乎文章中的圖片
2019-01-17
爬蟲Python
分散式爬蟲原理之分散式爬蟲原理
2018-05-25
分散式爬蟲
python爬蟲如何爬知乎的話題？
2019-02-16
Python爬蟲
分散式爬蟲
2019-03-05
分散式爬蟲
爬蟲實戰（一）：爬取微博使用者資訊
2018-07-15
爬蟲
Java爬蟲-爬取疫苗批次資訊
2024-06-03
Java爬蟲
Python網路爬蟲實戰：爬取知乎話題下 18934 條回答資料
2019-01-17
Python爬蟲
分散式爬蟲的部署之Gerapy分散式管理
2018-06-06
分散式爬蟲
分散式爬蟲的部署之Scrapyd分散式部署
2018-05-30
分散式爬蟲
分散式爬蟲原理
2019-02-16
分散式爬蟲
19--Scarpy05:增量式爬蟲、分散式爬蟲
2024-04-25
爬蟲分散式
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
爬蟲之股票定向爬取
2018-12-06
爬蟲
python爬蟲專案（新手教程）之知乎（requests方式）
2018-06-13
Python爬蟲
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
Python爬蟲教程-34-分散式爬蟲介紹
2018-09-06
Python爬蟲分散式
Python爬蟲之小說資訊爬取與資料視覺化分析
2021-01-09
Python爬蟲視覺化
爬蟲：HTTP請求與HTML解析（爬取某乎網站）
2021-05-19
爬蟲HTTPHTML網站
Python爬蟲爬取淘寶，京東商品資訊
2020-02-11
Python爬蟲
分散式爬蟲的部署之Scrapyd批量部署
2018-06-04
分散式爬蟲
爬蟲（14） - Scrapy-Redis分散式爬蟲(1) | 詳解
2022-07-06
爬蟲Redis分散式
selenium 知網爬蟲之根據【關鍵詞】獲取文獻資訊
2023-10-28
爬蟲
Python爬蟲實戰：爬取淘寶的商品資訊
2021-09-11
Python爬蟲
爬蟲01:爬取豆瓣電影TOP 250基本資訊
2020-12-29
爬蟲
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
分散式爬蟲的部署之Scrapyd對接Docker
2018-06-04
分散式爬蟲Docker
python爬蟲58同城（多個資訊一次爬取）
2018-11-04
Python爬蟲
Python爬蟲訓練：爬取酷燃網視訊資料
2020-10-23
Python爬蟲
[爬蟲架構] 如何設計一個分散式爬蟲架構
2018-05-01
爬蟲架構分散式
分散式爬蟲總結和使用
2018-12-09
分散式爬蟲
基於java的分散式爬蟲
2018-07-06
Java分散式爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
爬蟲爬取資料如何繞開限制？
2022-06-10
爬蟲

分散式爬蟲之知乎使用者資訊爬取

相關文章