前言
好久沒有給大家更新爬蟲的專案了,說來也有點慚愧,本著和廣大Python愛好者一起學習的目的,這次給大家帶來了Scrapy的分散式爬蟲。
爬蟲邏輯
本次我們的爬蟲目的是爬取知乎資訊,即爬取你所要爬取的知乎使用者和其關注者以及其關注者的資訊,這裡有點繞,我不知道大家聽懂了沒有。相當於演算法裡的遞迴,由一個使用者擴散到關注者使用者,再到其關注者。爬取的初始連結頁面如下。
我們發現people後面的是使用者名稱,即你可以修改你想要爬取的指定使用者,follower是關注者,如果你想爬取你所關注的人資訊的話,改成following即可。我們開啟開發者工具,發現當前資訊頁面並沒有我們要提取的資訊,究其原因是因為此頁面是Ajax載入形式,我們需要切換到XHR欄中找到我們需要的連結。如下圖所示。
最終我們發現其載入頁面,並獲得其載入連結,發現其資料格式是Json格式,那麼這就對我們的資料採集來說就方便很多了。如果我們想要爬取更多的關注者,就只需要把limit裡的數值改成20的倍數就可以了。至此我們的爬蟲邏輯就已經講解清楚了。
原始碼部分
1.不使用分散式
class ZhihuinfoSpider(Spider):
name = 'zhihuinfo'
#radis_key='ZhihuinfoSpider:start_urls'
allowed_domains = ['www.zhihu.com']
start_urls = ['https://www.zhihu.com/api/v4/members/bu-xin-ming-71/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20']
def parse(self, response):
responses=json.loads(response.body.decode('utf-8'))["data"]
count=len(responses)
if count<20:
pass
else:
page_offset=int(re.findall('&offset=(.*?)&',response.url)[0])
new_page_offset=page_offset+20
new_page_url=response.url.replace(
'&offset='+str(page_offset)+'&',
'&offset=' + str(new_page_offset) + '&'
)
yield Request(url=new_page_url,callback=self.parse)
for user in responses:
item=ZhihuItem()
item['name']=user['name']
item['id']= user['id']
item['headline'] = user['headline']
item['url_token'] = user['url_token']
item['user_type'] = user['user_type']
item['gender'] = user['gender']
item['articles_count'] = user['articles_count']
item['answer_count'] = user['answer_count']
item['follower_count'] = user['follower_count']
with open('userinfo.txt') as f:
user_list=f.read()
if user['url_token'] not in user_list:
with open('userinfo.txt','a') as f:
f.write(user['url_token']+'-----')
yield item
new_url='https://www.zhihu.com/api/v4/members/'+user['url_token']+'/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20'
yield Request(url=new_url,callback=self.parse)
複製程式碼
2.使用分散式
class ZhihuinfoSpider(RedisCrawlSpider):
name = 'zhihuinfo'
radis_key='ZhihuinfoSpider:start_urls'
allowed_domains = ['www.zhihu.com']
#start_urls = ['https://www.zhihu.com/api/v4/members/bu-xin-ming-71/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20']
def parse(self, response):
responses=json.loads(response.body.decode('utf-8'))["data"]
count=len(responses)
if count<20:
pass
else:
page_offset=int(re.findall('&offset=(.*?)&',response.url)[0])
new_page_offset=page_offset+20
new_page_url=response.url.replace(
'&offset='+str(page_offset)+'&',
'&offset=' + str(new_page_offset) + '&'
)
yield Request(url=new_page_url,callback=self.parse)
for user in responses:
item=ZhihuItem()
item['name']=user['name']
item['id']= user['id']
item['headline'] = user['headline']
item['url_token'] = user['url_token']
item['user_type'] = user['user_type']
item['gender'] = user['gender']
item['articles_count'] = user['articles_count']
item['answer_count'] = user['answer_count']
item['follower_count'] = user['follower_count']
with open('userinfo.txt') as f:
user_list=f.read()
if user['url_token'] not in user_list:
with open('userinfo.txt','a') as f:
f.write(user['url_token']+'-----')
yield item
new_url='https://www.zhihu.com/api/v4/members/'+user['url_token']+'/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20'
yield Request(url=new_url,callback=self.parse)
複製程式碼
可以發現使用Scrapy分散式只需要改動兩處就可以,再在之後的配置檔案中加上配置即可。如果是分散式執行,最後開啟多個終端即可。不使用分散式就只需要在一個終端上執行就可以了
執行頁面
執行結果
分散式爬蟲的速度很快,經小編測試半分鐘不到就已經採集了兩萬多條資料。感興趣的小夥伴們可以先嚐試下,對於沒有Scrapy框架基礎的小夥伴,也沒有關係,爬蟲邏輯都是一樣的。你們只需要複製爬蟲部分程式碼也可以執行。
推薦閱讀:
對爬蟲,資料分析,演算法感興趣的朋友們,可以加微信公眾號 TWcoding,我們一起玩轉Python。
If it works for you.Please,star.
自助者,天助之