接著上篇部落格《用Scrapy抓取豆瓣小組資料(一)》http://my.oschina.net/chengye/blog/124157
在scrapy中怎麼讓Spider自動去抓取豆瓣小組頁面
1,引入Scrapy中的另一個預定義的蜘蛛CrawlSpider
1 |
from scrapy.contrib.spiders import CrawlSpider,
Rule |
2 |
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor |
2, 基於CrawSpider定義一個新的類GroupSpider,並新增相應的爬行規則。
01 |
class GroupSpider(CrawlSpider): |
03 |
allowed_domains = [ "douban.com" ] |
05 |
"http://www.douban.com/group/explore?tag=%E8%B4%AD%E7%89%A9" , |
06 |
"http://www.douban.com/group/explore?tag=%E7%94%9F%E6%B4%BB" , |
07 |
"http://www.douban.com/group/explore?tag=%E7%A4%BE%E4%BC%9A" , |
08 |
"http://www.douban.com/group/explore?tag=%E8%89%BA%E6%9C%AF" , |
09 |
"http://www.douban.com/group/explore?tag=%E5%AD%A6%E6%9C%AF" , |
10 |
"http://www.douban.com/group/explore?tag=%E6%83%85%E6%84%9F" , |
11 |
"http://www.douban.com/group/explore?tag=%E9%97%B2%E8%81%8A" , |
12 |
"http://www.douban.com/group/explore?tag=%E5%85%B4%E8%B6%A3" |
16 |
Rule(SgmlLinkExtractor(allow = ( '/group/[^/]+/$' ,
)), callback = 'parse_group_home_page' ,
process_request = 'add_cookie' ), |
17 |
Rule(SgmlLinkExtractor(allow = ( '/group/explore\?tag' ,
)), follow = True ,
process_request = 'add_cookie' ), |
start_urls預定義了豆瓣有所小組分類頁面,蜘蛛會從這些頁面出發去找小組。
rules定義是CrawlSpider中最重要的一環,可以理解為:當蜘蛛看到某種型別的網頁,如何去進行處理。
例如,如下規則會處理URL以/group/XXXX/為字尾的網頁,呼叫parse_group_home_page為處理函式,並且會在request傳送前呼叫add_cookie來附加cookie資訊。
1 |
Rule(SgmlLinkExtractor(allow = ( '/group/[^/]+/$' ,
)), callback = 'parse_group_home_page' ,
process_request = 'add_cookie' ), |
又如,如下規則會抓取網頁內容,並自動抓取網頁中連結供下一步抓取,但不會處理網頁的其他內容。
1 |
Rule(SgmlLinkExtractor(allow = ( '/group/explore\?tag' ,
)), follow = True ,
process_request = 'add_cookie' ), |
如何新增Cookie
定義如下函式,並如前面所講在Rule定義裡新增process_request=add_cookie。
1 |
def add_cookie( self ,
request): |
2 |
request.replace(cookies = [ |
3 |
{ 'name' : 'COOKIE_NAME' , 'value' : 'VALUE' , 'domain' : '.douban.com' , 'path' : '/' }, |
一般網站在client端都用cookie來儲存使用者的session資訊,新增cookie資訊就可以模擬登陸使用者來抓取資料。
如何防止蜘蛛被網站Ban掉
首先可以嘗試新增登陸使用者的cookie去抓取網頁,即使你抓取的是公開網頁,新增cookie有可能會防止蜘蛛在應用程式層被禁。這個我沒有實際驗證過,但肯定沒有壞處。
其次,即使你是授權使用者,如果你的訪問過於頻繁,你的IP會可能被ban,所以一般你需要讓蜘蛛在訪問網址中間休息1~2秒。
還有就是配置User Agent,儘量輪換使用不同的UserAgent去抓取網頁
在Scrapy專案的settings.py鍾,新增如下設定:
2 |
RANDOMIZE_DOWNLOAD_DELAY = True |
3 |
USER_AGENT = 'Mozilla/5.0
(Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5' |
================
到此位置,抓取豆瓣小組頁面的蜘蛛就完成了。接下來,可以按照這種模式定義抓取小組討論頁面資料的Spider,然後就放手讓蜘蛛去爬行吧!Have Fun!
01 |
from scrapy.contrib.spiders import CrawlSpider,
Rule |
02 |
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor |
03 |
from scrapy.selector import HtmlXPathSelector |
04 |
from scrapy.item import Item |
05 |
from douban.items import DoubanItem |
08 |
class GroupSpider(CrawlSpider): |
10 |
allowed_domains = [ "douban.com" ] |
12 |
"http://www.douban.com/group/explore?tag=%E8%B4%AD%E7%89%A9" , |
13 |
"http://www.douban.com/group/explore?tag=%E7%94%9F%E6%B4%BB" , |
14 |
"http://www.douban.com/group/explore?tag=%E7%A4%BE%E4%BC%9A" , |
15 |
"http://www.douban.com/group/explore?tag=%E8%89%BA%E6%9C%AF" , |
16 |
"http://www.douban.com/group/explore?tag=%E5%AD%A6%E6%9C%AF" , |
17 |
"http://www.douban.com/group/explore?tag=%E6%83%85%E6%84%9F" , |
18 |
"http://www.douban.com/group/explore?tag=%E9%97%B2%E8%81%8A" , |
19 |
"http://www.douban.com/group/explore?tag=%E5%85%B4%E8%B6%A3" |
23 |
Rule(SgmlLinkExtractor(allow = ( '/group/[^/]+/$' ,
)), callback = 'parse_group_home_page' ,
process_request = 'add_cookie' ), |
25 |
Rule(SgmlLinkExtractor(allow = ( '/group/explore\?tag' ,
)), follow = True ,
process_request = 'add_cookie' ), |
28 |
def __get_id_from_group_url( self ,
url): |
29 |
m = re.search( "^http://www.douban.com/group/([^/]+)/$" ,
url) |
37 |
def add_cookie( self ,
request): |
38 |
request.replace(cookies = [ |
43 |
def parse_group_topic_list( self ,
response): |
44 |
self .log( "Fetch
group topic list page: %s" % response.url) |
48 |
def parse_group_home_page( self ,
response): |
50 |
self .log( "Fetch
group home page: %s" % response.url) |
52 |
hxs = HtmlXPathSelector(response) |
56 |
item[ 'groupName' ] = hxs.select( '//h1/text()' ).re( "^\s+(.*)\s+$" )[ 0 ] |
59 |
item[ 'groupURL' ] = response.url |
60 |
groupid = self .__get_id_from_group_url(response.url) |
63 |
members_url = "http://www.douban.com/group/%s/members" % groupid |
64 |
members_text = hxs.select( '//a[contains(@href,
"%s")]/text()' % members_url).re( "\((\d+)\)" ) |
65 |
item[ 'totalNumber' ] = members_text[ 0 ] |
68 |
item[ 'RelativeGroups' ] = [] |
69 |
groups = hxs.select( '//div[contains(@class,
"group-list-item")]' ) |
71 |
url = group.select( 'div[contains(@class,
"title")]/a/@href' ).extract()[ 0 ] |
72 |
item[ 'RelativeGroups' ].append(url) |
74 |
return item<span><span
style = "line-height:20px;" >
< / span>< / span> |