用scrapy進行網頁抓取

鴨脖發表於2012-05-09

最近用scrapy來進行網頁抓取,對於pythoner來說它用起來非常方便,詳細文件在這裡:http://doc.scrapy.org/en/0.14/index.html

要想利用scrapy來抓取網頁資訊,需要先新建一個工程,scrapy startproject myproject

工程建立好後,會有一個myproject/myproject的子目錄,裡面有item.py(由於你要抓取的東西的定義),pipeline.py(用於處理抓取後的資料,可以儲存資料庫,或是其他),然後是spiders資料夾,可以在裡面編寫爬蟲的指令碼.

這裡以爬取某網站的書籍資訊為例:

item.py如下:

?

from scrapy.item import Item, Field

class BookItem(Item):
    # define the fields for your item here like:
    name = Field()
    publisher = Field()
    publish_date = Field()
    price = Field()

?

我們要抓取的東西都在上面定義好了,分別是名字,出版商,出版日期,價格,

下面就要寫爬蟲去網戰抓取資訊了,

spiders/book.py如下:

?

from urlparse import urljoin
import simplejson

from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

from myproject.items import BookItem

class BookSpider(CrawlSpider):
    name = 'bookspider'
    allowed_domains = ['test.com']
    start_urls = [
        "http://test_url.com",   #這裡寫開始抓取的頁面地址(這裡網址是虛構的,實際使用時請替換)
    ]
    rules = (
        #下面是符合規則的網址,但是不抓取內容,只是提取該頁的連結(這裡網址是虛構的,實際使用時請替換)
        Rule(SgmlLinkExtractor(allow=(r'http://test_url/test?page_index=\d+'))),
        #下面是符合規則的網址,提取內容,(這裡網址是虛構的,實際使用時請替換)
        Rule(SgmlLinkExtractor(allow=(r'http://test_rul/test?product_id=\d+')), callback="parse_item"),
    )

        
    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = BookItem()
        item['name'] = hxs.select('//div[@class="h1_title book_head"]/h1/text()').extract()[0]
        item['author'] = hxs.select('//div[@class="book_detailed"]/p[1]/a/text()').extract()
        publisher = hxs.select('//div[@class="book_detailed"]/p[2]/a/text()').extract()
        item['publisher'] = publisher and publisher[0] or ''
        publish_date = hxs.select('//div[@class="book_detailed"]/p[3]/text()').re(u"[\u2e80-\u9fffh]+\uff1a([\d-]+)")
        item['publish_date'] = publish_date and publish_date[0] or ''
        prices = hxs.select('//p[@class="price_m"]/text()').re("(\d*\.*\d*)")
        item['price'] = prices and prices[0] or ''
        return item

然後資訊抓取後,需要儲存,這時就需要寫pipelines.py了(用於scapy是用的twisted,所以具體的資料庫操作可以看twisted的資料,這裡只是簡單介紹如何儲存到資料庫中):

?

from scrapy import log
#from scrapy.core.exceptions import DropItem
from twisted.enterprise import adbapi
from scrapy.http import Request
from scrapy.exceptions import DropItem
from scrapy.contrib.pipeline.images import ImagesPipeline
import time
import MySQLdb
import MySQLdb.cursors


class MySQLStorePipeline(object):

    def __init__(self):
        self.dbpool = adbapi.ConnectionPool('MySQLdb',
                db = 'test',
                user = 'user',
                passwd = '******',
                cursorclass = MySQLdb.cursors.DictCursor,
                charset = 'utf8',
                use_unicode = False
        )

    def process_item(self, item, spider):
        
        query = self.dbpool.runInteraction(self._conditional_insert, item)
        
        query.addErrback(self.handle_error)
        return item
  
    def _conditional_insert(self, tx, item):
        if item.get('name'):
            tx.execute(\
                "insert into book (name, publisher, publish_date, price ) \
                 values (%s, %s, %s, %s)",
                (item['name'],  item['publisher'], item['publish_date'], 
                item['price'])
            )
?

完成之後在setting.py中新增該pipeline:

?

ITEM_PIPELINES = ['myproject.pipelines.MySQLStorePipeline']

?最後執行scrapy crawl bookspider就開始抓取了

?

本文地址http://www.chengxuyuans.com/Python/39302.html


相關文章