用scrapy進行網頁抓取

鴨脖發表於2012-05-09

原文網址 : https://blog.csdn.net/yelbosh/article/details/7551584

最近用scrapy來進行網頁抓取,對於pythoner來說它用起來非常方便,詳細文件在這裡:http://doc.scrapy.org/en/0.14/index.html

要想利用scrapy來抓取網頁資訊,需要先新建一個工程,scrapy startproject myproject

工程建立好後,會有一個myproject/myproject的子目錄,裡面有item.py(由於你要抓取的東西的定義),pipeline.py(用於處理抓取後的資料,可以儲存資料庫,或是其他),然後是spiders資料夾,可以在裡面編寫爬蟲的指令碼.

這裡以爬取某網站的書籍資訊為例:

item.py如下:

Python程式碼  

from scrapy.item import Item, Field  

class BookItem(Item):  

    # define the fields for your item here like:  

    name = Field()  

    publisher = Field()  

    publish_date = Field()  

    price = Field()

我們要抓取的東西都在上面定義好了,分別是名字,出版商,出版日期,價格,

下面就要寫爬蟲去網戰抓取資訊了,

spiders/book.py如下:

Python程式碼  

from urlparse import urljoin  

import simplejson  

from scrapy.http import Request  

from scrapy.contrib.spiders import CrawlSpider, Rule  

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor  

from scrapy.selector import HtmlXPathSelector  

from myproject.items import BookItem  

class BookSpider(CrawlSpider):  

    name = 'bookspider'  

    allowed_domains = ['test.com']  

    start_urls = [  

        "http://test_url.com",   #這裡寫開始抓取的頁面地址(這裡網址是虛構的,實際使用時請替換)  

    ]  

    rules = (  

        #下面是符合規則的網址,但是不抓取內容,只是提取該頁的連結(這裡網址是虛構的,實際使用時請替換)  

        Rule(SgmlLinkExtractor(allow=(r'http://test_url/test?page_index=\d+'))),  

        #下面是符合規則的網址,提取內容,(這裡網址是虛構的,實際使用時請替換)  

        Rule(SgmlLinkExtractor(allow=(r'http://test_rul/test?product_id=\d+')), callback="parse_item"),  

    )  

    def parse_item(self, response):  

        hxs = HtmlXPathSelector(response)  

        item = BookItem()  

        item['name'] = hxs.select('//div[@class="h1_title book_head"]/h1/text()').extract()[0]  

        item['author'] = hxs.select('//div[@class="book_detailed"]/p[1]/a/text()').extract()  

        publisher = hxs.select('//div[@class="book_detailed"]/p[2]/a/text()').extract()  

        item['publisher'] = publisher and publisher[0] or ''  

        publish_date = hxs.select('//div[@class="book_detailed"]/p[3]/text()').re(u"[\u2e80-\u9fffh]+\uff1a([\d-]+)")  

        item['publish_date'] = publish_date and publish_date[0] or ''  

        prices = hxs.select('//p[@class="price_m"]/text()').re("(\d*\.*\d*)")  

        item['price'] = prices and prices[0] or ''  

        return item

然後資訊抓取後,需要儲存,這時就需要寫pipelines.py了(用於scapy是用的twisted,所以具體的資料庫操作可以看twisted的資料,這裡只是簡單介紹如何儲存到資料庫中):

Python程式碼  

from scrapy import log  

#from scrapy.core.exceptions import DropItem  

from twisted.enterprise import adbapi  

from scrapy.http import Request  

from scrapy.exceptions import DropItem  

from scrapy.contrib.pipeline.images import ImagesPipeline  

import time  

import MySQLdb  

import MySQLdb.cursors  

class MySQLStorePipeline(object):  

    def __init__(self):  

        self.dbpool = adbapi.ConnectionPool('MySQLdb',  

                db = 'test',  

                user = 'user',  

                passwd = '******',  

                cursorclass = MySQLdb.cursors.DictCursor,  

                charset = 'utf8',  

                use_unicode = False  

        )  

    def process_item(self, item, spider):  

        query = self.dbpool.runInteraction(self._conditional_insert, item)  

        query.addErrback(self.handle_error)  

        return item  

    def _conditional_insert(self, tx, item):  

        if item.get('name'):  

            tx.execute(\  

                "insert into book (name, publisher, publish_date, price ) \  

                 values (%s, %s, %s, %s)",  

                (item['name'],  item['publisher'], item['publish_date'],   

                item['price'])  

            )

完成之後在setting.py中新增該pipeline:

Python程式碼  

ITEM_PIPELINES = ['myproject.pipelines.MySQLStorePipeline']  

最後執行scrapy crawl bookspider就開始抓取了

為什麼需要用代理進行網頁抓取？
2021-11-10
網頁
如何進行網路抓取？
2022-02-09
藉助代理IP進行網頁抓取的終極指南
2023-03-06
網頁
在 C# 和 JavaScript 之間選擇進行網頁抓取
2024-09-22
C#JavaScript網頁
用HTML進行網頁佈局
2020-12-28
HTML網頁
使用代理進行抓取網頁的主要原因是什麼？
2021-12-27
網頁
使用Scrapy抓取優酷視訊列表頁（電影/電視）
2019-02-16
NodeJS使用PhantomJs抓取網頁
2019-02-16
NodeJS網頁
騰牛網抓取（單頁）
2024-08-07
如何抓取網頁資訊？
2022-06-02
網頁
網頁資料抓取之噹噹網
2020-12-21
網頁
批量抓取網頁pdf檔案
2019-02-16
網頁
使用chromedriver抓取網頁截圖
2024-11-07
Chrome網頁
使用代理抓取網頁的原因
2021-09-11
網頁
Scrapy爬蟲：實習僧網最新招聘資訊抓取
2021-09-09
爬蟲
爬蟲進階——動態網頁Ajax資料抓取（簡易版）
2024-04-12
爬蟲網頁
爬蟲抓取網頁資料原理
2021-12-06
爬蟲網頁
網路爬蟲如何獲取IP進行資料抓取
2022-05-19
爬蟲
使用scrapy抓取Youtube播放列表資訊
2019-02-16
爬蟲抓取網頁的詳細流程
2023-11-28
爬蟲網頁
使用 Beautiful Soup 在 Python 中抓取網頁
2021-12-27
Python網頁
網頁抓取的重要性介紹
2021-12-16
網頁
IP地址在網頁抓取中的作用
2022-06-20
網頁
網頁抓取如何幫助資料分析？
2022-02-11
網頁
IP地址在網頁抓取中有何作用
2022-05-09
網頁
如何使用python進行網頁爬取?
2020-08-06
Python網頁
例項：使用puppeteer headless方式抓取JS網頁
2018-05-08
JS網頁
如何用Python爬資料？（一）網頁抓取
2018-06-27
Python網頁
學會XPath，輕鬆抓取網頁資料
2023-11-30
網頁
網頁抓取與IPIDEA代理IP的關係
2023-05-04
網頁Idea
表情黨抓取（單頁） (網站已轉移)
2024-08-07
網站
網頁抓取常見的問題有哪些？
2023-01-11
網頁
網頁抓取五種常用的HTTP標頭
2022-06-28
網頁HTTP
使用Scrapy抓取新浪微博使用者資訊
2019-02-16
如何對php網站頁面進行修改
2024-10-09
PHP網站
抓取網頁的含義和URL基本構成
2023-10-24
網頁
基於Chrome的Easy Scraper外掛抓取網頁
2024-04-06
Chrome網頁
如何避免在網頁抓取時被檢測到？
2022-01-25
網頁
使用代理進行抓取的四個優勢
2022-03-25

用scrapy進行網頁抓取

相關文章