scrapy之分散式爬蟲scrapy-redis
scrapy_redis的作用
Scrapy_redis在scrapy的基礎上實現了更多,更強大的功能,具體體現在:
通過持久化請求佇列和請求的指紋集合來實現:
- 斷點續爬
- 分散式快速抓取
其他概念性的東西可自行百度。我們就只寫怎麼將普通爬蟲改寫為分散式爬蟲
第一步:匯入分散式爬蟲類(抄官方)
第二步:繼承分散式爬蟲類(記不住就抄)
第三步:登出起始url和允許的域
第四步:設定redis-key(隨便寫,看官網也行)
第五步:設定–init–(抄官方例子)
根據以前爬取頁面的不同,我們主要寫了crawlspider和普通的spider爬蟲,下面我們將這兩種爬蟲改寫為分散式爬蟲
首先你要從git上下載官方模板
git clone https://github.com/rolando/scrapy-redis.git
改寫crawlspider爬蟲
目標爬去有緣網的資訊
改寫後爬蟲如下
from scrapy.spiders import CrawlSpider, Rule
from youyuanwang.items import YouyuanwangItem
from scrapy.linkextractors import LinkExtractor
# 第一步 匯入需要的類
from scrapy_redis.spiders import RedisCrawlSpider
# 第二步 繼承類
class MarrigeSpider(RedisCrawlSpider):
name = 'marrige'
# 第三步 登出起始的網址和允許的域
# allowed_domains = ['youyuan.com']
# start_urls = ['http://www.youyuan.com/find/xian/mm18-0/advance-0-0-0-0-0-0-0/p1/']
# 第四步 設定redis——key
redis_key = 'guoshaosong'
# 第五步 通過init設定允許的域
rules = (
Rule(LinkExtractor(allow=r'^.*youyuan.*xian.*'), callback='parse_item', follow=True),
)
# print(rules)
def __init__(self, *args, **kwargs):
# Dynamically define the allowed domains list.
domain = kwargs.pop('domain', '')
self.allowed_domains = filter(None, domain.split(','))
super(MarrigeSpider, self).__init__(*args, **kwargs)#為當前類名
def parse_item(self, response):
item = YouyuanwangItem()
student_list = response.xpath('//div[@class="student"]/ul/li')
for li in student_list:
item['name'] = li.xpath('./dl/dd/a[1]/strong/text()').extract_first()
item['desc'] = li.xpath('./dl/dd/font//text()').extract()
item['img'] = li.xpath('./dl/dt/a/img/@src').extract_first()
yield item
下來是settings,也是抄官方
# Scrapy settings for youyuanwang project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/topics/settings.html
#
SPIDER_MODULES = ['youyuanwang.spiders']
NEWSPIDER_MODULE = 'youyuanwang.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = False
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
ITEM_PIPELINES = {
# 'youyuanwang.pipelines.YouyuanwangPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400,
}
LOG_LEVEL = 'DEBUG'
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
# REDIS_PASS = 'root'
SPIDER_MIDDLEWARES = { 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None, }
# Introduce an artifical delay to make use of parallelism. to speed up the
# crawl.
DOWNLOAD_DELAY = 1
執行,先給redis資料庫扔一個起始連結
lpush youyuanwang:start_urls http://www.youyuan.com/find/xian/mm18-0/advance-0-0-0-0-0-0-0/p1/
然後pycharm執行
scrapy runspider 爬蟲名
中間可以暫停一下,看到底是不是斷點續爬
改寫spider爬蟲
過程都一樣,不會就抄官方,主要是來打個樣
爬蟲目標:爬去新浪新聞的內容
import scrapy
from news.items import NewsItem
# 第一步:匯入必要的模組
from scrapy_redis.spiders import RedisSpider
# 第二步:更改繼承類
class SinaNewsSpider(RedisSpider):
name = 'sina_news'
# 第三步:註釋允許的域和起始url
# allowed_domains = ['sina.com.cn']
# start_urls = ['http://news.sina.com.cn/guide/']
# 第四步:設定redis-key
redis_key = 'myspider:start_urls'
# 第五步:匯入配置函式
def __init__(self, *args, **kwargs):
# Dynamically define the allowed domains list.
domain = kwargs.pop('domain', '')
self.allowed_domains = filter(None, domain.split(','))
super(SinaNewsSpider, self).__init__(*args, **kwargs)
# 第六步:修改setting檔案,全部選中是c+s+a+j
def parse(self, response):
# items = []
# 所有大類的url 和 標題
parentUrls = response.xpath('//div[@id="tab01"]/div/h3/a/@href').extract()
parentTitle = response.xpath('//div[@id="tab01"]/div/h3/a/text()').extract()
# 所有小類的ur 和 標題
subUrls = response.xpath('//div[@id="tab01"]/div/ul/li/a/@href').extract()
subTitle = response.xpath('//div[@id="tab01"]/div/ul/li/a/text()').extract()
# 判斷小標籤是否屬於大標籤
for i in range(0, len(parentUrls)):
for j in range(0, len(subUrls)):
item = NewsItem()
item['headline'] = parentTitle[i]
item['headline_url'] = parentUrls[i]
# 對一個大分類,如果我是你的分類,就拼接這個item物件
if subUrls[j].startswith(item['headline_url']):
item['subtitle'] = subTitle[i]
item['subtitle_url'] = subUrls[j]
yield scrapy.Request(url=item['subtitle_url'], callback=self.subtitle_parse, meta={'meta_1': item})
def subtitle_parse(self, response):
item = NewsItem()
meta_1 = response.meta['meta_1']
item['content'] = response.xpath('//title/text()').extract_first()
item['headline'] = meta_1['headline']
item['headline_url'] = meta_1['headline_url']
item['subtitle'] = meta_1['subtitle']
item['subtitle_url'] = meta_1['subtitle_url']
yield item
settings
# Scrapy settings for news project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/topics/settings.html
#
SPIDER_MODULES = ['news.spiders']
NEWSPIDER_MODULE = 'news.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
ITEM_PIPELINES = {
# 'news.pipelines.ExamplePipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400,
}
# 配置redis資料庫
REDIS_HOST = "127.0.0.1"
REDIS_PORT = 6379
LOG_LEVEL = 'DEBUG'
SPIDER_MIDDLEWARES = { 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None, }
# Introduce an artifical delay to make use of parallelism. to speed up the
# crawl.
DOWNLOAD_DELAY = 1
到這基本就結束,我個人還去看了看分散式爬蟲的部署以及管理,但因為我是win10教育版,docker安裝等步驟總是出錯,就不再折騰了,主要有scrapyd,以及gerapy等技術。csdn上都可以搜到相關部落格,感興趣自己研究。。。。。。
相關文章
- 爬蟲(14) - Scrapy-Redis分散式爬蟲(1) | 詳解爬蟲Redis分散式
- scrapy-redis實現爬蟲分散式爬取分析與實現Redis爬蟲分散式
- Scrapy框架中的Middleware擴充套件與Scrapy-Redis分散式爬蟲框架套件Redis分散式爬蟲
- 手把手教你實現Scrapy-Redis分散式爬蟲:從配置到最終執行的實戰指南Redis分散式爬蟲
- 分散式爬蟲原理之分散式爬蟲原理分散式爬蟲
- Scrapy之"並行"爬蟲並行爬蟲
- 分散式爬蟲的部署之Scrapyd分散式部署分散式爬蟲
- 分散式爬蟲的部署之Gerapy分散式管理分散式爬蟲
- 基於Scrapy分散式爬蟲的開發與設計分散式爬蟲
- scrapy_redis 和 docker 實現簡單分散式爬蟲RedisDocker分散式爬蟲
- 分散式爬蟲分散式爬蟲
- scrapy爬蟲爬蟲
- python爬蟲利器 scrapy和scrapy-redis 詳解一 入門demo及內容解析Python爬蟲Redis
- Python開發技巧:scrapy-redis爬蟲如何傳送POST請求PythonRedis爬蟲
- 爬蟲--Scrapy簡易爬蟲爬蟲
- Scrapy-RedisRedis
- 打造高效的分散式爬蟲系統:利用Scrapy框架實現分散式爬蟲框架
- 分散式爬蟲原理分散式爬蟲
- 爬蟲學習之基於Scrapy的網路爬蟲爬蟲
- Scrapy爬蟲-草稿爬蟲
- Scrapy爬蟲框架爬蟲框架
- 分散式爬蟲之知乎使用者資訊爬取分散式爬蟲
- 分散式爬蟲的部署之Scrapyd批量部署分散式爬蟲
- 19--Scarpy05:增量式爬蟲、分散式爬蟲爬蟲分散式
- 爬蟲(9) - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架爬蟲框架非同步
- python爬蟲之Scrapy 使用代理配置Python爬蟲
- scrapy爬蟲代理池爬蟲
- 爬蟲實戰scrapy爬蟲
- 分散式爬蟲的部署之Scrapyd對接Docker分散式爬蟲Docker
- Python Scrapy 爬蟲(二):scrapy 初試Python爬蟲
- Scrapy框架的使用之Scrapy通用爬蟲框架爬蟲
- scrapy + mogoDB 網站爬蟲Go網站爬蟲
- 爬蟲框架-scrapy的使用爬蟲框架
- python爬蟲Scrapy框架Python爬蟲框架
- Scrapy爬蟲框架的使用爬蟲框架
- 【Python篇】scrapy爬蟲Python爬蟲
- Python爬蟲—Scrapy框架Python爬蟲框架
- Scrapy建立爬蟲專案爬蟲