一.Scrapy的日誌等級
- 在使用scrapy crawl spiderFileName執行程式時,在終端裡列印輸出的就是scrapy的日誌資訊。
- 日誌資訊的種類:
ERROR : 一般錯誤 WARNING : 警告 INFO : 一般的資訊 DEBUG : 除錯資訊
- 設定日誌資訊指定輸出:
在settings.py配置檔案中,加入
LOG_LEVEL = ‘指定日誌資訊種類’即可。
LOG_FILE = 'log.txt'則表示將日誌資訊寫入到指定檔案中進行儲存。
二.請求傳參
- 在某些情況下,我們爬取的資料不在同一個頁面中,例如,我們爬取一個電影網站,電影的名稱,評分在一級頁面,而要爬取的其他電影詳情在其二級子頁面中。這時我們就需要用到請求傳參。
處理post請求的引數:
建立專案:
程式碼:
import scrapy class PostSpider(scrapy.Spider): name = 'post' # allowed_domains = ['www.xxx.com'] start_urls = ['https://fanyi.baidu.com/sug'] def start_requests(self): data = { 'kw':'dog' } for url in self.start_urls: yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse) def parse(self, response): print(response.text)
settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False
檢視請求的資料:
案例二:
# -*- coding: utf-8 -*- import scrapy from moviePro.items import MovieproItem class MovieSpider(scrapy.Spider): name = 'movie' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.4567tv.tv/frim/index1.html'] #解析詳情頁中的資料 def parse_detail(self,response): #response.meta返回接收到的meta字典 item = response.meta['item'] actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first() item['actor'] = actor yield item def parse(self, response): li_list = response.xpath('//li[@class="col-md-6 col-sm-4 col-xs-3"]') for li in li_list: item = MovieproItem() name = li.xpath('./div/a/@title').extract_first() detail_url = 'https://www.4567tv.tv'+li.xpath('./div/a/@href').extract_first() item['name'] = name #meta引數:請求傳參.meta字典就會傳遞給回撥函式的response引數 yield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={'item':item})
settings.py LOG_LEVEL = "ERROE" LOG_FILE = './log.txt' #輸出日誌
items.py
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class MoveproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() name = scrapy.Field() actor = scrapy.Field()