scrapy處理post請求的傳參和日誌等級

Bound_w發表於2019-03-04

一.Scrapy的日誌等級

  - 在使用scrapy crawl spiderFileName執行程式時,在終端裡列印輸出的就是scrapy的日誌資訊。

  - 日誌資訊的種類:

        ERROR : 一般錯誤

        WARNING : 警告

        INFO : 一般的資訊

        DEBUG : 除錯資訊  

  - 設定日誌資訊指定輸出:

    在settings.py配置檔案中,加入

                    LOG_LEVEL = ‘指定日誌資訊種類’即可。

                    LOG_FILE = 'log.txt'則表示將日誌資訊寫入到指定檔案中進行儲存。

二.請求傳參

  - 在某些情況下,我們爬取的資料不在同一個頁面中,例如,我們爬取一個電影網站,電影的名稱,評分在一級頁面,而要爬取的其他電影詳情在其二級子頁面中。這時我們就需要用到請求傳參。

處理post請求的引數: 

建立專案:

  

程式碼:

import scrapy


class PostSpider(scrapy.Spider):
    name = 'post'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://fanyi.baidu.com/sug']

    def start_requests(self):
        data = {
            'kw':'dog'
        }
        for url in self.start_urls:
            yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse)

    def parse(self, response):
        print(response.text)

settings.py

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

檢視請求的資料: 

 案例二:

# -*- coding: utf-8 -*-
import scrapy
from moviePro.items import MovieproItem

class MovieSpider(scrapy.Spider):
    name = 'movie'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.4567tv.tv/frim/index1.html']
    #解析詳情頁中的資料
    def parse_detail(self,response):
        #response.meta返回接收到的meta字典
        item = response.meta['item']
        actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
        item['actor'] = actor

        yield item

    def parse(self, response):
        li_list = response.xpath('//li[@class="col-md-6 col-sm-4 col-xs-3"]')
        for li in li_list:
            item = MovieproItem()
            name = li.xpath('./div/a/@title').extract_first()
            detail_url = 'https://www.4567tv.tv'+li.xpath('./div/a/@href').extract_first()
            item['name'] = name
            #meta引數:請求傳參.meta字典就會傳遞給回撥函式的response引數
            yield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={'item':item})
settings.py
LOG_LEVEL = "ERROE"
LOG_FILE = './log.txt'    #輸出日誌

 items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MoveproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    actor = scrapy.Field()

 

相關文章