item的介紹與使用-2.0

yuhui_2000發表於2020-10-11

原文網址 : https://blog.csdn.net/yuhui_2000/article/details/109013143

item的使用

通過爬取陽光熱線問政平臺來學習item的使用
目標：所有的投訴帖子的編號、帖子的連結、帖子的標題和內容
url：http://wz.sun0769.com/political/index/politicsNewest?id=1

網站的介紹

1.一個帖子對應著一個

標籤

2.找到了資料的響應的url地址

3.url隨著頁面變化的規律

http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1
http://wz.sun0769.com/political/index/politicsNewest?id=1&page=2
http://wz.sun0769.com/political/index/politicsNewest?id=1&page=3

http://wz.sun0769.com/political/index/politicsNewest?id=1&page=頁數

4.請求詳情頁的請求是一個get請求
在這裡插入圖片描述

檔案目錄

在這裡插入圖片描述

sun.py

# -*- coding: utf-8 -*-
import scrapy
from ..items import SunItem

class SunSpider(scrapy.Spider):
    name = 'sun'
    allowed_domains = ['sun0769.com']
    start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']

    def parse(self, response):
        # 分組
        li_list=response.xpath('//div[@class="width-12"]/ul[@class="title-state-ul"]//li[@class="clear"]')
        for li in li_list:
            item=SunItem()
            item["num"]=li.xpath('.//span[@class="state1"]/text()').extract_first()
            item["title"]=li.xpath('.//span[@class="state3"]/a[1]/text()').extract_first()
            item["response_time"]=li.xpath('.//span[@class="state4"]/text()').extract_first().strip()
            item["response_time"]=item["response_time"].split("：")[-1]
            item["ask_time"]=li.xpath('.//span[@class="state5 "]/text()').extract_first()
            item["detail_url"]="http://wz.sun0769.com"+li.xpath('.//span[@class="state3"]/a[1]/@href').extract_first()

            yield scrapy.Request(item["detail_url"],
                                 callback=self.parse_detail_url,
                                 meta={"item":item}  # 通過meta傳遞資料
                                 )

        # 實現翻頁操作
        for page in range(2, 4):
            next_url = f"http://wz.sun0769.com/political/index/politicsNewest?id=1&page={page}"
            yield scrapy.Request(next_url,
                                 callback=self.parse
                                 )

    def parse_detail_url(self,response):
        """處理詳情頁的資料"""
        item=response.meta["item"]  # 取出資料

        item["content"]=response.xpath('//div[@class="details-box"]/pre/text()').extract_first()
		
		# 注意：圖片和視訊可能不止一個，也可能沒有
        item["img"]=response.xpath('//div[@class="clear details-img-list Picture-img"]/img/@src').extract()
        item["video"]=response.xpath('//div[@class="vcp-player"]/video/@src').extract()

        yield item

items.py

定義我們需要爬取哪一些欄位

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class SunItem(scrapy.Item):
    # define the fields for your item here like:
    num = scrapy.Field()
    title = scrapy.Field()
    response_time = scrapy.Field()
    ask_time = scrapy.Field()
    detail_url = scrapy.Field()
    content = scrapy.Field()
    img = scrapy.Field()
    video = scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import logging

logger=logging.getLogger(__name__)

class SunPipeline:
    def process_item(self, item, spider):
        item["content"]=self.process_content(item["content"])
        # logger.warning(item)
        return item

    def process_content(self,content):
        """處理item中的字串"""
        new_content=content.replace("\r\n","").replace("\xa0","")
        return new_content

settings.py

在這裡插入圖片描述

# -*- coding: utf-8 -*-

# Scrapy settings for Sun project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Sun'

SPIDER_MODULES = ['Sun.spiders']
NEWSPIDER_MODULE = 'Sun.spiders'

LOG_LEVEL="WARN"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Sun.middlewares.SunSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Sun.middlewares.SunDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'Sun.pipelines.SunPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

cmd

scrapy startproject Sun

cd Sun

scrapy genspider sun

scrapy crawl sun

JQuery的介紹與使用
2020-10-07
jQuery
xtrabackup 2.4 的介紹與使用
2023-10-30
FastAPI 的路由介紹與使用
2023-04-10
ASTAPI路由
GoogleTagManager 介紹與使用
2019-03-25
Go
Influxdb 介紹與使用
2018-07-22
UX
java ShutdownHook介紹與使用
2018-08-09
JavaHook
IIS Express介紹與使用
2020-04-06
Express
iOS開發-UITabbarController的介紹與使用
2024-06-30
iOSUItabBarController
區塊鏈 2.0：介紹（一）
2019-03-25
區塊鏈
[iOS] Socket & CocoaAsyncSocket介紹與使用
2018-11-19
iOS
Shell指令碼介紹與使用
2020-09-27
指令碼
Android 動畫介紹與使用
2019-05-10
Android動畫
Redis - 介紹與使用場景
2023-01-31
Redis
推薦模型NeuralCF：原理介紹與TensorFlow2.0實現
2021-03-27
模型
推薦模型DeepCrossing: 原理介紹與TensorFlow2.0實現
2021-03-14
模型ROS
iOS開發-WKWebView的介紹與基本使用
2024-07-08
iOSWebView
iOS開發- UILabel的基本介紹與使用
2024-07-04
iOSUI
MySQL壓測工具mysqlslap的介紹與使用
2021-09-09
MySql
執行緒本地ThreadLocal的介紹與使用！
2021-07-07
執行緒thread
虛擬機器之介紹_2.0
2024-11-20
虛擬機
OAuth 2.0（Open Authorization 2.0）授權框架入門介紹
2024-04-03
OAuth框架
Android 常用佈局介紹與使用
2019-04-11
Android
Android RxJava：基礎介紹與使用
2019-03-28
AndroidRxJava
原創：oracle DML介紹與使用
2020-04-06
Oracle
Android JetPack~ ViewModel (一) 介紹與使用
2023-02-20
AndroidJetpackView
Android JetPack~ LiveData (一) 介紹與使用
2023-02-16
AndroidJetpackLiveData
mydumper備份工具介紹與使用
2021-09-22
4、Spring+AOP介紹與使用
2020-12-17
Spring
BPMN 2.0使用簡介
2024-07-10
useRoute 函式的詳細介紹與使用示例
2024-07-27
函式
SpringBoot2.0應用（一）：SpringBoot2.0簡單介紹
2018-09-30
Spring Boot
授權呼叫: 介紹 Transformers 智慧體 2.0
2024-05-27
ORM智慧體
區塊鏈 2.0：Hyperledger Fabric 介紹（十）
2019-10-15
區塊鏈
HTML的介紹與seo
2020-10-17
HTML
PEG.js 介紹與基礎使用
2018-11-14
JS
大資料 Hadoop介紹、配置與使用
2018-09-15
大資料Hadoop
LangChain的Agent使用介紹
2024-03-13
LangChain
layui 的基本使用介紹
2020-10-26
UI

item的介紹與使用-2.0

item的使用

網站的介紹

檔案目錄

sun.py

items.py

pipelines.py

settings.py

cmd

相關文章