我的第一個 scrapy 爬蟲

prepared發表於2019-02-16

安裝 python

這個就不用我說了吧,網上教程一大堆

安裝 scrapy 包

pip install scrapy

建立 scrapy 專案

scrapy startproject aliSpider

進入專案目錄下,建立爬蟲檔案

cmd 進入專案目錄,執行命令:

scrapy genspider -t crawl alispi job.alibaba.com

編寫 items.py 檔案

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class AlispiderItem(scrapy.Item):
    # define the fields for your item here like:
    detail = scrapy.Field()
    workPosition = scrapy.Field()
    jobclass = scrapy.Field()
    

編寫 alispi.py 檔案

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from aliSpider.items import AlispiderItem


class AlispiSpider(CrawlSpider):
    name = `alispi`
    allowed_domains = [`job.alibaba.com`]
    start_urls = [`https://job.alibaba.com/zhaopin/positionList.html#page/0`]
    pagelink = LinkExtractor(allow=("d+"))
    rules = (
        Rule(pagelink, callback=`parse_item`, follow=True),
    )

    def parse_item(self, response):
        # for each in response.xpath("//tr[@style=`display:none`]"):
        for each in response.xpath("//tr"):
            item = AlispiderItem()
            # 職位名稱
            item[`detail`] = each.xpath("./td[1]/span/a/@href").extract()
            # # # 詳情連線
            item[`workPosition`] = each.xpath("./td[3]/span/text()").extract()
            # # # 職位類別
            item[`jobclass`] = each.xpath("./td[2]/span/text()").extract()
            yield item

執行

scrapy crawl alispi

輸出到檔案 items.json

scrapy crawl alispi -o items.json

執行成功會顯示如下內容

版本說明

python 3.5.5

參考:https://scrapy-chs.readthedoc…

關注微信公眾號 [prepared],與博主深入探討。

相關文章