安裝 python
這個就不用我說了吧,網上教程一大堆
安裝 scrapy 包
pip install scrapy
建立 scrapy 專案
scrapy startproject aliSpider
進入專案目錄下,建立爬蟲檔案
cmd 進入專案目錄,執行命令:
scrapy genspider -t crawl alispi job.alibaba.com
編寫 items.py 檔案
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class AlispiderItem(scrapy.Item):
# define the fields for your item here like:
detail = scrapy.Field()
workPosition = scrapy.Field()
jobclass = scrapy.Field()
編寫 alispi.py 檔案
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from aliSpider.items import AlispiderItem
class AlispiSpider(CrawlSpider):
name = `alispi`
allowed_domains = [`job.alibaba.com`]
start_urls = [`https://job.alibaba.com/zhaopin/positionList.html#page/0`]
pagelink = LinkExtractor(allow=("d+"))
rules = (
Rule(pagelink, callback=`parse_item`, follow=True),
)
def parse_item(self, response):
# for each in response.xpath("//tr[@style=`display:none`]"):
for each in response.xpath("//tr"):
item = AlispiderItem()
# 職位名稱
item[`detail`] = each.xpath("./td[1]/span/a/@href").extract()
# # # 詳情連線
item[`workPosition`] = each.xpath("./td[3]/span/text()").extract()
# # # 職位類別
item[`jobclass`] = each.xpath("./td[2]/span/text()").extract()
yield item
執行
scrapy crawl alispi
輸出到檔案 items.json
scrapy crawl alispi -o items.json
執行成功會顯示如下內容
版本說明
python 3.5.5
參考:https://scrapy-chs.readthedoc…
關注微信公眾號 [prepared],與博主深入探討。