python爬蟲初探--第一個python爬蟲專案

smh2208發表於2018-05-18

前兩天把python基礎語法看了下,簡單做了點練習,今天開始做了第一個python爬蟲專案,用了scrapy框架,從安裝python開始記錄下步驟。

一。安裝python和pycharm

1.從官網:https://www.python.org/downloads/ 下載python3.6.5,不裝在預設位置,我選擇安裝在D:\py\python3.6.5下。

2.從官網:http://www.jetbrains.com/pycharm/ 下載pycharm,選擇professional版,安裝在D:\ProgramFiles下,安裝路徑不要有空格,不然無法安裝。

3.設定python環境變數:在path下新增python的安裝路勁 D:\py\python3.6.5\python.exe


二。安裝scrapy庫

scrapy庫安裝可費勁了,安裝過程兩部一個坑,感覺跟配置spring xml一樣哈哈,扯遠了。下面我們就來看看具體步驟。

網上很多教程都教我們在命令列裡安裝,其實這樣挺麻煩的,既然有工具,為什麼不用工具呢!OK,我們開啟剛安裝好的pycharm,新建一個工程,注意新建工程時有個選項 Project Virtual Interpreter,點開,選擇Existing interpreter,選擇我們上面安裝好的python.exe,不然pycharm就會建立新的python環境,我們前面已經安裝好了,這不是多此一舉嘛。

然後一次點選File-Default Settings-Project Interpreter,在Project Interpreter那裡選擇我們上面安裝好的python.exe


然後點選File-Settings-Project Interpreter,這裡的Project Interpreter應該也更新了。點選右側的綠色加號,搜尋要安裝的包,點選install package依次安裝以下庫:


1.lxml

2.zope.interface

3.twisted:安裝時可能會提示error:Microsoft Visual C++ 10.0 is required (Unable to find vcvarsall.bat),改為手動安裝。

https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml 在這個網站下載對應的twisted庫,我的python版本是3.6.5,所以選擇

  • Twisted‑18.4.0‑cp34‑cp34m‑win32.whl ,檔案放在了D:\py\python3.6.5\Scripts。命令列裡進入D:\py\python3.6.5\Scripts目錄
  • 執行命令 pip install Twisted-18.4.0-cp36-cp36m-win32.whl,片刻後安裝成功。
  • 4.回到pycharm,安裝OpenSSL
  • 5.安裝scrapy,安裝成功
  • 三。建立scrapy專案
  • 1.先按照上面的方法把scrapy新增到環境變數裡,命令列裡執行 scrapy startproject scrapy_exam,可以看到生成了scrapy工程,
  • 用pycharm開啟這個專案,編輯Items.py檔案
  • import scrapy
    from scrapy import Item,Field
    
    class ScrapyExamItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        pass
    
    class TestItem(Item):
        address = Field()
        price = Field()
        lease_type = Field()
        bed_amount = Field()
        suggestion = Field()
        comment_star = Field()
        comment_amount = Field()
    
    2.在spiders目錄下建立python檔案spider_test.py:

import scrapy

from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy_exam.items import TestItem
from scrapy import cmdline

class xiaozhu(CrawlSpider):
    name = 'xiaozhu'
    start_urls = ['http://bj.xiaozhu.com/search-duanzufang-p1-0/']

    def parse(self, response):
        item = TestItem()
        selector = Selector(response)
        commoditys = selector.xpath('//ul[@class="pic_list clearfix"]/li')

        for commodity in commoditys:
            address = commodity.xpath('div[2]/div/a/span/text()').extract()[0]
            price = commodity.xpath('div[2]/span[1]/i/text()').extract()[0]
            lease_type = commodity.xpath('div[2]/div/em/text()').extract()[0].split('/')[0].strip()
            bed_amount = commodity.xpath('div[2]/div/em/text()').extract()[0].split('/')[1].strip()
            suggestion = commodity.xpath('div[2]/div/em/text()').extract()[0].split('/')[2].strip()
            infos = commodity.xpath('div[2]/div/em/span/text()').extract()[0].strip()
            comment_star = infos.split('/')[0] if '/' in infos else '無'
            comment_amount = infos.split('/')[1] if '/' in infos else infos

            item['address'] = address
            item['price'] = price
            item['lease_type'] = lease_type
            item['bed_amount'] = bed_amount
            item['suggestion'] = suggestion
            item['comment_star'] = comment_star
            item['comment_amount'] = comment_amount

            yield item

        urls = ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(str(i)) for i in range(1, 14)]
        for url in urls:
            yield Request(url, callback=self.parse)

cmdline.execute("scrapy crawl xiaozhu".split())

3.右擊spider_test.py,選擇Run spider_test.py,可以看到輸出:


今天先寫到這裡,明天再分析一下程式

相關文章