用Python爬取實習資訊（Scrapy初體驗）

發表於2016-10-19

原文網址 : http://python.jobbole.com/86651/

Python

1.目標

這兩天要弄一個大作業，從水木社群和北大未名社群的實習板塊，爬取實習資訊，儲存在MongoDB資料庫。
正好想學習一下scrapy框架的使用，就愉快地決定用scrapy來實現。

2.介紹

Scrapy是Python開發的一個快速,高層次的螢幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的資料。使用了 Twisted 非同步網路庫來處理網路通訊。整體架構：

學習使用Scrapy，最重要的是官方文件。本文的主要參考資料也是該文件。
Scrapy的安裝，這裡就不說了，在滿足一系列依賴的安裝以後，pip一下，就搞定了。

pip install scrapy

1	pip install scrapy

3.開始

3.1 首先，新建一個Scrapy工程。

進入你的目標目錄，輸入以下指令，建立專案intern。

$ scrapy startproject intern

1	$ scrapy startproject intern

目錄結構如下：

.
├── scrapy.cfg
└── intern
  ├── __init__.py
  ├── items.py
  ├── pipelines.py
  ├── settings.py
  └── spiders
    └── __init__.py

├── scrapy.cfg

└── intern

├── __init__.py

├── items.py

├── pipelines.py

├── settings.py

└── spiders

└── __init__.py

這個目錄結構要熟記於心。

scrapy.cfg: 全域性配置檔案
intern/: 專案python模組
intern/items.py: 專案items檔案，定義爬取的資料儲存結構
intern/pipelines.py: 專案管道檔案，對爬取來的資料進行清洗、篩選、儲存等操作
intern/settings.py: 專案配置檔案
intern/spiders: 放置spider的目錄

3.2 編寫items.py檔案。

定義item的欄位如下：

import scrapy
class InternItem(scrapy.Item):
  title = scrapy.Field()
  href = scrapy.Field()
  author = scrapy.Field()
  time = scrapy.Field()
  content = scrapy.Field()
  is_dev = scrapy.Field()
  is_alg = scrapy.Field()
  is_fin = scrapy.Field()
  base_url_index = scrapy.Field()

import scrapy

class InternItem(scrapy.Item):

title = scrapy.Field()

href = scrapy.Field()

author = scrapy.Field()

time = scrapy.Field()

content = scrapy.Field()

is_dev = scrapy.Field()

is_alg = scrapy.Field()

is_fin = scrapy.Field()

base_url_index = scrapy.Field()

定義的方法很簡單，每個欄位都=scrapy.Field()即可。
使用：比如要使用某item的title，就像python中的dict一樣，item[‘title’]即可。

3.3 編寫爬蟲。

好了終於到了編寫爬蟲了。以爬取水木社群的爬蟲為例。在spiders目錄下，建立smSpider.py。

class SMSpider(scrapy.spiders.CrawlSpider):   
'''    
#要建立一個 Spider，你可以為 scrapy.spider.BaseSpider 建立一個子類，並確定三個主要的、強制的屬性：    
#name ：爬蟲的識別名，它必須是唯一的，在不同的爬蟲中你必須定義不同的名字.    
#start_urls ：爬蟲開始爬的一個 URL 列表。爬蟲從這裡開始抓取資料，所以，第一次下載的資料將會從這些 URLS 開始。其他子 URL 將會從這些起始 URL 中繼承性生成。   
#parse() ：爬蟲的方法，呼叫時候傳入從每一個 URL 傳回的 Response 物件作為引數，response 將會是 parse 方法的唯一的一個引數,    
#這個方法負責解析返回的資料、匹配抓取的資料(解析為 item )並跟蹤更多的 URL。    
''' 
  name="sm"    
  base_url = 'http://www.newsmth.net/nForum/board/Intern'    
  start_urls = [base_url]   
  start_urls.extend([base_url+'?p='+str(i) for i in range(2,4)])    
  platform = getPlatform()    
  def __init__(self):        
    scrapy.spiders.Spider.__init__(self)        
    if self.platform == 'linux':            
      self.driver = webdriver.PhantomJS()        
    elif self.platform == 'win':            
      self.driver =webdriver.PhantomJS(executable_path= 'F:/runtime/python/phantomjs-2.1.1-windows/bin/phantomjs.exe')            
    self.driver.set_page_load_timeout(10)       
    dispatcher.connect(self.spider_closed, signals.spider_closed)    
  def spider_closed(self, spider):        
    self.driver.quit()    
  def parse(self,response):
...

class SMSpider(scrapy.spiders.CrawlSpider):

'''

#要建立一個 Spider，你可以為 scrapy.spider.BaseSpider 建立一個子類，並確定三個主要的、強制的屬性：

#name ：爬蟲的識別名，它必須是唯一的，在不同的爬蟲中你必須定義不同的名字.

#start_urls ：爬蟲開始爬的一個 URL 列表。爬蟲從這裡開始抓取資料，所以，第一次下載的資料將會從這些 URLS 開始。其他子 URL 將會從這些起始 URL 中繼承性生成。

#parse() ：爬蟲的方法，呼叫時候傳入從每一個 URL 傳回的 Response 物件作為引數，response 將會是 parse 方法的唯一的一個引數,

#這個方法負責解析返回的資料、匹配抓取的資料(解析為 item )並跟蹤更多的 URL。

'''

name="sm"

base_url = 'http://www.newsmth.net/nForum/board/Intern'

start_urls = [base_url]

start_urls.extend([base_url+'?p='+str(i) for i in range(2,4)])

platform = getPlatform()

def __init__(self):

scrapy.spiders.Spider.__init__(self)

if self.platform == 'linux':

self.driver = webdriver.PhantomJS()

elif self.platform == 'win':

self.driver =webdriver.PhantomJS(executable_path= 'F:/runtime/python/phantomjs-2.1.1-windows/bin/phantomjs.exe')

self.driver.set_page_load_timeout(10)

dispatcher.connect(self.spider_closed, signals.spider_closed)

def spider_closed(self, spider):

self.driver.quit()

def parse(self,response):

...

從淺到深，一步步解釋這段程式碼。
首先，這個SMSpider是繼承於CrawlSpider，CrawlSpider繼承於BaseSpider。一般用BaseSpider就夠了，CrawlSpider可以增加一些爬取的Rule。但實際上我這裡並沒有用到。必需要定義的三個屬性。
name：爬蟲的名字。（唯一）
start_url：爬蟲開始爬取的url列表。
parse()：爬蟲爬取的方法。呼叫時傳入一個response物件，作為訪問某連結的響應。
在爬取水木社群的時候發現，水木的實習資訊是動態載入的。

也就是說，原始碼中，並沒有我們要的實習資訊。這時，考慮使用Selenium和Phantomjs的配合。Selenium本來在自動化測試上廣泛使用，它可以模仿使用者在瀏覽器上的行為，比如點選按鈕等等。Phantomjs是一個沒有UI的瀏覽器。Selenium和Phantomjs搭配，就可以方便地抓取動態載入的頁面。

用Python爬取實習資訊（Scrapy初體驗）

回到SMSpider的程式碼，我們要判斷當前的作業系統平臺，然後在Selenium的webdriver中載入Phantomjs。Linux不用輸入路徑，Windows要輸入程式所在路徑。在init()的結尾，還要加上事件分發器，使得在爬蟲退出後，關閉Phantomjs。

self.driver.set_page_load_timeout(10)

1	self.driver.set_page_load_timeout(10)

這句程式碼是為了不讓Phantom卡死在某一連結的請求上。設定每個頁面載入時間不能超過10秒。
具體的parse方法：

def parse(self,response):      
  self.driver.get(response.url)    
  print response.url
  #等待，直到table標籤出現    
  try:        
    element = WebDriverWait(self.driver,30).until(  
               EC.presence_of_all_elements_located((By.TAG_NAME,'table'))        )        
    print 'element:\n', element    
  except Exception, e:        
    print Exception, ":", e        
    print "wait failed"    
  page_source = self.driver.page_source    
  bs_obj = BeautifulSoup(page_source, "lxml")    
  print bs_obj    
  table = bs_obj.find('table',class_='board-list tiz')    
  print table    
  print "find message ====================================\n" 
  intern_messages = table.find_all('tr',class_=False)    
  for message in intern_messages:        
    title, href, time, author = '','','',''        
    td_9 = message.find('td',class_='title_9')        
    if td_9:            
      title = td_9.a.get_text().encode('utf-8','ignore')            
      href = td_9.a['href']        
    td_10 = message.find('td', class_='title_10')        
    if td_10:            
      time=td_10.get_text().encode('utf-8','ignore')        
    td_12 = message.find('td', class_='title_12')        
    if td_12:            
      author = td_12.a.get_text().encode('utf-8','ignore')        
    item = InternItem()        
    print 'title:',title        
    print 'href:', href        
    print 'time:', time        
    print 'author:', author        
    item['title'] = title        
    item['href'] = href        
    item['time'] = time       
    item['author'] = author        
    item['base_url_index'] = 0        
    #巢狀爬取每條實習資訊的具體內容
    root_url = 'http://www.newsmth.net'              
    if href!='':            
    content = self.parse_content(root_url+href)            
    item['content'] = content       
    yield item

def parse(self,response):

self.driver.get(response.url)

print response.url

#等待，直到table標籤出現

try:

element = WebDriverWait(self.driver,30).until(

EC.presence_of_all_elements_located((By.TAG_NAME,'table')) )

print 'element:\n', element

except Exception, e:

print Exception, ":", e

print "wait failed"

page_source = self.driver.page_source

bs_obj = BeautifulSoup(page_source, "lxml")

print bs_obj

table = bs_obj.find('table',class_='board-list tiz')

print table

print "find message ====================================\n"

intern_messages = table.find_all('tr',class_=False)

for message in intern_messages:

title, href, time, author = '','','',''

td_9 = message.find('td',class_='title_9')

if td_9:

title = td_9.a.get_text().encode('utf-8','ignore')

href = td_9.a['href']

td_10 = message.find('td', class_='title_10')

if td_10:

time=td_10.get_text().encode('utf-8','ignore')

td_12 = message.find('td', class_='title_12')

if td_12:

author = td_12.a.get_text().encode('utf-8','ignore')

item = InternItem()

print 'title:',title

print 'href:', href

print 'time:', time

print 'author:', author

item['title'] = title

item['href'] = href

item['time'] = time

item['author'] = author

item['base_url_index'] = 0

#巢狀爬取每條實習資訊的具體內容

root_url = 'http://www.newsmth.net'

if href!='':

content = self.parse_content(root_url+href)

item['content'] = content

yield item

這段程式碼，先是找到動態載入的目標標籤，等待這個標籤出現，再爬取實習資訊列表，再巢狀爬取每條實習資訊的具體內容。這裡我使用bs4對html進行解析。你也可以使用原生態的Xpath，或者selector。這裡就不進行具體的講解了，多瞭解幾種方法，熟練一種即可。爬取到的目標內容，像 item[‘title’] = title這樣，儲存在item裡。注意最後不是return，而是yeild。parse方法採用生成器的模式，逐條爬取分析。
爬取具體實習內容的程式碼：

def parse_content(self,url):    
  self.driver.get(url)    
  try:        
    element = WebDriverWait(self.driver, 30).until(            
            EC.presence_of_all_elements_located((By.TAG_NAME, 'table'))        )        
    print 'element:\n', element    
  except Exception, e:        
    print Exception, ":", e        
    print "wait failed"    
  page_source = self.driver.page_source    
  bs_obj = BeautifulSoup(page_source, "lxml")    
  return bs_obj.find('td', class_='a-content').p.get_text().encode('utf-8','ignore')

def parse_content(self,url):

self.driver.get(url)

try:

element = WebDriverWait(self.driver, 30).until(

EC.presence_of_all_elements_located((By.TAG_NAME, 'table')) )

print 'element:\n', element

except Exception, e:

print Exception, ":", e

print "wait failed"

page_source = self.driver.page_source

bs_obj = BeautifulSoup(page_source, "lxml")

return bs_obj.find('td', class_='a-content').p.get_text().encode('utf-8','ignore')

3.4 編寫pipelines.py。

接下來，我們想把爬取到的資料，存在Mongodb裡面。這可以交給pipeline去做。pipeline是我喜歡Scrapy的一個理由，你可以把你爬到的資料，以item的形式，扔進pipeline裡面，進行篩選、去重、儲存或者其他自定義的進一步的處理。pipeline之間的順序，可以在settings.py中設定，這使得pipeline更加靈活。
來看看MongoDBPipeline：

class MongoDBPipeline(object):    
  def __init__(self):        
    pass     
  def open_spider(self, spider):        
    self.client = pymongo.MongoClient(            
    settings['MONGODB_SERVER'],            
    settings['MONGODB_PORT']        )        
    self.db = self.client[settings['MONGODB_DB']]        
    self.collection = self.db[settings['MONGODB_COLLECTION']]    
  def close_spider(self, spider):        
    self.client.close()    
  def process_item(self, item, spider):        
    valid = True        
    for data in item:            
      if not data :                
        valid = False                
        raise DropItem("Missing {0}!".format(data))        
      if item['title'] == '':            
        valid = False            
        raise DropItem("title is '' ")        
      if valid:            
        self.collection.insert(dict(item))            
    return item

class MongoDBPipeline(object):

def __init__(self):

pass

def open_spider(self, spider):

self.client = pymongo.MongoClient(

settings['MONGODB_SERVER'],

settings['MONGODB_PORT'] )

self.db = self.client[settings['MONGODB_DB']]

self.collection = self.db[settings['MONGODB_COLLECTION']]

def close_spider(self, spider):

self.client.close()

def process_item(self, item, spider):

valid = True

for data in item:

if not data :

valid = False

raise DropItem("Missing {0}!".format(data))

if item['title'] == '':

valid = False

raise DropItem("title is '' ")

if valid:

self.collection.insert(dict(item))

return item

來說明一下。
首先建立類MongoDBPipeline，這裡不用繼承什麼預先設定好的pipeline。但是要有一個process_item的方法，傳入一個item和spider，返回處理完的item。open_spider和close_spider是在爬蟲開啟和關閉的時候呼叫的回撥函式。這裡我們要用到MongoDB，所以我們在爬蟲開啟的時候，連線一個Mongo客戶端，在爬蟲關閉的時候，再把客戶端關掉。這裡的資料庫相關的資訊，都儲存在settings.py裡面。如下：

MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "intern"
MONGODB_COLLECTION = "items"

MONGODB_SERVER = "localhost"

MONGODB_PORT = 27017

MONGODB_DB = "intern"

MONGODB_COLLECTION = "items"

寫在settings.py裡面的引數可以通過

from scrapy.conf import settings
settings['xxxxxx']

1 2	from scrapy.conf import settings settings['xxxxxx']

這種方式來獲取。
在寫完MongoDBPipeline以後，還要在settings.py註冊一下這個pipeline，如下：

ITEM_PIPELINES = {    
    'intern.pipelines.TagPipeline': 100, 
    'intern.pipelines.MongoDBPipeline':300                  
}

ITEM_PIPELINES = {

'intern.pipelines.TagPipeline': 100,

'intern.pipelines.MongoDBPipeline':300

}

後面的數值越小，越先執行。數值的範圍是1000以內的整數。通過這種方法，可以非常方便地設定pipeline之間的順序，以及開啟和關閉一個pipeline。

4.執行

在專案目錄下，執行如下指令：

scrapy crawl sm

1	scrapy crawl sm

這時我們的SMSpider就愉快地開始爬取資料了。

5.下一步

關於scrapy框架，要學的還有很多。比如說擴充套件和中介軟體的編寫，以及Crawler API的使用。
關於爬蟲，可以學習的還有：

使用代理
模擬登陸

下面一段時間，要做新浪微博的爬蟲，屆時有新的收穫再和大家分享。
本文原始碼地址:github
喜歡star一下哦~~~~

scrapy 爬蟲利器初體驗(1)
2018-11-26
爬蟲
Python Scrapy 爬蟲（二）：scrapy 初試
2018-08-13
Python爬蟲
初識Scrapy框架+爬蟲實戰(7)-爬取鏈家網100頁租房資訊
2018-06-12
框架爬蟲
【Python學習筆記1】Python網路爬蟲初體驗
2018-10-28
Python筆記爬蟲
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
Scrapy爬蟲：實習僧網最新招聘資訊抓取
2021-09-09
爬蟲
用python爬取鏈家的租房資訊
2020-10-29
Python
爬蟲教程——用Scrapy爬取豆瓣TOP250
2018-10-31
爬蟲
Python爬蟲實戰：爬取淘寶的商品資訊
2021-09-11
Python爬蟲
Python scrapy增量爬取例項及實現過程解析
2020-03-06
Python
scrapy爬取豆瓣電影資料
2021-09-11
python爬取北京租房資訊
2018-05-18
Python
Python 爬蟲（六）：使用 Scrapy 爬取去哪兒網景區資訊
2019-10-20
Python爬蟲
scrapy 爬取空值
2020-10-03
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
Selenium + Scrapy爬取某商標資料
2018-06-27
如何提升scrapy爬取資料的效率
2019-03-05
Python學習筆記——爬蟲之Scrapy專案實戰
2018-09-03
Python筆記爬蟲
python爬蟲常用之Scrapy 中介軟體
2018-12-22
Python爬蟲
Python爬蟲筆記（4）：利用scrapy爬取豆瓣電影250
2018-11-10
Python爬蟲筆記
Scrapy爬取二手房資訊+視覺化資料分析
2019-03-04
視覺化
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
Python爬取股票資訊，並實現視覺化資料
2020-09-25
Python視覺化
python爬蟲Scrapy框架
2018-11-21
Python爬蟲框架
Python爬蟲—Scrapy框架
2020-10-04
Python爬蟲框架
【Python篇】scrapy爬蟲
2020-11-29
Python爬蟲
Python爬蟲 ---scrapy框架初探及實戰
2020-04-16
Python爬蟲框架
Python爬蟲之Scrapy學習（基礎篇）
2019-03-04
Python爬蟲
python實現微博個人主頁的資訊爬取
2021-01-03
Python
python itchat 爬取微信好友資訊
2018-06-02
Python
Python乾貨：用Scrapy爬電商網站
2018-09-04
Python網站
Python爬蟲爬取淘寶，京東商品資訊
2020-02-11
Python爬蟲
Python《爬蟲初實踐》
2020-12-11
Python爬蟲
Python初體驗——列表
2020-04-07
Python
Scrapy框架的使用之Scrapy爬取新浪微博
2018-05-23
框架
使用 Scrapy 爬取股票程式碼
2019-02-25
Scrapy框架爬取海量妹子圖
2018-08-30
框架
python爬蟲練習--爬取虎牙主播原畫視訊
2020-11-28
Python爬蟲

用Python爬取實習資訊（Scrapy初體驗）

2.介紹

3.開始

3.1 首先，新建一個Scrapy工程。

3.2 編寫items.py檔案。

3.3 編寫爬蟲。

3.4 編寫pipelines.py。

4.執行

5.下一步

相關文章