本文內容
- 爬取豆瓣電影Top250頁面內容,欄位包含:
排名,片名,導演,一句話描述 有的為空,評分,評價人數,上映時間,上映國家,類別 - 抓取資料儲存
scrapy介紹
建立專案
scrapy startproject dbmovie
建立爬蟲
cd dbmoive
scarpy genspider dbmovie_spider movie.douban.com/top250
注意,爬蟲名不能和專案名一樣
應對反爬策略的配置
-
開啟settings.py檔案,將ROBOTSTXT_OBEY修改為False。
ROBOTSTXT_OBEY = False
-
修改User-Agent
DEFAULT_REQUEST_HEADERS = { `Accept`: `text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8`, `Accept-Language`: `en`, `Accept-Encoding` : `gzip, deflate, br`, `Cache-Control` : `max-age=0`, `Connection` : `keep-alive`, `Host` : `movie.douban.com`, `Upgrade-Insecure-Requests` : `1`, `User-Agent` : `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36`, }
執行爬蟲
scrapy crawl dbmovie_spider
定義item
根據前面的分析,我們需要抓取一共十個欄位的資訊,現在在items.py檔案中定義item
import scrapy
class DoubanItem(scrapy.Item):
# 排名
ranking = scrapy.Field()
# 篇名
title = scrapy.Field()
# 導演
director = scrapy.Field()
# 一句話描述 有的為空
movie_desc = scrapy.Field()
# 評分
rating_num = scrapy.Field()
# 評價人數
people_count = scrapy.Field()
# 上映時間
online_date = scrapy.Field()
# 上映國家
country = scrapy.Field()
# 類別
category = scrapy.Field()
欄位提取
這裡需要用到xpath相關知識,偷了個懶,直接用chrome外掛獲取
Chrome瀏覽器獲取XPATH的方法—-通過開發者工具獲取
def parse(self, response):
item = DoubanItem()
movies = response.xpath(`//div[@class="item"]`)
for movie in movies:
# 名次
item[`ranking`] = movie.xpath(`div[@class="pic"]/em/text()`).extract()[0]
# 片名 提取多個片名
titles = movie.xpath(`div[@class="info"]/div[1]/a/span/text()`).extract()[0]
item[`title`] = titles
# 獲取導演資訊
info_director = movie.xpath(`div[2]/div[2]/p[1]/text()[1]`).extract()[0].replace("
", "").replace(" ", "").split(`xa0`)[0]
item[`director`] = info_director
# 上映日期
online_date = movie.xpath(`div[2]/div[2]/p[1]/text()[2]`).extract()[0].replace("
", "").replace(`xa0`, ``).split("/")[0].replace(" ", "")
# 製片國家
country = movie.xpath(`div[2]/div[2]/p[1]/text()[2]`).extract()[0].replace("
", "").split("/")[1].replace(`xa0`, ``)
# 影片型別
category = movie.xpath(`div[2]/div[2]/p[1]/text()[2]`).extract()[0].replace("
", "").split("/")[2].replace(`xa0`, ``).replace(" ", "")
item[`online_date`] = online_date
item[`country`] = country
item[`category`] = category
movie_desc = movie.xpath(`div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span/text()`).extract()
if len(movie_desc) != 0: # 判斷info的值是否為空,不進行這一步有的電影資訊並沒有會報錯或資料不全
item[`movie_desc`] = movie_desc
else:
item[`movie_desc`] = ` `
item[`rating_num`] = movie.xpath(`div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()`).extract()[0]
item[`people_count`] = movie.xpath(`div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[4]/text()`).extract()[0]
yield item
# 獲取下一頁
next_url = response.xpath(`//span[@class="next"]/a/@href`).extract()
if next_url:
next_url = `https://movie.douban.com/top250` + next_url[0]
yield scrapy.Request(next_url, callback=self.parse, dont_filter=True)
儲存資料,mysql
注意1064錯誤,表中欄位包含mysql關鍵字導致
Scrapy入門教程之寫入資料庫
import pymysql
def dbHandle():
conn = pymysql.connect(
host=`localhost`,
user=`root`,
passwd=`pwd`,
db="dbmovie",
charset=`utf8`,
use_unicode=False
)
return conn
class DoubanPipeline(object):
def process_item(self, item, spider):
dbObject = dbHandle()
cursor = dbObject.cursor()
sql = "insert into db_info(ranking,title,director,movie_desc,rating_num,people_count,online_date,country,category) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
try:
cursor.execute(sql, (item[`ranking`], item[`title`], item[`director`], item[`movie_desc`], item[`rating_num`], item[`people_count`], item[`online_date`], item[`country`], item[`category`]))
dbObject.commit()
except Exception as e:
print(e)
dbObject.rollback()
return item