【Python】從0開始寫爬蟲——轉身扒豆瓣電影

跑呀跑發表於2018-08-16

豆瓣就比較符合這個“明人不說暗話”的原則。所以我們扒豆瓣,不多說,直接上程式碼

from scrapy import app
import re

header = {
    `User-Agent`:
        `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36`,
    `Host`: `movie.douban.com`,
    `Accept-Language`: `zh-CN,zh;q=0.9`
}

movie_url = "https://movie.douban.com/subject/26985127/?from=showing"

m_id = re.search("[0-9]+", movie_url).group()

# 獲取soup物件
soup = app.get_soup(url=movie_url, headers=header, charset="utf-8")
content = soup.find(id="content")

# 抓取電影名字和上映年份
m_name = content.find("h1").find("span").string
m_year = content.find(class_="year").string

# 抓取導演
info = content.find(id="info")
m_directer = info.find(attrs={"rel": "v:directedBy"}).string
# 上映日期
m_date = info.find(attrs={"property": "v:initialReleaseDate"}).string

# 型別
types = info.find_all(attrs={"property": "v:genre"}, limit=2)
m_types = []
for type_ in types:
    m_types.append(type_.string)


# 抓取主演,只取前面五個
actors = info.find(class_="actor").find_all(attrs={"rel": "v:starring"}, limit=5)
m_actors = []
for actor in actors:
    m_actors.append(actor.string)

# 片長
m_time = info.find(attrs={"property": "v:runtime"}).string
# m_adaptor = info.select()

print("id", m_id, "名稱", m_name, "年份 ", m_year, "導演 ", m_directer, "主演", m_actors)
print("上映日期", m_date, "型別", m_types, "片長", m_time)

輸出:

id 26985127 名稱 一出好戲 年份  (2018) 導演  黃渤 主演 [`黃渤`, `舒淇`, `王寶強`, `張藝興`, `於和偉`]
上映日期 2018-08-10(中國大陸) 型別 [`劇情`, `喜劇`] 片長 134分鐘

簡單粗暴

 


相關文章