Python 爬蟲實戰（1）：分析豆瓣中最新電影的影評

發表於2017-08-11

原文網址 : http://python.jobbole.com/88325/

Python爬蟲

簡介

剛接觸python不久，做一個小專案來練練手。前幾天看了《戰狼2》，發現它在最新上映的電影裡面是排行第一的，如下圖所示。準備把豆瓣上對它的影評做一個分析。

目標總覽

主要做了三件事：

抓取網頁資料
清理資料
用詞雲進行展示
使用的python版本是3.5.

一、抓取網頁資料

第一步要對網頁進行訪問，python中使用的是urllib庫。程式碼如下：

from urllib import request
resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')
html_data = resp.read().decode('utf-8')

from urllib import request

resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')

html_data = resp.read().decode('utf-8')

其中https://movie.douban.com/nowp…是豆瓣最新上映的電影頁面，可以在瀏覽器中輸入該網址進行檢視。

html_data是字串型別的變數，裡面存放了網頁的html程式碼。
輸入print(html_data)可以檢視，如下圖所示：

第二步，需要對得到的html程式碼進行解析，得到裡面提取我們需要的資料。
在python中使用BeautifulSoup庫進行html程式碼的解析。
（注：如果沒有安裝此庫，則使用pip install BeautifulSoup進行安裝即可！）
BeautifulSoup使用的格式如下：

BeautifulSoup(html,"html.parser")

1	BeautifulSoup(html,"html.parser")

第一個引數為需要提取資料的html，第二個引數是指定解析器，然後使用find_all()讀取html標籤中的內容。

但是html中有這麼多的標籤，該讀取哪些標籤呢？其實，最簡單的辦法是我們可以開啟我們爬取網頁的html程式碼，然後檢視我們需要的資料在哪個html標籤裡面，再進行讀取就可以了。如下圖所示：

從上圖中可以看出在div id=”nowplaying“標籤開始是我們想要的資料，裡面有電影的名稱、評分、主演等資訊。所以相應的程式碼編寫如下：

from bs4 import BeautifulSoup as bs
soup = bs(html_data, 'html.parser')    
nowplaying_movie = soup.find_all('div', id='nowplaying')
nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item')

from bs4 import BeautifulSoup as bs

soup = bs(html_data, 'html.parser')

nowplaying_movie = soup.find_all('div', id='nowplaying')

nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item')

其中nowplaying_movie_list 是一個列表，可以用print(nowplaying_movie_list[0])檢視裡面的內容，如下圖所示：

在上圖中可以看到data-subject屬性裡面放了電影的id號碼，而在img標籤的alt屬性裡面放了電影的名字，因此我們就通過這兩個屬性來得到電影的id和名稱。（注：開啟電影短評的網頁時需要用到電影的id，所以需要對它進行解析），編寫程式碼如下：

nowplaying_list = [] 
for item in nowplaying_movie_list:        
        nowplaying_dict = {}        
        nowplaying_dict['id'] = item['data-subject']       
        for tag_img_item in item.find_all('img'):            
            nowplaying_dict['name'] = tag_img_item['alt']            
            nowplaying_list.append(nowplaying_dict)

nowplaying_list = []

for item in nowplaying_movie_list:

nowplaying_dict = {}

nowplaying_dict['id'] = item['data-subject']

for tag_img_item in item.find_all('img'):

nowplaying_dict['name'] = tag_img_item['alt']

nowplaying_list.append(nowplaying_dict)

其中列表nowplaying_list中就存放了最新電影的id和名稱，可以使用print(nowplaying_list)進行檢視，如下圖所示：

可以看到和豆瓣網址上面是匹配的。這樣就得到了最新電影的資訊了。接下來就要進行對最新電影短評進行分析了。例如《戰狼2》的短評網址為：https://movie.douban.com/subject/26363254/comments?start=0&limit=20

其中26363254就是電影的id，start=0表示評論的第0條評論。

接下來接對該網址進行解析了。開啟上圖中的短評頁面的html程式碼，我們發現關於評論的資料是在div標籤的comment屬性下面，如下圖所示：

因此對此標籤進行解析，程式碼如下：

requrl = 'https://movie.douban.com/subject/' + nowplaying_list[0]['id'] + '/comments' +'?' +'start=0' + '&limit=20' 
resp = request.urlopen(requrl) 
html_data = resp.read().decode('utf-8') 
soup = bs(html_data, 'html.parser') 
comment_div_lits = soup.find_all('div', class_='comment')

requrl = 'https://movie.douban.com/subject/' + nowplaying_list[0]['id'] + '/comments' +'?' +'start=0' + '&limit=20'

resp = request.urlopen(requrl)

html_data = resp.read().decode('utf-8')

soup = bs(html_data, 'html.parser')

comment_div_lits = soup.find_all('div', class_='comment')

此時在comment_div_lits 列表中存放的就是div標籤和comment屬性下面的html程式碼了。在上圖中還可以發現在p標籤下面存放了網友對電影的評論，如下圖所示:

因此對comment_div_lits 程式碼中的html程式碼繼續進行解析，程式碼如下：

eachCommentList = []; 
for item in comment_div_lits: 
        if item.find_all('p')[0].string is not None:     
            eachCommentList.append(item.find_all('p')[0].string)

eachCommentList = [];

for item in comment_div_lits:

if item.find_all('p')[0].string is not None:

eachCommentList.append(item.find_all('p')[0].string)

使用print(eachCommentList)檢視eachCommentList列表中的內容，可以看到裡面存裡我們想要的影評。如下圖所示：

好的，至此我們已經爬取了豆瓣最近播放電影的評論資料，接下來就要對資料進行清洗和詞雲顯示了。

二、資料清洗

為了方便進行資料進行清洗，我們將列表中的資料放在一個字串陣列中，程式碼如下：

comments = ''
for k in range(len(eachCommentList)):
    comments = comments + (str(eachCommentList[k])).strip()

comments = ''

for k in range(len(eachCommentList)):

comments = comments + (str(eachCommentList[k])).strip()

使用print(comments)進行檢視，如下圖所示：

可以看到所有的評論已經變成一個字串了，但是我們發現評論中還有不少的標點符號等。這些符號對我們進行詞頻統計時根本沒有用，因此要將它們清除。所用的方法是正規表示式。python中正規表示式是通過re模組來實現的。程式碼如下：

import re

pattern = re.compile(r'[\u4e00-\u9fa5]+')
filterdata = re.findall(pattern, comments)
cleaned_comments = ''.join(filterdata)

import re

pattern = re.compile(r'[\u4e00-\u9fa5]+')

filterdata = re.findall(pattern, comments)

cleaned_comments = ''.join(filterdata)

繼續使用print(cleaned_comments)語句進行檢視，如下圖所示：

我們可以看到此時評論資料中已經沒有那些標點符號了，資料變得“乾淨”了很多。

因此要進行詞頻統計，所以先要進行中文分詞操作。在這裡我使用的是結巴分詞。如果沒有安裝結巴分詞，可以在控制檯使用pip install jieba進行安裝。（注：可以使用pip list檢視是否安裝了這些庫）。程式碼如下所示：

import jieba    #分詞包
import pandas as pd  

segment = jieba.lcut(cleaned_comments)
words_df=pd.DataFrame({'segment':segment})

import jieba #分詞包

import pandas as pd

segment = jieba.lcut(cleaned_comments)

words_df=pd.DataFrame({'segment':segment})

因為結巴分詞要用到pandas，所以我們這裡載入了pandas包。可以使用words_df.head()檢視分詞之後的結果，如下圖所示：

從上圖可以看到我們的資料中有“看”、“太”、“的”等虛詞（停用詞），而這些詞在任何場景中都是高頻時，並且沒有實際的含義，所以我們要他們進行清除。

我把停用詞放在一個stopwords.txt檔案中，將我們的資料與停用詞進行比對即可（注：只要在百度中輸入stopwords.txt，就可以下載到該檔案）。去停用詞程式碼如下程式碼如下：

stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用
words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

1 2	stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用 words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

繼續使用words_df.head()語句來檢視結果，如下圖所示，停用詞已經被出去了。

接下來就要進行詞頻統計了，程式碼如下：

import numpy    #numpy計算包
words_stat=words_df.groupby(by=['segment'])['segment'].agg({"計數":numpy.size})
words_stat=words_stat.reset_index().sort_values(by=["計數"],ascending=False)

import numpy #numpy計算包

words_stat=words_df.groupby(by=['segment'])['segment'].agg({"計數":numpy.size})

words_stat=words_stat.reset_index().sort_values(by=["計數"],ascending=False)

用words_stat.head()進行檢視，結果如下：

由於我們前面只是爬取了第一頁的評論，所以資料有點少，在最後給出的完整程式碼中，我爬取了10頁的評論，所資料還是有參考價值。

三、用詞雲進行顯示

程式碼如下：

import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud#詞雲包

wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80) #指定字型型別、字型大小和字型顏色
word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}
word_frequence_list = []
for key in word_frequence:
    temp = (key,word_frequence[key])
    word_frequence_list.append(temp)

wordcloud=wordcloud.fit_words(word_frequence_list)
plt.imshow(wordcloud)

import matplotlib.pyplot as plt

%matplotlib inline

import matplotlib

matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)

from wordcloud import WordCloud#詞雲包

wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80) #指定字型型別、字型大小和字型顏色

word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}

word_frequence_list = []

for key in word_frequence:

temp = (key,word_frequence[key])

word_frequence_list.append(temp)

wordcloud=wordcloud.fit_words(word_frequence_list)

plt.imshow(wordcloud)

其中simhei.ttf使用來指定字型的，可以在百度上輸入simhei.ttf進行下載後，放入程式的根目錄即可。顯示的影象如下：

到此為止，整個專案的介紹就結束了。由於自己也還是個初學者，接觸python不久，程式碼寫的並不好。而且第一次寫技術部落格，表達的有些冗餘，請大家多多包涵，有不對的地方，請大家批評指正。以後我也會將自己做的小專案以這種形式寫在部落格上和大家一起交流！最後貼上完整的程式碼。

完整程式碼

#coding:utf-8
__author__ = 'hang'

import warnings
warnings.filterwarnings("ignore")
import jieba    #分詞包
import numpy    #numpy計算包
import codecs   #codecs提供的open方法來指定開啟的檔案的語言編碼，它會在讀取的時候自動轉換為內部unicode 
import re
import pandas as pd  
import matplotlib.pyplot as plt
from urllib import request
from bs4 import BeautifulSoup as bs
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud#詞雲包

#分析網頁函式
def getNowPlayingMovie_list():   
    resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')        
    html_data = resp.read().decode('utf-8')    
    soup = bs(html_data, 'html.parser')    
    nowplaying_movie = soup.find_all('div', id='nowplaying')        
    nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item')    
    nowplaying_list = []    
    for item in nowplaying_movie_list:        
        nowplaying_dict = {}        
        nowplaying_dict['id'] = item['data-subject']       
        for tag_img_item in item.find_all('img'):            
            nowplaying_dict['name'] = tag_img_item['alt']            
            nowplaying_list.append(nowplaying_dict)    
    return nowplaying_list

#爬取評論函式
def getCommentsById(movieId, pageNum): 
    eachCommentList = []; 
    if pageNum>0: 
         start = (pageNum-1) * 20 
    else: 
        return False 
    requrl = 'https://movie.douban.com/subject/' + movieId + '/comments' +'?' +'start=' + str(start) + '&limit=20' 
    print(requrl)
    resp = request.urlopen(requrl) 
    html_data = resp.read().decode('utf-8') 
    soup = bs(html_data, 'html.parser') 
    comment_div_lits = soup.find_all('div', class_='comment') 
    for item in comment_div_lits: 
        if item.find_all('p')[0].string is not None:     
            eachCommentList.append(item.find_all('p')[0].string)
    return eachCommentList

def main():
    #迴圈獲取第一個電影的前10頁評論
    commentList = []
    NowPlayingMovie_list = getNowPlayingMovie_list()
    for i in range(10):    
        num = i + 1 
        commentList_temp = getCommentsById(NowPlayingMovie_list[0]['id'], num)
        commentList.append(commentList_temp)

    #將列表中的資料轉換為字串
    comments = ''
    for k in range(len(commentList)):
        comments = comments + (str(commentList[k])).strip()

    #使用正規表示式去除標點符號
    pattern = re.compile(r'[\u4e00-\u9fa5]+')
    filterdata = re.findall(pattern, comments)
    cleaned_comments = ''.join(filterdata)

    #使用結巴分詞進行中文分詞
    segment = jieba.lcut(cleaned_comments)
    words_df=pd.DataFrame({'segment':segment})

    #去掉停用詞
    stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用
    words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

    #統計詞頻
    words_stat=words_df.groupby(by=['segment'])['segment'].agg({"計數":numpy.size})
    words_stat=words_stat.reset_index().sort_values(by=["計數"],ascending=False)

    #用詞雲進行顯示
    wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80)
    word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}

    word_frequence_list = []
    for key in word_frequence:
        temp = (key,word_frequence[key])
        word_frequence_list.append(temp)

    wordcloud=wordcloud.fit_words(word_frequence_list)
    plt.imshow(wordcloud)

#主函式
main()

#coding:utf-8

__author__ = 'hang'

import warnings

warnings.filterwarnings("ignore")

import jieba #分詞包

import numpy #numpy計算包

import codecs #codecs提供的open方法來指定開啟的檔案的語言編碼，它會在讀取的時候自動轉換為內部unicode

import re

import pandas as pd

import matplotlib.pyplot as plt

from urllib import request

from bs4 import BeautifulSoup as bs

%matplotlib inline

import matplotlib

matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)

from wordcloud import WordCloud#詞雲包

#分析網頁函式

def getNowPlayingMovie_list():

resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')

html_data = resp.read().decode('utf-8')

soup = bs(html_data, 'html.parser')

nowplaying_movie = soup.find_all('div', id='nowplaying')

nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item')

nowplaying_list = []

for item in nowplaying_movie_list:

nowplaying_dict = {}

nowplaying_dict['id'] = item['data-subject']

for tag_img_item in item.find_all('img'):

nowplaying_dict['name'] = tag_img_item['alt']

nowplaying_list.append(nowplaying_dict)

return nowplaying_list

#爬取評論函式

def getCommentsById(movieId, pageNum):

eachCommentList = [];

if pageNum>0:

start = (pageNum-1) * 20

else:

return False

requrl = 'https://movie.douban.com/subject/' + movieId + '/comments' +'?' +'start=' + str(start) + '&limit=20'

print(requrl)

resp = request.urlopen(requrl)

html_data = resp.read().decode('utf-8')

soup = bs(html_data, 'html.parser')

comment_div_lits = soup.find_all('div', class_='comment')

for item in comment_div_lits:

if item.find_all('p')[0].string is not None:

eachCommentList.append(item.find_all('p')[0].string)

return eachCommentList

def main():

#迴圈獲取第一個電影的前10頁評論

commentList = []

NowPlayingMovie_list = getNowPlayingMovie_list()

for i in range(10):

num = i + 1

commentList_temp = getCommentsById(NowPlayingMovie_list[0]['id'], num)

commentList.append(commentList_temp)

#將列表中的資料轉換為字串

comments = ''

for k in range(len(commentList)):

comments = comments + (str(commentList[k])).strip()

#使用正規表示式去除標點符號

pattern = re.compile(r'[\u4e00-\u9fa5]+')

filterdata = re.findall(pattern, comments)

cleaned_comments = ''.join(filterdata)

#使用結巴分詞進行中文分詞

segment = jieba.lcut(cleaned_comments)

words_df=pd.DataFrame({'segment':segment})

#去掉停用詞

stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用

words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

#統計詞頻

words_stat=words_df.groupby(by=['segment'])['segment'].agg({"計數":numpy.size})

words_stat=words_stat.reset_index().sort_values(by=["計數"],ascending=False)

#用詞雲進行顯示

wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80)

word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}

word_frequence_list = []

for key in word_frequence:

temp = (key,word_frequence[key])

word_frequence_list.append(temp)

wordcloud=wordcloud.fit_words(word_frequence_list)

plt.imshow(wordcloud)

#主函式

main()

結果顯示如下：

上圖基本反映了《戰狼2》這部電影的情況。PS:我本人並不喜歡這部電影，內容太空洞、太假，為了愛國而愛國，沒意思。哎，這兩年真是國產電影的低谷啊，沒有一部拿得出手的國產電影，看看人家印度拍的《摔跤吧，爸爸》那才是拍的有深度，同樣是表現愛國，國產電影還是需要向別的國家好好學學。

python爬蟲爬取豆瓣電影 1-10 ajax 資料
2024-07-04
Python爬蟲
【python爬蟲案例】利用python爬取豆瓣電影TOP250評分排行資料！
2024-09-18
Python爬蟲
Python爬取分析豆瓣電影Top250
2018-09-07
Python
Python爬蟲筆記（4）：利用scrapy爬取豆瓣電影250
2018-11-10
Python爬蟲筆記
Python爬蟲教程-17-ajax爬取例項（豆瓣電影）
2018-09-06
Python爬蟲
【Python】從0開始寫爬蟲——轉身扒豆瓣電影
2018-08-16
Python爬蟲
爬蟲01:爬取豆瓣電影TOP 250基本資訊
2020-12-29
爬蟲
Python爬取豆瓣電影的短評資料並進行詞雲分析處理
2019-01-05
Python
豆瓣電影TOP250爬蟲及視覺化分析筆記
2021-11-09
爬蟲視覺化筆記
使用python爬取豆瓣電影TOP250
2021-03-11
Python
教你用python登陸豆瓣並爬取影評
2019-03-04
Python
python更換代理爬取豆瓣電影資料
2019-08-03
Python
Python爬蟲入門實戰之貓眼電影資料抓取（實戰篇）
2019-04-07
Python爬蟲
scrapy爬取豆瓣電影資料
2021-09-11
爬取豆瓣電影Top250和資料分析
2022-06-20
一篇文章教會你利用Python網路爬蟲實現豆瓣電影採集
2021-09-09
Python爬蟲
python初級爬蟲之貓眼電影
2019-02-23
Python爬蟲
Python電影爬蟲之身體每況愈下
2020-05-23
Python爬蟲
Python爬蟲批次下載電影連結
2021-09-09
Python爬蟲
【Python爬蟲&資料分析】2018年電影，你看了幾部？
2018-12-06
Python爬蟲
Python爬蟲入門實戰之貓眼電影資料抓取(理論篇)
2019-04-06
Python爬蟲
React實現的超高仿豆瓣電影
2019-03-04
React
【Python3網路爬蟲開發實戰】3.4-抓取貓眼電影排行
2019-07-04
Python爬蟲
Python網路爬蟲實踐案例：爬取貓眼電影Top100
2024-11-21
Python爬蟲
scrapy入門：豆瓣電影top250爬取
2019-02-16
手把手教你網路爬蟲（爬取豆瓣電影top250，附帶原始碼）
2023-03-04
爬蟲原始碼
擼個爬蟲，爬取電影種子
2019-05-11
爬蟲
Python爬取電影天堂
2018-11-01
Python
Python爬蟲例項：爬取貓眼電影——破解字型反爬
2019-02-26
Python爬蟲
Python 從底層結構聊 Beautiful Soup 4（內建豆瓣最新電影排行榜爬取案例）
2022-03-15
Python
專案實戰！用爬蟲和Flask打造屬於自己的電影網站
2018-08-08
爬蟲Flask網站
重寫Hexo豆瓣影評外掛
2020-12-15
Hexo
一個基於 golang 的爬蟲電影站
2020-03-20
Golang爬蟲
Python 爬蟲實戰
2023-10-16
Python爬蟲
正規表示式_爬取豆瓣電影排行Top250
2021-07-07
批量抓取豆瓣電影圖片
2021-11-15
豆瓣：2023年度電影榜單《流浪地球 2》斬獲最高評分華語電影
2023-12-26
python 爬蟲實戰的原理
2021-10-29
Python爬蟲
豆瓣電影：2019年評分最高的十大日本劇集
2020-01-05

Python 爬蟲實戰（1）：分析豆瓣中最新電影的影評

簡介

目標總覽

一、抓取網頁資料

二、資料清洗

三、用詞雲進行顯示

完整程式碼

相關文章