python爬蟲學習：爬蟲QQ說說並生成詞雲圖，回憶滿滿

程式猿tx發表於2018-05-13

原文網址 : https://juejin.im/post/5af7ef69f265da0b9b0769cb

自學過一段時間的python，用django自己做了個網站，也用requests+BeautifulSoup爬蟲過些簡單的網站，週末研究學習了一波，準備爬取QQ空間的說說，並把內容存在txt中，讀取生成雲圖。
好久不登qq了，空間說說更是幾年不玩了，裡面滿滿的都是上學時候的回憶，看著看著就笑了，笑著笑著就...哈哈哈~~
無圖言虛空

當年的我還是那麼風華正茂、幽默風趣...
言歸正傳，本次使用的是selenium模擬登入+BeautifulSoup4爬取資料+wordcloud生成詞雲圖

BeautifulSoup安裝

pip install beautifulsoup4
這裡有beautifulsoup4 的官方文件
還需要用到解析器，我選擇的是html5lib解析器pip install html5lib
下表列出了主要的解析器,以及它們的優缺點:

解析器	使用方法	優勢	劣勢
Python標準庫	BeautifulSoup(markup, "html.parser")	Python的內建標準庫執行速度適中文件容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文件容錯能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快文件容錯能力強	需要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml")	速度快唯一支援XML的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup, "html5lib")	最好的容錯性以瀏覽器的方式解析文件生成HTML5格式的文件	速度慢不依賴外部擴充套件

selenium模擬登入

使用selenium模擬登入QQ空間，安裝pip install selenium
我用的是chrom瀏覽器，webdriver.Chrome()，獲取Chrome瀏覽器的驅動。
這裡還需要下載安裝對應瀏覽器的驅動，否則在執行指令碼時，會提示 chromedriver executable needs to be in PATH錯誤，用的是mac，網上找的一篇下載驅動的文章，https://blog.csdn.net/zxy987872674/article/details/53082896
同理window的也一樣，下載對應的驅動，解壓後，將下載的**.exe 放到Python的安裝目錄，例如 D:\python 。同時需要將Python的安裝目錄新增到系統環境變數裡。

qq登入頁http://i.qq.com，利用webdriver開啟qq空間的登入頁面

driver = webdriver.Chrome()
driver.get("http://i.qq.com")
複製程式碼

開啟之後右擊檢查檢視頁面元素，發現帳號密碼登入在login_frame裡，先定位到所在的frame，driver.switch_to.frame("login_frame") ，再自動點選帳號密碼登入按鈕，自動輸入帳號密碼登入，並且開啟說說頁面，詳細程式碼如下

friend = '' # 朋友的QQ號，**朋友的空間要求允許你能訪問**，這裡可以輸入自己的qq號
user = ''  # 你的QQ號
pw = ''  # 你的QQ密碼

 # 獲取瀏覽器驅動
driver = webdriver.Chrome()
 # 瀏覽器視窗最大化
driver.maximize_window()
 # 瀏覽器地址定向為qq登陸頁面
driver.get("http://i.qq.com")

 # 定位到登入所在的frame
driver.switch_to.frame("login_frame")

 # 自動點選賬號登陸方式
driver.find_element_by_id("switcher_plogin").click()
 # 賬號輸入框輸入已知qq賬號
driver.find_element_by_id("u").send_keys(user)
 # 密碼框輸入已知密碼
driver.find_element_by_id("p").send_keys(pw)
 # 自動點選登陸按鈕
driver.find_element_by_id("login_button").click()
 # 讓webdriver操縱當前頁
driver.switch_to.default_content()
 # 跳到說說的url, friend可以任意改成你想訪問的空間，比如這邊訪問自己的qq空間
driver.get("http://user.qzone.qq.com/" + friend + "/311")
複製程式碼

這個時候可以看到已經開啟了qq說說的頁面了，注意部分空間開啟之後會出現一個提示框，需要先模擬點選事件關閉這個提示框

tm我以前竟然還有個黃鑽，好可怕~~，空間頭像也是那麼的年輕、主流...

try:
    #找到關閉按鈕，關閉提示框
    button = driver.find_element_by_id("dialog_button_111").click()
except:
    pass
複製程式碼

同時因為說說內容是動態載入的，需要自動下拉滾動條，載入出全部的內容，再模擬點選下一頁載入內容。具體程式碼見下面。

BeautifulSoup爬取說說

F12檢視內容，可以找到說說在feed_wrap這個<div>，<ol>裡面的<li>標籤陣列裡面，具體每條說說內容在<div> class="bd"的<pre>標籤中。


next_num = 0  # 初始“下一頁”的id
while True:
    # 下拉滾動條，使瀏覽器載入出全部的內容，
    # 這裡是從0開始到5結束 分5 次載入完每頁資料
    for i in range(0, 5):
        height = 20000 * i  # 每次滑動20000畫素
        strWord = "window.scrollBy(0," + str(height) + ")"
        driver.execute_script(strWord)
        time.sleep(2)

    # 這裡需要選中 說說 所在的frame，否則找不到下面需要的網頁元素
    driver.switch_to.frame("app_canvas_frame")
    # 解析頁面元素
    content = BeautifulSoup(driver.page_source, "html5lib")
    # 找到"feed_wrap"的div裡面的ol標籤
    ol = content.find("div", class_="feed_wrap").ol
    # 通過find_all遍歷li標籤陣列
    lis = ol.find_all("li", class_="feed")

    # 將說說內容寫入檔案，使用 a 表示內容可以連續不清空寫入
    with open('qq_word.txt', 'a', encoding='utf-8') as f:
        for li in lis:
            bd = li.find("div", class_="bd")
            #找到具體說說所在標籤pre，獲取內容
            ss_content = bd.pre.get_text()
            f.write(ss_content + "\n")

    # 當已經到了尾頁，“下一頁”這個按鈕就沒有id了，可以結束了
    if driver.page_source.find('pager_next_' + str(next_num)) == -1:
        break
    # 找到“下一頁”的按鈕，因為下一頁的按鈕是動態變化的，這裡需要動態記錄一下
    driver.find_element_by_id('pager_next_' + str(next_num)).click()
    # “下一頁”的id
    next_num += 1
    # 因為在下一個迴圈裡首先還要把頁面下拉，所以要跳到外層的frame上
    driver.switch_to.parent_frame()

複製程式碼

至此QQ說說已經爬取下來，並且儲存在了qq_word檔案裡
接下來生成詞雲圖

詞雲圖

使用wordcloud包生成詞雲圖，pip install wordcloud
這裡還可以使用jieba分詞，我並沒有使用，因為我覺得qq說說的句子讀起來才有點感覺，個人喜好，用jieba分詞可以看到說說高頻次的一些詞語。
設定下wordcloud的一些屬性，注意這裡要設定font_path屬性，否則漢字會出現亂碼。
這裡還有個要提醒的是，如果使用了虛擬環境的，不要在虛擬環境下執行以下指令碼，否則可能會報錯 RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are using (Ana)Conda please install python.app and replace the use of 'python' with 'pythonw'. See 'Working with Matplotlib on OSX' in the Matplotlib FAQ for more information. ，我就遇到了這種情況，deactivate 退出了虛擬環境再跑的

# coding:utf-8

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 生成詞雲
def create_word_cloud(filename):
    # 讀取檔案內容
    text = open("{}.txt".format(filename), encoding='utf-8').read()
    # 設定詞雲
    wc = WordCloud(
        # 設定背景顏色
        background_color="white",
        # 設定最大顯示的詞雲數
        max_words=2000,
        # 這種字型都在電腦字型中，window在C:\Windows\Fonts\下，mac我選的是/System/Library/Fonts/PingFang.ttc 字型
        font_path='/System/Library/Fonts/PingFang.ttc',
        height=1200,
        width=2000,
        # 設定字型最大值
        max_font_size=100,
        # 設定有多少種隨機生成狀態，即有多少種配色方案
        random_state=30,
    )

    myword = wc.generate(text)  # 生成詞雲
    # 展示詞雲圖
    plt.imshow(myword)
    plt.axis("off")
    plt.show()
    wc.to_file('qq_word.png')  # 把詞雲儲存下


if __name__ == '__main__':
    create_word_cloud('qq_word')

複製程式碼

至此，爬取qq說說內容，並生成詞雲圖。
原始碼github地址: github.com/taixiang/sp…

歡迎關注我的部落格：blog.manjiexiang.cn/
歡迎關注微訊號：春風十里不如認識你

【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python 第一個爬蟲，爬取 147 小說
2020-05-08
Python爬蟲
你有自己寫過爬蟲的程式嗎？說說你對爬蟲和反爬蟲的理解？
2024-11-28
爬蟲
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
python爬蟲學習1
2020-11-29
Python爬蟲
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
Python爬蟲學習線路圖丨Python爬蟲需要掌握哪些知識點
2018-12-10
Python爬蟲
如何用python爬蟲下載小說？
2021-09-11
Python爬蟲
爬蟲 Scrapy框架爬取圖蟲圖片並下載
2018-08-27
爬蟲框架
什麼是爬蟲?學習Python爬蟲難不難?
2019-11-05
爬蟲Python
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲
【python--爬蟲】彼岸圖網高清桌布爬蟲
2019-07-21
Python爬蟲
《從零開始學習Python爬蟲：頂點小說全網爬取實戰》
2024-07-06
Python爬蟲
python爬蟲之抓取小說(逆天邪神)
2022-03-10
Python爬蟲
【python--爬蟲】千圖網高清背景圖片爬蟲
2019-05-21
Python爬蟲
Python爬蟲入門學習線路圖2019最新版（附Python爬蟲視訊教程）
2019-01-09
Python爬蟲
python爬蟲—學習筆記-4
2024-04-23
Python爬蟲筆記
python爬蟲—學習筆記-2
2024-04-10
Python爬蟲筆記
python爬蟲js逆向學習（二）
2020-07-03
Python爬蟲JS
Python爬蟲學習筆記(三)
2021-01-30
Python爬蟲筆記
python爬蟲學習筆記（二）
2020-11-24
Python爬蟲筆記
聽說你的爬蟲被封了?
2019-04-23
爬蟲
一入爬蟲深似海，總結python爬蟲學習筆記！
2019-02-14
爬蟲Python筆記
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
python爬蟲學習01--電子書爬取
2020-07-13
Python爬蟲
誰說爬蟲只能Python？看我用C#快速簡單實現爬蟲開發和演示！
2024-05-27
爬蟲PythonC#
Python爬蟲之js加密破解，抓取網易雲音樂評論生成詞雲
2020-10-22
Python爬蟲JS加密
Python爬蟲系統化學習(3)
2021-02-25
Python爬蟲
Python爬蟲系統化學習(4)
2021-03-01
Python爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
爬蟲學習-初次上路
2020-11-21
爬蟲
selenium爬蟲學習1
2024-08-29
爬蟲
如何學習 Python 包並實現基本的爬蟲過程
2023-11-28
Python爬蟲

python爬蟲學習：爬蟲QQ說說並生成詞雲圖，回憶滿滿

BeautifulSoup安裝

selenium模擬登入

BeautifulSoup爬取說說

詞雲圖

相關文章