Puppeteer 爬取豆瓣小組公開資訊

王老闆的前端發表於2020-05-21

原文網址 : https://learnku.com/articles/44832

題外話

老王，最近取了筆名。不僅僅是筆名，字、號也統統安排。

上官追風，字追風，號追風居士。

非要給它一個解釋的話，那就是「追風少年宅家裡」。

老王的行文路線其實就是他的思維路線路。

Puppeteer

面對未知的事物，最好的老師顯然是搜尋引擎，而搜尋引擎中公認最好的又是 Google 搜尋。

Google 搜尋 Puppeteer

Puppeteer 文件

Github: github.com/puppeteer/puppeteer

英文文件：pptr.dev

中文文件：zhaoqize.github.io/puppeteer-api-z...

Puppeteer 簡介

以下介紹摘錄自中文文件。

Puppeteer 讀作 /puh·puh·teer/，是一個 Node 庫，它提供了一個高階 API 來透過 DevTools 協議控制 Chromium 或 Chrome。Puppeteer 預設以 headless 模式執行，但是可以透過修改配置檔案執行“有頭”模式。

生成頁面 PDF。
抓取 SPA「單頁應用」並生成預渲染內容（即 SSR「伺服器端渲染」）。
自動提交表單，進行 UI 測試，鍵盤輸入等。
建立一個時時更新的自動化測試環境。使用最新的 JavaScript 和瀏覽器功能直接在最新版本的Chrome 中執行測試。
捕獲網站的 timeline trace，用來幫助分析效能問題。
測試瀏覽器擴充套件。

專案背景

老王開始了電鴨社群「徵稿」板塊的事務，需要大量聯絡徵稿人來電鴨社群發徵稿貼。

簽到帖

偶然發現一個豆瓣小組有徵稿人簽到帖，那怎麼辦？手動複製貼上是不可能的，馬上動手寫小爬蟲。

程式碼實戰

第一步：建立專案

建立專案

建立douban.js檔案
貼上官網的示例程式碼

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://douban.com');
    await page.screenshot({path: 'example.png'});
    await browser.close();
})();

npm安裝Puppeteer

別急，還不能執行程式碼呢。開啟終端到專案根目錄npm安裝Puppeteer

npm i puppeteer

需要等待Chromium安裝完，網路不好的小夥伴，自己想想辦吧。

安裝 Puppeteer

修改package.json檔案

{
  "name": "douban",
  "version": "1.0.0",
  "scripts": {
    "start": "node ./douban.js"
  },
  "dependencies": {
    "puppeteer": "^3.1.0"
  }
}

第二步：模擬登陸

訪問目標頁面，發現需要登陸。

需要登陸

分析登陸頁面結構

我選擇了密碼登入，降低複雜度。

登陸頁面

我們需要幹什麼呢?

開啟頁面
點選密碼登入
輸入賬號
輸入密碼
點選登陸

程式碼示例

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({
        headless: true,
        timeout: 50000
    })

    const page = await browser.newPage()

    // 去豆瓣那個頁面
    await page.goto('https://accounts.douban.com/passport/login', {
        waitUntil: 'networkidle2'  // 網路空閒說明已載入完畢
    });

    // 點選搜尋框擬人輸入
    const clickPhoneLogin = await page.$('.account-tab-account')

    await clickPhoneLogin.click()

    const name = 'xxxxxxxxxx'
    await page.type('input[id="username"]', name, {delay: 0})

    const pwd = 'xxxxxxxxxx'
    await page.type('input[id="password"]', pwd, {delay: 1})

    // 獲取登入按鈕元素
    const loginElement = await page.$('div.account-form-field-submit > a')

    await loginElement.click()

    await page.waitForNavigation()

    await browser.close()
})();

最終效果

模擬登陸

第三步：爬取資料

有了前面的基礎，後面我就不詳細講啦。

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({
        headless: false,
        timeout: 50000
    })

    const page = await browser.newPage()

    // 去豆瓣登陸頁面
    await page.goto('https://accounts.douban.com/passport/login', {
        waitUntil: 'networkidle2'  // 網路空閒說明已載入完畢
    });

    // 點選搜尋框擬人輸入
    const clickPhoneLogin = await page.$('.account-tab-account')

    await clickPhoneLogin.click()

    const name = 'xxxxxxxx'
    await page.type('input[id="username"]', name, {delay: 0})

    const pwd = 'xxxxxxxx'
    await page.type('input[id="password"]', pwd, {delay: 1})

    // 獲取登入按鈕元素
    const loginElement = await page.$('div.account-form-field-submit > a')

    // 點選按鈕，開始登陸
    await loginElement.click()

    await page.waitForNavigation()


    // 目標頁面 url
    let url = 'https://www.douban.com/group/topic/112565224/?start='
    // 翻頁引數
    let pages = [0, 100, 200, 300, 400, 500]

    // 定義爬取函式
    async function next(url) {
        await page.goto(url, {
            waitUntil: 'networkidle2'  // 網路空閒說明已載入完畢
        })

        return await page.$$eval("div.reply-doc.content > p", e => {

            let a = []

            e.forEach(element => {
                a.push(element.innerText)
            })

            return a
        })
    }

    // 拼接文字字串
    let data = ''

    for (const index of pages) {
        let res = await next(url + index)
        data = res.join('\n\n\n-----------------------------------------------------------\n\n') + data
    }

    // 檢視一下資料
    console.log(data)  

    await browser.close()
})();

最終效果

爬取資料

第四步：寫入資料

const puppeteer = require('puppeteer');
const fs = require('fs');

(async () => {
    const browser = await puppeteer.launch({
        headless: false,
        timeout: 50000
    })

    const page = await browser.newPage()

    page.setViewport({
        width: 1920,
        height: 1080
    })

    // 去豆瓣登陸頁面
    await page.goto('https://accounts.douban.com/passport/login', {
        waitUntil: 'networkidle2'  // 網路空閒說明已載入完畢
    });

    // 點選搜尋框擬人輸入
    const clickPhoneLogin = await page.$('.account-tab-account')

    await clickPhoneLogin.click()

    const name = 'xxxxxxxx'
    await page.type('input[id="username"]', name, {delay: 0})

    const pwd = 'xxxxxxxx'
    await page.type('input[id="password"]', pwd, {delay: 1})

    // 獲取登入按鈕元素
    const loginElement = await page.$('div.account-form-field-submit > a')

    // 點選按鈕，開始登陸
    await loginElement.click()

    await page.waitForNavigation()


    // 目標頁面 url
    let url = 'https://www.douban.com/group/topic/112565224/?start='
    // 翻頁引數
    let pages = [0, 100, 200, 300, 400, 500]

    // 定義爬取函式
    async function next(url) {
        await page.goto(url, {
            waitUntil: 'networkidle2'  // 網路空閒說明已載入完畢
        })

        return await page.$$eval("div.reply-doc.content > p", e => {

            let a = []

            e.forEach(element => {
                a.push(element.innerText)
            })

            return a
        })
    }


    // 拼接文字字串
    let data = ''

    for (const index of pages) {
        let res = await next(url + index)
        data = res.join('\n\n\n-----------------------------------------------------------\n\n') + data
    }

    // 寫入檔案
    fs.writeFile('douban.txt',data,'utf8',function(error){
        if(error){
            console.log(error);
            return false;
        }
        console.log('寫入成功');
    })

    await browser.close()
})();

實戰反思

程式碼還需要最佳化，尤其是翻頁寫的很差。
能不能分模組來實現。這段程式碼中，模擬登陸、爬取目標、寫入檔案都是揉在一起的。
暫時就這些啦。

完整程式碼

gist.github.com/w3cfed/75217423f86...

const puppeteer = require('puppeteer');
const fs = require('fs');

(async () => {
    const browser = await puppeteer.launch({
        headless: false,
        timeout: 50000
    })

    const page = await browser.newPage()

    // 去豆瓣登陸頁面
    await page.goto('https://accounts.douban.com/passport/login', {
        waitUntil: 'networkidle2'  // 網路空閒說明已載入完畢
    });

    // 點選搜尋框擬人輸入
    const clickPhoneLogin = await page.$('.account-tab-account')

    await clickPhoneLogin.click()

    const name = 'xxxxxxx'
    await page.type('input[id="username"]', name, {delay: 0})

    const pwd = 'xxxxxxxx'
    await page.type('input[id="password"]', pwd, {delay: 1})

    // 獲取登入按鈕元素
    const loginElement = await page.$('div.account-form-field-submit > a')

    // 點選按鈕，開始登陸
    await loginElement.click()

    await page.waitForNavigation()


    // 目標頁面 url
    let url = 'https://www.douban.com/group/topic/112565224/?start='
    // 翻頁引數
    let pages = [0, 100, 200, 300, 400, 500]

    // 定義爬取函式
    async function next(url) {
        await page.goto(url, {
            waitUntil: 'networkidle2'  // 網路空閒說明已載入完畢
        })

        return await page.$$eval("div.reply-doc.content > p", e => {

            let a = []

            e.forEach(element => {
                a.push(element.innerText)
            })

            return a
        })
    }


    // 拼接文字字串
    let data = ''

    for (const index of pages) {
        let res = await next(url + index)
        data = res.join('\n\n\n-----------------------------------------------------------\n\n') + data
    }

    // 寫入檔案
    fs.writeFile('douban.txt',data,'utf8',function(error){
        if(error){
            console.log(error);
            return false;
        }
        console.log('寫入成功');
    })

    await browser.close()
})();

本作品採用《CC 協議》，轉載必須註明作者和本文連結

在 Github 發現好專案在 1024.Cool 打卡學習，交朋友

java爬取豆瓣書籍資訊
2019-01-03
Java
爬蟲01:爬取豆瓣電影TOP 250基本資訊
2020-12-29
爬蟲
Puppeteer爬取網頁資料
2019-03-22
網頁
豆瓣top250資料爬取
2020-11-09
scrapy爬取豆瓣電影資料
2021-09-11
用node+puppeteer騰訊視訊爬取例項
2019-03-13
python——豆瓣top250爬取
2021-01-02
Python
python爬蟲實踐: 豆瓣小組命令列客戶端
2019-02-16
Python爬蟲命令列客戶端
python爬蟲爬取豆瓣電影 1-10 ajax 資料
2024-07-04
Python爬蟲
python更換代理爬取豆瓣電影資料
2019-08-03
Python
python爬蟲練習之爬取豆瓣讀書所有標籤下的書籍資訊
2018-07-23
Python爬蟲
使用 puppeteer + nodejs 爬取喜歡的動漫資源
2022-06-11
NodeJS
爬蟲教程——用Scrapy爬取豆瓣TOP250
2018-10-31
爬蟲
爬取豆瓣電影Top250和資料分析
2022-06-20
【仿豆瓣小組】極簡社群開源產品
2019-05-11
puppeteer 頁面爬取例項（元素遍歷）
2018-12-07
Python爬蟲之小說資訊爬取與資料視覺化分析
2021-01-09
Python爬蟲視覺化
爬蟲實踐－基於Jsoup爬取Facebook群組成員資訊
2019-03-04
爬蟲JS
Python爬取分析豆瓣電影Top250
2018-09-07
Python
使用python爬取豆瓣電影TOP250
2021-03-11
Python
python爬蟲小專案--飛常準航班資訊爬取variflight（上）
2019-03-23
Python爬蟲
Springboot+JPA下實現簡易爬蟲--爬取豆瓣電視劇資料
2020-10-15
Spring Boot爬蟲
Puppeteer 實戰-爬取動態生成的網頁
2018-11-10
網頁
Java爬蟲-爬取疫苗批次資訊
2024-06-03
Java爬蟲
Python爬蟲筆記（4）：利用scrapy爬取豆瓣電影250
2018-11-10
Python爬蟲筆記
Python爬蟲教程-17-ajax爬取例項（豆瓣電影）
2018-09-06
Python爬蟲
【python爬蟲案例】利用python爬取豆瓣電影TOP250評分排行資料！
2024-09-18
Python爬蟲
scrapy入門：豆瓣電影top250爬取
2019-02-16
使用puppeteer爬取網站，抓出404無效連結
2018-12-20
網站
「譯」如何用 Node.Js 和 Puppeteer 爬取網頁
2019-03-03
Node.js網頁
python爬取北京租房資訊
2018-05-18
Python
淘寶商品資訊爬取
2020-12-20
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
爬蟲豆瓣美女
2018-11-28
爬蟲
06、豆瓣爬蟲
2019-04-11
爬蟲
node爬蟲-使用puppeteer
2018-04-02
爬蟲
教你用python登陸豆瓣並爬取影評
2019-03-04
Python

Puppeteer 爬取豆瓣小組公開資訊

題外話

Puppeteer

Puppeteer 文件

Puppeteer 簡介

專案背景

程式碼實戰

第一步：建立專案

第二步：模擬登陸

第三步：爬取資料

第四步：寫入資料

實戰反思

完整程式碼

相關文章