使用 nodejs 寫爬蟲(二): 抓取 github 熱門專案

lyreal666發表於2019-04-05

原文網址 : https://juejin.im/post/5ca71f326fb9a05e545e4c66

其實爬蟲是一個對計算機綜合能力要求比較高的技術活。

首先是要對網路協議尤其是 http 協議有基本的瞭解, 能夠分析網站的資料請求響應。學會使用一些工具，簡單的情況使用 chrome devtools 的 network 皮膚就夠了。我一般還會配合 postman 或者 charles 來分析，更復雜的情況可能舉要使用專業的抓包工具比如 wireshark 了。你對一個網站了解的越深，越容易想出簡單的方式來爬取你想獲取的資訊。

除了要了解一些計算機網路的知識，你還需要具備一定的字串處理能力，具體來說就是正規表示式玩的溜，其實正規表示式一般的使用場景下用不到很多高階知識，比較常用的有點小複雜的就是分組，非貪婪匹配等。俗話說，學好正規表示式，處理字串都不怕?。

還有就是掌握一些反爬蟲技巧，寫爬蟲你可能會碰到各種各樣的問題，但是不要怕，再複雜的 12306 都有人能夠爬，還有什麼是能難到我們的。常見的爬蟲碰到的問題比如伺服器會檢查 cookies, 檢查 host 和 referer 頭，表單中有隱藏欄位，驗證碼，訪問頻率限制，需要代理, spa 網站等等。其實啊，絕大多數爬蟲碰到的問題最終都可以通過操縱瀏覽器爬取的。

這篇使用 nodejs 寫爬蟲系列第二篇。實戰一個小爬蟲，抓取 github 熱門專案。想要達到目標:

學會從網頁原始碼中提取資料這種最基本的爬蟲
使用 json 檔案儲存抓取的資料
熟悉我上一篇介紹的一些模組
學會 node 中怎樣處理使用者輸入

分析需求

我們的需求是從 github 上抓取熱門專案資料，也就是 star 數排名靠前的專案。但是 github 好像沒有哪個頁面可以看到排名靠前的專案。往往網站提供的搜尋功能是我們寫爬蟲的人分析的重點物件。

我之前在 v2ex 灌水的時候，看到一個討論 996 的帖子上剛好教了一個檢視 github stars 數前幾的倉庫的方法。其實很簡單，就是在 github 搜尋時加上 star 數的過濾條件比如: stars:>60000，就可以搜尋到 github 上所有 star 數大於 60000 的倉庫。分析下面的截圖，注意圖片中的註釋:

分析一下可以得出以下資訊:

這個搜尋結果頁面是通過 get 請求返回 html 文件的，因為我 network 選擇了 Doc 過濾
url 中的請求的引數有3個，p(page) 代表頁面數，q(query) 代表搜尋內容，type 代表搜尋內容的型別

然後我又想 github 會不會檢查 cookies 和其它請求頭比如 referer，host 等，根據是否有這些請求頭決定是否返回頁面。

比較簡單的測試方法是直接用命令列工具 curl 來測試, 在 gitbash 中輸入下面命令即 curl "請求的url"

curl "https://github.com/search?p=2&q=stars%3A%3E60000&type=Repositories"
複製程式碼

不出意外的正常的返回了頁面的原始碼, 這樣的話我們的爬蟲指令碼就不用加上請求頭和 cookies 了。

通過 chrome 的搜尋功能，我們可以看到網頁原始碼中就有我們需要的專案資訊

分析到此結束，這其實就是一個很簡單的小爬蟲，我們只需要配置好查詢引數，通過 http 請求獲取到網頁原始碼，然後利用解析庫解析，獲取原始碼中我們需要的和專案相關的資訊，再處理一下資料成陣列，最後序列化成 json 字串儲存到到 json 檔案中。

動手來實現這個小爬蟲

獲取原始碼

想要通過 node 獲取原始碼，我們需要先配置好 url 引數，再通過 superagent 這個傳送 http 請求的模組來訪問配置好的 url。

'use strict';
const requests = require('superagent');
const cheerio = require('cheerio');
const constants = require('../config/constants');
const logger = require('../config/log4jsConfig').log4js.getLogger('githubHotProjects');
const requestUtil = require('./utils/request');
const models = require('./models');

/**
 * 獲取 star 數不低於 starCount k 的專案第 page 頁的原始碼
 * @param {number} starCount star 數量下限
 * @param {number} page 頁數
 */
const crawlSourceCode = async (starCount, page = 1) => {
    // 下限為 starCount k star 數
    starCount = starCount * 1024;
    // 替換 url 中的引數
    const url = constants.searchUrl.replace('${starCount}', starCount).replace('${page}', page);
    // response.text 即為返回的原始碼
    const { text: sourceCode } = await requestUtil.logRequest(requests.get(encodeURI(url)));
    return sourceCode;
}
複製程式碼

上面程式碼中的 constants 模組是用來儲存專案中的一些常量配置的，到時候需要改常量直接改這個配置檔案就行了，而且配置資訊更集中，便於檢視。

module.exports = {
    searchUrl: 'https://github.com/search?q=stars:>${starCount}&p=${page}&type=Repositories',
};
複製程式碼

解析原始碼獲取專案資訊

這裡我把專案資訊抽象成了一個 Repository 類了。在專案的 models 目錄下的 Repository.js 中。

const fs = require('fs-extra');
const path = require('path');


module.exports = class Repository {
    static async saveToLocal(repositories, indent = 2) {
        await fs.writeJSON(path.resolve(__dirname, '../../out/repositories.json'), repositories, { spaces: indent})
    }

    constructor({
        name,
        author,
        language,
        digest,
        starCount,
        lastUpdate,
    } = {}) {
        this.name = name;
        this.author = author;
        this.language = language;
        this.digest = digest;
        this.starCount = starCount;
        this.lastUpdate = lastUpdate;
    }

    display() {
        console.log(`   專案: ${this.name} 作者: ${this.author} 語言: ${this.language} star: ${this.starCount}
摘要: ${this.digest}
最後更新: ${this.lastUpdate}
`);
    }
}
複製程式碼

解析獲取到的原始碼我們需要使用 cheerio 這個解析庫，使用方式和 jquery 很相似。

/**
 * 獲取 star 數不低於 starCount k 的專案頁表
 * @param {number} starCount star 數量下限
 * @param {number} page 頁數
 */
const crawlProjectsByPage = async (starCount, page = 1) => {
    const sourceCode = await crawlSourceCode(starCount, page);
    const $ = cheerio.load(sourceCode);

    // 下面 cheerio 如果 jquery 比較熟應該沒有障礙, 不熟的話 github 官方倉庫可以檢視 api, api 並不是很多
    // 檢視 elements 皮膚, 發現每個倉庫的資訊在一個 li 標籤內, 下面的程式碼時建議開啟開發者工具的 elements 皮膚, 參照著閱讀
    const repositoryLiSelector = '.repo-list-item';
    const repositoryLis = $(repositoryLiSelector);
    const repositories = [];
    repositoryLis.each((index, li) => {
        const $li = $(li);

        // 獲取帶有倉庫作者和倉庫名的 a 連結
        const nameLink = $li.find('h3 a');

        // 提取出倉庫名和作者名
        const [author, name] = nameLink.text().split('/');

        // 獲取專案摘要
        const digestP = $($li.find('p')[0]);
        const digest = digestP.text().trim();

        // 獲取語言
        // 先獲取類名為 .repo-language-color 的那個 span, 在獲取包含語言文字的父 div
        // 這裡要注意有些倉庫是沒有語言的, 是獲取不到那個 span 的, language 為空字串
        const languageDiv = $li.find('.repo-language-color').parent();
        // 這裡注意使用 String.trim() 去除兩側的空白符
        const language = languageDiv.text().trim();

        // 獲取 star 數量
        const starCountLinkSelector = '.muted-link';
        const links = $li.find(starCountLinkSelector);
        // 選擇器為 .muted-link 還有可能是那個 issues 連結
        const starCountLink = $(links.length === 2 ? links[1] : links[0]);
        const starCount = starCountLink.text().trim();

        // 獲取最後更新時間
        const lastUpdateElementSelector = 'relative-time';
        const lastUpdate = $li.find(lastUpdateElementSelector).text().trim();
        const repository = new models.Repository({
            name,
            author,
            language,
            digest,
            starCount,
            lastUpdate,
        });
        repositories.push(repository);
    });
    return repositories;
}
複製程式碼

有時候搜尋結果是有很多頁的，所以我這裡又寫了一個新的函式用來獲取指定頁面數量的倉庫。

const crawlProjectsByPagesCount = async (starCount, pagesCount) => {
    if (pagesCount === undefined) {
        pagesCount = await getPagesCount(starCount);
        logger.warn(`未指定抓取的頁面數量, 將抓取所有倉庫, 總共${pagesCount}頁`);
    }

    const allRepositories = [];

    const tasks = Array.from({ length: pagesCount }, (ele, index) => {
        // 因為頁數是從 1 開始的, 所以這裡要 i + 1
        return crawlProjectsByPage(starCount, index + 1);
    });

    // 使用 Promise.all 來併發操作
    const resultRepositoriesArray = await Promise.all(tasks);
    resultRepositoriesArray.forEach(repositories => allRepositories.push(...repositories));
    return allRepositories;
}
複製程式碼

讓爬蟲專案更人性化

只是寫個指令碼，在程式碼裡面配置引數然後去爬，這有點太簡陋了。這裡我使用了一個可以同步獲取使用者輸入的庫readline-sync，加了一點使用者互動，後續的爬蟲教程我可能會考慮使用 electron 來做個簡單的介面, 下面是程式的啟動程式碼。

const readlineSync = require('readline-sync');
const { crawlProjectsByPage, crawlProjectsByPagesCount } = require('./crawlHotProjects');
const models = require('./models');
const logger = require('../config/log4jsConfig').log4js.getLogger('githubHotProjects');

const main = async () => {
    let isContinue = true;
    do {
        const starCount = readlineSync.questionInt(`輸入你想要抓取的 github 上專案的 star 數量下限, 單位(k): `, { encoding: 'utf-8'});
        const crawlModes = [
            '抓取某一頁',
            '抓取一定數量頁數',
            '抓取所有頁'
        ];
        const index = readlineSync.keyInSelect(crawlModes, '請選擇一種抓取模式');

        let repositories = [];
        switch (index) {
            case 0: {
                const page = readlineSync.questionInt('請輸入你要抓取的具體頁數: ');
                repositories = await crawlProjectsByPage(starCount, page);
                break;
            }
            case 1: {
                const pagesCount = readlineSync.questionInt('請輸入你要抓取的頁面數量: ');
                repositories = await crawlProjectsByPagesCount(starCount, pagesCount);
                break;
            }
            case 3: {
                repositories = await crawlProjectsByPagesCount(starCount);
                break;
            }
        }
        
        repositories.forEach(repository => repository.display());
        
        const isSave = readlineSync.keyInYN('請問是否要儲存到本地(json 格式) ?');
        isSave && models.Repository.saveToLocal(repositories);
        isContinue = readlineSync.keyInYN('繼續還是退出 ?');
    } while (isContinue);
    logger.info('程式正常退出...')
}

main();
複製程式碼

來看看最後的效果

這裡要提一下 readline-sync 的一個 bug,，在 windows 上, vscode 中使用 git bash 時，中文會亂碼，無論你檔案格式是不是 utf-8。搜了一些 issues，在 powershell 中切換編碼為 utf-8 就可以正常顯示，也就是把頁碼切到 65001。

專案的完整原始碼以及後續的教程原始碼都會儲存在我的 github 倉庫: Spiders。如果我的教程對您有幫助，希望不要吝嗇您的 star ?。後續的教程可能就是一個更復雜的案例，通過分析 ajax 請求來直接訪問介面。

github上的python爬蟲專案_GitHub - ahaharry/PythonCrawler: 用python編寫的爬蟲專案集合
2022-02-18
GithubPython爬蟲
Java爬蟲系列二：使用HttpClient抓取頁面HTML
2019-05-23
Java爬蟲HTTPclientHTML
python爬蟲例項專案大全-GitHub 上有哪些優秀的 Python 爬蟲專案？
2020-10-30
Python爬蟲Github
使用nodeJS寫一個簡單的小爬蟲
2018-12-25
NodeJS爬蟲
GitHub上有哪些優秀的爬蟲專案？
2019-04-18
Github爬蟲
基於nodejs編寫小爬蟲
2019-02-16
NodeJS爬蟲
Python爬蟲抓取技術的門道
2019-09-21
Python爬蟲
nodejs 爬蟲
2019-02-16
NodeJS爬蟲
GitHub 上有哪些優秀的 Python 爬蟲專案？
2020-04-13
GithubPython爬蟲
GitHub 熱門：各大網站的 Python 爬蟲登入彙總
2019-03-18
Github網站Python爬蟲
scrapy入門教程()部署爬蟲專案
2018-09-27
爬蟲
企業資料爬蟲專案（二）
2018-10-06
爬蟲
使用 nodejs 寫爬蟲(一): 常用模組和 js 語法
2019-04-03
NodeJS爬蟲
python爬蟲簡歷專案怎麼寫_爬蟲專案咋寫，爬取什麼樣的資料可以作為專案寫在簡歷上？...
2020-12-01
Python爬蟲
爬蟲專案
2019-06-07
爬蟲
【爬蟲】爬蟲專案推薦 / 思路
2020-04-21
爬蟲
NodeJs 入門到放棄 — 常用模組及網路爬蟲(二)
2021-03-03
NodeJS爬蟲
編寫web2.0爬蟲——頁面抓取部分
2020-10-09
Web爬蟲
Scrapy入門-第一個爬蟲專案
2018-07-23
爬蟲
Java爬蟲入門(一)——專案介紹
2018-08-06
Java爬蟲
github上十款熱門cmdb專案分享
2024-03-11
Github
爬蟲小專案
2019-05-10
爬蟲
爬蟲專案部署
2018-04-03
爬蟲
使用 nodejs 寫爬蟲(-): 常用模組介紹和前置JS語法
2019-04-02
NodeJS爬蟲
專案之爬蟲入門（豆瓣TOP250）
2020-11-19
爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
Python爬蟲入門教程 48-100 使用mitmdump抓取手機惠農APP-手機APP爬蟲部分
2019-03-12
Python爬蟲MITAPP
奇伢爬蟲專案
2018-10-08
爬蟲
爬蟲專案總結
2020-08-31
爬蟲
scrapyd 部署爬蟲專案
2018-03-22
爬蟲
網路爬蟲專案
2022-01-29
爬蟲
【爬蟲】專案篇-使用selenium爬取大魚潮汐網
2024-04-05
爬蟲
Python爬蟲入門學習實戰專案（一）
2020-02-18
Python爬蟲
Python爬蟲抓取股票資訊
2021-01-03
Python爬蟲
爬蟲原理與資料抓取
2020-12-17
爬蟲
Java 爬蟲專案實戰之爬蟲簡介
2018-11-24
Java爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
Python爬蟲二：抓取京東商品列表頁面資訊
2018-06-26
Python爬蟲