Node JS爬蟲：爬取瀑布流網頁高清圖

Bougie發表於2018-05-17

原文網址 : https://juejin.im/post/5afd29bb518825429d1f8241

靜態為主的網頁往往用get方法就能獲取頁面所有內容。動態網頁即非同步請求資料的網頁則需要用瀏覽器載入完成後再進行抓取。本文介紹瞭如何連續爬取瀑布流網頁。

在知乎提到python就必有一大幫人提起爬蟲，我們Node JS爬蟲也是非常簡單的，和python相比僅僅是“非同步”和“多執行緒”的效能對比而已。對python瞭解不多，故對此不做評價。

phantomjs是一個‘無殼’的chrome，具體安裝方法檢視phantomjs.org。phantomjs提供命令列工具執行，執行需使用命令phantom xxx.js。使用phantom-node這個庫可以在Node Js中把玩phantomjs，這樣就可以使用pm2進行程式守護和負載均衡了。

目標

爬取200張以上的1920*1080解析度的動漫桌布，網頁是百度瀑布流圖片

方式

瀑布流是根據頁面滾動位置來判斷是否繼續往下載入，故要利用phantomjs滾動頁面來獲取更多圖片連結。單個圖片詳細頁面剛進入時是壓縮過的圖片，這是百度優化訪問速度的措施，等待幾秒圖片src就會替換成大圖的連結。因此，進入圖片詳細頁時應延遲幾秒再獲取圖片src，具體延遲幾秒視你網速而定。

步驟

獲取連結

首先利用phantom開啟網頁

const phantom = require('phantom')

(async function() {
    const instance = await phantom.create();
    const page = await instance.createPage();
    const status = await page.open(url);
    const size = await page.property('viewportSize', {
        width: 1920,
        height: 1080
    })
}())
複製程式碼

獲取連結數量，不足200則滾動網頁

// 新增一個延時函式，等待頁面載入後再滾動
function delay(second) {
    return new Promise((resolve) => {
        setTimeout(resolve, second * 1000);
    });
}
複製程式碼

async function pageScroll(i) {
    await delay(5)
    await page.property('scrollPosition', {
        left: 0,
        top: 1000 * i
    })
    let content = await page.property('content')
    let $ = cheerio.load(content)
    console.log($('.imgbox').length)
    if($('.imgbox').length < 200) {
        await pageScroll(++i)
    }
}
await pageScroll(0)

複製程式碼

提取圖片連結

let urlList = []
$('.imgbox').each(function() {
    urlList.push('https://image.baidu.com'+$(this).find('a').attr('href'))
})
複製程式碼

儲存圖片

定義儲存圖片的函式

const request = require('request')
const fs = require('fs')

function save(url) {
    let ext = url.split('.').pop()
    request(url).pipe(fs.createWriteStream(`./image/${new Date().getTime()}.${ext}`));
}
複製程式碼

遍歷urlList，建議用遞迴遍歷，迴圈遍歷delay不起作用

async function imgSave(i) {
    let page = await page.open(urlList[i])
    delay(1)
    let content = await page.property('content')
    $ = cheerio.load(content)
    let src = $('#currentImg').attr('src')
    save(src)
    if(i<urlList.length) {
        await imgSave(++i)
    }
}
await imgSave(0)
複製程式碼

最後爬取結果如圖，都是高解析度的，部分圖片做了防爬處理

完整程式碼

const phantom = require('phantom')
const cheerio = require('cheerio')
const request = require('request')
const fs = require('fs')
function delay(second) {
    return new Promise((resolve) => {
        setTimeout(resolve, second * 1000);
    });
}
let url = 'https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111111&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E5%8A%A8%E6%BC%AB+%E5%A3%81%E7%BA%B8&oq=%E5%8A%A8%E6%BC%AB+%E5%A3%81%E7%BA%B8&rsp=-1'
function save(url) {
    let ext = url.split('.').pop()
    request(url).pipe(fs.createWriteStream(`./image/${new Date().getTime()}.${ext}`));
}
(async function() {
    let instance = await phantom.create();
    let page = await instance.createPage();
    let status = await page.open(url);
    let size = await page.property('viewportSize', {
        width: 1920,
        height: 1080
    })
    let $
    async function pageScroll(i) {
        await delay(1)
        await page.property('scrollPosition', {
            left: 0,
            top: 1000 * i
        })
        let content = await page.property('content')
        $ = cheerio.load(content)
        if($('.imgbox').length < 200) {
            await pageScroll(++i)
        }
    }
    await pageScroll(0)
    let urlList = []
    $('.imgbox').each(function() {
        urlList.push('https://image.baidu.com'+$(this).find('a').attr('href'))
    })
    async function imgSave(i) {
        let status = await page.open(urlList[i])
        await delay(1)
        let content = await page.property('content')
        $ = cheerio.load(content)
        let src = $('#currentImg').attr('src')
        save(src)
        if(i<urlList.length) {
            await imgSave(++i)
        }
    }
    await imgSave(0)
    await instance.exit()
}());
複製程式碼

我的部落格：www.bougieblog.cn，歡迎前來尬聊。

node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
Node.js爬取妹子圖-crawler爬蟲的使用
2018-04-04
Node.js爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
【python--爬蟲】彼岸圖網高清桌布爬蟲
2019-07-21
Python爬蟲
【python--爬蟲】千圖網高清背景圖片爬蟲
2019-05-21
Python爬蟲
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
「譯」如何用 Node.Js 和 Puppeteer 爬取網頁
2019-03-03
Node.js網頁
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
《網頁爬蟲》
2018-11-26
網頁爬蟲
網路爬蟲---從千圖網爬取圖片到本地
2019-09-03
爬蟲
Java爬蟲批量爬取圖片
2021-09-24
Java爬蟲
一起學爬蟲——使用Beautiful Soup爬取網頁
2018-11-26
爬蟲網頁
Python爬蟲入門【9】：圖蟲網多執行緒爬取
2019-07-31
Python爬蟲執行緒
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
爬蟲 Scrapy框架爬取圖蟲圖片並下載
2018-08-27
爬蟲框架
python 爬蟲如何爬取動態生成的網頁內容
2024-10-31
Python爬蟲網頁
爬蟲---xpath解析（爬取美女圖片）
2020-12-23
爬蟲
python爬蟲:瞭解JS加密爬取網易雲音樂
2021-08-19
Python爬蟲JS加密
用Node.js寫爬蟲，擼羞羞的圖片
2018-04-03
Node.js爬蟲
教你用Python爬取圖蟲網
2019-02-26
Python
Python爬蟲教程-13-爬蟲使用cookie爬取登入後的頁面(人人網)（下）
2018-09-06
Python爬蟲Cookie
Python爬蟲教程-12-爬蟲使用cookie爬取登入後的頁面(人人網)（上）
2018-09-06
Python爬蟲Cookie
python爬蟲爬取網頁中文亂碼問題的解決
2024-11-17
Python爬蟲網頁
Python爬蟲：爬取instagram，破解js加密引數
2019-04-09
Python爬蟲JS加密
python爬取換頁_爬蟲爬不進下一頁了，怎麼辦
2020-11-24
Python爬蟲
爬取網頁文章
2021-09-29
網頁
爬蟲練習——爬取縱橫中文網
2020-10-19
爬蟲
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲
Python爬蟲入門教程 2-100 妹子圖網站爬取
2018-12-13
Python爬蟲網站
簡單的爬蟲：爬取網站內容正文與圖片
2021-09-09
爬蟲網站
Node.js學習之路22——利用cheerio製作簡單的網頁爬蟲
2019-02-16
Node.js網頁爬蟲
基於Node.js的裁判文書網爬蟲分析
2018-10-06
Node.js爬蟲
Python網路爬蟲之爬取淘寶網頁頁面 MOOC可以執行的程式碼
2018-11-24
Python爬蟲網頁
上天的Node.js之爬蟲篇 15行程式碼爬取京東資源
2019-03-22
Node.js爬蟲行程
用Node寫頁面爬蟲的工具集
2018-10-24
爬蟲
網頁爬蟲--未完成
2020-10-04
網頁爬蟲
python 爬蟲網頁登陸
2020-11-30
Python爬蟲網頁

Node JS爬蟲：爬取瀑布流網頁高清圖

目標

方式

步驟

獲取連結

儲存圖片

完整程式碼

相關文章