用typescript開發爬蟲過程實踐

azumia發表於2019-02-27

原文網址 : https://flycode.co/archives/273537

最近剛學typescript，想著能用來做點什麼，順便也練練手，加之最近也有個想法，前提是需要解決資料來源的問題，所以嘗試一下能不能用ts來寫一個爬蟲，然後存到資料庫裡面為我所用，下面就是我的實踐過程

配置開發環境

全域性安裝typescript

npm install -g typescript 
複製程式碼

建立專案資料夾

mkdir ts-spider
複製程式碼

進入該資料夾以後初始化專案

npm init -y
複製程式碼

下面要安裝一下專案中用到的模組

axios （網路請求）
cheerio （提供jQuery Selector的解析能力）
mysql （資料庫互動）

npm i --save axios cheerio mysql
複製程式碼

相應的，要安裝一下對應的型別宣告模組

npm i -s @types/axios --save  
npm i -s @types/cheerio --save 
npm i -s @types/mysql --save 
複製程式碼

其實axios已經自帶型別宣告，所以不安裝也是可以的

下面安裝一下專案內的typescript(必須走這一步)

npm i --save typescript 
複製程式碼

用vscode開啟專案，在根目錄下新建一個tsconfig.json檔案，加入一下配置項

{ 
    "compilerOptions": { 
        "target": "ES6", 
        "module": "commonjs", 
        "noEmitOnError": true, 
        "noImplicitAny": true, 
        "experimentalDecorators": true, 
        "sourceMap": false, 
     // "sourceRoot": "./", 
        "outDir": "./out" 
    }, 
    "exclude": [ 
        "node_modules" 
    ] 
}  
複製程式碼

到這裡我們的環境搭建算基本完成了，下面我們來測試下

開發環境測試

在專案根目錄下建立一個api.ts檔案，寫入以下程式碼

import axios from 'axios'
 
/**網路請求 */
export const remote_get = function(url: string) { 
  const promise = new Promise(function (resolve, reject) { 
    axios.get(url).then((res: any) => {
        resolve(res.data);
    }, (err: any) => {
        reject(err);
    });
  });
  return promise;
}
複製程式碼

建立一個app.ts檔案，寫入以下程式碼

import { remote_get } from './api'

const go = async () => { 
    let res = await remote_get('http://www.baidu.com/'); 
    console.log(`獲取到的資料為： ${res}`);
} 
go();
複製程式碼

執行一下命令

tsc
複製程式碼

我們發現專案根目錄想多了一個/out資料夾，裡面是轉換後的js檔案

我們執行一下

node out/app
複製程式碼

輸出類似這樣，就代表我們的爬蟲已經爬到了這個網頁，環境測試已經通過了！接下來我們嘗試一下抓取其中的資料

分析網頁並抓取資料

我們將app.ts重構一下，引入cheerio，開始抓取我們需要的資料，當然了，這次我們換一下目標，我們抓取一下豆瓣上面的的資料

前面也提到了cheerio提供了jQuery Selector的解析能力，關於它的具體用法，可以點選這裡檢視

import { remote_get } from './api'
import * as cheerio from 'cheerio'

const go = async () => { 
  const res: any = await remote_get('https://www.douban.com/group/szsh/discussion?start=0');
  // 載入網頁
  const $ = cheerio.load(res);
  let urls: string[] = [];
  let titles: string[] = [];
  // 獲取網頁中的資料，分別寫到兩個陣列裡面
  $('.olt').find('tr').find('.title').find('a').each((index, element) => { 
      titles.push($(element).attr('title').trim()); 
      urls.push($(element).attr('href').trim()); 
  })
  // 列印陣列
  console.log(titles, urls);
} 
go(); 
複製程式碼

這段程式碼是獲取豆瓣上小組話題和對應的連結，然後寫入陣列裡面，分別列印出來。我們跑一下程式碼，看看輸出

可以看到已經獲取到我們想要的資料了。接下來我們嘗試把這些資料寫入到資料庫裡面

將資料寫入資料庫

開始的時候其實是想把資料寫到MongoDB裡面，但是考慮到自己對這個還不太熟，和自己手頭的體驗版伺服器那一點點可憐的空間，最後還是放棄了，還是決定先嚐試寫到mysql資料庫裡面

我們先本地安裝一個mysql資料庫，安裝過程就不詳細說了，安裝完後在本地資料庫中新建一個表

在專案根目錄下新增util.ts檔案，寫入一下程式碼

import * as mysql from 'mysql'

/* 延時函式 */
export function sleep(msec: number) {
  return new Promise<void>(resolve => setTimeout(resolve, msec));
}

/**
 * 封裝一個資料庫連線的方法
 * @param {string} sql SQL語句
 * @param arg SQL語句插入語句的資料
 * @param callback SQL語句的回撥
 */
export function db(sql: string, arg: any, callback?: any) {
  // 1.建立連線
  const config = mysql.createConnection({
      host: 'localhost', // 資料庫地址
      user: 'root', // 資料庫名
      password: '', // 資料庫密碼
      port: 3306, // 埠號
      database: 'zhufang' // 使用資料庫名字
  });
  // 2.開始連線資料庫
  config.connect();
  // 3.封裝對資料庫的增刪改查操作
  config.query(sql, arg, (err:any, data:any) => {
      callback(err, data);
  });
  // 4.關閉資料庫
  config.end();
}
複製程式碼

以上我們已經封裝好了一個資料庫連線的方法，其中包含了資料庫的配置資訊，下面我們修改app.ts檔案，引入我們封裝好的db模組，並寫入資料的操作程式碼

import { remote_get } from './api'
import * as cheerio from 'cheerio'
import { sleep, db } from './util'

const go = async () => { 
  const res: any = await remote_get('https://www.douban.com/group/szsh/discussion?start=0');
  // 載入網頁
  const $ = cheerio.load(res);
  let urls: string[] = [];
  let titles: string[] = [];
  // 獲取網頁中的資料，分別寫到兩個陣列裡面
  $('.olt').find('tr').find('.title').find('a').each((index, element) => { 
      titles.push($(element).attr('title').trim()); 
      urls.push($(element).attr('href').trim()); 
  })
  // 列印陣列
  console.log(titles, urls);
  // 往資料庫裡面寫入資料
  titles.map((item, index) => {
    db('insert into info_list(title,url) values(?,?)', [item, urls[index]], (err: any, data: any) => {
        if(data){
          console.log('提交資料成功！！')
        }
        if (err) {
            console.log('提交資料失敗')
        }
    })
  })
} 
go(); 
複製程式碼

這裡我們往資料庫中插入title陣列和urls陣列的資料。跑一下程式碼，看了輸出沒有問題，我們看下資料庫

資料已經寫入了！到這裡我們的這次實踐就告一段落

下面考慮的是爬取資料過快的延時機制，和如何分頁獲取資料，如何獲取爬到的連結對應的詳細資訊，功能模組化等等，這裡就不細說了

參考文件

https://cloud.tencent.com/info/d0dd52a4a2b1f90055afe4fac4dcd76b.html https://hpdell.github.io/%E7%88%AC%E8%99%AB/crawler-cheerio-ts/

Python爬蟲開發與專案實踐（3）
2020-10-26
Python爬蟲
《Python開發簡單爬蟲》實踐筆記
2021-09-09
Python爬蟲筆記
puppeteer在開發過程中的實踐
2018-07-27
Python《爬蟲初實踐》
2020-12-11
Python爬蟲
golang實現併發爬蟲三(用佇列排程器實現）
2020-04-24
Golang爬蟲佇列
TypeScript 在開發應用中的實踐總結
2021-06-23
TypeScript
hanson影院全棧開發日誌之Puppeteer爬蟲實踐
2019-03-17
全棧爬蟲
爬蟲過程中遇到的問題
2024-04-27
爬蟲
python爬蟲實戰教程-Python爬蟲開發實戰教程（微課版）
2020-11-11
Python爬蟲
《Python3網路爬蟲開發實戰》教程||爬蟲教程
2018-11-13
Python爬蟲
誰說爬蟲只能Python？看我用C#快速簡單實現爬蟲開發和演示！
2024-05-27
爬蟲PythonC#
python3網路爬蟲開發實戰_Python 3開發網路爬蟲(一)
2020-12-07
Python爬蟲
爬蟲實戰開發學習（一）
2021-07-06
爬蟲
如何學習 Python 包並實現基本的爬蟲過程
2023-11-28
Python爬蟲
爬蟲程式實現過程中的一些建議
2021-08-12
爬蟲
爬蟲開發技巧
2020-11-14
爬蟲
python3網路爬蟲開發實戰_Python3 爬蟲實戰
2022-01-24
Python爬蟲
APP爬蟲-某APP iOS版逆向過程
2018-06-02
APP爬蟲iOS
[Python3網路爬蟲開發實戰] 分散式爬蟲原理
2019-12-08
Python爬蟲分散式
實用爬蟲-01-檢測爬蟲的 IP
2018-09-08
爬蟲
python爬蟲開發微課版pdf_Python爬蟲開發實戰教程（微課版）
2020-11-21
Python爬蟲
Vue3 + TypeScript 開發實踐總結
2021-07-13
VueTypeScript
Python 3網路爬蟲開發實戰
2021-04-28
Python爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
實用爬蟲-02-爬蟲真正使用代理 ip
2018-09-08
爬蟲
Python爬蟲實踐--爬取網易雲音樂
2022-02-15
Python爬蟲
python爬蟲實操專案_Python爬蟲開發與專案實戰 1.6 小結
2021-02-04
Python爬蟲
JavaFX 整合 Sqlite 和 Hibernate 開發爬蟲應用
2019-08-06
JavaSQLite爬蟲
那些年，我爬過的北科(四)——爬蟲進階之極簡併行爬蟲框架開發
2019-03-04
爬蟲框架
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
Python爬蟲開發與專案實戰pdf
2020-01-11
Python爬蟲
《網路爬蟲開發實戰案例》筆記
2020-08-10
爬蟲筆記
Python爬蟲開發與專案實戰（2）
2020-10-21
Python爬蟲
Python爬蟲開發與專案實戰（1）
2020-10-18
Python爬蟲
Python3網路爬蟲開發實戰
2021-04-15
Python爬蟲
Disruptor 實踐：整合到現有的爬蟲框架
2018-12-05
爬蟲框架
Python爬蟲實踐-網易雲音樂
2018-09-09
Python爬蟲
TypeScript 和 jsdom 庫建立爬蟲程式示例
2024-01-08
TypeScriptJS爬蟲

用typescript開發爬蟲過程實踐

配置開發環境

開發環境測試

分析網頁並抓取資料

將資料寫入資料庫

相關文章