『No20: Golang 爬蟲上手指南』

weixin_33872660發表於2018-08-19

原文網址 : https://blog.csdn.net/weixin_33872660/article/details/88069751

Golang爬蟲

大家好，我叫謝偉，是一名程式設計師。

我寫過很多爬蟲，這是我最後一次在文章中內提及爬蟲。以後都不再寫了，想要研究其他領域。

本節的主題：Golang 爬蟲如何上手。

主要分下面幾個步驟：

獲取網頁原始碼
解析資料
儲存資料

1. 獲取網頁原始碼

使用原生的 net/http 庫進行請求即可：

GET

func GetHttpResponse(url string, ok bool) ([]byte, error) {
	request, err := http.NewRequest("GET", url, nil)
	if err != nil {
		return nil, errors.ErrorRequest
	}

	request.Header.Add("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36")

	client := http.DefaultClient

	response, err := client.Do(request)

	if err != nil {
		return nil, errors.ErrorResponse
	}

	defer response.Body.Close()
	fmt.Println(response.StatusCode)
	if response.StatusCode >= 300 && response.StatusCode <= 500 {
		return nil, errors.ErrorStatusCode
	}
	if ok {

		utf8Content := transform.NewReader(response.Body, simplifiedchinese.GBK.NewDecoder())
		return ioutil.ReadAll(utf8Content)
	} else {
		return ioutil.ReadAll(response.Body)
	}

}
複製程式碼

POST

func PostHttpResponse(url string, body string, ok bool) ([]byte, error) {
	payload := strings.NewReader(body)
	requests, err := http.NewRequest("POST", url, payload)
	if err != nil {
		return nil, errors.ErrorRequest
	}
	requests.Header.Add("Content-Type", "application/x-www-form-urlencoded")
	requests.Header.Add("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36")
	client := http.DefaultClient
	response, err := client.Do(requests)
	if err != nil {
		return nil, errors.ErrorResponse
	}

	fmt.Println(response.StatusCode)

	defer response.Body.Close()
	if ok {
		utf8Content := transform.NewReader(response.Body, simplifiedchinese.GBK.NewDecoder())
		return ioutil.ReadAll(utf8Content)
	}
	return ioutil.ReadAll(response.Body)

}
複製程式碼

使用上面兩個函式，不管是遇到的請求是Get 或者是 Post 都可以獲取到網頁原始碼，唯一需要注意的可能是Post 請求需要正確的傳遞引數給請求。

使用原生的庫需要寫很多的程式碼，那有沒有更簡潔一些的寫法？

已經有人把原生的 net/http 庫，進一步的進行了封裝，形成了這樣一個庫：gorequest.

gorequest 文件

對外暴露的介面非常的簡單：

resp, body, errs := gorequest.New().Get("http://example.com/").End()
複製程式碼

一行程式碼即可完成一次請求。

Post 的請求也可以比較簡便的完成：

request := gorequest.New()
resp, body, errs := request.Post("http://example.com").
  Set("Notes","gorequst is coming!").
  Send(`{"name":"backy", "species":"dog"}`).
  End()
複製程式碼

上述兩種方式，按照自己喜好選擇，可以獲取到網頁原始碼。此為第一步。

2. 解析資料

對獲取到的網頁原始碼，我們需要進行進一步的解析，得到我們需要的資料。

依據響應的不同型別，我們可以選擇不同的方法。

一般如果響應是 html 格式的資料，那麼我們可以很友好的選擇正規表示式或者Css 選擇器獲取到我們需要的內容。

但如果是json 資料呢，那麼我們可以使用原生的 encoding/json 庫來進行對得倒的資料反序列化，也能將資料獲取到。

好，知道了具體的方法，那麼我們的目標就是：

熟悉正規表示式用法，知道相應的情況下如何編寫正規表示式
熟悉 json 的序列化和反序列化
熟悉 css 選擇器各符號代表的意思，能在chrome 除錯視窗寫出css 選擇器

1. 基本思路

清晰需要的內容
分析網頁
獲取網頁資訊
解析網頁資訊

2. 分析網頁

Chrome 瀏覽器審查元素，檢視網頁原始碼

3. 網頁響應值的型別

json: 一般是呼叫的API，比較好分析，解析json 資料即可
xml: 不常見
html: 常見，使用正規表示式、CSS 選擇器、XPATH 獲取需要的內容

4. 請求的型別

Get : 常見，直接請求即可
Post : 需要分析請求的引數，構造請求，向對方伺服器端傳送請求，再解析響應值

5. 請求頭部資訊

Uer-Agent 頭部資訊

6. 儲存

本地: text、json、csv
資料庫: 關係型（postgres、MySQL）, 非關係型（mongodb）, 搜尋引擎（elasticsearch）

7. 圖片處理

請求
儲存

8. 其他

代理: ip 池
User-Agent: 模擬瀏覽器
APP: APP 資料需要使用抓包工具：Mac(Charles)、Windows(Fiddler)(分析出Api)

9. 難點

分散式
大規模抓取

例項

幾大要點

如何獲取網頁原始碼

原生 net/http
gorequest (基於原生的net/http 封裝)

Web客戶端請求方法

Get 絕大多少數
Post

Web服務端響應

json
html

Web服務端響應的處理方式

json: 使用原生的json 序列化，或者使用 gjson （第三方）
html: 正規表示式、 Css 選擇器、Xpath

儲存資料方式

Text
Json
Csv
db

前三種，涉及檔案讀寫；最後者涉及資料庫操作

原始碼

僅供參考：參考

全文完，我是謝偉，再會。

Golang福利爬蟲
2018-08-02
Golang爬蟲
爬蟲解析庫：XPath 輕鬆上手
2019-11-03
爬蟲
3 行寫爬蟲 - 使用 Goribot 快速構建 Golang 爬蟲
2019-10-13
爬蟲Golang
Golang爬蟲+正規表示式
2021-12-22
Golang爬蟲
快速上手——我用scrapy寫爬蟲（一）
2019-02-16
爬蟲
Golang 網路爬蟲框架gocolly/colly
2019-01-15
Golang爬蟲框架
用Golang寫爬蟲(六) - 使用colly
2019-07-18
Golang爬蟲
phpspider簡單快速上手的php爬蟲框架
2020-02-17
PHPIDE爬蟲框架
golang-spider-從單任務版爬蟲到併發爬蟲01
2018-04-05
GolangIDE爬蟲
使用 Golang 寫爬蟲經驗總結
2019-08-09
Golang爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
Golang爬蟲，Go&&正則爬取資料，槓桿的
2022-01-13
Golang爬蟲
Golang框架beego電影網爬蟲小試牛刀
2018-09-25
Golang框架爬蟲
一個基於 golang 的爬蟲電影站
2020-03-20
Golang爬蟲
深入淺出爬蟲之道： Python、Golang與GraphQuery的
2021-09-09
爬蟲PythonGolang
爬蟲：多程式爬蟲
2021-05-19
爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
爬蟲工程師 “養成” 指南（內附書單）
2020-01-14
爬蟲工程師
通用爬蟲與聚焦爬蟲
2023-04-18
爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
爬蟲進階：反反爬蟲技巧
2018-06-28
爬蟲
反爬蟲之字型反爬蟲
2019-06-27
爬蟲
『Ansible 上手指南』
2018-03-05
ColorEasyDuino上手指南
2024-06-12
UI
Elasticsearch上手指南
2022-04-13
Elasticsearch
中間人攻擊（爬蟲工具） mitmproxy 使用指南
2019-03-04
爬蟲MIT
PYTHON系列-從零開始的爬蟲入門指南
2018-09-16
Python爬蟲
Selenium IDE使用指南：爬蟲指令碼錄製器
2020-05-25
IDE爬蟲指令碼
爬蟲
2024-11-16
爬蟲
5分鐘上手Python爬蟲：從乾飯開始，輕鬆掌握技巧
2024-03-15
Python爬蟲
【爬蟲】爬蟲專案推薦 / 思路
2020-04-21
爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
爬蟲帶你瞭解一下Golang的市場行情
2018-04-28
爬蟲Golang
Golang爬蟲實踐-將你的掘金小冊裝進kindle
2019-02-20
Golang爬蟲
【教程】淘寶新店旺旺採集軟體爬蟲操作指南
2023-10-12
爬蟲
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
Docker 快速上手指南
2018-11-28
Docker
『Ansible 上手指南：2』
2018-03-09