golang解析網頁，可以做爬蟲了

1956587218發表於2017-06-21

原文網址 : https://gocn.vip/topics/8520?locale=zh-CN

> java 裡用 Jsoup，nodejs 裡用 cheerio，都可以相當方便的解析網頁，在 golang 語言裡也找到了一個網頁解析的利器，相當的好用，選擇器跟 jQuery 一樣

安裝

go get github.com/PuerkitoBio/goquery

使用

其實就是專案的readme.md裡的 demo

package main

import (
  &quot;fmt&quot;
  &quot;log&quot;

  &quot;github.com/PuerkitoBio/goquery&quot;
)

func ExampleScrape() {
  doc, err := goquery.NewDocument(&quot;http://metalsucks.net&quot;)
  if err != nil {
    log.Fatal(err)
  }

  // Find the review items
  doc.Find(&quot;.sidebar-reviews article .content-block&quot;).Each(func(i int, s *goquery.Selection) {
    // For each item found, get the band and title
    band := s.Find(&quot;a&quot;).Text()
    title := s.Find(&quot;i&quot;).Text()
    fmt.Printf(&quot;Review %d: %s - %s\n&quot;, i, band, title)
  })
}

func main() {
  ExampleScrape()
}

亂碼問題

中文網頁都會有亂碼問題，因為它預設是 utf8 編碼，這時候就要用到轉碼器了

安裝 iconv-go

go get github.com/djimenez/iconv-go

使用方法

func ExampleScrape() {
  res, err := http.Get(baseUrl)
  if err != nil {
    fmt.Println(err.Error())
  } else {
    defer res.Body.Close()
    utfBody, err := iconv.NewReader(res.Body, &quot;gb2312&quot;, &quot;utf-8&quot;)
    if err != nil {
      fmt.Println(err.Error())
    } else {
      doc, err := goquery.NewDocumentFromReader(utfBody)
      // 下面就可以用doc去獲取網頁裡的結構資料了
      // 比如
      doc.Find(&quot;li&quot;).Each(func(i int, s *goquery.Selection) {
        fmt.Println(i, s.Text())
      })
    }
  }
}

進階

有些網站會設定 Cookie, Referer 等驗證，可以在 http 發請求之前設定上請求的頭資訊

這個不屬於 goquery 裡的東西了，想了解更多可以檢視 golang 裡的 net/http 包下的方法等資訊

baseUrl:=&quot;http://baidu.com&quot;
client:=&amp;http.Client{}
req, err := http.NewRequest(&quot;GET&quot;, baseUrl, nil)
req.Header.Add(&quot;User-Agent&quot;, &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36&quot;)
req.Header.Add(&quot;Referer&quot;, baseUrl)
req.Header.Add(&quot;Cookie&quot;, &quot;your cookie&quot;) // 也可以通過req.Cookie()的方式來設定cookie
res, err := client.Do(req)
defer res.Body.Close()
//最後直接把res傳給goquery就可以來解析網頁了
doc, err := goquery.NewDocumentFromResponse(res)

參考

可以愉快的爬人家的網站了

原文：https://tomoya92.github.io/2017/06/21/golang-goquery/

更多原創文章乾貨分享，請關注公眾號

加微信實戰群請加微信(註明:實戰群)：gocnio

爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
《網頁爬蟲》
2018-11-26
網頁爬蟲
Python爬蟲可以做什麼？
2023-03-16
Python爬蟲
Python 爬蟲網頁解析工具lxml.html(二)
2018-12-05
Python爬蟲網頁XMLHTML
Python 爬蟲網頁解析工具lxml.html(一)
2018-12-05
Python爬蟲網頁XMLHTML
爬蟲（6） - 網頁資料解析(2) | BeautifulSoup4在爬蟲中的使用
2022-07-04
爬蟲網頁
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python網路爬蟲之爬取淘寶網頁頁面 MOOC可以執行的程式碼
2018-11-24
Python爬蟲網頁
Golang 網路爬蟲框架gocolly/colly
2019-01-15
Golang爬蟲框架
爬蟲 | 基本步驟和解析網頁的幾種方法
2024-06-05
爬蟲網頁
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
Golang福利爬蟲
2018-08-02
Golang爬蟲
網頁爬蟲--未完成
2020-10-04
網頁爬蟲
python 爬蟲網頁登陸
2020-11-30
Python爬蟲網頁
python爬取換頁_爬蟲爬不進下一頁了，怎麼辦
2020-11-24
Python爬蟲
【爬蟲】網頁抓包工具--Fiddler
2018-12-19
爬蟲網頁
爬蟲抓取網頁資料原理
2021-12-06
爬蟲網頁
[網路爬蟲] Jsoup : HTML 解析工具
2024-10-06
爬蟲JSHTML
爬蟲抓取網頁的詳細流程
2023-11-28
爬蟲網頁
Node JS爬蟲：爬取瀑布流網頁高清圖
2018-05-17
JS爬蟲網頁
Golang框架beego電影網爬蟲小試牛刀
2018-09-25
Golang框架爬蟲
3 行寫爬蟲 - 使用 Goribot 快速構建 Golang 爬蟲
2019-10-13
爬蟲Golang
Java爬蟲翻頁
2024-07-09
Java爬蟲
一起學爬蟲——使用Beautiful Soup爬取網頁
2018-11-26
爬蟲網頁
手把手教你利用爬蟲爬網頁（Python程式碼）
2019-05-14
爬蟲網頁Python
001.01 一般網頁爬蟲處理
2019-08-06
網頁爬蟲
python爬蟲：使用BeautifulSoup修改網頁內容
2020-04-05
Python爬蟲網頁
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
Golang爬蟲+正規表示式
2021-12-22
Golang爬蟲
Jsoup + HtmlUtil 實現網易新聞網頁爬蟲
2019-01-14
JSHTML網頁爬蟲
Python爬蟲教程-18-頁面解析和資料提取
2018-09-06
Python爬蟲
Python培訓分享：python爬蟲可以用來做什麼?
2021-09-06
Python爬蟲
Python爬蟲教程-13-爬蟲使用cookie爬取登入後的頁面(人人網)（下）
2018-09-06
Python爬蟲Cookie
Python爬蟲教程-12-爬蟲使用cookie爬取登入後的頁面(人人網)（上）
2018-09-06
Python爬蟲Cookie
python 爬蟲如何爬取動態生成的網頁內容
2024-10-31
Python爬蟲網頁
爬蟲：HTTP請求與HTML解析（爬取某乎網站）
2021-05-19
爬蟲HTTPHTML網站
Python 爬蟲網頁內容提取工具xpath(二)
2018-12-08
Python爬蟲網頁
Python3 | 簡單爬蟲分析網頁元素
2018-11-30
Python爬蟲網頁
【爬蟲】網頁抓包工具--Charles的使用教程
2018-12-19
爬蟲網頁

golang解析網頁，可以做爬蟲了

安裝

使用

亂碼問題

進階

參考

相關文章