# Ruby 資料抓取寫入 xls （unirest, nokogiri, spreadsheet）

SSBun發表於2017-12-13

安裝 unirest

使用 python 進行資料請求，我們可以使用 opne-uri，但是進行各種型別的請求時，顯得不是特別的方便快捷，所以我們使用unirest來進行網路資料請求。

gem install unirest // 安裝 unirest
複製程式碼

unirest 的用法

uinirest最低支援 Ruby2.0版本，使用起來十分的簡單，常用的方法有下面幾個 (具體的使用方法可以看unirest.io)

建立請求

response = Unirest.post "http://httpbin.org/post", 
                        headers:{ "Accept" => "application/json" }, 
                        parameters:{ :age => 23, :foo => "bar" }

response.code # Status code
response.headers # Response headers
response.body # Parsed body
response.raw_body # Unparsed body
複製程式碼

非同步請求

response = Unirest.post "http://httpbin.org/post", 
                        headers:{ "Accept" => "application/json" }, 
                        parameters:{ :age => 23, :foo => "bar" } {|response|
    response.code # Status code
    response.headers # Response headers
    response.body # Parsed body
    response.raw_body # Unparsed body
}
複製程式碼

基本 get 請求

response = Unirest.get "http://httpbin.org/get", auth:{:user=>"username", :password=>"password"}
複製程式碼

安裝 nokogiri

當我們爬取到資料後，我們需要對資料進行分析，如果是簡單的資料結構我們可以直接使用正規表示式，如果資料的結構比較複雜，我們就需要使用 nokogiri 對 html 的 dom 進行操作，如果對 dom 結果不瞭解可以先檢視相關的內容(html dom教程)[http://www.runoob.com/htmldom/htmldom-tutorial.html]

gem install nokogiri
複製程式碼

nokogiri 使用

匯入包

require 'rubygems'
require 'nonogiri'

複製程式碼

開啟一個 html 文件

page = Nokogiri::HTML(open("index.html))
puts page.class # => Nokogiri::HTML::Document

# 你也可以直接使用 unirest 請求下來的資料 response.body 來進行解析
response = Unirest.get "http://httpbin.org/get"
page = Nokogiri::HTML(response.body)
複製程式碼

通過 open-uri 直接解析 url

通過 http 請求直接獲取到 document

require 'rubygems'
require 'nokogiri'
require 'open-uri'
   
page = Nokogiri::HTML(open("http://en.wikipedia.org/"))   
puts page.class   # => Nokogiri::HTML::Document
複製程式碼

CSS 選擇器

對 Document 物件進行節點分析

page.css('title') # 查詢 page 下所有的 `title` 標籤, 返回的是一個陣列
page.css('li')[0].text # 獲取 page 下第一個 `li` 標籤中的內容
page.css('li')[0]['href'] # 獲取第一 `li` 標籤中屬性 `href` 的值
page.css("li[data-category='news']") #獲取屬性有 `data-category='news'` 的 `li` 標籤
page.css('div#funstuff')[0] #獲取標籤 `id='funstuff'` 的節點
page.css('div#reference a') # 獲取標籤 `id='reference'` 下所有的 `a` 節點 
複製程式碼

更多的關於 nokogiri 的資訊可以通過Parsing HTML with Nokogiri進行了解

安裝 spreadsheet

Spreadsheet是一個Ruby實現的gem，它可以使我們很方便的使用它對excel進行操作，我們需要將獲取到的資料存入本地，方便資料的記錄和後續處理。

# 引入spreadsheet外掛
require "spreadsheet"

# 宣告Spreadsheet處理Excel檔案組時的編碼
Spreadsheet.client_encoding = "UTF-8"

# 建立一個Spreadsheet物件，它相當於Excel檔案
book = Spreadsheet::Workbook.new
# 建立Excel檔案中的一個表格，並命名為 "Test Excel"
sheet1 = book.create_worksheet :name => "Test Excel"

# 設定一個Excel檔案的格式
default_format = Spreadsheet::Format.new(:weight => :bold,#字型加粗
                             :size => 14, 
                             :horizontal_align: => :merge, #表格合併
                             :color=>"red", 
                             :border=>1, 
                             :border_color=>"black",
                             :pattern => 1 ,
                             :pattern_fg_color => "yellow" )#這裡需要注意，如果pattern不手動處理，會導致pattern_fg_color無實際效果

# 指定一個在表格中的第一行物件
test_row = sheet1.row(0)
test_row.set_format(i, default_format)

# 為第一行的第一列指定值
test_row[0] = "row 1 col 1"
# 為第一行的第二列指定值
test_rwo[1] = "row 1 col 2" 

# 將建立的Spreadsheet物件寫入檔案，形成電子表格
book.write 'book2.xls'
複製程式碼

爬蟲

爬取 RUNOOB.COM(http://www.runoob.com/) 的教程列表和地址資料其實這都算不上是一個爬蟲，但是作為利用 ruby 的各種 gem 來實現非同步資料請求，資料篩選及儲存。是實現一個更加複雜的爬蟲的必備工具。熟練的使用各種各樣的 gem 可以體現 ruby 的簡潔

#!/usr/bin/ruby
require 'unirest'
require 'nokogiri'
require 'open-uri'
require 'spreadsheet'

# 獲取網頁的資訊
response = Unirest.get "http://www.runoob.com/"
page = Nokogiri::HTML(response.body)

# 獲取大分類的列表
datas = page.css('div.codelist')
puts datas.count

# 建立一個表格
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet::Workbook.new

# 建立一個 sheet
sheet = book.create_worksheet :name => "my excel"

index = 0
datas.each do |category|
	puts category.css('h2').text # 獲取大分類的名字
	items = category.css('a.item-top') 
	items.each do |item|		
		sheet.row(index)[0] = item.css('h4').text # 寫入教程的名字
		sheet.row(index)[1] = item['href'] # 寫入教程的連結
		index += 1
	end
end

book.write '/users/ssbun/desktop/runoob.xls' # 寫入本地檔案 （**注意你的路徑**）
複製程式碼

隨後你就可以看見在你的桌面上有一個 xls 檔案，開啟它就能看到裡面的資料了。

使用 Ruby 的 Nokogiri 庫來解析
2023-11-08
Nokogiri庫
2023-11-03
spark寫入hive資料
2019-04-09
SparkHive
MongoDB寫入資料策略
2023-03-02
MongoDB
如何將 EXCEL 資料寫入資料庫
2020-06-16
Excel資料庫
python簡書資料抓取
2018-08-25
Python
Elasticsearch 資料寫入原理分析
2019-09-06
Elasticsearch
Mysql增量寫入Hdfs（一） --將Mysql資料寫入Kafka Topic
2018-12-08
MySqlKafka
資料包抓取工具：Debookee for mac
2022-07-06
Mac
Debookee for mac(資料包抓取工具)
2022-07-05
Mac
爬蟲原理與資料抓取
2020-12-17
爬蟲
java資料list寫入檔案
2024-05-31
Java
如何使用代理IP進行資料抓取，PHP爬蟲抓取亞馬遜商品資料
2019-05-15
PHP爬蟲亞馬遜
18.2 使用NPCAP庫抓取資料包
2023-10-26
PCA
TypeScript_抓取酒店價格資料
2023-11-07
TypeScript
Python抓取淘寶IP地址資料
2019-04-26
Python
爬蟲抓取網頁資料原理
2021-12-06
爬蟲網頁
業務資料抓取的影響
2022-01-17
網頁資料抓取之噹噹網
2020-12-21
網頁
使用scrapy框架把資料非同步寫入資料庫
2018-07-16
框架非同步資料庫
Rails並不是用Ruby編寫的
2022-02-05
AI
資料庫併發寫入問題-丟失更新與寫入偏差
2020-12-29
資料庫
Python爬蟲入門教程 29-100 手機APP資料抓取 pyspider
2019-01-23
Python爬蟲APPIDE
Python爬蟲入門教程 33-100 《海王》評論資料抓取 scrapy
2019-02-14
Python爬蟲
HBase BulkLoad批量寫入資料實戰
2018-08-19
C#之txt的資料寫入
2020-10-23
C#
MySQL資料寫入過程介紹
2022-12-01
MySql
Javascript抓取京東、淘寶商品資料
2023-10-19
JavaScript
過濾搜尋引擎的抓取資料
2020-04-17
網頁抓取如何幫助資料分析？
2022-02-11
網頁
Elasticsearch Lucene 資料寫入原理 | ES 核心篇
2019-08-16
Elasticsearch
Elasticsearch Lucene是怎樣資料寫入的
2020-08-27
Elasticsearch
C# 將資料寫入到Excel表格
2024-08-25
C#Excel
MySQL的寫入資料儲存過程
2022-06-10
MySql儲存過程
極簡 Node.js 入門 - 3.4 資料夾寫入
2020-08-23
Node.js
MapReduce之----往hbase資料庫寫入資料時, 出現資料異常
2018-06-23
資料庫
【java】ObjectOutputStream & ObjectInputStream 多次寫入發生重複寫入相同資料的問題
2021-02-18
JavaObject
Go抓取網頁資料並存入MySQL和返回json資料
2019-02-16
Go網頁MySqlJSON
Python爬蟲入門實戰之貓眼電影資料抓取(理論篇)
2019-04-06
Python爬蟲