用rvest包來抓取Google學術搜尋資料

子華發表於2016-01-12

在這篇文章，主要展示的是如何抓取Google學術網頁。示例展示的是用rvest包來抓取作者博士指導老師的個人學術資料。我們可以看到他的合著者，論文被引用了多少次以及它們的附屬機構。Hadley Wickham在RStudio Blog中寫道：“rvest的靈感來源於類似beautiful soup這樣可以輕易的從HTML網頁抓取資料的一些庫”。因為它被設計成跟magrittr一起使用。我們可以通過一些簡單和易於理解的程式碼塊組成的管道操作來表示複雜的操作。

載入R包：

用ggplot2包來作圖 library(rvest) library(ggplot2) 他的論文被引用了多少次？

使用SelectorGadget的CSS選擇器來找出"cited by"列。 page <- read_html("https://scholar.google.com/citations?user=sTR9SIQAAAAJ&hl=en&oi=ao") 在html_nodes()中指定CSS選擇器，html_text()用來提取文字。最後，用as.numernic()將字串型別轉換為數值型別。 citations <- page %>% html_nodes ("#gsc_a_b .gsc_a_c") %>% html_text()%>%as.numeric() 檢視引用此次數： citations 148 96 79 64 57 57 57 55 52 50 48 37 34 33 30 28 26 25 23 22
繪製引用次數的條形圖： barplot(citations, main="How many times has each paper been cited?", ylab='Number of citations', col="skyblue", xlab="")

合著者，他們的附屬單位以及被引用的次數

同樣，我們使用SelecotGadget的CSS選擇器來找出匹配的合著者： page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ") Coauthors = page%>% html_nodes(css=".gsc_1usr_name a") %>% html_text() Coauthors = as.data.frame(Coauthors) names(Coauthors)='Coauthors' 檢視下合著者 head(Coauthors) Coauthors 1 Jason Evans 2 Mutlu Ozdogan 3 Rasmus Houborg 4 M. Tugrul Yilmaz 5 Joseph A. Santanello, Jr. 6 Seth Guikema

dim(Coauthors) [1] 27 1 截止到2016年1月1日，他的合著者共有27人。

他的合著者被引用了多少次？ page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ") citations = page%>% html_nodes(css = ".gsc_1usr_cby")%>%html_text()

citations [1] "Cited by 2231" "Cited by 1273" "Cited by 816" "Cited by 395" "Cited by 652" "Cited by 1531" [7] "Cited by 674" "Cited by 467" "Cited by 7967" "Cited by 3968" "Cited by 2603" "Cited by 3468" [13] "Cited by 3175" "Cited by 121" "Cited by 32" "Cited by 469" "Cited by 50" "Cited by 11"
[19] "Cited by 1187" "Cited by 1450" "Cited by 12407" "Cited by 1939" "Cited by 9" "Cited by 706"
[25] "Cited by 336" "Cited by 186" "Cited by 192"
通過全域性替代提取數值字串 citations = gsub('Cited by','', citations)

citations [1] " 2231" " 1273" " 816" " 395" " 652" " 1531" " 674" " 467" " 7967" " 3968" " 2603" " 3468" " 3175" [14] " 121" " 32" " 469" " 50" " 11" " 1187" " 1450" " 12407" " 1939" " 9" " 706" " 336" " 186"
[27] " 192" 將字串轉成數值型，再得到ggplot2可用的資料框格式： citations = as.numeric(citations) citations = as.data.frame(citations) 合著者的附屬機構 page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ") affilation = page %>% html_nodes(css = ".gsc_1usr_aff")%>%html_text() affilation = as.data.frame(affilation) names(affilation)='Affilation' 建立一個由coauthors,citations和affiliation組成的資料框 cauthors=cbind(Coauthors, citations, affilation)

cauthors Coauthors citations Affilation 1 Jason Evans 2231 University of New South Wales 2 Mutlu Ozdogan 1273 Assistant Professor of Environmental Science and Forest Ecology, University of Wisconsin 3 Rasmus Houborg 816 Research Scientist at King Abdullah University of Science and Technology 4 M. Tugrul Yilmaz 395 Assistant Professor, Civil Engineering Department, Middle East Technical University, Turkey 5 Joseph A. Santanello, Jr. 652 NASA-GSFC Hydrological Sciences Laboratory ..... 根據引用次數，對合著者重新排序

根據引用次數對合著者重新排序，以便得到遞減的順序圖： cauthors$Coauthors <- factor(cauthors$Coauthors, levels = cauthors$Coauthors[order(cauthors$citations, decreasing=F)])

ggplot(cauthors,aes(Coauthors,citations))+geom_bar(stat="identity", fill="#ff8c1a",size=5)+ theme(axis.title.y = element_blank())+ylab("# of citations")+ theme(plot.title=element_text(size = 18,colour="blue"), axis.text.y = element_text(colour="grey20",size=12))+ ggtitle('Citations of his coauthors')+coord_flip()

與他合著的科學家中，有引用超過了12000次。他的學生中像我（圖中最後一個）這樣的剛處在"學走路的階段"。

總結

在這篇文章，我們看到了如何抓取Google學術資料。我抓取了我導師的賬戶，獲得了論文引用次數資料，合著者的附屬機構以及他們被引用的次數。

正如我們在這篇文章所看到的一樣，利用rvest包可以很容易的抓取HTML網頁資料。同樣重要的是，SelectorGadget通過CSS選擇器可以幫助我們找出感興趣的資料。

修正：我的導師告訴我Google學術只收錄了他的小部分合著者。跟他合作發表的一些科學家以及一些引用很多次文章並沒有顯示出來。進一步，上面得到的結果對於有些人來說是不符合常理的（如：資歷更深的人發表了更多的文章卻比資歷淺的人引用的次數更少）。因此，Google學術資料應該謹慎使用。

本文由雪晴資料網負責翻譯整理，原文請參考Google scholar scraping with rvest package作者Fisseha Berhane。轉載請註明原文連結http://www.xueqing.cc/cms/article/109

python抓取google搜尋結果
2013-05-30
PythonGo
過濾搜尋引擎的抓取資料
2020-04-17
Mashable：Google搜尋背後的資料
2012-06-18
Go
Python多執行緒抓取Google搜尋連結網頁
2013-04-10
Python執行緒Go網頁
Google搜尋大法
2015-10-20
Go
Google搜尋API?
2008-01-03
GoAPI
Google搜尋技巧
2024-04-20
Go
海量資料搜尋---搜尋引擎
2018-11-13
【JavaScript學習筆記】呼叫google搜尋
2017-09-02
JavaScript筆記Go
快速使用Google搜尋
2015-08-04
Go
GOOGLE搜尋祕籍
2004-12-02
Go
google搜尋語法
2006-11-13
Go
Google是否濫用搜尋主導地位來影響旅遊業？
2012-08-21
Go
對於 Python 抓取 Google 搜尋結果的一些瞭解
2019-01-02
PythonGo
Google未來搜尋演變的四張王牌
2011-12-06
Go
Google Play Store啟用AdWords搜尋廣告
2015-08-03
Go
Google推出面部搜尋
2007-06-04
Go
諷刺&Google搜尋API
2008-01-03
GoAPI
資料包抓取工具：Debookee for mac
2022-07-06
Mac
Debookee for mac(資料包抓取工具)
2022-07-05
Mac
用 google 作為你的預設搜尋引擎
2018-05-03
Go
電商API介面：京東按關鍵字搜尋商品批次抓取資料爬蟲
2023-02-23
API爬蟲
Google 搜尋演算法是如何越來越懂你的？
2019-11-01
Go演算法
Google搜尋指令大全(最新整理)
2020-04-06
Go
黃了。Google中國版搜尋
2018-12-21
Go
Google的個性化搜尋
2008-01-03
Go
Google 按圖搜尋的原理
2013-03-23
Go
layui資料表格搜尋
2022-05-22
UI
Google開始測試電話語音搜尋技術
2007-09-28
Go
搜狗與微軟達成合作釋出英文搜尋及學術搜尋
2016-05-20
微軟
使用Google百度等搜尋引擎的常用搜尋技巧
2019-03-18
Go
Ubuntu 搜尋安裝包
2013-11-28
Ubuntu
Oracle：Google 2/3搜尋廣告收入來自iOS而非Android
2012-03-31
OracleGoiOSAndroid
新一代海量資料搜尋引擎 TurboSearch 來了！
2020-04-06
轉：google搜尋聯盟與adsense搜尋廣告的區別
2009-04-20
Go
Think with Google：印度搜尋洞察報告
2019-05-28
Go
如何使用GOOGLE高階搜尋技巧
2014-06-28
Go
google圖書搜尋上線了~~
2008-07-25
Go

用rvest包來抓取Google學術搜尋資料

相關文章