用R讀取PDF並進行資料探勘

jieforest發表於2012-10-01
用R讀取PDF並進行資料探勘,例子如下:

# here is a pdf for mining
url
dest
download.file(url, dest, mode = "wb")

# set path to pdftotxt.exe and convert pdf to text
exe
system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F)

# get txt-file name and open it
filetxt
shell.exec(filetxt); shell.exec(filetxt) # strangely the first try always throws an error..

# do something with it, i.e. a simple word cloud
library(tm)
library(wordcloud)
library(Rstem)

txt

txt
txt

corpus
corpus
tdm
m
d

# Stem words
d$stem

# and put words to column, otherwise they would be lost when aggregating
d$word

# remove web address (very long string):
d

# aggregate freqeuncy by word stem and
# keep first words..
agg_freq
agg_word

d

# sort by frequency
d

# print wordcloud:
wordcloud(d$word, d$freq)

# remove files
file.remove(dir(tempdir(), full.name=T)) # remove files

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/301743/viewspace-745512/,如需轉載,請註明出處,否則將追究法律責任。

上一篇: Java又爆致命漏洞
用R讀取PDF並進行資料探勘
請登入後發表評論 登入
全部評論

相關文章