一個簡單的例子!
環境:CentOS6.5
Hadoop叢集、Hive、R、RHive,具體安裝及除錯方法見部落格內文件。
名詞解釋:
先驗機率:由以往的資料分析得到的機率, 叫做先驗機率。
後驗機率:而在得到資訊之後,再重新加以修正的機率叫做後驗機率。貝葉斯分類是後驗機率。
貝葉斯分類演算法步驟:
第一步:準備階段
該階段為樸素貝葉斯分類做必要的準備。主要是依據具體情況確定特徵屬性,並且對特徵屬性進行適當劃分。然後就是對一部分待分類項進行人工劃分,以確定訓練樣本。
這一階段的輸入是所有的待分類項,輸出特徵屬性和訓練樣本。分類器的質量很大程度上依賴於特徵屬性及其劃分以及訓練樣本的質量。
第二步:分類器訓練階段
主要工作是計算每個類別在訓練樣本中出現頻率以及每個特徵屬性劃分對每個類別的條件機率估計。輸入是特徵屬性和訓練樣本,輸出是分類器。
第三步:應用階段
這個階段的任務是使用分類器對待分類項進行分類,其輸入是分類器和待分類項,輸出是待分類項與類別的對映關係。
特別要注意的是:樸素貝葉斯的核心在於它假設向量的所有分量之間是獨立的。
例項編寫R指令碼:
#!/usr/bin/Rscript #構造訓練集 data <- matrix(c("sunny","hot","high","weak","no", "sunny","hot","high","strong","no", "overcast","hot","high","weak","yes", "rain","mild","high","weak","yes", "rain","cool","normal","weak","yes", "rain","cool","normal","strong","no", "overcast","cool","normal","strong","yes", "sunny","mild","high","weak","no", "sunny","cool","normal","weak","yes", "rain","mild","normal","weak","yes", "sunny","mild","normal","strong","yes", "overcast","mild","high","strong","yes", "overcast","hot","normal","weak","yes", "rain","mild","high","strong","no"), byrow = TRUE, dimnames = list(day = c(),condition = c("outlook","temperature","humidity","wind","playtennis")), nrow=14, ncol=5); #計算先驗機率 prior.yes = sum(data[,5] == "yes") / length(data[,5]); prior.no = sum(data[,5] == "no") / length(data[,5]); #貝葉斯模型 naive.bayes.prediction <- function(condition.vec) { # Calculate unnormlized posterior probability for playtennis = yes. playtennis.yes <- sum((data[,1] == condition.vec[1]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(outlook = f_1 | playtennis = yes) sum((data[,2] == condition.vec[2]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(temperature = f_2 | playtennis = yes) sum((data[,3] == condition.vec[3]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(humidity = f_3 | playtennis = yes) sum((data[,4] == condition.vec[4]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(wind = f_4 | playtennis = yes) prior.yes; # P(playtennis = yes) # Calculate unnormlized posterior probability for playtennis = no. playtennis.no <- sum((data[,1] == condition.vec[1]) & (data[,5] == "no")) / sum(data[,5] == "no") * # P(outlook = f_1 | playtennis = no) sum((data[,2] == condition.vec[2]) & (data[,5] == "no")) / sum(data[,5] == "no") * # P(temperature = f_2 | playtennis = no) sum((data[,3] == condition.vec[3]) & (data[,5] == "no")) / sum(data[,5] == "no") * # P(humidity = f_3 | playtennis = no) sum((data[,4] == condition.vec[4]) & (data[,5] == "no")) / sum(data[,5] == "no") * # P(wind = f_4 | playtennis = no) prior.no; # P(playtennis = no) return(list(post.pr.yes = playtennis.yes, post.pr.no = playtennis.no, prediction = ifelse(playtennis.yes >= playtennis.no, "yes", "no"))); } #預測 naive.bayes.prediction(c("overcast", "mild", "normal", "weak"));
結果:
$post.pr.yes [1] 0.05643739 $post.pr.no [1] 0 $prediction [1] "yes"
預測結果為:yes