樸素貝葉斯分類器 (Naive Bayes Classifier, NBC) 發源於古典數學理論,有著堅實的數學基礎,以及穩定的分類效率。同時,NBC 模型所需估計的引數很少,對缺失資料不太敏感,演算法也比較簡單。之所以成為 “樸素” 是因為整個形式化過程只做最原始、最簡單的假設。樸素貝葉斯在資料較少的情況下仍然有效,可以處理多類別問題。
樸素貝葉斯演算法詳解:https://boywithacoin.cn/article/fen-lei-su...
電子郵件垃圾過濾,具體流程
- 先收集資料,具體資料在https://github.com/Freen247/database/tree/...
- 將文字檔案解析成詞條向量
- 檢查詞條確保解析的正確性
- 訓練/測試/使用演算法
由於程式中更需要使用第三方庫,我們需要先下載依賴包
pip install feedparser
0x00實現詞表到向量轉換
使用物件導向思路,構造bayes物件:
#!/usr/bin/python
# -*- coding: utf-8 -*-
#__author__ : stray_camel
#pip_source : https://mirrors.aliyun.com/pypi/simple
import sys,os
class Bayes():
def __init__(self,
absPath:"Directory of the current file"== os.path.dirname(os.path.abspath(__file__)),
):
self.absPath = absPath
建立函式返回一個包含所有文件中出現的不重複詞的list:
#contain all documents and list without duplicate words
def createVocabList(self,
dataSet:dict(type="", help = "the source data"),
)->dict(type=list, help = "Deduplicated list"):
vocabSet=set([])#creat an empty set,'set' is a list without duplicate words
for document in dataSet:
vocabSet=vocabSet|set(document) #create an union of two sets
return list(vocabSet)
同時我們還需要一個函式使用詞彙表或想要檢查的所有單詞作為輸入,然後為其中每一個單詞構造一個特徵。一旦給定一篇文件,該文件就會轉換為詞向量。
#determine if a term appears in the documents
def setOfWords2Vec(self,
vocabList = dict(type="", help="a glossary "),
inputSet = dict(type="", help="The word you want to detect"),
)-> = dict(type="", help="Word vector"):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else:
print("單詞: %s 不在我的詞彙裡面!" % word)#Returns a document vector indicating whether a word has appeared 1/0 in the input document
return returnVec
0x01實現bayes分類器訓練函式
使用樸素貝葉斯分類器 訓練函式:
#naive bayes classfication training function
def trainNB0(self,trainMatrix,trainCategory):
numTrainDocs=len(trainMatrix)
numWords=len(trainMatrix[0])
pAbusive=sum(trainCategory)/float(numTrainDocs)
p0Num = ones(numWords)
p1Num = ones(numWords)
p0Denom = 2.0
p1Denom = 2.0
for i in range(numTrainDocs):#Iterate through all documents
if trainCategory[i]==1:
p1Num+=trainMatrix[i]
p1Denom+=sum(trainMatrix[i])
else:
p0Num+=trainMatrix[i]
p0Denom+=sum(trainMatrix[i])
p1Vect = log(p1Num / p1Denom)
p0Vect = log(p0Num / p0Denom)
return p0Vect, p1Vect, pAbusive
0x02實現垃圾郵件測試函式
使用spamTest()對貝葉斯垃圾郵件分類器,進行自動化處理。匯入資料夾spam和ham下的文字檔案,並將他們解析成詞列表。案例中共有20封電子郵件,其中10封郵件被隨機選擇為測試集,分類器所要的機率計算指利用訓練集中的文件完成。這種隨機選擇一部分作為訓練集,而剩餘的部分作為測試集的過程稱為留存交叉驗證。
spamTest()
:
#filtering email, training+testing
def spamTest(self):
docList=[]; classList=[]; fullText=[]
# iterate through all the test files, A total of 26
for i in range(1,26):
wordList = self.textParse(open(self.absPath+'/email/spam/%d.txt' % i, "rb").read().decode('GBK', 'ignore'))
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList = self.textParse(open(self.absPath+'/email/ham/%d.txt' % i, "rb").read().decode('GBK', 'ignore'))
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList=self.createVocabList(docList)
trainingSet = list(range(50))
testSet=[]
for i in range(10):
# random.uniform(x,y) Returns a float random number from x to y
randIndex=int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]; trainClasses=[]
for docIndex in trainingSet:
trainMat.append(self.setOfWords2Vec(vocabList,docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam=self.trainNB0(np.array(trainMat),np.array(trainClasses))
errorCount=0
for docIndex in testSet:
wordVector=self.setOfWords2Vec(vocabList,docList[docIndex])
if self.classifyNB(np.array(wordVector),p0V,p1V,pSpam)!=classList[docIndex]:
errorCount+=1
print("分類錯誤的是: %s" %vocabList[docIndex])
print('錯誤率是:',float(errorCount)/len(testSet))
最終函式執行結果如下圖:
if __name__ == "__main__":
test = Bayes()
test.spamTest()
分類錯誤的是:
scifinance now automatically generates gpu-enabled pricing & risk model source code that runs up to 50-300
分類錯誤的是: tended in the latest release. this includes:
錯誤率是: 0.2
本作品採用《CC 協議》,轉載必須註明作者和本文連結
文章!!首發於我的部落格Stray_Camel(^U^)ノ~YO。