使用樸素貝葉斯過濾垃圾郵件

娃哈哈店長發表於2020-02-03

樸素貝葉斯分類器 (Naive Bayes Classifier, NBC) 發源於古典數學理論,有著堅實的數學基礎,以及穩定的分類效率。同時,NBC 模型所需估計的引數很少,對缺失資料不太敏感,演算法也比較簡單。之所以成為 “樸素” 是因為整個形式化過程只做最原始、最簡單的假設。樸素貝葉斯在資料較少的情況下仍然有效,可以處理多類別問題。

樸素貝葉斯演算法詳解:https://boywithacoin.cn/article/fen-lei-su...

電子郵件垃圾過濾,具體流程

由於程式中更需要使用第三方庫,我們需要先下載依賴包 pip install feedparser

0x00實現詞表到向量轉換

使用物件導向思路,構造bayes物件:

#!/usr/bin/python
# -*- coding: utf-8 -*-
#__author__ : stray_camel
#pip_source : https://mirrors.aliyun.com/pypi/simple
import sys,os

class Bayes():
    def __init__(self, 
    absPath:"Directory of the current file"== os.path.dirname(os.path.abspath(__file__)),
    ):
        self.absPath = absPath

建立函式返回一個包含所有文件中出現的不重複詞的list:

 #contain all documents and list without duplicate words
    def createVocabList(self, 
    dataSet:dict(type="", help = "the source data"),
    )->dict(type=list, help = "Deduplicated list"):
        vocabSet=set([])#creat an empty set,'set' is a list without duplicate words
        for document in dataSet:
            vocabSet=vocabSet|set(document) #create an union of two sets
        return list(vocabSet)

同時我們還需要一個函式使用詞彙表或想要檢查的所有單詞作為輸入,然後為其中每一個單詞構造一個特徵。一旦給定一篇文件,該文件就會轉換為詞向量。


    #determine if a term appears in the documents
    def setOfWords2Vec(self, 
    vocabList = dict(type="", help="a glossary "), 
    inputSet = dict(type="", help="The word you want to detect"), 
    )-> = dict(type="", help="Word vector"):
        returnVec = [0]*len(vocabList)
        for word in inputSet:
            if word in vocabList:
                returnVec[vocabList.index(word)] = 1
            else:
                print("單詞: %s 不在我的詞彙裡面!" % word)#Returns a document vector indicating whether a word has appeared 1/0 in the input document
        return returnVec

0x01實現bayes分類器訓練函式

使用樸素貝葉斯分類器 訓練函式:

    #naive bayes classfication training function
    def trainNB0(self,trainMatrix,trainCategory):
        numTrainDocs=len(trainMatrix)
        numWords=len(trainMatrix[0])
        pAbusive=sum(trainCategory)/float(numTrainDocs)
        p0Num = ones(numWords)
        p1Num = ones(numWords)
        p0Denom = 2.0
        p1Denom = 2.0
        for i in range(numTrainDocs):#Iterate through all documents
            if trainCategory[i]==1:
                p1Num+=trainMatrix[i]
                p1Denom+=sum(trainMatrix[i])
            else:
                p0Num+=trainMatrix[i]
                p0Denom+=sum(trainMatrix[i])

        p1Vect = log(p1Num / p1Denom)
        p0Vect = log(p0Num / p0Denom)
        return p0Vect, p1Vect, pAbusive

0x02實現垃圾郵件測試函式

使用spamTest()對貝葉斯垃圾郵件分類器,進行自動化處理。匯入資料夾spam和ham下的文字檔案,並將他們解析成詞列表。案例中共有20封電子郵件,其中10封郵件被隨機選擇為測試集,分類器所要的機率計算指利用訓練集中的文件完成。這種隨機選擇一部分作為訓練集,而剩餘的部分作為測試集的過程稱為留存交叉驗證。

spamTest()

 #filtering email, training+testing
    def spamTest(self):
        docList=[]; classList=[]; fullText=[]
        # iterate through all the test files, A total of 26
        for i in range(1,26):
            wordList = self.textParse(open(self.absPath+'/email/spam/%d.txt' % i, "rb").read().decode('GBK', 'ignore'))
            docList.append(wordList)
            fullText.extend(wordList)
            classList.append(1)
            wordList = self.textParse(open(self.absPath+'/email/ham/%d.txt' % i, "rb").read().decode('GBK', 'ignore'))
            docList.append(wordList)
            fullText.extend(wordList)
            classList.append(0)
        vocabList=self.createVocabList(docList)

        trainingSet = list(range(50))
        testSet=[]

        for i in range(10):
            # random.uniform(x,y)  Returns a float random number from x to y
            randIndex=int(random.uniform(0,len(trainingSet)))
            testSet.append(trainingSet[randIndex])
            del(trainingSet[randIndex])
        trainMat=[]; trainClasses=[]
        for docIndex in trainingSet:
            trainMat.append(self.setOfWords2Vec(vocabList,docList[docIndex]))
            trainClasses.append(classList[docIndex])
        p0V,p1V,pSpam=self.trainNB0(np.array(trainMat),np.array(trainClasses))
        errorCount=0
        for docIndex in testSet:
            wordVector=self.setOfWords2Vec(vocabList,docList[docIndex])
            if self.classifyNB(np.array(wordVector),p0V,p1V,pSpam)!=classList[docIndex]:
                errorCount+=1
                print("分類錯誤的是: %s" %vocabList[docIndex])
        print('錯誤率是:',float(errorCount)/len(testSet))

最終函式執行結果如下圖:

if __name__ == "__main__":
    test = Bayes()
    test.spamTest()
分類錯誤的是:
scifinance now automatically generates gpu-enabled pricing & risk model source code that runs up to 50-300
分類錯誤的是: tended in the latest release. this includes:

錯誤率是: 0.2
本作品採用《CC 協議》,轉載必須註明作者和本文連結
文章!!首發於我的部落格Stray_Camel(^U^)ノ~YO

相關文章