機器學習在入侵檢測方面的應用 - 基於ADFA-LD訓練集訓練入侵檢測判別模型

Andrew.Hann發表於2018-02-13

1. ADFA-LD資料集簡介

ADFA-LD資料集是澳大利亞國防學院對外發布的一套主機級入侵檢測資料集合,包括Linux和Windows,是一個包含了入侵事件的系統呼叫syscall序列的資料集(以單個程式,一段時間視窗內的systemcall api為一組)

ADFA-LD資料已經將各類系統呼叫完成了特徵化,並針對攻擊型別進行了標註,各種攻擊型別見下表

攻擊型別 資料量 標註型別
Trainning 833 normal
Validation 4373 normal
Hydra-FTP 162 attack
Hydra-SSH 148 attack
Adduser 91 attack
Java-Meterpreter 125 attack
Meterpreter 75 attack
Webshell 118 attack

ADFA-LD資料集的每個資料檔案都獨立記錄了一段時間內的系統呼叫順序,每個系統呼叫都用數字編號(定義在unisted.h中)

/*
 * This file contains the system call numbers, based on the
 * layout of the x86-64 architecture, which embeds the
 * pointer to the syscall in the table.
 *
 * As a basic principle, no duplication of functionality
 * should be added, e.g. we don't use lseek when llseek
 * is present. New architectures should use this file
 * and implement the less feature-full calls in user space.
 */

#ifndef __SYSCALL
#define __SYSCALL(x, y)
#endif

#if __BITS_PER_LONG == 32 || defined(__SYSCALL_COMPAT)
#define __SC_3264(_nr, _32, _64) __SYSCALL(_nr, _32)
#else
#define __SC_3264(_nr, _32, _64) __SYSCALL(_nr, _64)
#endif

#define __NR_io_setup 0
__SYSCALL(__NR_io_setup, sys_io_setup)
#define __NR_io_destroy 1
__SYSCALL(__NR_io_destroy, sys_io_destroy)
#define __NR_io_submit 2
__SYSCALL(__NR_io_submit, sys_io_submit)
#define __NR_io_cancel 3
__SYSCALL(__NR_io_cancel, sys_io_cancel)
#define __NR_io_getevents 4
__SYSCALL(__NR_io_getevents, sys_io_getevents)

/* fs/xattr.c */
#define __NR_setxattr 5
__SYSCALL(__NR_setxattr, sys_setxattr)
#define __NR_lsetxattr 6
__SYSCALL(__NR_lsetxattr, sys_lsetxattr)
#define __NR_fsetxattr 7
__SYSCALL(__NR_fsetxattr, sys_fsetxattr)
#define __NR_getxattr 8
__SYSCALL(__NR_getxattr, sys_getxattr)
#define __NR_lgetxattr 9
__SYSCALL(__NR_lgetxattr, sys_lgetxattr)

0x1: 包含的攻擊型別

1. Hydra-FTP:FTP暴力破解攻擊
2. Hydra-SSH:SSH暴力破解攻擊
3. Adduser
4. Meterpreter:the uploads of Java and Linux executable Meterpreter payloads for the remote compromise of a target host
5. Webshell:privilege escalation using C100 webshell

0x2:資料集特徵分析

在進行特徵工程之前,我們先嚐試對訓練資料集進行一個簡要的分析,嘗試從中發現一些規律輔助我們進行後續的特徵工程

1. syscall序列長度

序列長度體現了該程式從開始執行到最後完成攻擊/被攻擊總共呼叫的syscall次數,通過視覺化不同類別label資料集的Trace length的概率密度曲線(PDF)

可以看到,Trace length大致分佈在【100:500】區間,但是我們並沒有找到明顯的分界線/面來區分這些不同的label樣本,這說明Trace length可能不會是一個好的特徵

2. 從詞模型角度看樣本集中不同類別label的資料中是否存在公共模式

這一步本質上是在考慮樣本資料集是否線性可分,即樣本中包含的規律真值是否足夠明顯,只有資料集本身是線性可分的,才有可能通過演算法建模分析

樣本集中的syacall本質上就是一個詞序列,我們將其2-gram處理,統計詞頻直方圖

我們發現,在Adduser類別中,“168 168”、“168 265”這2個2-gram序列出現的頻次最高,而在Webshell類別中,“5 5”、“5 3”這2個2-gram出現的頻次最高。這從一定程度上表明兩類資料集在2-gram詞頻這個層面上是線性可分的

Relevant Link:

Evaluating host-based anomaly detection system:A preliminary analysis of ADFA-LD
https://www.unsw.adfa.edu.au/australian-centre-for-cyber-security/cybersecurity/ADFA-IDS-Datasets/ 

 

1. 如何進行特徵工程

由syscall api組成的token set本質上是由片語成的詞序列,我們可以通過詞模型的方式對樣本進行特徵工程

0x1:詞袋模型 - 基本單元是單個詞(即1-gram),以詞頻作為向量化的值空間,

Bag-of-words model (BoW model) 最早出現在自然語言處理(Natural Language Processing)和資訊檢索(Information Retrieval)領域.。該模型忽略掉文字的語法和語序等要素,將其僅僅看作是若干個詞彙的集合,文件中每個單詞的出現都是獨立的。BoW使用一組無序的單詞(words)來表達一段文字或一個文件

首先給出兩個簡單的文字文件如下:

John likes to watch movies. Mary likes too.
John also likes to watch football games.

基於上述兩個文件中出現的單詞,構建如下一個詞典 (dictionary):

{"John": 1, "likes": 2,"to": 3, "watch": 4, "movies": 5,"also": 6, "football": 7, "games": 8,"Mary": 9, "too": 10}

上面的詞典中包含10個單詞, 每個單詞有唯一的索引(注意索引的排序先後無意義), 那麼每個文字我們可以使用一個10維的向量來表示。如下:

[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
[1, 1,1, 1, 0, 1, 1, 1, 0, 0]

該向量與原來文字中單詞出現的順序沒有關係,而是詞典中每個單詞(無論該單詞是否在該樣本中出現)在文字中出現的頻率

scikit-learn中使用CountVectorizer()實現詞袋特徵的提取,CountVectorizer在一個類中實現了標記和計數:

# -*- coding:utf-8 -*-

from sklearn.feature_extraction.text import CountVectorizer

if __name__ == '__main__':
    vectorizer = CountVectorizer()
    corpus = [
        'This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document?',
    ]
    X = vectorizer.fit_transform(corpus)
    # 顯示根據語料訓練出的詞袋
    print vectorizer.vocabulary_

    # 顯示原始語料經過詞袋編碼後的向量矩陣
    print X.toarray()

稀疏性

大多數文件通常只會使用語料庫中所有詞的一個子集,因而產生的矩陣將有許多特徵值是0(通常99%以上都是0)。例如,一組10,000個短文字(比如email)會使用100,000的詞彙總量,而每個文件會使用100到1,000個唯一的詞。

為了能夠在記憶體中儲存這個矩陣,同時也提供矩陣/向量代數運算的速度,sklearn通常會使用稀疏表來儲存和運算特徵

訓練集覆蓋度問題

詞袋模型的vocab詞表在訓練期間就確定下來了,因此,在訓練語料中沒有出現的詞在後續呼叫轉化方法時將被完全忽略:

vectorizer.transform(['Something completely new.']).toarray()
                           
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)

這會在一定程度上影響詞袋模型的泛化能力

0x2:TF-IDF term weighting - 在詞頻的基礎上加上了權重的概念

在大文字語料中,一些詞語出現非常多(比如:英語中的“the”, “a”, “is” ),它們攜帶著很少量的資訊量。我們不能在分類器中直接使用這些詞的頻數,這會降低那些我們感興趣但是頻數很小的term。我們需要對feature的count頻數做進一步re-weight成浮點數,以方便分類器的使用,這一步通過tf-idf轉換來完成。

如果一個詞越常見,那麼分母就越大,逆文件頻率就越小越接近0。分母之所以要加1,是為了避免分母為0(即所有文件都不包含該詞)。log表示對得到的值取對數。

可以看到,詞的在單個樣本里的頻數和在整體語料庫中的頻數互相調和,動態決定了改詞的權重

scikit-learn中使用TfidfTransformer()實現了TF-IDF逆文件特徵提取

# -*- coding:utf-8 -*-

from sklearn.feature_extraction.text import TfidfVectorizer

if __name__ == '__main__':
    vectorizer = TfidfVectorizer(min_df=1)
    corpus = [
        'This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document?',
    ]
    tfidf = vectorizer.fit_transform(corpus)

    # 顯示原始語料經過TF-IDF編碼後的向量矩陣
    print tfidf.toarray()

'''
[[ 0.          0.43877674  0.54197657  0.43877674  0.          0.
   0.35872874  0.          0.43877674]
 [ 0.          0.27230147  0.          0.27230147  0.          0.85322574
   0.22262429  0.          0.27230147]
 [ 0.55280532  0.          0.          0.          0.55280532  0.
   0.28847675  0.55280532  0.        ]
 [ 0.          0.43877674  0.54197657  0.43877674  0.          0.
   0.35872874  0.          0.43877674]]
'''

TF-IDF和詞袋模型一樣,將原始樣本抽象成了一個定長的向量,所不同的在詞袋中的詞頻被替換成了TF-IDF權重

0x3:N-Gram模型 - 在詞頻模型基礎上考慮多詞上下文結構

一組unigrams(即詞袋)無法捕捉短語和多詞(multi-word)表達,我們可以用n-gram來對詞進行視窗化組合,在n-gram的基礎上進行詞頻向量化

CountVectorizer類中同樣實現了n-gram,或者說n-gram只是詞頻模型的一個引數選項

# -*- coding:utf-8 -*-

from sklearn.feature_extraction.text import CountVectorizer

if __name__ == '__main__':
    vectorizer = CountVectorizer(min_df=1, analyzer='word', ngram_range=(2, 3))
    corpus = [
        'This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document?',
    ]
    tfidf = vectorizer.fit_transform(corpus)

    # n-gram詞表
    print vectorizer.get_feature_names()

    # 顯示原始語料經過n-gram編碼後的向量矩陣
    print tfidf.toarray()

'''
[u'and the', u'and the third', u'first document', u'is the', u'is the first', u'is the second', u'is this', u'is this the', u'second document', u'second second', u'second second document', u'the first', u'the first document', u'the second', u'the second second', u'the third', u'the third one', u'third one', u'this is', u'this is the', u'this the', u'this the first']
[[0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0]
 [0 0 0 1 0 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0]
 [1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0]
 [0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1]]
'''

0x4:Word2Vec詞向量嵌入模型 - 不考慮詞頻而是從詞之間相似性在高維空間上的距離層面抽取特徵

word2vec的訓練過程是在訓練一個淺層神經網路,將訓練集中的每個詞都對映到一個指定維度的向量空間中

# -*- coding:utf-8 -*-

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim
import os

if __name__ == '__main__':
    modelpath = "./word2vec.test.txt"
    if os.path.isfile(modelpath):
        # 匯入模型
        print "load modeling..."
        model = gensim.models.Word2Vec.load(modelpath)
    else:
        # 用LineSentence把詞序列轉為所需要的格式
        corpus = [
            ['This', 'is', 'the', 'first', 'document.'],
            ['This', 'is', 'the', 'second', 'second', 'document.'],
            ['And', 'the', 'third', 'one.'],
            ['Is', 'this', 'the', 'first', 'document?']
        ]
        # 將詞嵌入到一個100維的向量空間中
        model = gensim.models.Word2Vec(corpus, min_count=1, size=100)
        # 儲存模型
        model.save(modelpath)

    print model['This']
'''
load modeling...
[  4.21720184e-03  -4.96086199e-03   3.77745135e-03   2.94174161e-03
  -1.84197503e-03  -2.94078956e-03   1.41434965e-03  -1.12752395e-03
   3.44854128e-03  -1.56023342e-03   2.58653867e-03   2.33289364e-04
   3.44703044e-03  -2.01581535e-03   4.42115450e-03  -2.88038654e-03
  -2.38809455e-03  -4.50134743e-03  -1.49860769e-03   7.91519240e-04
   4.98433039e-03   1.85355416e-03   2.31889612e-03  -1.69523829e-03
  -3.30593879e-03   4.40168194e-03  -4.88520879e-03   2.60615419e-03
   6.49481721e-04  -2.49359757e-03  -3.32681416e-03   2.01359508e-03
   3.97601305e-03   6.56171120e-04   3.81603022e-03   2.93262041e-04
  -2.28614034e-03  -2.23138509e-03  -2.07091100e-03  -2.18214374e-03
  -1.24846201e-03  -4.72204387e-03   1.10300467e-03   2.74274289e-03
   3.69609370e-05   2.28803046e-03   1.93586131e-03  -3.52792139e-03
   6.02113956e-04  -4.30466002e-03  -1.68499397e-03   4.44801664e-03
   3.73569527e-03  -2.87452945e-03  -4.44274070e-03   1.91680994e-03
   3.03726265e-04  -2.60479492e-03   3.86350509e-03  -3.56708956e-03
  -4.24962817e-03  -2.64985068e-03   4.89832275e-03   4.93438961e-03
  -8.93970719e-04  -4.92232037e-04  -2.22921767e-03  -2.13925354e-03
   3.71658040e-04   2.85526551e-03   3.21991998e-03   3.41509795e-03
  -4.62498562e-03  -2.23036925e-03   4.81000589e-03   3.47611774e-03
  -4.62327013e-03  -2.20024776e-05   4.42962535e-03   2.17637443e-03
   1.95589405e-03   3.56489979e-03   2.77884956e-03  -1.01689191e-03
  -3.14383302e-03   1.79978073e-04  -4.77676420e-03   4.18598717e-03
  -2.46347464e-03  -4.86065960e-03   2.29529128e-03   2.09548216e-06
   4.92842309e-03   4.01797617e-04  -4.82031086e-04   1.20579556e-03
   2.56112689e-04  -1.17955834e-03  -4.68734046e-03   3.14474717e-04]
'''

可以看到,word2vec向量化的基本單位是詞,每個詞都被對映成了一個指定維度的向量,而所有片語成一個詞序列(句子)就成了一個向量矩陣(詞個數 x 指定的word2vec嵌入維度)。但是機器學習的演算法要求的輸入都是一個一維張量,因此,我們還需要進行一次特徵處理,即用詞向量表對原始語料進行特徵編碼,編碼的方式有很多種,例如

1. 將所有詞向量相加,取每個維度的均值作為向量值
2. 在TF-IDF的基礎上進行方式1

可以看到,和詞袋模型相比,樣本語料(一段詞序列)進行word2vec emberdding之後的的向量維度是詞空間的維度(例如我們本例程式碼中指定的100維),但是詞袋模型編碼後的向量維度是詞袋的大小,大都數情況下詞向量模型編碼後的維度小於詞袋模型

# -*- coding:utf-8 -*-

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim
from collections import Counter, defaultdict
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import os
import numpy as np


class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(word2vec.itervalues().next())

    def fit(self):
        return self

    # 遍歷輸入詞序列中的每一個詞,取其在此項量表中的向量,如果改詞不在詞向量詞表中(即訓練集中未出現),則填0
    def transform(self, X):
        return np.array([np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                                or [np.zeros(self.dim)], axis=0)
                        for words in X
                        ])



# and a tf-idf version of the same
class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(word2vec.itervalues().next())

    def fit(self, X):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        # if a word was never seen - it must be at least as infrequent
        # as any of the known words - so the default idf is the max of
        # known idf's
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])

        return self

    # 在詞向量的基礎上乘上TF-IDF的詞權重
    def transform(self, X):
        return np.array([np.mean([self.word2vec[w] * self.word2weight[w]
                                 for w in words if w in self.word2vec] or
                                [np.zeros(self.dim)], axis=0)
                        for words in X
                        ])


corpus = [
    ['This', 'is', 'the', 'first', 'document.'],
    ['This', 'is', 'the', 'second', 'second', 'document.'],
    ['And', 'the', 'third', 'one.'],
    ['Is', 'this', 'the', 'first', 'document?']
]


if __name__ == '__main__':
    modelpath = "./word2vec.test.txt"
    model = None
    if os.path.isfile(modelpath):
        # 匯入模型
        print "load modeling..."
        model = gensim.models.Word2Vec.load(modelpath)
    else:
        # 將詞嵌入到一個100維的向量空間中
        model = gensim.models.Word2Vec(corpus, min_count=1, size=100)
        # 儲存模型
        model.save(modelpath)

    print "word emberding vocab: ", model.wv.vocab.keys()
    # 生成詞向量詞表
    words_vocab = dict()
    for key in model.wv.vocab.keys():
        nums = map(float, model[key])
        words_vocab[key] = np.array(nums)

    meanVectorizer = MeanEmbeddingVectorizer(words_vocab)
    # fit()可以忽略
    # 將訓練語料通過詞向量表編碼成一個行向量(取均值)
    corpusVecs = meanVectorizer.transform(corpus)
    for i in range(len(corpus)):
        print corpus[i]
        print corpusVecs[i]
        print ""

    tfidfVectorizer = TfidfEmbeddingVectorizer(words_vocab)
    tfidfVectorizer.fit(corpus)
    # 將訓練語料通過詞向量表編碼成一個行向量(在TF-IDF基礎上取均值)
    corpusVecs = tfidfVectorizer.transform(corpus)
    for i in range(len(corpus)):
        print corpus[i]
        print corpusVecs[i]
        print ""

    # print words_vocab



'''
load modeling...
[  4.21720184e-03  -4.96086199e-03   3.77745135e-03   2.94174161e-03
  -1.84197503e-03  -2.94078956e-03   1.41434965e-03  -1.12752395e-03
   3.44854128e-03  -1.56023342e-03   2.58653867e-03   2.33289364e-04
   3.44703044e-03  -2.01581535e-03   4.42115450e-03  -2.88038654e-03
  -2.38809455e-03  -4.50134743e-03  -1.49860769e-03   7.91519240e-04
   4.98433039e-03   1.85355416e-03   2.31889612e-03  -1.69523829e-03
  -3.30593879e-03   4.40168194e-03  -4.88520879e-03   2.60615419e-03
   6.49481721e-04  -2.49359757e-03  -3.32681416e-03   2.01359508e-03
   3.97601305e-03   6.56171120e-04   3.81603022e-03   2.93262041e-04
  -2.28614034e-03  -2.23138509e-03  -2.07091100e-03  -2.18214374e-03
  -1.24846201e-03  -4.72204387e-03   1.10300467e-03   2.74274289e-03
   3.69609370e-05   2.28803046e-03   1.93586131e-03  -3.52792139e-03
   6.02113956e-04  -4.30466002e-03  -1.68499397e-03   4.44801664e-03
   3.73569527e-03  -2.87452945e-03  -4.44274070e-03   1.91680994e-03
   3.03726265e-04  -2.60479492e-03   3.86350509e-03  -3.56708956e-03
  -4.24962817e-03  -2.64985068e-03   4.89832275e-03   4.93438961e-03
  -8.93970719e-04  -4.92232037e-04  -2.22921767e-03  -2.13925354e-03
   3.71658040e-04   2.85526551e-03   3.21991998e-03   3.41509795e-03
  -4.62498562e-03  -2.23036925e-03   4.81000589e-03   3.47611774e-03
  -4.62327013e-03  -2.20024776e-05   4.42962535e-03   2.17637443e-03
   1.95589405e-03   3.56489979e-03   2.77884956e-03  -1.01689191e-03
  -3.14383302e-03   1.79978073e-04  -4.77676420e-03   4.18598717e-03
  -2.46347464e-03  -4.86065960e-03   2.29529128e-03   2.09548216e-06
   4.92842309e-03   4.01797617e-04  -4.82031086e-04   1.20579556e-03
   2.56112689e-04  -1.17955834e-03  -4.68734046e-03   3.14474717e-04]
'''

值得注意的是:採用均值的方式可以解決待編碼的句子的長度不同問題,通過均值化保證了不會應為句子的長度導致向量化後量綱不一致。均值的方式比fix length and padding的方式要合理

0x5:Doc2Vec句向量嵌入模型 - 將一段詞序列直接抽象成一個固定長度的行向量

Doc2Vec 或者叫做 paragraph2vec, sentence embeddings,是一種非監督式演算法,可以獲得 sentences/paragraphs/documents 的向量表達,是 word2vec 的擴充。
學出來的向量可以通過計算距離來找 sentences/paragraphs/documents 之間的相似性,或者進一步可以給文件打標籤

# -*- coding:utf-8 -*-

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim
gensimLabeledSentence = gensim.models.doc2vec.LabeledSentence
import os


# 用文件集合來訓練模型
class LabeledLineSentence(object):
    def __init__(self, doc_list, labels_list):
        self.labels_list = labels_list
        self.doc_list = doc_list

    def __iter__(self):
            for idx, doc in enumerate(self.doc_list):
                # 在 gensim 中模型是以單詞為單位訓練的,所以不管是句子還是文件都分解成單詞
                yield gensimLabeledSentence(words=doc.split(), tags=[self.labels_list[idx]])




corpus = [
    'This is the first document.This is the first document.This is the first document.This is the first document.This is the first document.',
    'This is the second second document.This is the second second document.This is the second second document.This is the second second document.',
    'And the third one.And the third one.And the third one.And the third one.And the third one.',
    'Is this the first document?Is this the first document?Is this the first document?Is this the first document?Is this the first document?',
    'This is the first document.This is the first document.This is the first document.This is the first document.This is the first document.',
    'This is the second second document.This is the second second document.This is the second second document.This is the second second document.',
    'And the third one.And the third one.And the third one.And the third one.And the third one.',
    'Is this the first document?Is this the first document?Is this the first document?Is this the first document?Is this the first document?',
    'This is the first document.This is the first document.This is the first document.This is the first document.This is the first document.',
    'This is the second second document.This is the second second document.This is the second second document.This is the second second document.',
    'And the third one.And the third one.And the third one.And the third one.And the third one.',
    'Is this the first document?Is this the first document?Is this the first document?Is this the first document?Is this the first document?'
]
corpus_label = [
    'normal', 'normal', 'normal', 'bad',
    'normal', 'normal', 'normal', 'bad',
    'normal', 'normal', 'normal', 'bad'
]


if __name__ == '__main__':
    modelpath = "./doc2vec.test.txt"
    model = None
    if os.path.isfile(modelpath):
        # 匯入模型
        print "load modeling..."
        model = gensim.models.Doc2Vec.load(modelpath)
        # 測試模型
        print model['normal']
    else:
        # 載入樣本集
        it = LabeledLineSentence(corpus, corpus_label)
        # 訓練 Doc2Vec,並儲存模型
        model = gensim.models.Doc2Vec(size=300, window=10, min_count=5, workers=11, alpha=0.025, min_alpha=0.025)
        model.build_vocab(it)

        for epoch in range(10):
            model.train(it, total_examples=model.corpus_count, epochs=model.iter)
            model.alpha -= 0.002  # decrease the learning rate
            model.min_alpha = model.alpha  # fix the learning rate, no deca
            model.train(it, total_examples=model.corpus_count, epochs=model.iter)

        model.save(modelpath)

Relevant Link:

http://scikit-learn.org/stable/modules/feature_extraction.html
http://blog.csdn.net/u010213393/article/details/40987945
http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
http://d0evi1.com/sklearn/feature_extraction/ 
http://scikit-learn.org/stable/modules/feature_extraction.html
http://blog.csdn.net/sinsa110/article/details/76855428
http://cloga.info/2014/01/19/sklearn_text_feature_extraction 
http://blog.csdn.net/jerr__y/article/details/52967351
http://www.52nlp.cn/中英文維基百科語料上的word2vec實驗
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
https://github.com/nadbordrozd/blog_stuff/blob/master/classification_w2v/benchmarking.ipynb
https://rare-technologies.com/doc2vec-tutorial/
http://www.jianshu.com/p/854a59b93e09
http://cs.stanford.edu/~quocle/paragraph_vector.pdf
http://blog.csdn.net/lenbow/article/details/52120230

 

2. 基於KNN(K-Nearest Neighbor K近鄰) 檢測Webshell

0x1:基於詞袋模型特徵

# -*- coding:utf-8 -*-

import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

import os
import numpy as np
from sklearn.neighbors import KNeighborsClassifier


def load_one_flle(filename):
    x = []
    with open(filename) as f:
        line = f.readline()
        line = line.strip('\n')
    return line

def load_adfa_training_files(rootdir):
    x = []
    y = []
    list = os.listdir(rootdir)
    for i in range(0, len(list)):
        path = os.path.join(rootdir, list[i])
        if os.path.isfile(path):
            x.append(load_one_flle(path))
            y.append(0)
    return x, y

def dirlist(path, allfile):
    filelist = os.listdir(path)

    for filename in filelist:
        filepath = os.path.join(path, filename)
        if os.path.isdir(filepath):
            dirlist(filepath, allfile)
        else:
            allfile.append(filepath)
    return allfile

def load_adfa_webshell_files(rootdir):
    x = []
    y = []
    allfile=dirlist(rootdir,[])
    for file in allfile:
        if re.match(r"\.\./data/ADFA-LD/Attack_Data_Master/Web_Shell_\d+\\UAD-W*", file):
            x.append(load_one_flle(file))
            y.append(1)
    return x, y



if __name__ == '__main__':

    x1, y1 = load_adfa_training_files("../data/ADFA-LD/Training_Data_Master/")  # 訓練集(normal)
    x2, y2 = load_adfa_webshell_files("../data/ADFA-LD/Attack_Data_Master/")    # 訓練集(attack)
    x3, y3 = load_adfa_training_files("../data/ADFA-LD/Validation_Data_Master/")  # 驗證集(normal)

    # 訓練集黑白樣本混合
    x_train = x1 + x2
    y_train = y1 + y2
    x_validate = x3 + x2
    y_validate = y3 + y2

    # 詞袋模型,僅統計單個詞出現的頻數
    vectorizer = CountVectorizer(min_df=1)
    vecmodel = vectorizer.fit(x_train) # 按照訓練的詞表訓練vocab詞彙表
    x_train = vecmodel.transform(x_train).toarray() # 生成訓練集詞頻向量
    x_validate = vecmodel.transform(x_validate).toarray() # 按照同樣的標準生成驗證機詞頻向量

    # 根據訓練集生成KNN模型
    clf = KNeighborsClassifier(n_neighbors=4).fit(x_train, y_train)
    scores = cross_val_score(clf, x_train, y_train, n_jobs=-1, cv=10)
    # 反映KNN模型訓練擬合的程度
    print "Training accurate: "
    print scores
    print np.mean(scores)

    # Make predictions using the validate set
    # print x_train.shape
    # print x_validate.shape
    y_pre = clf.predict(x_validate)
    print "Predict result: ", y_pre
    # 預測的準確度
    print "Prediction accurate: %2f" % np.mean(y_pre == y_validate)
    

對驗證集能到達93的準確度

0x2:基於TF-IDF模型特徵

# -*- coding:utf-8 -*-

import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score

import os
import numpy as np
from sklearn.neighbors import KNeighborsClassifier


def load_one_flle(filename):
    x = []
    with open(filename) as f:
        line = f.readline()
        line = line.strip('\n')
    return line

def load_adfa_training_files(rootdir):
    x = []
    y = []
    list = os.listdir(rootdir)
    for i in range(0, len(list)):
        path = os.path.join(rootdir, list[i])
        if os.path.isfile(path):
            x.append(load_one_flle(path))
            y.append(0)
    return x, y

def dirlist(path, allfile):
    filelist = os.listdir(path)

    for filename in filelist:
        filepath = os.path.join(path, filename)
        if os.path.isdir(filepath):
            dirlist(filepath, allfile)
        else:
            allfile.append(filepath)
    return allfile

def load_adfa_webshell_files(rootdir):
    x = []
    y = []
    allfile = dirlist(rootdir, [])
    for file in allfile:
        if re.match(r"\.\./data/ADFA-LD/Attack_Data_Master/Web_Shell_\d+\\UAD-W*", file):
            x.append(load_one_flle(file))
            y.append(1)
    return x, y



if __name__ == '__main__':
    x1, y1 = load_adfa_training_files("../data/ADFA-LD/Training_Data_Master/")  # 訓練集(normal)
    x2, y2 = load_adfa_webshell_files("../data/ADFA-LD/Attack_Data_Master/")    # 訓練集(attack)
    x3, y3 = load_adfa_training_files("../data/ADFA-LD/Validation_Data_Master/")  # 驗證集(normal)

    # 訓練集黑白樣本混合
    x_train = x1 + x2
    y_train = y1 + y2
    x_validate = x3 + x2
    y_validate = y3 + y2

    # TF-IDF模型
    vectorizer = TfidfVectorizer(min_df=1)
    vecmodel = vectorizer.fit(x_train) # 按照訓練集訓練vocab詞彙表
    print "vocabulary_: "
    print vecmodel.vocabulary_

    x_train = vecmodel.transform(x_train).toarray()
    x_validate = vecmodel.transform(x_validate).toarray()
    print "x_train[0]: ", x_train[0]
    print "x_validate[0]: ", x_validate[0]

    # 根據訓練集生成KNN模型
    clf = KNeighborsClassifier(n_neighbors=4).fit(x_train, y_train)
    # 反映KNN模型訓練擬合的程度
    y_train_pre = clf.predict(x_train)
    print "Train result: ", y_train_pre
    print "Train accurate: %2f" % np.mean(y_train_pre == y_train)

    # Make predictions using the validate set
    # print x_train.shape
    # print x_validate.shape
    y_valid_pre = clf.predict(x_validate)
    print "Predict result: ", y_valid_pre
    # 預測的準確度
    print "Prediction accurate: %2f" % np.mean(y_valid_pre == y_validate)

0x3:基於N-gram模型特徵

# -*- coding:utf-8 -*-

import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

import os
import numpy as np
from sklearn.neighbors import KNeighborsClassifier


def load_one_flle(filename):
    x = []
    with open(filename) as f:
        line = f.readline()
        line = line.strip('\n')
    return line

def load_adfa_training_files(rootdir):
    x = []
    y = []
    list = os.listdir(rootdir)
    for i in range(0, len(list)):
        path = os.path.join(rootdir, list[i])
        if os.path.isfile(path):
            x.append(load_one_flle(path))
            y.append(0)
    return x, y

def dirlist(path, allfile):
    filelist = os.listdir(path)

    for filename in filelist:
        filepath = os.path.join(path, filename)
        if os.path.isdir(filepath):
            dirlist(filepath, allfile)
        else:
            allfile.append(filepath)
    return allfile

def load_adfa_webshell_files(rootdir):
    x = []
    y = []
    allfile=dirlist(rootdir,[])
    for file in allfile:
        if re.match(r"\.\./data/ADFA-LD/Attack_Data_Master/Web_Shell_\d+\\UAD-W*", file):
            x.append(load_one_flle(file))
            y.append(1)
    return x, y



if __name__ == '__main__':

    x1, y1 = load_adfa_training_files("../data/ADFA-LD/Training_Data_Master/")  # 訓練集(normal)
    x2, y2 = load_adfa_webshell_files("../data/ADFA-LD/Attack_Data_Master/")    # 訓練集(attack)
    x3, y3 = load_adfa_training_files("../data/ADFA-LD/Validation_Data_Master/")  # 驗證集(normal)

    # 訓練集黑白樣本混合
    x_train = x1 + x2
    y_train = y1 + y2
    x_validate = x3 + x2
    y_validate = y3 + y2

    # n-gram模型
    vectorizer = CountVectorizer(min_df=1, analyzer='word', ngram_range=(2, 3))
    vecmodel = vectorizer.fit(x_train)
    x_train = vecmodel.transform(x_train).toarray()
    x_validate = vecmodel.transform(x_validate).toarray()

    # 根據訓練集生成KNN模型
    clf = KNeighborsClassifier(n_neighbors=4).fit(x_train, y_train)
    scores = cross_val_score(clf, x_train, y_train, n_jobs=-1, cv=10)
    # 反映KNN模型訓練擬合的程度
    print "Training accurate: "
    print scores
    print np.mean(scores)

    # Make predictions using the validate set
    y_pre = clf.predict(x_validate)
    print "Predict result: ", y_pre
    # 預測的準確度
    print "Prediction accurate: %2f" % np.mean(y_pre == y_validate)

0x4:基於Word2Vec模型

# -*- coding:utf-8 -*-

import re
from sklearn.model_selection import cross_val_score

from sklearn.neighbors import KNeighborsClassifier

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim
from collections import Counter, defaultdict
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import os
import numpy as np


class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(word2vec.itervalues().next())

    def fit(self):
        return self

    # 遍歷輸入詞序列中的每一個詞,取其在此項量表中的向量,如果改詞不在詞向量詞表中(即訓練集中未出現),則填0
    def transform(self, X):
        return np.array([np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                                or [np.zeros(self.dim)], axis=0)
                        for words in X
                        ])

def load_one_flle(filename):
    x = []
    with open(filename) as f:
        line = f.readline()
        x = line.strip('\n').split()
    return x

def load_adfa_training_files(rootdir):
    x = []
    y = []
    list = os.listdir(rootdir)
    for i in range(0, len(list)):
        path = os.path.join(rootdir, list[i])
        if os.path.isfile(path):
            x.append(load_one_flle(path))
            y.append(0)
    return x, y

def dirlist(path, allfile):
    filelist = os.listdir(path)

    for filename in filelist:
        filepath = os.path.join(path, filename)
        if os.path.isdir(filepath):
            dirlist(filepath, allfile)
        else:
            allfile.append(filepath)
    return allfile

def load_adfa_webshell_files(rootdir):
    x = []
    y = []
    allfile=dirlist(rootdir,[])
    for file in allfile:
        if re.match(r"\.\./data/ADFA-LD/Attack_Data_Master/Web_Shell_\d+\\UAD-W*", file):
            x.append(load_one_flle(file))
            y.append(1)
    return x, y



if __name__ == '__main__':
    x1, y1 = load_adfa_training_files("../data/ADFA-LD/Training_Data_Master/")  # 訓練集(normal)
    x2, y2 = load_adfa_webshell_files("../data/ADFA-LD/Attack_Data_Master/")    # 訓練集(attack)
    x3, y3 = load_adfa_training_files("../data/ADFA-LD/Validation_Data_Master/")  # 驗證集(normal)

    # 訓練集黑白樣本混合
    x_train = x1 + x2
    y_train = y1 + y2
    x_validate = x3 + x2
    y_validate = y3 + y2

    modelpath = "./word2vec.test.txt"
    model = None
    if os.path.isfile(modelpath):
        # 匯入模型
        print "load modeling..."
        model = gensim.models.Word2Vec.load(modelpath)
    else:
        # 將詞嵌入到一個100維的向量空間中
        model = gensim.models.Word2Vec(x_train, min_count=1, size=100)
        # 儲存模型
        model.save(modelpath)

    print "word emberding vocab: ", model.wv.vocab.keys()

    # 生成詞向量詞表
    words_vocab = dict()
    for key in model.wv.vocab.keys():
        nums = map(float, model[key])
        words_vocab[key] = np.array(nums)

    meanVectorizer = MeanEmbeddingVectorizer(words_vocab)
    # 將訓練語料通過詞向量表編碼成一個行向量(取均值)
    x_trainVecs = meanVectorizer.transform(x_train)
    #for i in range(len(x_train)):
    #    print x_train[i]
    #    print x_trainVecs[i]
    #    print ""
    # 將驗證語料通過詞向量表編碼成一個行向量(取均值)
    x_validateVecs = meanVectorizer.transform(x_validate)
    #for i in range(len(x_train)):
    #    print x_validate[i]
    #    print x_validateVecs[i]
    #    print ""

    # 根據訓練集生成KNN模型
    clf = KNeighborsClassifier(n_neighbors=4).fit(x_trainVecs, y_train)
    scores = cross_val_score(clf, x_trainVecs, y_train, n_jobs=-1, cv=10)
    # 反映KNN模型訓練擬合的程度
    print "Training accurate: "
    print scores
    print np.mean(scores)

    # Make predictions using the validate set
    y_pre = clf.predict(x_validateVecs)
    print "Predict result: ", y_pre
    # 預測的準確度
    print "Prediction accurate: %2f" % np.mean(y_pre == y_validate)

https://arxiv.org/pdf/1611.01726.pdf  - LSTM-BASED SYSTEM-CALL LANGUAGE MODELING AND ROBUST ENSEMBLE METHOD FOR DESIGNING HOST-BASED INTRUSION DETECTION SYSTEMS
http://www.internationaljournalssrg.org/IJCSE/2015/Volume2-Issue6/IJCSE-V2I6P109.pdf - Review of A Semantic Approach to Host-based Intrusion Detection Systems Using Contiguous and Dis-contiguous System Call Patterns
http://www.ijirst.org/articles/IJIRSTV1I11121.pdf - A Host Based Intrusion Detection System Using Improved Extreme Learning Machine

Copyright (c) 2017 LittleHann All rights reserved

 

相關文章