我最近在赫爾大學完成了我的進階人工智慧模組。它特別棒。“機器學習”技術尤其吸引我,大量基於此的潛在應用看起來前途無量。當我克服了(人工神經)網路工作原理的陡峭學習曲線後,我決定是時候寫點什麼了。
人工神經網路(ANN)寫作器
在我全力搜尋網際網路上來研究奇蹟般的機器學習的同時,我偶然發現了個 github 上的專案,使用了遞迴神經網路(recurrent neural network)來模仿莎士比亞的寫作風格。我喜歡這個想法,所以也試著創造一個幾乎完全不一樣屬於我自己的版本。我決定使用 scikit 這個機器學習庫。這是因為它使用和配置起來都特別順手。
Scikit 同樣有著龐大的社群,裡面包含了大量的教程,還有許多可以用來訓練你自己的神經網路的樣本資料集(example datasets )。我建立的這個寫手使用了多重支援向量機(SVM)引擎。一個向量機(vector machine)用來對句子結構化,多個小型向量機用對應從詞彙表中選取單詞演算法。
句式結構化
句式結構化非常成功,我目前使用的演算法結果準確率已經很高了。這個階段中最打的障礙就是將訓練資料歸一化(normals)。我使用了 NLTK 這個自然語言的庫來將訓練資料轉化成詞性標籤(phrase identifiers),例如:NN(名詞),DET(限定詞),$(標誌)等等。
這代表著我可以利用這些標籤對資料進行歸一化,像下面這樣:
1 |
["The", "cat", "jumped"] = ['DET', 'NN', 'VP] |
一旦歸一化以後就像下面這樣:
1 |
['DET', 'NN', 'VP] = [0.2237823, 0.82392, 0.342323] |
現在我只需要得到一個目標歸一化後的值(target normal),並且將它代入神經網路中開始訓練即可。從 二進位制大型物件(BLOB)中讀取文字時,訓練用詞就是二進位制大型物件中的下一個詞,因此:
1 |
["The", "cat", "jumped"]["towards"] = ['DET', 'NN', 'VP]["PRP"] = [0.2237823, 0.82392, 0.342323][0.12121212] |
接下來要做的是拿到大量 J.K Rowling《Harry Potter》的資源並且準備開始模仿她的句式結構。
詞彙
詞彙在本專案中無疑是最難的部分,我很清楚沒有道理不使用遞迴神經網路,預測每個字母也是更好的辦法。然而,我選擇的方法產生的結果看起來特別有趣。
詞彙以詞序矩陣的形式包含在訓練用的 BLOB 檔案中。每個詞分解成了詞性標籤接著進行歸一化。歸一化後的值和詞彙依然被保留著,因為稍後將歸一化的值轉換回單詞依然要利用此作為對映表。詞彙看起來像這樣:
1 2 3 4 |
[[(cat, [0.232342]), (bat, [0.2553535]), (dog, [0.345454]), (horse, [0.4544646]) ... ] [(run, [0.12131], (jump, 0.232323), (fall, 0.43443434) ... ] ... ] |
嘗試應用
使用 HarryPotter(small).txt
這個資料集包含了 346 個訓練向量(training vectors)。是一個最小的訓練集。
1 2 |
Input: "Harry dont look" Output: "Harry dont look had at eleven for witches had been for eleven then with nothing had been for eleven then with nothing had been for eleven then with nothing had been for eleven " |
1 2 |
Input: "Wizards sometimes do" Output: "wizards sometimes do , Harry had been been , Harry had been been , Harry had been been , Harry had been been , Harry had been been , Harry had been been " |
你可以看出神經網路很努力想要學習卻缺乏訓練資料。它幾乎就陷入無限迴圈中了。
使用 HarryPotter(medium).txt
資料集包含了 2500 個訓練向量。
1 2 |
Input: "Harry don't look" Output: "Harry don't look , incredibly that a year for been parents in . followers , Harry , and Potter was been curse . Harry was up a year , Harry was been curse " |
1 2 |
Input: "Wizards sometimes do" Output: "wizards sometimes do parents." |
現在你可以看出神經網路已經生成了我們預想的結果。因為增加了訓練資料,它的效能得到了大幅度提升。
使用 HarryPotter(large).txt
資料集包含了 4550 個訓練向量。
1 2 3 |
Input: "Harry don't look" Output: "Harry don't look Petunia Harry , incredibly had on the address." Now, apart form a missing full-stop in-between 'Petunia' and 'Harry' the sentence structure is almost prefect. |
1 2 3 |
Input: "Wizards sometimes do" Output: "Wizards sometimes do the address on the address on the address." Here, the increase in training data has given a negative result. This is rare. |
使用 MacbookAirBlog(large).txt
我隨便在一個部落格上拿到了一份文字。出乎意料的是這個資料集常常能產生更精確的結果。讓我們看看。資料集中包含了 3576 個訓練向量。
1 2 |
Input: "Harry don't look" Output: "Harry don't look , also of a scaling on playing . Use for Control weight goes so cable and they've of placed it . you do to want things at at 2015." |
1 2 |
Input: "Wizards sometimes do" Output: "Wizards sometimes do When ports a scaling the have object , also of a scaling on playing ." |
結果是正確的,但詞彙有限。10,000 以上訓練向量的資料集馬上要來了。
進行一些單元測試
程式生成了非常準確的結果當我們試著對下一個順序的詞進行預測。但是當大量序列開始生成後,準確率就下降了。我建立了一個單元測試用來比較生成的下一個詞與 J.K Rowling 實際創作中的詞。我得到了下面的結果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
Failed Tests: (6/48) [('very', 'RB'), ('likely', 'JJ'), ('replace', 'NN')] Target: [('the', 'DT')] Prediction: ['.'] 20.0% [('entirely', 'RB'), ('once', 'RB'), ('Apple', 'NNP')] Target: [('is', 'VBZ')] Prediction: ['RBS'] 20.0% [('once', 'RB'), ('Apple', 'NNP'), ('is', 'VBZ')] Target: [('able', 'JJ')] Prediction: ['RBR'] 20.0% [('able', 'JJ'), ('to', 'TO'), ('bring', 'VB')] Target: [('its', 'PRP$')] Prediction: ['RP'] 20.0% [('down', 'IN'), ('enough', 'RB'), (',', ',')] Target: [('though', 'IN')] Prediction: ['VBN'] 20.0% [('though', 'IN'), ('this', 'DT'), ('may', 'MD')] Target: [('take', 'VB')] Prediction: ['.'] 20.0% Non-Fatal failed Tests: (24/48) [('The', 'DT'), ('12-inch', 'JJ'), ('Retina', 'NNP')] Target: [('MacBook', 'NN')] Prediction: [','] 40.0% [('Retina', 'NNP'), ('MacBook', 'NNP'), ('is', 'VBZ')] Target: [('Apple', 'NNP')] Prediction: ['IN'] 40.0% [('MacBook', 'NN'), ('is', 'VBZ'), ('Apple', 'NNP')] Target: [("'", 'POS')] Prediction: ['IN'] 40.0% [('Apple', 'NNP'), ("'", 'POS'), ('s', 'NNS')] Target: [('latest', 'JJS')] Prediction: ['VBP'] 40.0% [("'", 'POS'), ('s', 'NNS'), ('latest', 'JJS')] Target: [('and', 'CC')] Prediction: ['IN'] 40.0% [('latest', 'JJS'), ('and', 'CC'), ('greatest', 'JJS')] Target: [('notebook', 'NN')] Prediction: ['.'] 60.0% [('and', 'CC'), ('greatest', 'JJS'), ('notebook', 'NN')] Target: [(',', ',')] Prediction: ['NN'] 40.0% [('greatest', 'JJS'), ('notebook', 'NN'), (',', ',')] Target: [('and', 'CC')] Prediction: ['DT'] 40.0% [('notebook', 'NN'), (',', ','), ('and', 'CC')] Target: [('will', 'MD')] Prediction: ['NN'] 40.0% [('and', 'CC'), ('will', 'MD'), ('very', 'RB')] Target: [('likely', 'JJ')] Prediction: [','] 40.0% [('will', 'MD'), ('very', 'RB'), ('likely', 'JJ')] Target: [('replace', 'NN')] Prediction: ['TO'] 40.0% [('the', 'DT'), ('MacBook', 'NNP'), ('Air', 'NNP')] Target: [('entirely', 'RB')] Prediction: ['NN'] 40.0% [('MacBook', 'NN'), ('Air', 'NNP'), ('entirely', 'RB')] Target: [('once', 'RB')] Prediction: ['VBZ'] 60.0% [('Air', 'NNP'), ('entirely', 'RB'), ('once', 'RB')] Target: [('Apple', 'NNP')] Prediction: ['RB'] 40.0% [('Apple', 'NNP'), ('is', 'VBZ'), ('able', 'JJ')] Target: [('to', 'TO')] Prediction: ['NN'] 40.0% [('to', 'TO'), ('bring', 'VB'), ('its', 'PRP$')] Target: [('costs', 'NNS')] Prediction: ['VB'] 40.0% [('its', 'PRP$'), ('costs', 'NNS'), ('down', 'IN')] Target: [('enough', 'RB')] Prediction: ['DT'] 40.0% [('costs', 'NNS'), ('down', 'RB'), ('enough', 'RB')] Target: [(',', ',')] Prediction: ['RB'] 40.0% [(',', ','), ('though', 'IN'), ('this', 'DT')] Target: [('may', 'MD')] Prediction: ['JJS'] 40.0% [('a', 'DT'), ('few', 'JJ'), ('generations', 'NNS')] Target: [('.', '.')] Prediction: ['VBP'] 60.0% [('few', 'JJ'), ('generations', 'NNS'), ('.', '.')] Target: [('It', 'PRP')] Prediction: ['WRB'] 40.0% [('.', '.'), ('It', 'PRP'), ('is', 'VBZ')] Target: [('fresh', 'JJ')] Prediction: ['DT'] 40.0% [('It', 'PRP'), ('is', 'VBZ'), ('fresh', 'JJ')] Target: [('on', 'IN')] Prediction: ['TO'] 60.0% [('is', 'VBZ'), ('fresh', 'JJ'), ('on', 'IN')] Target: [('the', 'DT')] Prediction: ['JJ'] 40.0% Passed Tests: (14/48) [('12-inch', 'JJ'), ('Retina', 'NNP'), ('MacBook', 'NNP')] Target: [('is', 'VBZ')] Prediction: ['NNP'] 100.0% [('is', 'VBZ'), ('Apple', 'NNP'), ("'", 'POS')] Target: [('s', 'NNS')] Prediction: ['NNP'] 40.0% [('s', 'NNS'), ('latest', 'VBP'), ('and', 'CC')] Target: [('greatest', 'JJS')] Prediction: ['JJ'] 40.0% [(',', ','), ('and', 'CC'), ('will', 'MD')] Target: [('very', 'RB')] Prediction: ['RB'] 20.0% [('likely', 'JJ'), ('replace', 'NN'), ('the', 'DT')] Target: [('MacBook', 'NN')] Prediction: ['NN'] 40.0% [('replace', 'NN'), ('the', 'DT'), ('MacBook', 'NNP')] Target: [('Air', 'NNP')] Prediction: ['NN'] 40.0% [('is', 'VBZ'), ('able', 'JJ'), ('to', 'TO')] Target: [('bring', 'VBG')] Prediction: ['VB'] 60.0% [('bring', 'VBG'), ('its', 'PRP$'), ('costs', 'NNS')] Target: [('down', 'IN')] Prediction: ['IN'] 40.0% [('enough', 'RB'), (',', ','), ('though', 'IN')] Target: [('this', 'DT')] Prediction: ['DT'] 40.0% [('this', 'DT'), ('may', 'MD'), ('take', 'VB')] Target: [('a', 'DT')] Prediction: ['DT'] 40.0% [('may', 'MD'), ('take', 'VB'), ('a', 'DT')] Target: [('few', 'JJ')] Prediction: ['NN'] 80.0% [('take', 'VB'), ('a', 'DT'), ('few', 'JJ')] Target: [('generations', 'NNS')] Prediction: ['NN'] 40.0% [('generations', 'NNS'), ('.', '.'), ('It', 'PRP')] Target: [('is', 'VBZ')] Prediction: ['VBP'] 40.0% [('fresh', 'JJ'), ('on', 'IN'), ('the', 'DT')] Target: [('market', 'NN')] Prediction: ['NN'] 40.0% Passed: 14 Non-Fatals: 24 Fails: 6 Network accuracy: 13.6% |
通過命令列,你可以看到:
1 |
python3 main.py -utss -td "Datasets/MacbookAirBlog(large).txt" |
我用同樣的想法測試了詞彙表:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
Failed Tests: (19/46) (12-inch, JJ) Target: MacBook Pred: Retina 20.0% (latest, JJS) Target: greatest Pred: heaviest 20.0% (and, CC) Target: notebook Pred: faster 20.0% (MacBook, NNP) Target: entirely Pred: now 20.0% (Air, NNP) Target: once Pred: now 20.0% (entirely, RB) Target: Apple Pred: micro-USB 20.0% (once, RB) Target: is Pred: theres 20.0% (Apple, NNP) Target: able Pred: want 20.0% (its, PRP$) Target: down Pred: on 20.0% (costs, NNS) Target: enough Pred: portable 20.0% (enough, JJ) Target: though Pred: of 20.0% (though, IN) Target: may Pred: can 20.0% (this, DT) Target: take Pred: have 20.0% (may, MD) Target: a Pred: the 20.0% (take, VB) Target: few Pred: later 20.0% (a, DT) Target: generations Pred: thats 20.0% (It, PRP) Target: fresh Pred: same 20.0% (is, VBZ) Target: on Pred: in 20.0% (on, IN) Target: market Pred: playing 20.0% Non-Fatal failed Tests: (13/46) (,, ,) Target: 12-inch Pred: many 40.0% (The, DT) Target: Retina Pred: MacBook 40.0% (MacBook, NNP) Target: Apples Pred: X 40.0% (is, VBZ) Target: latest Pred: best 40.0% (Apples, NNP) Target: and Pred: but 40.0% (,, ,) Target: will Pred: can 60.0% (and, CC) Target: very Pred: not 60.0% (will, MD) Target: likely Pred: easy 40.0% (very, RB) Target: replace Pred: power 40.0% (the, DT) Target: Air Pred: MacBook 40.0% (able, JJ) Target: bring Pred: be 40.0% (to, TO) Target: its Pred: my 60.0% (bring, VB) Target: costs Pred: things 40.0% Passed Tests: (13/46) (Retina, NNP) Target: is Pred: is 40.0% (greatest, JJS) Target: , Pred: , 60.0% (notebook, NN) Target: and Pred: and 80.0% (likely, JJ) Target: the Pred: the 100.0% (replace, NN) Target: MacBook Pred: MacBook 20.0% (is, VBZ) Target: to Pred: to 60.0% (down, IN) Target: , Pred: , 100.0% (,, ,) Target: this Pred: the 80.0% (few, JJ) Target: . Pred: . 100.0% (generations, NNS) Target: It Pred: It 60.0% (., .) Target: is Pred: is 40.0% (fresh, JJ) Target: the Pred: the 100.0% (the, DT) Target: , Pred: , 100.0% Passed: 13 Non-Fatals: 13 Fails: 19 |
1 |
python3 main.py -utv -td "Datasets/MacbookAirBlog(large).txt" |
如果預估(prediction estimation)超過 80% 就會被歸為“通過(passed)”。
以上所有的結果都來自於“未完結”的程式,這也就是為什麼它們看起來並不準確。
本實驗只應用於教育,永不用於商業化。
如果你想檢視這個專案,你可以在 github 上看到。
打賞支援我翻譯更多好文章,謝謝!
打賞譯者
打賞支援我翻譯更多好文章,謝謝!
任選一種支付方式