使用 Python 生成基於馬爾可夫鏈的偽隨機文字

首先看一下來自Wolfram的定義

馬爾可夫鏈是隨機變數{X_t}的集合（t貫穿0,1,…），給定當前的狀態，未來與過去條件獨立。

Wikipedia的定義更清楚一點兒

…馬爾可夫鏈是具有馬爾可夫性質的隨機過程…[這意味著]狀態改變是概率性的，未來的狀態僅僅依賴當前的狀態。

馬爾可夫鏈具有多種用途，現在讓我看一下如何用它生產看起來像模像樣的胡言亂語。

演算法如下,

找一個作為語料庫的文字，語料庫用於選擇接下來的轉換。
從文字中兩個連續的單詞開始，最後的兩個單詞構成當前狀態。
生成下一個單詞的過程就是馬爾可夫轉換。為了生成下一個單詞，首先檢視語料庫，查詢這兩個單詞之後跟著的單詞。從它們中隨機選擇一個。
重複2，直到生成的文字達到需要的大小。

程式碼如下

import random

class Markov(object):

    def __init__(self, open_file):
        self.cache = {}
        self.open_file = open_file
        self.words = self.file_to_words()
        self.word_size = len(self.words)
        self.database()

    def file_to_words(self):
        self.open_file.seek(0)
        data = self.open_file.read()
        words = data.split()
        return words

    def triples(self):
        """ Generates triples from the given data string. So if our string were
                "What a lovely day", we'd generate (What, a, lovely) and then
                (a, lovely, day).
        """

        if len(self.words) < 3:
            return

        for i in range(len(self.words) - 2):
            yield (self.words[i], self.words[i+1], self.words[i+2])

    def database(self):
        for w1, w2, w3 in self.triples():
            key = (w1, w2)
            if key in self.cache:
                self.cache[key].append(w3)
            else:
                self.cache[key] = [w3]

    def generate_markov_text(self, size=25):
        seed = random.randint(0, self.word_size-3)
        seed_word, next_word = self.words[seed], self.words[seed+1]
        w1, w2 = seed_word, next_word
        gen_words = []
        for i in xrange(size):
            gen_words.append(w1)
            w1, w2 = w2, random.choice(self.cache[(w1, w2)])
        gen_words.append(w2)
        return ' '.join(gen_words)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

import random

class Markov(object):

def __init__(self, open_file):

self.cache = {}

self.open_file = open_file

self.words = self.file_to_words()

self.word_size = len(self.words)

self.database()

def file_to_words(self):

self.open_file.seek(0)

data = self.open_file.read()

words = data.split()

return words

def triples(self):

""" Generates triples from the given data string. So if our string were

"What a lovely day", we'd generate (What, a, lovely) and then

(a, lovely, day).

"""

if len(self.words) < 3:

return

for i in range(len(self.words) - 2):

yield (self.words[i], self.words[i+1], self.words[i+2])

def database(self):

for w1, w2, w3 in self.triples():

key = (w1, w2)

if key in self.cache:

self.cache[key].append(w3)

else:

self.cache[key] = [w3]

def generate_markov_text(self, size=25):

seed = random.randint(0, self.word_size-3)

seed_word, next_word = self.words[seed], self.words[seed+1]

w1, w2 = seed_word, next_word

gen_words = []

for i in xrange(size):

gen_words.append(w1)

w1, w2 = w2, random.choice(self.cache[(w1, w2)])

gen_words.append(w2)

return ' '.join(gen_words)

為了看到一個示例結果，我們從古騰堡計劃中拿了沃德豪斯的《My man jeeves》作為文字，示例結果如下。

In [1]: file_ = open('/home/shabda/jeeves.txt')

In [2]: import markovgen

In [3]: markov = markovgen.Markov(file_)

In [4]: markov.generate_markov_text()
Out[4]: 'Can you put a few years of your twin-brother Alfred,
who was apt to rally round a bit. I should strongly advocate
the blue with milk'

1

2

3

4

5

6

7

8

9

10

In [1]: file_ = open('/home/shabda/jeeves.txt')

In [2]: import markovgen

In [3]: markov = markovgen.Markov(file_)

In [4]: markov.generate_markov_text()

Out[4]: 'Can you put a few years of your twin-brother Alfred,

who was apt to rally round a bit. I should strongly advocate

the blue with milk'

[如果想執行這個例子，請下載jeeves.txt和markovgen.py］
馬爾可夫演算法怎樣呢？

最後兩個單詞是當前狀態。
接下來的單詞僅僅依賴最後兩個單詞，也就是當前狀態。
接下來的單詞是從語料庫的統計模型中隨機選擇的。

這是一個示例文字。

“The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead.”

這個文字對應的語料庫像這樣，

{('The', 'quick'): ['brown'],
 ('brown', 'fox'): ['jumps', 'who', 'who'],
 ('fox', 'jumps'): ['over'],
 ('fox', 'who'): ['is', 'is'],
 ('is', 'slow'): ['jumps'],
 ('jumps', 'over'): ['the', 'the'],
 ('over', 'the'): ['brown', 'brown'],
 ('quick', 'brown'): ['fox'],
 ('slow', 'jumps'): ['over'],
 ('the', 'brown'): ['fox', 'fox'],
 ('who', 'is'): ['slow', 'dead.']}

1

2

3

4

5

6

7

8

9

10

11

{('The', 'quick'): ['brown'],

('brown', 'fox'): ['jumps', 'who', 'who'],

('fox', 'jumps'): ['over'],

('fox', 'who'): ['is', 'is'],

('is', 'slow'): ['jumps'],

('jumps', 'over'): ['the', 'the'],

('over', 'the'): ['brown', 'brown'],

('quick', 'brown'): ['fox'],

('slow', 'jumps'): ['over'],

('the', 'brown'): ['fox', 'fox'],

('who', 'is'): ['slow', 'dead.']}

現在如果我們從”brown fox”開始，接下來的單詞可以是”jumps”或者”who”。如果我們選擇”jumps”，然後當前的狀態就變成了”fox jumps”，再接下的單詞就是”over”，之後依此類推。

提示

我們選擇的文字越大，每次轉換的選擇更多，生成的文字更好看。
狀態可以設定為依賴一個單詞、兩個單詞或者任意數量的單詞。隨著每個狀態的單詞數的增加，生成的文字更不隨機。
不要去掉標點符號等。它們會使語料庫更具代表性，隨機文字更好看。

資源

使用 Python 生成基於馬爾可夫鏈的偽隨機文字

相關文章