python統計英文文字中的迴文單詞數

老瀟的摸魚日記發表於2020-05-13

1. 要求:

給定一篇純英文的文字,統計其中迴文單詞的比列,並輸出其中的迴文單詞,文字資料如下:

This is Everyday Grammar. I am Madam Lucija
And I am Kaveh. Why the title, Lucija?
Well, it is a special word. Madam?
Yeah, maybe I should spell it for you forward or backward?
I am lost. The word Madam is a Palindrome.
I just learned about them the other day and I am having a lot of fun!
Palindrome, huh? Let me try!
But first, we need to explain what a Palindrome is.
That is easy! Palindromes are words, phrases or numbers that read the same back and forward, like DAD.
So, Palindromes can be serious or just silly.
Yup, like, A nut for a jar of tuna.
Or, Borrow or Rob. Probably borrow!
And if you are hungry, you can always have a Taco cat?
That is gross. What about this one?
A man, a plan, a cat, a ham, a yak, a yam, a hat, a canal panama!
That is a real tongue twister. But I prefer Italy. Amore Roma!
So how do we make palindromes?
One, read words backwards and see if they make sense.
Two, try to make palindromes where even the spacing between words is consistent. Like, NotATon.
And three, you can always check the internet for great palindromes!
And that is Everyday Grammar.

注意:

  • 區分單詞的大小寫,即同一個單詞的大寫和小寫視為不同的單詞;

2. 分析

本次任務的思路很簡單,基本步驟如下:

  • 第一步:讀入文字資料,然後去掉文字中的換行符;
  • 第二步:去掉第一步處理後的文字中的標點符號,這裡使用正規表示式將文字中的單詞保留,從而達到去標點符號的目的。之後使用一個列表存入每一行去掉標點之後的文字。
  • 第三步:根據預處理之後的文字統計詞頻,因為一篇文字里面可能有很多重複的單詞,那麼只須判斷文字構成的子典中的單詞是否是迴文單詞即可。
  • 第四步:遍歷字典中的鍵,並判斷是否是迴文單詞,具體實現方法見程式碼;
  • 第五步:根據找到的迴文單詞計算文字中迴文單詞的比例;

3. 程式碼

import re
from collections import Counter


# 文字預處理,返回值為['This', 'is', 'Everyday']這種形式
def process(path):
    token = []
    with open(path, 'r') as f:
        text = f.readlines()
        for row_text in text:
            row_text_prod = row_text.rstrip('\n')
            row_text_prod = re.findall(r'\b\w+\b', row_text_prod)
            token = token + row_text_prod
        return token


# 統計迴文單詞
def palindrome(processed_text):
    c = Counter(processed_text)  # 詞頻字典
    palindrome_word = []  # 迴文單詞列表
    not_palindrome_word = []  # 非迴文單詞列表
    # 遍歷詞頻字典
    for word in c.keys():
        flag = True
        i, j = 0, len(word)-1
        # 判斷是否是迴文單詞
        while i < j:
            if word[i] != word[j]:
                not_palindrome_word.append(word)  # 不是迴文單詞
                flag = False
                break
            i += 1
            j -= 1
        if flag:
            palindrome_word.append(word)  # 是迴文單詞
    print("迴文單詞:")
    print(palindrome_word)
    print("非迴文單詞:")
    print(not_palindrome_word)
    # 統計迴文單詞的比率
    total_palindrome_word = 0
    for word in palindrome_word:
        total_palindrome_word += c[word]
    print("迴文單詞的比例為:{:.3f}".format(total_palindrome_word / len(processed_text)))


def main():
    text_path = 'test.txt'
    processed_text = process(text_path)
    palindrome(processed_text)


if __name__ == '__main__':
    main()

reference:
python3小技巧之:妙用string.punctuation
迴文字串(Palindromic_String)

相關文章