PythonShowMeTheCode(0004): 檢查單詞個數

weixin_33860553發表於2016-08-18

1. 題目

第 0004 題:任一個英文的純文字檔案,統計其中的單詞出現的個數。

2. 效果

#------1.txt-----------
  There are moments in life when you miss only
 one life and one chance to do
 you want to do.is 
isn't don't word_d common

#------輸出------------
do: 2
word_d: 1
want: 1
to: 2
is: 1
you: 2
isn't: 1
don't: 1
...
  • 將所有單詞按照小寫處理
  • isn'tword_d這種應當作為一個單詞

3. 實現

# -*- coding:utf-8 -*-
import re


def get_word_dict(file_path=None):
    if file_path is None:
        print("Error")
        return

    word_dict = {}
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file.readlines():
            words = re.findall(r"[a-z\'_-]+\b", line.lower())
            for word in words:
                if word not in word_dict:
                    word_dict[word] = 1
                else:
                    word_dict[word] += 1
    for word, count in word_dict.items():
        print("%s: %d\n" % (word, count))
    return word_dict


if __name__ == "__main__":
    get_word_dic("1.txt")

4. 解決問題

<i>I. 無法識別isn't這樣的單詞</i>
在正則匹配時需要在加入一個\b來作為單詞邊界。

<i>II. 讀取檔案出現編碼錯誤</i>
open()函式中加入encoding引數。

相關文章