用 Python 做單詞拼寫檢查

發表於2015-12-25

這幾天在翻舊程式碼時發現以前寫的註釋部分有很多單詞拼寫錯誤，這些單詞錯得不算離譜，應該可以用工具自動糾錯絕大部分。用 Python 寫個拼寫檢查指令碼很容易，如果能很好利用 aspell/ispell 這些現成的小工具就更簡單了。

要點

1、輸入一個拼寫錯誤的單詞，呼叫 aspell -a 後得到一些候選正確單詞，然後用距離編輯進一步嗮選出更精確的詞。比如執行 aspell -a，輸入 ‘hella’ 後得到如下結果：

hell, Helli, hello, heal, Heall, he’ll, hells, Heller, Ella, Hall, Hill, Hull, hall, heel, hill, hula, hull, Helga, Helsa, Bella, Della, Mella, Sella, fella, Halli, Hally, Hilly, Holli, Holly, hallo, hilly, holly, hullo, Hell’s, hell’s

2、什麼是距離編輯（Edit-Distance，也叫 Levenshtein algorithm）呢？就是說給定一個單詞，通過多次插入、刪除、交換、替換單字元的操作後列舉出所有可能的正確拼寫，比如輸入 ‘hella’，經過多次插入、刪除、交換、替換單字元的操作後變成：

‘helkla’, ‘hjlla’, ‘hylla’, ‘hellma’, ‘khella’, ‘iella’, ‘helhla’, ‘hellag’, ‘hela’, ‘vhella’, ‘hhella’, ‘hell’, ‘heglla’, ‘hvlla’, ‘hellaa’, ‘ghella’, ‘hellar’, ‘heslla’, ‘lhella’, ‘helpa’, ‘hello’, …

3、綜合上面2個集合的結果，並且考慮到一些理論知識可以提高拼寫檢查的準確度，比如一般來說寫錯單詞都是無意的或者誤打，完全錯的單詞可能性很小，而且單詞的第一個字母一般不會拼錯。所以可以在上面集合裡去掉第一個字母不符合的單詞，比如：’Sella’, ‘Mella’, khella’, ‘iella’ 等，這裡 VPSee 不刪除單詞，而把這些單詞從佇列裡取出來放到佇列最後（優先順序降低），所以實在匹配不了以 h 開頭的單詞才去匹配那些以其他字母開頭的單詞。

4、程式中用到了外部工具 aspell，如何在 Python 裡捕捉外部程式的輸入和輸出以便在 Python 程式裡處理這些輸入和輸出呢？Python 2.4 以後引入了 subprocess 模組，可以用 subprocess.Popen 來處理。

5、Google 大牛 Peter Norvig 寫了一篇 How to Write a Spelling Corrector 很值得一看，大牛就是大牛，21行 Python 就解決拼寫問題，而且還不用外部工具，只需要事先讀入一個詞典檔案。本文程式的 edits1 函式就是從牛人家那裡 copy 的。

程式碼

#!/usr/bin/python
# A simple spell checker
# written by http://www.vpsee.com 

import os, sys, subprocess, signal

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def found(word, args, cwd = None, shell = True):
    child = subprocess.Popen(args, 
        shell = shell,  
        stdin = subprocess.PIPE, 
        stdout = subprocess.PIPE, 
        cwd = cwd,  
        universal_newlines = True) 
    child.stdout.readline()
    (stdout, stderr) = child.communicate(word)
    if ": " in stdout:
        # remove nn
        stdout = stdout.rstrip("n")
        # remove left part until :
        left, candidates = stdout.split(": ", 1) 
        candidates = candidates.split(", ")
        # making an error on the first letter of a word is less 
        # probable, so we remove those candidates and append them 
        # to the tail of queue, make them less priority
        for item in candidates:
            if item[0] != word[0]: 
                candidates.remove(item)
                candidates.append(item)
        return candidates
    else:
        return None

# copy from http://norvig.com/spell-correct.html
def edits1(word):
    n = len(word)
    return set([word[0:i]+word[i+1:] for i in range(n)] +                     
        [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] +
        [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] +
        [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])

def correct(word):
    candidates1 = found(word, 'aspell -a')
    if not candidates1:
        print "no suggestion"
        return  

    candidates2  = edits1(word)
    candidates  = []
    for word in candidates1:
        if word in candidates2:
            candidates.append(word)
    if not candidates:
        print "suggestion: %s" % candidates1[0]
    else:
        print "suggestion: %s" % max(candidates)

def signal_handler(signal, frame):
    sys.exit(0)

if __name__ == '__main__':
    signal.signal(signal.SIGINT, signal_handler)
    while True:
        input = raw_input()
        correct(input)

#!/usr/bin/python

# A simple spell checker

# written by http://www.vpsee.com

import os, sys, subprocess, signal

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def found(word, args, cwd = None, shell = True):

child = subprocess.Popen(args,

shell = shell,

stdin = subprocess.PIPE,

stdout = subprocess.PIPE,

cwd = cwd,

universal_newlines = True)

child.stdout.readline()

(stdout, stderr) = child.communicate(word)

if ": " in stdout:

# remove nn

stdout = stdout.rstrip("n")

# remove left part until :

left, candidates = stdout.split(": ", 1)

candidates = candidates.split(", ")

# making an error on the first letter of a word is less

# probable, so we remove those candidates and append them

# to the tail of queue, make them less priority

for item in candidates:

if item[0] != word[0]:

candidates.remove(item)

candidates.append(item)

return candidates

else:

return None

# copy from http://norvig.com/spell-correct.html

def edits1(word):

n = len(word)

return set([word[0:i]+word[i+1:] for i in range(n)] +

[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] +

[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] +

[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])

def correct(word):

candidates1 = found(word, 'aspell -a')

if not candidates1:

print "no suggestion"

return

candidates2 = edits1(word)

candidates = []

for word in candidates1:

if word in candidates2:

candidates.append(word)

if not candidates:

print "suggestion: %s" % candidates1[0]

else:

print "suggestion: %s" % max(candidates)

def signal_handler(signal, frame):

sys.exit(0)

if __name__ == '__main__':

signal.signal(signal.SIGINT, signal_handler)

while True:

input = raw_input()

correct(input)

更簡單的方法

當然直接在程式裡呼叫相關模組最簡單了，有個叫做 PyEnchant 的庫支援拼寫檢查，安裝 PyEnchant 和 Enchant 後就可以直接在 Python 程式裡 import 了：

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

>>> import enchant

>>> d = enchant.Dict("en_US")

>>> d.check("Hello")

True

>>> d.check("Helo")

False

>>> d.suggest("Helo")

['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]

>>>

基於海量詞庫的單詞拼寫檢查、推薦到底是咋做的？
2018-09-05
python PyEnchant（拼寫檢查）
2017-10-20
Python
怎樣寫一個拼寫檢查器（Python 版）
2015-07-03
Python
idea取消拼寫檢查
2024-04-04
Idea
vscode配置拼寫檢查
2020-09-23
VSCode
去除ckeditor裡的拼寫檢查
2012-11-21
21行Python程式碼實現拼寫檢查器
2016-01-24
Python
java 英文單詞拼寫糾正框架(Word Checker)
2018-08-11
Java框架
PHPstrom 取消單詞拼寫錯誤的提示
2020-09-24
PHP
用Python做一個三階拼圖。
2020-12-23
Python
使用SPELLCHECK屬性禁用輸入框拼寫檢查
2017-11-03
Android基礎知識之拼寫檢查框架
2013-06-24
Android框架
貝葉斯推斷及其網際網路應用（三）：拼寫檢查
2012-10-16
用Python寫一個簡單的中文分詞器
2013-03-28
Python中文分詞
LeetCode1160.拼寫單詞（Java+暴力+HashMap）
2020-12-15
LeetCodeJavaHashMap
PythonShowMeTheCode(0004): 檢查單詞個數
2016-08-18
Python
二叉搜尋樹應用-判斷一個單詞是否拼寫正確，實現簡單字典
2018-02-27
雜湊表：如何實現word編輯器的拼寫檢查？
2019-02-09
Win10系統下禁用OneNote中拼寫檢查的方法
2018-08-15
Win10
如何用Python做三階拼圖？
2019-02-13
Python
pycharm一些減少程式碼warning的拼寫檢查設定
2024-09-08
PyCharm
leetcode刷題之1160拼寫單詞 java題解（超詳細）
2020-11-21
LeetCodeJava
java 實現中英文拼寫檢查和錯誤糾正？可我只會寫 CRUD 啊！
2021-07-21
Java
如何用Python做詞雲？
2018-07-05
Python
IBM Lotus Symphony 拼寫檢查功能介紹及使用者擴充套件
2009-07-06
IBM套件
python多程式檢查埠並寫日誌
2017-08-25
Python
python 背單詞
2016-05-25
Python
python製作查詢單詞翻譯的指令碼
2014-07-07
Python指令碼
用Python做遊戲有多簡單?
2022-06-14
Python遊戲
PowerPoint 教程：如何在 PowerPoint 中檢查拼寫？
2022-06-25
如何用Python做中文分詞？
2018-06-28
Python中文分詞
為了收集和整理程式設計的常用單詞，我寫了個背單詞應用
2019-10-13
程式設計
Linux英文單詞縮寫
2018-02-25
Linux
win10訪問xp共享一直提示請檢查拼寫怎麼解決
2020-12-10
Win10
Python將所有的英文單詞首字母變成大寫
2021-02-24
Python
用 JavaScript 實現簡單拼圖遊戲
2018-11-15
JavaScript遊戲
用 Python 做個簡單的井字遊戲
2013-07-29
Python遊戲
利用RDA對Oracle做健康檢查
2015-05-04
Oracle

用 Python 做單詞拼寫檢查

要點

程式碼

更簡單的方法

相關文章