10 行 Python 程式碼寫的模糊查詢

_zero發表於2015-07-26

Python

導語：

模糊匹配可以算是現代編輯器(在選擇要開啟的檔案時)的一個必備特性了，它所做的就是根據使用者輸入的部分內容，猜測使用者想要的檔名，並提供一個推薦列表供使用者選擇。

樣例如下：

Vim (Ctrl-P)

Sublime Text (Cmd-P)

‘模糊匹配’這是一個極為有用的特性，同時也非常易於實現。

問題分析：

我們有一堆字串(檔名)集合，我們根據使用者的輸入不斷進行過濾，使用者的輸入可能是字串的一部分。我們就以下面的集合為例：

>>> collection = ['django_migrations.py',
                'django_admin_log.py',
                'main_generator.py',
                'migrations.py',
                'api_user.doc',
                'user_group.doc',
                'accounts.txt',
                ]

>>> collection = ['django_migrations.py',

'django_admin_log.py',

'main_generator.py',

'migrations.py',

'api_user.doc',

'user_group.doc',

'accounts.txt',

]

當使用者輸入’djm‘字串時，我們假定是匹配到’django_migrations.py’和’django_admin_log.py’，而最簡單的實現方法就是使用正規表示式。

解決方案：

1.常規的正則匹配

將’djm’轉換成’d.*j.*m’然後用這個正則嘗試匹配集合中的每一個字串，如果匹配到了就被列為候選。

>>> import re
>>> def fuzzyfinder(user_input, collection):
        suggestions = []
        pattern = '.*'.join(user_input) # Converts 'djm' to 'd.*j.*m'
        regex = re.compile(pattern)     # Compiles a regex.
        for item in collection:
            match = regex.search(item)  # Checks if the current item matches the regex.
            if match:
                suggestions.append(item)
        return suggestions

>>> print fuzzyfinder('djm', collection)
['django_migrations.py', 'django_admin_log.py']

>>> print fuzzyfinder('mig', collection)
['django_migrations.py', 'django_admin_log.py', 'main_generator.py', 'migrations.py']

>>> import re

>>> def fuzzyfinder(user_input, collection):

suggestions = []

pattern = '.*'.join(user_input) # Converts 'djm' to 'd.*j.*m'

regex = re.compile(pattern) # Compiles a regex.

for item in collection:

match = regex.search(item) # Checks if the current item matches the regex.

if match:

suggestions.append(item)

return suggestions

>>> print fuzzyfinder('djm', collection)

['django_migrations.py', 'django_admin_log.py']

>>> print fuzzyfinder('mig', collection)

['django_migrations.py', 'django_admin_log.py', 'main_generator.py', 'migrations.py']

這裡根據使用者的輸入我們得到了一個推薦列表，但是推薦列表中的字串是沒有進行重要性區分的。有可能出現最合適的匹配項被放到了最後的情況。

實際上，還是這個例子，當使用者輸入’mig‘時，最佳選項’migrations.py’就被放到了最後。

2.帶有rank排序的匹配列表

這裡我們對匹配到的結果按照匹配內容第一次出現的起始位置來進行排序。

'main_generator.py'     - 0
'migrations.py'         - 0
'django_migrations.py'  - 7
'django_admin_log.py'   - 9

'main_generator.py' - 0

'migrations.py' - 0

'django_migrations.py' - 7

'django_admin_log.py' - 9

下面是相關程式碼：

>>> import re
>>> def fuzzyfinder(user_input, collection):
        suggestions = []
        pattern = '.*'.join(user_input) # Converts 'djm' to 'd.*j.*m'
        regex = re.compile(pattern)     # Compiles a regex.
        for item in collection:
            match = regex.search(item)  # Checks if the current item matches the regex.
            if match:
                suggestions.append((match.start(), item))
        return [x for _, x in sorted(suggestions)]

>>> print fuzzyfinder('mig', collection)
['main_generator.py', 'migrations.py', 'django_migrations.py', 'django_admin_log.py']

>>> import re

>>> def fuzzyfinder(user_input, collection):

suggestions = []

pattern = '.*'.join(user_input) # Converts 'djm' to 'd.*j.*m'

regex = re.compile(pattern) # Compiles a regex.

for item in collection:

match = regex.search(item) # Checks if the current item matches the regex.

if match:

suggestions.append((match.start(), item))

return [x for _, x in sorted(suggestions)]

>>> print fuzzyfinder('mig', collection)

['main_generator.py', 'migrations.py', 'django_migrations.py', 'django_admin_log.py']

這次我們生成了一個由二元tuple組成的列表，即列表中的每一個元素為一個二元tuple，而該二元tuple的第一個值為匹配到的起始位置、第二個值為對應的檔名，然後使用列表推導式按照匹配到的位置進行排序並返回檔名列表。

現在我們已經很接近最終的結果了，但還稱不上完美——使用者想要的是’migration.py’，但我們卻把’main_generator.py’作為第一推薦。

3.根據匹配的緊湊程度進行排序

當使用者開始輸入一個字串時，他們傾向於輸入連續的字元以進行精確匹配。比如當使用者輸入’mig‘他們更傾向於找的是’migrations.py’或’django_migrations.py’，而不是’main_generator.py’，所以這裡我們所做的改變就是查詢匹配到的最緊湊的專案。

剛才提到的問題對於Python來說不算什麼事，因為當我們使用正規表示式進行字串匹配時，匹配到的字串就已經被存放在了match.group()中了。下面假設輸入為’mig’，對最初定義的’collection’的匹配結果如下：

regex = '(m.*i.*g)'

'main_generator.py'    ->  'main_g'
'migrations.py'        ->  'mig'
'django_migrations.py' ->  'mig'
'django_admin_log.py'  ->  'min_log'

regex = '(m.*i.*g)'

'main_generator.py' -> 'main_g'

'migrations.py' -> 'mig'

'django_migrations.py' -> 'mig'

'django_admin_log.py' -> 'min_log'

這裡我們將推薦列表做成了三元tuple的列表的形式，即推薦列表中的每一個元素為一個三元tuple，而該三元tuple的第一個值為匹配到的內容的長度、第二個值為匹配到的起始位置、第三個值為對應的檔名，然後按照匹配長度和起始位置進行排序並返回。

>>> import re
>>> def fuzzyfinder(user_input, collection):
        suggestions = []
        pattern = '.*'.join(user_input) # Converts 'djm' to 'd.*j.*m'
        regex = re.compile(pattern)     # Compiles a regex.
        for item in collection:
            match = regex.search(item)  # Checks if the current item matches the regex.
            if match:
                suggestions.append((len(match.group()), match.start(), item))
        return [x for _, _, x in sorted(suggestions)]

>>> print fuzzyfinder('mig', collection)
['migrations.py', 'django_migrations.py', 'main_generator.py', 'django_admin_log.py']

>>> import re

>>> def fuzzyfinder(user_input, collection):

suggestions = []

pattern = '.*'.join(user_input) # Converts 'djm' to 'd.*j.*m'

regex = re.compile(pattern) # Compiles a regex.

for item in collection:

match = regex.search(item) # Checks if the current item matches the regex.

if match:

suggestions.append((len(match.group()), match.start(), item))

return [x for _, _, x in sorted(suggestions)]

>>> print fuzzyfinder('mig', collection)

['migrations.py', 'django_migrations.py', 'main_generator.py', 'django_admin_log.py']

針對我們的輸入，這時候的匹配結果已經趨向於完美了，不過還沒完。

4.非貪婪匹配

由 Daniel Rocco 發現了這一微妙的問題：當集合中有[‘api_user’, ‘user_group’]這兩個元素存在，使用者輸入’user‘時，預期的匹配結果(相對順序)應該為[‘user_group’, ‘api_user‘]，但實際上的結果為：

>>> print fuzzyfinder('user', collection)
['api_user.doc', 'user_group.doc']

1 2	>>> print fuzzyfinder('user', collection) ['api_user.doc', 'user_group.doc']

上面的測試結果中：’api_user’要排在’user_group’前面。深入一點，我們發現這是因為在搜尋’user’時，正則被擴充套件成了’u.*s.*e.*r’，考慮到’user_group’有2個’r’，因此該模式匹配到了’user_gr‘而不是我們預期的’user‘。更長的匹配導致在最後的匹配rank排序時名次下降這一違反直覺的結果，不過這問題也容易解決，將正則修改為’非貪婪匹配’即可。

>>> import re
>>> def fuzzyfinder(user_input, collection):
        suggestions = []
        pattern = '.*?'.join(user_input)    # Converts 'djm' to 'd.*?j.*?m'
        regex = re.compile(pattern)         # Compiles a regex.
        for item in collection:
            match = regex.search(item)      # Checks if the current item matches the regex.
            if match:
                suggestions.append((len(match.group()), match.start(), item))
        return [x for _, _, x in sorted(suggestions)]

>>> fuzzyfinder('user', collection)
['user_group.doc', 'api_user.doc']

>>> print fuzzyfinder('mig', collection)
['migrations.py', 'django_migrations.py', 'main_generator.py', 'django_admin_log.py']

>>> import re

>>> def fuzzyfinder(user_input, collection):

suggestions = []

pattern = '.*?'.join(user_input) # Converts 'djm' to 'd.*?j.*?m'

regex = re.compile(pattern) # Compiles a regex.

for item in collection:

match = regex.search(item) # Checks if the current item matches the regex.

if match:

suggestions.append((len(match.group()), match.start(), item))

return [x for _, _, x in sorted(suggestions)]

>>> fuzzyfinder('user', collection)

['user_group.doc', 'api_user.doc']

>>> print fuzzyfinder('mig', collection)

['migrations.py', 'django_migrations.py', 'main_generator.py', 'django_admin_log.py']

現在，fuzzyfinder已經可以(在上面的情況中)正常工作了，而我們不過只寫了10行程式碼就實現了一個 fuzzy finder。

結論：

以上就是我在我的 pgcli 專案(一個有自動補全功能的Postgresql命令列實現)中設計實現’fuzzy matching’的過程記錄。

我已經將 fuzzyfinder 提取成一個獨立的Python包，你可以使用命令’pip install fuzzyfinder’在你的專案中進行安裝和使用。

感謝 Micah Zoltu 和 Daniel Rocco 對演算法的檢查和問題修復。

如果你對這個感興趣的話，你可以來 twitter 上找我。

結語：

當我第一次考慮用Python實現“fuzzy matching”的時候，我就知道一個叫做 fuzzywuzzy 的優秀庫，但是 fuzzywuzzy 的做法和這裡的不太一樣，它使用的是 “levenshtein distance” 來從集合中找到最匹配的字串。”levenshtein distance“是一個非常適合用來做自動更正拼寫錯誤的技術，但在從部分子串匹配長檔名時表現的不太好(所以這裡沒有使用)。

=EOF=

python 當中的模糊查詢
2024-03-21
Python
模糊查詢區分大小寫嗎？
2022-05-23
Mybatis中模糊查詢的各種寫法
2016-07-14
MyBatis
elasticsearch的模糊查詢
2019-01-04
Elasticsearch
Python中使用MySQL模糊查詢的方法
2024-06-27
PythonMySql
使用Excel進行快速模糊查詢的方法
2017-04-07
Excel
手寫下拉選單，模糊查詢資料
2018-09-25
MySQL模糊查詢（like）時區分大小寫
2015-02-25
MySql
mysql 模糊查詢
2024-07-06
MySql
spring data JPA 模糊查詢 --- 使用 LIKE --- 的寫法
2019-03-13
Spring
oracle 精確查詢和模糊查詢
2019-11-19
Oracle
pgsql查詢優化之模糊查詢
2019-07-20
SQL優化
MyBatis模糊查詢LIKE
2018-07-10
MyBatis
sql日期模糊查詢
2009-10-22
SQL
mybatis - [07] 模糊查詢
2024-06-05
MyBatis
LINQ中的模糊查詢
2011-08-29
Mysql高效的模糊查詢(轉)
2007-08-12
MySql
編寫簡易的JS輸入框模糊查詢匹配（附有原始碼和demo）
2013-11-21
JS原始碼
Java ——MongDB 插入資料、模糊查詢、in查詢
2018-08-10
Java
Mysql模糊查詢預設是不區分字母大小寫的
2018-12-21
MySql
21行Python程式碼實現拼寫檢查器
2016-01-24
Python
sql 模糊查詢問題
2021-05-31
SQL
mybatis做like模糊查詢
2016-10-14
MyBatis
Mybatis各種模糊查詢
2013-08-01
MyBatis
反向索引與模糊查詢
2011-05-06
索引
Mybatis中Oracle的拼接模糊查詢
2018-08-07
MyBatisOracle
mybatis 對特殊字元的模糊查詢
2020-11-05
MyBatis字元
Oracle特殊符號的模糊查詢
2013-08-26
Oracle符號
int 被當作模糊查詢
2021-08-26
加密的手機號，如何模糊查詢？
2023-10-12
加密
SQL使用模糊查詢like的優化
2017-05-14
SQL優化
0227windows下模糊查詢oracle事件的指令碼
2017-02-27
WindowsOracle事件指令碼
mysql like查詢 - 根據多個條件的模糊匹配查詢
2024-08-09
MySql
小程式雲開發模糊查詢，實現資料庫多欄位的模糊搜尋
2021-09-09
資料庫
關於Room資料庫，拼寫模糊查詢語句遇到的問題
2018-12-14
OOM資料庫
實現城市列表的排序及模糊查詢
2017-11-06
排序
報表怎麼做模糊查詢
2019-11-21
【PostgreSQL】字首模糊查詢級優化
2019-07-08
SQL優化

10 行 Python 程式碼寫的模糊查詢

導語：

樣例如下：

問題分析：

解決方案：

1.常規的正則匹配

2.帶有rank排序的匹配列表

3.根據匹配的緊湊程度進行排序

4.非貪婪匹配

結論：

結語：

相關文章