Python自然語言處理工具

sljwy發表於2020-10-20

Python 自然語言處理(NLP)工具彙總


NLTK

  • 簡介:

NLTK 在使用 Python 處理自然語言的工具中處於領先的地位。它提供了 WordNet 這種方便處理詞彙資源的介面,以及分類、分詞、詞幹提取、標註、語法分析、語義推理等類庫。

  • 網站:

Natural Language Toolkit

  • 安裝:

安裝 NLTK:

[root@master ~]# pip install nltk
Collecting nltk
  Downloading nltk-3.2.1.tar.gz (1.1MB)
    100% |████████████████████████████████| 1.1MB 664kB/s
Installing collected packages: nltk
  Running setup.py install for nltk ... done
Successfully installed nltk-3.2.1
  • 注意事項:
    安裝完以後還要下載nltk語料庫才可以使用,下載的是壓縮檔案,需要解壓到nltk_data下面。目錄結構如下:
zang@ZANG-PC D:\nltk_data
> ls -al
total 44
drwxrwx---+ 1 Administrators None   0 Oct 25  2015 .
drwxrwx---+ 1 SYSTEM         SYSTEM 0 May 30 10:55 ..
drwxrwx---+ 1 Administrators None   0 Oct 25  2015 chunkers
drwxrwx---+ 1 Administrators None   0 Oct 25  2015 corpora
drwxrwx---+ 1 Administrators None   0 Oct 25  2015 grammers
drwxrwx---+ 1 Administrators None   0 Oct 25  2015 help
drwxrwx---+ 1 Administrators None   0 Oct 25  2015 stemmers
drwxrwx---+ 1 Administrators None   0 Oct 25  2015 taggers
drwxrwx---+ 1 Administrators None   0 Oct 25  2015 tokenizers

Pattern

  • 簡介:

Pattern是基於web的Python挖掘模組,包含如下工具:
* 資料探勘:Web服務介面(Google,Twitter,Wikipedia),網路爬蟲,HTML DOM 解析。
* 自然語言處理:POS詞性標註,n-gram搜尋,情感分析,詞雲。
* 機器學習:向量空間模型(VSM),聚類,分類(KNN,SVM,Perceptron)。
* 網路分析:圖中心和視覺化。

這裡寫圖片描述

  • 網站:

GitHub主頁

  • 安裝:
[root@master ~]# pip install pattern
Collecting pattern
  Downloading pattern-2.6.zip (24.6MB)
    100% |████████████████████████████████| 24.6MB 43kB/s
Installing collected packages: pattern
  Running setup.py install for pattern ... done
Successfully installed pattern-2.6
[root@master ~]#

TextBlob

  • 簡介:

    TextBlob 是基於NLTK和pattern的工具, 有兩者的特性。如下:

    • 名詞短語提前
    • POS標註
    • 情感分析
    • 分類 (Naive Bayes, Decision Tree)
    • 谷歌翻譯
    • 分詞和分句
    • 詞頻和短語頻率統計
    • 句法解析
    • n-grams模型
    • 詞型轉換和詞幹提取
    • 拼寫校正
    • 通過詞雲整合新增新的語言和模型
  • 網站:

TextBlob: Simplified Text Processing

  • 安裝:
[root@master ~]# pip install -U textblob
Collecting textblob
  Downloading textblob-0.11.1-py2.py3-none-any.whl (634kB)
    100% |████████████████████████████████| 634kB 1.1MB/s
Requirement already up-to-date: nltk>=3.1 in /usr/lib/python2.7/site-packages (from textblob)
Installing collected packages: textblob
Successfully installed textblob-0.11.1
[root@master ~]# python -m textblob.download_corpora
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.

Gensim

  • 簡介:

Gensim 是一個 Python 庫,用於對大型語料庫進行主題建模、檔案索引、相似度檢索等。它可以處理大於記憶體的輸入資料。作者說它是“純文字上無監督的語義建模最健壯、高效、易用的軟體。”

image

  • 網站:

Gensim HomePage

GitHub - piskvorky/gensim: Topic Modelling for Humans

  • 安裝:
[root@master ~]# pip install -U gensim
Collecting gensim
  Downloading gensim-0.12.4.tar.gz (2.4MB)
    100% |████████████████████████████████| 2.4MB 358kB/s
Collecting numpy>=1.3 (from gensim)
  Downloading numpy-1.11.0-cp27-cp27mu-manylinux1_x86_64.whl (15.3MB)
    100% |████████████████████████████████| 15.3MB 66kB/s
Collecting scipy>=0.7.0 (from gensim)
  Downloading scipy-0.17.1-cp27-cp27mu-manylinux1_x86_64.whl (39.5MB)
    100% |████████████████████████████████| 39.5MB 27kB/s
Requirement already up-to-date: six>=1.5.0 in /usr/lib/python2.7/site-packages/six-1.10.0-py2.7.egg (from gensim)
Collecting smart_open>=1.2.1 (from gensim)
  Downloading smart_open-1.3.3.tar.gz
Collecting boto>=2.32 (from smart_open>=1.2.1->gensim)
  Downloading boto-2.40.0-py2.py3-none-any.whl (1.3MB)
    100% |████████████████████████████████| 1.4MB 634kB/s
Requirement already up-to-date: bz2file in /usr/lib/python2.7/site-packages (from smart_open>=1.2.1->gensim)
Collecting requests (from smart_open>=1.2.1->gensim)
  Downloading requests-2.10.0-py2.py3-none-any.whl (506kB)
    100% |████████████████████████████████| 512kB 1.4MB/s
Installing collected packages: numpy, scipy, boto, requests, smart-open, gensim
  Found existing installation: numpy 1.10.1
    Uninstalling numpy-1.10.1:
      Successfully uninstalled numpy-1.10.1
  Found existing installation: scipy 0.12.1
    DEPRECATION: Uninstalling a distutils installed project (scipy) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
    Uninstalling scipy-0.12.1:
      Successfully uninstalled scipy-0.12.1
  Found existing installation: boto 2.38.0
    Uninstalling boto-2.38.0:
      Successfully uninstalled boto-2.38.0
  Found existing installation: requests 2.8.1
    Uninstalling requests-2.8.1:
      Successfully uninstalled requests-2.8.1
  Found existing installation: smart-open 1.3.1
    Uninstalling smart-open-1.3.1:
      Successfully uninstalled smart-open-1.3.1
  Running setup.py install for smart-open ... done
  Found existing installation: gensim 0.12.3
    Uninstalling gensim-0.12.3:
      Successfully uninstalled gensim-0.12.3
  Running setup.py install for gensim ... done
Successfully installed boto-2.40.0 gensim-0.12.4 numpy-1.11.0 requests-2.6.0 scipy-0.17.1 smart-open-1.3.3

PyNLPI

  • 簡介:

它的全稱是:Python 自然語言處理庫(Python Natural Language Processing Library,音發作: pineapple) 是一個用於自然語言處理任務庫。它集合了各種獨立或鬆散互相關的,那些常見的、不常見的、對NLP 任務有用的模組。PyNLPI 可以用來處理 N 元搜尋,計算頻率表和分佈,建立語言模型。它還可以處理向優先佇列這種更加複雜的資料結構,或者像 Beam 搜尋這種更加複雜的演算法。

  • 網站:

Github

PyNLPI HomePage

  • 安裝:

從Github上下載原始碼,解壓以後編譯安裝。

[root@master pynlpl-master]# python setup.py install
Preparing build
running install
running bdist_egg
running egg_info
creating PyNLPl.egg-info
writing requirements to PyNLPl.egg-info/requires.txt
writing PyNLPl.egg-info/PKG-INFO
writing top-level names to PyNLPl.egg-info/top_level.txt
writing dependency_links to PyNLPl.egg-info/dependency_links.txt
writing manifest file 'PyNLPl.egg-info/SOURCES.txt'
reading manifest file 'PyNLPl.egg-info/SOURCES.txt'
writing manifest file 'PyNLPl.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/pynlpl
copying pynlpl/tagger.py -> build/lib/pynlpl
......
byte-compiling build/bdist.linux-x86_64/egg/pynlpl/__init__.py to __init__.pyc
byte-compiling build/bdist.linux-x86_64/egg/pynlpl/mt/__init__.py to __init__.pyc
byte-compiling build/bdist.linux-x86_64/egg/pynlpl/mt/wordalign.py to wordalign.pyc
byte-compiling build/bdist.linux-x86_64/egg/pynlpl/statistics.py to statistics.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying PyNLPl.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying PyNLPl.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying PyNLPl.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying PyNLPl.egg-info/not-zip-safe -> build/bdist.linux-x86_64/egg/EGG-INFO
copying PyNLPl.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying PyNLPl.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
creating dist
creating 'dist/PyNLPl-0.9.2-py2.7.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing PyNLPl-0.9.2-py2.7.egg
creating /usr/lib/python2.7/site-packages/PyNLPl-0.9.2-py2.7.egg
Extracting PyNLPl-0.9.2-py2.7.egg to /usr/lib/python2.7/site-packages
Adding PyNLPl 0.9.2 to easy-install.pth file

Installed /usr/lib/python2.7/site-packages/PyNLPl-0.9.2-py2.7.egg
Processing dependencies for PyNLPl==0.9.2
Searching for httplib2>=0.6
Reading https://pypi.python.org/simple/httplib2/
Best match: httplib2 0.9.2
Downloading https://pypi.python.org/packages/ff/a9/5751cdf17a70ea89f6dde23ceb1705bfb638fd8cee00f845308bf8d26397/httplib2-0.9.2.tar.gz#md5=bd1b1445b3b2dfa7276b09b1a07b7f0e
Processing httplib2-0.9.2.tar.gz
Writing /tmp/easy_install-G32Vg8/httplib2-0.9.2/setup.cfg
Running httplib2-0.9.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-G32Vg8/httplib2-0.9.2/egg-dist-tmp-IgKi70
zip_safe flag not set; analyzing archive contents…
httplib2.init: module references file
Adding httplib2 0.9.2 to easy-install.pth file

Installed /usr/lib/python2.7/site-packages/httplib2-0.9.2-py2.7.egg
Searching for numpy==1.11.0
Best match: numpy 1.11.0
Adding numpy 1.11.0 to easy-install.pth file

Using /usr/lib64/python2.7/site-packages
Searching for lxml==3.2.1
Best match: lxml 3.2.1
Adding lxml 3.2.1 to easy-install.pth file

Using /usr/lib64/python2.7/site-packages
Finished processing dependencies for PyNLPl==0.9.2


spaCy

  • 簡介:

這是一個商業的開源軟體。結合了Python 和Cython 優異的 NLP 工具。是快速的,最先進的自然語言處理工具。

  • 網站:

HomePage

GitHub

  • 安裝:
[root@master pynlpl-master]# pip install spacy
Collecting spacy
  Downloading spacy-0.101.0-cp27-cp27mu-manylinux1_x86_64.whl (5.7MB)
    100% |████████████████████████████████| 5.7MB 161kB/s
Collecting thinc<5.1.0,>=5.0.0 (from spacy)
  Downloading thinc-5.0.8-cp27-cp27mu-manylinux1_x86_64.whl (1.4MB)
    100% |████████████████████████████████| 1.4MB 287kB/s
Collecting murmurhash<0.27,>=0.26 (from spacy)
  Downloading murmurhash-0.26.4-cp27-cp27mu-manylinux1_x86_64.whl
Collecting cloudpickle (from spacy)
  Downloading cloudpickle-0.2.1-py2.py3-none-any.whl
Collecting plac (from spacy)
  Downloading plac-0.9.1.tar.gz (151kB)
    100% |████████████████████████████████| 153kB 3.2MB/s
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7 in /usr/lib64/python2.7/site-packages (from spacy)
Requirement already satisfied (use --upgrade to upgrade): six in /usr/lib/python2.7/site-packages/six-1.10.0-py2.7.egg (from spacy)
Collecting cymem<1.32,>=1.30 (from spacy)
  Downloading cymem-1.31.2-cp27-cp27mu-manylinux1_x86_64.whl (66kB)
    100% |████████████████████████████████| 71kB 4.3MB/s
Collecting preshed<0.47,>=0.46.1 (from spacy)
  Downloading preshed-0.46.4-cp27-cp27mu-manylinux1_x86_64.whl (223kB)
    100% |████████████████████████████████| 225kB 2.4MB/s
Collecting sputnik<0.10.0,>=0.9.2 (from spacy)
  Downloading sputnik-0.9.3-py2.py3-none-any.whl
Collecting semver (from sputnik<0.10.0,>=0.9.2->spacy)
  Downloading semver-2.5.0.tar.gz
Installing collected packages: murmurhash, cymem, preshed, thinc, cloudpickle, plac, semver, sputnik, spacy
  Running setup.py install for plac ... done
  Running setup.py install for semver ... done
Successfully installed cloudpickle-0.2.1 cymem-1.31.2 murmurhash-0.26.4 plac-0.9.1 preshed-0.46.4 semver-2.5.0 spacy-0.101.0 sputnik-0.9.3 thinc-5.0.8

Polyglot

  • 簡介:

Polyglot 支援大規模多語言應用程式的處理。它支援165種語言的分詞,196中語言的辨識,40種語言的專有名詞識別,16種語言的詞性標註,136種語言的情感分析,137種語言的嵌入,135種語言的形態分析,以及69種語言的翻譯。特性如下:
Tokenization (165 Languages)
Language detection (196 Languages)
Named Entity Recognition (40 Languages)
Part of Speech Tagging (16 Languages)
Sentiment Analysis (136 Languages)
Word Embeddings (137 Languages)
Morphological analysis (135 Languages)
Transliteration (69 Languages)

  • 網站:

Github

  • 安裝:
[root@master pynlpl-master]# pip install polyglot
Collecting polyglot
  Downloading polyglot-15.10.03-py2.py3-none-any.whl (54kB)
    100% |████████████████████████████████| 61kB 153kB/s
Collecting pycld2>=0.3 (from polyglot)
  Downloading pycld2-0.31.tar.gz (14.3MB)
    100% |████████████████████████████████| 14.3MB 71kB/s
Collecting wheel>=0.23.0 (from polyglot)
  Downloading wheel-0.29.0-py2.py3-none-any.whl (66kB)
    100% |████████████████████████████████| 71kB 4.2MB/s
Collecting futures>=2.1.6 (from polyglot)
  Downloading futures-3.0.5-py2-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): six>=1.7.3 in /usr/lib/python2.7/site-packages/six-1.10.0-py2.7.egg (from polyglot)
Collecting PyICU>=1.8 (from polyglot)
  Downloading PyICU-1.9.3.tar.gz (179kB)
    100% |████████████████████████████████| 184kB 2.9MB/s
Collecting morfessor>=2.0.2a1 (from polyglot)
  Downloading Morfessor-2.0.2alpha3.tar.gz
Installing collected packages: pycld2, wheel, futures, PyICU, morfessor, polyglot
  Running setup.py install for pycld2 ... done
  Running setup.py install for PyICU ... done
  Running setup.py install for morfessor ... done
Successfully installed PyICU-1.9.3 futures-3.0.5 morfessor-2.0.2a3 polyglot-15.10.3 pycld2-0.31 wheel-0.29.0

MontyLingua

  • 簡介:

MontyLingua 是一個免費的、功能強大的、端到端的英文處理工具。在 MontyLingua 輸入原始英文文字 ,輸出就會得到這段文字的語義解釋。它適用於資訊檢索和提取,請求處理,問答系統。從英文文字中,它能提取出主動賓元組,形容詞、名詞和動詞短語,人名、地名、事件,日期和時間等語義資訊。

  • 網站:

HomePage

Github

  • 安裝:

Usage
Webservice
python server.py
The webservice runs on port 8001 at /service by default. For parameters etc see the NIF spec.
Therefore you can curl your query like this
curl “http://localhost:8001/service?nif=true&input-type=text&input=This%20is%20a%20city%20called%20Berlin.”
or simply use your browser to query the target.
Console
python nif.py
But this method is mainly for debugging purposes and supports only hardcoded options.


BLLIP Parser

  • 簡介:

BLLIP Parser(也叫做 Charniak-Johnson parser)是一個整合了生成成分分析器和最大熵排序的統計自然語言分析器。它包括命令列和python介面。

  • 網站:

GitHub

HomePage

  • 安裝:
[root@master pynlpl-master]# pip install --user bllipparser
Collecting bllipparser
  Downloading bllipparser-2015.12.3.tar.gz (548kB)
    100% |████████████████████████████████| 552kB 1.2MB/s
Requirement already satisfied (use --upgrade to upgrade): six in /usr/lib/python2.7/site-packages/six-1.10.0-py2.7.egg (from bllipparser)
Building wheels for collected packages: bllipparser
  Running setup.py bdist_wheel for bllipparser ... done
  Stored in directory: /root/.cache/pip/wheels/6f/7a/d8/037a4aa0fa275f43e1129008eb7834dc8522ef158d2e96534b
Successfully built bllipparser
Installing collected packages: bllipparser
Successfully installed bllipparser

Quepy

  • 簡介:

Quepy 是一個 Python 框架,提供了將自然語言問題轉換成為資料庫查詢語言中的查詢。它可以方便地自定義自然語言中不同型別的問題和資料庫查詢。所以,通過 Quepy,僅僅修改幾行程式碼,就可以構建你自己的自然語言查詢資料庫系統。

  • 網站:

GitHub - machinalis/quepy: A python framework to transform natural language questions to queries in a database query language.
Quepy: A Python framework to transform natural language questions to queries.

  • 安裝
[root@master pynlpl-master]# pip install quepy
Collecting quepy
  Downloading quepy-0.2.tar.gz (42kB)
    100% |████████████████████████████████| 51kB 128kB/s
Collecting refo (from quepy)
  Downloading REfO-0.13.tar.gz
Requirement already satisfied (use --upgrade to upgrade): nltk in /usr/lib/python2.7/site-packages (from quepy)
Collecting SPARQLWrapper (from quepy)
  Downloading SPARQLWrapper-1.7.6.zip
Collecting rdflib>=4.0 (from SPARQLWrapper->quepy)
  Downloading rdflib-4.2.1.tar.gz (889kB)
    100% |████████████████████████████████| 890kB 823kB/s
Collecting keepalive>=0.5 (from SPARQLWrapper->quepy)
  Downloading keepalive-0.5.zip
Collecting isodate (from rdflib>=4.0->SPARQLWrapper->quepy)
  Downloading isodate-0.5.4.tar.gz
Requirement already satisfied (use --upgrade to upgrade): pyparsing in /usr/lib/python2.7/site-packages (from rdflib>=4.0->SPARQLWrapper->quepy)
Collecting html5lib (from rdflib>=4.0->SPARQLWrapper->quepy)
  Downloading html5lib-0.9999999.tar.gz (889kB)
    100% |████████████████████████████████| 890kB 854kB/s
Requirement already satisfied (use --upgrade to upgrade): six in /usr/lib/python2.7/site-packages/six-1.10.0-py2.7.egg (from html5lib->rdflib>=4.0->SPARQLWrapper->quepy)
Building wheels for collected packages: quepy, refo, SPARQLWrapper, rdflib, keepalive, isodate, html5lib
  Running setup.py bdist_wheel for quepy ... done
  Stored in directory: /root/.cache/pip/wheels/c8/04/bf/495b88a68aa5c1e9dd1629b09ab70261651cf517d1b1c27464
  Running setup.py bdist_wheel for refo ... done
  Stored in directory: /root/.cache/pip/wheels/76/97/81/825976cf0a2b9ad759bbec13a649264938dffb52dfd56ac6c8
  Running setup.py bdist_wheel for SPARQLWrapper ... done
  Stored in directory: /root/.cache/pip/wheels/50/fe/25/be6e98daa4f576494df2a18d5e86a182e3d7e0735d062cc984
  Running setup.py bdist_wheel for rdflib ... done
  Stored in directory: /root/.cache/pip/wheels/fb/93/10/4f8a3e95937d8db410a490fa235bd95e0e0d41b5f6274b20e5
  Running setup.py bdist_wheel for keepalive ... done
  Stored in directory: /root/.cache/pip/wheels/16/4f/c1/121ddff67b131a371b66d682feefac055fbdbb9569bfde5c51
  Running setup.py bdist_wheel for isodate ... done
  Stored in directory: /root/.cache/pip/wheels/61/c0/d2/6b4a10c222ba9261ab9872a8f05d471652962284e8c677e5e7
  Running setup.py bdist_wheel for html5lib ... done
  Stored in directory: /root/.cache/pip/wheels/6f/85/6c/56b8e1292c6214c4eb73b9dda50f53e8e977bf65989373c962
Successfully built quepy refo SPARQLWrapper rdflib keepalive isodate html5lib
Installing collected packages: refo, isodate, html5lib, rdflib, keepalive, SPARQLWrapper, quepy
Successfully installed SPARQLWrapper-1.7.6 html5lib-0.9999999 isodate-0.5.4 keepalive-0.5 quepy-0.2 rdflib-4.2.1 refo-0.13

MBSP

  • 簡介:

MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment.
The general English version of MBSP has been trained on data from the Wall Street Journal corpus.

這裡寫圖片描述

  • 網站:

HomePage

Github

  • 安裝:

下載,解壓,編譯安裝:

[root@master MBSP]# python setup.py install
.....編譯的資訊.....
.....2分鐘左右.....

參考:

相關文章