自然語言處理NLP快速入門

https://mp.weixin.qq.com/s/J-vndnycZgwVrSlDCefHZA

【導讀】自然語言處理已經成為人工智慧領域一個重要的分支，它研究能實現人與計算機之間用自然語言進行有效通訊的各種理論和方法。本文提供了一份簡要的自然語言處理介紹，幫助讀者對自然語言處理快速入門。

作者 | George Seif

編譯 | Xiaowen

An easy introduction to Natural Language Processing

Using computers to understand human language

計算機非常擅長處理標準化和結構化的資料，如資料庫表和財務記錄。他們能夠比我們人類更快地處理這些資料。但我們人類不使用“結構化資料”進行交流，也不會說二進位制語言！我們用文字進行交流，這是一種非結構化資料。

不幸的是，計算機很難處理非結構化資料，因為沒有標準化的技術來處理它。當我們使用c、java或python之類的語言對計算機進行程式設計時，我們實際上是給計算機一組它應該操作的規則。對於非結構化資料，這些規則是非常抽象和具有挑戰性的具體定義。

網際網路上有很多非結構化的自然語言，有時甚至連谷歌都不知道你在搜尋什麼！

人與計算機對語言的理解

人類寫東西已經有幾千年了。在這段時間裡，我們的大腦在理解自然語言方面獲得了大量的經驗。當我們在一張紙上或網際網路上的部落格上讀到一些東西時，我們就會明白它在現實世界中的真正含義。我們感受到了閱讀這些東西所引發的情感，我們經常想象現實生活中那東西會是什麼樣子。

自然語言處理 (NLP) 是人工智慧的一個子領域，致力於使計算機能夠理解和處理人類語言，使計算機更接近於人類對語言的理解。計算機對自然語言的直觀理解還不如人類，他們不能真正理解語言到底想說什麼。簡而言之，計算機不能在字裡行間閱讀。

儘管如此，機器學習 (ML) 的最新進展使計算機能夠用自然語言做很多有用的事情！深度學習使我們能夠編寫程式來執行諸如語言翻譯、語義理解和文字摘要等工作。所有這些都增加了現實世界的價值，使得你可以輕鬆地理解和執行大型文字塊上的計算，而無需手工操作。

讓我們從一個關於NLP如何在概念上工作的快速入門開始。之後，我們將深入研究一些python程式碼，這樣你就可以自己開始使用NLP了！

NLP 難的真正原因

閱讀和理解語言的過程比乍一看要複雜得多。要真正理解一段文字在現實世界中意味著什麼，有很多事情要做。例如，你認為下面這段文字意味著什麼？

“Steph Curry was on fire last nice. He totallydestroyed the other team”

對一個人來說，這句話的意思很明顯。我們知道 Steph Curry 是一名籃球運動員，即使你不知道，我們也知道他在某種球隊，可能是一支運動隊。當我們看到“著火”和“毀滅”時，我們知道這意味著Steph Curry昨晚踢得很好，擊敗了另一支球隊。

計算機往往把事情看得太過字面意思。從字面上看，我們會看到“Steph Curry”，並根據大寫假設它是一個人，一個地方，或其他重要的東西。但後來我們看到Steph Curry“著火了”…電腦可能會告訴你昨天有人把Steph Curry點上了火！…哎呀。在那之後，電腦可能會說, curry已經摧毀了另一支球隊…它們不再存在…偉大的…

Steph Curry 真的著火了！

但並不是機器所做的一切都是殘酷的，感謝機器學習，我們實際上可以做一些非常聰明的事情來快速地從自然語言中提取和理解資訊！讓我們看看如何在幾行程式碼中使用幾個簡單的python庫來實現這一點。

使用Python程式碼解決NLP問題

為了瞭解NLP是如何工作的，我們將使用Wikipedia中的以下文字作為我們的執行示例：

Amazon.com, Inc., doing business as Amazon, is an Americanelectronic commerce and cloud computing company based in Seattle, Washington,that was founded by Jeff Bezos on July 5, 1994. The tech giant is the largestInternet retailer in the world as measured by revenue and market capitalization,and second largest after Alibaba Group in terms of total sales. The amazon.comwebsite started as an online bookstore and later diversified to sell videodownloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming,software, video games, electronics, apparel, furniture, food, toys, andjewelry. The company also produces consumer electronics —Kindle e-readers,Fire tablets, Fire TV, and Echo—and is the world’s largest provider of cloud infrastructure services (IaaS andPaaS). Amazon also sells certain low-end products under its in-house brandAmazonBasics.

幾個需要的庫

首先，我們將安裝一些有用的python NLP庫，這些庫將幫助我們分析本文。


### Installing spaCy, general Python NLP lib


 


 


pip3 install spacy


 


 


### Downloading the English dictionary model for spaCy


 


 


python3 -m spacy download en_core_web_lg


 


 


### Installing textacy, basically a useful add-on to spaCy


 


 


pip3 install textacy

實體分析

現在所有的東西都安裝好了，我們可以對文字進行快速的實體分析。實體分析將遍歷文字並確定文字中所有重要的詞或“實體”。當我們說“重要”時，我們真正指的是具有某種真實世界語義意義或意義的單詞。

請檢視下面的程式碼，它為我們進行了所有的實體分析：


# coding: utf-8


 


 



import 



spacy


 


 


### Load spaCy's English NLP model


 


 


nlp = spacy.load(


'en_core_web_lg'


)


 


 


### The text we want to examine


 


 


text = 


"Amazon.com, Inc., doing business as Amazon, 


 is anAmerican electronic commerce and cloud computing 


 company based in Seattle,Washington, that was founded 


 by Jeff Bezos on July 5, 1994. The tech giant isthe 


 largest Internet retailer in the world as measured by 


 revenue and marketcapitalization, and second largest 


 after Alibaba Group in terms of total sales.The amazon.


 com website started as an online bookstore and later 


 diversified tosell video downloads/streaming, MP3 


 downloads/streaming, audiobookdownloads/streaming, 


 software, video games, electronics, apparel, furniture,


 food, toys, and jewelry. The company also produces 


 consumer electronics-Kindle e-readers,Fire tablets, 


 Fire TV, and Echo-and is the world's largest provider


 of cloud infrastructureservices (IaaS and PaaS). 


 Amazon also sells certain low-end products under 


 itsin-house brand AmazonBasics."


 


 


### Parse the text with spaCy


 


 ### Our 'document' variable now contains a parsed version oftext.


 


 


document = nlp(text)


 


 


### print out all the named entities that were detected


 


 



for 



entity 



in 



document.ents:


 


     



print



(entity.text


,


entity.label_)

我們首先載入spaCy’s learned ML模型，並初始化想要處理的文字。我們在文字上執行ML模型來提取實體。當執行taht程式碼時，你將得到以下輸出：


Amazon.com


, 


Inc. ORG


 Amazon ORG


 American NORP


 Seattle GPE


 Washington GPE


 Jeff Bezos PERSON


 July 


5


, 


1994 


DATE


 second ORDINAL


 Alibaba Group ORG


 amazon.com ORG


 Fire TV ORG


 Echo -  LOC


 PaaS ORG


 Amazon ORG


 AmazonBasics ORG

文字旁邊的3個字母程式碼[1]是標籤，表示我們正在檢視的實體的型別。看來我們的模型幹得不錯！Jeff Bezos確實是一個人，日期是正確的，亞馬遜是一個組織，西雅圖和華盛頓都是地緣政治實體(即國家、城市、州等)。唯一棘手的問題是，Fire TV和Echo之類的東西實際上是產品，而不是組織。然而模型錯過了亞馬遜銷售的其他產品“影片下載/流媒體、mp3下載/流媒體、有聲讀物下載/流媒體、軟體、影片遊戲、電子產品、服裝、傢俱、食品、玩具和珠寶”，可能是因為它們在一個龐大的的列表中，因此看起來相對不重要。

總的來說，我們的模型已經完成了我們想要的。想象一下，我們有一個巨大的文件，裡面滿是幾百頁的文字，這個NLP模型可以快速地讓你瞭解文件的內容以及文件中的關鍵實體是什麼。

對實體進行操作

讓我們嘗試做一些更適用的事情。假設你有與上面相同的文字塊，但出於隱私考慮，你希望自動刪除所有人員和組織的名稱。spaCy庫有一個非常有用的清除函式，我們可以使用它來清除任何我們不想看到的實體類別。如下所示：

效果很好！這實際上是一種非常強大的技術。人們總是在計算機上使用ctrl+f函式來查詢和替換文件中的單詞。但是使用NLP，我們可以找到和替換特定的實體，考慮到它們的語義意義，而不僅僅是它們的原始文字。

從文字中提取資訊

我們之前安裝的 textacy 庫在spaCy的基礎上實現了幾種常見的NLP資訊提取演算法。它會讓我們做一些比簡單的開箱即用的事情更先進的事情。

它實現的演算法之一是半結構化語句提取。這個演算法從本質上分析了spaCy的NLP模型能夠提取的一些資訊，並在此基礎上獲取一些關於某些實體的更具體的資訊！簡而言之，我們可以提取關於我們選擇的實體的某些“事實”。

讓我們看看程式碼中是什麼樣子的。對於這一篇，我們將把華盛頓特區維基百科頁面的全部摘要都拿出來。

# coding: utf-8

import

spacy

import

textacy.extract

### Load spaCy's English NLP model

nlp = spacy.load(

'en_core_web_lg'

)

### The text we want to examine

text =

"""Washington, D.C., formally the District of Columbia and commonly referred to as Washington or D.C., is the capital of the United States of America.[4] Founded after the American Revolution as the seat of government of the newly independent country, Washington was named after George Washington, first President of the United States and Founding Father.[5] Washington is the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6] As the seat of the United States federal government and several international organizations, the city is an important world political capital.[7] Washington is one of the most visited cities in the world, with more than 20 million annual tourists.[8][9]

The signing of the Residence Act on July 16, 1790, approved the creation of a capital district located along the Potomac River on the country's East Coast. The U.S. Constitution provided for a federal district under the exclusive jurisdiction of the Congress and the District is therefore not a part of any state. The states of Maryland and Virginia each donated land to form the federal district, which included the pre-existing settlements of Georgetown and Alexandria. Named in honor of President George Washington, the City of Washington was founded in 1791 to serve as the new national capital. In 1846, Congress returned the land originally ceded by Virginia; in 1871, it created a single municipal government for the remaining portion of the District.

Washington had an estimated population of 693,972 as of July 2017, making it the 20th largest American city by population. Commuters from the surrounding Maryland and Virginia suburbs raise the city's daytime population to more than one million during the workweek. The Washington metropolitan area, of which the District is the principal city, has a population of over 6 million, the sixth-largest metropolitan statistical area in the country.

All three branches of the U.S. federal government are centered in the District: U.S. Congress (legislative), President (executive), and the U.S. Supreme Court (judicial). Washington is home to many national monuments and museums, which are primarily situated on or around the National Mall. The city hosts 177 foreign embassies as well as the headquarters of many international organizations, trade unions, non-profit, lobbying groups, and professional associations, including the Organization of American States, AARP, the National Geographic Society, the Human Rights Campaign, the International Finance Corporation, and the American Red Cross.

A locally elected mayor and a 13‑member council have governed the District since 1973. However, Congress maintains supreme authority over the city and may overturn local laws. D.C. residents elect a non-voting, at-large congressional delegate to the House of Representatives, but the District has no representation in the Senate. The District receives three electoral votes in presidential elections as permitted by the Twenty-third Amendment to the United States Constitution, ratified in 1961."""

### Parse the text with spaCy

### Our 'document' variable now contains a parsed version of text.

document = nlp(text)

### Extracting semi-structured statements

statements = textacy.extract.semistructured_statements(document

,

"Washington"

)

print

(

"**** Information from Washington's Wikipedia page ****"

)

count =

1

for

statement

in

statements:

subject

,

verb

,

fact = statement

print

(

str

(count) +

" - Statement: "

,

statement)

print

(

str

(count) +

" - Fact: "

,

fact)

count +=

1

我們的NLP模型從這篇文章中發現了關於華盛頓特區的三個有用的事實：

(1) 華盛頓是美國的首都

(2) 華盛頓的人口，以及它是大都會的事實

(3) 許多國家紀念碑和博物館

最好的部分是，這些都是這一段文字中最重要的資訊！

深入研究NLP

到這裡就結束了我們對NLP的簡單介紹。我們學了很多，但這只是一個小小的嘗試…

NLP 有許多更好的應用，例如語言翻譯，聊天機器人，以及對文字文件的更具體和更復雜的分析。今天的大部分工作都是利用深度學習，特別是遞迴神經網路(RNNs)和長期短期記憶(LSTMs)網路來完成的。

如果你想自己玩更多的NLP，看看spaCy文件[2] 和textacy文件[3] 是一個很好的起點！你將看到許多處理解析文字的方法的示例，並從中提取非常有用的資訊。所有的東西都是快速和簡單的，你可以從中得到一些非常大的價值。是時候用深入的學習來做更大更好的事情了！

參考連結：

[1]

[2]

[3]

原文連結：

-END-

自然語言處理NLP快速入門

An easy introduction to Natural Language Processing

相關文章