The amazing power of word vectors | the morning paper (acolyer.org)
What is a word vector?
At one level, it’s simply a vector of weights. In a simple 1-of-N (or ‘one-hot’) encoding every element in the vector is associated with a word in the vocabulary. The encoding of a given word is simply the vector in which the corresponding element is set to one, and all other elements are zero.
從某種角度來說,詞向量(word vector)僅僅是對應單詞的權重用的向量化表示。在獨熱編碼中,向量中的每個編碼元素與一個文字對應的詞彙表中的特定的一個單詞有聯絡,一個單詞會對應一個獨熱向量,這個獨熱向量中對應這個單詞的維度的元素會被置為1,其餘的全都置為0.
Suppose our vocabulary has only five words: King, Queen, Man, Woman, and Child. We could encode the word ‘Queen’ as:
假設我們的詞彙表只有5個單詞,國王、女王、男人、女人和孩子,那麼我們可以把“女王”編碼為:
Using such an encoding, there’s no meaningful comparison we can make between word vectors other than equality testing.
但是這樣的編碼會使得兩個向量之間沒有什麼比較意義,除了判斷兩個向量是否相等。
In word2vec, a distributed representation of a word is used. Take a vector with several hundred dimensions (say 1000). Each word is representated by a distribution of weights across those elements. So instead of a one-to-one mapping between an element in the vector and a word, the representation of a word is spread across all of the elements in the vector, and each element in the vector contributes to the definition of many words.
在word2vec中,我們對單詞使用了一種數值分散(非零數值分散)的表示方法。假設一個向量有好幾百維(比如1000維),每個單詞都會被1000個權重表示,這些權重很有可能非零而且表示的是與其他單詞的關係的大小。所以一個單詞的表示會與這個1000維的向量的所有其他元素都有關係,每個元素都會對這個單詞的定義做或多或少的貢獻,所以我們會用其代替使用一 一對映的獨熱編碼來表示單詞或片語。
royaling:與皇室相關的;masculining:與男子漢氣概相關的;feminining:與女子特點相關的;Age:年齡。
從圖中我們可以看到,King和Queen與royaling很相關,這顯而易見,所以這個維度上King和Queen的分數應該要高點,其他也是如此。
另外可以這樣思考:VGueen - VWoman = [0.97 0.04 -0.069 0.1](記為Vt),這個向量表示什麼意思?Vman+Vt 等於什麼呢?約等於VKing嗎?下面會給出答案。
Reasoning with word vectors
We find that the learned word representations in fact capture meaningful syntactic and semantic regularities in a very simple way. Specifically, the regularities are observed as constant vector offsets between pairs of words sharing a particular relationship. For example, if we denote the vector for word i as xi, and focus on the singular/plural relation, we observe that xapple – xapples ≈ xcar – xcars, xfamily – xfamilies ≈ xcar – xcars, and so on. Perhaps more surprisingly, we find that this is also the case for a variety of semantic relations, as measured by the SemEval 2012 task of measuring relation similarity.
我們發現模型學習到的詞表示方法(詞向量)實際上能夠以一種很簡單的方式表示單詞對應的語法(Kings,King)和語義規律。更具體的來說,這些規律是用擁有特定關係的詞對所對應的兩個向量的差來表示的。例如,假設我們把單詞i對應的向量記為Vi,那麼對於表示單複數這個維度上的數值K來說應該有:applesk – applek ≈ carsk – cark, ≈ familiesk – familyk .可能更讓人驚訝的是,我們發現在2012年的SemEval關係相似度測量任務中,各種各樣的語義(語義:這個單詞是什麼意思)關係都可以有上述所說的性質。
The vectors are very good at answering analogy questions of the form a is to b as c is to ?. For example, man is to woman as uncle is to ? (aunt) using a simple vector offset method based on cosine distance.
這樣的向量非常適合於分析“a對應b,那麼c對應什麼”這樣的分析性問題。例如,如果男人對應女人,那麼伯父應該對應什麼?(伯母)
For example, here are vector offsets for three word pairs illustrating the gender relation:
如下圖所示,圖中用向量的差(相對偏移量)表示了三個詞對之間的性別關係。
And here we see the singular plural relation:
另外還有單複數的關係:
This kind of vector composition also lets us answer “King – Man + Woman = ?” question and arrive at the result “Queen” !
上圖所展示的向量有助於我們回答“國王-男人+女士=?”的問題,另外答案就是“女王”。