研學社 · 入門組 | 第六期:初入貝葉斯網路

使用者d8a171發表於2017-06-10

近些年,人工智慧領域發生了飛躍性的突破,更使得許多科技領域的學生或工作者對這一領域產生了濃厚的興趣。在入門人工智慧的道路上,The Master Algorithm 可以說是必讀書目之一,其重要性不需多言。作者 Pedro Domingos 看似只是粗略地介紹了機器學習領域的主流思想,然而幾乎所有當今已出現的、或未出現的重要應用均有所提及。本書既適合初學者從更巨集觀的角度概覽機器學習這一領域,又埋下無數伏筆,讓有心人能夠對特定技術問題進行深入學習,是一本不可多得的指導性入門書籍。詼諧幽默的行文風格也讓閱讀的過程充滿趣味。



以這本書為載體,機器之心「人工智慧研學社 · 入門組」近期將正式開班

加入方式

我們邀請所有對人工智慧、機器學習感興趣的初學者加入我們,通過對 The Master Algorithm 的閱讀與討論,巨集觀、全面地瞭解人工智慧的發展歷史與技術原理。



第六章總結

貝葉斯定理

支配世界的定理:貝葉斯定理

研學社 · 入門組 | 第六期:初入貝葉斯網路

貝葉斯定理是一個簡單的規律,描述了你在看到新證據後對某個假設的置信程度的改變:如果證據與假設一致,該假設的成立概率就提高;如果不一致,則會降低。

貝葉斯定理的歷史:

  • 托馬斯貝葉斯:這位牧師第一次提出了對概率的新思考方式
  • Pierre-Simon de Laplace:首次從貝葉斯觀點出發發展出貝葉斯定理的法國人

事實上,人類並不是很擅長貝葉斯推理,至少在牽涉到語言推理時是這樣的。問題是我們通常會忽視原因的先驗概率。舉個關於 HIV 的例子:你如果 HIV 測試結果呈陽性,並且測試出現假陽性的概率只有 1%。似乎第一眼看上去你得艾茲的概率達到了 99%。那麼我們可以使用貝葉斯定理,p(HIV感染概率) = p(HIV) × p(陽性HIV概率) / p(陽性概率) = 0.003 × 0.99 / 0.01 = 0.297 (這裡我們假定了 p(HIV)為一般人群感染 HIV 的概率,美國為0.003;p(陽性概率) 為無論你有沒有試驗結果都呈陽性的概率,假設是 0.01)。所以對於一個陽性測試結果,實際感染 HIV 的概率卻只有 0.297。

頻率學派:認為概率是一種頻率。他們通過事件發生的頻繁程度推測概率的大小。

貝葉斯學派:認為概率是一種主觀的置信程度。他們認為你應該在新證據出現後,更新你所相信的假設。

樸素貝葉斯分類器(所有的模型都是錯誤的,但有些是有用的。)-George Box。(BoxCox)

研學社 · 入門組 | 第六期:初入貝葉斯網路

  • 樸素貝葉斯分類器可以表達為原因→ 效果圖模型,就如上圖所示。
  • 樸素假設:給定分類標籤,所有的特徵都是條件獨立的。比如說,p(X1|Y) 與 p(X2|Y) 是相互獨立的。

即滿足方程式:P(X1, X2|Y) = p(X1|Y) * p(X2|Y)

  • 執行時間複雜度 = O(CD)。C= 型別數, D=特徵數
  • 一個有足夠資料去估測的過於簡單模型比一個資料不足的完美模型更好
  • 優勢:快速;避免了過擬合;這個簡單模型經驗上標校友聯表現優良,即使樸素假設並不實際
  • 成對連線

馬爾可夫鏈與隱馬爾可夫模型(從 Eugene Onegin到 Siri)

  • Markov 假定(錯誤但有用)一個事件的概率在文字的每個位置都是一樣的

研學社 · 入門組 | 第六期:初入貝葉斯網路

  • 隱馬爾可夫模型(HMM):在一個隱藏狀態中假定馬爾可夫過程
  • Speech Recognition(Siri)語音識別(Siri)hidden states: written wordsobservations: sounds spoken to Sirithe goal is to infer the words from the sounds 隱藏狀態:寫下來的文字 觀察:說給 Siri 的話 目的是從聲音中推斷出文字 其它應用:計算生物學,詞性標記

Bayesian Network and its applications (Everything is connected, but not directly)

  • 貝葉斯網路(Judea Pearl)是一個非常複雜的相關性隨機變數網路,其中每個變數僅直接和其他很少的幾個變數相關。
  • 樸素貝葉斯、馬爾可夫鏈,和隱馬爾可夫模型是貝葉斯網路的幾種特例。
  • 例子:報警器。
  • 你房子裡裝的報警器會因為盜賊試圖入侵而激發,也會被地震激發如果警報器響了,鄰居 Bob 或 Claire 會電話通知你

研學社 · 入門組 | 第六期:初入貝葉斯網路

  • 警報響了以後,Bob會根據盜竊或地震打電話。對於一個已有的警報,Bob 的電話與盜竊和地震是條件獨立的。當他發現警報響起時,Bob 打電話通知的事件是與盜竊和地震條件獨立的。若沒有獨立的結構,我們需要了解 2^5 種可能性。用這個結構,我們只需要 1+1+4+2+2 = 10 種可能性。
  • 應用:需要領域知識辨識出影象的結構!!
  • 生物:一個給定細胞中基因是如何互相調控的廣告:選擇放在網路上的廣告遊戲:給玩家評分,基於類似的技能匹配玩家

推理(推理問題)

  • 推理問題即在沒有構建出完整概率表的情況下,如何計算一個特定的概率
  • 在很多案例中,我們可以做到這點,且避免成指數放大
  • 環路信念傳播:
  • 圖(graph)包括了迴圈。我們假設圖沒有迴圈,僅僅是不停的往復傳播概率,直到收斂。但它有可能得出一個錯誤的答案,或根本就不收斂。
  • 馬爾可夫鏈蒙特卡爾理論
  • 設計一個收斂到貝葉斯網路分佈的馬爾可夫鏈。需要經過一系列步驟。使用一個建議分佈 Q (通常是易處理的)逼近於複雜的真實(通常很棘手而且是高維的)資料分佈。優勢:一個好用的馬爾可夫鏈會收斂到 s 穩態分佈。劣勢:很難收斂,並且會導致壞結果。

最大後驗概率 & 最大似然估計(學習貝葉斯方法)

p(hypo|data) * p(data) = p(data|hypo) * p(hypo)

我們能忽略 p(data),因為其對所有假設都是一樣的。

p(hypo|data) = p(data|hypo) * p(hypo)

先驗概率:p(hypo)

似然性: p(data|hypo)

後驗概率:p(data|hypo) * p(hypo)

  • 頻率學派:最大似然估計(MLE):在進行推論時,我們只關心似然度,並選擇給出所最大化 p(data|hypo) 的假設作為預測。
  • 貝葉斯學派:最大後驗概率(MLE):我們也需要把先驗 p(hypo) 納入計算,不僅是似然度,還要選擇給出最大 p(data|hypo) * p(hypo) 的假設作為預測。
  • 如果我們認為所有假設都服從均勻分佈,那麼MAP = MLE 。
  • 計算 MAP 需要先計算 p(data)。然而,實際上,p(data) 由高維度特徵構成,因此, p(data) 很難精確計算。我們只能用數值法粗略估算它的下確界或上確界。除了計算之外,MAP經常引發資料分佈的偏差,即 MAP 容易過擬合。適當的選擇適合給定問題的方式永遠是很重要的。
  • MLE 的劣勢:如果到目前時間還沒有發生(可能性=0),那麼根據 MLE 它將來也永遠不會出現。

馬爾可夫網路/馬爾可夫隨機場(馬爾可夫權衡了證據)

  • 馬爾可夫網路是一組有著相關權重的特徵,其定義了一個概率分佈
  • 它確實是一個無向圖模型

研學社 · 入門組 | 第六期:初入貝葉斯網路

  • 應用:影象分割(把每個畫素看作一個結點)

思考 & 重點:

  1. 有兩種統計學家,一種是頻率學派,認為頻率就是概率。另一種是貝葉斯學派的,用新資料更新先驗概率,以得出後驗概率。
  2. 貝葉斯網路是有向圖模型,背後的定律就是貝葉斯理論。
  3. 貝葉斯網路推論可以使精確或近似值。
  4. MLE 和 MAP 是完全不同的估計方法。
  5. 馬爾可夫隨機場是一種無向圖模型。

第六週 Q & A 總結

  1. 什麼是貝葉斯定理?
  2. P(A|B) = P(A) P(B|A) / P(B)“貝葉斯”定理只是一個說明了你在看到新證據是會更新相信程度的簡單規律:如果證據和假設一致,假設的可能性就會提高,反之則降低
  3. 什麼是樸素假設,它在樸素貝葉斯分類器中扮演了什麼角色?
  4. “樸素假設:所有特徵都是條件獨立於給定類別標籤的。”角色:它是樸素貝葉斯分類器的基礎
  5. 隱馬爾可夫模型(HMM) 和馬爾可夫模型之間的差別在哪裡?
  6. HMM 是馬爾可夫模型中的一類,有著未被觀察的(或部分被觀察的)系統狀態
  7. 為何領域知識對與影象模型的構建和推論很重要?
  8. “應用:需要領域知識辨識圖的結構!”

第七章總結

[章節小結]

這一章節描述了機器學習的 “類比” 族系,並精確指出了辨識相似性的演算法。作者從一個有趣的例子開始,一個罪犯運用非凡的技能犯罪的例子。在同樣的邏輯序列中,演算法在一些簡單對比的基礎上能取得很好的結果。類比在科學中激起很多歷史性進步的火花。類比推理啟發了機器學習中很多智慧發展。在學習的五個族系中,這個分支是聯絡最不緊密的。本章介紹了 1)最近鄰和 k-近鄰;2)SVM;3)一般性類比方法

[重要的部分]

  • 最近鄰:面部識別和劃定國界中的應用
  • K-最近鄰(KNN):基礎推薦系統中的應用
  • 解決升維問題
  • 支援向量機(SVM)
  • 類比演算法的主要子問題

[關鍵性概念]

  • 歷史
  • 1951年兩個伯克利統計學家寫的一篇晦澀的技術報告,是發展的開端,後來出現了最近鄰演算法。到了 2000年代,支援向量機被開發了出來。
  • “最近鄰”的應用:基於資料點的簡單分類:資料集越大越好
  • Facebook 面部識別:最近鄰有能力在整個Facebook資料庫的標記照片中找到最相似的圖片,在它 “最近鄰” 的基礎上。如果資料庫中的影象與 Jane 剛剛上傳的最相近,那麼這正面孔就屬於 Jane。Positiville vs. Negapolis:最近鄰能通過計算兩個國家首都的中心點找到兩國邊界。為獲取更高的精確度,可以通過更多的附近城市重新定義位置。邊界是以資料點位置和距離測量的形式揭示的,唯一的成本是查詢時間。
  • “ k- 最近鄰 ” 應用:分類測試案例的方法是尋找它的多個 k 最近鄰,並讓它們 ”投票選舉” 哪一個比單一最近鄰更穩健。
  • 貝葉斯推薦系統:現在意見一致的人很可能將來也會保持一致。有類似評分習慣的使用者能被用作小組中其他人的參考。竅門:一個讓最近鄰法更有效的簡單方式,刪除所有由鄰正確分類的樣本案例,因為只有 ”邊界線” 資料點是必須的。
  • 懷疑論:最近鄰能知道概念間的真實邊界嗎?
  • 1967 年 Tom Cover 和 Peter Har 證明,只要有足夠的資料,最近鄰最差也只比最佳分類器的錯誤率高兩倍。
  • 維度問題
  • 低維度上,最近鄰很起作用,但高維度卻不行。最近鄰通常會對更高維度的非相關屬性感到困惑。而且在更高的維度上,邊界會崩潰維度問題是機器學習中第二糟糕的問題,排在過擬合之後。解決方案:對於最近鄰,我們可以拋棄所有資訊增益低於一定閾值的屬性,只測量退化空間中的相似性。或者我們可以把屬性選擇”包圍”在學習者四周,用一個保留刪除屬性的登山式搜尋,只要它不會破壞最近鄰在留存資料上的準確性。
  • 支援向量機 (SVM)
  • 歷史:由 Vladimir Vapnik 在貝爾實驗室所發明看起來像是加權 k-最近鄰:這兩個類別的邊界是由一組有權重的案例集定義的,同時還有一個相似性度量。它被稱為支援向量,因為他們是“支援”劃分邊界的向量。用 SVM 能得到一個非常光滑的邊界。如何學習 SVM:我們選擇支援向量和他們的權重。 SVM 找到適應於正面和反面地雷之間最快的那條蛇。SVM 在給定樣本下,最大化其間的間隔,也就是在一定條件下最小化權重,由於精確值是任意的,這可能就是我們要找的那個。SVM vs. 感知機:SVM 作為多層感知機創造性的取得了很好的效果,這幾年來被精巧地用於數字識別。 SVM 可以被看成是感知機的一般化,因為當你使用一種特別近似測量(向量點積)方法時得到的就是類別間的超平面邊界。但SVM 相比多層感知機有一個主要的優勢:權重有一個單一最佳條件,而不是很多的區域性條件,所以更容易進行可靠地學習。一個 SVM 選擇的支援向量越少,生成的效果就越好。一個 SVM 的期望誤差最多也只是支援向量案例的一部分。維度增加時,這部分的大小也會增加。最終,SVM 還是免不了受到維度的限制,但它比絕大多數其他的方法都更好一些。直線:邊沿如何彎曲並不重要,他們永遠都是直線(或超平面)。
  • 一個類推的兩個主要子問題
  • 1)搞清楚兩個事情有多相似。2)確定從他們的相似性中推斷出其他哪些東西。類推最簡潔的技巧是跨問題領域學習。通過結構對映,類推可以做很多事情。結構對映有兩個描述,尋求某些部分和關係間的連貫一致性,然後,基於這種一致性,從一個案例中轉化出更深一層的特性。最後,根據 Douglas Hofstadter 的觀點,沒有類比推理不能解決的問題。
  • 符號學家和類推學家之間的爭論
  • 在認知科學界,符號學家和類推學家之間之間一直存在著長久的爭論。符號學家會指出一些他們能建模但類推學家無法做到的事情;而類推學家會想出解決方法,再找到一些他們能建模但單符號學家不能的事情,然後不斷迴圈往復。Domingos 設計出了一個叫做 RISE 的演算法,向著主演算法(Master Algorithm)的形成邁進了一步,它能結合符號化和類比化推理。

【Quiz】

  1. k-最近鄰是如何基於最近鄰演化的?
  2. 為什麼維度的提高會給最近鄰演算法帶來問題?
  3. 解釋一下 SVM 的機制。


Chapter #6 Review

【Chapter Summary】

Bayes' theorem

The theorem that runs the world — Bayes' theorem

研學社 · 入門組 | 第六期:初入貝葉斯網路

Bayes’ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down.

History of Bayes' theorem:

  • Tomas Bayes: a preacher who first describes a new way to think about chance.
  • Pierre-Simon de Laplace: a French who first codifies Bayes' insights to Bayes' theorem

Humans, it turns out, are not very good at Bayesian inference, at least when verbal reasoning is involved. A problem is that we tend to neglect the cause’s prior probability.

HIV example: If you get a positive test result for HIV, and the test only gives 1% false positives. At first sight, it seems like your chance of having AIDS is now 99%. Let's apply Bayes' theorem here, p(HIV|positive) = p(HIV) × p(positive|HIV) / p(positive) = 0.003 × 0.99 / 0.01 = 0.297 (here we assume p(HIV) — the possibility of get HIV in the general population — is 0.003 in US; p(positive)

is the probability that the test comes out positive whether or not you have AIDS, let's say 0.01). So given a positive test result, the probability of getting HIV is only 0.297.

Frequentists: the probability is a frequency. They estimate probability by counting how often the corresponding events occur

Bayesians: the probability is a subjective degree of belief. They state that you should update your prior beliefs with new evidence to obtain your posterior beliefs.

Naive Bayes Classifier (All models are wrong, but some are useful.) — George Box . (BoxCox)

研學社 · 入門組 | 第六期:初入貝葉斯網路

  • The Naive Bayes classifier can be expressed as a cause → effects graphical model, like the figure shown above.
  • Naive Assumption: all features are conditionally independent given the class label. For example, p(X1|Y) is independent with p(X2|Y).

P(X1, X2|Y) = p(X1|Y) * p(X2|Y) by independence

  • Runtime complexity = O(CD). C=the number of classes, D=the number of features
  • An oversimplified model that you have enough data to estimate is better than a perfect one with insufficient data.
  • Advantages: fast; immune to overfitting; this simple model performs well empirically, even though the naive assumption is not realistic.
  • Pairwise connection

Markov Chain & HMM (From Eugene Onegin to Siri)

  • Markov assumed (wrongly but usefully) that the probability of an event is the same at every position in the text.

研學社 · 入門組 | 第六期:初入貝葉斯網路


  • Hidden Markov model: assume a Markov process within the hidden state.
  • Speech Recognition(Siri)hidden states: written wordsobservations: sounds spoken to Sirithe goal is to infer the words from the sounds

研學社 · 入門組 | 第六期:初入貝葉斯網路

  • Other application: computational biology, part-of-speech tagging

Bayesian Network and its applications (Everything is connected, but not directly)

  • Bayesian Network(Judea Pearl) is a very complex network of dependencies among random variables, provided that each variable only directly depends on a few others.
  • Naive Bayes, Markov Chain, and Hidden Markov Model are special cases of Bayesian Networks
  • Example: Burglar alarms.
  • The alarm in your house can be triggered by a burglar attempting to break in, or an earthquake.Neighbor Bob or Neighbor Claire would call you if the alarm sounds.

研學社 · 入門組 | 第六期:初入貝葉斯網路

  • Bob calling depends on Burglary and Earthquake, but only through Alarm. Bob’s call is conditionally independent of the burglary and an earthquake given Alarm. When Bob observes that the alarm is sound, the event Bob calls is conditionally independent with the burglary and Earthquake.Without independent structure, we need to learn 2^5 probabilities. With this structure, we only need 1+1+4+2+2 = 10 probabilities. (Avoid the curse of dimensionality)
  • Applications: Need domain knowledge to identify the structure of the graph!
  • Biology: how genes regulate each other in a given cell.Ad: choosing ads to place on the web.Game: rate players and match players based on similar skills.

Inference (The inference problem)

  • The inference problem is how to compute a specific probability without building a full table of probability.
  • In many cases, we can do this and avoid the exponential blowup
  • Loopy belief propagation:
  • The graph contains loops.we pretend that the graph has no loops and just keep propagating probabilities back and forth until they converge.But it can converge to a wrong answer, or not converge at all.
  • Markov Chain Monte Carlo:
  • Design a Markov chain that converges to the distribution of our Bayesian network. Need to take a sequence of steps.Use a proposal distribution Q (often tractable) to approximate the complex (often intractable and high-dimension) real data distribution P.Advantage: A well-behaved Markov chain converges to s stable distribution.Disadvantage: It is hard to converge and can result in a bad result; burn-in phase.

Maximum A Posteriori & Maximum Likelihood Estimation ( Learning the Bayesian way)

p(hypo|data) * p(data) = p(data|hypo) * p(hypo)

We can ignore p(data) because it is the same for all hypotheses.

p(hypo|data) = p(data|hypo) * p(hypo)

prior: p(hypo)
likelihood: p(data|hypo)
posterior: p(data|hypo) * p(hypo)
  • Frequentist: Maximum Likelihood Estimation (MLE): when doing inference, we only care about the likelihood, and choose the hypothesis that gives the maximum of p(data|hypo) as our prediction.
  • Bayesian: Maximum A Posteriori (MAP): we also need to take the prior p(hypo) into account, not just likelihood, and choose the hypothesis that given the maximum p(data|hypo) * p(hypo) as our prediction.
  • If we assume that all hypotheses are uniformly distributed: MAP = MLE
  • Calculating MAP needs to compute p(data) first. However, practically, p(data) consists of high dimension features, hence, p(data) is hard to compute explicitly. Often, we only could use numeric methods to poorly estimate its lower or higher boundary. Besides the computational consideration, MAP often incurs the bias of data distribution, i.e. MAP tends to overfit. It is always important to properly choosing the methods that suit the given problem.
  • Disadvantage of MLE: if the event never occurs so far (likelihood = 0), it will also never show up in the future based on MLE.

Markov network / Markov random field (Markov weights the evidence)

  • A Markov network is a set of features with corresponding weights, which all together define a probability distribution. (p.171)
  • It is indeed an undirected graphical model

[Image: https://quip.com/-/blob/KbdAAAsU1Hh/y9n31veW589ysNSGgQGPEQ]

  • Application: image segmentation (view every pixel as a node)

Reflection & Highlights:

  1. There are two kinds of statisticians. One is frequentist, who regards the frequency as probability. The other is Bayesian, who updates the prior probability with new data to get the posterior probability.
  2. Bayesian Networks are directed graphical model, and the underlying rule is Bayes' theorem.
  3. Bayesian network inference could be exact or approximate.
  4. MLE and MAP are completely different estimation methods.
  5. Markov random field is an undirected graphical model.

Week 6 Q & A Collection

  1. What is Bayes' theorem?
  2. P(A|B) = P(A) P(B|A) / P(B)“Bayes’ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down.”
  3. What is Naive assumption, and what role does it play in Naive Bayes classifier?
  4. “Naive Assumption: all features are conditionally independent given the class label.” Role: It is the foundation of Naive Bayes classifier
  5. What is the difference between HMM and Markov model?
  6. HMM is one type of Markov models with unobserved (or partially observable) system state
  7. Why is domain knowledge important in building and inferencing graphical models?
  8. “Applications: Need domain knowledge to identify the structure of the graph!”

Chapter #7 Preview

【Chapter Summary】

This chapter describes the “analogizers” tribe of machine learning, and pinpoints the algorithm methods that identify similarities. The author begins with a fun example of a culprit who committed crimes using great impersonation skills. On the same train of logic, algorithms work well based on some simple comparisons. Analogy was the spark that ignited many historical advancements in science. Analogical reasoning inspired many intellectual developments in machine learning. This branch of learning is the least cohesive out of the five tribes. This chapter introduces 1) nearest neighbor and k-nearest neighbor; 2) SVM; 3) general analogical methods.

【Important Sections】

  • Nearest-Neighbour: application in facial recognition and drawing nation borders
  • K-nearest Neighbour (KNN): application in basic recommendation system
  • Solving the problem of rising dimensions
  • Support Vector Machine (SVM)
  • The main subproblems of analogical algorithms

【Key Concepts】

  • History
  • The development started slowly with an obscure technical report written in 1951 by two Berkeley statisticians, coning what was later called the nearest neighbor algorithm. By 2000s, support vector machine is developed.
  • Application of “Nearest-neighbor”: simple classification based on data points: the bigger the database, the better.
  • Facebook facial recognition: nearest neighbor has the ability to find the picture most similar to it in Facebook's entire database of labeled photo based on its “nearest neighbors”. If the image in the database is most similar to the one Jane just uploaded, then the face is Jane's. Positiville vs. Negapolis: nearest neighbor can find borders between two countries by calculating the midpoint of their capitals. To achieve higher exactitude, the border can be refined by the location of more nearby cities. The boarder is implicit in terms of the locations of the data points and the distance measurements, and the only cost is query time.
  • Application of “k-nearest-neighbor”: test example is classified by finding its k nearest neighbors and letting them “vote” which is more robust than a single nearest neighbor.
  • Basic recommendation system: people who are in agreement right now are likely to keep the agreement in the future. Users with similar rating habits can be used as the reference to others in the group.A trick: A simple way to make nearest-neighbor more efficient is to delete all the examples that are correctly classified by their neighbors because only “borderline” data points are necessary.
  • Skepticism: can nearest neighbor learn the true borders between concepts?
  • 1967 Tom Cover and Peter Hart proved that given enough data, nearest-neighbor is at worst only twice as error-prone as the best imaginable classifier.
  • Problem of Dimensionality
  • In low dimension, nearest-neighbor works well, but not otherwise. Nearest-neighbor is often confused by irrelevant attributes of higher dimensions. Also in a higher dimension, borders can break down. The problem of dimensionality is the second worst problem in machine learning, after overfitting. Solution: for nearest-neighbor, we can discard all the attributes whose information gain is below some threshold and then only measure similarity in the reduced space. Or we can “wrap” the attribute selection around the learner itself, with a hill-climbing searching that keeps deleting attributes as long as it doesn't hurt the nearest-neighbor's accuracy on held-out data.
  • Support Vector Machine (SVM)
  • History: invented by Vladimir Vapnik at Bell Labs. Looks like weighted k-nearest-neighbor: the frontier between two classes is defined by a set of examples with weights, together with a similarity measure. It is called support vectors because they are vectors “support” the demarcation border. With SVM, we can have a very smooth border.How to Learn SVM: we choose the support vectors and their weights. The SVM finds the fattest snake that fits between the positive and negative landmines. The SVM minimizes the weights under the constraint that all examples have a given margin, which could be one since the precise value is arbitrary. SVM vs. Perceptron: SVMs did well out of the box as multilayer perceptrons that had been carefully crafted for digital recognition over the years. SVMs can be seen as a generalization of the perceptron, because a hyperplane boundary between classes is what you get when you use a particularly similar measure (the dot product between vectors). But SVMs have a major advantage compared to multilayer perceptrons: the weights have a single optimum instead of many local ones and so learning them reliably is much easier. The fewer support vectors an SVM selects, the better it generalizes. The expected error rate of an SVM is at most the fraction of examples that are support vectors. As the number of dimensions goes up, the fraction goes up too. In the end, SVM is not immune to the curse of dimensionality, it is more resistant than most others. Straight Lines: it doesn't matter how curvy the frontiers are, they are always straight lines (or hyperplanes).
  • Two Main Subproblems of Analogy:
  • 1) Figuring out how similar two things are 2) Deciding what else to infer from their similaritiesAnalogizer's neatest trick is learning across problem domains. Analogizes can do a lot of things using structure mapping. Structure mapping takes two descriptions, find a coherent correspondence between some of the parts and relationships, and then, based on correspondence, transfer further properties from one. Finally, according to Douglas Hofstadter there is nothing that analogical reasoning can't solve.
  • Debate Between Symbolists and Analogizer
  • Cognitive science has seen a long-running debate between symbolists and analogizers. Symbolists point to something they can model that analogizers can't; then analogizers figure out how to do it, come up with something they can model that symbolists can't, and the cycle repeats. Domingos designed an algorithm called RISE that was a step toward the Master Algorithm because it combined symbolic and analogical reasoning.

【Quiz】

  1. How does k-nearest neighbor improve based on nearest-neighbor?
  2. Why would rising dimensionality create problems for nearest-neighbor algorithms?
  3. Explain the mechanism of SVM.

相關文章