研學社 · 入門組 | 第八期:通向終極演算法的可能

使用者d8a171發表於2017-06-24

近些年,人工智慧領域發生了飛躍性的突破,更使得許多科技領域的學生或工作者對這一領域產生了濃厚的興趣。在入門人工智慧的道路上,The Master Algorithm 可以說是必讀書目之一,其重要性不需多言。作者 Pedro Domingos 看似只是粗略地介紹了機器學習領域的主流思想,然而幾乎所有當今已出現的、或未出現的重要應用均有所提及。本書既適合初學者從更巨集觀的角度概覽機器學習這一領域,又埋下無數伏筆,讓有心人能夠對特定技術問題進行深入學習,是一本不可多得的指導性入門書籍。詼諧幽默的行文風格也讓閱讀的過程充滿趣味。

以這本書為載體,機器之心「人工智慧研學社 · 入門組」近期將正式開班!

加入方式


我們邀請所有對人工智慧、機器學習感興趣的初學者加入我們,通過對 The Master Algorithm 的閱讀與討論,巨集觀、全面地瞭解人工智慧的發展歷史與技術原理。

第8章回顧

【本章概括】

與前幾章關注標籤資料不同,第8章引入沒有標籤的無監督學習,即「沒有教師的學習」。認知科學家將兒童學習的理論用演算法的形式描述出來。而本文的作者也將通過演算法的形式為聚類、降維、強化學習、分塊(chunking)以及關係學習尋找解決方案。本章是一些概念和演算法的集合,這些演算法可以用於重構新生機器人的大腦學習過程。

首先是聚類,聚類能夠自發地將相似物件聚合在一起。本文的作者引入了最受歡迎的演算法之一——最大期望(EM)演算法,以及2個 EM 在特殊情況下的著名演算法:馬爾可夫模型和 k-means。為了學習隱馬爾可夫模型(HMM),我們依次在推斷隱狀態與基於推斷估計轉移(transition)概率和觀測(observation)概率之間進行迴圈計算。當我們想學習的統計模型缺失某重要資訊(如樣本的分類標籤)時,我們可以使用 EM 演算法。k-means 適用於所有屬性服從正態分佈並具有小方差的情況。我們從模糊聚類(coarse clustering)開始,並進一步將大類聚類為更小的子類。

隨後,本文的作者介紹了一些流行的無監督學習技術,PCA(主成分分析)和 Isomap,這兩個演算法主要用於降維。降維是將大量觀測到的特徵維度(如畫素)降維變換成一些隱式的特徵表示(如表情、面部特徵),這對於大資料是非常重要的處理技術。PCA 地圖在超空間中找到多個特徵維度的線性組合,使得資料的所有特徵維度的總方差和最大化。Isomap 是一種非線性降維技術,這種技術在高維度空間(如一張面部影象資料)中將每個資料點與所有附近的資料點(與之非常相似的面部影象)連線起來,在網路中計算所有資料對(pair of point)之間的最短距離,並找到最近似這些距離的降維座標。

在介紹聚類和降維之後,本文的作者講到了「強化學習」,這是一種通過依賴環境對學習者各種動作行為的反饋而學習的技術。作者詳細闡述了強化學習的歷史和發展歷程。在20世紀80年代初,馬薩諸塞大學的 Rich Sutton 和 Andy Barto 觀察到,學習在很大程度上取決於與環境的相互作用,這是監督學習無法獲取到的,因此他們從動物學習的心理學機制尋找靈感。Sutton 是強化學習的主要倡議者。另一個重大進展在1989年,劍橋大學的 Chris Watkins 由對兒童學習的實驗觀測得到啟發,發展了在未知環境中的最優控制(optimal control)理論,這標誌著現代強化學習的形成。最近的一個將神經網路與強化學習相結合的成功案例是一家叫「Deep Mind」的初創公司,這家公司被谷歌以5億美元收購。

由心理學啟發,分塊(chunking)是一種具有成為「終極演算法」潛力的技術。本文的作者給出了該演算法核心思想的基本概要。在商業領域,分塊和強化學習並沒有像監督學習、聚類或降維那樣被大規模使用。一種更簡單的與環境互動的學習型別是 A/B 測試。

本章最後介紹了另一個「殺手級演算法」——關係學習(relation learning),基於世界是相互聯絡的網路,每個我們創造的特徵模板都與它們的例項引數相連線。本文的作者建議將第8章提到的所有元素、解決方案和演算法轉化成為最終的「大師級演算法」最為結論。


第8周 Q&A


1. 分別給出馬爾可夫模型和 k-means 演算法的兩個應用。

  • a. 馬可夫模型:語音識別;手寫識別;生物序列分析等。
  • b. K-means:商品推薦(亞馬遜,Netflix等);商店位置選擇;文章自動摘要等。

2. 描述 Isomap 演算法最適用於降維時的情形。

  • a. “Isomap 是用於非線性降維的最流行的演算法之一。”“從理解視訊中的運動到識別語音中的情緒,Isomap 在聚焦複雜資料的最重要的特徵維度上有驚人的能力。”[摘自書中]

3. 帶有泛化的強化學習往往不能給出穩定的解決方案,為什麼?

  • a. “在監督學習中,一個狀態(state)的目標價值總是一致的,但在強化學習中,隨著周圍狀態的更新,一個狀態的值是不斷變化的。”[摘自書中]

4. 聚類和分塊有什麼區別?

  • a. 它們在根本上不同。聚類是通過無監督學習的方式將相似的資料分類在一起,而分塊是有意地將資料分割以更容易地解決問題。

第9章預覽

【一章概括】


歡迎來到第9章。在這一章中,作者介紹了結合一些之前提到的演算法來通向終極演算法的途徑,並且對它們進行了明確地對比。當然,作者在進行這樣的結合的過程中遇到了很多挑戰。他解釋瞭如何克服這些困難,如何生一個機器學習方法的統一者。最後,他從不同的角度分享了自己對這些方法的看法,例如,優點、缺點以及應用。

【重要部分】


  • 元學習方法
  •       作者展示了很多元學習方法--堆疊(stacking)、袋裝(bagging)、提升(boosting),並且總結了每種方法的特點和不足。
  • 終極演算法
  •       作者指引著我們進行了一場奇幻的旅行。我們在途中學習到了重要演算法的功能,各種演算法之間的關係,並學會了如何將它們結合起來以實現終極演算法。
  • 馬爾可夫邏輯網路
  •       作者介紹了馬爾可夫邏輯網路。 他成功地填補了結合各種演算法時遇到的最後的鴻溝。並且他還定義了終極演算法中每一個演算法的職能。
  • 從科幻到落地
  •       作者將初始的終極演算法命名為Alchemy。他得出了一些Alchemy的結論,並且展示瞭如何為不同的情景定製Alchemy,這部分也展示了一些實際應用。
  • 大規模應用的機器學習
  •       由於用到的計算特別昂貴,Alchemy並不能大規模應用。那麼,我們應該怎麼做呢?為什麼它在現實世界中可以如此有效?下一步應該如何做呢?這部分主要解答了這些問題。
  • 實際應用
  •       CanceRx 是Alchemy的一個典型成功應用。它有著巨大的潛力,並且持續被改進。


【關鍵概念】


  • Unifier(統一者):統一者就是將別的人或者事物聚集在一起的人或者事物。
  • Metalearning(元學習):元學習是學會學習的過程
  • Stacking(堆疊):演算法堆疊指的是訓練一個演算法去結合好幾個其他演算法的預測結果。
  • Bagging(袋裝):打包就是給整合投票中的每一個模型都賦予相等的權重。
  • Boosting(提升):提升(Boosting)就是通過訓練一個新模型例項來強調被之前的模型誤分類樣本的整合方法。
  • Representation(表徵):學習器用來表達模型的正式語言。
  • Evaluation(評估):展示模型好壞程度的評價函式。
  • Optimization(優化):搜尋最好的模型並將其返回的演算法。
  • Markov network(馬爾可夫網路):用特徵和的加權定義,和感知機很類似。
  • Markov Logic Network (MLN)(馬爾可夫邏輯網路):一系列邏輯公式以及它們的權重。
  • Alchemy:它簡單地定義了我們的通用學習機。這裡它主要指的是由作者的團隊開發的馬爾可夫邏輯網路演算法/模型。

【小測試】


  1. 堆疊(stacking)、Bagging和提升(boosting)的區別是什麼?
  2. 作者是如何結合邏輯和概率的?
  3. Alchemy面臨的挑戰是什麼?
  4. 分享一下你對這部分內容的看法。.

Chapter #8 Review

【Chapter Summary】

Unlike the previous chapters which focus on labeled data, Chapter 8 “Learning without a teacher” introduces unsupervised learning. Cognitive scientists describe their theories of children's learning in the form of algorithms. So does the author who seeks solutions from clustering, dimensionality reduction, reinforcement learning, chunking, and relational learning. This chapter is a collection of concepts and algorithms for recreating the brain's learning process in a newborn robot.

Clustering is the first try to spontaneously group similar objects together. The author explains Expectation Maximization(EM) algorithm, one of the most popular algorithms, along with its two special cases: Markov models and k-means. To learn hidden Markov models, we alternate between inferring the hidden states and estimating the transition and observation probabilities based on the inference. Whenever we want to learn a statistical model but are missing some crucial information (e.g., the classes of the examples), we can use EM algorithm.K-means is for the situation where all the attributes have normal distributions with very small variance. We can start with a coarse clustering and then further divide each primary cluster into smaller subclusters.

Subsequently, the author introduces some popular techniques for unsupervised learning, PCA(principal-component analysis) and Isomap, which are mainly used for dimensional reduction. Dimensionality reduction is the process to reduce a large number of visible dimensions (the pixels) to a few implicit ones (expression, facial features), which are very essential to cope with big data. PCA tries to come up a linear combination of various dimensions in the hyperspace so that the total variance of the data across all dimensions is maximized. Isomap, a nonlinear dimensionality reduction technique, connects each data point in a high-dimensional space (a face, say) to all nearby points (very similar faces), computes the shortest distances between all pairs of points along the resulting network, and finds the reduced coordinates that best approximate these distances.

After introducing clustering and dimensionality reduction, the author talks about "reinforcement learning", a technique that relies on immediate response of the environment to various actions of the learner. The author describes the history and development of reinforcement learning in detail. In the early 1980s, Rich Sutton and Andy Barto at the University of Massachusetts observed that learning depends crucially on the interaction with the environment, which supervised algorithms did not capture, and therefore they found inspiration instead in the psychology of animal learning. Sutton became the leading proponent of reinforcement learning. Then another key step happened in 1989, when Chris Watkins at Cambridge, initially motivated by his experimental observations of children’s learning, arrived at the modern formulation of reinforcement learning as optimal control in an unknown environment. A recent example of a successful startup that combines neural networks and reinforcement learning is "DeepMind", a company acquired by Google for half a billion dollars.

Inspired by psychology, chunking is an algorithm which is potential to be a part of "Master Algorithm". The author gives a basic outline of the core idea. Chunking and reinforcement learning are not as widely used in business as supervised learning, clustering, or dimensionality reduction. A simpler type of learning by interacting with the environment is A/B testing.

The chapter ends with the explanation to another potential killer algorithm: relational learning, as the world is a web of interconnections and every feature template we create ties the parameters of all its instances. The author recommends to transmute all the elements, the solutions and algorithms mentioned in this chapter 8 for an ultimate “Master Algorithm” as the conclusion.

Week 8 Q & A Collection

  1. Give two applications of Markov models and k-means algorithm respectively.
  2. Markov models: Speech Recognition; Handwriting Recognition; Biological Sequence Analysis etc.K-means: Making recommendations (Amazon, Netflix etc.); Store location choice; Summarize articles etc.
  3. Describe the situation when Isomap algorithm performs the best for dimensionality reduction.
  4. “One of the most popular algorithms for nonlinear dimensionality reduction, called Isomap, does just this.” “From understanding motion in video to detecting emotion in speech, Isomap has a surprising ability to zero in on the most important dimensions of complex data.” [from the book]
  5. Reinforcement learning with generalization often fails to settle on a stable solution, why?
  6. “In supervised learning the target value for a state is always the same, but in reinforcement learning, it keeps changing as a consequence of updates to nearby states”[from the book]
  7. What is the difference between clustering and chunking?
  8. They are fundamentally different. Clustering is to do unsupervised learning to group similar data together while chunk is to divide data intentionally to solve problems more easily.

Chapter #9 Preview

【Chapter Summary】

Welcome to Chapter 9. In this Chapter, the author introduces a path to Master Algorithm by combining some of the algorithms mentioned before and comparing them explicitly. Of course, the author encounters many challenges during this combination process. He explains how to overcome those problems and to generate the unifier of machine learning - Alchemy. Finally, he shares his insights about the alchemy from different perspectives, such as advantages, disadvantages, and application.

【Important Sections】

  • Out of many models, one
  • The author shows the exploration of the metalearner - stacking, bagging and boosting and the summary of shortages and characteristics for each.
  • The Master Algorithm
  • The author guides us to a fantastic tour in an imaginary city. In this journey, we learn the features of important algorithms, the relationship among them, and how to combine them all together to reach the Master Algorithm.
  • Markov logic networks
  • The author introduces the Markov network. He successfully merges the final gap in the combination and derives the Markov logic network. Also, he defines the duty of each algorithm in the Master Algorithm.
  • From Hume to your housebot
  • The author names the initial status of the master algorithms as Alchemy. He derives some conclusion of the Alchemy and shows how to customize it under different situations. Some practical applications are also shown in this section.
  • Planetary-scale machine learning
  • Alchemy is not scaled yet because the computation is very expensive. What should we do? Why could it work well in the real world? What is the next step? This section answers the above questions.
  • The doctor will see you now
  • CanceRx is one of the successful applications of the Alchemy. It has tremendous potential and is constantly improved.

【Key Concepts】

  • Unifier: A unifier is someone or something that brings others together.
  • Metalearning: Metalearning is the process of learning to learn.
  • Stacking: Stacking involves training a learning algorithm to combine the predictions of several other learning algorithms.
  • Bagging: Bagging is to make each model in the ensemble vote with equal weight.
  • Boosting: Boosting is incrementally building an ensemble by training each new model instance to emphasize the training instances which previous models misclassified.
  • Representation: It is the formal language in which the learner expresses the model.
  • Evaluation: It is a scoring function that shows how good a model is.
  • Optimization: It is an algorithm that searches for the highest-scoring model and returns it.
  • Markov network: It is defined by a weighted sum of features, much similar to a perceptron.
  • Markov Logic Network (MLN): An MLN is a set of logical formulas and their weights.
  • Alchemy: It refines our universal learner candidate for simplicity. Here it mainly referes to the Markov Logic Network algorithms/models developed by the author's team

【Quiz】

  1. What is the difference among stacking, bagging and boosting?
  2. How to combine logic and probability according to the author?
  3. What are the challenges for the Alchemy?
  4. Share your opinion on this section.

相關文章