卡耐基梅隆大學(CMU)深度學習基礎課Probabilistic Graphical Models內容解讀

小样本学习發表於2020-05-18

原文網址 : https://www.jiqizhixin.com/articles/2020-05-16-3

在這裡插入圖片描述

本文為卡耐基梅隆大學深度學習基礎課Probabilistic Graphical Models課程中 Statistical and Algorithmic Foundations of Deep Learning 部分的內容，報告人為Eric Xing。具體內容如下：

原文連結：https://mp.weixin.qq.com/s?__biz=MzU2OTgxNDgxNQ==&mid=2247485496&idx=1&sn=1bd4f128db3b3cd324313ad59edbc3d9&chksm=fcf9b248cb8e3b5e4e3784944814374d565ae628d8962ccdcc098202064c30f19633bac1fc7c&token=2048009475&lang=zh_CN#rd

Probabilistic Graphical Models

Statistical and Algorithmic Foundations of Deep Learning

Author: Eric Xing

01 An overview of DL components

Historical remarks: early days of neural networks

我們知道生物神經元是這樣的：在這裡插入圖片描述上游細胞透過軸突(Axon)將神經遞質傳送給下游細胞的樹突。人工智慧受到該原理的啟發，是按照下圖來構造人工神經元(或者是感知器)的。類似的，生物神經網路 —— > 人工神經網路Reverse-mode automatic differentiation (aka backpropagation)

Reverse-mode automatic differentiation (aka backpropagation)

下面我們來看看具體的感知器學習演算法。在這裡插入圖片描述假設這是一個迴歸問題x->y， $y = f (x) + η$ , 則目標函式為為了求出該函式的解，我們需要對其求導，具體的：其中

由此 $w$ 的更新公式為：在這裡插入圖片描述

下面我們來說說神經網路模型：在這裡插入圖片描述其中，隱藏單元沒有目標。

人工神經網路不過是可以由計算圖表示的複雜功能組成。在這裡插入圖片描述透過應用鏈式規則並使用反向累積，我們得到：該演算法通常稱為反向傳播。如果某些功能是隨機的怎麼辦？使用隨機反向傳播！現代軟體包可以自動執行此操作(稍後再介紹)

Modern building blocks: units, layers, activations functions, loss functions, etc.

常用啟用函式：

Linear and ReLU
Sigmoid and tanh
Etc.

網路層：

Fully connected
Convolutional & pooling
Recurrent
ResNets
Etc. -

也就是說基本構成要素的可以任意組合，如果有多種損失功能的話，可以實現多目標預測和轉移學習等。只要有足夠的資料，更深的架構就會不斷改進。

Feature learning 成功學習中間表示[Lee et al ICML 2009，Lee et al NIPS 2009] 在這裡插入圖片描述表示學習：網路學習越來越多的抽象資料表示形式，這些資料被“解開”，即可以進行線性分離。

02 Similarities and differences between GMs and NNs

Graphical models vs. computational graphs

Graphical models:

用於以圖形形式編碼有意義的知識和相關的不確定性的表示形式
學習和推理基於經過充分研究(依賴於結構)的技術(例如EM，訊息傳遞，VI，MCMC等)的豐富工具箱
圖形代表模型Utility of the graph
一種用於從區域性結構綜合全域性損失函式的工具(潛在功能，特徵功能等)
一種設計合理有效的推理演算法的工具(總和，均值場等)
激發近似和懲罰的工具(結構化MF，樹近似等)
用於監視理論和經驗行為以及推理準確性的工具

Utility of the loss function

學習演算法和模型質量的主要衡量指標

在這裡插入圖片描述 Deep neural networks :

學習有助於最終指標上的計算和效能的表示形式(中間表示形式不保證一定有意義)
學習主要基於梯度下降法(aka反向傳播)；推論通常是微不足道的，並透過“向前傳遞”完成
圖形代表計算

Utility of the network

概念上綜合複雜決策假設的工具(分階段的投影和聚合)
用於組織計算操作的工具(潛在狀態的分階段更新)
用於設計加工步驟和計算模組的工具(逐層並行化)
在評估DL推理演算法方面沒有明顯的用途

在這裡插入圖片描述

到目前為止，圖形模型是機率分佈的表示，而神經網路是函式近似器(無機率含義)。有些神經網路實際上是圖形模型(即單位/神經元代表隨機變數)：

玻爾茲曼機器Boltzmann machines (Hinton＆Sejnowsky，1983)
受限制的玻爾茲曼機器Restricted Boltzmann machines(Smolensky，1986)
Sigmoid信念網路的學習和推理Learning and Inference in sigmoid belief networks(Neal，1992)
深度信念網路中的快速學習Fast learning in deep belief networks(Hinton，Osindero，Teh，2006年)
深度玻爾茲曼機器Deep Boltzmann machines(Salakhutdinov和Hinton，2009年)

接下來我們會逐一介紹他們。

I: Restricted Boltzmann Machines 受限玻爾茲曼機器，縮寫為RBM。 RBM是用二部圖(bi-partite graph)表示的馬爾可夫隨機場，圖的一層/部分中的所有節點都連線到另一層中的所有節點；沒有層間連線。在這裡插入圖片描述聯合分佈為：單個資料點的對數似然度(不可觀察的邊際被邊緣化)：對數似然比的梯度模型引數：對數似然比的梯度引數(替代形式)：兩種期望都可以透過抽樣來近似，從後部取樣是準確的(RBM在給定的h上分解)。透過MCMC從關節進行取樣(例如，吉布斯取樣)

在神經網路文獻中：

計算第一項稱為鉗位/喚醒/正相(網路是“清醒的”，因為它取決於可見變數)
計算第二項稱為非固定/睡眠/自由/負相(該網路“處於睡眠狀態”，因為它對關節的可見變數進行了取樣；比喻，它夢見了可見的輸入)

透過隨機梯度下降(SGD)最佳化給定資料的模型對數似然來完成學習，第二項(負相)的估計嚴重依賴於馬爾可夫鏈的混合特性，這經常導致收斂緩慢並且需要額外的計算。

II: Sigmoid Belief Networks 在這裡插入圖片描述 Sigimoid信念網是簡單的貝葉斯網路，其二進位制變數的條件機率由Sigmoid函式表示：貝葉斯網路表現出一種稱為“解釋效應”的現象：如果A與C相關，則B與C相關的機會減少。 ⇒在給定C的情況下A和B相互關聯。 在這裡插入圖片描述值得注意的是，由於“解釋效應”，當我們以信念網路中的可見層為條件時，所有隱藏變數都將成為因變數。

Sigmoid Belief Networks as graphical models

尼爾提出了用於學習和推理的蒙特卡洛方法(尼爾，1992年)：在這裡插入圖片描述 RBMs are infinite belief networks 要對模型引數進行梯度更新，我們需要透過取樣計算期望值。

我們可以在第一階段從後驗中精確取樣
我們執行吉布斯塊抽樣，以從聯合分佈中近似抽取樣本

條件分佈 $p (v | h)$ 和 $p (h | v)$ 用sigmoid表示，因此，我們可以將以RBM表示的聯合分佈中的Gibbs取樣視為無限深的Sigmoid信念網路中的自頂向下傳播！在這裡插入圖片描述 RBM等效於無限深的信念網路。當我們訓練RBM時，實際上就是在訓練一個無限深的簡短網，只是所有圖層的權重都捆綁在一起。如果權重在某種程度上“統一”，我們將獲得一個深度信仰網路。

Deep Belief Networks and Boltzmann Machines

III: Deep Belief Nets 在這裡插入圖片描述 DBN是混合圖形模型(鏈圖)。其聯合機率分佈可表示為：

其中蘊含的挑戰： 由於explaining away effect，因此在DBN中進行精確推斷是有問題的訓練分兩個階段進行：

貪婪的預訓練+臨時微調；沒有適當的聯合訓練
近似推斷為前饋(自下而上)

Layer-wise pre-training

預訓練並凍結第一個RBM
在頂部堆疊另一個RBM並對其進行訓練
重物2層以上的重物保持綁緊狀態
我們重複此過程：預訓練和解開

Fine-tuning

Pre-training is quite ad-hoc(特別指定) and is unlikely to lead to a good probabilistic model per se
However, the layers of representations could perhaps be useful for some other downstream tasks!
We can further “fine-tune” a pre-trained DBN for some other task

Setting A: Unsupervised learning (DBN → autoencoder)

Pre-train a stack of RBMs in a greedy layer-wise fashion
“Unroll” the RBMs to create an autoencoder
Fine-tune the parameters by optimizing the reconstruction error(重構誤差)

在這裡插入圖片描述

Setting B: Supervised learning (DBN → classifier)

Pre-train a stack of RBMs in a greedy layer-wise fashion
“Unroll” the RBMs to create a feedforward classifier
Fine-tune the parameters by optimizing the reconstruction error

Deep Belief Nets and Boltzmann Machines 在這裡插入圖片描述 DBMs are fully un-directed models (Markov random fields). Can be trained similarly as RBMs via MCMC (Hinton & Sejnowski, 1983). Use a variational approximation(變分近似) of the data distribution for faster training (Salakhutdinov & Hinton, 2009). Similarly, can be used to initialize other networks for downstream tasks

A few ==critical points== to note about all these models:

The primary goal of deep generative models is to represent the distribution of the observable variables. Adding layers of hidden variables allows to represent increasingly more complex distributions.
Hidden variables are secondary (auxiliary) elements used to facilitate learning of complex dependencies between the observables.
Training of the model is ad-hoc, but what matters is the quality of learned hidden representations.
Representations are judged by their usefulness on a downstream task (the probabilistic meaning of the model is often discarded at the end).
In contrast, classical graphical models are often concerned with the correctness of learning and inference of all variables

Conclusion

DL & GM: the fields are similar in the beginning (structure, energy, etc.), and then diverge to their own signature pipelines
DL: most effort is directed to comparing different architectures and their components (models are driven by evaluating empirical performance on a downstream tasks)
DL models are good at learning robust hierarchical representations from the data and suitable for simple reasoning (call it “low-level cognition”)
GM: the effort is directed towards improving inference accuracy and convergence speed
GMs are best for provably correct inference and suitable for high-level complex reasoning tasks (call it “high-level cognition”) 推理任務
Convergence of both fields is very promising!
03 Combining DL methods and GMs
Using outputs of NNs as inputs to GMs
Combining sequential NNs and GMs HMM：隱馬爾可夫Hybrid NNs + conditional GMsIn a standard CRF條件隨機場, each of the factor cells is a parameter. In a hybrid model, these values are computed by a neural network.
GMs with potential functions represented by NNs q NNs with structured outputs
Using GMs as Prediction Explanations

在這裡插入圖片描述 !!!! How do we build a powerful predictive model whose predictions we can interpret in terms of semantically meaningful features?

Contextual Explanation Networks (CENs)

在這裡插入圖片描述

The final prediction is made by a linear GM.
Each coefficient assigns a weight to a meaningful attribute.
Allows us to judge predictions in terms of GMs produced by the context encoder.

CEN: Implementation Details 在這裡插入圖片描述 Workflow:

Maintain a (sparse稀疏) dictionary of GM parameters.
Process complex inputs (images, text, time series, etc.) using deep nets; use soft attention to either select or combine models from the dictionary. • Use constructed GMs (e.g., CRFs) to make predictions. • Inspect GM parameters to understand the reasoning behind predictions.

Results: imagery as context 在這裡插入圖片描述 Based on the imagery, CEN learns to select different models for urban and rural