Probabilistic Graphical Models
Statistical and Algorithmic Foundations of Deep Learning
Author: Eric Xing
01 An overview of DL components
Historical remarks: early days of neural networks
我們知道生物神經元是這樣的:上游細胞透過軸突(Axon)將神經遞質傳送給下游細胞的樹突。 人工智慧受到該原理的啟發,是按照下圖來構造人工神經元(或者是感知器)的。類似的,生物神經網路 —— > 人工神經網路Reverse-mode automatic differentiation (aka backpropagation)
Reverse-mode automatic differentiation (aka backpropagation)
下面我們來看看具體的感知器學習演算法。假設這是一個迴歸問題x->y,y=f(x)+ηy=f(x)+η, 則目標函式為為了求出該函式的解,我們需要對其求導,具體的:其中
由此ww的更新公式為:
下面我們來說說神經網路模型:其中,隱藏單元沒有目標。
人工神經網路不過是可以由計算圖表示的複雜功能組成。透過應用鏈式規則並使用反向累積,我們得到:該演算法通常稱為反向傳播。 如果某些功能是隨機的怎麼辦?使用隨機反向傳播!現代軟體包可以自動執行此操作(稍後再介紹)
Modern building blocks: units, layers, activations functions, loss functions, etc.
常用啟用函式:
- Linear and ReLU
- Sigmoid and tanh
- Etc.
網路層:
- Fully connected
- Convolutional & pooling
- Recurrent
- ResNets
- Etc. -
也就是說基本構成要素的可以任意組合,如果有多種損失功能的話,可以實現多目標預測和轉移學習等。 只要有足夠的資料,更深的架構就會不斷改進。
Feature learning 成功學習中間表示[Lee et al ICML 2009,Lee et al NIPS 2009]表示學習:網路學習越來越多的抽象資料表示形式,這些資料被“解開”,即可以進行線性分離。
02 Similarities and differences between GMs and NNs
Graphical models vs. computational graphs
Graphical models:
- 用於以圖形形式編碼有意義的知識和相關的不確定性的表示形式
- 學習和推理基於經過充分研究(依賴於結構)的技術(例如EM,訊息傳遞,VI,MCMC等)的豐富工具箱
- 圖形代表模型Utility of the graph
- 一種用於從區域性結構綜合全域性損失函式的工具(潛在功能,特徵功能等)
- 一種設計合理有效的推理演算法的工具(總和,均值場等)
- 激發近似和懲罰的工具(結構化MF,樹近似等)
- 用於監視理論和經驗行為以及推理準確性的工具
Utility of the loss function
- 學習演算法和模型質量的主要衡量指標
Deep neural networks :
- 學習有助於最終指標上的計算和效能的表示形式(中間表示形式不保證一定有意義)
- 學習主要基於梯度下降法(aka反向傳播);推論通常是微不足道的,並透過“向前傳遞”完成
- 圖形代表計算
Utility of the network
- 概念上綜合複雜決策假設的工具(分階段的投影和聚合)
- 用於組織計算操作的工具(潛在狀態的分階段更新)
- 用於設計加工步驟和計算模組的工具(逐層並行化)
- 在評估DL推理演算法方面沒有明顯的用途
到目前為止,圖形模型是機率分佈的表示,而神經網路是函式近似器(無機率含義)。有些神經網路實際上是圖形模型(即單位/神經元代表隨機變數):
- 玻爾茲曼機器Boltzmann machines (Hinton&Sejnowsky,1983)
- 受限制的玻爾茲曼機器Restricted Boltzmann machines(Smolensky,1986)
- Sigmoid信念網路的學習和推理Learning and Inference in sigmoid belief networks(Neal,1992)
- 深度信念網路中的快速學習Fast learning in deep belief networks(Hinton,Osindero,Teh,2006年)
- 深度玻爾茲曼機器Deep Boltzmann machines(Salakhutdinov和Hinton,2009年)
接下來我們會逐一介紹他們。
I: Restricted Boltzmann Machines 受限玻爾茲曼機器,縮寫為RBM。 RBM是用二部圖(bi-partite graph)表示的馬爾可夫隨機場,圖的一層/部分中的所有節點都連線到另一層中的所有節點; 沒有層間連線。聯合分佈為:單個資料點的對數似然度(不可觀察的邊際被邊緣化):對數似然比的梯度 模型引數:對數似然比的梯度 引數(替代形式):兩種期望都可以透過抽樣來近似, 從後部取樣是準確的(RBM在給定的h上分解)。 透過MCMC從關節進行取樣(例如,吉布斯取樣)
在神經網路文獻中:
- 計算第一項稱為鉗位/喚醒/正相(網路是“清醒的”,因為它取決於可見變數)
- 計算第二項稱為非固定/睡眠/自由/負相(該網路“處於睡眠狀態”,因為它對關節的可見變數進行了取樣;比喻,它夢見了可見的輸入)
透過隨機梯度下降(SGD)最佳化給定資料的模型對數似然來完成學習, 第二項(負相)的估計嚴重依賴於馬爾可夫鏈的混合特性,這經常導致收斂緩慢並且需要額外的計算。
II: Sigmoid Belief NetworksSigimoid信念網是簡單的貝葉斯網路,其二進位制變數的條件機率由Sigmoid函式表示:貝葉斯網路表現出一種稱為“解釋效應”的現象:如果A與C相關,則B與C相關的機會減少。 ⇒在給定C的情況下A和B相互關聯。值得注意的是, 由於“解釋效應”,當我們以信念網路中的可見層為條件時,所有隱藏變數都將成為因變數。
Sigmoid Belief Networks as graphical models
尼爾提出了用於學習和推理的蒙特卡洛方法(尼爾,1992年):RBMs are infinite belief networks 要對模型引數進行梯度更新,我們需要透過取樣計算期望值。
- 我們可以在第一階段從後驗中精確取樣
- 我們執行吉布斯塊抽樣,以從聯合分佈中近似抽取樣本
條件分佈p(v|h)p(v|h)和p(h|v)p(h|v)用sigmoid表示, 因此,我們可以將以RBM表示的聯合分佈中的Gibbs取樣視為無限深的Sigmoid信念網路中的自頂向下傳播!RBM等效於無限深的信念網路。當我們訓練RBM時,實際上就是在訓練一個無限深的簡短網, 只是所有圖層的權重都捆綁在一起。如果權重在某種程度上“統一”,我們將獲得一個深度信仰網路。
Deep Belief Networks and Boltzmann Machines
III: Deep Belief NetsDBN是混合圖形模型(鏈圖)。其聯合機率分佈可表示為:
其中蘊含的挑戰: 由於explaining away effect,因此在DBN中進行精確推斷是有問題的 訓練分兩個階段進行:
- 貪婪的預訓練+臨時微調; 沒有適當的聯合訓練
- 近似推斷為前饋(自下而上)
Layer-wise pre-training
- 預訓練並凍結第一個RBM
- 在頂部堆疊另一個RBM並對其進行訓練
- 重物2層以上的重物保持綁緊狀態
- 我們重複此過程:預訓練和解開
Fine-tuning
- Pre-training is quite ad-hoc(特別指定) and is unlikely to lead to a good probabilistic model per se
- However, the layers of representations could perhaps be useful for some other downstream tasks!
- We can further “fine-tune” a pre-trained DBN for some other task
Setting A: Unsupervised learning (DBN → autoencoder)
- Pre-train a stack of RBMs in a greedy layer-wise fashion
- “Unroll” the RBMs to create an autoencoder
- Fine-tune the parameters by optimizing the reconstruction error(重構誤差)
Setting B: Supervised learning (DBN → classifier)
- Pre-train a stack of RBMs in a greedy layer-wise fashion
- “Unroll” the RBMs to create a feedforward classifier
- Fine-tune the parameters by optimizing the reconstruction error
Deep Belief Nets and Boltzmann MachinesDBMs are fully un-directed models (Markov random fields). Can be trained similarly as RBMs via MCMC (Hinton & Sejnowski, 1983). Use a variational approximation(變分近似) of the data distribution for faster training (Salakhutdinov & Hinton, 2009). Similarly, can be used to initialize other networks for downstream tasks
A few ==critical points== to note about all these models:
- The primary goal of deep generative models is to represent the distribution of the observable variables. Adding layers of hidden variables allows to represent increasingly more complex distributions.
- Hidden variables are secondary (auxiliary) elements used to facilitate learning of complex dependencies between the observables.
- Training of the model is ad-hoc, but what matters is the quality of learned hidden representations.
- Representations are judged by their usefulness on a downstream task (the probabilistic meaning of the model is often discarded at the end).
- In contrast, classical graphical models are often concerned with the correctness of learning and inference of all variables
Conclusion
- DL & GM: the fields are similar in the beginning (structure, energy, etc.), and then diverge to their own signature pipelines
- DL: most effort is directed to comparing different architectures and their components (models are driven by evaluating empirical performance on a downstream tasks)
- DL models are good at learning robust hierarchical representations from the data and suitable for simple reasoning (call it “low-level cognition”)
- GM: the effort is directed towards improving inference accuracy and convergence speed
- GMs are best for provably correct inference and suitable for high-level complex reasoning tasks (call it “high-level cognition”) 推理任務
- Convergence of both fields is very promising!
03 Combining DL methods and GMs
Using outputs of NNs as inputs to GMs
Combining sequential NNs and GMs HMM:隱馬爾可夫Hybrid NNs + conditional GMsIn a standard CRF條件隨機場, each of the factor cells is a parameter. In a hybrid model, these values are computed by a neural network.
GMs with potential functions represented by NNs q NNs with structured outputs
Using GMs as Prediction Explanations
!!!! How do we build a powerful predictive model whose predictions we can interpret in terms of semantically meaningful features?
Contextual Explanation Networks (CENs)
- The final prediction is made by a linear GM.
- Each coefficient assigns a weight to a meaningful attribute.
- Allows us to judge predictions in terms of GMs produced by the context encoder.
CEN: Implementation DetailsWorkflow:
- Maintain a (sparse稀疏) dictionary of GM parameters.
- Process complex inputs (images, text, time series, etc.) using deep nets; use soft attention to either select or combine models from the dictionary. • Use constructed GMs (e.g., CRFs) to make predictions. • Inspect GM parameters to understand the reasoning behind predictions.
Results: imagery as contextBased on the imagery, CEN learns to select different models for urban and rural
Results: classical image & text datasetsCEN architectures for survival analysis
04 Bayesian Learning of NNs
Bayesian learning of NN parameters q Deep kernel learning
A neural network as a probabilistic model: Likelihood: p(y|x,θ)p(y|x,θ)
- Categorical distribution for classification ⇒ cross-entropy loss 交叉熵損失
- Gaussian distribution for regression ⇒ squared loss平方損失
- Gaussianprior⇒L2regularization
- Laplaceprior⇒L1regularization
Bayesian learning [MacKay 1992, Neal 1996, de Freitas 2003]