Probabilistic Models

拉普拉斯的汪發表於2020-12-17

Reference:

Bishop C M. Pattern recognition and machine learning[M]. springer, 2006.

Slides of CS4220, TUD

Bayesian Inference

Bayes’ Theorem for Gaussian Variables (2.3.1-2.3.3)

Suppose x \mathbf x x is a D D D-dimensional vector with Gaussian distribution N ( x ∣ μ , Σ ) \mathcal N(\mathbf x|\boldsymbol \mu,\boldsymbol \Sigma) N(xμ,Σ) and that we partition x \mathbf x x into two disjoint subsets x a \mathbf x_a xa and x b \mathbf x_b xb. Without loss of generality, we can take x a \mathbf x_a xa to form the first M M M components of x \mathbf x x, with x b \mathbf x_b xb comprising the remaining D − M D−M DM components, i.e.
x = ( x a x b ) (BI.1) \mathbf x=\left(\begin{matrix}\mathbf x_a\\\mathbf x_b \end{matrix}\right) \tag{BI.1} x=(xaxb)(BI.1)
We also define corresponding partitions of the mean vector μ \boldsymbol \mu μ given by
μ = ( μ a μ b ) (BI.2) \boldsymbol \mu=\left(\begin{matrix}\boldsymbol \mu_a\\\boldsymbol \mu_b \end{matrix}\right) \tag{BI.2} μ=(μaμb)(BI.2)
and of the covariance matrix Σ \boldsymbol \Sigma Σ given by
Σ = ( Σ a a Σ a b Σ b a Σ b b ) (BI.3) \boldsymbol \Sigma =\left(\begin{matrix}\boldsymbol \Sigma_{aa} & \boldsymbol \Sigma_{ab}\\ \boldsymbol \Sigma_{ba} & \boldsymbol \Sigma_{bb}\end{matrix}\right) \tag{BI.3} Σ=(ΣaaΣbaΣabΣbb)(BI.3)
Note that the symmetry Σ T = Σ \boldsymbol \Sigma^{\mathrm{T}}=\boldsymbol\Sigma ΣT=Σ of the covariance matrix implies that Σ a a \boldsymbol \Sigma_{a a} Σaa and Σ b b \boldsymbol \Sigma_{b b} Σbb are symmetric, while Σ b a = Σ a b T \boldsymbol \Sigma_{b a}=\boldsymbol \Sigma_{a b}^{\mathrm{T}} Σba=ΣabT.

In many situations, it will be convenient to work with the inverse of the covariance matrix
Λ ≡ Σ − 1 (BI.4) \mathbf{\Lambda} \equiv \mathbf{\Sigma}^{-1}\tag{BI.4} ΛΣ1(BI.4)
which is known as the precision matrix. In fact, we shall see that some properties of Gaussian distributions are most naturally expressed in terms of the covariance, whereas others take a simpler form when viewed in terms of the precision. We therefore also introduce the partitioned form of the precision matrix
Λ = ( Λ a a Λ a b Λ b a Λ b b ) (BI.5) \boldsymbol\Lambda=\left(\begin{array}{ll} \boldsymbol\Lambda_{a a} & \boldsymbol\Lambda_{a b} \\ \boldsymbol\Lambda_{b a} & \boldsymbol\Lambda_{b b} \end{array}\right)\tag{BI.5} Λ=(ΛaaΛbaΛabΛbb)(BI.5)
Because the inverse of a symmetric matrix is also symmetric, we see that Λ a a \boldsymbol \Lambda _{aa} Λaa and Λ b b \boldsymbol \Lambda_{bb} Λbb are symmetric, while Λ a b T = Λ b a \boldsymbol \Lambda_{ab}^T=\boldsymbol \Lambda_{ba} ΛabT=Λba.

An important property of the multivariate Gaussian distribution is that if two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. Similarly, the marginal distribution of either set is also Gaussian.

Using the expression above, since x a \mathbf x_a xa and x b \mathbf x_b xb are jointly Gaussian, then the conditional distributions p ( x a ∣ x b ) p(\mathbf x_a|\mathbf x_b) p(xaxb) and p ( x b ∣ x a ) p(\mathbf x_b|\mathbf x_a) p(xbxa), and the marginal distributions p ( x a ) p(\mathbf x_a) p(xa) and p ( x b ) p(\mathbf x_b) p(xb) are also Gaussian. We will show how to prove this.

Conditional Gaussian distributions

Let us begin by finding an expression for the conditional distribution p ( x a ∣ x b ) p(\mathbf x_a|\mathbf x_b) p(xaxb). It can be evaluated from the joint distribution p ( x ) = p ( x a , x b ) p(\mathbf x)=p(\mathbf x_a,\mathbf x_b) p(x)=p(xa,xb) by
p ( x a ∣ x b ) = p ( x a , x b ) p ( x b ) = p ( x a , x b ) ∫ p ( x a , x b ) d x a p(\mathbf x_a|\mathbf x_b)=\frac{p(\mathbf x_a,\mathbf x_b)}{p(\mathbf x_b)}=\frac{p(\mathbf x_a,\mathbf x_b)}{\int p(\mathbf x_a,\mathbf x_b)d\mathbf x_a} p(xaxb)=p(xb)p(xa,xb)=p(xa,xb)dxap(xa,xb)
where
p ( x ) = p ( x 1 , x 2 ) = 1 ( 2 π ) N ∣ Σ ∣ exp ⁡ ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) p(\mathbf x)=p(\mathbf x_1,\mathbf x_2)=\frac{1}{\sqrt{(2\pi)^N|\boldsymbol \Sigma|}}\exp\left(-\frac{1}{2} (\mathbf x-\boldsymbol \mu)^T \boldsymbol \Sigma^{-1}(\mathbf x-\boldsymbol\mu) \right) p(x)=p(x1,x2)=(2π)NΣ 1exp(21(xμ)TΣ1(xμ))
It can be viewed as fixing x b \mathbf x_b xb to the observed value and normalizing the resulting expression to obtain a valid probability distribution over x a \mathbf x_a xa. Therefore, we can first ignore p ( x b ) p(\mathbf x_b) p(xb) and focus on the quadratic form in the exponent of the Gaussian distribution p ( x a , x b ) p(\mathbf x_a,\mathbf x_b) p(xa,xb) and then reinstating the normalization coefficient at the end of the calculation.

If we make use of the partitioning ( B I . 1 ) (BI.1) (BI.1), ( B I . 2 ) (BI.2) (BI.2) and ( B I . 5 ) (BI.5) (BI.5), we obtain
− 1 2 ( x − μ ) T Σ − 1 ( x − μ ) = − 1 2 ( x a − μ a ) T Λ a a − 1 ( x a − μ a ) − 1 2 ( x a − μ a ) T Λ a b − 1 ( x b − μ b ) − 1 2 ( x b − μ b ) T Λ b a − 1 ( x a − μ a ) − 1 2 ( x b − μ b ) T Λ b b − 1 ( x b − μ b ) (BI.6) \begin{aligned} &-\frac{1}{2} (\mathbf x-\boldsymbol \mu)^T \boldsymbol \Sigma^{-1}(\mathbf x-\boldsymbol\mu)=\\ &-\frac{1}{2} (\mathbf x_a-\boldsymbol \mu_a)^T \boldsymbol \Lambda_{aa}^{-1}(\mathbf x_a-\boldsymbol\mu_a)-\frac{1}{2} (\mathbf x_a-\boldsymbol \mu_a)^T \boldsymbol \Lambda_{ab}^{-1}(\mathbf x_b-\boldsymbol\mu_b)\\ &-\frac{1}{2} (\mathbf x_b-\boldsymbol \mu_b)^T \boldsymbol \Lambda_{ba}^{-1}(\mathbf x_a-\boldsymbol\mu_a)-\frac{1}{2} (\mathbf x_b-\boldsymbol \mu_b)^T \boldsymbol \Lambda_{bb}^{-1}(\mathbf x_b-\boldsymbol\mu_b) \end{aligned}\tag{BI.6} 21(xμ)TΣ1(xμ)=21(xaμa)TΛaa1(xaμa)21(xaμa)TΛab1(xbμb)21(xbμb)TΛba1(xaμa)21(xbμb)TΛbb1(xbμb)(BI.6)
We see that as a function of x a \mathbf x_a xa, this is again a quadratic form, and hence the corresponding conditional distribution p ( x a ∣ x b ) p(\mathbf x_a|\mathbf x_b) p(xaxb) will be Gaussian.

Because this distribution is completely characterized by its mean and its covariance, our goal will be to identify expressions for the mean and covariance of p ( x a ∣ x b ) p(\mathbf x_a|\mathbf x_b) p(xaxb) by inspection of ( B I . 6 ) (BI.6) (BI.6). Such problems can be solved straightforwardly by noting that the exponent in a general Gaussian distribution N ( x ∣ μ , Σ ) \mathcal N(\mathbf x|\boldsymbol \mu, \boldsymbol \Sigma) N(xμ,Σ) can be written
− 1 2 ( x − μ ) T Σ − 1 ( x − μ ) = − 1 2 x T Σ − 1 x + x T Σ − 1 μ + c o n s t (BI.7) -\frac{1}{2} (\mathbf x-\boldsymbol \mu)^T \boldsymbol \Sigma^{-1}(\mathbf x-\boldsymbol\mu)=-\frac{1}{2}\mathbf x^T \boldsymbol \Sigma^{-1}\mathbf x+\mathbf x^T \boldsymbol \Sigma^{-1}\boldsymbol \mu+\mathrm{const}\tag{BI.7} 21(xμ)TΣ1(xμ)=21xTΣ1x+xTΣ1μ+const(BI.7)
Thus if we take our general quadratic form and express it in the form given by the right-hand side of ( B I . 7 ) (BI.7) (BI.7), then we can immediately equate the matrix of coefficients entering the second order term in x \mathbf x x to the inverse covariance matrix Σ − 1 \boldsymbol \Sigma ^{-1} Σ1 and the coefficient of the linear term in x \mathbf x x to Σ − 1 μ \boldsymbol \Sigma^{-1} \boldsymbol \mu Σ1μ, from which we can obtain μ \boldsymbol \mu μ.

Now let us apply this procedure to ( B I . 6 ) (BI.6) (BI.6). Denote the mean and covariance of this distribution by μ a ∣ b \boldsymbol \mu_{a|b} μab and Σ a ∣ b \boldsymbol \Sigma _{a|b} Σab, respectively. If we pick out all terms that are second order in x a \mathbf x_a xa, we have
− 1 2 x a T Λ a a x a -\frac{1}{2}\mathbf x_a^T \boldsymbol \Lambda_{aa}\mathbf x_a 21xaTΛaaxa
from which we can immediately conclude that the covariance (inverse precision) of p ( x a ∣ x b ) p(\mathbf x_a|\mathbf x_b) p(xaxb) is given by
Σ a ∣ b = Λ a a − 1 (BI.8) \boldsymbol \Sigma_{a|b}=\boldsymbol \Lambda_{aa}^{-1}\tag{BI.8} Σab=Λaa1(BI.8)
Now consider all of the terms in ( B I . 6 ) (BI.6) (BI.6) that are linear in x a \mathbf{x}_{a} xa
x a T { Λ a a μ a − Λ a b ( x b − μ b ) } \mathbf{x}_{a}^{\mathrm{T}}\left\{\mathbf{\Lambda}_{a a} \boldsymbol{\mu}_{a}-\boldsymbol{\Lambda}_{a b}\left(\mathbf{x}_{b}-\boldsymbol{\mu}_{b}\right)\right\} xaT{ΛaaμaΛab(xbμb)}
where we have used Λ b a T = Λ a b . \mathbf{\Lambda}_{b a}^{\mathrm{T}}=\boldsymbol{\Lambda}_{a b} . ΛbaT=Λab. From our discussion of the general form ( B I . 7 ) (BI.7) (BI.7) the coefficient of x a \mathbf{x}_{a} xa in this expression must equal Σ a ∣ b − 1 μ a ∣ b \boldsymbol\Sigma_{a | b}^{-1} \boldsymbol\mu_{a | b} Σab1μab and hence
μ a ∣ b = Σ a ∣ b { Λ a a μ a − Λ a b ( x b − μ b ) } = B I . 8 μ a − Λ a a − 1 Λ a b ( x b − μ b ) (BI.9) \begin{aligned} \boldsymbol{\mu}_{a | b} =\boldsymbol{\Sigma}_{a | b}\left\{\boldsymbol{\Lambda}_{a a} \boldsymbol{\mu}_{a}-\boldsymbol{\Lambda}_{a b}\left(\mathbf{x}_{b}-\boldsymbol{\mu}_{b}\right)\right\} \stackrel{BI.8}{=}\boldsymbol{\mu}_{a}-\boldsymbol{\Lambda}_{a a}^{-1} \boldsymbol{\Lambda}_{a b}\left(\mathbf{x}_{b}-\boldsymbol{\mu}_{b}\right) \end{aligned}\tag{BI.9} μab=Σab{ΛaaμaΛab(xbμb)}=BI.8μaΛaa1Λab(xbμb)(BI.9)
We can also express these results in terms of the corresponding partitioned covariance matrix. To do this, we make use of the following identity for the inverse of a partitioned matrix
( A B C D ) − 1 = ( M − M B D − 1 − D − 1 C M D − 1 + D − 1 C M B D − 1 ) (BI.10) \left(\begin{array}{cc} \mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D} \end{array}\right)^{-1}=\left(\begin{array}{cc} \mathbf{M} & -\mathbf{M} \mathbf{B} \mathbf{D}^{-1} \\ -\mathbf{D}^{-1} \mathbf{C M} & \mathbf{D}^{-1}+\mathbf{D}^{-1} \mathbf{C M B D}^{-1} \end{array}\right)\tag{BI.10} (ACBD)1=(MD1CMMBD1D1+D1CMBD1)(BI.10)
where we have defined
M = ( A − B D − 1 C ) − 1 (BI.11) \mathbf{M}=\left(\mathbf{A}-\mathbf{B} \mathbf{D}^{-1} \mathbf{C}\right)^{-1}\tag{BI.11} M=(ABD1C)1(BI.11)
The quantity M − 1 \mathrm{M}^{-1} M1 is known as the Schur complement of the matrix on the left-hand side of ( B I . 10 ) (BI.10) (BI.10) with respect to the submatrix D \mathbf D D. Using the definition
( Σ a a Σ a b Σ b a Σ b b ) − 1 = ( Λ a a Λ a b Λ b a Λ b b ) \left(\begin{array}{cc} \boldsymbol\Sigma_{a a} & \boldsymbol\Sigma_{a b} \\ \boldsymbol\Sigma_{b a} & \boldsymbol\Sigma_{b b} \end{array}\right)^{-1}=\left(\begin{array}{cc} \boldsymbol\Lambda_{a a} & \boldsymbol\Lambda_{a b} \\ \boldsymbol\Lambda_{b a} & \boldsymbol\Lambda_{b b} \end{array}\right) (ΣaaΣbaΣabΣbb)1=(ΛaaΛbaΛabΛbb)
and making use of ( B I . 10 ) , (BI.10), (BI.10), we have
Λ a a = ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 Λ a b = − ( Σ a a − Σ a b Σ b b − 1 Σ b a ) − 1 Σ a b Σ b b − 1 \begin{aligned} \boldsymbol{\Lambda}_{a a} &=\left(\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a}\right)^{-1} \\ \boldsymbol{\Lambda}_{a b} &=-\left(\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a}\right)^{-1} \boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \end{aligned} ΛaaΛab=(ΣaaΣabΣbb1Σba)1=(ΣaaΣabΣbb1Σba)1ΣabΣbb1
From this we obtain the following expressions for the mean and covariance of the conditional distribution p ( x a ∣ x b ) p(\mathbf x_a|\mathbf x_b) p(xaxb)
μ a ∣ b = μ a + Σ a b Σ b b − 1 ( x b − μ b ) (BI.12) \boldsymbol \mu_{a|b}=\boldsymbol \mu _a+\boldsymbol \Sigma_{ab}\boldsymbol \Sigma_{bb}^{-1}(\mathbf x_b-\boldsymbol \mu_b)\tag{BI.12} μab=μa+ΣabΣbb1(xbμb)(BI.12)

Σ a ∣ b = Σ a a − Σ a b Σ b b − 1 Σ b a (BI.13) \boldsymbol \Sigma_{a|b}=\boldsymbol \Sigma_{aa}-\boldsymbol \Sigma_{ab}\boldsymbol \Sigma _{bb}^{-1}\boldsymbol \Sigma_{ba}\tag{BI.13} Σab=ΣaaΣabΣbb1Σba(BI.13)

Marginal Gaussian distributions

Now we turn to a discussion of the marginal distribution given by
p ( x a ) = ∫ p ( x a , x b ) d x b (BI.14) p(\mathbf x_a)=\int p(\mathbf x_a,\mathbf x_b)d\mathbf x_b\tag{BI.14} p(xa)=p(xa,xb)dxb(BI.14)
Once again, our strategy for evaluating this distribution efficiently will be to focus on the quadratic form in the exponent of the joint distribution p ( x a , x b ) p(\mathbf x_a,\mathbf x_b) p(xa,xb) and thereby to identify the mean and covariance of the marginal distribution p ( x a ) p(\mathbf x_a) p(xa).

The quadratic form for the joint distribution can be expressed, using the partitioned precision matrix, in the form ( B I . 6 ) (BI.6) (BI.6). Because our goal is to integrate out x b \mathbf x_b xb, this is most easily achieved by first considering the terms involving x b \mathbf x_b xb and then completing the square in order to facilitate integration. Picking out just those terms that involve x b \mathbf x_b xb, we have
− 1 2 x b T Λ b b x b + x b T m = − 1 2 ( x b − Λ b b − 1 m ) T Λ b b ( x b − Λ b b − 1 m ) + 1 2 m T Λ b b − 1 m (BI.15) -\frac{1}{2} \mathbf{x}_{b}^{\mathrm{T}} \boldsymbol{\Lambda}_{b b} \mathbf{x}_{b}+\mathbf{x}_{b}^{T} \mathbf{m}=-\frac{1}{2}\left(\mathbf{x}_{b}-\mathbf{\Lambda}_{b b}^{-1} \mathbf{m}\right)^{\mathrm{T}} \boldsymbol{\Lambda}_{b b}\left(\mathbf{x}_{b}-\mathbf{\Lambda}_{b b}^{-1} \mathbf{m}\right)+\frac{1}{2} \mathbf{m}^{\mathrm{T}} \boldsymbol{\Lambda}_{b b}^{-1} \mathbf{m}\tag{BI.15} 21xbTΛbbxb+xbTm=21(xbΛbb1m)TΛbb(xbΛbb1m)+21mTΛbb1m(BI.15)
where we have defined
m = Λ b b μ b − Λ b a ( x a − μ a ) (BI.16) \mathbf{m}=\mathbf{\Lambda}_{b b} \boldsymbol{\mu}_{b}-\boldsymbol{\Lambda}_{b a}\left(\mathbf{x}_{a}-\boldsymbol{\mu}_{a}\right)\tag{BI.16} m=ΛbbμbΛba(xaμa)(BI.16)
We see that the dependence on x b \mathbf{x}_{b} xb has been cast into the standard quadratic form of a Gaussian distribution corresponding to the first term on the right-hand side of ( B I . 15 ) (BI.15) (BI.15) plus a term that does not depend on x b \mathbf{x}_{b} xb (but that does depend on x a \mathbf{x}_{a} xa). Thus, when we take the exponential of this quadratic form, we see that the integration over x b \mathbf{x}_{b} xb required by ( B I . 14 ) (BI.14) (BI.14) will take the form
∫ exp ⁡ { − 1 2 ( x b − Λ b b − 1 m ) T Λ b b ( x b − Λ b b − 1 m ) } d x b = ( 2 π ) N ∣ Λ b b − 1 ∣ ∫ 1 ( 2 π ) N ∣ Λ b b − 1 ∣ exp ⁡ { − 1 2 ( x b − Λ b b − 1 m ) T Λ b b ( x b − Λ b b − 1 m ) } d x b = ( 2 π ) N ∣ Λ b b − 1 ∣ (BI.17) \begin{aligned} &\int \exp \left\{-\frac{1}{2}\left(\mathbf{x}_{b}-\mathbf{\Lambda}_{b b}^{-1} \mathbf{m}\right)^{\mathrm{T}} \boldsymbol{\Lambda}_{b b}\left(\mathbf{x}_{b}-\mathbf{\Lambda}_{b b}^{-1} \mathbf{m}\right)\right\} \mathrm{d} \mathbf{x}_{b}\\=& \sqrt{(2\pi)^N|\boldsymbol \Lambda_{bb}^{-1}|}\int \frac{1}{\sqrt{(2\pi)^N|\boldsymbol \Lambda_{bb}^{-1}|}}\exp \left\{-\frac{1}{2}\left(\mathbf{x}_{b}-\mathbf{\Lambda}_{b b}^{-1} \mathbf{m}\right)^{\mathrm{T}} \boldsymbol{\Lambda}_{b b}\left(\mathbf{x}_{b}-\mathbf{\Lambda}_{b b}^{-1} \mathbf{m}\right)\right\} \mathrm{d} \mathbf{x}_{b}\\=&\sqrt{(2\pi)^N|\boldsymbol \Lambda_{bb}^{-1}|} \end{aligned}\tag{BI.17} ==exp{21(xbΛbb1m)TΛbb(xbΛbb1m)}dxb(2π)NΛbb1 (2π)NΛbb1 1exp{21(xbΛbb1m)TΛbb(xbΛbb1m)}dxb(2π)NΛbb1 (BI.17)
This integration is easily performed by noting that it is the integral over an unnormalized Gaussian, and so the result will be the reciprocal of the normalization coefficient. Thus, by completing the square with respect to x b \mathbf x_b xb, we can integrate out x b \mathbf x_b xb and the only term remaining from the contributions on the left-hand side of ( B I . 15 ) (BI.15) (BI.15) that depends on x a \mathbf x_a xa is 1 / 2   m T Λ b b − 1 m 1/2~ \mathbf{m}^{\mathrm{T}} \boldsymbol{\Lambda}_{b b}^{-1} \mathbf{m} 1/2 mTΛbb1m in which m \mathbf m m is given by ( B I . 16 ) (BI.16) (BI.16). Combining this term with the remaining terms from ( B I . 6 ) (BI.6) (BI.6) that depends on x a \mathbf x_a xa, we obtain
1 2 [ Λ b b μ b − Λ b a ( x a − μ a ) ] T Λ b b − 1 [ Λ b b μ b − Λ b a ( x a − μ a ) ] − 1 2 x a T Λ a a x a + x a T ( Λ a a μ a + Λ a b μ b ) +  const  = − 1 2 x a T ( Λ a a − Λ a b Λ b b − 1 Λ b a ) x a + x a T ( Λ a a − Λ a b Λ b b − 1 Λ b a ) − 1 μ a + c o n s t (BI.18) \begin{aligned} \frac{1}{2}\left[\boldsymbol{\Lambda}_{b b}\right.&\left.\boldsymbol{\mu}_{b}-\boldsymbol{\Lambda}_{b a}\left(\mathbf{x}_{a}-\boldsymbol{\mu}_{a}\right)\right]^{\mathrm{T}} \boldsymbol{\Lambda}_{b b}^{-1}\left[\boldsymbol{\Lambda}_{b b} \boldsymbol{\mu}_{b}-\boldsymbol{\Lambda}_{b a}\left(\mathbf{x}_{a}-\boldsymbol{\mu}_{a}\right)\right] \\ &-\frac{1}{2} \mathbf{x}_{a}^{\mathrm{T}} \boldsymbol{\Lambda}_{a a} \mathbf{x}_{a}+\mathbf{x}_{a}^{\mathrm{T}}\left(\boldsymbol{\Lambda}_{a a} \boldsymbol{\mu}_{a}+\boldsymbol{\Lambda}_{a b} \boldsymbol{\mu}_{b}\right)+\text { const } \\ =&-\frac{1}{2} \mathbf{x}_{a}^{\mathrm{T}}\left(\boldsymbol{\Lambda}_{a a}-\boldsymbol{\Lambda}_{a b} \boldsymbol{\Lambda}_{b b}^{-1} \boldsymbol{\Lambda}_{b a}\right) \mathbf{x}_{a} \\ &+\mathbf{x}_{a}^{\mathrm{T}}\left(\boldsymbol{\Lambda}_{a a}-\boldsymbol{\Lambda}_{a b} \boldsymbol{\Lambda}_{b b}^{-1} \boldsymbol{\Lambda}_{b a}\right)^{-1} \boldsymbol{\mu}_{a}+\mathrm{const} \end{aligned}\tag{BI.18} 21[Λbb=μbΛba(xaμa)]TΛbb1[ΛbbμbΛba(xaμa)]21xaTΛaaxa+xaT(Λaaμa+Λabμb)+ const 21xaT(ΛaaΛabΛbb1Λba)xa+xaT(ΛaaΛabΛbb1Λba)1μa+const(BI.18)
Again, by comparison with ( B I . 7 ) (BI.7) (BI.7), we see that the covariance of the marginal distribution of p ( x a ) p(\mathbf{x}_{a}) p(xa) is given by
Σ a = ( Λ a a − Λ a b Λ b b − 1 Λ b a ) − 1 (BI.19) \mathbf{\Sigma}_{a}=\left(\boldsymbol{\Lambda}_{a a}-\boldsymbol{\Lambda}_{a b} \boldsymbol{\Lambda}_{b b}^{-1} \boldsymbol{\Lambda}_{b a}\right)^{-1}\tag{BI.19} Σa=(ΛaaΛabΛbb1Λba)1(BI.19)
Similarly, the mean is given by
Σ a ( Λ a a − Λ a b Λ b b − 1 Λ b a ) μ a = B I . 19 μ a (BI.20) \mathbf{\Sigma}_{a}\left(\mathbf{\Lambda}_{a a}-\mathbf{\Lambda}_{a b} \mathbf{\Lambda}_{b b}^{-1} \mathbf{\Lambda}_{b a}\right) \boldsymbol{\mu}_{a}\stackrel{BI.19}{=}\boldsymbol{\mu}_{a}\tag{BI.20} Σa(ΛaaΛabΛbb1Λba)μa=BI.19μa(BI.20)
The covariance in ( B I . 19 ) (BI.19) (BI.19) is expressed in terms of the partitioned precision matrix. We can rewrite this in terms of the corresponding partitioning of the covariance matrix as we did for the conditional distribution. Making use of ( B I . 10 ) , (BI.10), (BI.10), we then have
( Λ a a − Λ a b Λ b b − 1 Λ b a ) − 1 = Σ a a (BI.21) \left(\boldsymbol{\Lambda}_{a a}-\boldsymbol{\Lambda}_{a b} \boldsymbol{\Lambda}_{b b}^{-1} \boldsymbol{\Lambda}_{b a}\right)^{-1}=\boldsymbol{\Sigma}_{a a}\tag{BI.21} (ΛaaΛabΛbb1Λba)1=Σaa(BI.21)
Thus we obtain the intuitively satisfying result that the marginal distribution p ( x a ) p\left(\mathbf{x}_{a}\right) p(xa) has mean and covariance given by
E [ x a ] = μ a (BI.22) \mathbb{E}\left[\mathbf{x}_{a}\right] =\boldsymbol{\mu}_{a} \tag{BI.22} E[xa]=μa(BI.22)
cov ⁡ [ x a ] = Σ a a (BI.23) \operatorname{cov}\left[\mathbf{x}_{a}\right] =\boldsymbol{\Sigma}_{a a} \tag{BI.23} cov[xa]=Σaa(BI.23)

We see that for a marginal distribution, the mean and covariance are most simply expressed in terms of the partitioned covariance matrix, in contrast to the conditional distribution for which the partitioned precision matrix gives rise to simpler expressions.

Partitioned Gaussians

We summarize the result for the marginal and conditional distributions of a partitioned Gaussian here.

Theorem: Partitioned Gaussians

Given a joint Gaussian distribution N ( x ∣ μ , Σ ) \mathcal N(\mathbf x|\boldsymbol \mu,\boldsymbol \Sigma) N(xμ,Σ) with Λ ≡ Σ − 1 \boldsymbol \Lambda\equiv \boldsymbol \Sigma^{-1} ΛΣ1 and
x = ( x a x b ) , μ = ( μ a μ b ) \mathbf x=\left(\begin{matrix}\mathbf x_a\\\mathbf x_b \end{matrix}\right),\quad \boldsymbol \mu=\left(\begin{matrix}\boldsymbol \mu_a\\\boldsymbol \mu_b \end{matrix}\right) x=(xaxb),μ=(μaμb)

Σ = ( Σ a a Σ a b Σ b a Σ b b ) , Λ = ( Λ a a Λ a b Λ b a Λ b b ) \boldsymbol \Sigma =\left(\begin{matrix}\boldsymbol \Sigma_{aa} & \boldsymbol \Sigma_{ab}\\ \boldsymbol \Sigma_{ba} & \boldsymbol \Sigma_{bb}\end{matrix}\right) ,\quad \boldsymbol\Lambda=\left(\begin{array}{ll} \boldsymbol\Lambda_{a a} & \boldsymbol\Lambda_{a b} \\ \boldsymbol\Lambda_{b a} & \boldsymbol\Lambda_{b b} \end{array}\right) Σ=(ΣaaΣbaΣabΣbb),Λ=(ΛaaΛbaΛabΛbb)

Conditional distribution:
p ( x a ∣ x b ) = N ( x ∣ μ a ∣ b , Σ a ∣ b ) μ a ∣ b = μ a − Λ a a − 1 Λ a b ( x b − μ b ) = μ a + Σ a b Σ b b − 1 ( x b − μ b ) Σ a ∣ b = Λ a a − 1 = Σ a a − Σ a b Σ b b − 1 Σ b a p(\mathbf x_a|\mathbf x_b)=\mathcal N(\mathbf x|\boldsymbol \mu_{a|b},\boldsymbol \Sigma_{a|b})\\ \boldsymbol{\mu}_{a | b} =\boldsymbol{\mu}_{a}-\boldsymbol{\Lambda}_{a a}^{-1} \boldsymbol{\Lambda}_{a b}\left(\mathbf{x}_{b}-\boldsymbol{\mu}_{b}\right)=\boldsymbol \mu _a+\boldsymbol \Sigma_{ab}\boldsymbol \Sigma_{bb}^{-1}(\mathbf x_b-\boldsymbol \mu_b) \\ \boldsymbol \Sigma_{a|b}=\boldsymbol \Lambda_{aa}^{-1}=\boldsymbol \Sigma_{aa}-\boldsymbol \Sigma_{ab}\boldsymbol \Sigma _{bb}^{-1}\boldsymbol \Sigma_{ba} p(xaxb)=N(xμab,Σab)μab=μaΛaa1Λab(xbμb)=μa+ΣabΣbb1(xbμb)Σab=Λaa1=ΣaaΣabΣbb1Σba
Marginal distribution:
p ( x a ) = N ( x a ∣ μ a , Σ a a ) p(\mathbf x_a)=\mathcal N(\mathbf x_a|\boldsymbol \mu_a,\boldsymbol \Sigma_{aa}) p(xa)=N(xaμa,Σaa)

Bayes’ theorem for Gaussian variables

In [Conditional Gaussian distributions](#Conditional Gaussian distributions) and [Marginal Gaussian distributions](#Marginal Gaussian distributions), we considered a Gaussian p ( x ) p(\mathbf x) p(x) in which we partitioned the vector x \mathbf x x into two subvectors x = ( x a , x b ) \mathbf x=(\mathbf x_a,\mathbf x_b) x=(xa,xb) and then found expressions for the conditional distribution p ( x a ∣ x b ) p(\mathbf x_a|\mathbf x_b) p(xaxb) and the marginal distribution p ( x a ) p(\mathbf x_a) p(xa).

Here we shall suppose that we are given a Gaussian marginal distribution p ( x ) p(\mathbf x) p(x) and a Gaussian conditional distribution p ( y ∣ x ) p(\mathbf y|\mathbf x) p(yx) in which p ( y ∣ x ) p(\mathbf y|\mathbf x) p(yx) has a mean that is a linear function of x \mathbf x x, and a covariance which is independent of x \mathbf x x, i.e.,
p ( x ) = N ( x ∣ μ , Λ − 1 ) (BI.24) p(\mathbf x)=\mathcal N(\mathbf x|\boldsymbol \mu,\boldsymbol \Lambda^{-1})\tag{BI.24} p(x)=N(xμ,Λ1)(BI.24)

p ( y ∣ x ) = N ( y ∣ A x + b , L − 1 ) (BI.25) p(\mathbf y|\mathbf x)=\mathcal N(\mathbf y|\mathbf A \mathbf x+\mathbf b,\mathbf L^{-1})\tag{BI.25} p(yx)=N(yAx+b,L1)(BI.25)

where Λ \boldsymbol \Lambda Λ and L \mathbf L L are precision matrices.

We wish to find the marginal distribution p ( y ) p(\mathbf y) p(y) and the conditional distribution p ( x ∣ y ) p(\mathbf x|\mathbf y) p(xy). For that, we will first find the joint distribution for x \mathbf x x and y \mathbf y y, and then use the conclusions in [Partitioned Gaussians](#Partitioned Gaussians) to obtain p ( y ) p(\mathbf y) p(y) and p ( x ∣ y ) p(\mathbf x|\mathbf y) p(xy).

Define
z = ( x y ) (BI.26) \mathbf z=\left(\begin{matrix}\mathbf x\\ \mathbf y \end{matrix} \right)\tag{BI.26} z=(xy)(BI.26)
and then consider the log of the joint distribution
ln ⁡ p ( z ) = ln ⁡ p ( x ) + ln ⁡ p ( y ∣ x ) = − 1 2 ( x − μ ) T Λ ( x − μ ) − 1 2 ( y − A x − b ) T L ( y − A x − b ) + c o n s t (BI.27) \begin{aligned} \ln p(\mathbf{z})=& \ln p(\mathbf{x})+\ln p(\mathbf{y} \mid \mathbf{x}) \\ =&-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\mathrm{T}} \boldsymbol{\Lambda}(\mathbf{x}-\boldsymbol{\mu}) \\ &-\frac{1}{2}(\mathbf{y}-\mathbf{A} \mathbf{x}-\mathbf{b})^{\mathrm{T}} \mathbf{L}(\mathbf{y}-\mathbf{A} \mathbf{x}-\mathbf{b})+\mathrm{const} \end{aligned}\tag{BI.27} lnp(z)==lnp(x)+lnp(yx)21(xμ)TΛ(xμ)21(yAxb)TL(yAxb)+const(BI.27)
As before, we see that this is a quadratic function of the components of z , \mathbf{z}, z, and hence p ( z ) p(\mathbf{z}) p(z) is Gaussian distribution. To find the precision of this Gaussian, we consider the second order terms in ( B I . 27 ) (BI.27) (BI.27), which can be written as
− 1 2 x T ( Λ + A T L A ) x − 1 2 y T L y + 1 2 y T L A x + 1 2 x T A T L y = − 1 2 ( x y ) T ( Λ + A T L A − A T L − L A L ) ( x y ) = − 1 2 z T R z (BI.28) \begin{aligned} -\frac{1}{2} \mathbf{x}^{\mathrm{T}}\left(\boldsymbol{\Lambda}+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A}\right) \mathbf{x}-\frac{1}{2} \mathbf{y}^{\mathrm{T}} \mathbf{L} \mathbf{y}+\frac{1}{2} \mathbf{y}^{\mathrm{T}} \mathbf{L} \mathbf{A} \mathbf{x}+\frac{1}{2} \mathbf{x}^{\mathrm{T}} \mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{y} \\ =-\frac{1}{2}\left(\begin{array}{l} \mathbf{x} \\ \mathbf{y} \end{array}\right)^{\mathrm{T}}\left(\begin{array}{cc} \mathbf{\Lambda}+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A} & -\mathbf{A}^{\mathrm{T}} \mathbf{L} \\ -\mathbf{L} \mathbf{A} & \mathbf{L} \end{array}\right)\left(\begin{array}{l} \mathbf{x} \\ \mathbf{y} \end{array}\right)=-\frac{1}{2} \mathbf{z}^{\mathrm{T}} \mathbf{R} \mathbf{z} \end{aligned}\tag{BI.28} 21xT(Λ+ATLA)x21yTLy+21yTLAx+21xTATLy=21(xy)T(Λ+ATLALAATLL)(xy)=21zTRz(BI.28)
and so the Gaussian distribution over z \mathbf{z} z has precision (inverse covariance) matrix given by
R = ( Λ + A T L A − A T L − L A L ) (BI.29) \mathbf{R}=\left(\begin{array}{cc} \mathbf{\Lambda}+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A} & -\mathbf{A}^{\mathrm{T}} \mathbf{L} \\ -\mathbf{L} \mathbf{A} & \mathbf{L} \end{array}\right)\tag{BI.29} R=(Λ+ATLALAATLL)(BI.29)
The covariance matrix is found by taking the inverse of the precision, which can be done using the matrix inversion formula ( B I . 10 ) (BI.10) (BI.10) to give
cov ⁡ [ z ] = R − 1 = ( Λ − 1 Λ − 1 A T A Λ − 1 L − 1 + A Λ − 1 A T ) (BI.30) \operatorname{cov}[\mathbf{z}]=\mathbf{R}^{-1}=\left(\begin{array}{cc} \mathbf{\Lambda}^{-1} & \boldsymbol{\Lambda}^{-1} \mathbf{A}^{\mathrm{T}} \\ \mathbf{A} \mathbf{\Lambda}^{-1} & \mathbf{L}^{-1}+\mathbf{A} \mathbf{\Lambda}^{-1} \mathbf{A}^{\mathrm{T}} \end{array}\right)\tag{BI.30} cov[z]=R1=(Λ1AΛ1Λ1ATL1+AΛ1AT)(BI.30)
Similarly, we can find the mean of the Gaussian distribution over z \mathbf{z} z by identifying the linear terms in ( B I . 27 ) (BI.27) (BI.27), which are given by
x T Λ μ − x T A T L b + y T L b = ( x y ) T ( Λ μ − A T L b L b ) (BI.31) \mathbf{x}^{\mathrm{T}} \boldsymbol{\Lambda} \boldsymbol{\mu}-\mathbf{x}^{\mathrm{T}} \mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{b}+\mathbf{y}^{\mathrm{T}} \mathbf{L} \mathbf{b}=\left(\begin{array}{l} \mathbf{x} \\ \mathbf{y} \end{array}\right)^{\mathrm{T}}\left(\begin{array}{c} \boldsymbol{\Lambda} \boldsymbol{\mu}-\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{b} \\ \mathbf{L b} \end{array}\right)\tag{BI.31} xTΛμxTATLb+yTLb=(xy)T(ΛμATLbLb)(BI.31)
Using our earlier result ( B I . 7 ) (BI.7) (BI.7) obtained by completing the square over the quadratic form of a multivariate Gaussian, we find that the mean of z z z is given by
E [ z ] = R − 1 ( Λ μ − A T L b L b ) = B I . 29 ( μ A μ + b ) (BI.32) \mathbb{E}[\mathbf{z}]=\mathbf{R}^{-1}\left(\begin{array}{c} \boldsymbol\Lambda \mu-\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{b} \\ \mathbf{L b} \end{array}\right)\stackrel{BI.29}{=}\left(\begin{array}{c} \boldsymbol{\mu} \\ \mathbf{A} \mu+\mathbf{b} \end{array}\right)\tag{BI.32} E[z]=R1(ΛμATLbLb)=BI.29(μAμ+b)(BI.32)
Now we can obtain the mean and covariance of the marginal distribution p ( y ) p(\mathbf{y}) p(y) using the conclusions in [Partitioned Gaussians](#Partitioned Gaussians):
E [ y ] = A μ + b (BI.33) \begin{aligned} \mathbb{E}[\mathbf{y}] &=\mathbf{A} \boldsymbol{\mu}+\mathbf{b} \end{aligned}\tag{BI.33} E[y]=Aμ+b(BI.33)
cov ⁡ [ y ] = L − 1 + A Λ − 1 A T (BI.34) \begin{aligned} \operatorname{cov}[\mathbf{y}] &=\mathbf{L}^{-1}+\mathbf{A} \mathbf{\Lambda}^{-1} \mathbf{A}^{\mathrm{T}} \end{aligned}\tag{BI.34} cov[y]=L1+AΛ1AT(BI.34)

A special case of this result is when A = I \mathbf{A}=\mathbf{I} A=I, in which case it reduces to the convolution of two Gaussians, for which we see that the mean of the convolution is the sum of the mean of the two Gaussians, and the covariance of the convolution is the sum of their covariances.

The conditional distribution p ( x ∣ y ) p(\mathbf{x} | \mathbf{y}) p(xy) has mean and covariance given by
E [ x ∣ y ] = ( Λ + A T L A ) − 1 { A T L ( y − b ) + Λ μ } (BI.35) \begin{aligned} \mathbb{E}[\mathbf{x} |\mathbf{y}] &=\left(\boldsymbol{\Lambda}+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A}\right)^{-1}\left\{\mathbf{A}^{\mathrm{T}} \mathbf{L}(\mathbf{y}-\mathbf{b})+\boldsymbol{\Lambda} \boldsymbol{\mu}\right\} \end{aligned}\tag{BI.35} E[xy]=(Λ+ATLA)1{ATL(yb)+Λμ}(BI.35)
cov ⁡ [ x ∣ y ] = ( Λ + A T L A ) − 1 (BI.36) \begin{aligned} \operatorname{cov}[\mathbf{x}| \mathbf{y}] &=\left(\boldsymbol{\Lambda}+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A}\right)^{-1} \end{aligned}\tag{BI.36} cov[xy]=(Λ+ATLA)1(BI.36)

Having found the marginal and conditional distributions, we effectively expressed the joint distribution p ( z ) = p ( x ) p ( y ∣ x ) p(\mathbf{z})=p(\mathbf{x}) p(\mathbf{y} | \mathbf{x}) p(z)=p(x)p(yx) in the form p ( x ∣ y ) p ( y ) . p(\mathbf{x}| \mathbf{y}) p(\mathbf{y}) . p(xy)p(y). These results are summarized below.

Theorem: Marginal and Conditional Gaussians

Given a marginal Gaussian distribution for x \mathbf{x} x and a conditional Gaussian distribution for y \mathbf y y given x \mathbf x x in the form
p ( x ) = N ( x ∣ μ , Λ − 1 ) p ( y ∣ x ) = N ( y ∣ A x + b , L − 1 ) \begin{aligned} p(\mathbf{x}) &=\mathcal{N}\left(\mathbf{x}|\boldsymbol{\mu}, \boldsymbol{\Lambda}^{-1}\right) \\ p(\mathbf{y} | \mathbf{x}) &=\mathcal{N}\left(\mathbf{y}| \mathbf{A} \mathbf{x}+\mathbf{b}, \mathbf{L}^{-1}\right) \end{aligned} p(x)p(yx)=N(xμ,Λ1)=N(yAx+b,L1)
the marginal distribution of y \mathbf{y} y and the conditional distribution of x \mathbf{x} x given y \mathbf{y} y are given by
p ( y ) = N ( y ∣ A μ + b , L − 1 + A Λ − 1 A T ) p ( x ∣ y ) = N ( x ∣ Σ { A T L ( y − b ) + Λ μ } , Σ ) \begin{aligned} p(\mathbf{y}) &=\mathcal{N}\left(\mathbf{y} | \mathbf{A} \boldsymbol{\mu}+\mathbf{b}, \mathbf{L}^{-1}+\mathbf{A} \mathbf{\Lambda}^{-1} \mathbf{A}^{\mathrm{T}}\right) \\ p(\mathbf{x}|\mathbf{y}) &=\mathcal{N}\left(\mathbf{x} | \mathbf{\Sigma}\left\{\mathbf{A}^{\mathrm{T}} \mathbf{L}(\mathbf{y}-\mathbf{b})+\boldsymbol{\Lambda} \boldsymbol{\mu}\right\}, \mathbf{\Sigma}\right) \end{aligned} p(y)p(xy)=N(yAμ+b,L1+AΛ1AT)=N(xΣ{ATL(yb)+Λμ},Σ)
where
Σ = ( Λ + A T L A ) − 1 \boldsymbol{\Sigma}=\left(\boldsymbol{\Lambda}+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A}\right)^{-1} Σ=(Λ+ATLA)1

Bayesian Linear Regression (3.3.1-3.3.2)

Consider a data set of inputs X = { x 1 , . . . , x N } \mathbf X = \{\mathbf x_1,..., \mathbf x_N\} X={x1,...,xN} with corresponding target values t = [ t 1 , . . . , t N ] T \mathbf t=[t_1,...,t_N]^T t=[t1,...,tN]T. The expression for the likelihood function, which is a function of the adjustable parameters w \mathbf w w and β β β, has the form
p ( t ∣ X , w , β ) = ∏ n = 1 N N ( w T ϕ ( x n ) , β − 1 ) = N ( Φ w , β − 1 I ) (BI.37) p(\mathbf t|\mathbf X,\mathbf w,\beta)=\prod_{n=1}^N \mathcal N(\mathbf w^T \phi(\mathbf x_n),\beta^{-1})=\mathcal N(\mathbf \Phi \mathbf w,\beta^{-1}\mathbf I)\tag{BI.37} p(tX,w,β)=n=1NN(wTϕ(xn),β1)=N(Φw,β1I)(BI.37)
where Φ = [ ϕ ( x 1 ) ⋯ ϕ ( x N ) ] T \mathbf \Phi=[\phi(\mathbf x_1)\cdots \phi(\mathbf x_N)]^T Φ=[ϕ(x1)ϕ(xN)]T.

We begin our discussion of the Bayesian treatment of linear regression by introducing a prior probability distribution over the model parameters w \mathbf w w. For the moment, we shall treat the noise precision parameter β β β as a known constant.

Parameter distribution

To simplify the notation, we denote the likelihood function p ( t ∣ X , w , β ) p(\mathbf t|\mathbf X,\mathbf w,\beta) p(tX,w,β) defined in ( B I . 37 ) (BI.37) (BI.37) as p ( t ∣ w ) p(\mathbf t|\mathbf w) p(tw). Note that p ( t ∣ w ) p(\mathbf t|\mathbf w) p(tw) is the exponential of a quadratic function of w \mathbf w w, the corresponding conjugate prior is therefore given by a Gaussian distribution of the form
p ( w ) = N ( w ∣ m 0 , S 0 ) (BI.38) p(\mathbf w)=\mathcal N (\mathbf w|\mathbf m_0,\mathbf S_0)\tag{BI.38} p(w)=N(wm0,S0)(BI.38)
having mean m 0 \mathbf m_0 m0 and covariance S 0 \mathbf S_0 S0.

Next we compute the posterior distribution, which is proportional to the product of the likelihood function and the prior. Due to the choice of a conjugate Gaussian prior distribution, the posterior will also be Gaussian. By applying the conclusions in [Bayes’ theorem for Gaussian variables](#Bayes’ theorem for Gaussian variables), where x = w \mathbf x=\mathbf w x=w, y = t \mathbf y=\mathbf t y=t, A = Φ \mathbf A=\boldsymbol\Phi A=Φ, b = 0 \mathbf b=\mathbf 0 b=0, L = β I \mathbf L=\beta\mathbf I L=βI, μ = m 0 \boldsymbol\mu=\mathbf m_0 μ=m0, and Λ = S 0 − 1 \boldsymbol\Lambda=\mathbf S_0^{-1} Λ=S01, we have
p ( w ∣ t ) = N ( w ∣ m N . S N ) (BI.39) p(\mathbf w|\mathbf t)=\mathcal N(\mathbf w|\mathbf m_N.\mathbf S_N)\tag{BI.39} p(wt)=N(wmN.SN)(BI.39)

m N = S N ( S 0 − 1 m 0 + β Φ T t ) (BI.40) \mathbf m_N=\mathbf S_N(\mathbf S_0^{-1}\mathbf m_0+\beta \boldsymbol \Phi^T\mathbf t)\tag{BI.40} mN=SN(S01m0+βΦTt)(BI.40)

S N − 1 = S 0 − 1 + β Φ T Φ (BI.41) \mathbf S_N^{-1}=\mathbf S_0^{-1}+\beta \boldsymbol \Phi^T \boldsymbol \Phi \tag{BI.41} SN1=S01+βΦTΦ(BI.41)

Note that because the posterior distribution is Gaussian, its mode coincides with its mean. Thus the maximum posterior weight vector is simply given by w M A P = m N \mathbf w_{\mathrm{MAP}}=\mathbf m_N wMAP=mN.

We shall consider a particular form of Gaussian prior in order to simplify the treatment. Specifically, we consider a zero-mean isotropic Gaussian governed by a single precision parameter α α α so that
p ( w ∣ α ) = N ( w ∣ 0 , α − 1 I ) (BI.42) p(\mathbf w|\alpha)=\mathcal N(\mathbf w|\mathbf 0,\alpha^{-1}\mathbf I)\tag{BI.42} p(wα)=N(w0,α1I)(BI.42)
and the corresponding posterior distribution over w \mathbf w w is then given by ( B I . 39 ) (BI.39) (BI.39) with
m N = β S N Φ T t (BI.43) \mathbf m_N=\beta\mathbf S_N \boldsymbol \Phi^T\mathbf t\tag{BI.43} mN=βSNΦTt(BI.43)

S N − 1 = α I + β Φ T Φ (BI.44) \mathbf S_N^{-1}=\alpha \mathbf I+\beta \boldsymbol \Phi^T \boldsymbol \Phi \tag{BI.44} SN1=αI+βΦTΦ(BI.44)

Predictive distribution

In practice, we are not usually interested in the value of w \mathbf w w itself but rather in making predictions of t t t for new values of x \mathbf x x. This requires that we evaluate the predictive distribution defined by
p ( t ∣ x , t , α , β ) = ∫ p ( t ∣ x , w , β ) p ( w ∣ x , t , α , β ) d w (BI.45) p(t|\mathbf x,\mathbf t,\alpha, \beta)=\int p(t|\mathbf x,\mathbf w,\beta)p(\mathbf w|\mathbf x,\mathbf t,\alpha, \beta)d\mathbf w\tag{BI.45} p(tx,t,α,β)=p(tx,w,β)p(wx,t,α,β)dw(BI.45)
where the conditional distribution p ( t ∣ x , w , β ) p(t|\mathbf x,\mathbf w,\beta) p(tx,w,β) is given by
p ( t ∣ x , w , β ) = N ( t ∣ w T ϕ ( x ) , β − 1 ) (BI.46) p(t|\mathbf x,\mathbf w,\beta)=\mathcal N(t|\mathbf w^T\phi(\mathbf x),\beta^{-1})\tag{BI.46} p(tx,w,β)=N(twTϕ(x),β1)(BI.46)
and p ( w ∣ x , t , α , β ) p(\mathbf w|\mathbf x,\mathbf t,\alpha, \beta) p(wx,t,α,β) is given by ( B I . 39 ) , ( B I . 43 ) , ( B I . 44 ) (BI.39),(BI.43),(BI.44) (BI.39),(BI.43),(BI.44).

Similarly, using the conclusions in [Bayes’ theorem for Gaussian variables](#Bayes’ theorem for Gaussian variables), viewing p ( w ∣ x , t , α , β ) p(\mathbf w|\mathbf x,\mathbf t,\alpha, \beta) p(wx,t,α,β) as the prior and p ( t ∣ x , w , β ) p(t|\mathbf x,\mathbf w,\beta) p(tx,w,β) the likelihood, we obtain the marginal distribution p ( t ∣ x , t , α , β ) p(t|\mathbf x,\mathbf t,\alpha, \beta) p(tx,t,α,β) from ( B I . 33 ) (BI.33) (BI.33) and ( B I . 34 ) (BI.34) (BI.34)
p ( t ∣ x , t , α , β ) = N ( t ∣ m N T ϕ ( x ) , σ N 2 ( x ) ) (BI.47) p(t | \mathbf{x}, \mathbf{t}, \alpha, \beta)=\mathcal{N}\left(t | \mathbf{m}_{N}^{T}{\phi}(\mathbf{x}), \sigma_{N}^{2}(\mathbf{x})\right)\tag{BI.47} p(tx,t,α,β)=N(tmNTϕ(x),σN2(x))(BI.47)
where the variance σ N 2 ( x ) \sigma_{N}^{2}(\mathbf{x}) σN2(x) of the predictive distribution is given by
σ N 2 ( x ) = 1 β + ϕ ( x ) T S N ϕ ( x ) (BI.48) \sigma_{N}^{2}(\mathbf{x})=\frac{1}{\beta}+\phi(\mathbf{x})^{\mathrm{T}} \mathbf{S}_{N} \boldsymbol{\phi}(\mathbf{x})\tag{BI.48} σN2(x)=β1+ϕ(x)TSNϕ(x)(BI.48)
Note: To see this better, simplify the notation of ( B I . 45 ) (BI.45) (BI.45) as
p ( t ∣ t ) = ∫ p ( t , w ∣ t ) d w = ∫ p ( t ∣ w , t ) p ( w ∣ t ) d w p(t|\mathbf t)=\int p(t,\mathbf w|\mathbf t)d\mathbf w=\int p(t|\mathbf w,\mathbf t)p(\mathbf w|\mathbf t)d\mathbf w p(tt)=p(t,wt)dw=p(tw,t)p(wt)dw
The first term in ( B I . 48 ) (BI.48) (BI.48) represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w. Because the noise process and the distribution of w \mathbf w w are independent Gaussians, their variances are additive.

As an illustration of the predictive distribution for Bayesian linear regression models, let us discuss the synthetic sinusoidal data set example. In Figure 3.8, we fit a model comprising a linear combination of Gaussian basis functions to data sets of various sizes and then look at the corresponding posterior distributions.

Here the green curves correspond to the function sin ⁡ ( 2 π x ) \sin(2πx) sin(2πx) from which the data points were generated (with the addition of Gaussian noise). Data sets of size N = 1 , N = 2 , N = 4 N =1, N =2, N =4 N=1,N=2,N=4, and N = 25 N =25 N=25 are shown in the four plots by the blue circles. For each plot, the red curve shows the mean of the corresponding Gaussian predictive distribution, and the red shaded region spans one standard deviation either side of the mean. Note that the predictive uncertainty depends on x \mathbf x x and is smallest in the neighborhood of the data points. Also note that the level of uncertainty decreases as more data points are observed.

在這裡插入圖片描述

The plots in Figure 3.8 only show the point-wise predictive variance as a function of x \mathbf x x. In order to gain insight into the covariance between the predictions at different values of x \mathbf x x, we can draw samples from the posterior distribution over w \mathbf w w, and then plot the corresponding functions y ( x , w ) y(\mathbf x,\mathbf w) y(x,w), as shown in Figure 3.9.

在這裡插入圖片描述

Graphic Models (8.2.1-8.2.2)

Bayesian Networks

In order to motivate the use of directed graphs to describe probability distributions, consider first an arbitrary joint distribution p ( a , b , c ) p(a, b, c) p(a,b,c) over three variables a , b , a, b, a,b, and c c c. We can write the joint distribution in the form
p ( a , b , c ) = p ( c ∣ a , b ) p ( a , b ) = p ( c ∣ a , b ) p ( b ∣ a ) p ( a ) (GM.1) p(a,b,c)=p(c|a,b)p(a,b)=p(c|a,b)p(b|a)p(a)\tag{GM.1} p(a,b,c)=p(ca,b)p(a,b)=p(ca,b)p(ba)p(a)(GM.1)
Note that at this stage, we do not need to specify anything further about these variables, such as whether they are discrete or continuous.

We now represent the right-hand side of ( G M . 1 ) (GM.1) (GM.1) in terms of a simple graphical model as follows

在這裡插入圖片描述

Example:

在這裡插入圖片描述

The joint distribution corresponding to Fig. 8.2 is given by
p ( x 1 ) p ( x 2 ) p ( x 3 ) p ( x 4 ∣ x 1 , x 2 , x 3 ) p ( x 5 ∣ x 1 , x 3 ) p ( x 6 ∣ x 4 ) p ( x 7 ∣ x 4 , x 5 ) p(x_1)p(x_2)p(x_3)p(x_4|x_1,x_2,x_3)p(x_5|x_1,x_3)p(x_6|x_4)p(x_7|x_4,x_5) p(x1)p(x2)p(x3)p(x4x1,x2,x3)p(x5x1,x3)p(x6x4)p(x7x4,x5)
An important concept for probability distributions over multiple variables is that of conditional independence. Consider three variables a , b , a, b, a,b, and c c c, and suppose that the conditional distribution of a a a,given b b b and c c c, is such that it does not depend on the value of b b b, so that
p ( a ∣ b , c ) = p ( a ∣ c ) (GM.2) p(a|b,c)=p(a|c)\tag{GM.2} p(ab,c)=p(ac)(GM.2)
We say that a is conditionally independent of b given c. This can be expressed in a slightly different way if we consider the joint distribution of a and b conditioned on c, which we can write in the form
p ( a , b ∣ c ) = p ( a ∣ b , c ) p ( b ∣ c ) = p ( a ∣ c ) p ( b ∣ c ) (GM.3) p(a,b|c)=p(a|b,c)p(b |c)=p(a|c)p(b|c)\tag{GM.3} p(a,bc)=p(ab,c)p(bc)=p(ac)p(bc)(GM.3)
Thus we see that, conditioned on c c c, the joint distribution of a a a and b b b factorizes into the product of the marginal distribution of a a a and the marginal distribution of b b b (again both conditioned on c c c). This says that the variables a a a and b b b are statistically independent, given c c c.

We shall sometimes use a shorthand notation for conditional independence in which
a ⊥ b ∣ c (GM.4) a \perp b \mid c\tag{GM.4} abc(GM.4)

Three Example Graphs

We begin our discussion of the conditional independence properties of directed graphs by considering three simple examples each involving graphs having just three nodes. Together, these will motivate and illustrate the key concepts of d-separation.

  • tail-to-tail

在這裡插入圖片描述

The joint distribution corresponding to this graph is
p ( a , b , c ) = p ( a ∣ c ) p ( b ∣ c ) p ( c ) (GM.5) p(a,b,c)=p(a|c)p(b|c)p(c)\tag{GM.5} p(a,b,c)=p(ac)p(bc)p(c)(GM.5)
If none of the variables are observed, then we can investigate whether a a a and b b b are independent by marginalizing both sides of ( G M . 5 ) (GM.5) (GM.5) with respect to c c c to give
p ( a , b ) = ∑ c p ( a ∣ c ) p ( b ∣ c ) p ( c ) (GM.6) p(a,b)=\sum_{c}p(a|c)p(b|c)p(c)\tag{GM.6} p(a,b)=cp(ac)p(bc)p(c)(GM.6)
In general, this does not factorize into the product p ( a ) p ( b ) p(a)p(b) p(a)p(b), and so
a ⊥̸ b ∣ ∅ (GM.7) a\not\perp b \mid \empty\tag{GM.7} ab(GM.7)

在這裡插入圖片描述

Now suppose we condition on the variable c c c, as represented by the graph above. From ( G M . 5 ) (GM.5) (GM.5), we can easily write down the conditional distribution of a a a and b b b,given c c c, in the form
p ( a , b ∣ c ) = p ( a , b , c ) p ( c ) = p ( a ∣ c ) p ( b ∣ c ) (GM.8) p(a,b|c)=\frac{p(a,b,c)}{p(c)}=p(a|c)p(b|c)\tag{GM.8} p(a,bc)=p(c)p(a,b,c)=p(ac)p(bc)(GM.8)
and so we obtain the conditional independence property
a ⊥ b ∣ c (GM.4) a \perp b \mid c\tag{GM.4} abc(GM.4)

  • Chain

在這裡插入圖片描述

The joint distribution corresponding to this graph is
p ( a , b , c ) = p ( a ) p ( c ∣ a ) p ( b ∣ c ) (GM.9) p(a, b, c)=p(a) p(c | a) p(b| c)\tag{GM.9} p(a,b,c)=p(a)p(ca)p(bc)(GM.9)
First of all, suppose that none of the variables are observed. Again, we can test to see if a a a and b b b are independent by marginalizing over c c c to give
p ( a , b ) = p ( a ) ∑ c p ( c ∣ a ) p ( b ∣ c ) = p ( a ) p ( b ∣ a ) (GM.10) p(a, b)=p(a) \sum_{c} p(c | a) p(b |c)=p(a) p(b | a)\tag{GM.10} p(a,b)=p(a)cp(ca)p(bc)=p(a)p(ba)(GM.10)
which in general does not factorize into p ( a ) p ( b ) p(a)p(b) p(a)p(b), and so
a ⊥̸ b ∣ ∅ (GM.7) a\not\perp b \mid \empty\tag{GM.7} ab(GM.7)
as before.

在這裡插入圖片描述

Now suppose we condition on node c c c. Using Bayes’ theorem, together with ( G M . 9 ) , (GM.9), (GM.9), we obtain
p ( a , b ∣ c ) = p ( a , b , c ) p ( c ) = p ( a ) p ( c ∣ a ) p ( b ∣ c ) p ( c ) = p ( a ∣ c ) p ( b ∣ c ) (GM.11) \begin{aligned} p(a, b | c) &=\frac{p(a, b, c)}{p(c)} \\ &=\frac{p(a) p(c | a) p(b | c)}{p(c)} \\ &=p(a | c) p(b | c) \end{aligned}\tag{GM.11} p(a,bc)=p(c)p(a,b,c)=p(c)p(a)p(ca)p(bc)=p(ac)p(bc)(GM.11)
and so again we obtain the conditional independence property
a ⊥ b ∣ c (GM.4) a \perp b \mid c\tag{GM.4} abc(GM.4)

  • Collider/head-to head

在這裡插入圖片描述

The joint distribution can again be written as
p ( a , b , c ) = p ( a ) p ( b ) p ( c ∣ a , b ) (GM.12) p(a, b, c)=p(a) p(b) p(c | a, b)\tag{GM.12} p(a,b,c)=p(a)p(b)p(ca,b)(GM.12)
Consider first the case where none of the variables are observed. Marginalizing both sides of ( G M . 12 ) (GM.12) (GM.12) over c c c we obtain
p ( a , b ) = p ( a ) p ( b ) (GM.13) p(a, b)=p(a) p(b)\tag{GM.13} p(a,b)=p(a)p(b)(GM.13)
and so a a a and b b b are independent with no variables observed, in contrast to the two previous examples. We can write this result as
a ⊥ b ∣ ∅ (GM.14) a \perp b \mid \emptyset \tag{GM.14} ab(GM.14)

在這裡插入圖片描述

Now suppose we condition on c c c. The conditional distribution of a a a and b b b is then given by
p ( a , b ∣ c ) = p ( a , b , c ) p ( c ) = p ( a ) p ( b ) p ( c ∣ a , b ) p ( c ) (GM.15) \begin{aligned} p(a, b | c) &=\frac{p(a, b, c)}{p(c)} \\ &=\frac{p(a) p(b) p(c | a, b)}{p(c)} \end{aligned}\tag{GM.15} p(a,bc)=p(c)p(a,b,c)=p(c)p(a)p(b)p(ca,b)(GM.15)
which in general does not factorize into the product p ( a ) p ( b ) , p(a) p(b), p(a)p(b), and so
a ⊥̸ b ∣ c (GM.16) a\not\perp b \mid c\tag{GM.16} abc(GM.16)

D-separation

We now give a general statement of the d-separation property for directed graphs. Consider a general directed graph in which A , B , A, B, A,B, and C C C are arbitrary nonintersecting sets of nodes (whose union may be smaller than the complete set of nodes in the graph). We wish to ascertain whether a particular conditional independence statement A ⊥ B ∣ C A \perp B \mid C ABC is implied by a given directed acyclic graph.

To do so, we consider all possible paths from any node in A A A to any node in B B B. Any such path is said to be blocked if it includes a node such that either

  • the arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set C , C, C, or
  • the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, is in the set C C C.

If all paths are blocked, then A A A is said to be d \mathrm{d} d-separated from B B B by C , C, C, and the joint distribution over all of the variables in the graph will satisfy A ⊥ B ∣ C A \perp B \mid C ABC.

The concept of d-separation is illustrated in Figure 8.22. 8.22 . 8.22. In graph (a), the path from a a a to b b b is not blocked by node f f f because it is a tail-to-tail node for this path and is not observed, nor is it blocked by node e e e because, although the latter is a head-to-head node, it has a descendant c c c because is in the conditioning set. Thus the conditional independence statement a ⊥ b ∣ c a \perp b \mid c abc does n o t n o t not follow from this graph. In graph (b), the path from a a a to b b b is blocked by node f f f because this is a tail-to-tail node that is observed, and so the conditional independence property a ⊥ b ∣ f a \perp b \mid f abf will be satisfied by any distribution that factorizes according to this graph. Note that this path is also blocked by node e because e e e is a head-to-head node and neither it nor its descendant are in the conditioning set.

在這裡插入圖片描述

相關文章