Probabilistic Models
Reference:
Bishop C M. Pattern recognition and machine learning[M]. springer, 2006.
Slides of CS4220, TUD
Bayesian Inference
Bayes’ Theorem for Gaussian Variables (2.3.1-2.3.3)
Suppose
x
\mathbf x
x is a
D
D
D-dimensional vector with Gaussian distribution
N
(
x
∣
μ
,
Σ
)
\mathcal N(\mathbf x|\boldsymbol \mu,\boldsymbol \Sigma)
N(x∣μ,Σ) and that we partition
x
\mathbf x
x into two disjoint subsets
x
a
\mathbf x_a
xa and
x
b
\mathbf x_b
xb. Without loss of generality, we can take
x
a
\mathbf x_a
xa to form the first
M
M
M components of
x
\mathbf x
x, with
x
b
\mathbf x_b
xb comprising the remaining
D
−
M
D−M
D−M components, i.e.
x
=
(
x
a
x
b
)
(BI.1)
\mathbf x=\left(\begin{matrix}\mathbf x_a\\\mathbf x_b \end{matrix}\right) \tag{BI.1}
x=(xaxb)(BI.1)
We also define corresponding partitions of the mean vector
μ
\boldsymbol \mu
μ given by
μ
=
(
μ
a
μ
b
)
(BI.2)
\boldsymbol \mu=\left(\begin{matrix}\boldsymbol \mu_a\\\boldsymbol \mu_b \end{matrix}\right) \tag{BI.2}
μ=(μaμb)(BI.2)
and of the covariance matrix
Σ
\boldsymbol \Sigma
Σ given by
Σ
=
(
Σ
a
a
Σ
a
b
Σ
b
a
Σ
b
b
)
(BI.3)
\boldsymbol \Sigma =\left(\begin{matrix}\boldsymbol \Sigma_{aa} & \boldsymbol \Sigma_{ab}\\ \boldsymbol \Sigma_{ba} & \boldsymbol \Sigma_{bb}\end{matrix}\right) \tag{BI.3}
Σ=(ΣaaΣbaΣabΣbb)(BI.3)
Note that the symmetry
Σ
T
=
Σ
\boldsymbol \Sigma^{\mathrm{T}}=\boldsymbol\Sigma
ΣT=Σ of the covariance matrix implies that
Σ
a
a
\boldsymbol \Sigma_{a a}
Σaa and
Σ
b
b
\boldsymbol \Sigma_{b b}
Σbb are symmetric, while
Σ
b
a
=
Σ
a
b
T
\boldsymbol \Sigma_{b a}=\boldsymbol \Sigma_{a b}^{\mathrm{T}}
Σba=ΣabT.
In many situations, it will be convenient to work with the inverse of the covariance matrix
Λ
≡
Σ
−
1
(BI.4)
\mathbf{\Lambda} \equiv \mathbf{\Sigma}^{-1}\tag{BI.4}
Λ≡Σ−1(BI.4)
which is known as the precision matrix. In fact, we shall see that some properties of Gaussian distributions are most naturally expressed in terms of the covariance, whereas others take a simpler form when viewed in terms of the precision. We therefore also introduce the partitioned form of the precision matrix
Λ
=
(
Λ
a
a
Λ
a
b
Λ
b
a
Λ
b
b
)
(BI.5)
\boldsymbol\Lambda=\left(\begin{array}{ll} \boldsymbol\Lambda_{a a} & \boldsymbol\Lambda_{a b} \\ \boldsymbol\Lambda_{b a} & \boldsymbol\Lambda_{b b} \end{array}\right)\tag{BI.5}
Λ=(ΛaaΛbaΛabΛbb)(BI.5)
Because the inverse of a symmetric matrix is also symmetric, we see that
Λ
a
a
\boldsymbol \Lambda _{aa}
Λaa and
Λ
b
b
\boldsymbol \Lambda_{bb}
Λbb are symmetric, while
Λ
a
b
T
=
Λ
b
a
\boldsymbol \Lambda_{ab}^T=\boldsymbol \Lambda_{ba}
ΛabT=Λba.
An important property of the multivariate Gaussian distribution is that if two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. Similarly, the marginal distribution of either set is also Gaussian.
Using the expression above, since x a \mathbf x_a xa and x b \mathbf x_b xb are jointly Gaussian, then the conditional distributions p ( x a ∣ x b ) p(\mathbf x_a|\mathbf x_b) p(xa∣xb) and p ( x b ∣ x a ) p(\mathbf x_b|\mathbf x_a) p(xb∣xa), and the marginal distributions p ( x a ) p(\mathbf x_a) p(xa) and p ( x b ) p(\mathbf x_b) p(xb) are also Gaussian. We will show how to prove this.
Conditional Gaussian distributions
Let us begin by finding an expression for the conditional distribution
p
(
x
a
∣
x
b
)
p(\mathbf x_a|\mathbf x_b)
p(xa∣xb). It can be evaluated from the joint distribution
p
(
x
)
=
p
(
x
a
,
x
b
)
p(\mathbf x)=p(\mathbf x_a,\mathbf x_b)
p(x)=p(xa,xb) by
p
(
x
a
∣
x
b
)
=
p
(
x
a
,
x
b
)
p
(
x
b
)
=
p
(
x
a
,
x
b
)
∫
p
(
x
a
,
x
b
)
d
x
a
p(\mathbf x_a|\mathbf x_b)=\frac{p(\mathbf x_a,\mathbf x_b)}{p(\mathbf x_b)}=\frac{p(\mathbf x_a,\mathbf x_b)}{\int p(\mathbf x_a,\mathbf x_b)d\mathbf x_a}
p(xa∣xb)=p(xb)p(xa,xb)=∫p(xa,xb)dxap(xa,xb)
where
p
(
x
)
=
p
(
x
1
,
x
2
)
=
1
(
2
π
)
N
∣
Σ
∣
exp
(
−
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
)
p(\mathbf x)=p(\mathbf x_1,\mathbf x_2)=\frac{1}{\sqrt{(2\pi)^N|\boldsymbol \Sigma|}}\exp\left(-\frac{1}{2} (\mathbf x-\boldsymbol \mu)^T \boldsymbol \Sigma^{-1}(\mathbf x-\boldsymbol\mu) \right)
p(x)=p(x1,x2)=(2π)N∣Σ∣1exp(−21(x−μ)TΣ−1(x−μ))
It can be viewed as fixing
x
b
\mathbf x_b
xb to the observed value and normalizing the resulting expression to obtain a valid probability distribution over
x
a
\mathbf x_a
xa. Therefore, we can first ignore
p
(
x
b
)
p(\mathbf x_b)
p(xb) and focus on the quadratic form in the exponent of the Gaussian distribution
p
(
x
a
,
x
b
)
p(\mathbf x_a,\mathbf x_b)
p(xa,xb) and then reinstating the normalization coefficient at the end of the calculation.
If we make use of the partitioning
(
B
I
.
1
)
(BI.1)
(BI.1),
(
B
I
.
2
)
(BI.2)
(BI.2) and
(
B
I
.
5
)
(BI.5)
(BI.5), we obtain
−
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
=
−
1
2
(
x
a
−
μ
a
)
T
Λ
a
a
−
1
(
x
a
−
μ
a
)
−
1
2
(
x
a
−
μ
a
)
T
Λ
a
b
−
1
(
x
b
−
μ
b
)
−
1
2
(
x
b
−
μ
b
)
T
Λ
b
a
−
1
(
x
a
−
μ
a
)
−
1
2
(
x
b
−
μ
b
)
T
Λ
b
b
−
1
(
x
b
−
μ
b
)
(BI.6)
\begin{aligned} &-\frac{1}{2} (\mathbf x-\boldsymbol \mu)^T \boldsymbol \Sigma^{-1}(\mathbf x-\boldsymbol\mu)=\\ &-\frac{1}{2} (\mathbf x_a-\boldsymbol \mu_a)^T \boldsymbol \Lambda_{aa}^{-1}(\mathbf x_a-\boldsymbol\mu_a)-\frac{1}{2} (\mathbf x_a-\boldsymbol \mu_a)^T \boldsymbol \Lambda_{ab}^{-1}(\mathbf x_b-\boldsymbol\mu_b)\\ &-\frac{1}{2} (\mathbf x_b-\boldsymbol \mu_b)^T \boldsymbol \Lambda_{ba}^{-1}(\mathbf x_a-\boldsymbol\mu_a)-\frac{1}{2} (\mathbf x_b-\boldsymbol \mu_b)^T \boldsymbol \Lambda_{bb}^{-1}(\mathbf x_b-\boldsymbol\mu_b) \end{aligned}\tag{BI.6}
−21(x−μ)TΣ−1(x−μ)=−21(xa−μa)TΛaa−1(xa−μa)−21(xa−μa)TΛab−1(xb−μb)−21(xb−μb)TΛba−1(xa−μa)−21(xb−μb)TΛbb−1(xb−μb)(BI.6)
We see that as a function of
x
a
\mathbf x_a
xa, this is again a quadratic form, and hence the corresponding conditional distribution
p
(
x
a
∣
x
b
)
p(\mathbf x_a|\mathbf x_b)
p(xa∣xb) will be Gaussian.
Because this distribution is completely characterized by its mean and its covariance, our goal will be to identify expressions for the mean and covariance of
p
(
x
a
∣
x
b
)
p(\mathbf x_a|\mathbf x_b)
p(xa∣xb) by inspection of
(
B
I
.
6
)
(BI.6)
(BI.6). Such problems can be solved straightforwardly by noting that the exponent in a general Gaussian distribution
N
(
x
∣
μ
,
Σ
)
\mathcal N(\mathbf x|\boldsymbol \mu, \boldsymbol \Sigma)
N(x∣μ,Σ) can be written
−
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
=
−
1
2
x
T
Σ
−
1
x
+
x
T
Σ
−
1
μ
+
c
o
n
s
t
(BI.7)
-\frac{1}{2} (\mathbf x-\boldsymbol \mu)^T \boldsymbol \Sigma^{-1}(\mathbf x-\boldsymbol\mu)=-\frac{1}{2}\mathbf x^T \boldsymbol \Sigma^{-1}\mathbf x+\mathbf x^T \boldsymbol \Sigma^{-1}\boldsymbol \mu+\mathrm{const}\tag{BI.7}
−21(x−μ)TΣ−1(x−μ)=−21xTΣ−1x+xTΣ−1μ+const(BI.7)
Thus if we take our general quadratic form and express it in the form given by the right-hand side of
(
B
I
.
7
)
(BI.7)
(BI.7), then we can immediately equate the matrix of coefficients entering the second order term in
x
\mathbf x
x to the inverse covariance matrix
Σ
−
1
\boldsymbol \Sigma ^{-1}
Σ−1 and the coefficient of the linear term in
x
\mathbf x
x to
Σ
−
1
μ
\boldsymbol \Sigma^{-1} \boldsymbol \mu
Σ−1μ, from which we can obtain
μ
\boldsymbol \mu
μ.
Now let us apply this procedure to
(
B
I
.
6
)
(BI.6)
(BI.6). Denote the mean and covariance of this distribution by
μ
a
∣
b
\boldsymbol \mu_{a|b}
μa∣b and
Σ
a
∣
b
\boldsymbol \Sigma _{a|b}
Σa∣b, respectively. If we pick out all terms that are second order in
x
a
\mathbf x_a
xa, we have
−
1
2
x
a
T
Λ
a
a
x
a
-\frac{1}{2}\mathbf x_a^T \boldsymbol \Lambda_{aa}\mathbf x_a
−21xaTΛaaxa
from which we can immediately conclude that the covariance (inverse precision) of
p
(
x
a
∣
x
b
)
p(\mathbf x_a|\mathbf x_b)
p(xa∣xb) is given by
Σ
a
∣
b
=
Λ
a
a
−
1
(BI.8)
\boldsymbol \Sigma_{a|b}=\boldsymbol \Lambda_{aa}^{-1}\tag{BI.8}
Σa∣b=Λaa−1(BI.8)
Now consider all of the terms in
(
B
I
.
6
)
(BI.6)
(BI.6) that are linear in
x
a
\mathbf{x}_{a}
xa
x
a
T
{
Λ
a
a
μ
a
−
Λ
a
b
(
x
b
−
μ
b
)
}
\mathbf{x}_{a}^{\mathrm{T}}\left\{\mathbf{\Lambda}_{a a} \boldsymbol{\mu}_{a}-\boldsymbol{\Lambda}_{a b}\left(\mathbf{x}_{b}-\boldsymbol{\mu}_{b}\right)\right\}
xaT{Λaaμa−Λab(xb−μb)}
where we have used
Λ
b
a
T
=
Λ
a
b
.
\mathbf{\Lambda}_{b a}^{\mathrm{T}}=\boldsymbol{\Lambda}_{a b} .
ΛbaT=Λab. From our discussion of the general form
(
B
I
.
7
)
(BI.7)
(BI.7) the coefficient of
x
a
\mathbf{x}_{a}
xa in this expression must equal
Σ
a
∣
b
−
1
μ
a
∣
b
\boldsymbol\Sigma_{a | b}^{-1} \boldsymbol\mu_{a | b}
Σa∣b−1μa∣b and hence
μ
a
∣
b
=
Σ
a
∣
b
{
Λ
a
a
μ
a
−
Λ
a
b
(
x
b
−
μ
b
)
}
=
B
I
.
8
μ
a
−
Λ
a
a
−
1
Λ
a
b
(
x
b
−
μ
b
)
(BI.9)
\begin{aligned} \boldsymbol{\mu}_{a | b} =\boldsymbol{\Sigma}_{a | b}\left\{\boldsymbol{\Lambda}_{a a} \boldsymbol{\mu}_{a}-\boldsymbol{\Lambda}_{a b}\left(\mathbf{x}_{b}-\boldsymbol{\mu}_{b}\right)\right\} \stackrel{BI.8}{=}\boldsymbol{\mu}_{a}-\boldsymbol{\Lambda}_{a a}^{-1} \boldsymbol{\Lambda}_{a b}\left(\mathbf{x}_{b}-\boldsymbol{\mu}_{b}\right) \end{aligned}\tag{BI.9}
μa∣b=Σa∣b{Λaaμa−Λab(xb−μb)}=BI.8μa−Λaa−1Λab(xb−μb)(BI.9)
We can also express these results in terms of the corresponding partitioned covariance matrix. To do this, we make use of the following identity for the inverse of a partitioned matrix
(
A
B
C
D
)
−
1
=
(
M
−
M
B
D
−
1
−
D
−
1
C
M
D
−
1
+
D
−
1
C
M
B
D
−
1
)
(BI.10)
\left(\begin{array}{cc} \mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D} \end{array}\right)^{-1}=\left(\begin{array}{cc} \mathbf{M} & -\mathbf{M} \mathbf{B} \mathbf{D}^{-1} \\ -\mathbf{D}^{-1} \mathbf{C M} & \mathbf{D}^{-1}+\mathbf{D}^{-1} \mathbf{C M B D}^{-1} \end{array}\right)\tag{BI.10}
(ACBD)−1=(M−D−1CM−MBD−1D−1+D−1CMBD−1)(BI.10)
where we have defined
M
=
(
A
−
B
D
−
1
C
)
−
1
(BI.11)
\mathbf{M}=\left(\mathbf{A}-\mathbf{B} \mathbf{D}^{-1} \mathbf{C}\right)^{-1}\tag{BI.11}
M=(A−BD−1C)−1(BI.11)
The quantity
M
−
1
\mathrm{M}^{-1}
M−1 is known as the Schur complement of the matrix on the left-hand side of
(
B
I
.
10
)
(BI.10)
(BI.10) with respect to the submatrix
D
\mathbf D
D. Using the definition
(
Σ
a
a
Σ
a
b
Σ
b
a
Σ
b
b
)
−
1
=
(
Λ
a
a
Λ
a
b
Λ
b
a
Λ
b
b
)
\left(\begin{array}{cc} \boldsymbol\Sigma_{a a} & \boldsymbol\Sigma_{a b} \\ \boldsymbol\Sigma_{b a} & \boldsymbol\Sigma_{b b} \end{array}\right)^{-1}=\left(\begin{array}{cc} \boldsymbol\Lambda_{a a} & \boldsymbol\Lambda_{a b} \\ \boldsymbol\Lambda_{b a} & \boldsymbol\Lambda_{b b} \end{array}\right)
(ΣaaΣbaΣabΣbb)−1=(ΛaaΛbaΛabΛbb)
and making use of
(
B
I
.
10
)
,
(BI.10),
(BI.10), we have
Λ
a
a
=
(
Σ
a
a
−
Σ
a
b
Σ
b
b
−
1
Σ
b
a
)
−
1
Λ
a
b
=
−
(
Σ
a
a
−
Σ
a
b
Σ
b
b
−
1
Σ
b
a
)
−
1
Σ
a
b
Σ
b
b
−
1
\begin{aligned} \boldsymbol{\Lambda}_{a a} &=\left(\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a}\right)^{-1} \\ \boldsymbol{\Lambda}_{a b} &=-\left(\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a}\right)^{-1} \boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \end{aligned}
ΛaaΛab=(Σaa−ΣabΣbb−1Σba)−1=−(Σaa−ΣabΣbb−1Σba)−1ΣabΣbb−1
From this we obtain the following expressions for the mean and covariance of the conditional distribution
p
(
x
a
∣
x
b
)
p(\mathbf x_a|\mathbf x_b)
p(xa∣xb)
μ
a
∣
b
=
μ
a
+
Σ
a
b
Σ
b
b
−
1
(
x
b
−
μ
b
)
(BI.12)
\boldsymbol \mu_{a|b}=\boldsymbol \mu _a+\boldsymbol \Sigma_{ab}\boldsymbol \Sigma_{bb}^{-1}(\mathbf x_b-\boldsymbol \mu_b)\tag{BI.12}
μa∣b=μa+ΣabΣbb−1(xb−μb)(BI.12)
Σ a ∣ b = Σ a a − Σ a b Σ b b − 1 Σ b a (BI.13) \boldsymbol \Sigma_{a|b}=\boldsymbol \Sigma_{aa}-\boldsymbol \Sigma_{ab}\boldsymbol \Sigma _{bb}^{-1}\boldsymbol \Sigma_{ba}\tag{BI.13} Σa∣b=Σaa−ΣabΣbb−1Σba(BI.13)
Marginal Gaussian distributions
Now we turn to a discussion of the marginal distribution given by
p
(
x
a
)
=
∫
p
(
x
a
,
x
b
)
d
x
b
(BI.14)
p(\mathbf x_a)=\int p(\mathbf x_a,\mathbf x_b)d\mathbf x_b\tag{BI.14}
p(xa)=∫p(xa,xb)dxb(BI.14)
Once again, our strategy for evaluating this distribution efficiently will be to focus on the quadratic form in the exponent of the joint distribution
p
(
x
a
,
x
b
)
p(\mathbf x_a,\mathbf x_b)
p(xa,xb) and thereby to identify the mean and covariance of the marginal distribution
p
(
x
a
)
p(\mathbf x_a)
p(xa).
The quadratic form for the joint distribution can be expressed, using the partitioned precision matrix, in the form
(
B
I
.
6
)
(BI.6)
(BI.6). Because our goal is to integrate out
x
b
\mathbf x_b
xb, this is most easily achieved by first considering the terms involving
x
b
\mathbf x_b
xb and then completing the square in order to facilitate integration. Picking out just those terms that involve
x
b
\mathbf x_b
xb, we have
−
1
2
x
b
T
Λ
b
b
x
b
+
x
b
T
m
=
−
1
2
(
x
b
−
Λ
b
b
−
1
m
)
T
Λ
b
b
(
x
b
−
Λ
b
b
−
1
m
)
+
1
2
m
T
Λ
b
b
−
1
m
(BI.15)
-\frac{1}{2} \mathbf{x}_{b}^{\mathrm{T}} \boldsymbol{\Lambda}_{b b} \mathbf{x}_{b}+\mathbf{x}_{b}^{T} \mathbf{m}=-\frac{1}{2}\left(\mathbf{x}_{b}-\mathbf{\Lambda}_{b b}^{-1} \mathbf{m}\right)^{\mathrm{T}} \boldsymbol{\Lambda}_{b b}\left(\mathbf{x}_{b}-\mathbf{\Lambda}_{b b}^{-1} \mathbf{m}\right)+\frac{1}{2} \mathbf{m}^{\mathrm{T}} \boldsymbol{\Lambda}_{b b}^{-1} \mathbf{m}\tag{BI.15}
−21xbTΛbbxb+xbTm=−21(xb−Λbb−1m)TΛbb(xb−Λbb−1m)+21mTΛbb−1m(BI.15)
where we have defined
m
=
Λ
b
b
μ
b
−
Λ
b
a
(
x
a
−
μ
a
)
(BI.16)
\mathbf{m}=\mathbf{\Lambda}_{b b} \boldsymbol{\mu}_{b}-\boldsymbol{\Lambda}_{b a}\left(\mathbf{x}_{a}-\boldsymbol{\mu}_{a}\right)\tag{BI.16}
m=Λbbμb−Λba(xa−μa)(BI.16)
We see that the dependence on
x
b
\mathbf{x}_{b}
xb has been cast into the standard quadratic form of a Gaussian distribution corresponding to the first term on the right-hand side of
(
B
I
.
15
)
(BI.15)
(BI.15) plus a term that does not depend on
x
b
\mathbf{x}_{b}
xb (but that does depend on
x
a
\mathbf{x}_{a}
xa). Thus, when we take the exponential of this quadratic form, we see that the integration over
x
b
\mathbf{x}_{b}
xb required by
(
B
I
.
14
)
(BI.14)
(BI.14) will take the form
∫
exp
{
−
1
2
(
x
b
−
Λ
b
b
−
1
m
)
T
Λ
b
b
(
x
b
−
Λ
b
b
−
1
m
)
}
d
x
b
=
(
2
π
)
N
∣
Λ
b
b
−
1
∣
∫
1
(
2
π
)
N
∣
Λ
b
b
−
1
∣
exp
{
−
1
2
(
x
b
−
Λ
b
b
−
1
m
)
T
Λ
b
b
(
x
b
−
Λ
b
b
−
1
m
)
}
d
x
b
=
(
2
π
)
N
∣
Λ
b
b
−
1
∣
(BI.17)
\begin{aligned} &\int \exp \left\{-\frac{1}{2}\left(\mathbf{x}_{b}-\mathbf{\Lambda}_{b b}^{-1} \mathbf{m}\right)^{\mathrm{T}} \boldsymbol{\Lambda}_{b b}\left(\mathbf{x}_{b}-\mathbf{\Lambda}_{b b}^{-1} \mathbf{m}\right)\right\} \mathrm{d} \mathbf{x}_{b}\\=& \sqrt{(2\pi)^N|\boldsymbol \Lambda_{bb}^{-1}|}\int \frac{1}{\sqrt{(2\pi)^N|\boldsymbol \Lambda_{bb}^{-1}|}}\exp \left\{-\frac{1}{2}\left(\mathbf{x}_{b}-\mathbf{\Lambda}_{b b}^{-1} \mathbf{m}\right)^{\mathrm{T}} \boldsymbol{\Lambda}_{b b}\left(\mathbf{x}_{b}-\mathbf{\Lambda}_{b b}^{-1} \mathbf{m}\right)\right\} \mathrm{d} \mathbf{x}_{b}\\=&\sqrt{(2\pi)^N|\boldsymbol \Lambda_{bb}^{-1}|} \end{aligned}\tag{BI.17}
==∫exp{−21(xb−Λbb−1m)TΛbb(xb−Λbb−1m)}dxb(2π)N∣Λbb−1∣∫(2π)N∣Λbb−1∣1exp{−21(xb−Λbb−1m)TΛbb(xb−Λbb−1m)}dxb(2π)N∣Λbb−1∣(BI.17)
This integration is easily performed by noting that it is the integral over an unnormalized Gaussian, and so the result will be the reciprocal of the normalization coefficient. Thus, by completing the square with respect to
x
b
\mathbf x_b
xb, we can integrate out
x
b
\mathbf x_b
xb and the only term remaining from the contributions on the left-hand side of
(
B
I
.
15
)
(BI.15)
(BI.15) that depends on
x
a
\mathbf x_a
xa is
1
/
2
m
T
Λ
b
b
−
1
m
1/2~ \mathbf{m}^{\mathrm{T}} \boldsymbol{\Lambda}_{b b}^{-1} \mathbf{m}
1/2 mTΛbb−1m in which
m
\mathbf m
m is given by
(
B
I
.
16
)
(BI.16)
(BI.16). Combining this term with the remaining terms from
(
B
I
.
6
)
(BI.6)
(BI.6) that depends on
x
a
\mathbf x_a
xa, we obtain
1
2
[
Λ
b
b
μ
b
−
Λ
b
a
(
x
a
−
μ
a
)
]
T
Λ
b
b
−
1
[
Λ
b
b
μ
b
−
Λ
b
a
(
x
a
−
μ
a
)
]
−
1
2
x
a
T
Λ
a
a
x
a
+
x
a
T
(
Λ
a
a
μ
a
+
Λ
a
b
μ
b
)
+
const
=
−
1
2
x
a
T
(
Λ
a
a
−
Λ
a
b
Λ
b
b
−
1
Λ
b
a
)
x
a
+
x
a
T
(
Λ
a
a
−
Λ
a
b
Λ
b
b
−
1
Λ
b
a
)
−
1
μ
a
+
c
o
n
s
t
(BI.18)
\begin{aligned} \frac{1}{2}\left[\boldsymbol{\Lambda}_{b b}\right.&\left.\boldsymbol{\mu}_{b}-\boldsymbol{\Lambda}_{b a}\left(\mathbf{x}_{a}-\boldsymbol{\mu}_{a}\right)\right]^{\mathrm{T}} \boldsymbol{\Lambda}_{b b}^{-1}\left[\boldsymbol{\Lambda}_{b b} \boldsymbol{\mu}_{b}-\boldsymbol{\Lambda}_{b a}\left(\mathbf{x}_{a}-\boldsymbol{\mu}_{a}\right)\right] \\ &-\frac{1}{2} \mathbf{x}_{a}^{\mathrm{T}} \boldsymbol{\Lambda}_{a a} \mathbf{x}_{a}+\mathbf{x}_{a}^{\mathrm{T}}\left(\boldsymbol{\Lambda}_{a a} \boldsymbol{\mu}_{a}+\boldsymbol{\Lambda}_{a b} \boldsymbol{\mu}_{b}\right)+\text { const } \\ =&-\frac{1}{2} \mathbf{x}_{a}^{\mathrm{T}}\left(\boldsymbol{\Lambda}_{a a}-\boldsymbol{\Lambda}_{a b} \boldsymbol{\Lambda}_{b b}^{-1} \boldsymbol{\Lambda}_{b a}\right) \mathbf{x}_{a} \\ &+\mathbf{x}_{a}^{\mathrm{T}}\left(\boldsymbol{\Lambda}_{a a}-\boldsymbol{\Lambda}_{a b} \boldsymbol{\Lambda}_{b b}^{-1} \boldsymbol{\Lambda}_{b a}\right)^{-1} \boldsymbol{\mu}_{a}+\mathrm{const} \end{aligned}\tag{BI.18}
21[Λbb=μb−Λba(xa−μa)]TΛbb−1[Λbbμb−Λba(xa−μa)]−21xaTΛaaxa+xaT(Λaaμa+Λabμb)+ const −21xaT(Λaa−ΛabΛbb−1Λba)xa+xaT(Λaa−ΛabΛbb−1Λba)−1μa+const(BI.18)
Again, by comparison with
(
B
I
.
7
)
(BI.7)
(BI.7), we see that the covariance of the marginal distribution of
p
(
x
a
)
p(\mathbf{x}_{a})
p(xa) is given by
Σ
a
=
(
Λ
a
a
−
Λ
a
b
Λ
b
b
−
1
Λ
b
a
)
−
1
(BI.19)
\mathbf{\Sigma}_{a}=\left(\boldsymbol{\Lambda}_{a a}-\boldsymbol{\Lambda}_{a b} \boldsymbol{\Lambda}_{b b}^{-1} \boldsymbol{\Lambda}_{b a}\right)^{-1}\tag{BI.19}
Σa=(Λaa−ΛabΛbb−1Λba)−1(BI.19)
Similarly, the mean is given by
Σ
a
(
Λ
a
a
−
Λ
a
b
Λ
b
b
−
1
Λ
b
a
)
μ
a
=
B
I
.
19
μ
a
(BI.20)
\mathbf{\Sigma}_{a}\left(\mathbf{\Lambda}_{a a}-\mathbf{\Lambda}_{a b} \mathbf{\Lambda}_{b b}^{-1} \mathbf{\Lambda}_{b a}\right) \boldsymbol{\mu}_{a}\stackrel{BI.19}{=}\boldsymbol{\mu}_{a}\tag{BI.20}
Σa(Λaa−ΛabΛbb−1Λba)μa=BI.19μa(BI.20)
The covariance in
(
B
I
.
19
)
(BI.19)
(BI.19) is expressed in terms of the partitioned precision matrix. We can rewrite this in terms of the corresponding partitioning of the covariance matrix as we did for the conditional distribution. Making use of
(
B
I
.
10
)
,
(BI.10),
(BI.10), we then have
(
Λ
a
a
−
Λ
a
b
Λ
b
b
−
1
Λ
b
a
)
−
1
=
Σ
a
a
(BI.21)
\left(\boldsymbol{\Lambda}_{a a}-\boldsymbol{\Lambda}_{a b} \boldsymbol{\Lambda}_{b b}^{-1} \boldsymbol{\Lambda}_{b a}\right)^{-1}=\boldsymbol{\Sigma}_{a a}\tag{BI.21}
(Λaa−ΛabΛbb−1Λba)−1=Σaa(BI.21)
Thus we obtain the intuitively satisfying result that the marginal distribution
p
(
x
a
)
p\left(\mathbf{x}_{a}\right)
p(xa) has mean and covariance given by
E
[
x
a
]
=
μ
a
(BI.22)
\mathbb{E}\left[\mathbf{x}_{a}\right] =\boldsymbol{\mu}_{a} \tag{BI.22}
E[xa]=μa(BI.22)
cov
[
x
a
]
=
Σ
a
a
(BI.23)
\operatorname{cov}\left[\mathbf{x}_{a}\right] =\boldsymbol{\Sigma}_{a a} \tag{BI.23}
cov[xa]=Σaa(BI.23)
We see that for a marginal distribution, the mean and covariance are most simply expressed in terms of the partitioned covariance matrix, in contrast to the conditional distribution for which the partitioned precision matrix gives rise to simpler expressions.
Partitioned Gaussians
We summarize the result for the marginal and conditional distributions of a partitioned Gaussian here.
Theorem: Partitioned Gaussians
Given a joint Gaussian distribution N ( x ∣ μ , Σ ) \mathcal N(\mathbf x|\boldsymbol \mu,\boldsymbol \Sigma) N(x∣μ,Σ) with Λ ≡ Σ − 1 \boldsymbol \Lambda\equiv \boldsymbol \Sigma^{-1} Λ≡Σ−1 and
x = ( x a x b ) , μ = ( μ a μ b ) \mathbf x=\left(\begin{matrix}\mathbf x_a\\\mathbf x_b \end{matrix}\right),\quad \boldsymbol \mu=\left(\begin{matrix}\boldsymbol \mu_a\\\boldsymbol \mu_b \end{matrix}\right) x=(xaxb),μ=(μaμb)Σ = ( Σ a a Σ a b Σ b a Σ b b ) , Λ = ( Λ a a Λ a b Λ b a Λ b b ) \boldsymbol \Sigma =\left(\begin{matrix}\boldsymbol \Sigma_{aa} & \boldsymbol \Sigma_{ab}\\ \boldsymbol \Sigma_{ba} & \boldsymbol \Sigma_{bb}\end{matrix}\right) ,\quad \boldsymbol\Lambda=\left(\begin{array}{ll} \boldsymbol\Lambda_{a a} & \boldsymbol\Lambda_{a b} \\ \boldsymbol\Lambda_{b a} & \boldsymbol\Lambda_{b b} \end{array}\right) Σ=(ΣaaΣbaΣabΣbb),Λ=(ΛaaΛbaΛabΛbb)
Conditional distribution:
p ( x a ∣ x b ) = N ( x ∣ μ a ∣ b , Σ a ∣ b ) μ a ∣ b = μ a − Λ a a − 1 Λ a b ( x b − μ b ) = μ a + Σ a b Σ b b − 1 ( x b − μ b ) Σ a ∣ b = Λ a a − 1 = Σ a a − Σ a b Σ b b − 1 Σ b a p(\mathbf x_a|\mathbf x_b)=\mathcal N(\mathbf x|\boldsymbol \mu_{a|b},\boldsymbol \Sigma_{a|b})\\ \boldsymbol{\mu}_{a | b} =\boldsymbol{\mu}_{a}-\boldsymbol{\Lambda}_{a a}^{-1} \boldsymbol{\Lambda}_{a b}\left(\mathbf{x}_{b}-\boldsymbol{\mu}_{b}\right)=\boldsymbol \mu _a+\boldsymbol \Sigma_{ab}\boldsymbol \Sigma_{bb}^{-1}(\mathbf x_b-\boldsymbol \mu_b) \\ \boldsymbol \Sigma_{a|b}=\boldsymbol \Lambda_{aa}^{-1}=\boldsymbol \Sigma_{aa}-\boldsymbol \Sigma_{ab}\boldsymbol \Sigma _{bb}^{-1}\boldsymbol \Sigma_{ba} p(xa∣xb)=N(x∣μa∣b,Σa∣b)μa∣b=μa−Λaa−1Λab(xb−μb)=μa+ΣabΣbb−1(xb−μb)Σa∣b=Λaa−1=Σaa−ΣabΣbb−1Σba
Marginal distribution:
p ( x a ) = N ( x a ∣ μ a , Σ a a ) p(\mathbf x_a)=\mathcal N(\mathbf x_a|\boldsymbol \mu_a,\boldsymbol \Sigma_{aa}) p(xa)=N(xa∣μa,Σaa)
Bayes’ theorem for Gaussian variables
In [Conditional Gaussian distributions](#Conditional Gaussian distributions) and [Marginal Gaussian distributions](#Marginal Gaussian distributions), we considered a Gaussian p ( x ) p(\mathbf x) p(x) in which we partitioned the vector x \mathbf x x into two subvectors x = ( x a , x b ) \mathbf x=(\mathbf x_a,\mathbf x_b) x=(xa,xb) and then found expressions for the conditional distribution p ( x a ∣ x b ) p(\mathbf x_a|\mathbf x_b) p(xa∣xb) and the marginal distribution p ( x a ) p(\mathbf x_a) p(xa).
Here we shall suppose that we are given a Gaussian marginal distribution
p
(
x
)
p(\mathbf x)
p(x) and a Gaussian conditional distribution
p
(
y
∣
x
)
p(\mathbf y|\mathbf x)
p(y∣x) in which
p
(
y
∣
x
)
p(\mathbf y|\mathbf x)
p(y∣x) has a mean that is a linear function of
x
\mathbf x
x, and a covariance which is independent of
x
\mathbf x
x, i.e.,
p
(
x
)
=
N
(
x
∣
μ
,
Λ
−
1
)
(BI.24)
p(\mathbf x)=\mathcal N(\mathbf x|\boldsymbol \mu,\boldsymbol \Lambda^{-1})\tag{BI.24}
p(x)=N(x∣μ,Λ−1)(BI.24)
p ( y ∣ x ) = N ( y ∣ A x + b , L − 1 ) (BI.25) p(\mathbf y|\mathbf x)=\mathcal N(\mathbf y|\mathbf A \mathbf x+\mathbf b,\mathbf L^{-1})\tag{BI.25} p(y∣x)=N(y∣Ax+b,L−1)(BI.25)
where Λ \boldsymbol \Lambda Λ and L \mathbf L L are precision matrices.
We wish to find the marginal distribution p ( y ) p(\mathbf y) p(y) and the conditional distribution p ( x ∣ y ) p(\mathbf x|\mathbf y) p(x∣y). For that, we will first find the joint distribution for x \mathbf x x and y \mathbf y y, and then use the conclusions in [Partitioned Gaussians](#Partitioned Gaussians) to obtain p ( y ) p(\mathbf y) p(y) and p ( x ∣ y ) p(\mathbf x|\mathbf y) p(x∣y).
Define
z
=
(
x
y
)
(BI.26)
\mathbf z=\left(\begin{matrix}\mathbf x\\ \mathbf y \end{matrix} \right)\tag{BI.26}
z=(xy)(BI.26)
and then consider the log of the joint distribution
ln
p
(
z
)
=
ln
p
(
x
)
+
ln
p
(
y
∣
x
)
=
−
1
2
(
x
−
μ
)
T
Λ
(
x
−
μ
)
−
1
2
(
y
−
A
x
−
b
)
T
L
(
y
−
A
x
−
b
)
+
c
o
n
s
t
(BI.27)
\begin{aligned} \ln p(\mathbf{z})=& \ln p(\mathbf{x})+\ln p(\mathbf{y} \mid \mathbf{x}) \\ =&-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{\mathrm{T}} \boldsymbol{\Lambda}(\mathbf{x}-\boldsymbol{\mu}) \\ &-\frac{1}{2}(\mathbf{y}-\mathbf{A} \mathbf{x}-\mathbf{b})^{\mathrm{T}} \mathbf{L}(\mathbf{y}-\mathbf{A} \mathbf{x}-\mathbf{b})+\mathrm{const} \end{aligned}\tag{BI.27}
lnp(z)==lnp(x)+lnp(y∣x)−21(x−μ)TΛ(x−μ)−21(y−Ax−b)TL(y−Ax−b)+const(BI.27)
As before, we see that this is a quadratic function of the components of
z
,
\mathbf{z},
z, and hence
p
(
z
)
p(\mathbf{z})
p(z) is Gaussian distribution. To find the precision of this Gaussian, we consider the second order terms in
(
B
I
.
27
)
(BI.27)
(BI.27), which can be written as
−
1
2
x
T
(
Λ
+
A
T
L
A
)
x
−
1
2
y
T
L
y
+
1
2
y
T
L
A
x
+
1
2
x
T
A
T
L
y
=
−
1
2
(
x
y
)
T
(
Λ
+
A
T
L
A
−
A
T
L
−
L
A
L
)
(
x
y
)
=
−
1
2
z
T
R
z
(BI.28)
\begin{aligned} -\frac{1}{2} \mathbf{x}^{\mathrm{T}}\left(\boldsymbol{\Lambda}+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A}\right) \mathbf{x}-\frac{1}{2} \mathbf{y}^{\mathrm{T}} \mathbf{L} \mathbf{y}+\frac{1}{2} \mathbf{y}^{\mathrm{T}} \mathbf{L} \mathbf{A} \mathbf{x}+\frac{1}{2} \mathbf{x}^{\mathrm{T}} \mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{y} \\ =-\frac{1}{2}\left(\begin{array}{l} \mathbf{x} \\ \mathbf{y} \end{array}\right)^{\mathrm{T}}\left(\begin{array}{cc} \mathbf{\Lambda}+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A} & -\mathbf{A}^{\mathrm{T}} \mathbf{L} \\ -\mathbf{L} \mathbf{A} & \mathbf{L} \end{array}\right)\left(\begin{array}{l} \mathbf{x} \\ \mathbf{y} \end{array}\right)=-\frac{1}{2} \mathbf{z}^{\mathrm{T}} \mathbf{R} \mathbf{z} \end{aligned}\tag{BI.28}
−21xT(Λ+ATLA)x−21yTLy+21yTLAx+21xTATLy=−21(xy)T(Λ+ATLA−LA−ATLL)(xy)=−21zTRz(BI.28)
and so the Gaussian distribution over
z
\mathbf{z}
z has precision (inverse covariance) matrix given by
R
=
(
Λ
+
A
T
L
A
−
A
T
L
−
L
A
L
)
(BI.29)
\mathbf{R}=\left(\begin{array}{cc} \mathbf{\Lambda}+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A} & -\mathbf{A}^{\mathrm{T}} \mathbf{L} \\ -\mathbf{L} \mathbf{A} & \mathbf{L} \end{array}\right)\tag{BI.29}
R=(Λ+ATLA−LA−ATLL)(BI.29)
The covariance matrix is found by taking the inverse of the precision, which can be done using the matrix inversion formula
(
B
I
.
10
)
(BI.10)
(BI.10) to give
cov
[
z
]
=
R
−
1
=
(
Λ
−
1
Λ
−
1
A
T
A
Λ
−
1
L
−
1
+
A
Λ
−
1
A
T
)
(BI.30)
\operatorname{cov}[\mathbf{z}]=\mathbf{R}^{-1}=\left(\begin{array}{cc} \mathbf{\Lambda}^{-1} & \boldsymbol{\Lambda}^{-1} \mathbf{A}^{\mathrm{T}} \\ \mathbf{A} \mathbf{\Lambda}^{-1} & \mathbf{L}^{-1}+\mathbf{A} \mathbf{\Lambda}^{-1} \mathbf{A}^{\mathrm{T}} \end{array}\right)\tag{BI.30}
cov[z]=R−1=(Λ−1AΛ−1Λ−1ATL−1+AΛ−1AT)(BI.30)
Similarly, we can find the mean of the Gaussian distribution over
z
\mathbf{z}
z by identifying the linear terms in
(
B
I
.
27
)
(BI.27)
(BI.27), which are given by
x
T
Λ
μ
−
x
T
A
T
L
b
+
y
T
L
b
=
(
x
y
)
T
(
Λ
μ
−
A
T
L
b
L
b
)
(BI.31)
\mathbf{x}^{\mathrm{T}} \boldsymbol{\Lambda} \boldsymbol{\mu}-\mathbf{x}^{\mathrm{T}} \mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{b}+\mathbf{y}^{\mathrm{T}} \mathbf{L} \mathbf{b}=\left(\begin{array}{l} \mathbf{x} \\ \mathbf{y} \end{array}\right)^{\mathrm{T}}\left(\begin{array}{c} \boldsymbol{\Lambda} \boldsymbol{\mu}-\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{b} \\ \mathbf{L b} \end{array}\right)\tag{BI.31}
xTΛμ−xTATLb+yTLb=(xy)T(Λμ−ATLbLb)(BI.31)
Using our earlier result
(
B
I
.
7
)
(BI.7)
(BI.7) obtained by completing the square over the quadratic form of a multivariate Gaussian, we find that the mean of
z
z
z is given by
E
[
z
]
=
R
−
1
(
Λ
μ
−
A
T
L
b
L
b
)
=
B
I
.
29
(
μ
A
μ
+
b
)
(BI.32)
\mathbb{E}[\mathbf{z}]=\mathbf{R}^{-1}\left(\begin{array}{c} \boldsymbol\Lambda \mu-\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{b} \\ \mathbf{L b} \end{array}\right)\stackrel{BI.29}{=}\left(\begin{array}{c} \boldsymbol{\mu} \\ \mathbf{A} \mu+\mathbf{b} \end{array}\right)\tag{BI.32}
E[z]=R−1(Λμ−ATLbLb)=BI.29(μAμ+b)(BI.32)
Now we can obtain the mean and covariance of the marginal distribution
p
(
y
)
p(\mathbf{y})
p(y) using the conclusions in [Partitioned Gaussians](#Partitioned Gaussians):
E
[
y
]
=
A
μ
+
b
(BI.33)
\begin{aligned} \mathbb{E}[\mathbf{y}] &=\mathbf{A} \boldsymbol{\mu}+\mathbf{b} \end{aligned}\tag{BI.33}
E[y]=Aμ+b(BI.33)
cov
[
y
]
=
L
−
1
+
A
Λ
−
1
A
T
(BI.34)
\begin{aligned} \operatorname{cov}[\mathbf{y}] &=\mathbf{L}^{-1}+\mathbf{A} \mathbf{\Lambda}^{-1} \mathbf{A}^{\mathrm{T}} \end{aligned}\tag{BI.34}
cov[y]=L−1+AΛ−1AT(BI.34)
A special case of this result is when A = I \mathbf{A}=\mathbf{I} A=I, in which case it reduces to the convolution of two Gaussians, for which we see that the mean of the convolution is the sum of the mean of the two Gaussians, and the covariance of the convolution is the sum of their covariances.
The conditional distribution
p
(
x
∣
y
)
p(\mathbf{x} | \mathbf{y})
p(x∣y) has mean and covariance given by
E
[
x
∣
y
]
=
(
Λ
+
A
T
L
A
)
−
1
{
A
T
L
(
y
−
b
)
+
Λ
μ
}
(BI.35)
\begin{aligned} \mathbb{E}[\mathbf{x} |\mathbf{y}] &=\left(\boldsymbol{\Lambda}+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A}\right)^{-1}\left\{\mathbf{A}^{\mathrm{T}} \mathbf{L}(\mathbf{y}-\mathbf{b})+\boldsymbol{\Lambda} \boldsymbol{\mu}\right\} \end{aligned}\tag{BI.35}
E[x∣y]=(Λ+ATLA)−1{ATL(y−b)+Λμ}(BI.35)
cov
[
x
∣
y
]
=
(
Λ
+
A
T
L
A
)
−
1
(BI.36)
\begin{aligned} \operatorname{cov}[\mathbf{x}| \mathbf{y}] &=\left(\boldsymbol{\Lambda}+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A}\right)^{-1} \end{aligned}\tag{BI.36}
cov[x∣y]=(Λ+ATLA)−1(BI.36)
Having found the marginal and conditional distributions, we effectively expressed the joint distribution p ( z ) = p ( x ) p ( y ∣ x ) p(\mathbf{z})=p(\mathbf{x}) p(\mathbf{y} | \mathbf{x}) p(z)=p(x)p(y∣x) in the form p ( x ∣ y ) p ( y ) . p(\mathbf{x}| \mathbf{y}) p(\mathbf{y}) . p(x∣y)p(y). These results are summarized below.
Theorem: Marginal and Conditional Gaussians
Given a marginal Gaussian distribution for x \mathbf{x} x and a conditional Gaussian distribution for y \mathbf y y given x \mathbf x x in the form
p ( x ) = N ( x ∣ μ , Λ − 1 ) p ( y ∣ x ) = N ( y ∣ A x + b , L − 1 ) \begin{aligned} p(\mathbf{x}) &=\mathcal{N}\left(\mathbf{x}|\boldsymbol{\mu}, \boldsymbol{\Lambda}^{-1}\right) \\ p(\mathbf{y} | \mathbf{x}) &=\mathcal{N}\left(\mathbf{y}| \mathbf{A} \mathbf{x}+\mathbf{b}, \mathbf{L}^{-1}\right) \end{aligned} p(x)p(y∣x)=N(x∣μ,Λ−1)=N(y∣Ax+b,L−1)
the marginal distribution of y \mathbf{y} y and the conditional distribution of x \mathbf{x} x given y \mathbf{y} y are given by
p ( y ) = N ( y ∣ A μ + b , L − 1 + A Λ − 1 A T ) p ( x ∣ y ) = N ( x ∣ Σ { A T L ( y − b ) + Λ μ } , Σ ) \begin{aligned} p(\mathbf{y}) &=\mathcal{N}\left(\mathbf{y} | \mathbf{A} \boldsymbol{\mu}+\mathbf{b}, \mathbf{L}^{-1}+\mathbf{A} \mathbf{\Lambda}^{-1} \mathbf{A}^{\mathrm{T}}\right) \\ p(\mathbf{x}|\mathbf{y}) &=\mathcal{N}\left(\mathbf{x} | \mathbf{\Sigma}\left\{\mathbf{A}^{\mathrm{T}} \mathbf{L}(\mathbf{y}-\mathbf{b})+\boldsymbol{\Lambda} \boldsymbol{\mu}\right\}, \mathbf{\Sigma}\right) \end{aligned} p(y)p(x∣y)=N(y∣Aμ+b,L−1+AΛ−1AT)=N(x∣Σ{ATL(y−b)+Λμ},Σ)
where
Σ = ( Λ + A T L A ) − 1 \boldsymbol{\Sigma}=\left(\boldsymbol{\Lambda}+\mathbf{A}^{\mathrm{T}} \mathbf{L} \mathbf{A}\right)^{-1} Σ=(Λ+ATLA)−1
Bayesian Linear Regression (3.3.1-3.3.2)
Consider a data set of inputs
X
=
{
x
1
,
.
.
.
,
x
N
}
\mathbf X = \{\mathbf x_1,..., \mathbf x_N\}
X={x1,...,xN} with corresponding target values
t
=
[
t
1
,
.
.
.
,
t
N
]
T
\mathbf t=[t_1,...,t_N]^T
t=[t1,...,tN]T. The expression for the likelihood function, which is a function of the adjustable parameters
w
\mathbf w
w and
β
β
β, has the form
p
(
t
∣
X
,
w
,
β
)
=
∏
n
=
1
N
N
(
w
T
ϕ
(
x
n
)
,
β
−
1
)
=
N
(
Φ
w
,
β
−
1
I
)
(BI.37)
p(\mathbf t|\mathbf X,\mathbf w,\beta)=\prod_{n=1}^N \mathcal N(\mathbf w^T \phi(\mathbf x_n),\beta^{-1})=\mathcal N(\mathbf \Phi \mathbf w,\beta^{-1}\mathbf I)\tag{BI.37}
p(t∣X,w,β)=n=1∏NN(wTϕ(xn),β−1)=N(Φw,β−1I)(BI.37)
where
Φ
=
[
ϕ
(
x
1
)
⋯
ϕ
(
x
N
)
]
T
\mathbf \Phi=[\phi(\mathbf x_1)\cdots \phi(\mathbf x_N)]^T
Φ=[ϕ(x1)⋯ϕ(xN)]T.
We begin our discussion of the Bayesian treatment of linear regression by introducing a prior probability distribution over the model parameters w \mathbf w w. For the moment, we shall treat the noise precision parameter β β β as a known constant.
Parameter distribution
To simplify the notation, we denote the likelihood function
p
(
t
∣
X
,
w
,
β
)
p(\mathbf t|\mathbf X,\mathbf w,\beta)
p(t∣X,w,β) defined in
(
B
I
.
37
)
(BI.37)
(BI.37) as
p
(
t
∣
w
)
p(\mathbf t|\mathbf w)
p(t∣w). Note that
p
(
t
∣
w
)
p(\mathbf t|\mathbf w)
p(t∣w) is the exponential of a quadratic function of
w
\mathbf w
w, the corresponding conjugate prior is therefore given by a Gaussian distribution of the form
p
(
w
)
=
N
(
w
∣
m
0
,
S
0
)
(BI.38)
p(\mathbf w)=\mathcal N (\mathbf w|\mathbf m_0,\mathbf S_0)\tag{BI.38}
p(w)=N(w∣m0,S0)(BI.38)
having mean
m
0
\mathbf m_0
m0 and covariance
S
0
\mathbf S_0
S0.
Next we compute the posterior distribution, which is proportional to the product of the likelihood function and the prior. Due to the choice of a conjugate Gaussian prior distribution, the posterior will also be Gaussian. By applying the conclusions in [Bayes’ theorem for Gaussian variables](#Bayes’ theorem for Gaussian variables), where
x
=
w
\mathbf x=\mathbf w
x=w,
y
=
t
\mathbf y=\mathbf t
y=t,
A
=
Φ
\mathbf A=\boldsymbol\Phi
A=Φ,
b
=
0
\mathbf b=\mathbf 0
b=0,
L
=
β
I
\mathbf L=\beta\mathbf I
L=βI,
μ
=
m
0
\boldsymbol\mu=\mathbf m_0
μ=m0, and
Λ
=
S
0
−
1
\boldsymbol\Lambda=\mathbf S_0^{-1}
Λ=S0−1, we have
p
(
w
∣
t
)
=
N
(
w
∣
m
N
.
S
N
)
(BI.39)
p(\mathbf w|\mathbf t)=\mathcal N(\mathbf w|\mathbf m_N.\mathbf S_N)\tag{BI.39}
p(w∣t)=N(w∣mN.SN)(BI.39)
m N = S N ( S 0 − 1 m 0 + β Φ T t ) (BI.40) \mathbf m_N=\mathbf S_N(\mathbf S_0^{-1}\mathbf m_0+\beta \boldsymbol \Phi^T\mathbf t)\tag{BI.40} mN=SN(S0−1m0+βΦTt)(BI.40)
S N − 1 = S 0 − 1 + β Φ T Φ (BI.41) \mathbf S_N^{-1}=\mathbf S_0^{-1}+\beta \boldsymbol \Phi^T \boldsymbol \Phi \tag{BI.41} SN−1=S0−1+βΦTΦ(BI.41)
Note that because the posterior distribution is Gaussian, its mode coincides with its mean. Thus the maximum posterior weight vector is simply given by w M A P = m N \mathbf w_{\mathrm{MAP}}=\mathbf m_N wMAP=mN.
We shall consider a particular form of Gaussian prior in order to simplify the treatment. Specifically, we consider a zero-mean isotropic Gaussian governed by a single precision parameter
α
α
α so that
p
(
w
∣
α
)
=
N
(
w
∣
0
,
α
−
1
I
)
(BI.42)
p(\mathbf w|\alpha)=\mathcal N(\mathbf w|\mathbf 0,\alpha^{-1}\mathbf I)\tag{BI.42}
p(w∣α)=N(w∣0,α−1I)(BI.42)
and the corresponding posterior distribution over
w
\mathbf w
w is then given by
(
B
I
.
39
)
(BI.39)
(BI.39) with
m
N
=
β
S
N
Φ
T
t
(BI.43)
\mathbf m_N=\beta\mathbf S_N \boldsymbol \Phi^T\mathbf t\tag{BI.43}
mN=βSNΦTt(BI.43)
S N − 1 = α I + β Φ T Φ (BI.44) \mathbf S_N^{-1}=\alpha \mathbf I+\beta \boldsymbol \Phi^T \boldsymbol \Phi \tag{BI.44} SN−1=αI+βΦTΦ(BI.44)
Predictive distribution
In practice, we are not usually interested in the value of
w
\mathbf w
w itself but rather in making predictions of
t
t
t for new values of
x
\mathbf x
x. This requires that we evaluate the predictive distribution defined by
p
(
t
∣
x
,
t
,
α
,
β
)
=
∫
p
(
t
∣
x
,
w
,
β
)
p
(
w
∣
x
,
t
,
α
,
β
)
d
w
(BI.45)
p(t|\mathbf x,\mathbf t,\alpha, \beta)=\int p(t|\mathbf x,\mathbf w,\beta)p(\mathbf w|\mathbf x,\mathbf t,\alpha, \beta)d\mathbf w\tag{BI.45}
p(t∣x,t,α,β)=∫p(t∣x,w,β)p(w∣x,t,α,β)dw(BI.45)
where the conditional distribution
p
(
t
∣
x
,
w
,
β
)
p(t|\mathbf x,\mathbf w,\beta)
p(t∣x,w,β) is given by
p
(
t
∣
x
,
w
,
β
)
=
N
(
t
∣
w
T
ϕ
(
x
)
,
β
−
1
)
(BI.46)
p(t|\mathbf x,\mathbf w,\beta)=\mathcal N(t|\mathbf w^T\phi(\mathbf x),\beta^{-1})\tag{BI.46}
p(t∣x,w,β)=N(t∣wTϕ(x),β−1)(BI.46)
and
p
(
w
∣
x
,
t
,
α
,
β
)
p(\mathbf w|\mathbf x,\mathbf t,\alpha, \beta)
p(w∣x,t,α,β) is given by
(
B
I
.
39
)
,
(
B
I
.
43
)
,
(
B
I
.
44
)
(BI.39),(BI.43),(BI.44)
(BI.39),(BI.43),(BI.44).
Similarly, using the conclusions in [Bayes’ theorem for Gaussian variables](#Bayes’ theorem for Gaussian variables), viewing
p
(
w
∣
x
,
t
,
α
,
β
)
p(\mathbf w|\mathbf x,\mathbf t,\alpha, \beta)
p(w∣x,t,α,β) as the prior and
p
(
t
∣
x
,
w
,
β
)
p(t|\mathbf x,\mathbf w,\beta)
p(t∣x,w,β) the likelihood, we obtain the marginal distribution
p
(
t
∣
x
,
t
,
α
,
β
)
p(t|\mathbf x,\mathbf t,\alpha, \beta)
p(t∣x,t,α,β) from
(
B
I
.
33
)
(BI.33)
(BI.33) and
(
B
I
.
34
)
(BI.34)
(BI.34)
p
(
t
∣
x
,
t
,
α
,
β
)
=
N
(
t
∣
m
N
T
ϕ
(
x
)
,
σ
N
2
(
x
)
)
(BI.47)
p(t | \mathbf{x}, \mathbf{t}, \alpha, \beta)=\mathcal{N}\left(t | \mathbf{m}_{N}^{T}{\phi}(\mathbf{x}), \sigma_{N}^{2}(\mathbf{x})\right)\tag{BI.47}
p(t∣x,t,α,β)=N(t∣mNTϕ(x),σN2(x))(BI.47)
where the variance
σ
N
2
(
x
)
\sigma_{N}^{2}(\mathbf{x})
σN2(x) of the predictive distribution is given by
σ
N
2
(
x
)
=
1
β
+
ϕ
(
x
)
T
S
N
ϕ
(
x
)
(BI.48)
\sigma_{N}^{2}(\mathbf{x})=\frac{1}{\beta}+\phi(\mathbf{x})^{\mathrm{T}} \mathbf{S}_{N} \boldsymbol{\phi}(\mathbf{x})\tag{BI.48}
σN2(x)=β1+ϕ(x)TSNϕ(x)(BI.48)
Note: To see this better, simplify the notation of
(
B
I
.
45
)
(BI.45)
(BI.45) as
p
(
t
∣
t
)
=
∫
p
(
t
,
w
∣
t
)
d
w
=
∫
p
(
t
∣
w
,
t
)
p
(
w
∣
t
)
d
w
p(t|\mathbf t)=\int p(t,\mathbf w|\mathbf t)d\mathbf w=\int p(t|\mathbf w,\mathbf t)p(\mathbf w|\mathbf t)d\mathbf w
p(t∣t)=∫p(t,w∣t)dw=∫p(t∣w,t)p(w∣t)dw
The first term in
(
B
I
.
48
)
(BI.48)
(BI.48) represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w. Because the noise process and the distribution of
w
\mathbf w
w are independent Gaussians, their variances are additive.
As an illustration of the predictive distribution for Bayesian linear regression models, let us discuss the synthetic sinusoidal data set example. In Figure 3.8, we fit a model comprising a linear combination of Gaussian basis functions to data sets of various sizes and then look at the corresponding posterior distributions.
Here the green curves correspond to the function sin ( 2 π x ) \sin(2πx) sin(2πx) from which the data points were generated (with the addition of Gaussian noise). Data sets of size N = 1 , N = 2 , N = 4 N =1, N =2, N =4 N=1,N=2,N=4, and N = 25 N =25 N=25 are shown in the four plots by the blue circles. For each plot, the red curve shows the mean of the corresponding Gaussian predictive distribution, and the red shaded region spans one standard deviation either side of the mean. Note that the predictive uncertainty depends on x \mathbf x x and is smallest in the neighborhood of the data points. Also note that the level of uncertainty decreases as more data points are observed.
The plots in Figure 3.8 only show the point-wise predictive variance as a function of x \mathbf x x. In order to gain insight into the covariance between the predictions at different values of x \mathbf x x, we can draw samples from the posterior distribution over w \mathbf w w, and then plot the corresponding functions y ( x , w ) y(\mathbf x,\mathbf w) y(x,w), as shown in Figure 3.9.
Graphic Models (8.2.1-8.2.2)
Bayesian Networks
In order to motivate the use of directed graphs to describe probability distributions, consider first an arbitrary joint distribution
p
(
a
,
b
,
c
)
p(a, b, c)
p(a,b,c) over three variables
a
,
b
,
a, b,
a,b, and
c
c
c. We can write the joint distribution in the form
p
(
a
,
b
,
c
)
=
p
(
c
∣
a
,
b
)
p
(
a
,
b
)
=
p
(
c
∣
a
,
b
)
p
(
b
∣
a
)
p
(
a
)
(GM.1)
p(a,b,c)=p(c|a,b)p(a,b)=p(c|a,b)p(b|a)p(a)\tag{GM.1}
p(a,b,c)=p(c∣a,b)p(a,b)=p(c∣a,b)p(b∣a)p(a)(GM.1)
Note that at this stage, we do not need to specify anything further about these variables, such as whether they are discrete or continuous.
We now represent the right-hand side of ( G M . 1 ) (GM.1) (GM.1) in terms of a simple graphical model as follows
Example:
The joint distribution corresponding to Fig. 8.2 is given by
p
(
x
1
)
p
(
x
2
)
p
(
x
3
)
p
(
x
4
∣
x
1
,
x
2
,
x
3
)
p
(
x
5
∣
x
1
,
x
3
)
p
(
x
6
∣
x
4
)
p
(
x
7
∣
x
4
,
x
5
)
p(x_1)p(x_2)p(x_3)p(x_4|x_1,x_2,x_3)p(x_5|x_1,x_3)p(x_6|x_4)p(x_7|x_4,x_5)
p(x1)p(x2)p(x3)p(x4∣x1,x2,x3)p(x5∣x1,x3)p(x6∣x4)p(x7∣x4,x5)
An important concept for probability distributions over multiple variables is that of conditional independence. Consider three variables
a
,
b
,
a, b,
a,b, and
c
c
c, and suppose that the conditional distribution of
a
a
a,given
b
b
b and
c
c
c, is such that it does not depend on the value of
b
b
b, so that
p
(
a
∣
b
,
c
)
=
p
(
a
∣
c
)
(GM.2)
p(a|b,c)=p(a|c)\tag{GM.2}
p(a∣b,c)=p(a∣c)(GM.2)
We say that a is conditionally independent of b given c. This can be expressed in a slightly different way if we consider the joint distribution of a and b conditioned on c, which we can write in the form
p
(
a
,
b
∣
c
)
=
p
(
a
∣
b
,
c
)
p
(
b
∣
c
)
=
p
(
a
∣
c
)
p
(
b
∣
c
)
(GM.3)
p(a,b|c)=p(a|b,c)p(b |c)=p(a|c)p(b|c)\tag{GM.3}
p(a,b∣c)=p(a∣b,c)p(b∣c)=p(a∣c)p(b∣c)(GM.3)
Thus we see that, conditioned on
c
c
c, the joint distribution of
a
a
a and
b
b
b factorizes into the product of the marginal distribution of
a
a
a and the marginal distribution of
b
b
b (again both conditioned on
c
c
c). This says that the variables
a
a
a and
b
b
b are statistically independent, given
c
c
c.
We shall sometimes use a shorthand notation for conditional independence in which
a
⊥
b
∣
c
(GM.4)
a \perp b \mid c\tag{GM.4}
a⊥b∣c(GM.4)
Three Example Graphs
We begin our discussion of the conditional independence properties of directed graphs by considering three simple examples each involving graphs having just three nodes. Together, these will motivate and illustrate the key concepts of d-separation.
- tail-to-tail
The joint distribution corresponding to this graph is
p
(
a
,
b
,
c
)
=
p
(
a
∣
c
)
p
(
b
∣
c
)
p
(
c
)
(GM.5)
p(a,b,c)=p(a|c)p(b|c)p(c)\tag{GM.5}
p(a,b,c)=p(a∣c)p(b∣c)p(c)(GM.5)
If none of the variables are observed, then we can investigate whether
a
a
a and
b
b
b are independent by marginalizing both sides of
(
G
M
.
5
)
(GM.5)
(GM.5) with respect to
c
c
c to give
p
(
a
,
b
)
=
∑
c
p
(
a
∣
c
)
p
(
b
∣
c
)
p
(
c
)
(GM.6)
p(a,b)=\sum_{c}p(a|c)p(b|c)p(c)\tag{GM.6}
p(a,b)=c∑p(a∣c)p(b∣c)p(c)(GM.6)
In general, this does not factorize into the product
p
(
a
)
p
(
b
)
p(a)p(b)
p(a)p(b), and so
a
⊥̸
b
∣
∅
(GM.7)
a\not\perp b \mid \empty\tag{GM.7}
a⊥b∣∅(GM.7)
Now suppose we condition on the variable
c
c
c, as represented by the graph above. From
(
G
M
.
5
)
(GM.5)
(GM.5), we can easily write down the conditional distribution of
a
a
a and
b
b
b,given
c
c
c, in the form
p
(
a
,
b
∣
c
)
=
p
(
a
,
b
,
c
)
p
(
c
)
=
p
(
a
∣
c
)
p
(
b
∣
c
)
(GM.8)
p(a,b|c)=\frac{p(a,b,c)}{p(c)}=p(a|c)p(b|c)\tag{GM.8}
p(a,b∣c)=p(c)p(a,b,c)=p(a∣c)p(b∣c)(GM.8)
and so we obtain the conditional independence property
a
⊥
b
∣
c
(GM.4)
a \perp b \mid c\tag{GM.4}
a⊥b∣c(GM.4)
- Chain
The joint distribution corresponding to this graph is
p
(
a
,
b
,
c
)
=
p
(
a
)
p
(
c
∣
a
)
p
(
b
∣
c
)
(GM.9)
p(a, b, c)=p(a) p(c | a) p(b| c)\tag{GM.9}
p(a,b,c)=p(a)p(c∣a)p(b∣c)(GM.9)
First of all, suppose that none of the variables are observed. Again, we can test to see if
a
a
a and
b
b
b are independent by marginalizing over
c
c
c to give
p
(
a
,
b
)
=
p
(
a
)
∑
c
p
(
c
∣
a
)
p
(
b
∣
c
)
=
p
(
a
)
p
(
b
∣
a
)
(GM.10)
p(a, b)=p(a) \sum_{c} p(c | a) p(b |c)=p(a) p(b | a)\tag{GM.10}
p(a,b)=p(a)c∑p(c∣a)p(b∣c)=p(a)p(b∣a)(GM.10)
which in general does not factorize into
p
(
a
)
p
(
b
)
p(a)p(b)
p(a)p(b), and so
a
⊥̸
b
∣
∅
(GM.7)
a\not\perp b \mid \empty\tag{GM.7}
a⊥b∣∅(GM.7)
as before.
Now suppose we condition on node
c
c
c. Using Bayes’ theorem, together with
(
G
M
.
9
)
,
(GM.9),
(GM.9), we obtain
p
(
a
,
b
∣
c
)
=
p
(
a
,
b
,
c
)
p
(
c
)
=
p
(
a
)
p
(
c
∣
a
)
p
(
b
∣
c
)
p
(
c
)
=
p
(
a
∣
c
)
p
(
b
∣
c
)
(GM.11)
\begin{aligned} p(a, b | c) &=\frac{p(a, b, c)}{p(c)} \\ &=\frac{p(a) p(c | a) p(b | c)}{p(c)} \\ &=p(a | c) p(b | c) \end{aligned}\tag{GM.11}
p(a,b∣c)=p(c)p(a,b,c)=p(c)p(a)p(c∣a)p(b∣c)=p(a∣c)p(b∣c)(GM.11)
and so again we obtain the conditional independence property
a
⊥
b
∣
c
(GM.4)
a \perp b \mid c\tag{GM.4}
a⊥b∣c(GM.4)
- Collider/head-to head
The joint distribution can again be written as
p
(
a
,
b
,
c
)
=
p
(
a
)
p
(
b
)
p
(
c
∣
a
,
b
)
(GM.12)
p(a, b, c)=p(a) p(b) p(c | a, b)\tag{GM.12}
p(a,b,c)=p(a)p(b)p(c∣a,b)(GM.12)
Consider first the case where none of the variables are observed. Marginalizing both sides of
(
G
M
.
12
)
(GM.12)
(GM.12) over
c
c
c we obtain
p
(
a
,
b
)
=
p
(
a
)
p
(
b
)
(GM.13)
p(a, b)=p(a) p(b)\tag{GM.13}
p(a,b)=p(a)p(b)(GM.13)
and so
a
a
a and
b
b
b are independent with no variables observed, in contrast to the two previous examples. We can write this result as
a
⊥
b
∣
∅
(GM.14)
a \perp b \mid \emptyset \tag{GM.14}
a⊥b∣∅(GM.14)
Now suppose we condition on
c
c
c. The conditional distribution of
a
a
a and
b
b
b is then given by
p
(
a
,
b
∣
c
)
=
p
(
a
,
b
,
c
)
p
(
c
)
=
p
(
a
)
p
(
b
)
p
(
c
∣
a
,
b
)
p
(
c
)
(GM.15)
\begin{aligned} p(a, b | c) &=\frac{p(a, b, c)}{p(c)} \\ &=\frac{p(a) p(b) p(c | a, b)}{p(c)} \end{aligned}\tag{GM.15}
p(a,b∣c)=p(c)p(a,b,c)=p(c)p(a)p(b)p(c∣a,b)(GM.15)
which in general does not factorize into the product
p
(
a
)
p
(
b
)
,
p(a) p(b),
p(a)p(b), and so
a
⊥̸
b
∣
c
(GM.16)
a\not\perp b \mid c\tag{GM.16}
a⊥b∣c(GM.16)
D-separation
We now give a general statement of the d-separation property for directed graphs. Consider a general directed graph in which A , B , A, B, A,B, and C C C are arbitrary nonintersecting sets of nodes (whose union may be smaller than the complete set of nodes in the graph). We wish to ascertain whether a particular conditional independence statement A ⊥ B ∣ C A \perp B \mid C A⊥B∣C is implied by a given directed acyclic graph.
To do so, we consider all possible paths from any node in A A A to any node in B B B. Any such path is said to be blocked if it includes a node such that either
- the arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set C , C, C, or
- the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, is in the set C C C.
If all paths are blocked, then A A A is said to be d \mathrm{d} d-separated from B B B by C , C, C, and the joint distribution over all of the variables in the graph will satisfy A ⊥ B ∣ C A \perp B \mid C A⊥B∣C.
The concept of d-separation is illustrated in Figure 8.22. 8.22 . 8.22. In graph (a), the path from a a a to b b b is not blocked by node f f f because it is a tail-to-tail node for this path and is not observed, nor is it blocked by node e e e because, although the latter is a head-to-head node, it has a descendant c c c because is in the conditioning set. Thus the conditional independence statement a ⊥ b ∣ c a \perp b \mid c a⊥b∣c does n o t n o t not follow from this graph. In graph (b), the path from a a a to b b b is blocked by node f f f because this is a tail-to-tail node that is observed, and so the conditional independence property a ⊥ b ∣ f a \perp b \mid f a⊥b∣f will be satisfied by any distribution that factorizes according to this graph. Note that this path is also blocked by node e because e e e is a head-to-head node and neither it nor its descendant are in the conditioning set.
相關文章
- 論文閱讀:《Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering》Symbol
- 卡耐基梅隆大學(CMU)深度學習基礎課Probabilistic Graphical Models內容解讀深度學習
- 不在models.py中的models
- Beego Models之二Go
- Grails + EJB Domain ModelsAI
- CS 839: FOUNDATION MODELS
- diffusion model(一):DDPM技術小結 (denoising diffusion probabilistic)
- Laravel view models [翻譯]LaravelView
- models生成與載入
- 12、flask-模型-modelsFlask模型
- Structuring Your TensorFlow ModelsStruct
- 論文解讀(PCL)《Probabilistic Contrastive Learning for Domain Adaptation》ASTAIAPT
- oracle data Format Models---二(轉)OracleORM
- 【機器學習】李宏毅——Flow-based Generative Models機器學習
- Enhancing Diffusion Models with Reinforcement Learning
- 怎麼使用Stable diffusion中的models
- 翻譯:Bullet Proofing Django Models 待更新Django
- langchain中的chat models介紹和使用LangChain
- Python Django基礎教程(三)(模型models)PythonDjango模型
- Why NoSQL Should Be Called "SQL with Alternative Storage Models"SQL
- Accelerated Failure Time Models加速失效時間模型AFTAI模型
- SQL Server的幾種恢復模式(recovery models)SQLServer模式
- As a reader --> Diffusion Models for Imperceptible and Transferable Adversarial Attack
- [Paper Reading] DDIM: DENOISING DIFFUSION IMPLICIT MODELS
- Denoising Diffusion Implicit Models(去噪隱式模型)模型
- 大型語言模型(Large Language Models)的介紹模型
- Multi-lingual Models for Compostional Semantic representations
- httprunner3原始碼解讀(2)models.pyHTTP原始碼
- As a reader --> AdvDiffuser: Natural Adversarial Example Synthesis with Diffusion Models
- Pixel Aligned Language Models論文閱讀筆記筆記
- PaperRead - Comparison of Fundamental Mesh Smoothing Algorithms for Medical Surface ModelsGo
- Django Models隨機獲取指定數量資料方法Django隨機
- Detectron2-寫模型(Write Models)官方文件中文翻譯模型
- No-PDO-Models-MySQL資料庫層抽象類 – 實現MySql資料庫抽象
- 中國黑客入侵ModelS系統特斯拉發更新修復漏洞黑客
- What are HANA's models of cloud computing, and which should I choose?Cloud
- App\User 替換為 App\Models\User 的問題解決!APP
- Python - pydantic 入門介紹與 Models 的簡單使用Python