General Bayesian Estimators (MMSE, MAP, LMMSE)
Reference:
Kay S M. Fundamentals of statistical signal processing[M]. Prentice Hall PTR, 1993. (Ch. 11 - 11.5, Ch. 12 - 12.5)
Slides of ET4386, TUD
Content
Motivation:
-
The optimal Bayesian estimators discussed in the previous chapter are difficult to determine in closed form, and in practice too computationally intensive to implement.
-
We now study more general Bayesian estimators and their properties. To do so, the concept of the Bayesian risk function is discussed.
-
Minimization of this criterion results in a variety of estimators. The ones that we will concentrate on are the MMSE estimator and the maximum a posteriori estimator.
-
They involve multidimensional integration for the MMSE estimator and multidimensional maximization for the MAP estimator. Although under the jointly Gaussian assumption these estimators are easily found, in general, they are not. When we are unable to make the Gaussian assumption, another approach must be used. To fill this gap we can choose to retain the MMSE criterion but constrain the estimator to be linear. Then, an explicit form for the estimator may be determined which depends only on the first two moments of the PDF. In many ways this approach is analogous to the BLUE in classical estimation, and some parallels will become apparent.
Risk Functions
Previously, we had derived the MMSE estimator by minimizing
E
[
(
θ
−
θ
^
)
2
]
E[(\theta-\hat \theta)^2]
E[(θ−θ^)2], where the expectation is with respect to the PDF
p
(
x
,
θ
)
p(\mathbf x,\theta)
p(x,θ). If we let
ϵ
=
θ
−
θ
^
\epsilon=\theta-\hat \theta
ϵ=θ−θ^ denote the error of the estimator for a particular realization of
x
\mathbf x
x and
θ
\theta
θ, and also let
C
(
ϵ
)
=
ϵ
2
\mathcal C(\epsilon)=\epsilon^2
C(ϵ)=ϵ2, then the MSE criterion minimizes
E
[
C
(
ϵ
)
]
E[\mathcal C(\epsilon)]
E[C(ϵ)].
C
(
ϵ
)
\mathcal C(\epsilon)
C(ϵ) is termed as cost function, and the average cost or
E
[
C
(
ϵ
)
]
E[\mathcal C (\epsilon)]
E[C(ϵ)] is termed the Bayes risk
R
\mathcal R
R or
R
=
E
[
C
(
ϵ
)
]
(RF.1)
\mathcal R=E[\mathcal C(\epsilon)]\tag{RF.1}
R=E[C(ϵ)](RF.1)
If
C
(
ϵ
)
=
ϵ
2
\mathcal C (\epsilon)=\epsilon^2
C(ϵ)=ϵ2, then the cost function is quadratic and the Bayes risk is just the MSE. Of course, there is no need to restrict ourselves to quadratic cost functions. In Figure 11.1b we have the “absolute” error
C
(
ϵ
)
=
∣
ϵ
∣
(RF.2)
\mathcal C(\epsilon)=|\epsilon|\tag{RF.2}
C(ϵ)=∣ϵ∣(RF.2)
In Figure 11.1c the “hit-or-miss” cost function is displayed
C
(
ϵ
)
=
{
0
∣
ϵ
∣
<
δ
1
∣
ϵ
∣
>
δ
(RF.3)
\mathcal C(\epsilon)=\left\{ \begin{array}{cc} 0 & |\epsilon|<\delta \\ 1 & |\epsilon|>\delta \end{array} \right.\tag{RF.3}
C(ϵ)={01∣ϵ∣<δ∣ϵ∣>δ(RF.3)
We already have seen that the Bayes risk is minimized for a quadratic cost function by the MMSE estimator θ ^ = E ( θ ∣ x ) \hat\theta=E(\theta|\mathbf x) θ^=E(θ∣x). We now determine the optimal estimators for the other cost functions.
The Bayes risk
R
\mathcal R
R is
R
=
E
[
C
(
ϵ
)
]
=
∫
∫
C
(
θ
−
θ
^
)
p
(
x
,
θ
)
d
x
d
θ
=
∫
[
∫
C
(
θ
−
θ
^
)
p
(
θ
∣
x
)
d
θ
]
p
(
x
)
d
x
(RF.4)
\begin{aligned} \mathcal R&=E[\mathcal C(\epsilon)]\\ &=\int \int \mathcal C(\theta-\hat \theta)p(\mathbf x,\theta)d\mathbf x d\theta \\ &=\int \left[\int \mathcal C(\theta-\hat \theta)p(\theta|\mathbf x)d \theta\right]p(\mathbf x)d \mathbf x \end{aligned}\tag{RF.4}
R=E[C(ϵ)]=∫∫C(θ−θ^)p(x,θ)dxdθ=∫[∫C(θ−θ^)p(θ∣x)dθ]p(x)dx(RF.4)
We attempt to minimize the inner integral
g
(
θ
^
)
=
∫
C
(
θ
−
θ
^
)
p
(
θ
∣
x
)
d
θ
g(\hat \theta)=\int \mathcal C(\theta-\hat \theta)p(\theta|\mathbf x)d \theta
g(θ^)=∫C(θ−θ^)p(θ∣x)dθ for each
x
\mathbf x
x.
-
absolute error cost function
g ( θ ^ ) = ∫ ∣ θ − θ ^ ∣ p ( θ ∣ x ) d θ = ∫ − ∞ θ ^ ( θ ^ − θ ) p ( θ ∣ x ) d θ + ∫ θ ^ ∞ ( θ − θ ^ ) p ( θ ∣ x ) d θ \begin{aligned} g(\hat \theta)&=\int|\theta-\hat \theta|p(\theta|\mathbf x)d\theta\\ &=\int_{-\infty}^{\hat \theta} (\hat \theta-\theta)p(\theta|\mathbf x)d\theta+\int_{\hat \theta}^\infty (\theta-\hat \theta)p(\theta|\mathbf x)d\theta \end{aligned} g(θ^)=∫∣θ−θ^∣p(θ∣x)dθ=∫−∞θ^(θ^−θ)p(θ∣x)dθ+∫θ^∞(θ−θ^)p(θ∣x)dθ
Differentiating w.r.t. θ ^ \hat \theta θ^ using Leibnitz’s rule
∂ ∂ u ∫ ϕ 1 ( u ) ϕ 2 ( u ) h ( u , v ) d v = ∫ ϕ 1 ( u ) ϕ 2 ( u ) ∂ h ( u , v ) ∂ u d v + d ϕ 2 ( u ) d u h ( u , ϕ 2 ( u ) ) − d ϕ 1 ( u ) d u h ( u , ϕ 1 ( u ) ) \frac{\partial}{\partial u}\int_{\phi_1(u)}^{\phi_2(u)}h(u,v)dv=\int_{\phi_1(u)}^{\phi_2(u)}\frac{\partial h(u,v)}{\partial u}dv+\frac{d\phi_2(u)}{du}h(u,\phi_2(u))-\frac{d\phi_1(u)}{du}h(u,\phi_1(u)) ∂u∂∫ϕ1(u)ϕ2(u)h(u,v)dv=∫ϕ1(u)ϕ2(u)∂u∂h(u,v)dv+dudϕ2(u)h(u,ϕ2(u))−dudϕ1(u)h(u,ϕ1(u))
we have
d g ( θ ^ ) d θ ^ = ∫ − ∞ θ ^ p ( θ ∣ x ) d θ − ∫ θ ^ ∞ p ( θ ∣ x ) d θ = 0 \frac{dg(\hat\theta)}{d\hat \theta}=\int_{-\infty}^{\hat \theta} p(\theta|\mathbf x)d\theta -\int_{\hat \theta}^\infty p(\theta|\mathbf x)d\theta=0 dθ^dg(θ^)=∫−∞θ^p(θ∣x)dθ−∫θ^∞p(θ∣x)dθ=0
or
∫ − ∞ θ ^ p ( θ ∣ x ) d θ = ∫ θ ^ ∞ p ( θ ∣ x ) d θ \int_{-\infty}^{\hat \theta} p(\theta|\mathbf x)d\theta =\int_{\hat \theta}^\infty p(\theta|\mathbf x)d\theta ∫−∞θ^p(θ∣x)dθ=∫θ^∞p(θ∣x)dθ
By definition θ ^ \hat \theta θ^ is the median of the posterior PDF or the point for which Pr { θ ≤ θ ^ ∣ x } = 1 / 2 \Pr \{ \theta\le \hat \theta|\mathbf x\}=1/2 Pr{θ≤θ^∣x}=1/2. -
hit or miss cost function
g ( θ ^ ) = ∫ − ∞ θ ^ − δ 1 ⋅ p ( θ ∣ x ) d θ + ∫ θ ^ + δ ∞ 1 ⋅ p ( θ ∣ x ) d θ = 1 − ∫ θ ^ − δ θ ^ + δ p ( θ ∣ x ) d θ \begin{aligned} g(\hat{\theta})=&\int_{-\infty}^{\hat{\theta}-\delta} 1 \cdot p(\theta | \mathbf{x}) d \theta+\int_{\hat{\theta}+\delta}^{\infty} 1 \cdot p(\theta |\mathbf{x}) d \theta\\ =&1-\int_{\hat \theta-\delta}^{\hat \theta+\delta}p(\theta|\mathbf x)d\theta \end{aligned} g(θ^)==∫−∞θ^−δ1⋅p(θ∣x)dθ+∫θ^+δ∞1⋅p(θ∣x)dθ1−∫θ^−δθ^+δp(θ∣x)dθ
This is minimized by maximizing
∫ θ ˙ − δ θ ^ + δ p ( θ ∣ x ) d θ \int_{\dot{\theta}-\delta}^{\hat{\theta}+\delta} p(\theta | \mathbf{x}) d \theta ∫θ˙−δθ^+δp(θ∣x)dθ
For δ \delta δ arbitrarily small this is maximized by choosing θ ^ \hat \theta θ^ to correspond to the location of the maximum of p ( θ ∣ x ) p(\theta|\mathbf x) p(θ∣x). The estimator that minimizes the Bayes risk for the “hit-or-miss” cost function is therefore the mode (location of the maximum) of the posterior PDF. It is termed the maximum a posteriori (MAP) estimator and will be described in more detail later.
In summary, the estimators that minimize the Bayes risk for the cost functions of Figure 11.1 are the mean, median, and mode of the posterior PDF. This is illustrated in Figure 11.2a. For some posterior PDFs these three estimators are identical. A notable example is the Gaussian posterior PDF
p
(
θ
∣
x
)
=
1
2
π
σ
θ
∣
x
2
exp
[
−
1
2
σ
θ
∣
x
2
(
θ
−
μ
θ
∣
x
)
2
]
p(\theta|\mathbf x)=\frac{1}{\sqrt{2\pi \sigma^2_{\theta|x}}}\exp\left[-\frac {1}{2\sigma^2_{\theta|x}}(\theta-\mu_{\theta|x})^2 \right]
p(θ∣x)=2πσθ∣x21exp[−2σθ∣x21(θ−μθ∣x)2]
The mean
μ
θ
∣
x
\mu_{\theta|x}
μθ∣x is identical to the median (due to the symmetry) and the mode, as illustrated in Figure 11.2b.
Minimum Mean Square Error Estimators (MMSE)
In The Bayesian Philosophy the MMSE estimator was determined to be E ( θ ∣ x ) E(\theta|\mathbf x) E(θ∣x) or the mean of the posterior PDF. We continue our discussion of this important estimator by first extending it to the vector parameter case and then studying some of its properties.
If
θ
\boldsymbol \theta
θ is a vector parameter of dimension
p
×
1
p \times 1
p×1, then to estimate
θ
1
\theta_1
θ1, for example, we may view the remaining parameters as nuisance parameters
p
(
θ
1
∣
x
)
=
∫
⋯
∫
p
(
θ
∣
x
)
d
θ
2
⋯
d
θ
p
(MMSE.1)
p(\theta_1|\mathbf x)=\int \cdots \int p(\boldsymbol \theta|\mathbf x)d\theta_2\cdots d\theta_p \tag{MMSE.1}
p(θ1∣x)=∫⋯∫p(θ∣x)dθ2⋯dθp(MMSE.1)
where
p
(
θ
∣
x
)
=
p
(
x
∣
θ
)
p
(
θ
)
∫
p
(
x
∣
θ
)
p
(
θ
)
d
θ
(MMSE.2)
p(\boldsymbol \theta|\mathbf x)=\frac{p(\mathbf x|\boldsymbol \theta)p(\boldsymbol \theta)}{\int p(\mathbf x|\boldsymbol \theta)p(\boldsymbol \theta)d\boldsymbol \theta} \tag{MMSE.2}
p(θ∣x)=∫p(x∣θ)p(θ)dθp(x∣θ)p(θ)(MMSE.2)
Then we have
θ
^
i
=
E
(
θ
i
∣
x
)
=
∫
θ
i
p
(
θ
i
∣
x
)
d
θ
i
,
i
=
1
,
2
,
⋯
,
p
(MMSE.3)
\hat \theta_i =E(\theta_i|\mathbf x)=\int \theta_ip(\theta_i|\mathbf x)d\theta_i, \quad i=1,2,\cdots,p \tag{MMSE.3}
θ^i=E(θi∣x)=∫θip(θi∣x)dθi,i=1,2,⋯,p(MMSE.3)
This is the MMSE estimator that minimizes
E
[
(
θ
i
−
θ
^
i
)
2
]
=
∫
∫
(
θ
i
−
θ
^
i
)
2
p
(
x
,
θ
i
)
d
x
d
θ
i
(MMSE.4)
E[(\theta_i-\hat \theta_i)^2]=\int\int (\theta_i-\hat \theta_i)^2 p(\mathbf x,\theta_i)d\mathbf x d\theta_i \tag{MMSE.4}
E[(θi−θ^i)2]=∫∫(θi−θ^i)2p(x,θi)dxdθi(MMSE.4)
or the squared error when averaged with respect to the marginal PDF
p
(
x
,
θ
i
)
p(\mathbf x,\theta_i)
p(x,θi). Alternatively, we can express the MMSE estimator for the first parameter from
(
M
M
S
E
.
1
)
(MMSE.1)
(MMSE.1)
θ
^
1
=
∫
θ
1
p
(
θ
1
∣
x
)
d
θ
1
=
∫
θ
1
[
∫
⋯
∫
p
(
θ
∣
x
)
d
θ
2
⋯
d
θ
p
]
d
θ
1
=
∫
θ
1
p
(
θ
∣
x
)
d
θ
\begin{aligned} \hat \theta_1 &= \int \theta_1 p(\theta_1|\mathbf x)d\theta_1\\ &= \int \theta_1 \left[\int \cdots \int p(\boldsymbol \theta|\mathbf x)d\theta_2\cdots d\theta_p\right]d\theta_1\\ &=\int \theta_1 p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta \end{aligned}
θ^1=∫θ1p(θ1∣x)dθ1=∫θ1[∫⋯∫p(θ∣x)dθ2⋯dθp]dθ1=∫θ1p(θ∣x)dθ
or in general
θ
^
i
=
∫
θ
i
p
(
θ
∣
x
)
d
θ
,
i
=
1
,
2
,
⋯
,
p
\hat \theta_i=\int \theta_i p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta,\quad i=1,2,\cdots,p
θ^i=∫θip(θ∣x)dθ,i=1,2,⋯,p
In vector form we have
θ
^
=
[
∫
θ
1
p
(
θ
∣
x
)
d
θ
∫
θ
2
p
(
θ
∣
x
)
d
θ
⋮
∫
θ
p
p
(
θ
∣
x
)
d
θ
]
=
∫
θ
p
(
θ
∣
x
)
d
θ
=
E
(
θ
∣
x
)
(MMSE.5)
\hat {\boldsymbol \theta}=\left[\begin{matrix}\int \theta_1 p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta\\ \int \theta_2 p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta\\ \vdots\\ \int \theta_p p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta \end{matrix} \right]=\int \boldsymbol \theta p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta=E(\boldsymbol \theta|\mathbf x)\tag{MMSE.5}
θ^=⎣⎢⎢⎢⎡∫θ1p(θ∣x)dθ∫θ2p(θ∣x)dθ⋮∫θpp(θ∣x)dθ⎦⎥⎥⎥⎤=∫θp(θ∣x)dθ=E(θ∣x)(MMSE.5)
where the expectation is with respect to the posterior PDF of the vector parameter or
p
(
θ
∣
x
)
p(\boldsymbol \theta|\mathbf x)
p(θ∣x).
According to the definition, the minimum Bayesian MSE for
θ
^
i
\hat\theta_i
θ^i
Bmse
(
θ
^
1
)
=
E
[
(
θ
1
−
θ
^
1
)
2
]
=
∫
∫
(
θ
1
−
θ
^
1
)
2
p
(
x
,
θ
1
)
d
θ
1
d
x
\operatorname{Bmse}(\hat \theta_1)=E[(\theta_1-\hat {\theta}_1)^2]=\int \int(\theta_1-\hat {\theta}_1)^2p(\mathbf x,\theta_1) d\theta_1 d\mathbf x
Bmse(θ^1)=E[(θ1−θ^1)2]=∫∫(θ1−θ^1)2p(x,θ1)dθ1dx
and since
θ
^
1
=
E
(
θ
1
∣
x
)
\hat \theta_1=E(\theta_1|\mathbf x)
θ^1=E(θ1∣x), we have
Bmse
(
θ
^
1
)
=
∫
∫
[
θ
1
−
E
(
θ
1
∣
x
)
]
2
p
(
x
,
θ
1
)
d
θ
1
d
x
=
∫
[
∫
[
θ
1
−
E
(
θ
1
∣
x
)
]
2
p
(
θ
1
∣
x
)
d
θ
1
]
p
(
x
)
d
x
\begin{aligned} \operatorname{Bmse}(\hat \theta_1)&=\int \int [\theta_1-E(\theta_1|\mathbf x)]^2p(\mathbf x,\theta_1) d\theta_1 d\mathbf x\\ &=\int \left[\int [\theta_1-E(\theta_1|\mathbf x)]^2p(\theta_1|\mathbf x) d\theta_1 \right]p(\mathbf x)d\mathbf x \end{aligned}
Bmse(θ^1)=∫∫[θ1−E(θ1∣x)]2p(x,θ1)dθ1dx=∫[∫[θ1−E(θ1∣x)]2p(θ1∣x)dθ1]p(x)dx
The posterior PDF can be written as
p
(
θ
1
∣
x
)
=
∫
⋯
∫
p
(
θ
∣
x
)
d
θ
2
⋯
d
θ
p
p(\theta_1|\mathbf x)=\int \cdots \int p(\boldsymbol \theta|\mathbf x)d\theta_2\cdots d\theta _p
p(θ1∣x)=∫⋯∫p(θ∣x)dθ2⋯dθp
so that
Bmse
(
θ
^
1
)
=
∫
[
∫
[
θ
1
−
E
(
θ
1
∣
x
)
]
2
p
(
θ
∣
x
)
d
θ
]
p
(
x
)
d
x
(MMSE.6)
\operatorname{Bmse}(\hat \theta_1)=\int \left[\int [\theta_1-E(\theta_1|\mathbf x)]^2p(\boldsymbol \theta|\mathbf x) d\boldsymbol \theta \right]p(\mathbf x)d\mathbf x\tag{MMSE.6}
Bmse(θ^1)=∫[∫[θ1−E(θ1∣x)]2p(θ∣x)dθ]p(x)dx(MMSE.6)
which is just the
[
1
,
1
]
[1,1]
[1,1] element of
C
θ
∣
x
\mathbf C_{\theta|x}
Cθ∣x, the covariance matrix of the posterior PDF. Hence, in general we have the minimum Bayesian MSE is
Bmse
(
θ
^
i
)
=
∫
[
C
θ
∣
x
]
i
i
p
(
x
)
d
x
i
=
1
,
2
,
⋯
,
p
(MMSE.7)
\operatorname{Bmse}(\hat \theta_i)=\int [\mathbf C_{\theta|x}]_{ii}p(\mathbf x)d\mathbf x\quad i=1,2,\cdots,p\tag{MMSE.7}
Bmse(θ^i)=∫[Cθ∣x]iip(x)dxi=1,2,⋯,p(MMSE.7)
where
C
θ
∣
x
=
E
θ
∣
x
[
(
θ
−
E
(
θ
∣
x
)
)
(
θ
−
E
(
θ
∣
x
)
)
T
]
(MMSE.8)
\mathbf C_{\theta|x}=E_{\theta|x}[(\boldsymbol \theta-E(\boldsymbol \theta|\mathbf x))(\boldsymbol \theta-E(\boldsymbol \theta|\mathbf x))^T]\tag{MMSE.8}
Cθ∣x=Eθ∣x[(θ−E(θ∣x))(θ−E(θ∣x))T](MMSE.8)
We can obtain
θ
^
\hat {\boldsymbol \theta}
θ^ and
C
θ
∣
x
\mathbf C_{\theta|x}
Cθ∣x from Theorem 10.2 mentioned in The Bayesian Philosophy.
Theorem 10.2 (Conditional PDF of Multivariate Gaussian) If x \mathbf{x} x and y \mathbf y y are jointly Gaussian, where x \mathbf{x} x is k × 1 k \times 1 k×1 and y \mathbf{y} y is l × 1 , l \times 1, l×1, with mean vector [ E ( x ) T E ( y ) T ] T \left[E(\mathbf{x})^{T} E(\mathbf{y})^{T}\right]^{T} [E(x)TE(y)T]T and
partitioned covariance matrix
C = [ C x x C x y C y x C y y ] = [ k × k k × l l × k l × l ] (BF.25) \mathbf{C}=\left[\begin{array}{ll} \mathbf{C}_{x x} & \mathbf{C}_{x y} \\ \mathbf{C}_{y x} & \mathbf{C}_{y y} \end{array}\right]=\left[\begin{array}{ll} k \times k & k \times l \\ l \times k & l \times l \end{array}\right]\tag{BF.25} C=[CxxCyxCxyCyy]=[k×kl×kk×ll×l](BF.25)
so that
p ( x , y ) = 1 ( 2 π ) k + 1 2 det 1 2 ( C ) exp [ − 1 2 ( [ x − E ( x ) y − E ( y ) ] ) T C − 1 ( [ x − E ( x ) y − E ( y ) ] ) ] p(\mathbf{x}, \mathbf{y})=\frac{1}{(2 \pi)^{\frac{k+1}{2}} \operatorname{det}^{\frac{1}{2}}(\mathbf{C})} \exp \left[-\frac{1}{2}\left(\left[\begin{array}{l} \mathbf{x}-E(\mathbf{x}) \\ \mathbf{y}-E(\mathbf{y}) \end{array}\right]\right)^{T} \mathbf{C}^{-1}\left(\left[\begin{array}{l} \mathbf{x}-E(\mathbf{x}) \\ \mathbf{y}-E(\mathbf{y}) \end{array}\right]\right)\right] p(x,y)=(2π)2k+1det21(C)1exp[−21([x−E(x)y−E(y)])TC−1([x−E(x)y−E(y)])]
Then the conditional PDF p ( y ∣ x ) p(\mathbf y|\mathbf x) p(y∣x) is also Gaussian and
E ( y ∣ x ) = E ( y ) + C y x C x x − 1 ( x − E ( x ) ) (BF.26) E(\mathbf y|\mathbf x)=E(\mathbf y)+\mathbf C_{yx} \mathbf C_{xx}^{-1}(\mathbf x-E(\mathbf x))\tag{BF.26} E(y∣x)=E(y)+CyxCxx−1(x−E(x))(BF.26)C y ∣ x = C y y − C y x C x x − 1 C x y (BF.27) \mathbf C_{y|x}=\mathbf C_{yy}-\mathbf C_{yx}\mathbf C_{xx}^{-1}\mathbf C_{xy}\tag{BF.27} Cy∣x=Cyy−CyxCxx−1Cxy(BF.27)
For general linear model, we can apply the result in Theorem 10.3.
Theorem 10.3 (Posterior PDF for the Bayesian General Linear Model) If the observed data x \mathbf x x can be modeled as
x = H θ + w \mathbf{x}=\mathbf{H} \boldsymbol{\theta}+\mathbf{w} x=Hθ+w
where x \mathbf{x} x is an N × 1 N \times 1 N×1 data vector, H \mathbf{H} H is a known N × p N \times p N×p matrix, θ \boldsymbol{\theta} θ is a p × 1 p \times 1 p×1 random vector with prior PDF N ( μ θ , C θ ) , \mathcal{N}\left(\mu_{\theta}, \mathbf{C}_{\theta}\right), N(μθ,Cθ), and w \mathbf{w} w is an N × 1 N \times 1 N×1 noise vector with PDF N ( 0 , C w ) \mathcal{N}\left(\mathbf{0}, \mathbf{C}_{w}\right) N(0,Cw) and independent of θ , \boldsymbol \theta, θ, then the posterior PDF p ( θ ∣ x ) p(\boldsymbol\theta |\mathbf{x}) p(θ∣x) is Gaussian with mean
E ( θ ∣ x ) = μ θ + C θ H T ( H C θ H T + C w ) − 1 ( x − H μ θ ) (BF.29) E(\boldsymbol{\theta} |\mathbf{x})=\boldsymbol{\mu}_{\theta}+\mathbf{C}_{\theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1}\left(\mathbf{x}-\mathbf{H} \boldsymbol{\mu}_{\theta}\right)\tag{BF.29} E(θ∣x)=μθ+CθHT(HCθHT+Cw)−1(x−Hμθ)(BF.29)
and covariance
C θ ∣ x = C θ − C θ H T ( H C θ H T + C w ) − 1 H C θ (BF.30) \mathbf{C}_{\theta \mid x}=\mathbf{C}_{\theta}-\mathbf{C}_{\theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1} \mathbf{H} \mathbf{C}_{\theta}\tag{BF.30} Cθ∣x=Cθ−CθHT(HCθHT+Cw)−1HCθ(BF.30)
In contrast to the classical general linear model, H \mathbf{H} H need not be full rank to ensure the invertibility of H C θ H T + C w . \mathbf{H C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w} . HCθHT+Cw.Alternative formulation using Matrix inversion lemma:
E ( θ ∣ x ) = μ θ + ( C θ − 1 + H T C w − 1 H ) − 1 H T C w − 1 ( x − H μ θ ) (BF.31) E(\boldsymbol{\theta} |\mathbf{x})=\boldsymbol\mu_{\theta}+\left(\mathbf{C}_{\theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}_w^{-1}\left(\mathbf{x}-\mathbf{H} \boldsymbol\mu_{\theta}\right) \tag{BF.31} E(θ∣x)=μθ+(Cθ−1+HTCw−1H)−1HTCw−1(x−Hμθ)(BF.31)C θ ∣ x = ( C θ − 1 + H T C w − 1 H ) − 1 (BF.32) \mathbf{C}_{\theta \mid x} =\left(\mathbf{C}_{\theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1}\tag{BF.32} Cθ∣x=(Cθ−1+HTCw−1H)−1(BF.32)
It is interesting to note that in the absence of prior knowledge in the Bayesian linear model the MMSE estimator yields the same form as the MVU estimator for the classical linear model. To verify this result note that from
(
B
F
.
31
)
(BF.31)
(BF.31), for no prior knowledge
C
θ
−
1
→
0
\mathbf C_\theta^{-1}\to \mathbf 0
Cθ−1→0, therefore
θ
^
→
(
H
T
C
w
−
1
H
)
−
1
H
T
C
w
−
1
x
\hat {\boldsymbol \theta}\to \left(\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}_w^{-1}\mathbf{x}
θ^→(HTCw−1H)−1HTCw−1x
which is recognized as the MVU estimator for the general linear model.
Properties
-
The MMSE estimator commutes over linear transformations. Assume that we wish to estimate α \boldsymbol \alpha α for
α = A θ + b \boldsymbol\alpha =\mathbf A\boldsymbol \theta +\mathbf b α=Aθ+b
α \boldsymbol \alpha α is a random vector for which the MMSE estimator is α ^ = E ( α ∣ x ) \hat {\boldsymbol \alpha}=E(\boldsymbol \alpha |\mathbf x) α^=E(α∣x). Because the linearity of the expectation operator
α ^ = E ( A θ + b ∣ x ) = A E ( θ ∣ x ) + b = A θ ^ + b \hat {\boldsymbol \alpha}=E(\mathbf A\boldsymbol \theta+\mathbf b|\mathbf x)=\mathbf A E(\boldsymbol \theta|\mathbf x)+\mathbf b=\mathbf A\hat {\boldsymbol \theta}+\mathbf b α^=E(Aθ+b∣x)=AE(θ∣x)+b=Aθ^+b
This holds regardless of the joint PDF p ( x , θ ) p(\mathbf x,\boldsymbol \theta) p(x,θ). -
The MMSE has an additivity property for independent data sets. Assume x 1 , x 2 \mathbf x_1, \mathbf x_2 x1,x2 are independent data vectors and θ , x 1 , x 2 \boldsymbol \theta, \mathbf x_1,\mathbf x_2 θ,x1,x2 are jointly Gaussian. Letting x = [ x 1 T x 2 T ] T \mathbf x=[\mathbf x_1^T ~ \mathbf x_2^T]^T x=[x1T x2T]T, we have from Theorem 10.2
θ ^ = E ( θ ) + C θ x C x x − 1 ( x − E ( x ) ) \hat {\boldsymbol \theta}=E(\boldsymbol \theta)+\mathbf C_{\theta x} \mathbf C_{xx}^{-1}(\mathbf x-E(\mathbf x)) θ^=E(θ)+CθxCxx−1(x−E(x))
Since x 1 , x 2 \mathbf x_1, \mathbf x_2 x1,x2 are independent,
C x x − 1 = [ C x 1 x 2 0 0 C x 2 x 2 ] − 1 = [ C x 1 x 2 − 1 0 0 C x 2 x 2 − 1 ] \mathbf C_{xx}^{-1}=\left[\begin{matrix}\mathbf C_{x_1 x_2} & \mathbf 0\\ \mathbf 0 & \mathbf C_{x_2x_2} \end{matrix} \right]^{-1}=\left[\begin{matrix}\mathbf C_{x_1 x_2}^{-1} & \mathbf 0\\ \mathbf 0 & \mathbf C_{x_2x_2}^{-1} \end{matrix} \right] Cxx−1=[Cx1x200Cx2x2]−1=[Cx1x2−100Cx2x2−1]C θ x = E [ θ [ x 1 x 2 ] T ] = [ C θ x 1 C θ x 2 ] \mathbf C_{\theta x}=E\left[\boldsymbol \theta\left[\begin{matrix}\mathbf x_1 \\\mathbf x_2 \end{matrix} \right]^T\right]=[\mathbf C_{\theta x_1}~\mathbf C_{\theta x_2}] Cθx=E[θ[x1x2]T]=[Cθx1 Cθx2]
Therefore
θ ^ = E ( θ ) + C θ x 1 C x 1 x 1 − 1 ( x 1 − E ( x 1 ) ) + C θ x 2 C x 2 x 2 − 1 ( x 2 − E ( x 2 ) ) \hat {\boldsymbol \theta}=E(\boldsymbol \theta)+\mathbf C_{\theta x_1} \mathbf C_{x_1x_1}^{-1}(\mathbf x_1-E(\mathbf x_1))+\mathbf C_{\theta x_2} \mathbf C_{x_2x_2}^{-1}(\mathbf x_2-E(\mathbf x_2)) θ^=E(θ)+Cθx1Cx1x1−1(x1−E(x1))+Cθx2Cx2x2−1(x2−E(x2)) -
In the jointly Gaussian case the MMSE is linear in the data, as can be seen from ( B F . 26 ) (BF.26) (BF.26).
Example: Bayesian Fourier Analysis
Data model:
x
[
n
]
=
a
cos
2
π
f
0
n
+
b
sin
2
π
f
0
n
+
w
[
n
]
n
=
0
,
1
,
⋯
,
N
−
1
x[n]=a \cos 2\pi f_0 n+b \sin 2\pi f_0 n+w[n]\quad n=0,1,\cdots, N-1
x[n]=acos2πf0n+bsin2πf0n+w[n]n=0,1,⋯,N−1
where
f
0
f_0
f0 is a multiple of
1
/
N
1/N
1/N, excepting
0
0
0 or
1
/
2
1/2
1/2 (for which
sin
2
π
f
0
n
\sin 2 \pi f_0 n
sin2πf0n is identically zero), and
w
[
n
]
w[n]
w[n] is WGN with variance
σ
2
\sigma^2
σ2. It is desired to estimate
θ
=
[
a
b
]
T
\boldsymbol \theta=[a~b]^T
θ=[a b]T.
To find the MMSE estimator, the data model is rewritten as
x
=
H
θ
+
w
\mathbf x=\mathbf H \boldsymbol \theta +\mathbf w
x=Hθ+w
where
H
=
[
1
0
cos
2
π
f
0
sin
2
π
f
0
⋮
⋮
cos
[
2
π
f
0
(
N
−
1
)
]
sin
[
2
π
f
0
(
N
−
1
)
]
]
\mathbf{H}=\left[\begin{array}{cc} 1 & 0 \\ \cos 2 \pi f_{0} & \sin 2 \pi f_{0} \\ \vdots & \vdots \\ \cos \left[2 \pi f_{0}(N-1)\right] & \sin \left[2 \pi f_{0}(N-1)\right] \end{array}\right]
H=⎣⎢⎢⎢⎡1cos2πf0⋮cos[2πf0(N−1)]0sin2πf0⋮sin[2πf0(N−1)]⎦⎥⎥⎥⎤
Because the columns of
H
\mathbf H
H are orthogonal, we have
H
T
H
=
N
2
I
\mathbf H^T \mathbf H=\frac{N}{2}\mathbf I
HTH=2NI
Therefore it is more convenient to apply
(
B
F
.
31
)
(BF.31)
(BF.31) and
(
B
F
.
32
)
(BF.32)
(BF.32)
θ
^
=
(
1
σ
θ
2
I
+
H
T
1
σ
2
H
)
−
1
H
T
1
σ
2
x
=
1
/
σ
2
1
/
σ
θ
2
+
N
/
2
σ
2
H
T
x
C
θ
∣
x
=
(
1
σ
θ
2
I
+
H
T
1
σ
2
H
)
−
1
=
1
1
/
σ
θ
2
+
N
/
2
σ
2
I
\begin{aligned} \hat{\boldsymbol{\theta}} &=\left(\frac{1}{\sigma_{\theta}^{2}} \mathbf{I}+\mathbf{H}^{T} \frac{1}{\sigma^{2}} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \frac{1}{\sigma^{2}} \mathbf{x}=\frac{1/\sigma^2}{1/\sigma_\theta^2+N/2\sigma^2}\mathbf H^T\mathbf x \\ \mathbf{C}_{\theta \mid x} &=\left(\frac{1}{\sigma_{\theta}^{2}} \mathbf{I}+\mathbf{H}^{T} \frac{1}{\sigma^{2}} \mathbf{H}\right)^{-1}=\frac{1}{1/\sigma_\theta^2+N/2\sigma^2}\mathbf I \end{aligned}
θ^Cθ∣x=(σθ21I+HTσ21H)−1HTσ21x=1/σθ2+N/2σ21/σ2HTx=(σθ21I+HTσ21H)−1=1/σθ2+N/2σ21I
The MMSE estimator is
a
^
=
1
1
+
2
σ
2
/
N
σ
θ
2
[
2
N
∑
n
=
0
N
−
1
x
[
n
]
cos
2
π
f
0
n
]
b
^
=
1
1
+
2
σ
2
/
N
σ
θ
2
[
2
N
∑
n
=
0
N
−
1
x
[
n
]
sin
2
π
f
0
n
]
\begin{aligned} \hat{a} &=\frac{1}{1+\frac{2 \sigma^{2} / N}{\sigma_{\theta}^{2}}}\left[\frac{2}{N} \sum_{n=0}^{N-1} x[n] \cos 2 \pi f_{0} n\right] \\ \hat{b} &=\frac{1}{1+\frac{2 \sigma^{2} / N}{\sigma_{\theta}^{2}}}\left[\frac{2}{N} \sum_{n=0}^{N-1} x[n] \sin 2 \pi f_{0} n\right] \end{aligned}
a^b^=1+σθ22σ2/N1[N2n=0∑N−1x[n]cos2πf0n]=1+σθ22σ2/N1[N2n=0∑N−1x[n]sin2πf0n]
with
B
m
s
e
(
a
^
)
=
B
m
s
e
(
b
^
)
=
1
1
σ
θ
2
+
1
2
σ
2
/
N
\mathrm{Bmse}(\hat{a})=\mathrm{Bmse}(\hat{b})=\frac{1}{\frac{1}{\sigma_{\theta}^{2}}+\frac{1}{2 \sigma^{2} / N}}
Bmse(a^)=Bmse(b^)=σθ21+2σ2/N11
Maximum A Posteriori Estimators (MAP)
In the MAP estimation approach we choose
θ
^
\hat \theta
θ^ to maximize the posterior PDF or
θ
^
=
arg
max
θ
p
(
θ
∣
x
)
\hat \theta=\arg \max _\theta p(\theta|\mathbf x)
θ^=argθmaxp(θ∣x)
This was shown to minimize the Bayes risk for a “hit-or-miss” cost function
E
[
∣
θ
i
−
θ
^
i
∣
]
E[|\theta_i-\hat \theta_i|]
E[∣θi−θ^i∣] . Since
p
(
θ
∣
x
)
=
p
(
x
∣
θ
)
p
(
θ
)
p
(
x
)
p(\theta|\mathbf x)=\frac{p(\mathbf x|\theta)p(\theta)}{p(\mathbf x)}
p(θ∣x)=p(x)p(x∣θ)p(θ)
It is equivalent to
θ
^
=
arg
max
θ
p
(
x
∣
θ
)
p
(
θ
)
(MAP.1)
\hat \theta=\arg \max _\theta p(\mathbf x|\theta)p(\theta)\tag{MAP.1}
θ^=argθmaxp(x∣θ)p(θ)(MAP.1)
or
θ
^
=
arg
max
θ
[
ln
p
(
x
∣
θ
)
+
ln
p
(
θ
)
]
(MAP.2)
\hat \theta=\arg \max _\theta [\ln p(\mathbf x|\theta)+\ln p(\theta)]\tag{MAP.2}
θ^=argθmax[lnp(x∣θ)+lnp(θ)](MAP.2)
To extend the MAP estimator to the vector parameter case in which the posterior PDF is now
p
(
θ
∣
x
)
p(\boldsymbol \theta|\mathbf x)
p(θ∣x), we again employ the result
p
(
θ
1
∣
x
)
=
∫
⋯
∫
p
(
θ
∣
x
)
d
θ
2
⋯
d
θ
p
(MAP.3)
p(\theta_1|\mathbf x)=\int \cdots \int p(\boldsymbol \theta|\mathbf x)d\theta_2\cdots d\theta _p \tag{MAP.3}
p(θ1∣x)=∫⋯∫p(θ∣x)dθ2⋯dθp(MAP.3)
Then the MAP estimator is given by
θ
^
1
=
arg
max
θ
1
p
(
θ
1
∣
x
)
\hat \theta_1=\arg \max _{\theta_1} p(\theta_1|\mathbf x)
θ^1=argθ1maxp(θ1∣x)
or in general
θ
^
i
=
arg
max
θ
i
p
(
θ
i
∣
x
)
i
=
1
,
2
,
⋯
,
p
(MAP.4)
\hat \theta_i=\arg \max _{\theta_i} p(\theta_i|\mathbf x) \quad i=1,2,\cdots,p \tag{MAP.4}
θ^i=argθimaxp(θi∣x)i=1,2,⋯,p(MAP.4)
One of the advantages of the MAP estimator for a scalar parameter is that to numerically determine it we need only maximize
p
(
x
∣
θ
)
p
(
θ
)
p(\mathbf x|\theta)p(\theta)
p(x∣θ)p(θ). No integration is required. This desirable property of the MAP estimator for a scalar parameter does not carry over to the vector parameter case due to the need to obtain
p
(
θ
i
∣
x
)
p(\theta_i|\mathbf x)
p(θi∣x) as per
(
M
A
P
.
3
)
(MAP.3)
(MAP.3). However, we might propose the following vector MAP estimator
θ
^
=
arg
max
θ
p
(
θ
∣
x
)
=
arg
max
θ
p
(
x
∣
θ
)
p
(
θ
)
(MAP.5)
\hat {\boldsymbol \theta}=\arg \max _{\boldsymbol \theta} p(\boldsymbol \theta|\mathbf x)=\arg \max _{\boldsymbol \theta} p(\mathbf x|\boldsymbol \theta)p(\boldsymbol \theta) \tag{MAP.5}
θ^=argθmaxp(θ∣x)=argθmaxp(x∣θ)p(θ)(MAP.5)
in which the posterior PDF for the vector parameter
θ
\boldsymbol \theta
θ is maximized to find the estimator. The estimator in
(
M
A
P
.
5
)
(MAP.5)
(MAP.5) is not in general the same as
(
M
A
P
.
4
)
(MAP.4)
(MAP.4).
Properties
-
If p ( θ ) p(\theta) p(θ) is uniform and p ( x ∣ θ ) p(\mathbf x|\theta) p(x∣θ) falls within this interval, then
θ ^ = arg max θ p ( x ∣ θ ) p ( θ ) = arg max θ p ( x ∣ θ ) \hat \theta=\arg \max _\theta p(\mathbf x|\theta)p(\theta)=\arg \max _\theta p(\mathbf x|\theta) θ^=argθmaxp(x∣θ)p(θ)=argθmaxp(x∣θ)
which is essentially the Bayesian MLE. -
Generally, it holds that if N → ∞ N\to \infty N→∞, the PDF p ( x ∣ θ ) p(\mathbf x|\theta) p(x∣θ) becomes dominant over p ( θ ) p(\theta) p(θ) and the MAP becomes thus identical to the Bayesian MLE.
-
If x \mathbf x x and θ \boldsymbol \theta θ are jointly Gaussian, the posterior PDF is Gaussian and thus the MAP estimator is identical to the MMSE estimator.
-
The invariance property encountered in maximum likelihood theory (MLE) does not hold for the MAP estimator.
Example: DC Level in WGN - Uniform Prior PDF
Recall the introductory example in The Bayesian Philosophy. There we discussed the MMSE estimator of
A
A
A for a DC level in WGN with a uniform prior PDF. The MMSE estimator could not be obtained in explicit form due to the need to evaluate the integrals. The posterior PDF was given as
p
(
A
∣
x
)
=
{
1
2
π
σ
2
/
N
exp
[
−
1
2
σ
2
/
N
(
A
−
x
ˉ
)
2
]
∫
−
A
0
A
0
1
2
π
σ
2
/
N
exp
[
−
1
2
σ
2
/
N
(
A
−
x
ˉ
)
2
]
d
A
∣
A
∣
≤
A
0
0
,
∣
A
∣
>
A
0
(BP.9)
p(A |\mathbf{x})=\left\{\begin{aligned} &\frac{\frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right]}{\int_{-A_0}^{A_0}\frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right]dA} && |A| \leq A_{0} \\ &0, && |A|>A_{0} \end{aligned}\right.\tag{BP.9}
p(A∣x)=⎩⎪⎪⎪⎪⎨⎪⎪⎪⎪⎧∫−A0A02πσ2/N1exp[−2σ2/N1(A−xˉ)2]dA2πσ2/N1exp[−2σ2/N1(A−xˉ)2]0,∣A∣≤A0∣A∣>A0(BP.9)
The MAP estimator is explicitly given as
A
^
=
{
−
A
0
x
ˉ
<
−
A
0
x
ˉ
−
A
0
≤
x
ˉ
≤
A
0
A
0
x
ˉ
>
A
0
\hat A=\left\{\begin{array}{cc}&-A_0 && \bar x<-A_0\\&\bar x && -A_0\le \bar x\le A_0\\ &A_0 && \bar x>A_0 \end{array}\right.
A^=⎩⎨⎧−A0xˉA0xˉ<−A0−A0≤xˉ≤A0xˉ>A0
It can be seen that the MAP estimator is usually easier to determine since it does not involve any integration but only a maximization.
Linear MMSE Estimation (LMMSE)
We begin our discussion by assuming a scalar parameter θ \theta θ is to be estimated based on the data set x = [ x [ 0 ] x [ 1 ] ⋯ x [ N − 1 ] ] T \mathbf x=[x[0]~x[1]~\cdots~x[N-1]]^T x=[x[0] x[1] ⋯ x[N−1]]T. We do not assume any specific form for the joint PDF p ( x , θ ) p(\mathbf x,\theta) p(x,θ), but as we shall see shortly, only a knowledge of the first two moments.
We now consider the class of all linear estimators of the form
θ
^
=
∑
n
=
0
N
−
1
a
n
x
[
n
]
+
a
N
(LMMSE.1)
\hat \theta =\sum_{n=0}^{N-1}a_n x[n]+a_N\tag{LMMSE.1}
θ^=n=0∑N−1anx[n]+aN(LMMSE.1)
and choosing the weighting coefficients
a
n
a_n
an's to minimize the Bayesian MSE
Bmse
(
θ
^
)
=
E
[
(
θ
−
θ
^
)
2
]
=
∫
∫
(
θ
−
θ
^
)
2
p
(
x
,
θ
)
d
θ
d
x
(LMMSE.2)
\operatorname{Bmse}(\hat \theta)=E\left[(\theta-\hat \theta)^2\right]=\int\int (\theta-\hat \theta)^2 p(\mathbf x,\theta)d\theta d\mathbf x\tag{LMMSE.2}
Bmse(θ^)=E[(θ−θ^)2]=∫∫(θ−θ^)2p(x,θ)dθdx(LMMSE.2)
The resultant estimator is termed the linear minimum mean square error (LMMSE) estimator. Note that we included the
a
N
a_N
aN coefficient to allow for nonzero means of
x
\mathbf x
x and
θ
\boldsymbol \theta
θ.
Before determining the LMMSE estimator we should keep in mind that the estimator will be suboptimal unless the MMSE estimator happens to be linear. Such would be the case, for example, if the Bayesian linear model applied. Since the LMMSE estimator relies on the correlation between random variables, a parameter uncorrelated with the data cannot be linearly estimated.
We now derive the optimal weighting coefficients for use in
(
L
M
M
S
E
.
1
)
(LMMSE.1)
(LMMSE.1). Substituting
(
L
M
M
S
E
.
1
)
(LMMSE.1)
(LMMSE.1) into
(
L
M
M
S
E
.
2
)
(LMMSE.2)
(LMMSE.2) and differentiating w.r.t.
a
N
a_N
aN
∂
∂
a
N
E
[
(
θ
−
∑
n
=
0
N
−
1
a
n
x
[
n
]
−
a
N
)
2
]
=
−
2
E
[
θ
−
∑
n
=
0
N
−
1
a
n
x
[
n
]
−
a
N
]
\frac{\partial}{\partial a_{N}} E\left[\left(\theta-\sum_{n=0}^{N-1} a_{n} x[n]-a_{N}\right)^{2}\right]=-2 E\left[\theta-\sum_{n=0}^{N-1} a_{n} x[n]-a_{N}\right]
∂aN∂E⎣⎡(θ−n=0∑N−1anx[n]−aN)2⎦⎤=−2E[θ−n=0∑N−1anx[n]−aN]
Setting this equal to zero produces
a
N
=
E
(
θ
)
−
∑
n
=
0
N
−
1
a
n
E
(
x
[
n
]
)
(LMMSE.3)
a_{N}=E(\theta)-\sum_{n=0}^{N-1} a_{n} E(x[n]) \tag{LMMSE.3}
aN=E(θ)−n=0∑N−1anE(x[n])(LMMSE.3)
which is zero if the means are zero. Continuing, we need to minimize
B
m
s
e
(
θ
^
)
=
E
{
[
∑
n
=
0
N
−
1
a
n
(
x
[
n
]
−
E
(
x
[
n
]
)
)
−
(
θ
−
E
(
θ
)
)
]
2
}
\mathrm{Bmse}(\hat{\theta})=E\left\{\left[\sum_{n=0}^{N-1} a_{n}(x[n]-E(x[n]))-(\theta-E(\theta))\right]^{2}\right\}
Bmse(θ^)=E⎩⎨⎧[n=0∑N−1an(x[n]−E(x[n]))−(θ−E(θ))]2⎭⎬⎫
over the remaining
a
n
a_{n}
an's, where
a
N
a_{N}
aN has been replaced by
(
L
M
M
S
E
.
3
)
(LMMSE.3)
(LMMSE.3). Letting
a
=
[
a
0
a
1
⋯
a
N
−
1
]
T
,
\mathbf{a}=\left[a_{0}~ a_{1}~ \cdots~ a_{N-1}\right]^{T},
a=[a0 a1 ⋯ aN−1]T, we have
Bmse
(
θ
^
)
=
E
{
[
a
T
(
x
−
E
(
x
)
)
−
(
θ
−
E
(
θ
)
)
]
2
}
=
E
[
a
T
(
x
−
E
(
x
)
)
(
x
−
E
(
x
)
)
T
a
]
−
E
[
a
T
(
x
−
E
(
x
)
)
(
θ
−
E
(
θ
)
)
]
−
E
[
(
θ
−
E
(
θ
)
)
(
x
−
E
(
x
)
)
T
a
]
+
E
[
(
θ
−
E
(
θ
)
)
2
]
=
a
T
C
x
x
a
−
a
T
C
x
θ
−
C
θ
x
a
+
C
θ
θ
(LMMSE.4)
\begin{aligned} \operatorname{Bmse}(\hat{\theta})=& E\left\{\left[\mathbf{a}^{T}(\mathbf{x}-E(\mathbf{x}))-(\theta-E(\theta))\right]^{2}\right\} \\ =& E\left[\mathbf{a}^{T}(\mathbf{x}-E(\mathbf{x}))(\mathbf{x}-E(\mathbf{x}))^{T} \mathbf{a}\right]-E\left[\mathbf{a}^{T}(\mathbf{x}-E(\mathbf{x}))(\theta-E(\theta))\right] \\ &-E\left[(\theta-E(\theta))(\mathbf{x}-E(\mathbf{x}))^{T} \mathbf{a}\right]+E\left[(\theta-E(\theta))^{2}\right] \\ =& \mathbf{a}^{T} \mathbf{C}_{x x} \mathbf{a}-\mathbf{a}^{T} \mathbf{C}_{x \theta}-\mathbf{C}_{\theta x} \mathbf{a}+C_{\theta \theta} \end{aligned}\tag{LMMSE.4}
Bmse(θ^)===E{[aT(x−E(x))−(θ−E(θ))]2}E[aT(x−E(x))(x−E(x))Ta]−E[aT(x−E(x))(θ−E(θ))]−E[(θ−E(θ))(x−E(x))Ta]+E[(θ−E(θ))2]aTCxxa−aTCxθ−Cθxa+Cθθ(LMMSE.4)
where
C
x
x
\mathbf{C}_{x x}
Cxx is the
N
×
N
N \times N
N×N covariance matrix of
x
,
\mathbf{x},
x, and
C
θ
x
\mathbf{C}_{\theta x}
Cθx is the
1
×
N
1 \times N
1×N cross-covariance vector having the property that
C
θ
x
T
=
C
x
θ
,
\mathbf{C}_{\theta x}^{T}=\mathbf{C}_{x \theta},
CθxT=Cxθ, and
C
θ
θ
C_{\theta \theta}
Cθθ is the variance of
θ
.
\theta .
θ. Taking the gradient to yield
∂
B
m
s
e
(
θ
^
)
∂
a
=
2
C
x
x
a
−
2
C
x
θ
\frac{\partial \mathrm{B} \mathrm{mse}(\hat{\theta})}{\partial \mathbf{a}}=2 \mathbf{C}_{x x} \mathbf{a}-2 \mathbf{C}_{x \theta}
∂a∂Bmse(θ^)=2Cxxa−2Cxθ
which when set to zero results in
a
=
C
x
x
−
1
C
x
θ
(LMMSE.5)
\mathbf{a}=\mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}\tag{LMMSE.5}
a=Cxx−1Cxθ(LMMSE.5)
Using
(
L
M
M
S
E
.
3
)
(LMMSE.3)
(LMMSE.3) and
(
L
M
M
S
E
.
5
)
(LMMSE.5)
(LMMSE.5) in
(
L
M
M
S
E
.
1
)
(LMMSE.1)
(LMMSE.1) produces
θ
^
=
a
T
x
+
a
N
=
C
x
θ
T
C
x
x
−
1
x
+
E
(
θ
)
−
C
x
θ
T
C
x
x
−
1
E
(
x
)
\begin{aligned} \hat{\theta} &=\mathbf{a}^{T} \mathbf{x}+a_{N} \\ &=\mathbf{C}_{x \theta}^{T} \mathbf{C}_{x x}^{-1} \mathbf{x}+E(\theta)-\mathbf{C}_{x \theta}^{T} \mathbf{C}_{x x}^{-1} E(\mathbf{x}) \end{aligned}
θ^=aTx+aN=CxθTCxx−1x+E(θ)−CxθTCxx−1E(x)
or finally the LMMSE estimator is
θ
^
=
E
(
θ
)
+
C
θ
x
C
x
x
−
1
(
x
−
E
(
x
)
)
(LMMSE.6)
\hat{\theta}=E(\theta)+\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x}))\tag{LMMSE.6}
θ^=E(θ)+CθxCxx−1(x−E(x))(LMMSE.6)
Note that it is identical in form to the MMSE estimator for jointly Gaussian
x
\mathbf x
x and
θ
\theta
θ, as can be verified from
(
B
F
.
26
)
(BF.26)
(BF.26). This is because in the Gaussian case the MMSE estimator happens to be linear, and hence our constraint is automatically satisfied.
The minimum Bayesian MSE is obtained by substituting
(
L
M
M
S
E
.
5
)
(LMMSE.5)
(LMMSE.5) into
(
L
M
M
S
E
.
4
)
(LMMSE.4)
(LMMSE.4) to yield
B
m
s
e
(
θ
^
)
=
C
x
θ
T
C
x
x
−
1
C
x
x
C
x
x
−
1
C
x
θ
−
C
x
θ
T
C
x
x
−
1
C
x
θ
−
C
θ
x
C
x
x
−
1
C
x
θ
+
C
θ
θ
=
C
θ
x
C
x
x
−
1
C
x
θ
−
2
C
θ
x
C
x
x
−
1
C
x
θ
+
C
θ
θ
=
C
θ
θ
−
C
θ
x
C
x
x
−
1
C
x
θ
(LMMSE.7)
\begin{aligned} \mathrm{Bmse}(\hat{\theta})=& \mathbf{C}_{x \theta}^{T} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}-\mathbf{C}_{x \theta}^{T} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta} -\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}+ C_{\theta \theta} \\ =& \mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}-2 \mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}+ C_{\theta \theta}\\ =& C_{\theta \theta}-\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta} \end{aligned}\tag{LMMSE.7}
Bmse(θ^)===CxθTCxx−1CxxCxx−1Cxθ−CxθTCxx−1Cxθ−CθxCxx−1Cxθ+CθθCθxCxx−1Cxθ−2CθxCxx−1Cxθ+CθθCθθ−CθxCxx−1Cxθ(LMMSE.7)
In general, all that is required to determine the LMMSE estimator are the first two moments of
p
(
x
,
θ
)
p(\mathbf x, \theta)
p(x,θ) or
[
E
(
θ
)
E
(
x
)
]
,
[
C
θ
θ
C
θ
x
C
x
θ
C
x
x
]
\left[ \begin{matrix}E(\theta) \\E(\mathbf x) \end{matrix}\right], \left[ \begin{matrix} C_{\theta \theta} & \mathbf C_{\theta x} \\\mathbf C_{x \theta} & \mathbf C_{xx} \end{matrix}\right]
[E(θ)E(x)],[CθθCxθCθxCxx]
Geometrical Interpretations
In this formulation, we assume that
θ
\theta
θ and
x
\mathbf x
x are zero mean. If not, we can always define the zero mean random variables
θ
′
=
θ
−
E
(
θ
)
\theta '=\theta-E(\theta)
θ′=θ−E(θ) and
x
′
=
x
−
E
(
x
)
\mathbf x '=\mathbf x-E(\mathbf x)
x′=x−E(x), and consider estimation of
θ
′
\theta '
θ′ by a linear function of
x
′
\mathbf x'
x′. Now we wish to find the
a
n
a_n
an's so that
θ
^
=
∑
n
=
0
N
−
1
a
n
x
[
n
]
\hat \theta =\sum_{n=0}^{N-1}a_n x[n]
θ^=n=0∑N−1anx[n]
minimizes
Bmse
(
θ
^
)
=
E
[
(
θ
−
θ
^
)
2
]
=
∫
∫
(
θ
−
θ
^
)
2
p
(
x
,
θ
)
d
θ
d
x
\operatorname{Bmse}(\hat \theta)=E\left[(\theta-\hat \theta)^2\right]=\int\int (\theta-\hat \theta)^2 p(\mathbf x,\theta)d\theta d\mathbf x
Bmse(θ^)=E[(θ−θ^)2]=∫∫(θ−θ^)2p(x,θ)dθdx
It can be shown that an appropriate definition, i.e., one that satisfies the properties of an inner product between the vectors
x
x
x and
y
y
y, is
(
x
,
y
)
=
E
(
x
y
)
(LMMSE.8)
(x,y)=E(xy) \tag{LMMSE.8}
(x,y)=E(xy)(LMMSE.8)
With this definition we have that
(
x
,
x
)
=
E
(
x
2
)
=
∥
x
∥
2
(LMMSE.9)
(x,x)=E(x^2)=\|x\|^2\tag{LMMSE.9}
(x,x)=E(x2)=∥x∥2(LMMSE.9)
Also, we can now define two vectors to be orthogonal if
(
x
,
y
)
=
E
(
x
y
)
=
0
(LMMSE.10)
(x,y)=E(xy)=0\tag{LMMSE.10}
(x,y)=E(xy)=0(LMMSE.10)
Since the vectors are zero mean, this is equivalent to saying that two vectors are orthogonal if and only if they are uncorrelated.
Then the Bayesian MSE can be presented as
E
[
(
θ
−
θ
^
)
2
]
=
∥
θ
−
∑
n
=
0
N
−
1
a
n
x
[
n
]
∥
2
E\left[(\theta-\hat \theta)^2\right]=\left\|\theta -\sum_{n=0}^{N-1} a_n x[n] \right\|^2
E[(θ−θ^)2]=∥∥∥∥∥θ−n=0∑N−1anx[n]∥∥∥∥∥2
This means that minimization of the MSE is equivalent to a minimization of the squared length of the error vector
ϵ
=
θ
−
θ
^
\epsilon =\theta-\hat \theta
ϵ=θ−θ^. Clearly, the length of the error vector is minimized when
ϵ
\epsilon
ϵ is orthogonal to the subspace spanned by
{
x
[
0
]
,
x
[
1
]
,
.
.
.
,
x
[
N
−
1
]
}
\{x[0], x[1], ... , x[N - 1]\}
{x[0],x[1],...,x[N−1]}. Hence, we require
ϵ
⊥
x
[
0
]
,
x
[
1
]
,
⋯
,
x
[
N
−
1
]
(LMMSE.11)
\epsilon \perp x[0],x[1],\cdots,x[N-1]\tag{LMMSE.11}
ϵ⊥x[0],x[1],⋯,x[N−1](LMMSE.11)
or by using our definition of orthogonality
E
[
(
θ
−
θ
^
)
x
[
n
]
]
=
0
n
=
0
,
1
,
⋯
,
N
−
1
(LMMSE.12)
E\left[ (\theta-\hat \theta)x[n] \right]=0\quad n=0,1,\cdots,N-1 \tag{LMMSE.12}
E[(θ−θ^)x[n]]=0n=0,1,⋯,N−1(LMMSE.12)
This is the important orthogonality principle. It says that in estimating the realization of a random variable by a linear combination of data samples, the optimal estimator is obtained when the error is orthogonal to each data sample. Using the orthogonality principle, the weighting coefficients are easily found as
∑
n
=
0
N
−
1
a
m
E
(
x
[
m
]
x
[
n
]
)
=
E
(
θ
x
[
n
]
)
n
=
0
,
1
,
⋯
,
N
−
1
\sum_{n=0}^{N-1}a_m E(x[m]x[n])=E(\theta x[n])\quad n=0,1,\cdots,N-1
n=0∑N−1amE(x[m]x[n])=E(θx[n])n=0,1,⋯,N−1
In matrix form this is
[
E
(
x
2
[
0
]
)
E
(
x
[
0
]
x
[
1
]
)
…
E
(
x
[
0
]
x
[
N
−
1
]
)
E
(
x
[
1
]
x
[
0
]
)
E
(
x
2
[
1
]
)
…
E
(
x
[
1
]
x
[
N
−
1
]
)
⋮
⋮
⋱
⋮
E
(
x
[
N
−
1
]
x
[
0
]
)
E
(
x
[
N
−
1
]
x
[
1
]
)
…
E
(
x
2
[
N
−
1
]
)
]
[
a
0
a
1
⋮
a
N
−
1
]
=
[
E
(
θ
x
[
0
]
)
E
(
θ
x
[
1
]
)
⋮
E
(
θ
x
[
N
−
1
]
)
]
\left[\begin{array}{cccc} E\left(x^{2}[0]\right) & E(x[0] x[1]) & \ldots & E(x[0] x[N-1]) \\ E(x[1] x[0]) & E\left(x^{2}[1]\right) & \ldots & E(x[1] x[N-1]) \\ \vdots & \vdots & \ddots & \vdots \\ E(x[N-1] x[0]) & E(x[N-1] x[1]) & \ldots & E\left(x^{2}[N-1]\right) \end{array}\right]\left[\begin{array}{c} a_{0} \\ a_{1} \\ \vdots \\ a_{N-1} \end{array}\right]=\left[\begin{array}{c} E(\theta x[0]) \\ E(\theta x[1]) \\ \vdots \\ E(\theta x[N-1]) \end{array}\right]
⎣⎢⎢⎢⎡E(x2[0])E(x[1]x[0])⋮E(x[N−1]x[0])E(x[0]x[1])E(x2[1])⋮E(x[N−1]x[1])……⋱…E(x[0]x[N−1])E(x[1]x[N−1])⋮E(x2[N−1])⎦⎥⎥⎥⎤⎣⎢⎢⎢⎡a0a1⋮aN−1⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡E(θx[0])E(θx[1])⋮E(θx[N−1])⎦⎥⎥⎥⎤
These are normal equations. The matrix is recognized as
C
x
x
\mathbf C_{xx}
Cxx, and the right-hand vector as
C
x
θ
\mathbf C_{x\theta}
Cxθ. Therefore,
C
x
x
a
=
C
x
θ
\mathbf C_{xx}\mathbf a=\mathbf C_{x\theta}
Cxxa=Cxθ
and
a
=
C
x
x
−
1
C
x
θ
\mathbf a=\mathbf C_{xx}^{-1}\mathbf C_{x\theta}
a=Cxx−1Cxθ
The LMMSE estimator of
θ
\theta
θ is
θ
^
=
a
T
x
=
C
θ
x
C
x
x
−
1
x
\hat \theta =\mathbf a^T \mathbf x=\mathbf C_{\theta x}\mathbf C_{xx}^{-1}\mathbf x
θ^=aTx=CθxCxx−1x
in agreement with
(
L
M
M
S
E
.
5
)
(LMMSE.5)
(LMMSE.5) and
(
L
M
M
S
E
.
6
)
(LMMSE.6)
(LMMSE.6). The minimum Bayesian MSE is the squared length of the error vector or
E
[
(
θ
−
θ
^
)
2
]
=
E
[
(
θ
−
∑
n
=
0
N
−
1
a
n
x
[
n
]
)
ϵ
]
=
E
[
θ
ϵ
]
=
E
(
θ
2
)
−
∑
n
=
0
N
−
1
a
n
E
(
x
[
n
]
θ
)
=
C
θ
θ
−
a
T
C
x
θ
=
C
θ
θ
−
C
θ
x
C
x
x
−
1
C
x
θ
\begin{aligned} E\left[(\theta-\hat \theta)^2\right]&=E\left[(\theta-\sum_{n=0}^{N-1}a_n x[n])\epsilon\right]\\ &=E\left[\theta\epsilon\right]\\ &=E(\theta^2)-\sum_{n=0}^{N-1}a_n E(x[n]\theta)\\ &=C_{\theta \theta}-\mathbf a^T\mathbf C_{x\theta}\\ &=C_{\theta \theta}-\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta} \end{aligned}
E[(θ−θ^)2]=E[(θ−n=0∑N−1anx[n])ϵ]=E[θϵ]=E(θ2)−n=0∑N−1anE(x[n]θ)=Cθθ−aTCxθ=Cθθ−CθxCxx−1Cxθ
in agreement with
(
L
M
M
S
E
.
7
)
(LMMSE.7)
(LMMSE.7).
Example: DC Level in WGN - Uniform Prior PDF
The data model is
x
[
n
]
=
A
+
w
[
n
]
n
=
0
,
1
,
…
,
N
−
1
x[n]=A+w[n] \quad n=0,1, \ldots, N-1
x[n]=A+w[n]n=0,1,…,N−1
where
A
∼
U
[
−
A
0
,
A
0
]
,
w
[
n
]
A \sim \mathcal{U}\left[-A_{0}, A_{0}\right], w[n]
A∼U[−A0,A0],w[n] is WGN with variance
σ
2
,
\sigma^{2},
σ2, and
A
A
A and
w
[
n
]
w[n]
w[n] are independent. We wish to estimate
A
A
A. The MMSE estimator cannot be obtained in closed form due to the integration required. Applying the LMMSE estimator, we first note that
E
(
A
)
=
0
,
E(A)=0,
E(A)=0, and hence
E
(
x
[
n
]
)
=
0.
E(x[n])=0 .
E(x[n])=0. since
E
(
x
)
=
0
,
E(\mathbf{x})=\mathbf{0},
E(x)=0, the covariances are
C
x
x
=
E
(
x
x
T
)
=
E
[
(
A
1
+
w
)
(
A
1
+
w
)
T
]
=
E
(
A
2
)
1
1
T
+
σ
2
I
C
θ
x
=
E
(
A
x
T
)
=
E
[
A
(
A
1
+
w
)
T
]
=
E
(
A
2
)
1
T
\begin{aligned} \mathbf{C}_{x x} &=E\left(\mathbf{x} \mathbf{x}^{T}\right) \\ &=E\left[(A \mathbf{1}+\mathbf{w})(A \mathbf{1}+\mathbf{w})^{T}\right] \\ &=E\left(A^{2}\right) \mathbf{1} \mathbf{1}^{T}+\sigma^{2} \mathbf{I} \\ \mathbf{C}_{\theta x} &=E\left(A \mathbf{x}^{T}\right) \\ &=E\left[A(A \mathbf{1}+\mathbf{w})^{T}\right] \\ &=E\left(A^{2}\right) \mathbf{1}^{T} \end{aligned}
CxxCθx=E(xxT)=E[(A1+w)(A1+w)T]=E(A2)11T+σ2I=E(AxT)=E[A(A1+w)T]=E(A2)1T
Hence, from
(
L
M
M
S
E
.
6
)
(LMMSE.6)
(LMMSE.6)
A
^
=
C
θ
x
C
x
x
−
1
x
=
σ
A
2
1
T
(
σ
A
2
1
1
T
+
σ
2
I
)
−
1
x
=
A
0
2
3
A
0
2
3
+
σ
2
N
x
ˉ
\begin{aligned} \hat{A} &=\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{x} \\ &=\sigma_{A}^{2} \mathbf{1}^{T}\left(\sigma_{A}^{2} \mathbf{1} \mathbf{1}^{T}+\sigma^{2} \mathbf{I}\right)^{-1} \mathbf{x}\\ &=\frac{\frac{A_{0}^{2}}{3}}{\frac{A_{0}^{2}}{3}+\frac{\sigma^{2}}{N}} \bar{x} \end{aligned}
A^=CθxCxx−1x=σA21T(σA211T+σ2I)−1x=3A02+Nσ23A02xˉ
where we have let
σ
A
2
=
E
(
A
2
)
.
\sigma_{A}^{2}=E\left(A^{2}\right) .
σA2=E(A2).
The Vector LMMSE Estimator
Now we wish to find the linear estimator that minimizes the Bayesian MSE for each element. We assume that
θ
^
i
=
∑
n
=
0
N
−
1
a
i
n
x
[
n
]
+
a
i
N
(LMMSE.13)
\hat{\theta}_{i}=\sum_{n=0}^{N-1} a_{i n} x[n]+a_{i N} \tag{LMMSE.13}
θ^i=n=0∑N−1ainx[n]+aiN(LMMSE.13)
for
i
=
1
,
2
,
…
,
p
i=1,2, \ldots, p
i=1,2,…,p and choose the weighting coefficients to minimize
Bmse
(
θ
^
i
)
=
E
[
(
θ
i
−
θ
^
i
)
2
]
i
=
1
,
2
,
…
,
p
\operatorname{Bmse}\left(\hat{\theta}_{i}\right)=E\left[\left(\theta_{i}-\hat{\theta}_{i}\right)^{2}\right] \quad i=1,2, \ldots, p
Bmse(θ^i)=E[(θi−θ^i)2]i=1,2,…,p
where the expectation is with respect to
p
(
x
,
θ
i
)
p\left(\mathbf{x}, \theta_{i}\right)
p(x,θi). Since we are actually determining
p
p
p separate estimators, the scalar solution can be applied, and we obtain from
(
L
M
M
S
E
.
6
)
(LMMSE.6)
(LMMSE.6)
θ
^
i
=
E
(
θ
i
)
+
C
θ
i
x
C
x
x
−
1
(
x
−
E
(
x
)
)
i
=
1
,
2
,
…
,
p
\hat{\theta}_{i}=E\left(\theta_{i}\right)+\mathbf{C}_{\theta_{i} x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \quad i=1,2, \ldots, p
θ^i=E(θi)+CθixCxx−1(x−E(x))i=1,2,…,p
and the minimum Bayesian MSE is from
(
L
M
M
S
E
.
7
)
(LMMSE.7)
(LMMSE.7)
B
m
s
e
(
θ
^
i
)
=
C
θ
i
θ
i
−
C
θ
i
C
x
x
−
1
C
x
θ
i
i
=
1
,
2
,
…
,
p
\mathrm{Bmse}\left(\hat{\theta}_{i}\right)=C_{\theta_{i} \theta_{i}}-\mathbf{C}_{\theta_{i}} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta_{i}} \quad i=1,2, \ldots, p
Bmse(θ^i)=Cθiθi−CθiCxx−1Cxθii=1,2,…,p
The scalar LMMSE estimators can be combined into a vector estimator as
θ
^
=
[
E
(
θ
1
)
E
(
θ
2
)
⋮
E
(
θ
p
)
]
+
[
C
θ
1
x
C
x
x
−
1
(
x
−
E
(
x
)
)
C
θ
2
x
C
x
x
−
1
(
x
−
E
(
x
)
)
⋮
C
θ
p
x
C
x
x
−
1
(
x
−
E
(
x
)
)
]
=
[
E
(
θ
1
)
E
(
θ
2
)
⋮
E
(
θ
p
)
]
+
[
C
θ
1
x
C
θ
2
x
⋮
C
θ
p
x
]
C
x
x
−
1
(
x
−
E
(
x
)
)
=
E
(
θ
)
+
C
θ
x
C
x
x
−
1
(
x
−
E
(
x
)
)
(LMMSE.14)
\begin{aligned} \hat{\boldsymbol{\theta}} &=\left[\begin{array}{c} E\left(\theta_{1}\right) \\ E\left(\theta_{2}\right) \\ \vdots \\ E\left(\theta_{p}\right) \end{array}\right]+\left[\begin{array}{c} \mathbf{C}_{\theta_{1} x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \\ \mathbf{C}_{\theta_{2} x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \\ \vdots \\ \mathbf{C}_{\theta_{p} x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \end{array}\right] \\ &=\left[\begin{array}{c} E\left(\theta_{1}\right) \\ E\left(\theta_{2}\right) \\ \vdots \\ E\left(\theta_{p}\right) \end{array}\right]+\left[\begin{array}{c} \mathbf{C}_{\theta_{1} x} \\ \mathbf{C}_{\theta_{2} x} \\ \vdots \\ \mathbf{C}_{\theta_{p} x} \end{array}\right] \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \\ &=E(\boldsymbol{\theta})+\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \end{aligned}\tag{LMMSE.14}
θ^=⎣⎢⎢⎢⎡E(θ1)E(θ2)⋮E(θp)⎦⎥⎥⎥⎤+⎣⎢⎢⎢⎡Cθ1xCxx−1(x−E(x))Cθ2xCxx−1(x−E(x))⋮CθpxCxx−1(x−E(x))⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡E(θ1)E(θ2)⋮E(θp)⎦⎥⎥⎥⎤+⎣⎢⎢⎢⎡Cθ1xCθ2x⋮Cθpx⎦⎥⎥⎥⎤Cxx−1(x−E(x))=E(θ)+CθxCxx−1(x−E(x))(LMMSE.14)
where now
C
θ
x
\mathbf C_{\theta x}
Cθx is a
p
×
N
p \times N
p×N matrix.
By a similar approach we find that the Bayesian MSE matrix is (see Problem 12.7)
M
θ
^
=
E
[
(
θ
−
θ
^
)
(
θ
−
θ
^
)
T
]
=
C
θ
θ
−
C
θ
x
C
x
x
−
1
C
x
θ
(LMMSE.15)
\begin{aligned} \mathbf{M}_{\hat{\theta}} &=E\left[(\boldsymbol{\theta}-\hat{\boldsymbol{\theta}})(\boldsymbol{\theta}-\hat{\boldsymbol{\theta}})^{T}\right] \\ &=\mathbf{C}_{\theta \theta}-\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta} \end{aligned}\tag{LMMSE.15}
Mθ^=E[(θ−θ^)(θ−θ^)T]=Cθθ−CθxCxx−1Cxθ(LMMSE.15)
where
C
θ
θ
\mathbf{C}_{\theta \theta}
Cθθ is the
p
×
p
p \times p
p×p covariance matrix. Consequently, the minimum Bayesian
M
S
E
\mathrm{MSE}
MSE is
Bmse
(
θ
^
i
)
=
[
M
θ
^
]
i
i
(LMMSE.16)
\operatorname{Bmse}(\hat \theta_i)=[\mathbf M_{\hat \theta}]_{ii}\tag{LMMSE.16}
Bmse(θ^i)=[Mθ^]ii(LMMSE.16)
Of course, these results are identical to those for the MMSE estimator in the Gaussian case, for which the estimator is linear.
Properties
-
LMMSE estimators are identical in form to the MMSE estimator for jointly Gaussian x \mathbf x x and θ \boldsymbol \theta θ
-
LMMSE estimator commutes over linear transformations. This is to say that if
α = A θ + b {\boldsymbol{\alpha}}=\mathbf{A} {\boldsymbol{\theta}}+\mathbf{b} α=Aθ+b
then the LMMSE estimator of α \alpha α is
α
^
=
A
θ
^
+
b
\hat{\boldsymbol{\alpha}}=\mathbf{A} \hat{\boldsymbol{\theta}}+\mathbf{b}
α^=Aθ^+b
with
θ
^
\hat{\theta}
θ^ given by
(
L
M
M
S
E
.
14
)
(LMMSE.14)
(LMMSE.14).
- The LMMSE estimator of a sum of unknown parameters is the sum of the individual estimators. Specifically, if we wish to estimate α = θ 1 + θ 2 , \alpha=\theta_{1}+\theta_{2}, α=θ1+θ2, then
α ^ = θ ^ 1 + θ ^ 2 \hat{\boldsymbol{\alpha}}=\hat{\boldsymbol{\theta}}_{\mathbf{1}}+\hat{\boldsymbol{\theta}}_{\mathbf{2}} α^=θ^1+θ^2
-
In analogy with the BLUE there is a corresponding Gauss-Markov theorem for the Bayesian case
Theorem 12.1 (Bayesian Gauss-Markov Theorem) If the data are described by the Bayesian linear model form
x = H θ + w \mathbf{x}=\mathbf{H} \boldsymbol{\theta}+\mathbf{w} x=Hθ+w
where x \mathbf{x} x is an N × 1 N \times 1 N×1 data vector, H \mathbf{H} H is a known N × p N \times p N×p observation matrix, θ \boldsymbol{\theta} θ is a p × 1 p \times 1 p×1 random vector of parameters whose realization is to be estimated and has mean E ( θ ) E(\boldsymbol{\theta}) E(θ) and covariance matrix C θ θ \mathrm{C}_{\theta \theta} Cθθ, and w \mathbf w w is an N × 1 N \times 1 N×1 random vector with zero mean and covariance matrix C w \mathrm{C}_{w} Cw and is uncorrelated with θ \boldsymbol \theta θ, then the LMMSE estimator of θ \boldsymbol \theta θ is
θ ^ = E ( θ ) + C θ θ H T ( H C θ θ H T + C w ) − 1 ( x − H E ( θ ) ) = E ( θ ) + ( C θ θ − 1 + H T C w − 1 H ) − 1 H T C w − 1 ( x − H E ( θ ) ) \begin{aligned} \hat{\boldsymbol{\theta}} &=E(\boldsymbol{\theta})+\mathbf{C}_{\theta \theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta \theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1}(\mathbf{x}-\mathbf{H} E(\boldsymbol{\theta})) \\ &=E(\boldsymbol{\theta})+\left(\mathbf{C}_{\theta \theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_{w}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}_{w}^{-1}(\mathbf{x}-\mathbf{H} E(\boldsymbol{\theta})) \end{aligned} θ^=E(θ)+CθθHT(HCθθHT+Cw)−1(x−HE(θ))=E(θ)+(Cθθ−1+HTCw−1H)−1HTCw−1(x−HE(θ))
The performance of the estimator is measured by the error ϵ = θ − θ ^ \boldsymbol\epsilon=\boldsymbol\theta-\hat{\boldsymbol\theta} ϵ=θ−θ^ whose mean is zero and whose covariance matrix is
C ϵ = E x , θ ( ϵ ϵ T ) = C θ θ − C θ θ H T ( H C θ θ H T + C w ) − 1 H C θ θ = ( C θ θ − 1 + H T C w − 1 H ) − 1 \begin{aligned} \mathbf{C}_{\epsilon} &=E_{x, \theta}\left(\boldsymbol\epsilon \boldsymbol\epsilon^{T}\right) \\ &=\mathbf{C}_{\theta \theta}-\mathbf{C}_{\theta \theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta \theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1} \mathbf{H} \mathbf{C}_{\theta \theta} \\ &=\left(\mathbf{C}_{\theta \theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_{w}^{-1} \mathbf{H}\right)^{-1} \end{aligned} Cϵ=Ex,θ(ϵϵT)=Cθθ−CθθHT(HCθθHT+Cw)−1HCθθ=(Cθθ−1+HTCw−1H)−1
The error covariance matrix is also the minimum MSE matrix M θ ^ \mathbf M_{\hat{\theta}} Mθ^ whose diagonal elements yield the minimum Bayesian MSE
[ M θ ^ ] i i = [ C ϵ ] i i = Bmse ( θ ^ i ) \begin{aligned} \left[\mathbf{M}_{\hat{\theta}}\right]_{i i} &=\left[\mathbf{C}_{\epsilon}\right]_{i i} =\operatorname{Bmse}\left(\hat{\theta}_{i}\right) \end{aligned} [Mθ^]ii=[Cϵ]ii=Bmse(θ^i)
These results are identical to those in Theorem 10.3 (Posterior PDF for the Bayesian General Linear Model) except that the error vector is not necessarily Gaussian.
相關文章
- Practical Estimators (MLE, BLUE, LS)
- 『演算法』general演算法
- linux general protection faultLinux
- Animal Controller文件——GeneralController
- m基於深度學習的OFDM+QPSK鏈路通道估計和均衡演算法誤位元速率matlab模擬,對比LS,MMSE及LMMSE傳統演算法深度學習演算法Matlab
- 貝葉斯深度學習(bayesian deep learning)深度學習
- General->Identity->VersionIDE
- MySQL8.0:The General Query LogMySql
- What are general rules when deciding on index?Index
- 隨機森林n_estimators 學習曲線隨機森林
- 開啟 mysql 的 general_logMySql
- mysql5.7 General tablespace使用說明MySql
- Run-time Settings--General--Run Logic
- MySQL的general_log和slow_logMySql
- 分散式 | DBLE 的 general 日誌實現分散式
- General options: wwwxbs666999com 17008768000 execute commands
- 【MySQL 5.7參考手冊】8.14.2 General Thread StatesMySqlthread
- 人工智慧之機器學習基礎——貝葉斯(Bayesian Methods)人工智慧機器學習
- Map
- Dual Differential Grouping: A More General Decomposition Method for Large-Scale Optimization
- source map
- JavaScript map()JavaScript
- Map集合
- c++ map和unordered_map比較C++
- SQLSTATE[HY000]: General error: 1615 Prepared statement needs to be re-preparedSQLError
- Map集合&&Map集合的不同遍歷【keySet()&&entrySet()】
- Map delete() 方法delete
- Map forEach() 方法
- Set 和Map
- Map get() 方法
- Linux kernel mapLinux
- forEach與map
- Scala操作Map
- SCSS map物件CSS物件
- JavaScript map()方法JavaScript
- map 對比
- map/ multimap容器
- map和multimap