General Bayesian Estimators (MMSE, MAP, LMMSE)

拉普拉斯的汪發表於2020-12-06

Reference:
Kay S M. Fundamentals of statistical signal processing[M]. Prentice Hall PTR, 1993. (Ch. 11 - 11.5, Ch. 12 - 12.5)
Slides of ET4386, TUD


Motivation:

  • The optimal Bayesian estimators discussed in the previous chapter are difficult to determine in closed form, and in practice too computationally intensive to implement.

  • We now study more general Bayesian estimators and their properties. To do so, the concept of the Bayesian risk function is discussed.

  • Minimization of this criterion results in a variety of estimators. The ones that we will concentrate on are the MMSE estimator and the maximum a posteriori estimator.

  • They involve multidimensional integration for the MMSE estimator and multidimensional maximization for the MAP estimator. Although under the jointly Gaussian assumption these estimators are easily found, in general, they are not. When we are unable to make the Gaussian assumption, another approach must be used. To fill this gap we can choose to retain the MMSE criterion but constrain the estimator to be linear. Then, an explicit form for the estimator may be determined which depends only on the first two moments of the PDF. In many ways this approach is analogous to the BLUE in classical estimation, and some parallels will become apparent.

Risk Functions

Previously, we had derived the MMSE estimator by minimizing E [ ( θ − θ ^ ) 2 ] E[(\theta-\hat \theta)^2] E[(θθ^)2], where the expectation is with respect to the PDF p ( x , θ ) p(\mathbf x,\theta) p(x,θ). If we let ϵ = θ − θ ^ \epsilon=\theta-\hat \theta ϵ=θθ^ denote the error of the estimator for a particular realization of x \mathbf x x and θ \theta θ, and also let C ( ϵ ) = ϵ 2 \mathcal C(\epsilon)=\epsilon^2 C(ϵ)=ϵ2, then the MSE criterion minimizes E [ C ( ϵ ) ] E[\mathcal C(\epsilon)] E[C(ϵ)]. C ( ϵ ) \mathcal C(\epsilon) C(ϵ) is termed as cost function, and the average cost or E [ C ( ϵ ) ] E[\mathcal C (\epsilon)] E[C(ϵ)] is termed the Bayes risk R \mathcal R R or
R = E [ C ( ϵ ) ] (RF.1) \mathcal R=E[\mathcal C(\epsilon)]\tag{RF.1} R=E[C(ϵ)](RF.1)
If C ( ϵ ) = ϵ 2 \mathcal C (\epsilon)=\epsilon^2 C(ϵ)=ϵ2, then the cost function is quadratic and the Bayes risk is just the MSE. Of course, there is no need to restrict ourselves to quadratic cost functions. In Figure 11.1b we have the “absolute” error
C ( ϵ ) = ∣ ϵ ∣ (RF.2) \mathcal C(\epsilon)=|\epsilon|\tag{RF.2} C(ϵ)=ϵ(RF.2)
In Figure 11.1c the “hit-or-miss” cost function is displayed
C ( ϵ ) = { 0 ∣ ϵ ∣ < δ 1 ∣ ϵ ∣ > δ (RF.3) \mathcal C(\epsilon)=\left\{ \begin{array}{cc} 0 & |\epsilon|<\delta \\ 1 & |\epsilon|>\delta \end{array} \right.\tag{RF.3} C(ϵ)={01ϵ<δϵ>δ(RF.3)

在這裡插入圖片描述

We already have seen that the Bayes risk is minimized for a quadratic cost function by the MMSE estimator θ ^ = E ( θ ∣ x ) \hat\theta=E(\theta|\mathbf x) θ^=E(θx). We now determine the optimal estimators for the other cost functions.

The Bayes risk R \mathcal R R is
R = E [ C ( ϵ ) ] = ∫ ∫ C ( θ − θ ^ ) p ( x , θ ) d x d θ = ∫ [ ∫ C ( θ − θ ^ ) p ( θ ∣ x ) d θ ] p ( x ) d x (RF.4) \begin{aligned} \mathcal R&=E[\mathcal C(\epsilon)]\\ &=\int \int \mathcal C(\theta-\hat \theta)p(\mathbf x,\theta)d\mathbf x d\theta \\ &=\int \left[\int \mathcal C(\theta-\hat \theta)p(\theta|\mathbf x)d \theta\right]p(\mathbf x)d \mathbf x \end{aligned}\tag{RF.4} R=E[C(ϵ)]=C(θθ^)p(x,θ)dxdθ=[C(θθ^)p(θx)dθ]p(x)dx(RF.4)
We attempt to minimize the inner integral g ( θ ^ ) = ∫ C ( θ − θ ^ ) p ( θ ∣ x ) d θ g(\hat \theta)=\int \mathcal C(\theta-\hat \theta)p(\theta|\mathbf x)d \theta g(θ^)=C(θθ^)p(θx)dθ for each x \mathbf x x.

  • absolute error cost function
    g ( θ ^ ) = ∫ ∣ θ − θ ^ ∣ p ( θ ∣ x ) d θ = ∫ − ∞ θ ^ ( θ ^ − θ ) p ( θ ∣ x ) d θ + ∫ θ ^ ∞ ( θ − θ ^ ) p ( θ ∣ x ) d θ \begin{aligned} g(\hat \theta)&=\int|\theta-\hat \theta|p(\theta|\mathbf x)d\theta\\ &=\int_{-\infty}^{\hat \theta} (\hat \theta-\theta)p(\theta|\mathbf x)d\theta+\int_{\hat \theta}^\infty (\theta-\hat \theta)p(\theta|\mathbf x)d\theta \end{aligned} g(θ^)=θθ^p(θx)dθ=θ^(θ^θ)p(θx)dθ+θ^(θθ^)p(θx)dθ
    Differentiating w.r.t. θ ^ \hat \theta θ^ using Leibnitz’s rule
    ∂ ∂ u ∫ ϕ 1 ( u ) ϕ 2 ( u ) h ( u , v ) d v = ∫ ϕ 1 ( u ) ϕ 2 ( u ) ∂ h ( u , v ) ∂ u d v + d ϕ 2 ( u ) d u h ( u , ϕ 2 ( u ) ) − d ϕ 1 ( u ) d u h ( u , ϕ 1 ( u ) ) \frac{\partial}{\partial u}\int_{\phi_1(u)}^{\phi_2(u)}h(u,v)dv=\int_{\phi_1(u)}^{\phi_2(u)}\frac{\partial h(u,v)}{\partial u}dv+\frac{d\phi_2(u)}{du}h(u,\phi_2(u))-\frac{d\phi_1(u)}{du}h(u,\phi_1(u)) uϕ1(u)ϕ2(u)h(u,v)dv=ϕ1(u)ϕ2(u)uh(u,v)dv+dudϕ2(u)h(u,ϕ2(u))dudϕ1(u)h(u,ϕ1(u))
    we have
    d g ( θ ^ ) d θ ^ = ∫ − ∞ θ ^ p ( θ ∣ x ) d θ − ∫ θ ^ ∞ p ( θ ∣ x ) d θ = 0 \frac{dg(\hat\theta)}{d\hat \theta}=\int_{-\infty}^{\hat \theta} p(\theta|\mathbf x)d\theta -\int_{\hat \theta}^\infty p(\theta|\mathbf x)d\theta=0 dθ^dg(θ^)=θ^p(θx)dθθ^p(θx)dθ=0
    or
    ∫ − ∞ θ ^ p ( θ ∣ x ) d θ = ∫ θ ^ ∞ p ( θ ∣ x ) d θ \int_{-\infty}^{\hat \theta} p(\theta|\mathbf x)d\theta =\int_{\hat \theta}^\infty p(\theta|\mathbf x)d\theta θ^p(θx)dθ=θ^p(θx)dθ
    By definition θ ^ \hat \theta θ^ is the median of the posterior PDF or the point for which Pr ⁡ { θ ≤ θ ^ ∣ x } = 1 / 2 \Pr \{ \theta\le \hat \theta|\mathbf x\}=1/2 Pr{θθ^x}=1/2.

  • hit or miss cost function
    g ( θ ^ ) = ∫ − ∞ θ ^ − δ 1 ⋅ p ( θ ∣ x ) d θ + ∫ θ ^ + δ ∞ 1 ⋅ p ( θ ∣ x ) d θ = 1 − ∫ θ ^ − δ θ ^ + δ p ( θ ∣ x ) d θ \begin{aligned} g(\hat{\theta})=&\int_{-\infty}^{\hat{\theta}-\delta} 1 \cdot p(\theta | \mathbf{x}) d \theta+\int_{\hat{\theta}+\delta}^{\infty} 1 \cdot p(\theta |\mathbf{x}) d \theta\\ =&1-\int_{\hat \theta-\delta}^{\hat \theta+\delta}p(\theta|\mathbf x)d\theta \end{aligned} g(θ^)==θ^δ1p(θx)dθ+θ^+δ1p(θx)dθ1θ^δθ^+δp(θx)dθ
    This is minimized by maximizing
    ∫ θ ˙ − δ θ ^ + δ p ( θ ∣ x ) d θ \int_{\dot{\theta}-\delta}^{\hat{\theta}+\delta} p(\theta | \mathbf{x}) d \theta θ˙δθ^+δp(θx)dθ
    For δ \delta δ arbitrarily small this is maximized by choosing θ ^ \hat \theta θ^ to correspond to the location of the maximum of p ( θ ∣ x ) p(\theta|\mathbf x) p(θx). The estimator that minimizes the Bayes risk for the “hit-or-miss” cost function is therefore the mode (location of the maximum) of the posterior PDF. It is termed the maximum a posteriori (MAP) estimator and will be described in more detail later.

In summary, the estimators that minimize the Bayes risk for the cost functions of Figure 11.1 are the mean, median, and mode of the posterior PDF. This is illustrated in Figure 11.2a. For some posterior PDFs these three estimators are identical. A notable example is the Gaussian posterior PDF
p ( θ ∣ x ) = 1 2 π σ θ ∣ x 2 exp ⁡ [ − 1 2 σ θ ∣ x 2 ( θ − μ θ ∣ x ) 2 ] p(\theta|\mathbf x)=\frac{1}{\sqrt{2\pi \sigma^2_{\theta|x}}}\exp\left[-\frac {1}{2\sigma^2_{\theta|x}}(\theta-\mu_{\theta|x})^2 \right] p(θx)=2πσθx2 1exp[2σθx21(θμθx)2]
The mean μ θ ∣ x \mu_{\theta|x} μθx is identical to the median (due to the symmetry) and the mode, as illustrated in Figure 11.2b.

在這裡插入圖片描述

Minimum Mean Square Error Estimators (MMSE)

In The Bayesian Philosophy the MMSE estimator was determined to be E ( θ ∣ x ) E(\theta|\mathbf x) E(θx) or the mean of the posterior PDF. We continue our discussion of this important estimator by first extending it to the vector parameter case and then studying some of its properties.

If θ \boldsymbol \theta θ is a vector parameter of dimension p × 1 p \times 1 p×1, then to estimate θ 1 \theta_1 θ1, for example, we may view the remaining parameters as nuisance parameters
p ( θ 1 ∣ x ) = ∫ ⋯ ∫ p ( θ ∣ x ) d θ 2 ⋯ d θ p (MMSE.1) p(\theta_1|\mathbf x)=\int \cdots \int p(\boldsymbol \theta|\mathbf x)d\theta_2\cdots d\theta_p \tag{MMSE.1} p(θ1x)=p(θx)dθ2dθp(MMSE.1)
where
p ( θ ∣ x ) = p ( x ∣ θ ) p ( θ ) ∫ p ( x ∣ θ ) p ( θ ) d θ (MMSE.2) p(\boldsymbol \theta|\mathbf x)=\frac{p(\mathbf x|\boldsymbol \theta)p(\boldsymbol \theta)}{\int p(\mathbf x|\boldsymbol \theta)p(\boldsymbol \theta)d\boldsymbol \theta} \tag{MMSE.2} p(θx)=p(xθ)p(θ)dθp(xθ)p(θ)(MMSE.2)
Then we have
θ ^ i = E ( θ i ∣ x ) = ∫ θ i p ( θ i ∣ x ) d θ i , i = 1 , 2 , ⋯   , p (MMSE.3) \hat \theta_i =E(\theta_i|\mathbf x)=\int \theta_ip(\theta_i|\mathbf x)d\theta_i, \quad i=1,2,\cdots,p \tag{MMSE.3} θ^i=E(θix)=θip(θix)dθi,i=1,2,,p(MMSE.3)
This is the MMSE estimator that minimizes
E [ ( θ i − θ ^ i ) 2 ] = ∫ ∫ ( θ i − θ ^ i ) 2 p ( x , θ i ) d x d θ i (MMSE.4) E[(\theta_i-\hat \theta_i)^2]=\int\int (\theta_i-\hat \theta_i)^2 p(\mathbf x,\theta_i)d\mathbf x d\theta_i \tag{MMSE.4} E[(θiθ^i)2]=(θiθ^i)2p(x,θi)dxdθi(MMSE.4)
or the squared error when averaged with respect to the marginal PDF p ( x , θ i ) p(\mathbf x,\theta_i) p(x,θi). Alternatively, we can express the MMSE estimator for the first parameter from ( M M S E . 1 ) (MMSE.1) (MMSE.1)
θ ^ 1 = ∫ θ 1 p ( θ 1 ∣ x ) d θ 1 = ∫ θ 1 [ ∫ ⋯ ∫ p ( θ ∣ x ) d θ 2 ⋯ d θ p ] d θ 1 = ∫ θ 1 p ( θ ∣ x ) d θ \begin{aligned} \hat \theta_1 &= \int \theta_1 p(\theta_1|\mathbf x)d\theta_1\\ &= \int \theta_1 \left[\int \cdots \int p(\boldsymbol \theta|\mathbf x)d\theta_2\cdots d\theta_p\right]d\theta_1\\ &=\int \theta_1 p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta \end{aligned} θ^1=θ1p(θ1x)dθ1=θ1[p(θx)dθ2dθp]dθ1=θ1p(θx)dθ
or in general
θ ^ i = ∫ θ i p ( θ ∣ x ) d θ , i = 1 , 2 , ⋯   , p \hat \theta_i=\int \theta_i p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta,\quad i=1,2,\cdots,p θ^i=θip(θx)dθ,i=1,2,,p
In vector form we have
θ ^ = [ ∫ θ 1 p ( θ ∣ x ) d θ ∫ θ 2 p ( θ ∣ x ) d θ ⋮ ∫ θ p p ( θ ∣ x ) d θ ] = ∫ θ p ( θ ∣ x ) d θ = E ( θ ∣ x ) (MMSE.5) \hat {\boldsymbol \theta}=\left[\begin{matrix}\int \theta_1 p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta\\ \int \theta_2 p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta\\ \vdots\\ \int \theta_p p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta \end{matrix} \right]=\int \boldsymbol \theta p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta=E(\boldsymbol \theta|\mathbf x)\tag{MMSE.5} θ^=θ1p(θx)dθθ2p(θx)dθθpp(θx)dθ=θp(θx)dθ=E(θx)(MMSE.5)
where the expectation is with respect to the posterior PDF of the vector parameter or p ( θ ∣ x ) p(\boldsymbol \theta|\mathbf x) p(θx).

According to the definition, the minimum Bayesian MSE for θ ^ i \hat\theta_i θ^i
Bmse ⁡ ( θ ^ 1 ) = E [ ( θ 1 − θ ^ 1 ) 2 ] = ∫ ∫ ( θ 1 − θ ^ 1 ) 2 p ( x , θ 1 ) d θ 1 d x \operatorname{Bmse}(\hat \theta_1)=E[(\theta_1-\hat {\theta}_1)^2]=\int \int(\theta_1-\hat {\theta}_1)^2p(\mathbf x,\theta_1) d\theta_1 d\mathbf x Bmse(θ^1)=E[(θ1θ^1)2]=(θ1θ^1)2p(x,θ1)dθ1dx
and since θ ^ 1 = E ( θ 1 ∣ x ) \hat \theta_1=E(\theta_1|\mathbf x) θ^1=E(θ1x), we have
Bmse ⁡ ( θ ^ 1 ) = ∫ ∫ [ θ 1 − E ( θ 1 ∣ x ) ] 2 p ( x , θ 1 ) d θ 1 d x = ∫ [ ∫ [ θ 1 − E ( θ 1 ∣ x ) ] 2 p ( θ 1 ∣ x ) d θ 1 ] p ( x ) d x \begin{aligned} \operatorname{Bmse}(\hat \theta_1)&=\int \int [\theta_1-E(\theta_1|\mathbf x)]^2p(\mathbf x,\theta_1) d\theta_1 d\mathbf x\\ &=\int \left[\int [\theta_1-E(\theta_1|\mathbf x)]^2p(\theta_1|\mathbf x) d\theta_1 \right]p(\mathbf x)d\mathbf x \end{aligned} Bmse(θ^1)=[θ1E(θ1x)]2p(x,θ1)dθ1dx=[[θ1E(θ1x)]2p(θ1x)dθ1]p(x)dx
The posterior PDF can be written as
p ( θ 1 ∣ x ) = ∫ ⋯ ∫ p ( θ ∣ x ) d θ 2 ⋯ d θ p p(\theta_1|\mathbf x)=\int \cdots \int p(\boldsymbol \theta|\mathbf x)d\theta_2\cdots d\theta _p p(θ1x)=p(θx)dθ2dθp
so that
Bmse ⁡ ( θ ^ 1 ) = ∫ [ ∫ [ θ 1 − E ( θ 1 ∣ x ) ] 2 p ( θ ∣ x ) d θ ] p ( x ) d x (MMSE.6) \operatorname{Bmse}(\hat \theta_1)=\int \left[\int [\theta_1-E(\theta_1|\mathbf x)]^2p(\boldsymbol \theta|\mathbf x) d\boldsymbol \theta \right]p(\mathbf x)d\mathbf x\tag{MMSE.6} Bmse(θ^1)=[[θ1E(θ1x)]2p(θx)dθ]p(x)dx(MMSE.6)
which is just the [ 1 , 1 ] [1,1] [1,1] element of C θ ∣ x \mathbf C_{\theta|x} Cθx, the covariance matrix of the posterior PDF. Hence, in general we have the minimum Bayesian MSE is
Bmse ⁡ ( θ ^ i ) = ∫ [ C θ ∣ x ] i i p ( x ) d x i = 1 , 2 , ⋯   , p (MMSE.7) \operatorname{Bmse}(\hat \theta_i)=\int [\mathbf C_{\theta|x}]_{ii}p(\mathbf x)d\mathbf x\quad i=1,2,\cdots,p\tag{MMSE.7} Bmse(θ^i)=[Cθx]iip(x)dxi=1,2,,p(MMSE.7)
where
C θ ∣ x = E θ ∣ x [ ( θ − E ( θ ∣ x ) ) ( θ − E ( θ ∣ x ) ) T ] (MMSE.8) \mathbf C_{\theta|x}=E_{\theta|x}[(\boldsymbol \theta-E(\boldsymbol \theta|\mathbf x))(\boldsymbol \theta-E(\boldsymbol \theta|\mathbf x))^T]\tag{MMSE.8} Cθx=Eθx[(θE(θx))(θE(θx))T](MMSE.8)
We can obtain θ ^ \hat {\boldsymbol \theta} θ^ and C θ ∣ x \mathbf C_{\theta|x} Cθx from Theorem 10.2 mentioned in The Bayesian Philosophy.

Theorem 10.2 (Conditional PDF of Multivariate Gaussian) If x \mathbf{x} x and y \mathbf y y are jointly Gaussian, where x \mathbf{x} x is k × 1 k \times 1 k×1 and y \mathbf{y} y is l × 1 , l \times 1, l×1, with mean vector [ E ( x ) T E ( y ) T ] T \left[E(\mathbf{x})^{T} E(\mathbf{y})^{T}\right]^{T} [E(x)TE(y)T]T and
partitioned covariance matrix
C = [ C x x C x y C y x C y y ] = [ k × k k × l l × k l × l ] (BF.25) \mathbf{C}=\left[\begin{array}{ll} \mathbf{C}_{x x} & \mathbf{C}_{x y} \\ \mathbf{C}_{y x} & \mathbf{C}_{y y} \end{array}\right]=\left[\begin{array}{ll} k \times k & k \times l \\ l \times k & l \times l \end{array}\right]\tag{BF.25} C=[CxxCyxCxyCyy]=[k×kl×kk×ll×l](BF.25)
so that
p ( x , y ) = 1 ( 2 π ) k + 1 2 det ⁡ 1 2 ( C ) exp ⁡ [ − 1 2 ( [ x − E ( x ) y − E ( y ) ] ) T C − 1 ( [ x − E ( x ) y − E ( y ) ] ) ] p(\mathbf{x}, \mathbf{y})=\frac{1}{(2 \pi)^{\frac{k+1}{2}} \operatorname{det}^{\frac{1}{2}}(\mathbf{C})} \exp \left[-\frac{1}{2}\left(\left[\begin{array}{l} \mathbf{x}-E(\mathbf{x}) \\ \mathbf{y}-E(\mathbf{y}) \end{array}\right]\right)^{T} \mathbf{C}^{-1}\left(\left[\begin{array}{l} \mathbf{x}-E(\mathbf{x}) \\ \mathbf{y}-E(\mathbf{y}) \end{array}\right]\right)\right] p(x,y)=(2π)2k+1det21(C)1exp[21([xE(x)yE(y)])TC1([xE(x)yE(y)])]
Then the conditional PDF p ( y ∣ x ) p(\mathbf y|\mathbf x) p(yx) is also Gaussian and
E ( y ∣ x ) = E ( y ) + C y x C x x − 1 ( x − E ( x ) ) (BF.26) E(\mathbf y|\mathbf x)=E(\mathbf y)+\mathbf C_{yx} \mathbf C_{xx}^{-1}(\mathbf x-E(\mathbf x))\tag{BF.26} E(yx)=E(y)+CyxCxx1(xE(x))(BF.26)

C y ∣ x = C y y − C y x C x x − 1 C x y (BF.27) \mathbf C_{y|x}=\mathbf C_{yy}-\mathbf C_{yx}\mathbf C_{xx}^{-1}\mathbf C_{xy}\tag{BF.27} Cyx=CyyCyxCxx1Cxy(BF.27)

For general linear model, we can apply the result in Theorem 10.3.

Theorem 10.3 (Posterior PDF for the Bayesian General Linear Model) If the observed data x \mathbf x x can be modeled as
x = H θ + w \mathbf{x}=\mathbf{H} \boldsymbol{\theta}+\mathbf{w} x=Hθ+w
where x \mathbf{x} x is an N × 1 N \times 1 N×1 data vector, H \mathbf{H} H is a known N × p N \times p N×p matrix, θ \boldsymbol{\theta} θ is a p × 1 p \times 1 p×1 random vector with prior PDF N ( μ θ , C θ ) , \mathcal{N}\left(\mu_{\theta}, \mathbf{C}_{\theta}\right), N(μθ,Cθ), and w \mathbf{w} w is an N × 1 N \times 1 N×1 noise vector with PDF N ( 0 , C w ) \mathcal{N}\left(\mathbf{0}, \mathbf{C}_{w}\right) N(0,Cw) and independent of θ , \boldsymbol \theta, θ, then the posterior PDF p ( θ ∣ x ) p(\boldsymbol\theta |\mathbf{x}) p(θx) is Gaussian with mean
E ( θ ∣ x ) = μ θ + C θ H T ( H C θ H T + C w ) − 1 ( x − H μ θ ) (BF.29) E(\boldsymbol{\theta} |\mathbf{x})=\boldsymbol{\mu}_{\theta}+\mathbf{C}_{\theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1}\left(\mathbf{x}-\mathbf{H} \boldsymbol{\mu}_{\theta}\right)\tag{BF.29} E(θx)=μθ+CθHT(HCθHT+Cw)1(xHμθ)(BF.29)
and covariance
C θ ∣ x = C θ − C θ H T ( H C θ H T + C w ) − 1 H C θ (BF.30) \mathbf{C}_{\theta \mid x}=\mathbf{C}_{\theta}-\mathbf{C}_{\theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1} \mathbf{H} \mathbf{C}_{\theta}\tag{BF.30} Cθx=CθCθHT(HCθHT+Cw)1HCθ(BF.30)
In contrast to the classical general linear model, H \mathbf{H} H need not be full rank to ensure the invertibility of H C θ H T + C w . \mathbf{H C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w} . HCθHT+Cw.

Alternative formulation using Matrix inversion lemma:
E ( θ ∣ x ) = μ θ + ( C θ − 1 + H T C w − 1 H ) − 1 H T C w − 1 ( x − H μ θ ) (BF.31) E(\boldsymbol{\theta} |\mathbf{x})=\boldsymbol\mu_{\theta}+\left(\mathbf{C}_{\theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}_w^{-1}\left(\mathbf{x}-\mathbf{H} \boldsymbol\mu_{\theta}\right) \tag{BF.31} E(θx)=μθ+(Cθ1+HTCw1H)1HTCw1(xHμθ)(BF.31)

C θ ∣ x = ( C θ − 1 + H T C w − 1 H ) − 1 (BF.32) \mathbf{C}_{\theta \mid x} =\left(\mathbf{C}_{\theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1}\tag{BF.32} Cθx=(Cθ1+HTCw1H)1(BF.32)

It is interesting to note that in the absence of prior knowledge in the Bayesian linear model the MMSE estimator yields the same form as the MVU estimator for the classical linear model. To verify this result note that from ( B F . 31 ) (BF.31) (BF.31), for no prior knowledge C θ − 1 → 0 \mathbf C_\theta^{-1}\to \mathbf 0 Cθ10, therefore
θ ^ → ( H T C w − 1 H ) − 1 H T C w − 1 x \hat {\boldsymbol \theta}\to \left(\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}_w^{-1}\mathbf{x} θ^(HTCw1H)1HTCw1x
which is recognized as the MVU estimator for the general linear model.

Properties

  • The MMSE estimator commutes over linear transformations. Assume that we wish to estimate α \boldsymbol \alpha α for
    α = A θ + b \boldsymbol\alpha =\mathbf A\boldsymbol \theta +\mathbf b α=Aθ+b
    α \boldsymbol \alpha α is a random vector for which the MMSE estimator is α ^ = E ( α ∣ x ) \hat {\boldsymbol \alpha}=E(\boldsymbol \alpha |\mathbf x) α^=E(αx). Because the linearity of the expectation operator
    α ^ = E ( A θ + b ∣ x ) = A E ( θ ∣ x ) + b = A θ ^ + b \hat {\boldsymbol \alpha}=E(\mathbf A\boldsymbol \theta+\mathbf b|\mathbf x)=\mathbf A E(\boldsymbol \theta|\mathbf x)+\mathbf b=\mathbf A\hat {\boldsymbol \theta}+\mathbf b α^=E(Aθ+bx)=AE(θx)+b=Aθ^+b
    This holds regardless of the joint PDF p ( x , θ ) p(\mathbf x,\boldsymbol \theta) p(x,θ).

  • The MMSE has an additivity property for independent data sets. Assume x 1 , x 2 \mathbf x_1, \mathbf x_2 x1,x2 are independent data vectors and θ , x 1 , x 2 \boldsymbol \theta, \mathbf x_1,\mathbf x_2 θ,x1,x2 are jointly Gaussian. Letting x = [ x 1 T   x 2 T ] T \mathbf x=[\mathbf x_1^T ~ \mathbf x_2^T]^T x=[x1T x2T]T, we have from Theorem 10.2
    θ ^ = E ( θ ) + C θ x C x x − 1 ( x − E ( x ) ) \hat {\boldsymbol \theta}=E(\boldsymbol \theta)+\mathbf C_{\theta x} \mathbf C_{xx}^{-1}(\mathbf x-E(\mathbf x)) θ^=E(θ)+CθxCxx1(xE(x))
    Since x 1 , x 2 \mathbf x_1, \mathbf x_2 x1,x2 are independent,
    C x x − 1 = [ C x 1 x 2 0 0 C x 2 x 2 ] − 1 = [ C x 1 x 2 − 1 0 0 C x 2 x 2 − 1 ] \mathbf C_{xx}^{-1}=\left[\begin{matrix}\mathbf C_{x_1 x_2} & \mathbf 0\\ \mathbf 0 & \mathbf C_{x_2x_2} \end{matrix} \right]^{-1}=\left[\begin{matrix}\mathbf C_{x_1 x_2}^{-1} & \mathbf 0\\ \mathbf 0 & \mathbf C_{x_2x_2}^{-1} \end{matrix} \right] Cxx1=[Cx1x200Cx2x2]1=[Cx1x2100Cx2x21]

    C θ x = E [ θ [ x 1 x 2 ] T ] = [ C θ x 1   C θ x 2 ] \mathbf C_{\theta x}=E\left[\boldsymbol \theta\left[\begin{matrix}\mathbf x_1 \\\mathbf x_2 \end{matrix} \right]^T\right]=[\mathbf C_{\theta x_1}~\mathbf C_{\theta x_2}] Cθx=E[θ[x1x2]T]=[Cθx1 Cθx2]

    Therefore
    θ ^ = E ( θ ) + C θ x 1 C x 1 x 1 − 1 ( x 1 − E ( x 1 ) ) + C θ x 2 C x 2 x 2 − 1 ( x 2 − E ( x 2 ) ) \hat {\boldsymbol \theta}=E(\boldsymbol \theta)+\mathbf C_{\theta x_1} \mathbf C_{x_1x_1}^{-1}(\mathbf x_1-E(\mathbf x_1))+\mathbf C_{\theta x_2} \mathbf C_{x_2x_2}^{-1}(\mathbf x_2-E(\mathbf x_2)) θ^=E(θ)+Cθx1Cx1x11(x1E(x1))+Cθx2Cx2x21(x2E(x2))

  • In the jointly Gaussian case the MMSE is linear in the data, as can be seen from ( B F . 26 ) (BF.26) (BF.26).

Example: Bayesian Fourier Analysis

Data model:
x [ n ] = a cos ⁡ 2 π f 0 n + b sin ⁡ 2 π f 0 n + w [ n ] n = 0 , 1 , ⋯   , N − 1 x[n]=a \cos 2\pi f_0 n+b \sin 2\pi f_0 n+w[n]\quad n=0,1,\cdots, N-1 x[n]=acos2πf0n+bsin2πf0n+w[n]n=0,1,,N1
where f 0 f_0 f0 is a multiple of 1 / N 1/N 1/N, excepting 0 0 0 or 1 / 2 1/2 1/2 (for which sin ⁡ 2 π f 0 n \sin 2 \pi f_0 n sin2πf0n is identically zero), and w [ n ] w[n] w[n] is WGN with variance σ 2 \sigma^2 σ2. It is desired to estimate θ = [ a   b ] T \boldsymbol \theta=[a~b]^T θ=[a b]T.

To find the MMSE estimator, the data model is rewritten as
x = H θ + w \mathbf x=\mathbf H \boldsymbol \theta +\mathbf w x=Hθ+w
where
H = [ 1 0 cos ⁡ 2 π f 0 sin ⁡ 2 π f 0 ⋮ ⋮ cos ⁡ [ 2 π f 0 ( N − 1 ) ] sin ⁡ [ 2 π f 0 ( N − 1 ) ] ] \mathbf{H}=\left[\begin{array}{cc} 1 & 0 \\ \cos 2 \pi f_{0} & \sin 2 \pi f_{0} \\ \vdots & \vdots \\ \cos \left[2 \pi f_{0}(N-1)\right] & \sin \left[2 \pi f_{0}(N-1)\right] \end{array}\right] H=1cos2πf0cos[2πf0(N1)]0sin2πf0sin[2πf0(N1)]
Because the columns of H \mathbf H H are orthogonal, we have
H T H = N 2 I \mathbf H^T \mathbf H=\frac{N}{2}\mathbf I HTH=2NI
Therefore it is more convenient to apply ( B F . 31 ) (BF.31) (BF.31) and ( B F . 32 ) (BF.32) (BF.32)
θ ^ = ( 1 σ θ 2 I + H T 1 σ 2 H ) − 1 H T 1 σ 2 x = 1 / σ 2 1 / σ θ 2 + N / 2 σ 2 H T x C θ ∣ x = ( 1 σ θ 2 I + H T 1 σ 2 H ) − 1 = 1 1 / σ θ 2 + N / 2 σ 2 I \begin{aligned} \hat{\boldsymbol{\theta}} &=\left(\frac{1}{\sigma_{\theta}^{2}} \mathbf{I}+\mathbf{H}^{T} \frac{1}{\sigma^{2}} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \frac{1}{\sigma^{2}} \mathbf{x}=\frac{1/\sigma^2}{1/\sigma_\theta^2+N/2\sigma^2}\mathbf H^T\mathbf x \\ \mathbf{C}_{\theta \mid x} &=\left(\frac{1}{\sigma_{\theta}^{2}} \mathbf{I}+\mathbf{H}^{T} \frac{1}{\sigma^{2}} \mathbf{H}\right)^{-1}=\frac{1}{1/\sigma_\theta^2+N/2\sigma^2}\mathbf I \end{aligned} θ^Cθx=(σθ21I+HTσ21H)1HTσ21x=1/σθ2+N/2σ21/σ2HTx=(σθ21I+HTσ21H)1=1/σθ2+N/2σ21I
The MMSE estimator is
a ^ = 1 1 + 2 σ 2 / N σ θ 2 [ 2 N ∑ n = 0 N − 1 x [ n ] cos ⁡ 2 π f 0 n ] b ^ = 1 1 + 2 σ 2 / N σ θ 2 [ 2 N ∑ n = 0 N − 1 x [ n ] sin ⁡ 2 π f 0 n ] \begin{aligned} \hat{a} &=\frac{1}{1+\frac{2 \sigma^{2} / N}{\sigma_{\theta}^{2}}}\left[\frac{2}{N} \sum_{n=0}^{N-1} x[n] \cos 2 \pi f_{0} n\right] \\ \hat{b} &=\frac{1}{1+\frac{2 \sigma^{2} / N}{\sigma_{\theta}^{2}}}\left[\frac{2}{N} \sum_{n=0}^{N-1} x[n] \sin 2 \pi f_{0} n\right] \end{aligned} a^b^=1+σθ22σ2/N1[N2n=0N1x[n]cos2πf0n]=1+σθ22σ2/N1[N2n=0N1x[n]sin2πf0n]
with
B m s e ( a ^ ) = B m s e ( b ^ ) = 1 1 σ θ 2 + 1 2 σ 2 / N \mathrm{Bmse}(\hat{a})=\mathrm{Bmse}(\hat{b})=\frac{1}{\frac{1}{\sigma_{\theta}^{2}}+\frac{1}{2 \sigma^{2} / N}} Bmse(a^)=Bmse(b^)=σθ21+2σ2/N11

Maximum A Posteriori Estimators (MAP)

In the MAP estimation approach we choose θ ^ \hat \theta θ^ to maximize the posterior PDF or
θ ^ = arg ⁡ max ⁡ θ p ( θ ∣ x ) \hat \theta=\arg \max _\theta p(\theta|\mathbf x) θ^=argθmaxp(θx)
This was shown to minimize the Bayes risk for a “hit-or-miss” cost function E [ ∣ θ i − θ ^ i ∣ ] E[|\theta_i-\hat \theta_i|] E[θiθ^i] . Since
p ( θ ∣ x ) = p ( x ∣ θ ) p ( θ ) p ( x ) p(\theta|\mathbf x)=\frac{p(\mathbf x|\theta)p(\theta)}{p(\mathbf x)} p(θx)=p(x)p(xθ)p(θ)
It is equivalent to
θ ^ = arg ⁡ max ⁡ θ p ( x ∣ θ ) p ( θ ) (MAP.1) \hat \theta=\arg \max _\theta p(\mathbf x|\theta)p(\theta)\tag{MAP.1} θ^=argθmaxp(xθ)p(θ)(MAP.1)
or
θ ^ = arg ⁡ max ⁡ θ [ ln ⁡ p ( x ∣ θ ) + ln ⁡ p ( θ ) ] (MAP.2) \hat \theta=\arg \max _\theta [\ln p(\mathbf x|\theta)+\ln p(\theta)]\tag{MAP.2} θ^=argθmax[lnp(xθ)+lnp(θ)](MAP.2)
To extend the MAP estimator to the vector parameter case in which the posterior PDF is now p ( θ ∣ x ) p(\boldsymbol \theta|\mathbf x) p(θx), we again employ the result
p ( θ 1 ∣ x ) = ∫ ⋯ ∫ p ( θ ∣ x ) d θ 2 ⋯ d θ p (MAP.3) p(\theta_1|\mathbf x)=\int \cdots \int p(\boldsymbol \theta|\mathbf x)d\theta_2\cdots d\theta _p \tag{MAP.3} p(θ1x)=p(θx)dθ2dθp(MAP.3)
Then the MAP estimator is given by
θ ^ 1 = arg ⁡ max ⁡ θ 1 p ( θ 1 ∣ x ) \hat \theta_1=\arg \max _{\theta_1} p(\theta_1|\mathbf x) θ^1=argθ1maxp(θ1x)
or in general
θ ^ i = arg ⁡ max ⁡ θ i p ( θ i ∣ x ) i = 1 , 2 , ⋯   , p (MAP.4) \hat \theta_i=\arg \max _{\theta_i} p(\theta_i|\mathbf x) \quad i=1,2,\cdots,p \tag{MAP.4} θ^i=argθimaxp(θix)i=1,2,,p(MAP.4)
One of the advantages of the MAP estimator for a scalar parameter is that to numerically determine it we need only maximize p ( x ∣ θ ) p ( θ ) p(\mathbf x|\theta)p(\theta) p(xθ)p(θ). No integration is required. This desirable property of the MAP estimator for a scalar parameter does not carry over to the vector parameter case due to the need to obtain p ( θ i ∣ x ) p(\theta_i|\mathbf x) p(θix) as per ( M A P . 3 ) (MAP.3) (MAP.3). However, we might propose the following vector MAP estimator
θ ^ = arg ⁡ max ⁡ θ p ( θ ∣ x ) = arg ⁡ max ⁡ θ p ( x ∣ θ ) p ( θ ) (MAP.5) \hat {\boldsymbol \theta}=\arg \max _{\boldsymbol \theta} p(\boldsymbol \theta|\mathbf x)=\arg \max _{\boldsymbol \theta} p(\mathbf x|\boldsymbol \theta)p(\boldsymbol \theta) \tag{MAP.5} θ^=argθmaxp(θx)=argθmaxp(xθ)p(θ)(MAP.5)
in which the posterior PDF for the vector parameter θ \boldsymbol \theta θ is maximized to find the estimator. The estimator in ( M A P . 5 ) (MAP.5) (MAP.5) is not in general the same as ( M A P . 4 ) (MAP.4) (MAP.4).

Properties

  • If p ( θ ) p(\theta) p(θ) is uniform and p ( x ∣ θ ) p(\mathbf x|\theta) p(xθ) falls within this interval, then
    θ ^ = arg ⁡ max ⁡ θ p ( x ∣ θ ) p ( θ ) = arg ⁡ max ⁡ θ p ( x ∣ θ ) \hat \theta=\arg \max _\theta p(\mathbf x|\theta)p(\theta)=\arg \max _\theta p(\mathbf x|\theta) θ^=argθmaxp(xθ)p(θ)=argθmaxp(xθ)
    which is essentially the Bayesian MLE.

  • Generally, it holds that if N → ∞ N\to \infty N, the PDF p ( x ∣ θ ) p(\mathbf x|\theta) p(xθ) becomes dominant over p ( θ ) p(\theta) p(θ) and the MAP becomes thus identical to the Bayesian MLE.

  • If x \mathbf x x and θ \boldsymbol \theta θ are jointly Gaussian, the posterior PDF is Gaussian and thus the MAP estimator is identical to the MMSE estimator.

  • The invariance property encountered in maximum likelihood theory (MLE) does not hold for the MAP estimator.

Example: DC Level in WGN - Uniform Prior PDF

Recall the introductory example in The Bayesian Philosophy. There we discussed the MMSE estimator of A A A for a DC level in WGN with a uniform prior PDF. The MMSE estimator could not be obtained in explicit form due to the need to evaluate the integrals. The posterior PDF was given as
p ( A ∣ x ) = { 1 2 π σ 2 / N exp ⁡ [ − 1 2 σ 2 / N ( A − x ˉ ) 2 ] ∫ − A 0 A 0 1 2 π σ 2 / N exp ⁡ [ − 1 2 σ 2 / N ( A − x ˉ ) 2 ] d A ∣ A ∣ ≤ A 0 0 , ∣ A ∣ > A 0 (BP.9) p(A |\mathbf{x})=\left\{\begin{aligned} &\frac{\frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right]}{\int_{-A_0}^{A_0}\frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right]dA} && |A| \leq A_{0} \\ &0, && |A|>A_{0} \end{aligned}\right.\tag{BP.9} p(Ax)=A0A02πσ2/N 1exp[2σ2/N1(Axˉ)2]dA2πσ2/N 1exp[2σ2/N1(Axˉ)2]0,AA0A>A0(BP.9)

在這裡插入圖片描述

The MAP estimator is explicitly given as
A ^ = { − A 0 x ˉ < − A 0 x ˉ − A 0 ≤ x ˉ ≤ A 0 A 0 x ˉ > A 0 \hat A=\left\{\begin{array}{cc}&-A_0 && \bar x<-A_0\\&\bar x && -A_0\le \bar x\le A_0\\ &A_0 && \bar x>A_0 \end{array}\right. A^=A0xˉA0xˉ<A0A0xˉA0xˉ>A0
It can be seen that the MAP estimator is usually easier to determine since it does not involve any integration but only a maximization.

Linear MMSE Estimation (LMMSE)

We begin our discussion by assuming a scalar parameter θ \theta θ is to be estimated based on the data set x = [ x [ 0 ]   x [ 1 ]   ⋯   x [ N − 1 ] ] T \mathbf x=[x[0]~x[1]~\cdots~x[N-1]]^T x=[x[0] x[1]  x[N1]]T. We do not assume any specific form for the joint PDF p ( x , θ ) p(\mathbf x,\theta) p(x,θ), but as we shall see shortly, only a knowledge of the first two moments.

We now consider the class of all linear estimators of the form
θ ^ = ∑ n = 0 N − 1 a n x [ n ] + a N (LMMSE.1) \hat \theta =\sum_{n=0}^{N-1}a_n x[n]+a_N\tag{LMMSE.1} θ^=n=0N1anx[n]+aN(LMMSE.1)
and choosing the weighting coefficients a n a_n an's to minimize the Bayesian MSE
Bmse ⁡ ( θ ^ ) = E [ ( θ − θ ^ ) 2 ] = ∫ ∫ ( θ − θ ^ ) 2 p ( x , θ ) d θ d x (LMMSE.2) \operatorname{Bmse}(\hat \theta)=E\left[(\theta-\hat \theta)^2\right]=\int\int (\theta-\hat \theta)^2 p(\mathbf x,\theta)d\theta d\mathbf x\tag{LMMSE.2} Bmse(θ^)=E[(θθ^)2]=(θθ^)2p(x,θ)dθdx(LMMSE.2)
The resultant estimator is termed the linear minimum mean square error (LMMSE) estimator. Note that we included the a N a_N aN coefficient to allow for nonzero means of x \mathbf x x and θ \boldsymbol \theta θ.

Before determining the LMMSE estimator we should keep in mind that the estimator will be suboptimal unless the MMSE estimator happens to be linear. Such would be the case, for example, if the Bayesian linear model applied. Since the LMMSE estimator relies on the correlation between random variables, a parameter uncorrelated with the data cannot be linearly estimated.

We now derive the optimal weighting coefficients for use in ( L M M S E . 1 ) (LMMSE.1) (LMMSE.1). Substituting ( L M M S E . 1 ) (LMMSE.1) (LMMSE.1) into ( L M M S E . 2 ) (LMMSE.2) (LMMSE.2) and differentiating w.r.t. a N a_N aN
∂ ∂ a N E [ ( θ − ∑ n = 0 N − 1 a n x [ n ] − a N ) 2 ] = − 2 E [ θ − ∑ n = 0 N − 1 a n x [ n ] − a N ] \frac{\partial}{\partial a_{N}} E\left[\left(\theta-\sum_{n=0}^{N-1} a_{n} x[n]-a_{N}\right)^{2}\right]=-2 E\left[\theta-\sum_{n=0}^{N-1} a_{n} x[n]-a_{N}\right] aNE(θn=0N1anx[n]aN)2=2E[θn=0N1anx[n]aN]
Setting this equal to zero produces
a N = E ( θ ) − ∑ n = 0 N − 1 a n E ( x [ n ] ) (LMMSE.3) a_{N}=E(\theta)-\sum_{n=0}^{N-1} a_{n} E(x[n]) \tag{LMMSE.3} aN=E(θ)n=0N1anE(x[n])(LMMSE.3)
which is zero if the means are zero. Continuing, we need to minimize
B m s e ( θ ^ ) = E { [ ∑ n = 0 N − 1 a n ( x [ n ] − E ( x [ n ] ) ) − ( θ − E ( θ ) ) ] 2 } \mathrm{Bmse}(\hat{\theta})=E\left\{\left[\sum_{n=0}^{N-1} a_{n}(x[n]-E(x[n]))-(\theta-E(\theta))\right]^{2}\right\} Bmse(θ^)=E[n=0N1an(x[n]E(x[n]))(θE(θ))]2
over the remaining a n a_{n} an's, where a N a_{N} aN has been replaced by ( L M M S E . 3 ) (LMMSE.3) (LMMSE.3). Letting a = [ a 0   a 1   ⋯   a N − 1 ] T , \mathbf{a}=\left[a_{0}~ a_{1}~ \cdots~ a_{N-1}\right]^{T}, a=[a0 a1  aN1]T, we have
Bmse ⁡ ( θ ^ ) = E { [ a T ( x − E ( x ) ) − ( θ − E ( θ ) ) ] 2 } = E [ a T ( x − E ( x ) ) ( x − E ( x ) ) T a ] − E [ a T ( x − E ( x ) ) ( θ − E ( θ ) ) ] − E [ ( θ − E ( θ ) ) ( x − E ( x ) ) T a ] + E [ ( θ − E ( θ ) ) 2 ] = a T C x x a − a T C x θ − C θ x a + C θ θ (LMMSE.4) \begin{aligned} \operatorname{Bmse}(\hat{\theta})=& E\left\{\left[\mathbf{a}^{T}(\mathbf{x}-E(\mathbf{x}))-(\theta-E(\theta))\right]^{2}\right\} \\ =& E\left[\mathbf{a}^{T}(\mathbf{x}-E(\mathbf{x}))(\mathbf{x}-E(\mathbf{x}))^{T} \mathbf{a}\right]-E\left[\mathbf{a}^{T}(\mathbf{x}-E(\mathbf{x}))(\theta-E(\theta))\right] \\ &-E\left[(\theta-E(\theta))(\mathbf{x}-E(\mathbf{x}))^{T} \mathbf{a}\right]+E\left[(\theta-E(\theta))^{2}\right] \\ =& \mathbf{a}^{T} \mathbf{C}_{x x} \mathbf{a}-\mathbf{a}^{T} \mathbf{C}_{x \theta}-\mathbf{C}_{\theta x} \mathbf{a}+C_{\theta \theta} \end{aligned}\tag{LMMSE.4} Bmse(θ^)===E{[aT(xE(x))(θE(θ))]2}E[aT(xE(x))(xE(x))Ta]E[aT(xE(x))(θE(θ))]E[(θE(θ))(xE(x))Ta]+E[(θE(θ))2]aTCxxaaTCxθCθxa+Cθθ(LMMSE.4)
where C x x \mathbf{C}_{x x} Cxx is the N × N N \times N N×N covariance matrix of x , \mathbf{x}, x, and C θ x \mathbf{C}_{\theta x} Cθx is the 1 × N 1 \times N 1×N cross-covariance vector having the property that C θ x T = C x θ , \mathbf{C}_{\theta x}^{T}=\mathbf{C}_{x \theta}, CθxT=Cxθ, and C θ θ C_{\theta \theta} Cθθ is the variance of θ . \theta . θ. Taking the gradient to yield
∂ B m s e ( θ ^ ) ∂ a = 2 C x x a − 2 C x θ \frac{\partial \mathrm{B} \mathrm{mse}(\hat{\theta})}{\partial \mathbf{a}}=2 \mathbf{C}_{x x} \mathbf{a}-2 \mathbf{C}_{x \theta} aBmse(θ^)=2Cxxa2Cxθ
which when set to zero results in
a = C x x − 1 C x θ (LMMSE.5) \mathbf{a}=\mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}\tag{LMMSE.5} a=Cxx1Cxθ(LMMSE.5)
Using ( L M M S E . 3 ) (LMMSE.3) (LMMSE.3) and ( L M M S E . 5 ) (LMMSE.5) (LMMSE.5) in ( L M M S E . 1 ) (LMMSE.1) (LMMSE.1) produces
θ ^ = a T x + a N = C x θ T C x x − 1 x + E ( θ ) − C x θ T C x x − 1 E ( x ) \begin{aligned} \hat{\theta} &=\mathbf{a}^{T} \mathbf{x}+a_{N} \\ &=\mathbf{C}_{x \theta}^{T} \mathbf{C}_{x x}^{-1} \mathbf{x}+E(\theta)-\mathbf{C}_{x \theta}^{T} \mathbf{C}_{x x}^{-1} E(\mathbf{x}) \end{aligned} θ^=aTx+aN=CxθTCxx1x+E(θ)CxθTCxx1E(x)
or finally the LMMSE estimator is
θ ^ = E ( θ ) + C θ x C x x − 1 ( x − E ( x ) ) (LMMSE.6) \hat{\theta}=E(\theta)+\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x}))\tag{LMMSE.6} θ^=E(θ)+CθxCxx1(xE(x))(LMMSE.6)
Note that it is identical in form to the MMSE estimator for jointly Gaussian x \mathbf x x and θ \theta θ, as can be verified from ( B F . 26 ) (BF.26) (BF.26). This is because in the Gaussian case the MMSE estimator happens to be linear, and hence our constraint is automatically satisfied.

The minimum Bayesian MSE is obtained by substituting ( L M M S E . 5 ) (LMMSE.5) (LMMSE.5) into ( L M M S E . 4 ) (LMMSE.4) (LMMSE.4) to yield
B m s e ( θ ^ ) = C x θ T C x x − 1 C x x C x x − 1 C x θ − C x θ T C x x − 1 C x θ − C θ x C x x − 1 C x θ + C θ θ = C θ x C x x − 1 C x θ − 2 C θ x C x x − 1 C x θ + C θ θ = C θ θ − C θ x C x x − 1 C x θ (LMMSE.7) \begin{aligned} \mathrm{Bmse}(\hat{\theta})=& \mathbf{C}_{x \theta}^{T} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}-\mathbf{C}_{x \theta}^{T} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta} -\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}+ C_{\theta \theta} \\ =& \mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}-2 \mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}+ C_{\theta \theta}\\ =& C_{\theta \theta}-\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta} \end{aligned}\tag{LMMSE.7} Bmse(θ^)===CxθTCxx1CxxCxx1CxθCxθTCxx1CxθCθxCxx1Cxθ+CθθCθxCxx1Cxθ2CθxCxx1Cxθ+CθθCθθCθxCxx1Cxθ(LMMSE.7)
In general, all that is required to determine the LMMSE estimator are the first two moments of p ( x , θ ) p(\mathbf x, \theta) p(x,θ) or
[ E ( θ ) E ( x ) ] , [ C θ θ C θ x C x θ C x x ] \left[ \begin{matrix}E(\theta) \\E(\mathbf x) \end{matrix}\right], \left[ \begin{matrix} C_{\theta \theta} & \mathbf C_{\theta x} \\\mathbf C_{x \theta} & \mathbf C_{xx} \end{matrix}\right] [E(θ)E(x)],[CθθCxθCθxCxx]

Geometrical Interpretations

In this formulation, we assume that θ \theta θ and x \mathbf x x are zero mean. If not, we can always define the zero mean random variables θ ′ = θ − E ( θ ) \theta '=\theta-E(\theta) θ=θE(θ) and x ′ = x − E ( x ) \mathbf x '=\mathbf x-E(\mathbf x) x=xE(x), and consider estimation of θ ′ \theta ' θ by a linear function of x ′ \mathbf x' x. Now we wish to find the a n a_n an's so that
θ ^ = ∑ n = 0 N − 1 a n x [ n ] \hat \theta =\sum_{n=0}^{N-1}a_n x[n] θ^=n=0N1anx[n]
minimizes
Bmse ⁡ ( θ ^ ) = E [ ( θ − θ ^ ) 2 ] = ∫ ∫ ( θ − θ ^ ) 2 p ( x , θ ) d θ d x \operatorname{Bmse}(\hat \theta)=E\left[(\theta-\hat \theta)^2\right]=\int\int (\theta-\hat \theta)^2 p(\mathbf x,\theta)d\theta d\mathbf x Bmse(θ^)=E[(θθ^)2]=(θθ^)2p(x,θ)dθdx
It can be shown that an appropriate definition, i.e., one that satisfies the properties of an inner product between the vectors x x x and y y y, is
( x , y ) = E ( x y ) (LMMSE.8) (x,y)=E(xy) \tag{LMMSE.8} (x,y)=E(xy)(LMMSE.8)
With this definition we have that
( x , x ) = E ( x 2 ) = ∥ x ∥ 2 (LMMSE.9) (x,x)=E(x^2)=\|x\|^2\tag{LMMSE.9} (x,x)=E(x2)=x2(LMMSE.9)
Also, we can now define two vectors to be orthogonal if
( x , y ) = E ( x y ) = 0 (LMMSE.10) (x,y)=E(xy)=0\tag{LMMSE.10} (x,y)=E(xy)=0(LMMSE.10)
Since the vectors are zero mean, this is equivalent to saying that two vectors are orthogonal if and only if they are uncorrelated.

Then the Bayesian MSE can be presented as
E [ ( θ − θ ^ ) 2 ] = ∥ θ − ∑ n = 0 N − 1 a n x [ n ] ∥ 2 E\left[(\theta-\hat \theta)^2\right]=\left\|\theta -\sum_{n=0}^{N-1} a_n x[n] \right\|^2 E[(θθ^)2]=θn=0N1anx[n]2
This means that minimization of the MSE is equivalent to a minimization of the squared length of the error vector ϵ = θ − θ ^ \epsilon =\theta-\hat \theta ϵ=θθ^. Clearly, the length of the error vector is minimized when ϵ \epsilon ϵ is orthogonal to the subspace spanned by { x [ 0 ] , x [ 1 ] , . . . , x [ N − 1 ] } \{x[0], x[1], ... , x[N - 1]\} {x[0],x[1],...,x[N1]}. Hence, we require
ϵ ⊥ x [ 0 ] , x [ 1 ] , ⋯   , x [ N − 1 ] (LMMSE.11) \epsilon \perp x[0],x[1],\cdots,x[N-1]\tag{LMMSE.11} ϵx[0],x[1],,x[N1](LMMSE.11)
or by using our definition of orthogonality
E [ ( θ − θ ^ ) x [ n ] ] = 0 n = 0 , 1 , ⋯   , N − 1 (LMMSE.12) E\left[ (\theta-\hat \theta)x[n] \right]=0\quad n=0,1,\cdots,N-1 \tag{LMMSE.12} E[(θθ^)x[n]]=0n=0,1,,N1(LMMSE.12)
This is the important orthogonality principle. It says that in estimating the realization of a random variable by a linear combination of data samples, the optimal estimator is obtained when the error is orthogonal to each data sample. Using the orthogonality principle, the weighting coefficients are easily found as
∑ n = 0 N − 1 a m E ( x [ m ] x [ n ] ) = E ( θ x [ n ] ) n = 0 , 1 , ⋯   , N − 1 \sum_{n=0}^{N-1}a_m E(x[m]x[n])=E(\theta x[n])\quad n=0,1,\cdots,N-1 n=0N1amE(x[m]x[n])=E(θx[n])n=0,1,,N1
In matrix form this is
[ E ( x 2 [ 0 ] ) E ( x [ 0 ] x [ 1 ] ) … E ( x [ 0 ] x [ N − 1 ] ) E ( x [ 1 ] x [ 0 ] ) E ( x 2 [ 1 ] ) … E ( x [ 1 ] x [ N − 1 ] ) ⋮ ⋮ ⋱ ⋮ E ( x [ N − 1 ] x [ 0 ] ) E ( x [ N − 1 ] x [ 1 ] ) … E ( x 2 [ N − 1 ] ) ] [ a 0 a 1 ⋮ a N − 1 ] = [ E ( θ x [ 0 ] ) E ( θ x [ 1 ] ) ⋮ E ( θ x [ N − 1 ] ) ] \left[\begin{array}{cccc} E\left(x^{2}[0]\right) & E(x[0] x[1]) & \ldots & E(x[0] x[N-1]) \\ E(x[1] x[0]) & E\left(x^{2}[1]\right) & \ldots & E(x[1] x[N-1]) \\ \vdots & \vdots & \ddots & \vdots \\ E(x[N-1] x[0]) & E(x[N-1] x[1]) & \ldots & E\left(x^{2}[N-1]\right) \end{array}\right]\left[\begin{array}{c} a_{0} \\ a_{1} \\ \vdots \\ a_{N-1} \end{array}\right]=\left[\begin{array}{c} E(\theta x[0]) \\ E(\theta x[1]) \\ \vdots \\ E(\theta x[N-1]) \end{array}\right] E(x2[0])E(x[1]x[0])E(x[N1]x[0])E(x[0]x[1])E(x2[1])E(x[N1]x[1])E(x[0]x[N1])E(x[1]x[N1])E(x2[N1])a0a1aN1=E(θx[0])E(θx[1])E(θx[N1])
These are normal equations. The matrix is recognized as C x x \mathbf C_{xx} Cxx, and the right-hand vector as C x θ \mathbf C_{x\theta} Cxθ. Therefore,
C x x a = C x θ \mathbf C_{xx}\mathbf a=\mathbf C_{x\theta} Cxxa=Cxθ
and
a = C x x − 1 C x θ \mathbf a=\mathbf C_{xx}^{-1}\mathbf C_{x\theta} a=Cxx1Cxθ
The LMMSE estimator of θ \theta θ is
θ ^ = a T x = C θ x C x x − 1 x \hat \theta =\mathbf a^T \mathbf x=\mathbf C_{\theta x}\mathbf C_{xx}^{-1}\mathbf x θ^=aTx=CθxCxx1x
in agreement with ( L M M S E . 5 ) (LMMSE.5) (LMMSE.5) and ( L M M S E . 6 ) (LMMSE.6) (LMMSE.6). The minimum Bayesian MSE is the squared length of the error vector or
E [ ( θ − θ ^ ) 2 ] = E [ ( θ − ∑ n = 0 N − 1 a n x [ n ] ) ϵ ] = E [ θ ϵ ] = E ( θ 2 ) − ∑ n = 0 N − 1 a n E ( x [ n ] θ ) = C θ θ − a T C x θ = C θ θ − C θ x C x x − 1 C x θ \begin{aligned} E\left[(\theta-\hat \theta)^2\right]&=E\left[(\theta-\sum_{n=0}^{N-1}a_n x[n])\epsilon\right]\\ &=E\left[\theta\epsilon\right]\\ &=E(\theta^2)-\sum_{n=0}^{N-1}a_n E(x[n]\theta)\\ &=C_{\theta \theta}-\mathbf a^T\mathbf C_{x\theta}\\ &=C_{\theta \theta}-\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta} \end{aligned} E[(θθ^)2]=E[(θn=0N1anx[n])ϵ]=E[θϵ]=E(θ2)n=0N1anE(x[n]θ)=CθθaTCxθ=CθθCθxCxx1Cxθ
in agreement with ( L M M S E . 7 ) (LMMSE.7) (LMMSE.7).

在這裡插入圖片描述

Example: DC Level in WGN - Uniform Prior PDF

The data model is
x [ n ] = A + w [ n ] n = 0 , 1 , … , N − 1 x[n]=A+w[n] \quad n=0,1, \ldots, N-1 x[n]=A+w[n]n=0,1,,N1
where A ∼ U [ − A 0 , A 0 ] , w [ n ] A \sim \mathcal{U}\left[-A_{0}, A_{0}\right], w[n] AU[A0,A0],w[n] is WGN with variance σ 2 , \sigma^{2}, σ2, and A A A and w [ n ] w[n] w[n] are independent. We wish to estimate A A A. The MMSE estimator cannot be obtained in closed form due to the integration required. Applying the LMMSE estimator, we first note that E ( A ) = 0 , E(A)=0, E(A)=0, and hence E ( x [ n ] ) = 0. E(x[n])=0 . E(x[n])=0. since E ( x ) = 0 , E(\mathbf{x})=\mathbf{0}, E(x)=0, the covariances are
C x x = E ( x x T ) = E [ ( A 1 + w ) ( A 1 + w ) T ] = E ( A 2 ) 1 1 T + σ 2 I C θ x = E ( A x T ) = E [ A ( A 1 + w ) T ] = E ( A 2 ) 1 T \begin{aligned} \mathbf{C}_{x x} &=E\left(\mathbf{x} \mathbf{x}^{T}\right) \\ &=E\left[(A \mathbf{1}+\mathbf{w})(A \mathbf{1}+\mathbf{w})^{T}\right] \\ &=E\left(A^{2}\right) \mathbf{1} \mathbf{1}^{T}+\sigma^{2} \mathbf{I} \\ \mathbf{C}_{\theta x} &=E\left(A \mathbf{x}^{T}\right) \\ &=E\left[A(A \mathbf{1}+\mathbf{w})^{T}\right] \\ &=E\left(A^{2}\right) \mathbf{1}^{T} \end{aligned} CxxCθx=E(xxT)=E[(A1+w)(A1+w)T]=E(A2)11T+σ2I=E(AxT)=E[A(A1+w)T]=E(A2)1T
Hence, from ( L M M S E . 6 ) (LMMSE.6) (LMMSE.6)
A ^ = C θ x C x x − 1 x = σ A 2 1 T ( σ A 2 1 1 T + σ 2 I ) − 1 x = A 0 2 3 A 0 2 3 + σ 2 N x ˉ \begin{aligned} \hat{A} &=\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{x} \\ &=\sigma_{A}^{2} \mathbf{1}^{T}\left(\sigma_{A}^{2} \mathbf{1} \mathbf{1}^{T}+\sigma^{2} \mathbf{I}\right)^{-1} \mathbf{x}\\ &=\frac{\frac{A_{0}^{2}}{3}}{\frac{A_{0}^{2}}{3}+\frac{\sigma^{2}}{N}} \bar{x} \end{aligned} A^=CθxCxx1x=σA21T(σA211T+σ2I)1x=3A02+Nσ23A02xˉ
where we have let σ A 2 = E ( A 2 ) . \sigma_{A}^{2}=E\left(A^{2}\right) . σA2=E(A2).

The Vector LMMSE Estimator

Now we wish to find the linear estimator that minimizes the Bayesian MSE for each element. We assume that
θ ^ i = ∑ n = 0 N − 1 a i n x [ n ] + a i N (LMMSE.13) \hat{\theta}_{i}=\sum_{n=0}^{N-1} a_{i n} x[n]+a_{i N} \tag{LMMSE.13} θ^i=n=0N1ainx[n]+aiN(LMMSE.13)
for i = 1 , 2 , … , p i=1,2, \ldots, p i=1,2,,p and choose the weighting coefficients to minimize
Bmse ⁡ ( θ ^ i ) = E [ ( θ i − θ ^ i ) 2 ] i = 1 , 2 , … , p \operatorname{Bmse}\left(\hat{\theta}_{i}\right)=E\left[\left(\theta_{i}-\hat{\theta}_{i}\right)^{2}\right] \quad i=1,2, \ldots, p Bmse(θ^i)=E[(θiθ^i)2]i=1,2,,p
where the expectation is with respect to p ( x , θ i ) p\left(\mathbf{x}, \theta_{i}\right) p(x,θi). Since we are actually determining p p p separate estimators, the scalar solution can be applied, and we obtain from ( L M M S E . 6 ) (LMMSE.6) (LMMSE.6)
θ ^ i = E ( θ i ) + C θ i x C x x − 1 ( x − E ( x ) ) i = 1 , 2 , … , p \hat{\theta}_{i}=E\left(\theta_{i}\right)+\mathbf{C}_{\theta_{i} x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \quad i=1,2, \ldots, p θ^i=E(θi)+CθixCxx1(xE(x))i=1,2,,p
and the minimum Bayesian MSE is from ( L M M S E . 7 ) (LMMSE.7) (LMMSE.7)
B m s e ( θ ^ i ) = C θ i θ i − C θ i C x x − 1 C x θ i i = 1 , 2 , … , p \mathrm{Bmse}\left(\hat{\theta}_{i}\right)=C_{\theta_{i} \theta_{i}}-\mathbf{C}_{\theta_{i}} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta_{i}} \quad i=1,2, \ldots, p Bmse(θ^i)=CθiθiCθiCxx1Cxθii=1,2,,p
The scalar LMMSE estimators can be combined into a vector estimator as
θ ^ = [ E ( θ 1 ) E ( θ 2 ) ⋮ E ( θ p ) ] + [ C θ 1 x C x x − 1 ( x − E ( x ) ) C θ 2 x C x x − 1 ( x − E ( x ) ) ⋮ C θ p x C x x − 1 ( x − E ( x ) ) ] = [ E ( θ 1 ) E ( θ 2 ) ⋮ E ( θ p ) ] + [ C θ 1 x C θ 2 x ⋮ C θ p x ] C x x − 1 ( x − E ( x ) ) = E ( θ ) + C θ x C x x − 1 ( x − E ( x ) ) (LMMSE.14) \begin{aligned} \hat{\boldsymbol{\theta}} &=\left[\begin{array}{c} E\left(\theta_{1}\right) \\ E\left(\theta_{2}\right) \\ \vdots \\ E\left(\theta_{p}\right) \end{array}\right]+\left[\begin{array}{c} \mathbf{C}_{\theta_{1} x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \\ \mathbf{C}_{\theta_{2} x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \\ \vdots \\ \mathbf{C}_{\theta_{p} x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \end{array}\right] \\ &=\left[\begin{array}{c} E\left(\theta_{1}\right) \\ E\left(\theta_{2}\right) \\ \vdots \\ E\left(\theta_{p}\right) \end{array}\right]+\left[\begin{array}{c} \mathbf{C}_{\theta_{1} x} \\ \mathbf{C}_{\theta_{2} x} \\ \vdots \\ \mathbf{C}_{\theta_{p} x} \end{array}\right] \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \\ &=E(\boldsymbol{\theta})+\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \end{aligned}\tag{LMMSE.14} θ^=E(θ1)E(θ2)E(θp)+Cθ1xCxx1(xE(x))Cθ2xCxx1(xE(x))CθpxCxx1(xE(x))=E(θ1)E(θ2)E(θp)+Cθ1xCθ2xCθpxCxx1(xE(x))=E(θ)+CθxCxx1(xE(x))(LMMSE.14)
where now C θ x \mathbf C_{\theta x} Cθx is a p × N p \times N p×N matrix.

By a similar approach we find that the Bayesian MSE matrix is (see Problem 12.7)
M θ ^ = E [ ( θ − θ ^ ) ( θ − θ ^ ) T ] = C θ θ − C θ x C x x − 1 C x θ (LMMSE.15) \begin{aligned} \mathbf{M}_{\hat{\theta}} &=E\left[(\boldsymbol{\theta}-\hat{\boldsymbol{\theta}})(\boldsymbol{\theta}-\hat{\boldsymbol{\theta}})^{T}\right] \\ &=\mathbf{C}_{\theta \theta}-\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta} \end{aligned}\tag{LMMSE.15} Mθ^=E[(θθ^)(θθ^)T]=CθθCθxCxx1Cxθ(LMMSE.15)
where C θ θ \mathbf{C}_{\theta \theta} Cθθ is the p × p p \times p p×p covariance matrix. Consequently, the minimum Bayesian M S E \mathrm{MSE} MSE is
Bmse ⁡ ( θ ^ i ) = [ M θ ^ ] i i (LMMSE.16) \operatorname{Bmse}(\hat \theta_i)=[\mathbf M_{\hat \theta}]_{ii}\tag{LMMSE.16} Bmse(θ^i)=[Mθ^]ii(LMMSE.16)
Of course, these results are identical to those for the MMSE estimator in the Gaussian case, for which the estimator is linear.

Properties

  • LMMSE estimators are identical in form to the MMSE estimator for jointly Gaussian x \mathbf x x and θ \boldsymbol \theta θ

  • LMMSE estimator commutes over linear transformations. This is to say that if
    α = A θ + b {\boldsymbol{\alpha}}=\mathbf{A} {\boldsymbol{\theta}}+\mathbf{b} α=Aθ+b
    then the LMMSE estimator of α \alpha α is

α ^ = A θ ^ + b \hat{\boldsymbol{\alpha}}=\mathbf{A} \hat{\boldsymbol{\theta}}+\mathbf{b} α^=Aθ^+b
​ with θ ^ \hat{\theta} θ^ given by ( L M M S E . 14 ) (LMMSE.14) (LMMSE.14).

  • The LMMSE estimator of a sum of unknown parameters is the sum of the individual estimators. Specifically, if we wish to estimate α = θ 1 + θ 2 , \alpha=\theta_{1}+\theta_{2}, α=θ1+θ2, then

α ^ = θ ^ 1 + θ ^ 2 \hat{\boldsymbol{\alpha}}=\hat{\boldsymbol{\theta}}_{\mathbf{1}}+\hat{\boldsymbol{\theta}}_{\mathbf{2}} α^=θ^1+θ^2

  • In analogy with the BLUE there is a corresponding Gauss-Markov theorem for the Bayesian case

    Theorem 12.1 (Bayesian Gauss-Markov Theorem) If the data are described by the Bayesian linear model form
    x = H θ + w \mathbf{x}=\mathbf{H} \boldsymbol{\theta}+\mathbf{w} x=Hθ+w
    where x \mathbf{x} x is an N × 1 N \times 1 N×1 data vector, H \mathbf{H} H is a known N × p N \times p N×p observation matrix, θ \boldsymbol{\theta} θ is a p × 1 p \times 1 p×1 random vector of parameters whose realization is to be estimated and has mean E ( θ ) E(\boldsymbol{\theta}) E(θ) and covariance matrix C θ θ \mathrm{C}_{\theta \theta} Cθθ, and w \mathbf w w is an N × 1 N \times 1 N×1 random vector with zero mean and covariance matrix C w \mathrm{C}_{w} Cw and is uncorrelated with θ \boldsymbol \theta θ, then the LMMSE estimator of θ \boldsymbol \theta θ is
    θ ^ = E ( θ ) + C θ θ H T ( H C θ θ H T + C w ) − 1 ( x − H E ( θ ) ) = E ( θ ) + ( C θ θ − 1 + H T C w − 1 H ) − 1 H T C w − 1 ( x − H E ( θ ) ) \begin{aligned} \hat{\boldsymbol{\theta}} &=E(\boldsymbol{\theta})+\mathbf{C}_{\theta \theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta \theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1}(\mathbf{x}-\mathbf{H} E(\boldsymbol{\theta})) \\ &=E(\boldsymbol{\theta})+\left(\mathbf{C}_{\theta \theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_{w}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}_{w}^{-1}(\mathbf{x}-\mathbf{H} E(\boldsymbol{\theta})) \end{aligned} θ^=E(θ)+CθθHT(HCθθHT+Cw)1(xHE(θ))=E(θ)+(Cθθ1+HTCw1H)1HTCw1(xHE(θ))
    The performance of the estimator is measured by the error ϵ = θ − θ ^ \boldsymbol\epsilon=\boldsymbol\theta-\hat{\boldsymbol\theta} ϵ=θθ^ whose mean is zero and whose covariance matrix is
    C ϵ = E x , θ ( ϵ ϵ T ) = C θ θ − C θ θ H T ( H C θ θ H T + C w ) − 1 H C θ θ = ( C θ θ − 1 + H T C w − 1 H ) − 1 \begin{aligned} \mathbf{C}_{\epsilon} &=E_{x, \theta}\left(\boldsymbol\epsilon \boldsymbol\epsilon^{T}\right) \\ &=\mathbf{C}_{\theta \theta}-\mathbf{C}_{\theta \theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta \theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1} \mathbf{H} \mathbf{C}_{\theta \theta} \\ &=\left(\mathbf{C}_{\theta \theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_{w}^{-1} \mathbf{H}\right)^{-1} \end{aligned} Cϵ=Ex,θ(ϵϵT)=CθθCθθHT(HCθθHT+Cw)1HCθθ=(Cθθ1+HTCw1H)1
    The error covariance matrix is also the minimum MSE matrix M θ ^ \mathbf M_{\hat{\theta}} Mθ^ whose diagonal elements yield the minimum Bayesian MSE
    [ M θ ^ ] i i = [ C ϵ ] i i = Bmse ⁡ ( θ ^ i ) \begin{aligned} \left[\mathbf{M}_{\hat{\theta}}\right]_{i i} &=\left[\mathbf{C}_{\epsilon}\right]_{i i} =\operatorname{Bmse}\left(\hat{\theta}_{i}\right) \end{aligned} [Mθ^]ii=[Cϵ]ii=Bmse(θ^i)

These results are identical to those in Theorem 10.3 (Posterior PDF for the Bayesian General Linear Model) except that the error vector is not necessarily Gaussian.

相關文章