General Bayesian Estimators (MMSE, MAP, LMMSE)

Reference:
Kay S M. Fundamentals of statistical signal processing[M]. Prentice Hall PTR, 1993. (Ch. 11 - 11.5, Ch. 12 - 12.5)
Slides of ET4386, TUD

Content

Motivation:

The optimal Bayesian estimators discussed in the previous chapter are difficult to determine in closed form, and in practice too computationally intensive to implement.
We now study more general Bayesian estimators and their properties. To do so, the concept of the Bayesian risk function is discussed.
Minimization of this criterion results in a variety of estimators. The ones that we will concentrate on are the MMSE estimator and the maximum a posteriori estimator.
They involve multidimensional integration for the MMSE estimator and multidimensional maximization for the MAP estimator. Although under the jointly Gaussian assumption these estimators are easily found, in general, they are not. When we are unable to make the Gaussian assumption, another approach must be used. To fill this gap we can choose to retain the MMSE criterion but constrain the estimator to be linear. Then, an explicit form for the estimator may be determined which depends only on the first two moments of the PDF. In many ways this approach is analogous to the BLUE in classical estimation, and some parallels will become apparent.

Risk Functions

Previously, we had derived the MMSE estimator by minimizing $E[(\theta-\hat \theta)^2]$ , where the expectation is with respect to the PDF $p(\mathbf x,\theta)$ . If we let $\epsilon=\theta-\hat \theta$ denote the error of the estimator for a particular realization of $\mathbf x$ and $\theta$ , and also let $\mathcal C(\epsilon)=\epsilon^2$ , then the MSE criterion minimizes $E[\mathcal C(\epsilon)]$ . $\mathcal C(\epsilon)$ is termed as cost function, and the average cost or $E[\mathcal C (\epsilon)]$ is termed the Bayes risk $\mathcal R$ or
$\mathcal R=E[\mathcal C(\epsilon)]\tag{RF.1}$
If $\mathcal C (\epsilon)=\epsilon^2$ , then the cost function is quadratic and the Bayes risk is just the MSE. Of course, there is no need to restrict ourselves to quadratic cost functions. In Figure 11.1b we have the “absolute” error
$\mathcal C(\epsilon)=|\epsilon|\tag{RF.2}$
In Figure 11.1c the “hit-or-miss” cost function is displayed
$\mathcal C(\epsilon)=\left\{ \begin{array}{cc} 0 & |\epsilon|<\delta \\ 1 & |\epsilon|>\delta \end{array} \right.\tag{RF.3}$

在這裡插入圖片描述

We already have seen that the Bayes risk is minimized for a quadratic cost function by the MMSE estimator $\hat\theta=E(\theta|\mathbf x)$ . We now determine the optimal estimators for the other cost functions.

The Bayes risk $\mathcal R$ is
$\begin{aligned} \mathcal R&=E[\mathcal C(\epsilon)]\\ &=\int \int \mathcal C(\theta-\hat \theta)p(\mathbf x,\theta)d\mathbf x d\theta \\ &=\int \left[\int \mathcal C(\theta-\hat \theta)p(\theta|\mathbf x)d \theta\right]p(\mathbf x)d \mathbf x \end{aligned}\tag{RF.4}$
We attempt to minimize the inner integral $g(\hat \theta)=\int \mathcal C(\theta-\hat \theta)p(\theta|\mathbf x)d \theta$ for each $\mathbf x$ .

absolute error cost function
$\begin{aligned} g(\hat \theta)&=\int|\theta-\hat \theta|p(\theta|\mathbf x)d\theta\\ &=\int_{-\infty}^{\hat \theta} (\hat \theta-\theta)p(\theta|\mathbf x)d\theta+\int_{\hat \theta}^\infty (\theta-\hat \theta)p(\theta|\mathbf x)d\theta \end{aligned}$
Differentiating w.r.t. $\hat \theta$ using Leibnitz’s rule
$\frac{\partial}{\partial u}\int_{\phi_1(u)}^{\phi_2(u)}h(u,v)dv=\int_{\phi_1(u)}^{\phi_2(u)}\frac{\partial h(u,v)}{\partial u}dv+\frac{d\phi_2(u)}{du}h(u,\phi_2(u))-\frac{d\phi_1(u)}{du}h(u,\phi_1(u))$
we have
$\frac{dg(\hat\theta)}{d\hat \theta}=\int_{-\infty}^{\hat \theta} p(\theta|\mathbf x)d\theta -\int_{\hat \theta}^\infty p(\theta|\mathbf x)d\theta=0$
or
$\int_{-\infty}^{\hat \theta} p(\theta|\mathbf x)d\theta =\int_{\hat \theta}^\infty p(\theta|\mathbf x)d\theta$
By definition $\hat \theta$ is the median of the posterior PDF or the point for which $\Pr \{ \theta\le \hat \theta|\mathbf x\}=1/2$ .
hit or miss cost function
$\begin{aligned} g(\hat{\theta})=&\int_{-\infty}^{\hat{\theta}-\delta} 1 \cdot p(\theta | \mathbf{x}) d \theta+\int_{\hat{\theta}+\delta}^{\infty} 1 \cdot p(\theta |\mathbf{x}) d \theta\\ =&1-\int_{\hat \theta-\delta}^{\hat \theta+\delta}p(\theta|\mathbf x)d\theta \end{aligned}$
This is minimized by maximizing
$\int_{\dot{\theta}-\delta}^{\hat{\theta}+\delta} p(\theta | \mathbf{x}) d \theta$
For $\delta$ arbitrarily small this is maximized by choosing $\hat \theta$ to correspond to the location of the maximum of $p(\theta|\mathbf x)$ . The estimator that minimizes the Bayes risk for the “hit-or-miss” cost function is therefore the mode (location of the maximum) of the posterior PDF. It is termed the maximum a posteriori (MAP) estimator and will be described in more detail later.

In summary, the estimators that minimize the Bayes risk for the cost functions of Figure 11.1 are the mean, median, and mode of the posterior PDF. This is illustrated in Figure 11.2a. For some posterior PDFs these three estimators are identical. A notable example is the Gaussian posterior PDF
$p(\theta|\mathbf x)=\frac{1}{\sqrt{2\pi \sigma^2_{\theta|x}}}\exp\left[-\frac {1}{2\sigma^2_{\theta|x}}(\theta-\mu_{\theta|x})^2 \right]$
The mean $\mu_{\theta|x}$ is identical to the median (due to the symmetry) and the mode, as illustrated in Figure 11.2b.

在這裡插入圖片描述

Minimum Mean Square Error Estimators (MMSE)

In The Bayesian Philosophy the MMSE estimator was determined to be $E(\theta|\mathbf x)$ or the mean of the posterior PDF. We continue our discussion of this important estimator by first extending it to the vector parameter case and then studying some of its properties.

If $\boldsymbol \theta$ is a vector parameter of dimension $\times 1$ , then to estimate $\theta_1$ , for example, we may view the remaining parameters as nuisance parameters
$p(\theta_1|\mathbf x)=\int \cdots \int p(\boldsymbol \theta|\mathbf x)d\theta_2\cdots d\theta_p \tag{MMSE.1}$
where
$p(\boldsymbol \theta|\mathbf x)=\frac{p(\mathbf x|\boldsymbol \theta)p(\boldsymbol \theta)}{\int p(\mathbf x|\boldsymbol \theta)p(\boldsymbol \theta)d\boldsymbol \theta} \tag{MMSE.2}$
Then we have
$\hat \theta_i =E(\theta_i|\mathbf x)=\int \theta_ip(\theta_i|\mathbf x)d\theta_i, \quad i=1,2,\cdots,p \tag{MMSE.3}$
This is the MMSE estimator that minimizes
$E[(\theta_i-\hat \theta_i)^2]=\int\int (\theta_i-\hat \theta_i)^2 p(\mathbf x,\theta_i)d\mathbf x d\theta_i \tag{MMSE.4}$
or the squared error when averaged with respect to the marginal PDF $p(\mathbf x,\theta_i)$ . Alternatively, we can express the MMSE estimator for the first parameter from $(M M S E . 1)$
$\begin{aligned} \hat \theta_1 &= \int \theta_1 p(\theta_1|\mathbf x)d\theta_1\\ &= \int \theta_1 \left[\int \cdots \int p(\boldsymbol \theta|\mathbf x)d\theta_2\cdots d\theta_p\right]d\theta_1\\ &=\int \theta_1 p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta \end{aligned}$
or in general
$\hat \theta_i=\int \theta_i p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta,\quad i=1,2,\cdots,p$
In vector form we have
$\hat {\boldsymbol \theta}=\left[\begin{matrix}\int \theta_1 p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta\\ \int \theta_2 p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta\\ \vdots\\ \int \theta_p p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta \end{matrix} \right]=\int \boldsymbol \theta p(\boldsymbol \theta|\mathbf x)d\boldsymbol \theta=E(\boldsymbol \theta|\mathbf x)\tag{MMSE.5}$
where the expectation is with respect to the posterior PDF of the vector parameter or $p(\boldsymbol \theta|\mathbf x)$ .

According to the definition, the minimum Bayesian MSE for $\hat\theta_i$
$\operatorname{Bmse}(\hat \theta_1)=E[(\theta_1-\hat {\theta}_1)^2]=\int \int(\theta_1-\hat {\theta}_1)^2p(\mathbf x,\theta_1) d\theta_1 d\mathbf x$
and since $\hat \theta_1=E(\theta_1|\mathbf x)$ , we have
$\begin{aligned} \operatorname{Bmse}(\hat \theta_1)&=\int \int [\theta_1-E(\theta_1|\mathbf x)]^2p(\mathbf x,\theta_1) d\theta_1 d\mathbf x\\ &=\int \left[\int [\theta_1-E(\theta_1|\mathbf x)]^2p(\theta_1|\mathbf x) d\theta_1 \right]p(\mathbf x)d\mathbf x \end{aligned}$
The posterior PDF can be written as
$p(\theta_1|\mathbf x)=\int \cdots \int p(\boldsymbol \theta|\mathbf x)d\theta_2\cdots d\theta _p$
so that
$\operatorname{Bmse}(\hat \theta_1)=\int \left[\int [\theta_1-E(\theta_1|\mathbf x)]^2p(\boldsymbol \theta|\mathbf x) d\boldsymbol \theta \right]p(\mathbf x)d\mathbf x\tag{MMSE.6}$
which is just the $[1, 1]$ element of $\mathbf C_{\theta|x}$ , the covariance matrix of the posterior PDF. Hence, in general we have the minimum Bayesian MSE is
$\operatorname{Bmse}(\hat \theta_i)=\int [\mathbf C_{\theta|x}]_{ii}p(\mathbf x)d\mathbf x\quad i=1,2,\cdots,p\tag{MMSE.7}$
where
$\mathbf C_{\theta|x}=E_{\theta|x}[(\boldsymbol \theta-E(\boldsymbol \theta|\mathbf x))(\boldsymbol \theta-E(\boldsymbol \theta|\mathbf x))^T]\tag{MMSE.8}$
We can obtain $\hat {\boldsymbol \theta}$ and $\mathbf C_{\theta|x}$ from Theorem 10.2 mentioned in The Bayesian Philosophy.

Theorem 10.2 (Conditional PDF of Multivariate Gaussian) If $\mathbf{x}$ and $\mathbf y$ are jointly Gaussian, where $\mathbf{x}$ is $\times 1$ and $\mathbf{y}$ is $\times 1,$ with mean vector $\left[E(\mathbf{x})^{T} E(\mathbf{y})^{T}\right]^{T}$ and
partitioned covariance matrix
$\mathbf{C}=\left[\begin{array}{ll} \mathbf{C}_{x x} & \mathbf{C}_{x y} \\ \mathbf{C}_{y x} & \mathbf{C}_{y y} \end{array}\right]=\left[\begin{array}{ll} k \times k & k \times l \\ l \times k & l \times l \end{array}\right]\tag{BF.25}$
so that
$p(\mathbf{x}, \mathbf{y})=\frac{1}{(2 \pi)^{\frac{k+1}{2}} \operatorname{det}^{\frac{1}{2}}(\mathbf{C})} \exp \left[-\frac{1}{2}\left(\left[\begin{array}{l} \mathbf{x}-E(\mathbf{x}) \\ \mathbf{y}-E(\mathbf{y}) \end{array}\right]\right)^{T} \mathbf{C}^{-1}\left(\left[\begin{array}{l} \mathbf{x}-E(\mathbf{x}) \\ \mathbf{y}-E(\mathbf{y}) \end{array}\right]\right)\right]$
Then the conditional PDF $p(\mathbf y|\mathbf x)$ is also Gaussian and
$E(\mathbf y|\mathbf x)=E(\mathbf y)+\mathbf C_{yx} \mathbf C_{xx}^{-1}(\mathbf x-E(\mathbf x))\tag{BF.26}$

$\mathbf C_{y|x}=\mathbf C_{yy}-\mathbf C_{yx}\mathbf C_{xx}^{-1}\mathbf C_{xy}\tag{BF.27}$

For general linear model, we can apply the result in Theorem 10.3.

Theorem 10.3 (Posterior PDF for the Bayesian General Linear Model) If the observed data $\mathbf x$ can be modeled as
$\mathbf{x}=\mathbf{H} \boldsymbol{\theta}+\mathbf{w}$
where $\mathbf{x}$ is an $\times 1$ data vector, $\mathbf{H}$ is a known $\times p$ matrix, $\boldsymbol{\theta}$ is a $\times 1$ random vector with prior PDF $\mathcal{N}\left(\mu_{\theta}, \mathbf{C}_{\theta}\right),$ and $\mathbf{w}$ is an $\times 1$ noise vector with PDF $\mathcal{N}\left(\mathbf{0}, \mathbf{C}_{w}\right)$ and independent of $\boldsymbol \theta,$ then the posterior PDF $p(\boldsymbol\theta |\mathbf{x})$ is Gaussian with mean
$E(\boldsymbol{\theta} |\mathbf{x})=\boldsymbol{\mu}_{\theta}+\mathbf{C}_{\theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1}\left(\mathbf{x}-\mathbf{H} \boldsymbol{\mu}_{\theta}\right)\tag{BF.29}$
and covariance
$\mathbf{C}_{\theta \mid x}=\mathbf{C}_{\theta}-\mathbf{C}_{\theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1} \mathbf{H} \mathbf{C}_{\theta}\tag{BF.30}$
In contrast to the classical general linear model, $\mathbf{H}$ need not be full rank to ensure the invertibility of $\mathbf{H C}_{\theta} \mathbf{H}^{T}+\mathbf{C}_{w} .$

Alternative formulation using Matrix inversion lemma:
$E(\boldsymbol{\theta} |\mathbf{x})=\boldsymbol\mu_{\theta}+\left(\mathbf{C}_{\theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}_w^{-1}\left(\mathbf{x}-\mathbf{H} \boldsymbol\mu_{\theta}\right) \tag{BF.31}$

$\mathbf{C}_{\theta \mid x} =\left(\mathbf{C}_{\theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1}\tag{BF.32}$

It is interesting to note that in the absence of prior knowledge in the Bayesian linear model the MMSE estimator yields the same form as the MVU estimator for the classical linear model. To verify this result note that from $(B F . 31)$ , for no prior knowledge $\mathbf C_\theta^{-1}\to \mathbf 0$ , therefore
$\hat {\boldsymbol \theta}\to \left(\mathbf{H}^{T} \mathbf{C}_w^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}_w^{-1}\mathbf{x}$
which is recognized as the MVU estimator for the general linear model.

Properties

The MMSE estimator commutes over linear transformations. Assume that we wish to estimate $\boldsymbol \alpha$ for
$\boldsymbol\alpha =\mathbf A\boldsymbol \theta +\mathbf b$
$\boldsymbol \alpha$ is a random vector for which the MMSE estimator is $\hat {\boldsymbol \alpha}=E(\boldsymbol \alpha |\mathbf x)$ . Because the linearity of the expectation operator
$\hat {\boldsymbol \alpha}=E(\mathbf A\boldsymbol \theta+\mathbf b|\mathbf x)=\mathbf A E(\boldsymbol \theta|\mathbf x)+\mathbf b=\mathbf A\hat {\boldsymbol \theta}+\mathbf b$
This holds regardless of the joint PDF $p(\mathbf x,\boldsymbol \theta)$ .
The MMSE has an additivity property for independent data sets. Assume $\mathbf x_1, \mathbf x_2$ are independent data vectors and $\boldsymbol \theta, \mathbf x_1,\mathbf x_2$ are jointly Gaussian. Letting $\mathbf x=[\mathbf x_1^T ~ \mathbf x_2^T]^T$ , we have from Theorem 10.2
$\hat {\boldsymbol \theta}=E(\boldsymbol \theta)+\mathbf C_{\theta x} \mathbf C_{xx}^{-1}(\mathbf x-E(\mathbf x))$
Since $\mathbf x_1, \mathbf x_2$ are independent,
$\mathbf C_{xx}^{-1}=\left[\begin{matrix}\mathbf C_{x_1 x_2} & \mathbf 0\\ \mathbf 0 & \mathbf C_{x_2x_2} \end{matrix} \right]^{-1}=\left[\begin{matrix}\mathbf C_{x_1 x_2}^{-1} & \mathbf 0\\ \mathbf 0 & \mathbf C_{x_2x_2}^{-1} \end{matrix} \right]$

$\mathbf C_{\theta x}=E\left[\boldsymbol \theta\left[\begin{matrix}\mathbf x_1 \\\mathbf x_2 \end{matrix} \right]^T\right]=[\mathbf C_{\theta x_1}~\mathbf C_{\theta x_2}]$

Therefore
$\hat {\boldsymbol \theta}=E(\boldsymbol \theta)+\mathbf C_{\theta x_1} \mathbf C_{x_1x_1}^{-1}(\mathbf x_1-E(\mathbf x_1))+\mathbf C_{\theta x_2} \mathbf C_{x_2x_2}^{-1}(\mathbf x_2-E(\mathbf x_2))$
In the jointly Gaussian case the MMSE is linear in the data, as can be seen from $(B F . 26)$ .

Example: Bayesian Fourier Analysis

Data model:
$\cos 2\pi f_0 n+b \sin 2\pi f_0 n+w[n]\quad n=0,1,\cdots, N-1$
where $f_0$ is a multiple of $1 / N$ , excepting $0$ or $1 / 2$ (for which $\sin 2 \pi f_0 n$ is identically zero), and $w [n]$ is WGN with variance $\sigma^2$ . It is desired to estimate $\boldsymbol \theta=[a~b]^T$ .

To find the MMSE estimator, the data model is rewritten as
$\mathbf x=\mathbf H \boldsymbol \theta +\mathbf w$
where
$\mathbf{H}=\left[\begin{array}{cc} 1 & 0 \\ \cos 2 \pi f_{0} & \sin 2 \pi f_{0} \\ \vdots & \vdots \\ \cos \left[2 \pi f_{0}(N-1)\right] & \sin \left[2 \pi f_{0}(N-1)\right] \end{array}\right]$
Because the columns of $\mathbf H$ are orthogonal, we have
$\mathbf H^T \mathbf H=\frac{N}{2}\mathbf I$
Therefore it is more convenient to apply $(B F . 31)$ and $(B F . 32)$
$\begin{aligned} \hat{\boldsymbol{\theta}} &=\left(\frac{1}{\sigma_{\theta}^{2}} \mathbf{I}+\mathbf{H}^{T} \frac{1}{\sigma^{2}} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \frac{1}{\sigma^{2}} \mathbf{x}=\frac{1/\sigma^2}{1/\sigma_\theta^2+N/2\sigma^2}\mathbf H^T\mathbf x \\ \mathbf{C}_{\theta \mid x} &=\left(\frac{1}{\sigma_{\theta}^{2}} \mathbf{I}+\mathbf{H}^{T} \frac{1}{\sigma^{2}} \mathbf{H}\right)^{-1}=\frac{1}{1/\sigma_\theta^2+N/2\sigma^2}\mathbf I \end{aligned}$
The MMSE estimator is
$\begin{aligned} \hat{a} &=\frac{1}{1+\frac{2 \sigma^{2} / N}{\sigma_{\theta}^{2}}}\left[\frac{2}{N} \sum_{n=0}^{N-1} x[n] \cos 2 \pi f_{0} n\right] \\ \hat{b} &=\frac{1}{1+\frac{2 \sigma^{2} / N}{\sigma_{\theta}^{2}}}\left[\frac{2}{N} \sum_{n=0}^{N-1} x[n] \sin 2 \pi f_{0} n\right] \end{aligned}$
with
$\mathrm{Bmse}(\hat{a})=\mathrm{Bmse}(\hat{b})=\frac{1}{\frac{1}{\sigma_{\theta}^{2}}+\frac{1}{2 \sigma^{2} / N}}$

Maximum A Posteriori Estimators (MAP)

In the MAP estimation approach we choose $\hat \theta$ to maximize the posterior PDF or
$\hat \theta=\arg \max _\theta p(\theta|\mathbf x)$
This was shown to minimize the Bayes risk for a “hit-or-miss” cost function $E[|\theta_i-\hat \theta_i|]$ . Since
$p(\theta|\mathbf x)=\frac{p(\mathbf x|\theta)p(\theta)}{p(\mathbf x)}$
It is equivalent to
$\hat \theta=\arg \max _\theta p(\mathbf x|\theta)p(\theta)\tag{MAP.1}$
or
$\hat \theta=\arg \max _\theta [\ln p(\mathbf x|\theta)+\ln p(\theta)]\tag{MAP.2}$
To extend the MAP estimator to the vector parameter case in which the posterior PDF is now $p(\boldsymbol \theta|\mathbf x)$ , we again employ the result
$p(\theta_1|\mathbf x)=\int \cdots \int p(\boldsymbol \theta|\mathbf x)d\theta_2\cdots d\theta _p \tag{MAP.3}$
Then the MAP estimator is given by
$\hat \theta_1=\arg \max _{\theta_1} p(\theta_1|\mathbf x)$
or in general
$\hat \theta_i=\arg \max _{\theta_i} p(\theta_i|\mathbf x) \quad i=1,2,\cdots,p \tag{MAP.4}$
One of the advantages of the MAP estimator for a scalar parameter is that to numerically determine it we need only maximize $p(\mathbf x|\theta)p(\theta)$ . No integration is required. This desirable property of the MAP estimator for a scalar parameter does not carry over to the vector parameter case due to the need to obtain $p(\theta_i|\mathbf x)$ as per $(M A P . 3)$ . However, we might propose the following vector MAP estimator
$\hat {\boldsymbol \theta}=\arg \max _{\boldsymbol \theta} p(\boldsymbol \theta|\mathbf x)=\arg \max _{\boldsymbol \theta} p(\mathbf x|\boldsymbol \theta)p(\boldsymbol \theta) \tag{MAP.5}$
in which the posterior PDF for the vector parameter $\boldsymbol \theta$ is maximized to find the estimator. The estimator in $(M A P . 5)$ is not in general the same as $(M A P . 4)$ .

Properties

If $p(\theta)$ is uniform and $p(\mathbf x|\theta)$ falls within this interval, then
$\hat \theta=\arg \max _\theta p(\mathbf x|\theta)p(\theta)=\arg \max _\theta p(\mathbf x|\theta)$
which is essentially the Bayesian MLE.
Generally, it holds that if $N\to \infty$ , the PDF $p(\mathbf x|\theta)$ becomes dominant over $p(\theta)$ and the MAP becomes thus identical to the Bayesian MLE.
If $\mathbf x$ and $\boldsymbol \theta$ are jointly Gaussian, the posterior PDF is Gaussian and thus the MAP estimator is identical to the MMSE estimator.
The invariance property encountered in maximum likelihood theory (MLE) does not hold for the MAP estimator.

Example: DC Level in WGN - Uniform Prior PDF

Recall the introductory example in The Bayesian Philosophy. There we discussed the MMSE estimator of $A$ for a DC level in WGN with a uniform prior PDF. The MMSE estimator could not be obtained in explicit form due to the need to evaluate the integrals. The posterior PDF was given as
$|\mathbf{x})=\left\{\begin{aligned} &\frac{\frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right]}{\int_{-A_0}^{A_0}\frac{1}{\sqrt{2 \pi \sigma^{2} / N}} \exp \left[-\frac{1}{2 \sigma^{2} / N}(A-\bar{x})^{2}\right]dA} && |A| \leq A_{0} \\ &0, && |A|>A_{0} \end{aligned}\right.\tag{BP.9}$

在這裡插入圖片描述

The MAP estimator is explicitly given as
$\hat A=\left\{\begin{array}{cc}&-A_0 && \bar x<-A_0\\&\bar x && -A_0\le \bar x\le A_0\\ &A_0 && \bar x>A_0 \end{array}\right.$
It can be seen that the MAP estimator is usually easier to determine since it does not involve any integration but only a maximization.

Linear MMSE Estimation (LMMSE)

We begin our discussion by assuming a scalar parameter $\theta$ is to be estimated based on the data set $\mathbf x=[x[0]~x[1]~\cdots~x[N-1]]^T$ . We do not assume any specific form for the joint PDF $p(\mathbf x,\theta)$ , but as we shall see shortly, only a knowledge of the first two moments.

We now consider the class of all linear estimators of the form
$\hat \theta =\sum_{n=0}^{N-1}a_n x[n]+a_N\tag{LMMSE.1}$
and choosing the weighting coefficients $a_n$ 's to minimize the Bayesian MSE
$\operatorname{Bmse}(\hat \theta)=E\left[(\theta-\hat \theta)^2\right]=\int\int (\theta-\hat \theta)^2 p(\mathbf x,\theta)d\theta d\mathbf x\tag{LMMSE.2}$
The resultant estimator is termed the linear minimum mean square error (LMMSE) estimator. Note that we included the $a_N$ coefficient to allow for nonzero means of $\mathbf x$ and $\boldsymbol \theta$ .

Before determining the LMMSE estimator we should keep in mind that the estimator will be suboptimal unless the MMSE estimator happens to be linear. Such would be the case, for example, if the Bayesian linear model applied. Since the LMMSE estimator relies on the correlation between random variables, a parameter uncorrelated with the data cannot be linearly estimated.

We now derive the optimal weighting coefficients for use in $(L M M S E . 1)$ . Substituting $(L M M S E . 1)$ into $(L M M S E . 2)$ and differentiating w.r.t. $a_N$
$\frac{\partial}{\partial a_{N}} E\left[\left(\theta-\sum_{n=0}^{N-1} a_{n} x[n]-a_{N}\right)^{2}\right]=-2 E\left[\theta-\sum_{n=0}^{N-1} a_{n} x[n]-a_{N}\right]$
Setting this equal to zero produces
$a_{N}=E(\theta)-\sum_{n=0}^{N-1} a_{n} E(x[n]) \tag{LMMSE.3}$
which is zero if the means are zero. Continuing, we need to minimize
$\mathrm{Bmse}(\hat{\theta})=E\left\{\left[\sum_{n=0}^{N-1} a_{n}(x[n]-E(x[n]))-(\theta-E(\theta))\right]^{2}\right\}$
over the remaining $a_{n}$ 's, where $a_{N}$ has been replaced by $(L M M S E . 3)$ . Letting $\mathbf{a}=\left[a_{0}~ a_{1}~ \cdots~ a_{N-1}\right]^{T},$ we have
$\begin{aligned} \operatorname{Bmse}(\hat{\theta})=& E\left\{\left[\mathbf{a}^{T}(\mathbf{x}-E(\mathbf{x}))-(\theta-E(\theta))\right]^{2}\right\} \\ =& E\left[\mathbf{a}^{T}(\mathbf{x}-E(\mathbf{x}))(\mathbf{x}-E(\mathbf{x}))^{T} \mathbf{a}\right]-E\left[\mathbf{a}^{T}(\mathbf{x}-E(\mathbf{x}))(\theta-E(\theta))\right] \\ &-E\left[(\theta-E(\theta))(\mathbf{x}-E(\mathbf{x}))^{T} \mathbf{a}\right]+E\left[(\theta-E(\theta))^{2}\right] \\ =& \mathbf{a}^{T} \mathbf{C}_{x x} \mathbf{a}-\mathbf{a}^{T} \mathbf{C}_{x \theta}-\mathbf{C}_{\theta x} \mathbf{a}+C_{\theta \theta} \end{aligned}\tag{LMMSE.4}$
where $\mathbf{C}_{x x}$ is the $\times N$ covariance matrix of $\mathbf{x},$ and $\mathbf{C}_{\theta x}$ is the $\times N$ cross-covariance vector having the property that $\mathbf{C}_{\theta x}^{T}=\mathbf{C}_{x \theta},$ and $C_{\theta \theta}$ is the variance of $\theta .$ Taking the gradient to yield
$\frac{\partial \mathrm{B} \mathrm{mse}(\hat{\theta})}{\partial \mathbf{a}}=2 \mathbf{C}_{x x} \mathbf{a}-2 \mathbf{C}_{x \theta}$
which when set to zero results in
$\mathbf{a}=\mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}\tag{LMMSE.5}$
Using $(L M M S E . 3)$ and $(L M M S E . 5)$ in $(L M M S E . 1)$ produces
$\begin{aligned} \hat{\theta} &=\mathbf{a}^{T} \mathbf{x}+a_{N} \\ &=\mathbf{C}_{x \theta}^{T} \mathbf{C}_{x x}^{-1} \mathbf{x}+E(\theta)-\mathbf{C}_{x \theta}^{T} \mathbf{C}_{x x}^{-1} E(\mathbf{x}) \end{aligned}$
or finally the LMMSE estimator is
$\hat{\theta}=E(\theta)+\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x}))\tag{LMMSE.6}$
Note that it is identical in form to the MMSE estimator for jointly Gaussian $\mathbf x$ and $\theta$ , as can be verified from $(B F . 26)$ . This is because in the Gaussian case the MMSE estimator happens to be linear, and hence our constraint is automatically satisfied.

The minimum Bayesian MSE is obtained by substituting $(L M M S E . 5)$ into $(L M M S E . 4)$ to yield
$\begin{aligned} \mathrm{Bmse}(\hat{\theta})=& \mathbf{C}_{x \theta}^{T} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}-\mathbf{C}_{x \theta}^{T} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta} -\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}+ C_{\theta \theta} \\ =& \mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}-2 \mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta}+ C_{\theta \theta}\\ =& C_{\theta \theta}-\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta} \end{aligned}\tag{LMMSE.7}$
In general, all that is required to determine the LMMSE estimator are the first two moments of $p(\mathbf x, \theta)$ or
$\left[ \begin{matrix}E(\theta) \\E(\mathbf x) \end{matrix}\right], \left[ \begin{matrix} C_{\theta \theta} & \mathbf C_{\theta x} \\\mathbf C_{x \theta} & \mathbf C_{xx} \end{matrix}\right]$

Geometrical Interpretations

In this formulation, we assume that $\theta$ and $\mathbf x$ are zero mean. If not, we can always define the zero mean random variables $\theta '=\theta-E(\theta)$ and $\mathbf x '=\mathbf x-E(\mathbf x)$ , and consider estimation of $\theta '$ by a linear function of $\mathbf x'$ . Now we wish to find the $a_n$ 's so that
$\hat \theta =\sum_{n=0}^{N-1}a_n x[n]$
minimizes
$\operatorname{Bmse}(\hat \theta)=E\left[(\theta-\hat \theta)^2\right]=\int\int (\theta-\hat \theta)^2 p(\mathbf x,\theta)d\theta d\mathbf x$
It can be shown that an appropriate definition, i.e., one that satisfies the properties of an inner product between the vectors $x$ and $y$ , is
$\tag{LMMSE.8}$
With this definition we have that
$(x,x)=E(x^2)=\|x\|^2\tag{LMMSE.9}$
Also, we can now define two vectors to be orthogonal if
$(x,y)=E(xy)=0\tag{LMMSE.10}$
Since the vectors are zero mean, this is equivalent to saying that two vectors are orthogonal if and only if they are uncorrelated.

Then the Bayesian MSE can be presented as
$E\left[(\theta-\hat \theta)^2\right]=\left\|\theta -\sum_{n=0}^{N-1} a_n x[n] \right\|^2$
This means that minimization of the MSE is equivalent to a minimization of the squared length of the error vector $\epsilon =\theta-\hat \theta$ . Clearly, the length of the error vector is minimized when $\epsilon$ is orthogonal to the subspace spanned by ${x[0], x[1], ... , x[N - 1]\}$ . Hence, we require
$\epsilon \perp x[0],x[1],\cdots,x[N-1]\tag{LMMSE.11}$
or by using our definition of orthogonality
$E\left[ (\theta-\hat \theta)x[n] \right]=0\quad n=0,1,\cdots,N-1 \tag{LMMSE.12}$
This is the important orthogonality principle. It says that in estimating the realization of a random variable by a linear combination of data samples, the optimal estimator is obtained when the error is orthogonal to each data sample. Using the orthogonality principle, the weighting coefficients are easily found as
$\sum_{n=0}^{N-1}a_m E(x[m]x[n])=E(\theta x[n])\quad n=0,1,\cdots,N-1$
In matrix form this is
$\left[\begin{array}{cccc} E\left(x^{2}[0]\right) & E(x[0] x[1]) & \ldots & E(x[0] x[N-1]) \\ E(x[1] x[0]) & E\left(x^{2}[1]\right) & \ldots & E(x[1] x[N-1]) \\ \vdots & \vdots & \ddots & \vdots \\ E(x[N-1] x[0]) & E(x[N-1] x[1]) & \ldots & E\left(x^{2}[N-1]\right) \end{array}\right]\left[\begin{array}{c} a_{0} \\ a_{1} \\ \vdots \\ a_{N-1} \end{array}\right]=\left[\begin{array}{c} E(\theta x[0]) \\ E(\theta x[1]) \\ \vdots \\ E(\theta x[N-1]) \end{array}\right]$
These are normal equations. The matrix is recognized as $\mathbf C_{xx}$ , and the right-hand vector as $\mathbf C_{x\theta}$ . Therefore,
$\mathbf C_{xx}\mathbf a=\mathbf C_{x\theta}$
and
$\mathbf a=\mathbf C_{xx}^{-1}\mathbf C_{x\theta}$
The LMMSE estimator of $\theta$ is
$\hat \theta =\mathbf a^T \mathbf x=\mathbf C_{\theta x}\mathbf C_{xx}^{-1}\mathbf x$
in agreement with $(L M M S E . 5)$ and $(L M M S E . 6)$ . The minimum Bayesian MSE is the squared length of the error vector or
$\begin{aligned} E\left[(\theta-\hat \theta)^2\right]&=E\left[(\theta-\sum_{n=0}^{N-1}a_n x[n])\epsilon\right]\\ &=E\left[\theta\epsilon\right]\\ &=E(\theta^2)-\sum_{n=0}^{N-1}a_n E(x[n]\theta)\\ &=C_{\theta \theta}-\mathbf a^T\mathbf C_{x\theta}\\ &=C_{\theta \theta}-\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta} \end{aligned}$
in agreement with $(L M M S E . 7)$ .

在這裡插入圖片描述

Example: DC Level in WGN - Uniform Prior PDF

The data model is
$\quad n=0,1, \ldots, N-1$
where $\sim \mathcal{U}\left[-A_{0}, A_{0}\right], w[n]$ is WGN with variance $\sigma^{2},$ and $A$ and $w [n]$ are independent. We wish to estimate $A$ . The MMSE estimator cannot be obtained in closed form due to the integration required. Applying the LMMSE estimator, we first note that $E (A) = 0,$ and hence $E (x [n]) = 0 .$ since $E(\mathbf{x})=\mathbf{0},$ the covariances are
$\begin{aligned} \mathbf{C}_{x x} &=E\left(\mathbf{x} \mathbf{x}^{T}\right) \\ &=E\left[(A \mathbf{1}+\mathbf{w})(A \mathbf{1}+\mathbf{w})^{T}\right] \\ &=E\left(A^{2}\right) \mathbf{1} \mathbf{1}^{T}+\sigma^{2} \mathbf{I} \\ \mathbf{C}_{\theta x} &=E\left(A \mathbf{x}^{T}\right) \\ &=E\left[A(A \mathbf{1}+\mathbf{w})^{T}\right] \\ &=E\left(A^{2}\right) \mathbf{1}^{T} \end{aligned}$
Hence, from $(L M M S E . 6)$
$\begin{aligned} \hat{A} &=\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{x} \\ &=\sigma_{A}^{2} \mathbf{1}^{T}\left(\sigma_{A}^{2} \mathbf{1} \mathbf{1}^{T}+\sigma^{2} \mathbf{I}\right)^{-1} \mathbf{x}\\ &=\frac{\frac{A_{0}^{2}}{3}}{\frac{A_{0}^{2}}{3}+\frac{\sigma^{2}}{N}} \bar{x} \end{aligned}$
where we have let $\sigma_{A}^{2}=E\left(A^{2}\right) .$

The Vector LMMSE Estimator

Now we wish to find the linear estimator that minimizes the Bayesian MSE for each element. We assume that
$\hat{\theta}_{i}=\sum_{n=0}^{N-1} a_{i n} x[n]+a_{i N} \tag{LMMSE.13}$
for $\ldots, p$ and choose the weighting coefficients to minimize
$\operatorname{Bmse}\left(\hat{\theta}_{i}\right)=E\left[\left(\theta_{i}-\hat{\theta}_{i}\right)^{2}\right] \quad i=1,2, \ldots, p$
where the expectation is with respect to $p\left(\mathbf{x}, \theta_{i}\right)$ . Since we are actually determining $p$ separate estimators, the scalar solution can be applied, and we obtain from $(L M M S E . 6)$
$\hat{\theta}_{i}=E\left(\theta_{i}\right)+\mathbf{C}_{\theta_{i} x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \quad i=1,2, \ldots, p$
and the minimum Bayesian MSE is from $(L M M S E . 7)$
$\mathrm{Bmse}\left(\hat{\theta}_{i}\right)=C_{\theta_{i} \theta_{i}}-\mathbf{C}_{\theta_{i}} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta_{i}} \quad i=1,2, \ldots, p$
The scalar LMMSE estimators can be combined into a vector estimator as
$\begin{aligned} \hat{\boldsymbol{\theta}} &=\left[\begin{array}{c} E\left(\theta_{1}\right) \\ E\left(\theta_{2}\right) \\ \vdots \\ E\left(\theta_{p}\right) \end{array}\right]+\left[\begin{array}{c} \mathbf{C}_{\theta_{1} x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \\ \mathbf{C}_{\theta_{2} x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \\ \vdots \\ \mathbf{C}_{\theta_{p} x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \end{array}\right] \\ &=\left[\begin{array}{c} E\left(\theta_{1}\right) \\ E\left(\theta_{2}\right) \\ \vdots \\ E\left(\theta_{p}\right) \end{array}\right]+\left[\begin{array}{c} \mathbf{C}_{\theta_{1} x} \\ \mathbf{C}_{\theta_{2} x} \\ \vdots \\ \mathbf{C}_{\theta_{p} x} \end{array}\right] \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \\ &=E(\boldsymbol{\theta})+\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1}(\mathbf{x}-E(\mathbf{x})) \end{aligned}\tag{LMMSE.14}$
where now $\mathbf C_{\theta x}$ is a $\times N$ matrix.

By a similar approach we find that the Bayesian MSE matrix is (see Problem 12.7)
$\begin{aligned} \mathbf{M}_{\hat{\theta}} &=E\left[(\boldsymbol{\theta}-\hat{\boldsymbol{\theta}})(\boldsymbol{\theta}-\hat{\boldsymbol{\theta}})^{T}\right] \\ &=\mathbf{C}_{\theta \theta}-\mathbf{C}_{\theta x} \mathbf{C}_{x x}^{-1} \mathbf{C}_{x \theta} \end{aligned}\tag{LMMSE.15}$
where $\mathbf{C}_{\theta \theta}$ is the $\times p$ covariance matrix. Consequently, the minimum Bayesian $\mathrm{MSE}$ is
$\operatorname{Bmse}(\hat \theta_i)=[\mathbf M_{\hat \theta}]_{ii}\tag{LMMSE.16}$
Of course, these results are identical to those for the MMSE estimator in the Gaussian case, for which the estimator is linear.

Properties

LMMSE estimators are identical in form to the MMSE estimator for jointly Gaussian $\mathbf x$ and $\boldsymbol \theta$
LMMSE estimator commutes over linear transformations. This is to say that if
${\boldsymbol{\alpha}}=\mathbf{A} {\boldsymbol{\theta}}+\mathbf{b}$
then the LMMSE estimator of $\alpha$ is

$\hat{\boldsymbol{\alpha}}=\mathbf{A} \hat{\boldsymbol{\theta}}+\mathbf{b}$
with $\hat{\theta}$ given by $(L M M S E . 14)$ .

The LMMSE estimator of a sum of unknown parameters is the sum of the individual estimators. Specifically, if we wish to estimate $\alpha=\theta_{1}+\theta_{2},$ then

$\hat{\boldsymbol{\alpha}}=\hat{\boldsymbol{\theta}}_{\mathbf{1}}+\hat{\boldsymbol{\theta}}_{\mathbf{2}}$

In analogy with the BLUE there is a corresponding Gauss-Markov theorem for the Bayesian case

Theorem 12.1 (Bayesian Gauss-Markov Theorem) If the data are described by the Bayesian linear model form
$\mathbf{x}=\mathbf{H} \boldsymbol{\theta}+\mathbf{w}$
where $\mathbf{x}$ is an $\times 1$ data vector, $\mathbf{H}$ is a known $\times p$ observation matrix, $\boldsymbol{\theta}$ is a $\times 1$ random vector of parameters whose realization is to be estimated and has mean $E(\boldsymbol{\theta})$ and covariance matrix $\mathrm{C}_{\theta \theta}$ , and $\mathbf w$ is an $\times 1$ random vector with zero mean and covariance matrix $\mathrm{C}_{w}$ and is uncorrelated with $\boldsymbol \theta$ , then the LMMSE estimator of $\boldsymbol \theta$ is
$\begin{aligned} \hat{\boldsymbol{\theta}} &=E(\boldsymbol{\theta})+\mathbf{C}_{\theta \theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta \theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1}(\mathbf{x}-\mathbf{H} E(\boldsymbol{\theta})) \\ &=E(\boldsymbol{\theta})+\left(\mathbf{C}_{\theta \theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_{w}^{-1} \mathbf{H}\right)^{-1} \mathbf{H}^{T} \mathbf{C}_{w}^{-1}(\mathbf{x}-\mathbf{H} E(\boldsymbol{\theta})) \end{aligned}$
The performance of the estimator is measured by the error $\boldsymbol\epsilon=\boldsymbol\theta-\hat{\boldsymbol\theta}$ whose mean is zero and whose covariance matrix is
$\begin{aligned} \mathbf{C}_{\epsilon} &=E_{x, \theta}\left(\boldsymbol\epsilon \boldsymbol\epsilon^{T}\right) \\ &=\mathbf{C}_{\theta \theta}-\mathbf{C}_{\theta \theta} \mathbf{H}^{T}\left(\mathbf{H} \mathbf{C}_{\theta \theta} \mathbf{H}^{T}+\mathbf{C}_{w}\right)^{-1} \mathbf{H} \mathbf{C}_{\theta \theta} \\ &=\left(\mathbf{C}_{\theta \theta}^{-1}+\mathbf{H}^{T} \mathbf{C}_{w}^{-1} \mathbf{H}\right)^{-1} \end{aligned}$
The error covariance matrix is also the minimum MSE matrix $\mathbf M_{\hat{\theta}}$ whose diagonal elements yield the minimum Bayesian MSE
$\begin{aligned} \left[\mathbf{M}_{\hat{\theta}}\right]_{i i} &=\left[\mathbf{C}_{\epsilon}\right]_{i i} =\operatorname{Bmse}\left(\hat{\theta}_{i}\right) \end{aligned}$

These results are identical to those in Theorem 10.3 (Posterior PDF for the Bayesian General Linear Model) except that the error vector is not necessarily Gaussian.

General Bayesian Estimators (MMSE, MAP, LMMSE)

Content

Risk Functions

Minimum Mean Square Error Estimators (MMSE)

Properties

Example: Bayesian Fourier Analysis

Maximum A Posteriori Estimators (MAP)

Properties

Example: DC Level in WGN - Uniform Prior PDF

Linear MMSE Estimation (LMMSE)

Geometrical Interpretations

Example: DC Level in WGN - Uniform Prior PDF

The Vector LMMSE Estimator

Properties

相關文章