Download presentation
Presentation is loading. Please wait.
1
統計通訊 黃偉傑
2
Table of Contents Estimation Theory Assessing Estimator Performance
Minimum Variance Unbiased (MVU) Estimation Cramer-Rao Lower Bound
3
Assessing Estimator Performance
Consider the data set shown in Figure 1. That x[n] consists of a DC level A in noise. We could model the data as where w[n] denotes some zero mean noise process. ---Figure 1
4
Assessing Estimator Performance
Based on the data set {x[0],x[1],…,x[N-1]}, we would like to estimate A. It would be reasonable to estimate A as or by the sample mean of the data. Several questions come to mind: How close will  be to A? Are there better estimator than the sample mean ?
5
Assessing Estimator Performance
For the data set in Figure 1, it turns out that Â=0.9, which is close to the true value of A=1. Another estimator might be For the data set in Figure1, Ă=0.95, which is closer to the true value of A than the sample mean estimate. Can we conclude that Ă is better estimator than  ? An estimator is a function of the data, which are random variables, it too is a random variable, subject to many possible outcomes.
6
Assessing Estimator Performance
Suppose we repeat the experiment by fixing A=1 and adding different noise. We determine the values of the two estimators for each data set. For 100 realizations the histograms are shown in Figure 2 and 3. ---Figure 2
7
Assessing Estimator Performance
It should be evident that  is a better than Ă because the values obtained are more concentrated about the true value of A=1.  will usually produce a value closer to the true one than Ă. ---Figure 3
8
Assessing Estimator Performance
To prove that  is better we could establish that the variance is less. The modeling assumptions that we must employ are that the w[n]’s, in addition to bring zero mean, are uncorrelated and have equal variance 2. We first show that the mean of each estimator is the true value or
9
Assessing Estimator Performance
Second, the variances are Since the w[n]’s are uncorrelated and thus
10
Table of Contents Minimum Variance Unbiased (MVU) Estimation
Unbiased Estimators Minimum Variance Criterion
11
Unbiased Estimators For an estimator to be unbiased we mean that on the average the estimator will yield the true value of the unknown parameter. Mathematically, an estimator is unbiased if where (a, b) denotes the range of possible values of . Unbiased estimators tend to have symmetric PDFs centered about the true value of . For example 1 the PDF is shown in Figure 4 and is easily shown to be N ~ (A,2/N).
12
Unbiased Estimators The restriction that for all is an important one. Letting ,where x=[x[0],x[1],…x[N-1]]T, it asserts that Figure 4 Probability density function for sample mean function
13
Example 1 – Unbiased Estimator for DC level in WGN
Unbiased Estimators Example 1 – Unbiased Estimator for DC level in WGN Consider the observations Where −∞<A <∞. Then, a reasonable estimator for the average value of x[n] is Due to the linearity properties of the expectation operator for all A.
14
Example 2 Biased Estimator for DC Level in White Noise
Unbiased Estimators Example 2 Biased Estimator for DC Level in White Noise Consider again Example 1 but with modified sample mean estimator Then, That an estimator is unbiased does not necessarily mean that it is a good estimator. It only guarantees that on the average it will attain the true value. A persistent bias will always result in a poor estimator.
15
Combining estimators problem
Unbiased Estimators Combining estimators problem It sometimes occurs that multiple estimates of the same parameter are available, i.e., A reasonable procedure is to combine these estimates into a better one by averaging them to form Assuming the estimators are unbiased, with the same variance, and uncorrelated with each other,
16
Combining estimators problem (cont.)
Unbiased Estimators Combining estimators problem (cont.) So that as more estimates are averaged, the variance will decrease. However, if the estimators are biased or then and no matter how many estimators are averaged, will not converge to the true value. is defined as the bias of the estimator.
17
Unbiased Estimators Combining estimators problem (cont.)
18
Minimum Variance Criterion
Mean square error (MSE) Unfortunately, adoption of this natural criterion leads to unrealizable estimators, ones that cannot be written solely as a function of the data. To understand the problem, we first rewrite the MSE as
19
Minimum Variance Criterion
The equation shows that the MSE is composed of errors due to the variance of the estimator as well as the bias. As an example, from the problem in Example 1 consider the modified estimator We will attempt to find the a which results in the minimum MSE. Since E(Ă)=aA and var(Ă)=a2σ2/N, we have
20
Minimum Variance Criterion
Differentiating the MSE with respect to a yields which upon setting to zero and solving yields the optimum value It is seen that the optimal value of a depends upon the unknown parameter A. The estimator is therefore not realizable.
21
Minimum Variance Criterion
It would seem that any criterion which depends on the bias will lead to an unrealizable estimator. Although this is generally true, on occasion realizable minimum MSE estimator can be found. From a practical viewpoint the minimum MSE estimator needs to be abandoned. An alternative approach is: Constrain the bias to be zero. Find the estimator which minimizes the variance. Such an estimator is termed the minimum variance unbiased (MVU) estimator.
22
Minimum Variance Criterion
Possible dependence of estimator variance with . In the former case is sometimes referred to as the uniformly minimum variance unbiased estimator. In general, the MVU estimator does not always exist.
23
Minimum Variance Criterion
Example 3 – Counterexample to Existence of MVU Estimator. If the form of the PDF changes with , then is would be expected that the best estimator would also change with . Assume that we have two independent observations
24
Minimum Variance Criterion
Example (cont.) The two estimators can easily be shown to be unbiased. To compute the variances we have that
25
Minimum Variance Criterion
Example (cont.) so that Clearly, between these two estimators no MVU estimator exists.
26
Table of Contents Cramer-Rao Lower Bound
Estimator Accuracy Considerations Transformation of Parameters
27
Estimator Accuracy Considerations
If a single sample is observed as It is desired to estimate A, then we expect a better estimate if 2 is small. A good unbiased estimator is Â=x[0]. The variance is 2, so that the estimator accuracy improves as 2 decreases. The PDFs for two different variances are shown in Figure 5. They are for i=1,2.
28
Estimator Accuracy Considerations
The PDF has been plotted versus the unknown parameter A for a given value of x[0]. If , then we should be able to estimate A more accurately based on p1(x[0];A). Figure 5 PDF dependence on unknown parameter
29
Estimator Accuracy Considerations
When the PDF is viewed as a function of the unknown parameter (with x fixed), it is termed the likelihood function. If we consider the natural logarithm of the PDF Then the first derivative is And the negative of the second derivative becomes
30
Estimator Accuracy Considerations
The curvature increase as 2 decrease. Since we already know that the estimator Â=x[0] has variance 2, then for this example and the variance decrease as the curvature increase. A more appropriate measure of curvature is which measures the average curvature of the log-likehood function.
31
Cramer-Rao Lower Bound
Theorem (Cramer-Rao Lower Bound-Scalar Parameter) It is assumed that the PDF p(x,) satisfies the “regularity” condition Then, the variance of any unbiased estimator must satisfy where the derivative is evaluated as the true value of and the expectation is taken with respect to p(x,).
32
Cramer-Rao Lower Bound
An unbiased estimator may be found that attains the bound for all if and only if That estimator , which is the MVU estimator, and the minimum variance is 1/I().
33
Cramer-Rao Lower Bound
Example 4 DC level in White Gaussian Noise where w[n] is WGN with variance 2. To determine the CRLB for A
34
Cramer-Rao Lower Bound
Example 4(cont.) Taking the first derivative Differentiating again
35
Cramer-Rao Lower Bound
Example 4(cont.) And noting that the second derivative is a constant, we have from as the CRLB. By comparing, we see that the sample mean estimator attains the bound and must therefore be the MVU estimator.
36
Cramer-Rao Lower Bound
We now prove that when the CRLB is attained where From Cramer-Rao Lower Bound and
37
Cramer-Rao Lower Bound
Differentiating the latter produces And taking the negative expected value yields And therefore In the next example we will see that the CRLB is not always satisfied.
38
Cramer-Rao Lower Bound
Example 5 Phase Estimator Assume that we wish to estimate the phase of a sinusoid embedded in WGN, then The amplitude A and frequency f0 are assumed known. The PDF is
39
Cramer-Rao Lower Bound
Example 5(cont.) Differentiating the log-likelihood function produce And
40
Cramer-Rao Lower Bound
Example 5(cont.) Upon taking the negative expected value we have
41
Cramer-Rao Lower Bound
Example 5(cont.) Since therefore,
42
Cramer-Rao Lower Bound
In this example the conditions for the bound to hold is not satisfied. Hence a phase estimator does not exist. An estimator which is unbiased and attains the CRLB, as the sample mean estimator in Example 4 does, is said to be efficiently uses the data.
43
Transformation of Parameters
In Example 4 we may not be interested in the sign of A but instead may wish to estimate A2 or the power of the signal. Knowing the CRLB for A, we can easily obtain it for A2. If it is desired to estimate =g(), then the CRLB is For the present example this become =g(A)=A2 and
44
Transformation of Parameters
We saw in Example 4 that sample mean estimator was efficient for A. It might be supposed that is efficient for A2. To quickly dispel this notion we first show that is not even an unbiased estimator. Since ~ N (A,2/N) Hence, we immediately conclude that the efficiency of an estimator is destroyed by a nonlinear transformation.
45
Transformation of Parameters
That it is maintained for linear transformations is easily verified. Assume that an efficient estimator for exists and is given by . It is desired to estimate g()= a + b. We choose The CRLB for g() But , so that the CRLB is achieved.
46
Transformation of Parameters
Although efficient is preserved only over linear transformations, it is approximately maintained over nonlinear transformations if the data record is large enough. To see why this property holds, we return to the previous example of estimating A2 by Although is biased, we note from that is asymptotically unbiased or unbiased as N.
47
Transformation of Parameters
Since , we can evaluate the variance By using the result that if ~ N(,2), then Therefore
48
Transformation of Parameters
For our problem we have then As N, the variance approaches 4A22/N, the last term converging to zero faster than the first term. Our assertion that is an asymptotically efficient estimator of A2 is verified. This situation occurs due to the statistical linearity of the transformation, as illustrated in Figure 6.
49
Transformation of Parameters
As N increase, the PDF of becomes more concentrated about the mean A. Therefore, the value of that are observed lie in a small interval about Over this small interval the nonlinear transformation is approximately linear. Figure 6 Statistical linearity of nonlinear transformations
50
Minimum Variance Unbiased Estimator for the Linear Model
If the data observed can be modeled as where x is an N × 1 vector of observations, H is a known N × p observation matrix, is p×1 vector of parameters to be estimated, and w is an N×1 noise vector with PDF N(0,2I). The MVU estimator is and the covariance matrix of is
51
Table of Contents Least Squares (LS) Estimators
Linear LSE Nonlinear LSE General Bayesian Estimators The Bayesian Philosophy Minimum Mean Square Error (MMSE) Estimators Maximum A Posteriori (MAP) Estimators
52
Least Squares A salient feature of the method is that no probabilistic assumptions are made about the data, only a signal model is assumed. The advantage is its broader range of possible applications. On the negative side, no claims about optimality can be made, and furthermore, the statistical performance cannot be assessed without some specific assumptions about the probabilistic structure of the data. The least squares estimator is widely used in practice due to its ease of implementation, amounting to the minimization of a least squares error criterion.
53
Least Squares In the LS approach we attempt to minimize the squared difference between the given data x[n] and the assumed signal or noiseless data.
54
Least Squares The LS error criterion
The value of that minimizes J() is the LSE. Note that no probabilistic assumptions have been made about the data x[n]. LSE are usually applied in situations A precise statistical characterization of the data is unknown. An optimal estimator cannot be found. Too complicated to apply in practice.
55
Linear Least Squares In applying the linear LS approach for a scalar parameter we must assume that where h[n] is a known sequence. The LS error criterion becomes A minimization is readily shown to produce the LSE
56
Linear Least Squares The minimum LS error is
57
Linear Least Squares Alternatively, we can rewrite Jmin as
For the signal s = [s[0] s[1] …s[N−1]]T to be linear in the unknown parameters, using matrix notation, The matrix H, which is a known N × p matrix (N>p) of full rank p, is referred to as the observation matrix.
58
Linear Least Squares The LSE is found by minimizing Since
The gradient is
59
Linear Least Squares Setting the gradient equal to zero yields the LSE
The minimum LS error is
60
Linear Least Squares The last step results from the fact that I−H(HTH)−1HT is an idempotent matrix or is has the property A2 = A. Other forms for Jmin are
61
Nonlinear Least Squares
Before discussing general methods for determining nonlinear LSEs we first describe two methods that can reduce the complexity of the problem. 1. Transformation of parameters. 2. Separability of parameters. In the first case we seek a one-to-one transformation of that produces a linear signal model in the new space. To do so we let g is a p-dimensional function of whose inverse exists.
62
Nonlinear Least Squares
If a g can be found so that then the signal model will be linear in α. We can then easily find the linear LSE of α and thus the nonlinear LSE of by This approach relies on the property that the minimization can be carried out in any transformed space that is obtained by a one-to-one mapping and then converted back to the original space.
63
Nonlinear Least Squares
Example – Sinusoidal Parameter Estimation For a sinusoidal signal model it is desired to estimate the amplitude A, where A > 0, and phase . The frequency f0 is assumed known. The LSE is obtained by minimizing over A and .
64
Nonlinear Least Squares
Example – Sinusoidal Parameter Estimation (cont.) Because if we let then the signal model becomes In matrix form this is
65
Nonlinear Least Squares
Example – Sinusoidal Parameter Estimation (cont.) where which is now linear in the new parameters. The LSE of α is and to find we must find the inverse transformation g−1(α).
66
Nonlinear Least Squares
Example – Sinusoidal Parameter Estimation (cont.) This is so that the nonlinear LSE for this problem is given by where
67
Nonlinear Least Squares
A second type of nonlinear LS problem that is less complex than the general one exhibits the separability property. Although the signal model is nonlinear, it may be linear in some of the parameters. In general, a separable signal model has the form H(α) is an N × q matrix dependent on α. This model is linear in β but nonlinear in α.
68
Nonlinear Least Squares
where As a result, the LS error may be minimized with respect to β and thus reduced to a function of α only. Since The β that minimizes J for a given α is
69
Nonlinear Least Squares
The resulting LS error is The problem now reduces to a maximization of over α. If, for instance, q = p − 1, so that α is a scalar, then a grid search can possibly be used.
70
Nonlinear Least Squares
Example – Damped Exponentials Assume we have a signal model where the unknown parameters are {A1, A2, A3, r}. It is known that 0 < r < 1. Then, the model is linear in the amplitudes β= [A1, A2, A3]T, and nonlinear in the damping factor α= r. The nonlinear LSE is obtained by maximizing over 0 < r <1.
71
Nonlinear Least Squares
Example – Damped Exponentials (cont.) where Once is found we have the LSE for the amplitudes This maximization is easily carried out on a digital computer.
72
The Bayesian Philosophy
We now note depart from the classical approach to statistical estimation in which the parameter of interest is assumed to be deterministic but unknown constant. Instead, we assume that is a random variable to whose particular realization we must estimate. This is Bayesian approach, so named because its implementation is based directly on Bayes’ theorem. The motivation for doing so is twofold. First, if we have available some priori knowledge about , we can incorporate it into our estimator. Second, Bayesian estimation is useful in situations where an MVU estimator cannot be found.
73
Prior Knowledge and Estimation
It is a fundamental rule of estimation theorem that the use of prior knowledge will lead to a more accurate estimator. For example, if a parameter is constrained to lie in a known interval, then any good estimator should produce only estimates within that interval. In Example before, it was shown that the MVU estimator of A is the sample mean However, this Assumed that A could take on any value in the interval −∞ < A < ∞. Due to physical constraints it may be more reasonable to assume that A can take on only values in the finite interval –A0 ≤ A ≤ A0. To retain  = as the best estimator would be undesirable since  may yield value outside the known interval.
74
Prior Knowledge and Estimation
As shown in figure (a), this is due to noise effects. Certainly, we would expect to improve our estimation if we used the truncated sample mean estimator
75
Prior Knowledge and Estimation
Such an estimator would have the PDF It is seen that Ă is a biased estimator. However, if we compare the MSE of the two estimators, we note that for any A in the interval –A0 ≤ A ≤ A0
76
Prior Knowledge and Estimation
Hence, Ă, the truncated sample mean estimator, is better than the sample mean estimator in terms of MSE. Although  is still the MVU estimator, we have been able to reduce the mean square error by allowing the estimator to be biased. Knowing that A must lie in a known interval, we suppose that the true value of A has been chosen from that interval. We then model the process of choosing a value as a random event to which a PDF can be assigned. With knowledge only of the interval and no inclination as to whether A should be nearer any particular value, it makes sense to assign a U[-A0, A0] PDF to the random variable A.
77
Prior Knowledge and Estimation
The overall data model then appears as in the following figure. As shown there, the act of choosing A according to the given PDF represents the departure of the Bayesian approach from the classical approach. The problem, as always, is to estimate the value of A or the realization of the random variable, now we can incorporate our knowledge of how A was chosen.
78
Prior Knowledge and Estimation
For example, we might attempt to find an estimator  that would minimize the Bayesian MSE defined as We choose to defined the error as A −  in contrast to the classical estimation error of  − A . Now we emphasize that since A is a random variable, the expectation operator is with respect to the joint PDF p(x,A). This is a fundamentally different MSE than in the classical case. We distinguish it by using the Bmse notation.
79
Prior Knowledge and Estimation
To appreciate the difference compare the classical MSE to the Bayesian MSE Note that whereas the classical MSE will depend on A, and hence estimators that attempt to minimize the MSE will usually depend on A, the Bayesian MSE will not. In effect, we have integrated the parameter dependence away!
80
Prior Knowledge and Estimation
To complete our example we now derive the estimator that minimizes the Bayesian MSE. First, we use Bayes, theorem to write So that Now since p(x) ≥ 0 for all x, if the integral in brackets can be minimized for each x, then the Bayesian MSE will be minimized.
81
Prior Knowledge and Estimation
Hence, fixing x so that  is a scalar variable, we have which when set equal to zero results in or finally
82
Prior Knowledge and Estimation
It is seen that the optimal estimator in terms of minimizing the Bayesian MSE is the mean of the posterior PDF p(A|x). The posterior PDF refers to the PDF of A after the data have been observed. In contrast, p(A) or may be thought of as the priori PDF of A, indicating the PDF before the data are observed. We will henceforth term the estimator that minimizes the Bayesian MSE the minimum mean square error (MMSE) estimator.
83
Prior Knowledge and Estimation
In determining the MMSE estimator we first require the posterior PDF. We can use Bayes’ rule to determine it as Note that the denominator is just a normalizing factor, independent of A, needed to ensure that p(A|x) integrates to 1. If we continue our example, we recall that the prior PDF p(A) is U[-A0, A0]. To specify the conditional PDF p(x|A) we need to further assume that the choice of A via p(A) does not affect the PDF of the noise samples or that w[n] is independent of A.
84
Prior Knowledge and Estimation
Then, for n = 0, 1, …, N- 1 and therefore It is apparent that the PDF is identical in form to the usual classical PDF p(x; A).
85
Prior Knowledge and Estimation
The posterior PDF becomes But
86
Prior Knowledge and Estimation
So that we have The factor c is determined by the requirement that p(A|x) integrate to 1, resulting in
87
Prior Knowledge and Estimation
The PDF is seen to be a truncated Gaussian, as shown in figure. The MMSE estimator, which is the mean of p(A|x), is
88
Prior Knowledge and Estimation
Although this cannot be evaluated in closed form, we note that  will be a function of as well as of A0 and σ2. The MMSE estimator will not be due to the truncation shown in figure (b), unless A0 is so large that there is effectively no truncation. This will occur if The effect of the data is to position the posterior mean between A = 0 and A = in a compromise between the prior knowledge and that contributed by the data. To further appreciate this weighting consider what happens as N becomes large so that the data knowledge becomes more important.
89
Prior Knowledge and Estimation
As shown in Figure 4, as N increases, we have from that the posterior PDF becomes more concentrated about (since σ2/N decreases). Hence, it becomes nearly Gaussian, and its mean becomes just . The MMSE estimator relies less and less on the prior knowledge and more on the data. It is said that the data “swamps out” the prior knowledge.
90
Prior Knowledge and Estimation
Theorem Conditional PDF of Multivariate Gaussian If x and y are jointly Gaussian, where x is k × 1 and y is l × 1, with mean vector [E(x)TE(y)T]T and partitioned covariance matrix So that Then the conditional PDF p(y|x) is also Gaussian and
91
Bayesian Linear Model Theorem Posterior PDF for the Bayesian General Linear Model If the observed data x can be modeled as x is an N × 1 data vector, H is a known N × p matrix, θ is a p × 1 random vector with prior PDF N(,C), and w is an N × 1 noise vector with PDF N(0, Cw) and independent of θ. Then the posterior PDF p(θ|x) is Gaussian with mean and covariance
92
Risk Function Previously, we had derived the MMSE estimator by minimizing , where the expectation is with respect to the PDF p(x,). If we let denote the error of the estimator for a particular realization of x and , and also let C(ε)= ε2, then the MSE criterion minimizes E[C(ε)]. The deterministic function C(ε) is termed the cost function. It is noted that large errors are particularly costly. Also, the average cost or E[C(ε)] is termed the Bayes risk R or and measures the performance of a given estimator.
93
Risk Function Examples of cost function.
94
Risk Function The Bayes risk R is
95
Maximum A Posteriori Estimators
In the MAP estimation approach we choose to maximize the posterior PDF or In finding the maximum of p(θ|x) we observe that so an equivalent maximization if of p(x|θ)p(θ). Hence, the MAP estimator is or, equivalently,
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.