Download presentation
Presentation is loading. Please wait.
Published byAleesha Moody Modified over 9 years ago
1
1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University
2
2 Typical Classification Problem Rarely know the complete probabilistic structure of the problem Have vague, general knowledge Have a number of design samples or training data as representatives of patterns for classification Find some way to use this information to design or train the classifier
3
3 Estimating Probabilities Not difficulty to Estimate prior probabilities Hard to estimate class-conditional densities –Number of available samples always seems too small –Serious when dimensionality is large
4
4 Estimating Parameters Problems permit to parameterize the conditional densities Simplifies the problem from one of estimating an unknown function to one of estimating the parameters –e.g., mean vector and covariance matrix for multi-variate normal distribution
5
5 Maximum-Likelihood Estimation View the parameters as quantities whose values are fixed but unknown Best estimate is the one that maximize the probability of obtaining the samples actually observed Nearly always have good convergence properties as the number of samples increases Often simpler than alternative methods
6
6 I. I. D. Random Variables Separate data into D 1,..., D c Samples in D j are drawn independently according to p(x| j ) Such samples are independent and identically distributed (i. i. d.) random variables Let p(x| j ) has a known parametric form and is determined uniquely by a parameter vector j,, i.e., p(x| j )=p(x| j, j )
7
7 Simplification Assumptions Samples in D i give no information about j, if i is not equal to j Can work with each class separately Have c separate problems of the same form –Use set D of i. i. d. samples from p(x| ) to estimate unknown parameter vector
8
8 Maximum-likelihood Estimate
9
9 Maximum-likelihood Estimation
10
10 A Note The likelihood p(D| ) as a function of is not a probability density function of Its area on the -domain has no significance The likelihood p(D| ) can be regarded as probability of D for a given
11
11 Analytical Approach
12
12 MAP Estimators
13
13 Gaussian Case: Unknown
14
14 Univariate Gaussian Case: Unknown and 2
15
15 Multivariate Gaussian Case: Unknown and
16
16 Bias, Absolutely Unbiased, and Asymptotically Unbiased
17
17 Model Error For reliable model, the ML classifier can give excellent results If the model is wrong, the ML classifier can not get the best results, even for the assumed set of models
18
18 Bayesian Estimation (Bayesian Learning) Answers obtained in general is nearly identical to those by maximum- likelihood Basic conceptual difference –The parameter vector is a random variable –Use the training data to convert a distribution on this variable into a posterior probability density
19
19 Central Problem
20
20 Parameter Distribution Assume p(x) has a known parametric form with parameter vector of unknown value Thus, p(x| ) is completely known Information about prior to observing samples is contained in known prior density p( ) Observations convert p( ) to p( |D) –should be sharply peaked about the true value of
21
21 Parameter Distribution
22
22 Univariate Gaussian Case: p( |D)
23
23 Reproducing Density
24
24 Bayesian Learning
25
25 Dogmatism
26
26 Univariate Gaussian Case: p(x|D)
27
27 Multivariate Gaussian Case
28
28 Multivariate Gaussian Case
29
29 Multivariate Bayesian Learning
30
30 General Bayesian Estimation
31
31 Recursive Bayesian Learning
32
32 Example 1: Recursive Bayes Learning
33
33 Example 1: Recursive Bayes Learning
34
34 Example 1: Bayes vs. ML
35
35 Identifiability p(x| ) is identifiable –Sequence of posterior densities p( |D n ) converge to a delta function –Only one causes p(x| ) to fit the data In some occasions, more than one values may yield the same p(x| ) – p( |D n ) will peak near all that explain the data –Ambiguity is erased in integration for p(x|D n ), which converges to p(x) whether or not p(x| ) is identifiable
36
36 ML vs. Bayes Methods Computational complexity Interpretability Confidence in prior information –Form of the underlying distribution p(x| ) Results differs when p( |D) is broad, or asymmetric around the estimated –Bayes methods would exploit such information whereas ML would not
37
37 Classification Errors Bayes or indistinguishability error Model error Estimation error –Parameters are estimated from a finite sample –Vanishes in the limit of infinite training data (ML and Bayes would have the same total classification error)
38
38 Invariance and Non-informative Priors Guidance in creating priors Invariance –Translation invariance –Scale invariance Non-informative with respect to an invariance –Much better than accommodating arbitrary transformation in a MAP estimator –Of great use in Bayesian estimation
39
39 Gibbs Algorithm
40
40 Sufficient Statistics Statistic –Any function of samples Sufficient statistic s of samples D – s Contains all information relevant to estimating some parameter –Definition: p(D|s, ) is independent of –If can be regarded as a random variable
41
41 Factorization Theorem A statistic s is sufficient for if and only if P(D| ) can be written as the product P(D| ) = g(s, ) h(D) for some functions g(.,.) and h(.)
42
42 Example: Multivariate Gaussian
43
43 Proof of Factorization Theorem: The “Only if” Part
44
44 Proof of Factorization Theorem: The “if” Part
45
45 Kernel Density Factoring of P(D| ) into g(s, )h(D) is not unique –If f(s) is any function, g’(s, )=f(s)g(s, ) and h’(D) = h(D)/f(s) are equivalent factors Ambiguity is removed by defining the kernel density invariant to such scaling
46
46 Example: Multivariate Gaussian
47
47 Kernel Density and Parameter Estimation Maximum-likelihood –maximization of g(s, ) Bayesian –If prior knowledge of is vague, p( ) tend to be uniform, and p( |D) is approximately the same as the kernel density –If p(x| ) is identifiable, g(s, ) peaks sharply at some value, and p( ) is continuous as well as non-zero there, p( |D) approaches the kernel density
48
48 Sufficient Statistics for Exponential Family
49
49 Error Rate and Dimensionality
50
50 Accuracy and Dimensionality
51
51 Effects of Additional Features In practice, beyond a certain point, inclusion of additional features leads to worse rather than better performance Sources of difficulty –Wrong models –Number of design or training samples is finite and thus the distributions are not estimated accurately
52
52 Computational Complexity for Maximum-Likelihood Estimation
53
53 Computational Complexity for Classification
54
54 Approaches for Inadequate Samples Reduce dimensionality –Redesign feature extractor –Select appropriate subset of features –Combine the existing features –Pool the available data by assuming all classes share the same covariance matrix Look for a better estimate for –Use Bayesian estimate and diagonal 0 –Threshold sample covariance matrix –Assume statistical independence
55
55 Shrinkage (Regularized Discriminant Analysis)
56
56 Concept of Overfitting
57
57 Best Representative Point
58
58 Projection Along a Line
59
59 Best Projection to a Line Through the Sample Mean
60
60 Best Representative Direction
61
61 Principal Component Analysis (PCA)
62
62 Concept of Fisher Linear Discriminant
63
63 Fisher Linear Discriminant Analysis
64
64 Fisher Linear Discriminant Analysis
65
65 Fisher Linear Discriminant Analysis
66
66 Fisher Linear Discriminant Analysis for Multivariate Normal
67
67 Concept of Multidimensional Discriminant Analysis
68
68 Multiple Discriminant Analysis
69
69 Multiple Discriminant Analysis
70
70 Multiple Discriminant Analysis
71
71 Multiple Discriminant Analysis
72
72 Expectation-Maximization (EM) Finding the maximum-likelihood estimate of the parameters of an underlying distribution –from a given data set when the data is incomplete or has missing values Two main applications –When the data indeed has missing values –When optimizing the likelihood function is analytically intractable but when the likelihood function can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters
73
73 Expectation-Maximization (EM) Full sample D = {x 1,..., x n } x k = { x kg, x kb } Separate individual features into D g and D b – D is the union of D g and D b Form the function
74
74 Expectation-Maximization (EM) begin initialize 0, T, i 0 do i i + 1 do i i + 1 E step: Compute Q( ; i ) E step: Compute Q( ; i ) M step: i+1 arg max Q( , i ) M step: i+1 arg max Q( , i ) until Q( i+1 ; i )-Q( i ; i-1 ) T until Q( i+1 ; i )-Q( i ; i-1 ) T return i+1 return i+1end
75
75 Expectation-Maximization (EM)
76
76 Example: 2D Model
77
77 Example: 2D Model
78
78 Example: 2D Model
79
79 Example: 2D Model
80
80 Generalized Expectation- Maximization (GEM) Instead of maximizing Q( ; i ), we find some i+1 such that Q( i+1 ; i )>Q( ; i ) and is also guaranteed to converge and is also guaranteed to converge Convergence will not as rapid Offers great freedom to choose computationally simpler steps –e.g., using maximum-likelihood value of unknown values, if they lead to a greater likelihood
81
81 Hidden Markov Model (HMM) Used for problems of making a series of decisions –e.g., speech or gesture recognition Problem states at time t are influenced directly by a state at t-1 More reference: –L. A. Rabiner and B. W. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993, Chapter 6.
82
82 First Order Markov Models
83
83 First Order Hidden Markov Models
84
84 Hidden Markov Model Probabilities
85
85 Hidden Markov Model Computation Evaluation problem –Given a ij and b jk, determine P(V T | ) Decoding problem –Given V T, determine the most likely sequence of hidden states that lead to V T Learning problem –Given training observations of visible symbols and the coarse structure but not the probabilities, determine a ij and b jk
86
86 Evaluation
87
87 HMM Forward
88
88 HMM Forward and Trellis
89
89 HMM Forward
90
90 HMM Backward
91
91 HMM Backward
92
92 Example 3: Hidden Markov Model
93
93 Example 3: Hidden Markov Model
94
94 Example 3: Hidden Markov Model
95
95 Left-to-Right Models for Speech
96
96 HMM Decoding
97
97 Problem of Local Optimization This decoding algorithm depends only on the single previous time step, not the full sequence Not guarantee that the path is indeed allowable
98
98 HMM Decoding
99
99 Example 4: HMM Decoding
100
100 Forward-Backward Algorithm Determines model parameters, a ij and b jk, from an ensemble of training samples An instance of a generalized expectation-maximization algorithm No known method for the optimal or most likely set of parameters from data
101
101 Probability of Transition
102
102 Improved Estimate for a ij
103
103 Improved Estimate for b jk
104
104 Forward-Backward Algorithm (Baum-Welch Algorithm)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.