0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Visual Recognition Tutorial
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Hidden Markov Model: Extension of Markov Chains
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Introduction to Bayesian Parameter Estimation
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Univariate Gaussian Case (Cont.)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification Chapter 2(Part 3) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Univariate Gaussian Case (Cont.)
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
CH 5: Multivariate Methods
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Multivariate Methods Berlin Chen
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Multivariate Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality l Computational Complexity l Component Analysis and Discriminants l Hidden Markov Models Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

2 Pattern Classification, Chapter 3 2 Bayesian Estimation (Bayesian learning to pattern classification problems) In MLE  was supposed fix In BE  is a random variable The computation of posterior probabilities P(  i | x) lies at the heart of Bayesian classification Goal: compute P(  i | x, D) Given the sample D, Bayes formula can be written 3

3 Pattern Classification, Chapter 3 3 To demonstrate the preceding equation, use: 3

4 Pattern Classification, Chapter 3 4 Bayesian Parameter Estimation: Gaussian Case Goal: Estimate  using the a-posteriori density P(  | D) The univariate case: P(  | D)  is the only unknown parameter (  0 and  0 are known!) 4

5 Pattern Classification, Chapter 3 5 But we know Plugging in their gaussian expressions and extracting out factors not depending on  yields: 4

6 Pattern Classification, Chapter 3 6 Observation: p(  |D) is an exponential of a quadratic It is again normal! It is called a reproducing density 4

7 Pattern Classification, Chapter 3 7 Identifying coefficients in the top equation with that of the generic Gaussian Yields expressions for  n and  n 2 4

8 Pattern Classification, Chapter 3 8 Solving for  n and  n 2 yields: From these equations we see as n increases: the variance decreases monotonically the estimate of p(  |D) becomes more peaked 4

9 Pattern Classification, Chapter 3 9 4

10 Pattern Classification, Chapter 3 10 The univariate case P(x | D) P(  | D) computed (in preceding discussion) P(x | D) remains to be computed! It provides: We know  2 and how to compute (Desired class-conditional density P(x | D j,  j )) Therefore: using P(x | D j,  j ) together with P(  j ) And using Bayes formula, we obtain the Bayesian classification rule: 4

11 Pattern Classification, Chapter Bayesian Parameter Estimation: General Theory P(x | D) computation can be applied to any situation in which the unknown density can be parameterized: the basic assumptions are: The form of P(x |  ) is assumed known, but the value of  is not known exactly Our knowledge about  is assumed to be contained in a known prior density P(  ) The rest of our knowledge  is contained in a set D of n random variables x 1, x 2, …, x n that follows P(x) 5

12 Pattern Classification, Chapter 3 12 The basic problem is: “Compute the posterior density P(  | D)” then “Derive P(x | D)”, where Using Bayes formula, we have: And by the independence assumption: 5

13 Pattern Classification, Chapter 3 13 Convergence (from notes)

14 Pattern Classification, Chapter 3 14 Problems of Dimensionality Problems involving 50 or 100 features are common (usually binary valued) Note: microarray data might entail ~20000 real-valued features Classification accuracy dependant on dimensionality amount of training data discrete vs continuous 7

15 Pattern Classification, Chapter 3 15 Case of two class multivariate normal with the same covariance P(x|  j ) ~N(  j,  ), j=1,2 Statistically independent features If the priors are equal then: 7

16 Pattern Classification, Chapter 3 16 If features are conditionally independent then: Do we remember what conditional independence is? Example for binary features: Let p i = Pr[x i =1|  1 ] then P(x|  1 ) is the product of the p i 7

17 Pattern Classification, Chapter 3 17 Most useful features are the ones for which the difference between the means is large relative to the standard deviation Doesn’t require independence Adding independent features helps increase r  reduce error Caution: adding features increases cost & complexity of feature extractor and classifier It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model ! we don’t have enough training data to support the additional dimensions 7

18 Pattern Classification, Chapter

19 Pattern Classification, Chapter 3 19 Computational Complexity Our design methodology is affected by the computational difficulty “big oh” notation f(x) = O(h(x)) “big oh of h(x)” If: (An upper bound on f(x) grows no worse than h(x) for sufficiently large x!) f(x) = 2+3x+4x 2 g(x) = x 2 f(x) = O(x 2 ) 7

20 Pattern Classification, Chapter 3 20 “big oh” is not unique! f(x) = O(x 2 ); f(x) = O(x 3 ); f(x) = O(x 4 ) “big theta” notation f(x) =  (h(x)) If: f(x) =  (x 2 ) but f(x)   (x 3 ) 7

21 Pattern Classification, Chapter 3 21 Complexity of the ML Estimation Gaussian priors in d dimensions classifier with n training samples for each of c classes For each category, we have to compute the discriminant function Total = O(d 2 n) Total for c classes = O(cd 2 n)  O(d 2 n) Cost increase when d and n are large! 7

22 Pattern Classification, Chapter 3 22 Overfitting Dimensionality of model vs size of training data Issue: not enough data to support the model Possible solutions: Reduce model dimensionality Make (possibly incorrect) assumptions to better estimate 

23 Pattern Classification, Chapter 3 23 Overfitting Estimate better  use data pooled from all classes normalization issues use pseudo-Bayesian form 0 + (1- )  n “doctor”  by thresholding entries reduces chance correlations assume statistical independence zero all off-diagonal elements

24 Pattern Classification, Chapter 3 24 Shrinkage Shrinkage: weighted combination of common and individual covariances We can also shrink the estimate common covariances toward the identity matrix

25 Pattern Classification, Chapter 3 25 Component Analysis and Discriminants Combine features in order to reduce the dimension of the feature space Linear combinations are simple to compute and tractable Project high dimensional data onto a lower dimensional space Two classical approaches for finding “optimal” linear transformation PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense” MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least-squares sense” 8

26 Pattern Classification, Chapter 3 26 PCA (from notes)

27 Pattern Classification, Chapter 3 27 Hidden Markov Models: Markov Chains Goal: make a sequence of decisions Processes that unfold in time, states at time t are influenced by a state at time t-1 Applications: speech recognition, gesture recognition, parts of speech tagging and DNA sequencing, Any temporal process without memory  T = {  (1),  (2),  (3), …,  (T)} sequence of states We might have  6 = {  1,  4,  2,  2,  1,  4} The system can revisit a state at different steps and not every state need to be visited 10

28 Pattern Classification, Chapter 3 28 First-order Markov models Our productions of any sequence is described by the transition probabilities P(  j (t + 1) |  i (t)) = a ij 10

29 Pattern Classification, Chapter

30 Pattern Classification, Chapter 3 30  = (a ij,  T ) P(  T |  ) = a 14. a 42. a 22. a 21. a 14. P(  (1) =  i ) Example: speech recognition “production of spoken words” Production of the word: “pattern” represented by phonemes /p/ /a/ /tt/ /er/ /n/ // ( // = silent state) Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/, /er/ to /n/ and /n/ to a silent state 10