ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Principles of Density Estimation
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Maximum likelihood (ML) and likelihood ratio (LR) test
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Presenting: Assaf Tzabari
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Visual Recognition Tutorial
Computer vision: models, learning and inference
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Introduction to Bayesian Parameter Estimation
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Univariate Gaussian Case (Cont.)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Univariate Gaussian Case (Cont.)
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Parameter Estimation 主講人:虞台文.
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Outline Parameter estimation – continued Non-parametric methods.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Learning From Observed Data
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Applied Statistics and Probability for Engineers
Presentation transcript:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives: Bayesian Estimation Example Resources: D.H.S.: Chapter 3 (Part 2) J.O.S.: Bayesian Parameter Estimation A.K.: The Holy Trinity A.E.: Bayesian Methods J.H.: Euro Coin

ECE 8527: Lecture 07, Slide 1 Case: only mean unknown Known prior density: Using Bayes formula: Rationale: Once a value of μ is known, the density for x is completely known. α is a normalization factor that depends on the data, D. Univariate Gaussian Case

ECE 8527: Lecture 07, Slide 2 Univariate Gaussian Case (Cont.) Rearrange terms so that the dependencies on μ are clear: Associate terms related to σ 2 and μ: There is actually a third equation involving terms not related to μ: but we can ignore this since it is not a function of μ and is a complicated equation to solve.

ECE 8527: Lecture 07, Slide 3 Two equations and two unknowns. Solve for μ n and σ n 2. First, solve for μ n 2 : Univariate Gaussian Case (Cont.) Next, solve for μ n : Summarizing:

ECE 8527: Lecture 07, Slide 4 μ n represents our best guess after n samples. σ n 2 represents our uncertainty about this guess. σ n 2 approaches σ 2 /n for large n – each additional observation decreases our uncertainty. The posterior, p(μ|D), becomes more sharply peaked as n grows large. This is known as Bayesian learning. Bayesian Learning

ECE 8527: Lecture 07, Slide 5 How do we obtain p(x|D) (derivation is tedious): where: Note that: The conditional mean, μ n, is treated as the true mean. p(x|D) and P(ω j ) can be used to design the classifier. Class-Conditional Density

ECE 8527: Lecture 07, Slide 6 Applying Bayes formula: which has the form: Once again: and we have a reproducing density. Multivariate Case Assume: where are assumed to be known.

ECE 8527: Lecture 07, Slide 7 Equating coefficients between the two Gaussians: The solution to these equations is: It also follows that: Estimation Equations

ECE 8527: Lecture 07, Slide 8 p(x | D) computation can be applied to any situation in which the unknown density can be parameterized. The basic assumptions are:  The form of p(x | θ) is assumed known, but the value of θ is not known exactly.  Our knowledge about θ is assumed to be contained in a known prior density p(θ).  The rest of our knowledge about θ is contained in a set D of n random variables x 1, x 2, …, x n drawn independently according to the unknown probability density function p(x). General Theory

ECE 8527: Lecture 07, Slide 9 Formal Solution The posterior is given by: Using Bayes formula, we can write p(D| θ ) as: and by the independence assumption: This constitutes the formal solution to the problem because we have an expression for the probability of the data given the parameters. This also illuminates the relation to the maximum likelihood estimate: Suppose p(D| θ ) reaches a sharp peak at.

ECE 8527: Lecture 07, Slide 10 Comparison to Maximum Likelihood This also illuminates the relation to the maximum likelihood estimate:  Suppose p(D| θ ) reaches a sharp peak at.  p(θ| D) will also peak at the same place if p(θ) is well-behaved.  p(x|D) will be approximately, which is the ML result.  If the peak of p(D| θ ) is very sharp, then the influence of prior information on the uncertainty of the true value of θ can be ignored.  However, the Bayes solution tells us how to use all of the available information to compute the desired density p(x|D).

ECE 8527: Lecture 07, Slide 11 Recursive Bayes Incremental Learning To indicate explicitly the dependence on the number of samples, let: We can then write our expression for p(D| θ ): where. We can write the posterior density using a recursive relation: where. This is called the Recursive Bayes Incremental Learning because we have a method for incrementally updating our estimates.

ECE 8527: Lecture 07, Slide 12 When do ML and Bayesian Estimation Differ? For infinite amounts of data, the solutions converge. However, limited data is always a problem. If prior information is reliable, a Bayesian estimate can be superior. Bayesian estimates for uniform priors are similar to an ML solution. If p(θ| D) is broad or asymmetric around the true value, the approaches are likely to produce different solutions. When designing a classifier using these techniques, there are three sources of error:  Bayes Error: the error due to overlapping distributions  Model Error: the error due to an incorrect model or incorrect assumption about the parametric form.  Estimation Error: the error arising from the fact that the parameters are estimated from a finite amount of data.

ECE 8527: Lecture 07, Slide 13 Noninformative Priors and Invariance The information about the prior is based on the designer’s knowledge of the problem domain. We expect the prior distributions to be “translation and scale invariance” – they should not depend on the actual value of the parameter. A prior that satisfies this property is referred to as a “noninformative prior”:  The Bayesian approach remains applicable even when little or no prior information is available.  Such situations can be handled by choosing a prior density giving equal weight to all possible values of θ.  Priors that seemingly impart no prior preference, the so-called noninformative priors, also arise when the prior is required to be invariant under certain transformations.  Frequently, the desire to treat all possible values of θ equitably leads to priors with infinite mass. Such noninformative priors are called improper priors.

ECE 8527: Lecture 07, Slide 14 Example of Noninformative Priors For example, if we assume the prior distribution of a mean of a continuous random variable is independent of the choice of the origin, the only prior that could satisfy this is a uniform distribution (which isn’t possible). Consider a parameter σ, and a transformation of this variable: new variable,. Suppose we also scale by a positive constant:. A noninformative prior on σ is the inverse distribution p(σ) = 1/ σ, which is also improper.

ECE 8527: Lecture 07, Slide 15 Direct computation of p(D|θ) and p(θ|D) for large data sets is challenging (e.g. neural networks) We need a parametric form for p(x|θ) (e.g., Gaussian) Gaussian case: computation of the sample mean and covariance, which was straightforward, contained all the information relevant to estimating the unknown population mean and covariance. This property exists for other distributions. A sufficient statistic is a function s of the samples D that contains all the information relevant to a parameter, θ. A statistic, s, is said to be sufficient for θ if p(D|s,θ) is independent of θ: Sufficient Statistics

ECE 8527: Lecture 07, Slide 16 Theorem: A statistic, s, is sufficient for θ, if and only if p(D|θ) can be written as:. There are many ways to formulate sufficient statistics (e.g., define a vector of the samples themselves). Useful only when the function g() and the sufficient statistic are simple (e.g., sample mean calculation). The factoring of p(D|θ) is not unique: Define a kernel density invariant to scaling: Significance: most practical applications of parameter estimation involve simple sufficient statistics and simple kernel densities. The Factorization Theorem

ECE 8527: Lecture 07, Slide 17 This isolates the θ dependence in the first term, and hence, the sample mean is a sufficient statistic using the Factorization Theorem. The kernel is: Gaussian Distributions

ECE 8527: Lecture 07, Slide 18 This can be generalized: and: Examples: The Exponential Family

ECE 8527: Lecture 07, Slide 19 Summary Introduction of Bayesian parameter estimation. The role of the class-conditional distribution in a Bayesian estimate. Estimation of the posterior and probability density function assuming the only unknown parameter is the mean, and the conditional density of the “features” given the mean, p(x|θ), can be modeled as a Gaussian distribution. Bayesian estimates of the mean for the multivariate Gaussian case. General theory for Bayesian estimation. Comparison to maximum likelihood estimates. Recursive Bayesian incremental learning. Noninformative priors. Sufficient statistics Kernel density.

ECE 8527: Lecture 07, Slide 20 Getting ahead a bit, let’s see how we can put these ideas to work on a simple example due to David MacKay, and explained by Jon Hamaker.David MacKayJon Hamaker “The Euro Coin”