PatReco: Estimation/Training Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall 2004-2005.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Advertisements

CS479/679 Pattern Recognition Dr. George Bebis
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Part 2b Parameter Estimation CSE717, FALL 2008 CUBS, Univ at Buffalo.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Non Parametric Classifiers Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Unsupervised Training and Clustering Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
PatReco: Bayesian Networks Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Review Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Maximum Likelihood (ML), Expectation Maximization (EM)
PatReco: Bayes Classifier and Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Visual Recognition Tutorial
Introduction to Bayesian Parameter Estimation
PatReco: Discriminant Functions for Gaussians Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Maximum Likelihood Estimation
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 32: HIERARCHICAL CLUSTERING Objectives: Unsupervised.
Machine Learning 5. Parametric Methods.
Univariate Gaussian Case (Cont.)
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
PatReco: Introduction Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 4.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
PatReco: Model and Feature Selection Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Univariate Gaussian Case (Cont.)
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
Maximum Likelihood Estimation
Lecture 09: Gaussian Processes
Outline Parameter estimation – continued Non-parametric methods.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
Lecture 10: Gaussian Processes
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Learning From Observed Data
Maximum Likelihood Estimation (MLE)
Presentation transcript:

PatReco: Estimation/Training Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall

Estimation/Training  Goal: Given observed data (re-)estimate the parameters of the model e.g., for a Gaussian model estimate the mean and variance for each class

Supervised-Unsupervised  Supervised training: All data has been (manually) labeled, i.e., assigned to classes  Unsupervised training: Data is not assigned a class label

Observable data  Fully observed data: all information necessary for training is available (features, class labels etc.)  Partially observed data: some of the features or some of the class labels are missing

Supervised Training (fully observable data)  Maximum likelihood estimation (ML)  Maximum a posteriori estimation (MAP)  Bayesian estimation (BE)

Training process  Collected data used for training consists of the following examples D = {x 1, x 2, … x N }  Step 1: Label each example with the corresponding class label ω 1, ω 2,... ω Κ  Step 2: For each of the classes separately estimate the model parameters using ML, MAP, BE and the corresponding training examples D 1, D 2..D K

Training Process: Step 1 D = {x 1, x 2, x 3, x 4, x 5, … x N } Label manually ω 1, ω 2,... ω Κ D 1 = {x 11, x 12, x 13, … x 1 N 1 } D 2 = {x 21, x 22, x 23, … x 2 N 2 } ………… D K = {x K 1, x K 2, x K 3, … x KN k }

Training Process: Step 2  Maximum Likelihood θ 1 = argmax Θ P(D 1 |θ 1 )  Maximum-a-posteriori θ 1 = argmax Θ P(D 1 |θ 1 ) P(θ 1 )  Bayesian estimation P (x|ω 1 ) =  P(x| θ 1 )P( θ 1 |D 1 ) d θ 1

ML Estimation Assumptions 1.P(x|ω i ) follows a parametric distribution with parameters θ 2.D j tells us nothing about P(x|ω i ) (functional independence) 3.Observations x 1, x 2, x 3, … x N are iid (independent identically distributed 4a (ML only!) θ is a quantity whose value is fixed but unknown

ML estimation θ = argmax Θ P(θ|D) = argmax Θ P(D|θ) P(θ) = 4 argmax Θ P(D|θ) = argmax Θ P( x 1, x 2, … x N |θ) = 3 argmax Θ Π j P(x j |θ) =>  Π j P(x j |θ) /  θ = 0 => θ = …

ML estimate for Gaussian pdf If P(x|ω) = Ν(μ,σ 2 ) and θ=(μ,σ 2 ) then 1-D μ = (1/Ν) Σ j=1..N x j σ 2 = (1/Ν) Σ j=1..N (x j – μ) 2 Multi-D : θ=( μ, Σ ) μ = (1/Ν) Σ j=1..N x j Σ = (1/Ν) Σ j=1..N (x j – μ) Τ (x j – μ)

Bayesian Estimat. Assumptions 1.P(x|ω i ) follows a parametric distribution with parameters θ 2.D j tells us nothing about P(x|ω i ) (functional independence) 3.Observations x 1, x 2, x 3, … x N are iid (independent identically distributed) 4b (MAP, BE) θ is a random variable whose prior distribution p(θ) is known

Bayesian Estimation P (x|D) =  P(x,θ|D) dθ =  P(x|θ,D)P(θ|D) dθ =  P(x|θ)P(θ|D) dθ STEP 1: P(θ)  P(θ|D) P(x|D) = P(D|θ)P(θ)/P(D) STEP 2: P(x|θ)  P (x|D)

Bayesian Estimate for Gaussian pdf and priors If P(x|θ) = Ν(μ, σ 2 ) and p(θ) = Ν(μ 0, σ 0 2 ) then STEP 1: P(θ|D)=Ν(μ n, σ n 2 ) STEP 2: P(x|D)=N(μ n, σ 2 +σ n 2 ) μ n = σ 0 2 /(n σ σ 2 ) ( Σ j x j ) + σ 2 /(n σ σ 2 ) μ 0 σ n 2 = σ 2 σ 0 2 /(n σ σ 2 ) For large n (number of training samples) maximum likelihood and Bayesian estimation equivalent!!!

Conclusions  Maximum likelihood estimation is simple and gives good estimates when the number of training samples is large  Bayesian adaptation gives good estimates even for small amounts of training data provided that a good prior is selected  Bayesian adaptation is hard and often does not have a closed form solution (in which case try: iterative recursive Bayesian estimation)