Visual Recognition Tutorial

Slides:



Advertisements
Similar presentations
Unsupervised Learning
Advertisements

CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Biointelligence Laboratory, Seoul National University
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Expectation Maximization
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Supervised Learning Recap
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Classification and risk prediction
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Visual Recognition Tutorial
Maximum Likelihood (ML), Expectation Maximization (EM)
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
EM Algorithm Likelihood, Mixture Models and Clustering.
Maximum likelihood (ML)
Crash Course on Machine Learning
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Sergios Theodoridis Konstantinos Koutroumbas Version 2
Lecture 19: More EM Machine Learning April 15, 2010.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
Lecture 2: Statistical learning primer for biologists
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Usman Roshan CS 675 Machine Learning
Visual Recognition Tutorial
Probability Theory and Parameter Estimation I
Empirical risk minimization
Model Inference and Averaging
Ch3: Model Building through Regression
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Mathematical Foundations of BME
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Empirical risk minimization
Learning From Observed Data
Biointelligence Laboratory, Seoul National University
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Data Exploration and Pattern Recognition © R. El-Yaniv
Presentation transcript:

236607 Visual Recognition Tutorial Maximum likelihood – an example Maximum likelihood – another example Bayesian estimation EM for a mixture model EM Algorithm General Setting Jensen’s inequality 236607 Visual Recognition Tutorial

Bayesian Estimation: General Theory Bayesian leaning considers (the parameter vector to be estimated) to be a random variable. Before we observe the data, the parameters are described by a prior which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning 236607 Visual Recognition Tutorial

Bayesian parametric estimation Density function for x, given the training data set (it was defined in the Lect.2) From the definition of conditional probability densities The first factor is independent of X(n) since it just our assumed form for parameterized density. Therefore 236607 Visual Recognition Tutorial

Bayesian parametric estimation Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of If the weighting factor , which is a posterior of peaks very sharply about some value we obtain . Thus the optimal estimator is the most likely value of given the data and the prior of . 236607 Visual Recognition Tutorial

Bayesian decision making Suppose we know the distribution of possible values of that is a prior Suppose we also have a loss function which measures the penalty for estimating when actual value is Then we may formulate the estimation problem as Bayesian decision making: choose the value of which minimizes the risk Note that the loss function is usually continuous. 236607 Visual Recognition Tutorial

Maximum A-Posteriori (MAP) Estimation Let us look at : the optimal estimator is the most likely value of q given the data and the prior of q This “most likely value” is given by 236607 Visual Recognition Tutorial

Maximum A-Posteriori (MAP) Estimation since the data is i.i.d. We can disregard the normalizing factor when looking for the maximum 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial MAP - continued So, the we are looking for is 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial Maximum likelihood In MAP estimator, the larger n (the size of the data), the less important is in the expression It can motivate us to omit the prior. What we get is the maximum likelihood (ML) method. Informally: we don’t use any prior knowledge about the parameters; we seek those values that “explain” the data in the best way . is a log-likelihood of with respect to X(n) . We seek a maximum of the likelihood function, log-likelihood, or their monotonically increasing function. 236607 Visual Recognition Tutorial

Maximum likelihood – an example Let us find the ML estimator for the parameter of the exponential density function : so we are actually looking for the maximum of log-likelihood. Observe: The maximum is achieved where We have got the empirical mean (average) 236607 Visual Recognition Tutorial

Maximum likelihood – another example Let us find the ML estimator for Observe: The maximum is at where This is the median of the sampled data. 236607 Visual Recognition Tutorial

Bayesian estimation -revisited We saw Bayesian estimator for 0/1 loss function (MAP). What happens when we assume other loss functions? Example 1: (q is unidimensional). The total Bayesian risk here: We seek its minimum: 236607 Visual Recognition Tutorial

Bayesian estimation -continued At the which is a solution we have That is, for the the optimal Bayesian estimator for the parameter is the median of the distribution Example 2: (squared error). Total Bayesian risk: Again, in order to find the minimum, let the derivative be equal 0: 236607 Visual Recognition Tutorial

Bayesian estimation -continued The optimal estimator here is the conditional expectation of q given the data X(n) . 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial Mixture Models 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial Mixture Models Introduce multinomial random variable Z with components Zk If and only if Zn takes kth value then . Note that 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial Mixture Models where Marginal prob. of X is 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial Mixture Models A mixture model as graphical model. Z – multinomial latent variable Conditional prob. of Z. Define posterior 236607 Visual Recognition Tutorial

Unconditional Mixture Models Cond. Mix.Mod. -> to solve regression and classification (supervised). Need observation of data X and labels Y that is (X,Y) pairs. Uncond. Mix.Mod. -> to solve density estimation problems Need only observation of data X. Applications – detection of outliers, compression, unsupervised classification (clustering) … 236607 Visual Recognition Tutorial

Unconditional Mixture Models 236607 Visual Recognition Tutorial

Gaussian Mixture Models Estimate from IID data D={x1,…,xN} 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial The K- means algorithm Group data D={x1,…,xN} into a set of K clusters, where K is given. Represents i-th cluster as one vector - its mean . Data points assign to the nearest mean . Phase 1: values for the indicator variable are evaluated by assigning each point xn to the closed mean: Phase 2: recompute 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM Algorithm If Zn were observed, then it would be “class label” and estimate of mean would be We don’t know them and replace them by their conditional expectations, conditioning on data: But from (6),(7) depends on parameter estimates so we should iterate. 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM Algorithm Iteration formulas: 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM Algorithm Expectation step is (14) Maximization step is parameter updates (15)-(17) What relationship this algorithm has to quantity which we want to maximize - log likelihood (9) ? Calculating derivatives of l with respect to the parameters, we have 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM Algorithm Setting to zero yields Analogously and mixing proportions: 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM General Setting EM is iterative technique designed for probabilistic models. We have two sample spaces: X which are observed (dataset) Z which are missing (latent) A probability model is If we knew Z we would do ML estimation by maximizing 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM General Setting Z is not observed so we calculate incomplete log likelihood Given Z is not observed so the complete log likelihood is a random quantity and cannot be maximized directly. Thus we average over Z using some “averaging distribution” q(z|x). We hope that maximizing this surrogate expression will yield value of q which will be improvement of initial value of q. 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM General Setting The distribution can be used to obtain lower bound on log likelihood: EM is coordinate ascent on At the (t+1)st iteration, for fixed q(t), we first maximize with respect to q, which yield q(t+1). For this q(t+1) we then maximize with respect to q which yields q(t+1), 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM General Setting E step M step The M step is equivalently viewed as the maximization of the expected complete log likelihood. Proof: Second term is independent of q. Thus maximizing of with respect to q is equivalent to maximizing . 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial EM General Setting The E step can be solved ones and for all: choise yields the maximum: 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial Jensen’s inequality Definition: function is convex over (a,b) if Convex Concave Jensen’s inequality: For convex function 236607 Visual Recognition Tutorial

236607 Visual Recognition Tutorial Jensen’s inequality For d.r.v.with two mass points Let Jensen’s inequality is right for k-1 mass points, then due to induction assumption due to convexity 236607 Visual Recognition Tutorial

Jensen’s inequality corollary Let Function log is concave, so from Jensen inequality we have: 236607 Visual Recognition Tutorial