Data Exploration and Pattern Recognition © R. El-Yaniv

Slides:



Advertisements
Similar presentations
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Maximum likelihood (ML) and likelihood ratio (LR) test
Point estimation, interval estimation
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum likelihood (ML)
Maximum likelihood (ML) and likelihood ratio (LR) test
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Visual Recognition Tutorial
Introduction to Bayesian Parameter Estimation
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
1 E. Fatemizadeh Statistical Pattern Recognition.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Univariate Gaussian Case (Cont.)
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Virtual University of Pakistan
Applied statistics Usman Roshan.
Univariate Gaussian Case (Cont.)
CS479/679 Pattern Recognition Dr. George Bebis
Usman Roshan CS 675 Machine Learning
Chapter 3: Maximum-Likelihood Parameter Estimation
Visual Recognition Tutorial
12. Principles of Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Model Inference and Averaging
Parameter Estimation 主講人:虞台文.
Maximum Likelihood Estimation
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Summarizing Data by Statistics
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
12. Principles of Parameter Estimation
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Data Exploration and Pattern Recognition © R. El-Yaniv The KL-Divergence Let and be distributions over . The Kullback-Leibler (KL) divergence between them is The quantity is given in bits (whenever the logarithm is base 2) and measures the dissimilarity between the distributions. Other popular names: cross-entropy, relative entropy, discrimination information. Although the KL-divergence is not a metric (it’s not symmetric and doesn’t obey the triangle inequality) and therefore, not a true distance, it is widely used as a “distance” measure between distributions. כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Reminder: Jensen Inequality Lemma (Jensen Inequality) : If is a convex function and is a random variable, then If is strictly convex then equality implies that is constant (i.e. ). Convex Concave כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Properties of the KL-Divergence Lemma (Information Inequality): with equality iff Proof. Let , the support set of Since log is strictly concave we have equality iff Jensen Inequality כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Properties of the KL-Divergence Sensitive to zero-probability events under Not a true distance (not symmetric, violates triangle inequality). So why use it? We could use, for example, the the Euclidean distance which is a true metric and not sensitive to zero-prob events (Partial) Answer: the Euclidean distance doen’t have statistical interpretations. Doesn’t yiels optimal solutions. The KL-Divergence does! We have: כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Binomial Approximation With KL-Div. Let One method to compute is to use Stirling Approximation כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Binomial Approximation - Cntd. כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Binomial Approximation - Cntd. כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Binomial Approximation - Cntd. Example: What is the probability of getting 30 heads when tossing an unbiased coin 100 times? כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Learning From Observed Data Hidden Observed Unsupervised Supervised כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv Density Estimation The Bayesian method is optimal (for classification and decision making) but requires that all relevant distributions (prior, and class-conditional) are known. Unfortunately, this is rarely the case. We only see data, not distributions. Threfore, in order to use Bayesian classification We want to learn these distributions from the data (called training data). Supervised learning: we get to see samples from each of the classes “separately” (called tagged samples). Tagged samples are “expensive”. We need to learn the distributions as efficiently as possible. Two methods: parametric (easier) and nonparametric (harder) כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv Parameter Estimation Suppose we can assume that the relevant densities are of some parametric form. For example, suppose we are pretty sure that is normal , without knowing and . It remains to estimate the parameters and from the data. Examples of parameterized desnsities Binomial: has 1’s and 0’s Normal: Each data point is distributed according to Here, כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Two Methods for Parameter Estimation We’ll study two methods for parameter estimation: Bayesian estimation and maximum likelihood. Both methods are conceptually different. Maximum likelihood: unknown parameters are fixed; pick the parameters that best “explain” the data Bayesian estimation: unknown parameters are random variables sampled from some prior; we use the observed data to revise the prior (obtaining “sharper” posterior) and choose the best parameters using the standard (and optimal) Bayesian method. But assymptotically they yield the same results כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv Isolating the Problem We get to see a traning set of data points Each point in belongs to some class of different classes Suppose the subset of i.i.d. points is in some class so that each is drown according to to the class-conditional We assume that has some known parametric form given by for some parameter vector Thus, we have sepaerate parameter estimation problems כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Maximum Likelihood Estimation Recall that the likelihood of with respect to is The maximum likelihood parameter vector is the one that best supports the data; that is, Analytically, it is often easier to consider the log of the likelihood function (since the log is monotone, the maximum log-likelihood is same as maximum likelihood). Example: assume that all the points in are drawn from some (one-dimensional) normal distribution with some particular variance (and unknown mean). כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Maximum Likelihood - Illustration כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Maximum Likelihood - Cntd. If is “well-behaved” and, in particular, differentiable, we can find using standard differntial calculus Suppose . Then satisfies and we must verify that this is a global maximum כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Example - Maximum Likelihood Suppose we know that each data point is distributed according to a normal distribution with known standard deviation 1 but with unknown mean Differentiating, we have So כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Example: Normal, Unknown and Suppose each data point is distributed We have so כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv Biased Estimators In the last example the ML estimator of was This estimate is biased ; that is, Claim: This estimator is asymptotically unbiased (approaches an unbiased estimate; see below) To see the bias it is sufficient to consider one data point (that is, ) Unbiased estimates for and are כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

ML Estimators for Multivariate Normal PDF A similar (but much more involved) calculations yields the following ML estimator for the multivariate normal density with unknown mean vector and unknown covariance matrix The estimator for the mean is unbiased and the estimator for the covariance matrix is biased. An unbiased estimator is כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Baysian Parameter Estimation Here again, the form of the source density is assumed to be known but the parameter is unknown We assume that is random variable Our initial knowledge (guess) about , before observing the data , is given by the prior We use the sample data (drawn independently according to ) to compute the posterior Since is drawn i.i.d. Recall that is a normalizing factor כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv Bayesian Estimation The prior is typically “broad” or “flat”, reflecting the fact that we don’t know a lot about the parameters values The data we see is more consistent with some values of the parameters and therefore we expect the posterior to pick sharply around more likely values כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv Bayesian Estimation Recall that our goal (in the isolated problem) is to estimate the class-conditional density of the th class, given the (labeled data of that class ) Using the posterior we compute the class-conditional the weighted average over all possible values of כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Bayesian Estimation - Example Suppose we know that class-conditional p.d.f. is normal with unknown mean Also, suppose with both known We imagine that Nature draws a value for using and then i.i.d. chooses the data using We now calculate the posterior כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Bayesian Estimation - Example cntd. The answer (exercise): is normal Letting Hint: to save algebraic manipulations, note that any p.d.f. of the form is normal Notice that is a convex combination of and Always and after observing samples, is our “best guess” for כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Bayesian Estimation - Example cntd. After determining the posterior it remains to calculate the class-conditional where כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Another Example: Prob. of Sun Rising Question: What is the probability that the sun will rise tomorrow? Laplace’s (Bayesian) answer: Assume that each day the sun rises with probability (Bernoulli process) and that is distributed uniformaly in . Suppose there were sun rises so far. What is the probability of an st rise? Denote the data set by where . כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Prob. of Sun Rising - Cntd. We have Therefore, This is called: Laplace’s law of succession Notice that ML gives כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Maximum Likelihood vs. Bayesian ML and Bayesian estimations are asymptotically equivalent. they yield the same class-conditional densities when the size of the training data grows to infinity. ML is typically computationally easier: E.g. consider the case where the p.d.f. is “nice” (i.e. differentiable) . In ML we need to do (multidimensional) differentiation and in Bayesian, (multidimensional) integration. ML is often easier to interpret: it returns the single best model (parameter) whereas Bayesian gives a weighted average of models. But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information). Bayesian with “flat” prior is essntially ML. With asymmetric and broad priors the methods lead to different solutions. כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv