Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Machine Learning – Lecture 3 Probability Density Estimation II 24.10.2013 Bastian.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Principles of Density Estimation
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
Giansalvo EXIN Cirrincione unit #3 PROBABILITY DENSITY ESTIMATION labelled unlabelled A specific functional form for the density model is assumed. This.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Lecture 3 Nonparametric density estimation and classification
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Machine Learning CMPT 726 Simon Fraser University
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Machine Learning – Lecture 4
Machine Learning – Lecture 1
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.
Perceptual and Sensory Augmented Computing Machine Learning Summer’09 Machine Learning – Lecture 2 Probability Density Estimation Bastian Leibe.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Machine Learning – Lecture 6
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 15 Regression II Bastian Leibe RWTH Aachen.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Jakob Verbeek December 11, 2009
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Nonparametric Density Estimation Riu Baring CIS 8526 Machine Learning Temple University Fall 2007 Christopher M. Bishop, Pattern Recognition and Machine.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Machine Learning – Lecture 11
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Machine Learning – Lecture 6 Statistical Learning Theory & SVMs Bastian.
Machine Learning – Lecture 18
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 5 Linear Discriminant Functions Bastian Leibe.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
KNN & Naïve Bayes Hongning Wang
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Nonparametric density estimation and classification
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Machine Learning – Lecture 3 Probability Density Estimation II Bastian Leibe RWTH Aachen TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A A A A A AAAAAA A AAAA A A A AA A Many slides adapted from B. Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Announcements (2) Different lecture room on Thursday for the next weeks!  This week (24.10., 10: :45): AH I  Next week (31.10., 10:15 – 11:45):5056  After that, we’ll see... We’ll also send around an announcement for this. 2 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Announcements Exercise sheet 1 is now available on L2P  Bayes decision theory  Maximum Likelihood  Kernel density estimation / k-NN  Submit your results to Ishrat until evening of Work in teams (of up to 3 people) is encouraged  Who is not part of an exercise team yet? 3 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Course Outline Fundamentals (2 weeks)  Bayes Decision Theory  Probability Density Estimation Discriminative Approaches (5 weeks)  Linear Discriminant Functions  Support Vector Machines  Ensemble Methods & Boosting  Randomized Trees, Forests & Ferns Generative Models (4 weeks)  Bayesian Networks  Markov Random Fields B. Leibe 4

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Topics of This Lecture Recap: Bayes Decision Theory Parametric Methods  Recap: Maximum Likelihood approach  Bayesian Learning Non-Parametric Methods  Histograms  Kernel density estimation  K-Nearest Neighbors  k-NN for Classification  Bias-Variance tradeoff 5 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Recap: Bayes Decision Theory Optimal decision rule  Decide for C 1 if  This is equivalent to  Which is again equivalent to (Likelihood-Ratio test) 6 B. Leibe Decision threshold  Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Recap: Bayes Decision Theory Decision regions: R 1, R 2, R 3, … 7 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Recap: Classifying with Loss Functions We can formalize the intuition that different decisions have different weights by introducing a loss matrix L kj Example: cancer diagnosis 8 B. Leibe Decision Truth

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Recap: Minimizing the Expected Loss Optimal solution is the one that minimizes the loss.  But: loss function depends on the true class, which is unknown. Solution: Minimize the expected loss This can be done by choosing the regions such that  Adapted decision rule: 9 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 One-dimensional case  Mean ¹  Variance ¾ 2 Multi-dimensional case  Mean ¹  Covariance § Recap: Gaussian (or Normal) Distribution 10 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Computation of the likelihood  Single data point:  Assumption: all data points are independent  Log-likelihood Estimation of the parameters µ (Learning)  Maximize the likelihood (=minimize the negative log-likelihood)  Take the derivative and set it to zero. Recap: Maximum Likelihood Approach 11 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Topics of This Lecture Recap: Bayes Decision Theory Parametric Methods  Recap: Maximum Likelihood approach  Bayesian Learning Non-Parametric Methods  Histograms  Kernel density estimation  K-Nearest Neighbors  k-NN for Classification  Bias-Variance tradeoff 12 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Bayesian Approach to Parameter Learning Conceptual shift  Maximum Likelihood views the true parameter vector µ to be unknown, but fixed.  In Bayesian learning, we consider µ to be a random variable. This allows us to use knowledge about the parameters µ  i.e. to use a prior for µ  Training data then converts this prior distribution on µ into a posterior probability density.  The prior thus encodes knowledge we have about the type of distribution we expect to see for µ. 13 B. Leibe Slide adapted from Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Bayesian Learning Approach Bayesian view:  Consider the parameter vector µ as a random variable.  When estimating the parameters, what we compute is 14 B. Leibe This is entirely determined by the parameter µ (i.e. by the parametric form of the pdf). Slide adapted from Bernt Schiele Assumption: given µ, this doesn’t depend on X anymore

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Bayesian Learning Approach Inserting this above, we obtain 15 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Bayesian Learning Approach Discussion  If we now plug in a (suitable) prior p ( µ ), we can estimate from the data set X. 16 B. Leibe Normalization: integrate over all possible values of µ Likelihood of the parametric form µ given the data set X. Prior for the parameters µ Estimate for x based on parametric form µ

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Bayesian Density Estimation Discussion  The probability makes the dependency of the estimate on the data explicit.  If is very small everywhere, but is large for one, then  The more uncertain we are about µ, the more we average over all parameter values. 17 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Bayesian Density Estimation Problem  In the general case, the integration over µ is not possible (or only possible stochastically). Example where an analytical solution is possible  Normal distribution for the data, ¾ 2 assumed known and fixed.  Estimate the distribution of the mean:  Prior: We assume a Gaussian prior over ¹, 18 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Bayesian Learning Approach Sample mean: Bayes estimate: Note: 19 B. Leibe Slide adapted from Bernt Schiele Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Summary: ML vs. Bayesian Learning Maximum Likelihood  Simple approach, often analytically possible.  Problem: estimation is biased, tends to overfit to the data.  Often needs some correction or regularization.  But: –Approximation gets accurate for. Bayesian Learning  General approach, avoids the estimation bias through a prior.  Problems: –Need to choose a suitable prior (not always obvious). –Integral over µ often not analytically feasible anymore.  But: –Efficient stochastic sampling techniques available. (In this lecture, we’ll use both concepts wherever appropriate) 20 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Topics of This Lecture Recap: Bayes Decision Theory Parametric Methods  Recap: Maximum Likelihood approach  Bayesian Learning Non-Parametric Methods  Histograms  Kernel density estimation  K-Nearest Neighbors  k-NN for Classification  Bias-Variance tradeoff 21 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Non-Parametric Methods Non-parametric representations  Often the functional form of the distribution is unknown Estimate probability density from data  Histograms  Kernel density estimation (Parzen window / Gaussian kernels)  k-Nearest-Neighbor 22 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Histograms Basic idea:  Partition the data space into distinct bins with widths ¢ i and count the number of observations, n i, in each bin.  Often, the same width is used for all bins, ¢ i = ¢.  This can be done, in principle, for any dimensionality D … 23 B. Leibe …but the required number of bins grows exponen- tially with D ! Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Histograms The bin width M acts as a smoothing factor. 24 B. Leibe not smooth enough about OK too smooth Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Summary: Histograms Properties  Very general. In the limit ( N !1 ), every probability density can be represented.  No need to store the data points once histogram is computed.  Rather brute-force Problems  High-dimensional feature spaces – D -dimensional space with M bins/dimension will require M D bins!  Requires an exponentially growing number of data points  “Curse of dimensionality”  Discontinuities at bin edges  Bin size? –too large: too much smoothing –too small: too much noise 25 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Statistically Better-Founded Approach Data point x comes from pdf p ( x )  Probability that x falls into small region R If R is sufficiently small, p ( x ) is roughly constant  Let V be the volume of R If the number N of samples is sufficiently large, we can estimate P as 26 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Statistically Better-Founded Approach Kernel methods  Example: Determine the number K of data points inside a fixed hypercube… 27 B. Leibe fixed V determine K fixed K determine V Kernel MethodsK-Nearest Neighbor Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Kernel Methods Parzen Window  Hypercube of dimension D with edge length h :  Probability density estimate: 28 B. Leibe “Kernel function” Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Kernel Methods: Parzen Window Interpretations 1.We place a kernel window k at location x and count how many data points fall inside it. 2.We place a kernel window k around each data point x n and sum up their influences at location x.  Direct visualization of the density. Still, we have artificial discontinuities at the cube boundaries…  We can obtain a smoother density model if we choose a smoother kernel function, e.g. a Gaussian 29 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Kernel Methods: Gaussian Kernel Gaussian kernel  Kernel function  Probability density estimate 30 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Gauss Kernel: Examples 31 B. Leibe not smooth enough about OK too smooth h acts as a smoother. Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Kernel Methods In general  Any kernel such that can be used. Then  And we get the probability density estimate 32 B. Leibe Slide adapted from Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Statistically Better-Founded Approach 33 B. Leibe fixed V determine K fixed K determine V Kernel MethodsK-Nearest Neighbor  Increase the volume V until the K next data points are found. Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 K-Nearest Neighbor Nearest-Neighbor density estimation  Fix K, estimate V from the data.  Consider a hypersphere centred on x and let it grow to a volume V ? that includes K of the given N data points.  Then Side note  Strictly speaking, the model produced by K-NN is not a true density model, because the integral over all space diverges.  E.g. consider K = 1 and a sample exactly on a data point x = x j. 34 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 k-Nearest Neighbor: Examples 35 B. Leibe not smooth enough about OK too smooth K acts as a smoother. Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Summary: Kernel and k-NN Density Estimation Properties  Very general. In the limit ( N !1 ), every probability density can be represented.  No computation involved in the training phase  Simply storage of the training set Problems  Requires storing and computing with the entire dataset.  Computational cost linear in the number of data points.  This can be improved, at the expense of some computation during training, by constructing efficient tree-based search structures.  Kernel size / K in K-NN? –Too large: too much smoothing –Too small: too much noise 36 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 K-Nearest Neighbor Classification Bayesian Classification Here we have 37 B. Leibe k-Nearest Neighbor classification Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 K-Nearest Neighbors for Classification 38 B. Leibe K = 1K = 3 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 K-Nearest Neighbors for Classification Results on an example data set K acts as a smoothing parameter. Theoretical guarantee  For N !1, the error rate of the 1-NN classifier is never more than twice the optimal error (obtained from the true conditional class distributions). 39 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Bias-Variance Tradeoff Probability density estimation  Histograms: bin size? – M too large: too smooth – M too small: not smooth enough  Kernel methods: kernel size? – h too large: too smooth – h too small: not smooth enough  K-Nearest Neighbor: K ? – K too large: too smooth – K too small: not smooth enough This is a general problem of many probability density estimation methods  Including parametric methods and mixture models 40 B. Leibe Too much bias Too much variance Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Discussion The methods discussed so far are all simple and easy to apply. They are used in many practical applications. However…  Histograms scale poorly with increasing dimensionality.  Only suitable for relatively low-dimensional data.  Both k-NN and kernel density estimation require the entire data set to be stored.  Too expensive if the data set is large.  Simple parametric models are very restricted in what forms of distributions they can represent.  Only suitable if the data has the same general form. We need density models that are efficient and flexible!  Next lecture… 41 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 References and Further Reading More information in Bishop’s book  Gaussian distribution and ML: Ch and  Bayesian Learning: Ch and  Nonparametric methods: Ch Additional information can be found in Duda & Hart  ML estimation:Ch. 3.2  Bayesian Learning:Ch  Nonparametric methods:Ch B. Leibe 42 Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 R.O. Duda, P.E. Hart, D.G. Stork Pattern Classification 2 nd Ed., Wiley-Interscience, 2000