Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Update by B.-H. Kim Summarized by M.H. Kim Biointelligence.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
Giansalvo EXIN Cirrincione unit #3 PROBABILITY DENSITY ESTIMATION labelled unlabelled A specific functional form for the density model is assumed. This.
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
Computer vision: models, learning and inference Chapter 8 Regression.
Linear Models for Classification: Probabilistic Methods
Chapter 4: Linear Models for Classification
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Introduction to Bayesian Parameter Estimation
Biointelligence Laboratory, Seoul National University
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J. S. Kim Biointelligence Laboratory, Seoul National University.
Biointelligence Laboratory, Seoul National University
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Affiliation: Kyoto University
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Machine Learning – Lecture 3 Probability Density Estimation II Bastian.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Example: Bioassay experiment Problem statement –Observations: At each level of dose, 5 animals are tested, and number of death are observed.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
Univariate Gaussian Case (Cont.)
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Univariate Gaussian Case (Cont.)
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
Latent Variables, Mixture Models and EM
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Biointelligence Laboratory, Seoul National University
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Presentation transcript:

Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, Update by B.-H. Kim Summarized by M.H. Kim Biointelligence Laboratory, Seoul National University

2(C) 2007, SNU Biointelligence Lab, The Gaussian Distribution  Bayesian inference for the Gaussian  Student's t-distribution  Periodic variables  Mixtures of Gaussians 2.4 The Exponential Family  Maximum likelihood and sufficient statistics  Conjugate priors  Noninformative priors 2.5 Nonparametric Methods  Kernel density estimators  Nearest-neighbour methods

3(C) 2007, SNU Biointelligence Lab, Bayesian inference for the Gaussian In maximum likelihood framework  Point estimation for the parameters and Bayesian treatment for parameter estimation  Introduce prior distributions over parameters  Three cases  [C1] When the covariance is known, inferring the mean  [C2] When the mean is known, inferring the covariance  [C3] Both mean and covariance is unknown  Conjugate prior  Makes the form of the distribution is consistent  Means effective fictitious data points (exponential family)

4(C) 2007, SNU Biointelligence Lab, [C1] When the covariance is known, inferring the mean Single Gaussian random variable x given a set of N observation X={x 1,…,x N } The likelihood function  the form of the exponential of a quadratic form in μ  Thus if we choose a prior p(μ) given by a Gaussian, it will be a conjugate distribution for this likelihood function The conjugate prior distribution

5(C) 2007, SNU Biointelligence Lab, [C1] When the covariance is known, inferring the mean The posterior distribution  Posterior mean and variance  Posterior mean is a compromise btw and  Precision (inverse variance) is additive

6(C) 2007, SNU Biointelligence Lab, [C1] When the covariance is known, inferring the mean Illustration of this Bayesian inference  Data points generation : N(x|0.8, )  Prior N(μ |0,0.1 2 ) Sequential estimation of mean  Bayesian paradigm leas very naturally to a sequential view of the inference problem The posterior distribution after observing N-1 data points, or a prior distribution before the N th data is observed Likelihood function associated with x N

7(C) 2007, SNU Biointelligence Lab, [C2] When the mean is known, inferring the covariance Likelihood function for the precision λ The conjugate prior distribution – gamma distribution Let the prior to be, then the posterior becomes Plot of the Gam(λ|a,b)

8(C) 2007, SNU Biointelligence Lab, Meaning of the conjugate prior for exponential family The effect of observing N data points  Increase the value of the coefficient a 0 by N/2  2a 0 : ‘effective’ prior observations  Increase the value of the coefficient b 0 by Nσ 2 ML /2  b 0 arises from the 2a 0 effective prior observations having variance b 0 /a 0 A conjugate prior is interpreted as effective fictitious data points General property of exponential family of distributions

9(C) 2007, SNU Biointelligence Lab, [C3] Both mean and covariance is unknown Likelihood function Prior distribution a Gaussian whose precision is a linear function of λ a gamma distribution (Normal-gamma or Gaussian-gamma dist)

10(C) 2007, SNU Biointelligence Lab, In the case of the multivariate Gaussian Three cases  [C1] When the covariance is known, inferring the mean  [C2] When the mean is known, inferring the covariance  [C3] Both mean and covariance is unknown The form of conjugate prior distribution univariatemultivariate C1Gaussian C2GammaWishart C3Gaussian-gammaGaussian-Wishart

11(C) 2007, SNU Biointelligence Lab, Student's t-distribution If we have a univariate Gaussian together with a Gamma prior and we integrate out the precision, we obtain the marginal distribution of x Student’s t-distribution

12(C) 2007, SNU Biointelligence Lab, Properties of Student's t-distribution Adding up an infinite number of Gaussian distributions having the same mean but different precisions Infinite mixture of Gaussians Longer ‘tails’ than a Gaussian => robustness : much less sensitive than the Gaussian to the outliers ML solutions of Student’s t-distribution (red) and Gaussian (green) St ∼ Gaussian outlier

13(C) 2007, SNU Biointelligence Lab, More on Student's t-distribution For regression problems  The least squares approach does not exhibit robustness, because it corresponds to ML under a (conditional) Gaussian dist.  We obtain a more robust model based on a heavy-tailed distribution such as a t-distribution Multivariate t-distribution where D is the dimensionality of x

14(C) 2007, SNU Biointelligence Lab, Periodic variables Problem Setting  We want to evaluate the mean of a set of observations of a periodic variable:  To find an invariant measure of the mean, observations are considered as points on the unit circle Cartesian coord. angular coord. The maximum likelihood estimator

15(C) 2007, SNU Biointelligence Lab, Von Mises distribution (circular normal) Setting for periodic generalization of the Gaussian  Conditions of the distribution p(θ) that have period 2π Gaussian-like distribution that satisfies these properties 2D Gaussian, conditioned by the unit circle Von Mises distribution zeroth-order Bessel function of the first kindBessel function

16(C) 2007, SNU Biointelligence Lab, Von Mises distribution (circular normal) Plot of the von Mises distribution Cartesian plot polar plot

17(C) 2007, SNU Biointelligence Lab, ML estimators for the parameters

18(C) 2007, SNU Biointelligence Lab, Some alternative techniques for the construction of periodic distribution The simplest approach is to use a histogram of observations in which the angular coordinate is divided into fixed bins => simple and flexible, but significantly limited (Section 2.5) Approach 2 : like the von Mises distribution from a Gaussian distribution over a Euclidean space but now marginalizes onto the unit circle rather than conditioning. Approach 3 : ‘wrapping’ the real axis around unit circle  Mapping successive intervals of width 2π onto (0, 2π) Mixtures of von Mises distributions

19(C) 2007, SNU Biointelligence Lab, Mixtures of Gaussians Mixture of Gaussians From the sum and product rules, the marginal density is given by Example data set which requires mixture distributions Example of a Gaussian mixture distribution (3 Gaussians) : mixing coefficients : responsibility, plays an important role

20(C) 2007, SNU Biointelligence Lab, Mixtures of Gaussians The maximum likelihood solution for the parameters  No closed-form analytical solution  Need iterative numerical optimization techniques, or  Expectation maximization (Chapter 9)

21(C) 2007, SNU Biointelligence Lab, The Exponential Family The exponential family of distributions over x, given parameters, is defined to be the set of distributions of the form : scalar or vector : natural parameters : some function of x (sufficient statistic) : inverse of the normalizer (alternative form)

22(C) 2007, SNU Biointelligence Lab, Bernoulli distribution as an exponential family Bernoulli distribution logistic sigmoid function Exponential family

23(C) 2007, SNU Biointelligence Lab, Multinomial distribution as an exponential family Multinomial distribution Removing a constraint and using M-1 parameters : softmax function Exponential family

24(C) 2007, SNU Biointelligence Lab, Gaussian distribution Gaussian distribution as an exponential family

25(C) 2007, SNU Biointelligence Lab, Maximum likelihood and sufficient statistics Maximum likelihood estimation of the parameter vector in the general exponential family distribution  The covariance of u(x) ~ the second derivatives of  The higher order moments ~ the nth derivatives of Taking the gradient of both sides The property of the partition function (alternative form)

26(C) 2007, SNU Biointelligence Lab, Sufficient statistics With i.i.d. (independent identically distributed) data The likelihood function Setting the gradient of log likelihood with respect to to zero The solution depends on the data only through  : sufficient statistic of the distribution  We do not need to store the entire data set itself but only the s.s.  Ex) the Bernoulli distribution : s.s. is the data points {x n }  Ex) Gaussian : s.s. are sum of {x n } and the sum of {x 2 n } Sufficiency in Bayesian?

27(C) 2007, SNU Biointelligence Lab, Conjugate priors The exponential family Conjugate prior Posterior distribution  This takes the same functional form as the prior,  confirming conjugacy = : a normalization coefficient : effective number of pseudo-observations

28(C) 2007, SNU Biointelligence Lab, Noninformative priors Role of a prior  If prior knowledge can be conveniently expressed through the prior distribution… very good~  When we have little idea, we use noninformative prior Noninformative prior is intended to have as little influence on the posterior distribution as possible.  ‘letting the data speak for themselves’  Two difficulties in the case of continuous parameters  If the domain of is unbounded => cannot be normalized : improper  The transformation behaviour of a prob. density under a nonlinear change of variables

29(C) 2007, SNU Biointelligence Lab, Two examples of noninformative priors Family of densities with translation invariance  Ex) mean of a Gaussian distributionmean of a Gaussian distribution Family of densities with scale invariance  Ex) stdev of a Gaussian distributionstdev of a Gaussian distribution shifting x by a constant

30(C) 2007, SNU Biointelligence Lab, Nonparametric Methods Parametric approach to density estimation  Use of p.d.f. with specific functional forms (unimodal!)  Governed by a small number of parameters  Parameters are determined from a data set  Limitation: chosen density might be a poor model => poor prediction Nonparametric approach  Make few assumptions about the form of the distribution  The form of the dist. typically depends on the size of the data set  Still contain parameters, but these control the model complexity rather than the form of the distribution  Nonparametric Bayesian methods are attracting interest If you want more details, See Ch 4 of Duda & Hart

31(C) 2007, SNU Biointelligence Lab, Three nonparametric methods Histogram methods Kernel density estimators Nearest-neighbour methods Common points  Concept of locality  Smoothing parameter V is fixed, K is fixed N: # observations K: # points within some region V: the volume of the region Data: 50 points from mixture of 2 Gaussians (green) Δ i : width of ith bin

32(C) 2007, SNU Biointelligence Lab, More on kernel density estimators Set a kernel function (or Parzen window) for each data point Property of kernel function on local region u around a data pointkernel Uniform kernel function Gaussian kernel function (smoother one)

33(C) 2007, SNU Biointelligence Lab, Pros and Cons Histogram methods  Once computed, the data set can be discarded  Easily applied to the sequential data processing  Setting bin edges produces (artificial) discontinuous density  Weak scaling with dimensionality Kernel density estimators  No computation form ‘training’  Requires the storage of the entire training set => computational cost of evaluating the density grows linearly with the |data|  Fixed ‘h’ : the optimal choice may be dependent on location K nearest-neighbour method  The model is not a true density model (integral diverges)

34(C) 2007, SNU Biointelligence Lab, Classification with K-NN We apply the K-nearest-neighbour density estimation to each class separately And then make use of Bayes’ theorem We wish to minimize the prob. of misclassification => assign the test point x to the class having the largest posterior probability ~ (K k /K) (density of each class) (unconditional density)(class priors) (the posterior prob. of class membership)

35(C) 2007, SNU Biointelligence Lab, Illustrations of the K-NN classification

36(C) 2007, SNU Biointelligence Lab, Good density models Nonparametric methods  Flexible, but require the entire training data set to be stored Parametric methods  Very restricted in terms of the forms of distributions What we want is  Density models that are flexible yet  The complexity of the models can be controlled independently of the size of the training set We shall see in subsequent chapters how to achieve this