Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Update by B.-H. Kim Summarized by M.H. Kim Biointelligence.

Similar presentations


Presentation on theme: "Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Update by B.-H. Kim Summarized by M.H. Kim Biointelligence."— Presentation transcript:

1

2 Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Update by B.-H. Kim Summarized by M.H. Kim Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

3 2(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/Content 2.3 The Gaussian Distribution  2.3.6 Bayesian inference for the Gaussian  2.3.7 Student's t-distribution  2.3.8 Periodic variables  2.3.9 Mixtures of Gaussians 2.4 The Exponential Family  2.4.1 Maximum likelihood and sufficient statistics  2.4.2 Conjugate priors  2.4.3 Noninformative priors 2.5 Nonparametric Methods  2.5.1 Kernel density estimators  2.5.2 Nearest-neighbour methods

4 3(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.3.6 Bayesian inference for the Gaussian In maximum likelihood framework  Point estimation for the parameters and Bayesian treatment for parameter estimation  Introduce prior distributions over parameters  Three cases  [C1] When the covariance is known, inferring the mean  [C2] When the mean is known, inferring the covariance  [C3] Both mean and covariance is unknown  Conjugate prior  Makes the form of the distribution is consistent  Means effective fictitious data points (exponential family)

5 4(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ [C1] When the covariance is known, inferring the mean Single Gaussian random variable x given a set of N observation X={x 1,…,x N } The likelihood function  the form of the exponential of a quadratic form in μ  Thus if we choose a prior p(μ) given by a Gaussian, it will be a conjugate distribution for this likelihood function The conjugate prior distribution

6 5(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ [C1] When the covariance is known, inferring the mean The posterior distribution  Posterior mean and variance  Posterior mean is a compromise btw and  Precision (inverse variance) is additive

7 6(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ [C1] When the covariance is known, inferring the mean Illustration of this Bayesian inference  Data points generation : N(x|0.8, 0.1 2 )  Prior N(μ |0,0.1 2 ) Sequential estimation of mean  Bayesian paradigm leas very naturally to a sequential view of the inference problem The posterior distribution after observing N-1 data points, or a prior distribution before the N th data is observed Likelihood function associated with x N

8 7(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ [C2] When the mean is known, inferring the covariance Likelihood function for the precision λ The conjugate prior distribution – gamma distribution Let the prior to be, then the posterior becomes Plot of the Gam(λ|a,b)

9 8(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Meaning of the conjugate prior for exponential family The effect of observing N data points  Increase the value of the coefficient a 0 by N/2  2a 0 : ‘effective’ prior observations  Increase the value of the coefficient b 0 by Nσ 2 ML /2  b 0 arises from the 2a 0 effective prior observations having variance b 0 /a 0 A conjugate prior is interpreted as effective fictitious data points General property of exponential family of distributions

10 9(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ [C3] Both mean and covariance is unknown Likelihood function Prior distribution a Gaussian whose precision is a linear function of λ a gamma distribution (Normal-gamma or Gaussian-gamma dist)

11 10(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ In the case of the multivariate Gaussian Three cases  [C1] When the covariance is known, inferring the mean  [C2] When the mean is known, inferring the covariance  [C3] Both mean and covariance is unknown The form of conjugate prior distribution univariatemultivariate C1Gaussian C2GammaWishart C3Gaussian-gammaGaussian-Wishart

12 11(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.3.7 Student's t-distribution If we have a univariate Gaussian together with a Gamma prior and we integrate out the precision, we obtain the marginal distribution of x Student’s t-distribution

13 12(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Properties of Student's t-distribution Adding up an infinite number of Gaussian distributions having the same mean but different precisions Infinite mixture of Gaussians Longer ‘tails’ than a Gaussian => robustness : much less sensitive than the Gaussian to the outliers ML solutions of Student’s t-distribution (red) and Gaussian (green) St ∼ Gaussian outlier

14 13(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ More on Student's t-distribution For regression problems  The least squares approach does not exhibit robustness, because it corresponds to ML under a (conditional) Gaussian dist.  We obtain a more robust model based on a heavy-tailed distribution such as a t-distribution Multivariate t-distribution where D is the dimensionality of x

15 14(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.3.8 Periodic variables Problem Setting  We want to evaluate the mean of a set of observations of a periodic variable:  To find an invariant measure of the mean, observations are considered as points on the unit circle Cartesian coord. angular coord. The maximum likelihood estimator

16 15(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Von Mises distribution (circular normal) Setting for periodic generalization of the Gaussian  Conditions of the distribution p(θ) that have period 2π Gaussian-like distribution that satisfies these properties 2D Gaussian, conditioned by the unit circle Von Mises distribution zeroth-order Bessel function of the first kindBessel function

17 16(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Von Mises distribution (circular normal) Plot of the von Mises distribution Cartesian plot polar plot

18 17(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ ML estimators for the parameters

19 18(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Some alternative techniques for the construction of periodic distribution The simplest approach is to use a histogram of observations in which the angular coordinate is divided into fixed bins => simple and flexible, but significantly limited (Section 2.5) Approach 2 : like the von Mises distribution from a Gaussian distribution over a Euclidean space but now marginalizes onto the unit circle rather than conditioning. Approach 3 : ‘wrapping’ the real axis around unit circle  Mapping successive intervals of width 2π onto (0, 2π) Mixtures of von Mises distributions

20 19(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.3.9 Mixtures of Gaussians Mixture of Gaussians From the sum and product rules, the marginal density is given by Example data set which requires mixture distributions Example of a Gaussian mixture distribution (3 Gaussians) : mixing coefficients : responsibility, plays an important role

21 20(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.3.9 Mixtures of Gaussians The maximum likelihood solution for the parameters  No closed-form analytical solution  Need iterative numerical optimization techniques, or  Expectation maximization (Chapter 9)

22 21(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.4 The Exponential Family The exponential family of distributions over x, given parameters, is defined to be the set of distributions of the form : scalar or vector : natural parameters : some function of x (sufficient statistic) : inverse of the normalizer (alternative form)

23 22(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Bernoulli distribution as an exponential family Bernoulli distribution logistic sigmoid function Exponential family

24 23(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Multinomial distribution as an exponential family Multinomial distribution Removing a constraint and using M-1 parameters : softmax function Exponential family

25 24(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Gaussian distribution Gaussian distribution as an exponential family

26 25(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.4.1 Maximum likelihood and sufficient statistics Maximum likelihood estimation of the parameter vector in the general exponential family distribution  The covariance of u(x) ~ the second derivatives of  The higher order moments ~ the nth derivatives of Taking the gradient of both sides The property of the partition function (alternative form)

27 26(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Sufficient statistics With i.i.d. (independent identically distributed) data The likelihood function Setting the gradient of log likelihood with respect to to zero The solution depends on the data only through  : sufficient statistic of the distribution  We do not need to store the entire data set itself but only the s.s.  Ex) the Bernoulli distribution : s.s. is the data points {x n }  Ex) Gaussian : s.s. are sum of {x n } and the sum of {x 2 n } Sufficiency in Bayesian?

28 27(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.4.2 Conjugate priors The exponential family Conjugate prior Posterior distribution  This takes the same functional form as the prior,  confirming conjugacy = : a normalization coefficient : effective number of pseudo-observations

29 28(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.4.3 Noninformative priors Role of a prior  If prior knowledge can be conveniently expressed through the prior distribution… very good~  When we have little idea, we use noninformative prior Noninformative prior is intended to have as little influence on the posterior distribution as possible.  ‘letting the data speak for themselves’  Two difficulties in the case of continuous parameters  If the domain of is unbounded => cannot be normalized : improper  The transformation behaviour of a prob. density under a nonlinear change of variables

30 29(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Two examples of noninformative priors Family of densities with translation invariance  Ex) mean of a Gaussian distributionmean of a Gaussian distribution Family of densities with scale invariance  Ex) stdev of a Gaussian distributionstdev of a Gaussian distribution shifting x by a constant

31 30(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.5 Nonparametric Methods Parametric approach to density estimation  Use of p.d.f. with specific functional forms (unimodal!)  Governed by a small number of parameters  Parameters are determined from a data set  Limitation: chosen density might be a poor model => poor prediction Nonparametric approach  Make few assumptions about the form of the distribution  The form of the dist. typically depends on the size of the data set  Still contain parameters, but these control the model complexity rather than the form of the distribution  Nonparametric Bayesian methods are attracting interest If you want more details, See Ch 4 of Duda & Hart

32 31(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Three nonparametric methods Histogram methods Kernel density estimators Nearest-neighbour methods Common points  Concept of locality  Smoothing parameter V is fixed, K is fixed N: # observations K: # points within some region V: the volume of the region Data: 50 points from mixture of 2 Gaussians (green) Δ i : width of ith bin

33 32(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ More on kernel density estimators Set a kernel function (or Parzen window) for each data point Property of kernel function on local region u around a data pointkernel Uniform kernel function Gaussian kernel function (smoother one)

34 33(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Pros and Cons Histogram methods  Once computed, the data set can be discarded  Easily applied to the sequential data processing  Setting bin edges produces (artificial) discontinuous density  Weak scaling with dimensionality Kernel density estimators  No computation form ‘training’  Requires the storage of the entire training set => computational cost of evaluating the density grows linearly with the |data|  Fixed ‘h’ : the optimal choice may be dependent on location K nearest-neighbour method  The model is not a true density model (integral diverges)

35 34(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Classification with K-NN We apply the K-nearest-neighbour density estimation to each class separately And then make use of Bayes’ theorem We wish to minimize the prob. of misclassification => assign the test point x to the class having the largest posterior probability ~ (K k /K) (density of each class) (unconditional density)(class priors) (the posterior prob. of class membership)

36 35(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Illustrations of the K-NN classification

37 36(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Good density models Nonparametric methods  Flexible, but require the entire training data set to be stored Parametric methods  Very restricted in terms of the forms of distributions What we want is  Density models that are flexible yet  The complexity of the models can be controlled independently of the size of the training set We shall see in subsequent chapters how to achieve this


Download ppt "Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Update by B.-H. Kim Summarized by M.H. Kim Biointelligence."

Similar presentations


Ads by Google