Download presentation
Presentation is loading. Please wait.
Published bySuzanna Palmer Modified over 8 years ago
2
Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Update by B.-H. Kim Summarized by M.H. Kim Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/
3
2(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/Content 2.3 The Gaussian Distribution 2.3.6 Bayesian inference for the Gaussian 2.3.7 Student's t-distribution 2.3.8 Periodic variables 2.3.9 Mixtures of Gaussians 2.4 The Exponential Family 2.4.1 Maximum likelihood and sufficient statistics 2.4.2 Conjugate priors 2.4.3 Noninformative priors 2.5 Nonparametric Methods 2.5.1 Kernel density estimators 2.5.2 Nearest-neighbour methods
4
3(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.3.6 Bayesian inference for the Gaussian In maximum likelihood framework Point estimation for the parameters and Bayesian treatment for parameter estimation Introduce prior distributions over parameters Three cases [C1] When the covariance is known, inferring the mean [C2] When the mean is known, inferring the covariance [C3] Both mean and covariance is unknown Conjugate prior Makes the form of the distribution is consistent Means effective fictitious data points (exponential family)
5
4(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ [C1] When the covariance is known, inferring the mean Single Gaussian random variable x given a set of N observation X={x 1,…,x N } The likelihood function the form of the exponential of a quadratic form in μ Thus if we choose a prior p(μ) given by a Gaussian, it will be a conjugate distribution for this likelihood function The conjugate prior distribution
6
5(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ [C1] When the covariance is known, inferring the mean The posterior distribution Posterior mean and variance Posterior mean is a compromise btw and Precision (inverse variance) is additive
7
6(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ [C1] When the covariance is known, inferring the mean Illustration of this Bayesian inference Data points generation : N(x|0.8, 0.1 2 ) Prior N(μ |0,0.1 2 ) Sequential estimation of mean Bayesian paradigm leas very naturally to a sequential view of the inference problem The posterior distribution after observing N-1 data points, or a prior distribution before the N th data is observed Likelihood function associated with x N
8
7(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ [C2] When the mean is known, inferring the covariance Likelihood function for the precision λ The conjugate prior distribution – gamma distribution Let the prior to be, then the posterior becomes Plot of the Gam(λ|a,b)
9
8(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Meaning of the conjugate prior for exponential family The effect of observing N data points Increase the value of the coefficient a 0 by N/2 2a 0 : ‘effective’ prior observations Increase the value of the coefficient b 0 by Nσ 2 ML /2 b 0 arises from the 2a 0 effective prior observations having variance b 0 /a 0 A conjugate prior is interpreted as effective fictitious data points General property of exponential family of distributions
10
9(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ [C3] Both mean and covariance is unknown Likelihood function Prior distribution a Gaussian whose precision is a linear function of λ a gamma distribution (Normal-gamma or Gaussian-gamma dist)
11
10(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ In the case of the multivariate Gaussian Three cases [C1] When the covariance is known, inferring the mean [C2] When the mean is known, inferring the covariance [C3] Both mean and covariance is unknown The form of conjugate prior distribution univariatemultivariate C1Gaussian C2GammaWishart C3Gaussian-gammaGaussian-Wishart
12
11(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.3.7 Student's t-distribution If we have a univariate Gaussian together with a Gamma prior and we integrate out the precision, we obtain the marginal distribution of x Student’s t-distribution
13
12(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Properties of Student's t-distribution Adding up an infinite number of Gaussian distributions having the same mean but different precisions Infinite mixture of Gaussians Longer ‘tails’ than a Gaussian => robustness : much less sensitive than the Gaussian to the outliers ML solutions of Student’s t-distribution (red) and Gaussian (green) St ∼ Gaussian outlier
14
13(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ More on Student's t-distribution For regression problems The least squares approach does not exhibit robustness, because it corresponds to ML under a (conditional) Gaussian dist. We obtain a more robust model based on a heavy-tailed distribution such as a t-distribution Multivariate t-distribution where D is the dimensionality of x
15
14(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.3.8 Periodic variables Problem Setting We want to evaluate the mean of a set of observations of a periodic variable: To find an invariant measure of the mean, observations are considered as points on the unit circle Cartesian coord. angular coord. The maximum likelihood estimator
16
15(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Von Mises distribution (circular normal) Setting for periodic generalization of the Gaussian Conditions of the distribution p(θ) that have period 2π Gaussian-like distribution that satisfies these properties 2D Gaussian, conditioned by the unit circle Von Mises distribution zeroth-order Bessel function of the first kindBessel function
17
16(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Von Mises distribution (circular normal) Plot of the von Mises distribution Cartesian plot polar plot
18
17(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ ML estimators for the parameters
19
18(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Some alternative techniques for the construction of periodic distribution The simplest approach is to use a histogram of observations in which the angular coordinate is divided into fixed bins => simple and flexible, but significantly limited (Section 2.5) Approach 2 : like the von Mises distribution from a Gaussian distribution over a Euclidean space but now marginalizes onto the unit circle rather than conditioning. Approach 3 : ‘wrapping’ the real axis around unit circle Mapping successive intervals of width 2π onto (0, 2π) Mixtures of von Mises distributions
20
19(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.3.9 Mixtures of Gaussians Mixture of Gaussians From the sum and product rules, the marginal density is given by Example data set which requires mixture distributions Example of a Gaussian mixture distribution (3 Gaussians) : mixing coefficients : responsibility, plays an important role
21
20(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.3.9 Mixtures of Gaussians The maximum likelihood solution for the parameters No closed-form analytical solution Need iterative numerical optimization techniques, or Expectation maximization (Chapter 9)
22
21(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.4 The Exponential Family The exponential family of distributions over x, given parameters, is defined to be the set of distributions of the form : scalar or vector : natural parameters : some function of x (sufficient statistic) : inverse of the normalizer (alternative form)
23
22(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Bernoulli distribution as an exponential family Bernoulli distribution logistic sigmoid function Exponential family
24
23(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Multinomial distribution as an exponential family Multinomial distribution Removing a constraint and using M-1 parameters : softmax function Exponential family
25
24(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Gaussian distribution Gaussian distribution as an exponential family
26
25(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.4.1 Maximum likelihood and sufficient statistics Maximum likelihood estimation of the parameter vector in the general exponential family distribution The covariance of u(x) ~ the second derivatives of The higher order moments ~ the nth derivatives of Taking the gradient of both sides The property of the partition function (alternative form)
27
26(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Sufficient statistics With i.i.d. (independent identically distributed) data The likelihood function Setting the gradient of log likelihood with respect to to zero The solution depends on the data only through : sufficient statistic of the distribution We do not need to store the entire data set itself but only the s.s. Ex) the Bernoulli distribution : s.s. is the data points {x n } Ex) Gaussian : s.s. are sum of {x n } and the sum of {x 2 n } Sufficiency in Bayesian?
28
27(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.4.2 Conjugate priors The exponential family Conjugate prior Posterior distribution This takes the same functional form as the prior, confirming conjugacy = : a normalization coefficient : effective number of pseudo-observations
29
28(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.4.3 Noninformative priors Role of a prior If prior knowledge can be conveniently expressed through the prior distribution… very good~ When we have little idea, we use noninformative prior Noninformative prior is intended to have as little influence on the posterior distribution as possible. ‘letting the data speak for themselves’ Two difficulties in the case of continuous parameters If the domain of is unbounded => cannot be normalized : improper The transformation behaviour of a prob. density under a nonlinear change of variables
30
29(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Two examples of noninformative priors Family of densities with translation invariance Ex) mean of a Gaussian distributionmean of a Gaussian distribution Family of densities with scale invariance Ex) stdev of a Gaussian distributionstdev of a Gaussian distribution shifting x by a constant
31
30(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 2.5 Nonparametric Methods Parametric approach to density estimation Use of p.d.f. with specific functional forms (unimodal!) Governed by a small number of parameters Parameters are determined from a data set Limitation: chosen density might be a poor model => poor prediction Nonparametric approach Make few assumptions about the form of the distribution The form of the dist. typically depends on the size of the data set Still contain parameters, but these control the model complexity rather than the form of the distribution Nonparametric Bayesian methods are attracting interest If you want more details, See Ch 4 of Duda & Hart
32
31(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Three nonparametric methods Histogram methods Kernel density estimators Nearest-neighbour methods Common points Concept of locality Smoothing parameter V is fixed, K is fixed N: # observations K: # points within some region V: the volume of the region Data: 50 points from mixture of 2 Gaussians (green) Δ i : width of ith bin
33
32(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ More on kernel density estimators Set a kernel function (or Parzen window) for each data point Property of kernel function on local region u around a data pointkernel Uniform kernel function Gaussian kernel function (smoother one)
34
33(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Pros and Cons Histogram methods Once computed, the data set can be discarded Easily applied to the sequential data processing Setting bin edges produces (artificial) discontinuous density Weak scaling with dimensionality Kernel density estimators No computation form ‘training’ Requires the storage of the entire training set => computational cost of evaluating the density grows linearly with the |data| Fixed ‘h’ : the optimal choice may be dependent on location K nearest-neighbour method The model is not a true density model (integral diverges)
35
34(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Classification with K-NN We apply the K-nearest-neighbour density estimation to each class separately And then make use of Bayes’ theorem We wish to minimize the prob. of misclassification => assign the test point x to the class having the largest posterior probability ~ (K k /K) (density of each class) (unconditional density)(class priors) (the posterior prob. of class membership)
36
35(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Illustrations of the K-NN classification
37
36(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Good density models Nonparametric methods Flexible, but require the entire training data set to be stored Parametric methods Very restricted in terms of the forms of distributions What we want is Density models that are flexible yet The complexity of the models can be controlled independently of the size of the training set We shall see in subsequent chapters how to achieve this
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.