Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July - 2018 chonbuk national university.

Similar presentations


Presentation on theme: "Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July - 2018 chonbuk national university."— Presentation transcript:

1 Pattern Recognition and Machine Learning Chapter 2: Probability Distributions
July chonbuk national university

2 Probability Distributions: General
Density Estimation: given a finite set x1, , xN of observations, find distribution p(x) of x Frequentist’s Way: chose specific parameter values by optimizing criterion (e.g., likelihood) Bayesian Way: prior distribution over parameters, compute posterior distribution with Bayes’ rule Conjugate Prior: leads to a posterior distribution of the same functional form as the prior (makes life a lot easier :) July chonbuk national university

3 Binary Variables: Frequentist’s Way
Given a binary random variable x ∈ {0, 1} (tossing a coin) with p(x = 1|µ) = µ, p(x = 0|µ) = 1 − µ. (2.1) p(x) can be described by the Bernoulli distribution: Bern(x|µ) = µx(1 − µ)1−x. (2.2) The maximum likelihood estimate for µ is: µML = m with m = (#observations of x = 1) (2.8) N Yet this can lead to overfitting (especially for small N ), e.g., N = m = 3 yields µML = 1 July chonbuk national university

4 Binary Variables: Bayesian Way (1)
The binomial distribution describes the number m of observations of x = 1 out of a data set of size N : N = fixed number of trials, m = number of success in N trial 𝜇 = probability of success, (1-𝜇) = probability of failure (2.9) Where, (2.10) 0.3 Histogram plot of the binomial distribution fro N = 10 and 𝜇 = 0.25 0.2 0.1 July Bishop Chapter 2: Probability Distributions

5 Binary Variables: Bayesian Way (2)
For a Bayesian treatment, we take the beta distribution as conjugate prior by introducing prior distribution 𝑝(𝜇) over the parameter 𝜇. Γ(a + b) Beta(µ|a, b) = µ (1 − µ) a−1 b−1 (2.13) Γ(a)Γ(b) (The gamma function extends the factorial to real numbers, i.e., Γ(n) = (n − 1)!.) Mean and variance are given by: a a + b E[µ] = var[µ] = (2.15) ab (a + b)2(a + b + 1) (2.16) July Bishop Chapter 2: Probability Distributions

6 Binary Variables: Beta Distribution
Some plots of the beta distribution: 3 3 a =0 .1 b =0 .1 a =1 b =1 2 2 1 1 0.5 1 0.5 1 3 a =2 b =3 3 a =8 b =4 2 2 1 1 0.5 1 0.5 1 plots of the beta distribution 𝐵𝑒𝑡𝑎(𝜇|𝑎,𝑏) given by (2.13) as a function of 𝜇 for various values of the hyper parameters 𝑎 and 𝑏 July Bishop Chapter 2: Probability Distributions

7 Binary Variables: Bayesian Way (3)
Multiplying the binomial likelihood function (2.9) and the beta prior (2.13), the posterior is a beta distribution and has the form: 𝑝 𝜇 𝑚,𝑙,𝑎,𝑏 ∝ 𝜇 𝑚+𝑎−1 (1−𝜇) 𝑙+𝑏−1 (2.17) with l = N − m. Simple interpretation of hyperparameters a and b as effective number of observations of x = 1 and x = 0 (a priori) As we observe new data, a and b are updated. As N → ∞, the variance (uncertainty) decreases and the mean converges to the ML estimate July Bishop Chapter 2: Probability Distributions

8 Multinomial Variables: Frequentist’s Way
A random variable with K mutually exclusive states can be represented as a K dimensional vector x in which one elements of xk = 1 and all other remaining equals 0. The Bernoulli distribution can be generalized to 𝑝 𝑋 𝜇 = 𝑘=1 𝐾 𝜇 𝑘 𝑥 𝑘 (2.26) 𝑤𝑖𝑡ℎ 𝑘 𝜇 𝑘 =1 . For a data set D with N independent observations 𝑋 1 ….. 𝑋 𝑁 , the corresponding likelihood function takes the form (2.29) The maximum likelihood estimate for µ is: µML = mk (2.33) k N July Bishop Chapter 2: Probability Distributions

9 Multinomial Variables: Bayesian Way (1)
The multinomial distribution is a joint distribution of the parameters m1, , mK , conditioned on µ and N : Mult(m1, m2, , mK |µ, N ) = (2.34) Which is known as multinomial distribution. The normalization coefficient is the number of ways of partitioning 𝑁 objects into 𝐾 groups of size 𝑚 1 ,……. 𝑚 𝑘 and is given by: (2.35) where the variables mk are subject to the constraint: (2.36) July Bishop Chapter 2: Probability Distributions

10 Multinomial Variables: Bayesian Way (2)
For a Bayesian treatment, the Dirichlet distribution can be taken as conjugate prior: (2.38) 𝑎 0 = 𝑘=1 𝐾 𝑎 𝑘 is a gamma function and (2.39) July Bishop Chapter 2: Probability Distributions

11 Multinomial Variables: Dirichlet Distribution
Some plots of a Dirichlet distribution over 3 variables: Dirichlet distribution with val- ues (from left to right): α = Dirichlet distribution with values (clockwise from top left): α = (6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4). (0.1, 0.1, 0.1), (1, 1, 1). July Bishop Chapter 2: Probability Distributions

12 Multinomial Variables: Bayesian Way (3)
Multiplying the prior (2.38) by the likelihood function (2.34) yields the posterior for { 𝜇 𝑘 } in the form: (2.40) Again posterior distribution takes the form of Dirichlet distribution which is conjugate prior for the multinomial. This allows to find the normalization coefficient by comparing (2.38) (2.41) with m = (m1, , mK )T. Similarly to the binomial distribution with its beta prior, αk can be interpreted as effective number of observations of xk = 1 (a priori). July Bishop Chapter 2: Probability Distributions

13 The gaussian distribution
The gaussian law of a D dimensional vector x is: (2.43) Where 𝜇 is a D- dimensional mean vector, is a 𝐷×𝐷 covariance matrix and denotes the determinant of Motivations: maximum of the entropy, central limit theorem. 3 3 2 2 1 1 3 2 1 0 0 0.5 1 0 0.5 1 Histogram plots of the mean of N uniformly distributed numbers for various values of N. we observe that as N increases, the distribution tends towards a Gaussian. July Bishop Chapter 2: Probability Distributions

14 The gaussian distribution : Properties
The law is a function of the Mahalanobis distance from x to µ: ∆2 = (x − µ)TΣ−1(x − µ) (2.44) The expectation of x under the Gaussian distribution is: IE(x) = µ, (2.59) The covariance matrix of x is: cov(x) = Σ. (2.64) July Bishop Chapter 2: Probability Distributions

15 The gaussian distribution : Properties
The law is constant on elliptical surfaces x2 u 2 The red curve shows the elliptical surface of constant probability density for a Gaussian in a two-dimensional space 𝑋= 𝑥 1 , 𝑥 2 on which the density is exp⁡(− 1 2 ) of its value at 𝑋=𝜇. The major axes of the ellipse are defined by the eigenvectors 𝑢 𝑖 of the covariance matrix, with corresponding eigenvalues 𝜆 𝑖 u1 y2 y1 λ1/2 2 λ1/2 1 x1 Where λi are the eigenvalues of Σ, ui are the associated eigenvectors. July Bishop Chapter 2: Probability Distributions

16 The gaussian distribution : Conditional and marginal laws
Given a Gausian distribution N (x|µ, Σ) with: x = (xa, xb)T, µ = (µa, µb)T (2.94) Σ Σ Σ = aa ab (2.95) Σba Σbb The conditional distribution p(xa|xb) is again a Gaussian law with parameters: (2.81) Σ−1 Σa|b = Σaa − Σab Σ . (2.82) bb ba The marginal distribution p(xa) is a gaussian law with parameters (µa, Σaa). July Bishop Chapter 2: Probability Distributions

17 The gaussian distribution : Bayes’ theorem
Given a marginal Gaussian distribution for X and a conditional Gaussian distribution for y given X in the form p(x) = N (x, µ, Λ) p(y|x) = N (y, Ax + b, L−1) (2.113) (2.114) (y = Ax + b + s) where x is gaussian and s is a centered gaussian noise). Λ and 𝐿 are precision matrices. 𝜇, A and b are parameters governing the means. Then, p(y) = N (y, Aµ + b, L−1 + AΛ−1AT) p(x|y) = N (x|Σ(ATL(y − b) + Λµ), Σ) (2.115) (2.116) where Σ = (Λ + ATLA)−1 (2.117) July Bishop Chapter 2: Probability Distributions

18 The gaussian distribution : Maximum likelihood
Assume we have X a set of N iid observations following a Gaussian law. The parameters of the law, estimated by ML for the mean and covariance are: 𝜇 𝑀𝐿 = 1 𝑁 𝑛=1 𝑁 𝑋 𝑛 (2.121) Σ 𝑀𝐿 = 1 𝑁 𝑛=1 𝑁 ( 𝑋 𝑛 − 𝜇 𝑀𝐿 )( 𝑋 𝑛 − 𝜇 𝑀𝐿 ) 𝑇 (2.122) The empirical mean is unbiased but it is not the case of the empirical variance. The bias can be correct multiplying ΣML by N N − 1 the factor . July Bishop Chapter 2: Probability Distributions

19 The gaussian distribution : Maximum likehood
The mean estimated form N data points is a revision of the estimator obtained from the (N − 1) first data points: µ(N ) = µ(N −1) + 1 (x N − µ (N −1) ). (2.126) ML ML N ML It is a particular case of the algorithm of Robbins-Monro, which iteratively search the root of a regression function. July Bishop Chapter 2: Probability Distributions


Download ppt "Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July - 2018 chonbuk national university."

Similar presentations


Ads by Google