Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July - 2018 chonbuk national university.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Pattern Recognition and Machine Learning
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Visual Recognition Tutorial
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
Computer vision: models, learning and inference
(1) A probability model respecting those covariance observations: Gaussian Maximum entropy probability distribution for a given covariance observation.
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
: Appendix A: Mathematical Foundations 1 Montri Karnjanadecha ac.th/~montri Principles of.
Machine Learning Queens College Lecture 3: Probability and Statistics.
PATTERN RECOGNITION AND MACHINE LEARNING
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Bayesian Estimation and Confidence Intervals
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Probability for Machine Learning
ICS 280 Learning in Graphical Models
Appendix A: Probability Theory
CS 2750: Machine Learning Density Estimation
CS 2750: Machine Learning Probability Review Density Estimation
Bayes Net Learning: Bayesian Approaches
Graduate School of Information Sciences, Tohoku University
Computer vision: models, learning and inference
Special Topics In Scientific Computing
Latent Variables, Mixture Models and EM
Distributions and Concepts in Probability Theory
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Probability Review 11/22/2018.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Multivariate Methods Berlin Chen, 2005 References:
Data Exploration and Pattern Recognition © R. El-Yaniv
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July - 2018 chonbuk national university

Probability Distributions: General Density Estimation: given a finite set x1, . . . , xN of observations, find distribution p(x) of x Frequentist’s Way: chose specific parameter values by optimizing criterion (e.g., likelihood) Bayesian Way: prior distribution over parameters, compute posterior distribution with Bayes’ rule Conjugate Prior: leads to a posterior distribution of the same functional form as the prior (makes life a lot easier :) July - 2018 chonbuk national university

Binary Variables: Frequentist’s Way Given a binary random variable x ∈ {0, 1} (tossing a coin) with p(x = 1|µ) = µ, p(x = 0|µ) = 1 − µ. (2.1) p(x) can be described by the Bernoulli distribution: Bern(x|µ) = µx(1 − µ)1−x. (2.2) The maximum likelihood estimate for µ is: µML = m with m = (#observations of x = 1) (2.8) N Yet this can lead to overfitting (especially for small N ), e.g., N = m = 3 yields µML = 1 July - 2018 chonbuk national university

Binary Variables: Bayesian Way (1) The binomial distribution describes the number m of observations of x = 1 out of a data set of size N : N = fixed number of trials, m = number of success in N trial 𝜇 = probability of success, (1-𝜇) = probability of failure (2.9) Where, (2.10) 0.3 Histogram plot of the binomial distribution fro N = 10 and 𝜇 = 0.25 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 July - 2018 Bishop Chapter 2: Probability Distributions

Binary Variables: Bayesian Way (2) For a Bayesian treatment, we take the beta distribution as conjugate prior by introducing prior distribution 𝑝(𝜇) over the parameter 𝜇. Γ(a + b) Beta(µ|a, b) = µ (1 − µ) a−1 b−1 (2.13) Γ(a)Γ(b) (The gamma function extends the factorial to real numbers, i.e., Γ(n) = (n − 1)!.) Mean and variance are given by: a a + b E[µ] = var[µ] = (2.15) ab (a + b)2(a + b + 1) (2.16) July - 2018 Bishop Chapter 2: Probability Distributions

Binary Variables: Beta Distribution Some plots of the beta distribution: 3 3 a =0 .1 b =0 .1 a =1 b =1 2 2 1 1 0.5 1 0.5 1 3 a =2 b =3 3 a =8 b =4 2 2 1 1 0.5 1 0.5 1 plots of the beta distribution 𝐵𝑒𝑡𝑎(𝜇|𝑎,𝑏) given by (2.13) as a function of 𝜇 for various values of the hyper parameters 𝑎 and 𝑏 July - 2018 Bishop Chapter 2: Probability Distributions

Binary Variables: Bayesian Way (3) Multiplying the binomial likelihood function (2.9) and the beta prior (2.13), the posterior is a beta distribution and has the form: 𝑝 𝜇 𝑚,𝑙,𝑎,𝑏 ∝ 𝜇 𝑚+𝑎−1 (1−𝜇) 𝑙+𝑏−1 (2.17) with l = N − m. Simple interpretation of hyperparameters a and b as effective number of observations of x = 1 and x = 0 (a priori) As we observe new data, a and b are updated. As N → ∞, the variance (uncertainty) decreases and the mean converges to the ML estimate July - 2018 Bishop Chapter 2: Probability Distributions

Multinomial Variables: Frequentist’s Way A random variable with K mutually exclusive states can be represented as a K dimensional vector x in which one elements of xk = 1 and all other remaining equals 0. The Bernoulli distribution can be generalized to 𝑝 𝑋 𝜇 = 𝑘=1 𝐾 𝜇 𝑘 𝑥 𝑘 (2.26) 𝑤𝑖𝑡ℎ 𝑘 𝜇 𝑘 =1 . For a data set D with N independent observations 𝑋 1 ….. 𝑋 𝑁 , the corresponding likelihood function takes the form (2.29) The maximum likelihood estimate for µ is: µML = mk (2.33) k N July - 2018 Bishop Chapter 2: Probability Distributions

Multinomial Variables: Bayesian Way (1) The multinomial distribution is a joint distribution of the parameters m1, . . . , mK , conditioned on µ and N : Mult(m1, m2, . . . , mK |µ, N ) = (2.34) Which is known as multinomial distribution. The normalization coefficient is the number of ways of partitioning 𝑁 objects into 𝐾 groups of size 𝑚 1 ,……. 𝑚 𝑘 and is given by: (2.35) where the variables mk are subject to the constraint: (2.36) July - 2018 Bishop Chapter 2: Probability Distributions

Multinomial Variables: Bayesian Way (2) For a Bayesian treatment, the Dirichlet distribution can be taken as conjugate prior: (2.38) 𝑎 0 = 𝑘=1 𝐾 𝑎 𝑘 is a gamma function and (2.39) July - 2018 Bishop Chapter 2: Probability Distributions

Multinomial Variables: Dirichlet Distribution Some plots of a Dirichlet distribution over 3 variables: Dirichlet distribution with val- ues (from left to right): α = Dirichlet distribution with values (clockwise from top left): α = (6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4). (0.1, 0.1, 0.1), (1, 1, 1). July - 2018 Bishop Chapter 2: Probability Distributions

Multinomial Variables: Bayesian Way (3) Multiplying the prior (2.38) by the likelihood function (2.34) yields the posterior for { 𝜇 𝑘 } in the form: (2.40) Again posterior distribution takes the form of Dirichlet distribution which is conjugate prior for the multinomial. This allows to find the normalization coefficient by comparing (2.38) (2.41) with m = (m1, . . . , mK )T. Similarly to the binomial distribution with its beta prior, αk can be interpreted as effective number of observations of xk = 1 (a priori). July - 2018 Bishop Chapter 2: Probability Distributions

The gaussian distribution The gaussian law of a D dimensional vector x is: (2.43) Where 𝜇 is a D- dimensional mean vector, is a 𝐷×𝐷 covariance matrix and denotes the determinant of . Motivations: maximum of the entropy, central limit theorem. 3 3 2 2 1 1 3 2 1 0 0 0 0.5 1 0 0.5 1 0 0.5 1 Histogram plots of the mean of N uniformly distributed numbers for various values of N. we observe that as N increases, the distribution tends towards a Gaussian. July - 2018 Bishop Chapter 2: Probability Distributions

The gaussian distribution : Properties The law is a function of the Mahalanobis distance from x to µ: ∆2 = (x − µ)TΣ−1(x − µ) (2.44) The expectation of x under the Gaussian distribution is: IE(x) = µ, (2.59) The covariance matrix of x is: cov(x) = Σ. (2.64) July - 2018 Bishop Chapter 2: Probability Distributions

The gaussian distribution : Properties The law is constant on elliptical surfaces x2 u 2 The red curve shows the elliptical surface of constant probability density for a Gaussian in a two-dimensional space 𝑋= 𝑥 1 , 𝑥 2 on which the density is exp⁡(− 1 2 ) of its value at 𝑋=𝜇. The major axes of the ellipse are defined by the eigenvectors 𝑢 𝑖 of the covariance matrix, with corresponding eigenvalues 𝜆 𝑖 u1 y2 y1 µ λ1/2 2 λ1/2 1 x1 Where λi are the eigenvalues of Σ, ui are the associated eigenvectors. July - 2018 Bishop Chapter 2: Probability Distributions

The gaussian distribution : Conditional and marginal laws Given a Gausian distribution N (x|µ, Σ) with: x = (xa, xb)T, µ = (µa, µb)T (2.94) Σ Σ Σ = aa ab (2.95) Σba Σbb The conditional distribution p(xa|xb) is again a Gaussian law with parameters: (2.81) Σ−1 Σa|b = Σaa − Σab Σ . (2.82) bb ba The marginal distribution p(xa) is a gaussian law with parameters (µa, Σaa). July - 2018 Bishop Chapter 2: Probability Distributions

The gaussian distribution : Bayes’ theorem Given a marginal Gaussian distribution for X and a conditional Gaussian distribution for y given X in the form p(x) = N (x, µ, Λ) p(y|x) = N (y, Ax + b, L−1) (2.113) (2.114) (y = Ax + b + s) where x is gaussian and s is a centered gaussian noise). Λ and 𝐿 are precision matrices. 𝜇, A and b are parameters governing the means. Then, p(y) = N (y, Aµ + b, L−1 + AΛ−1AT) p(x|y) = N (x|Σ(ATL(y − b) + Λµ), Σ) (2.115) (2.116) where Σ = (Λ + ATLA)−1 (2.117) July - 2018 Bishop Chapter 2: Probability Distributions

The gaussian distribution : Maximum likelihood Assume we have X a set of N iid observations following a Gaussian law. The parameters of the law, estimated by ML for the mean and covariance are: 𝜇 𝑀𝐿 = 1 𝑁 𝑛=1 𝑁 𝑋 𝑛 (2.121) Σ 𝑀𝐿 = 1 𝑁 𝑛=1 𝑁 ( 𝑋 𝑛 − 𝜇 𝑀𝐿 )( 𝑋 𝑛 − 𝜇 𝑀𝐿 ) 𝑇 (2.122) The empirical mean is unbiased but it is not the case of the empirical variance. The bias can be correct multiplying ΣML by N N − 1 the factor . July - 2018 Bishop Chapter 2: Probability Distributions

The gaussian distribution : Maximum likehood The mean estimated form N data points is a revision of the estimator obtained from the (N − 1) first data points: µ(N ) = µ(N −1) + 1 (x N − µ (N −1) ). (2.126) ML ML N ML It is a particular case of the algorithm of Robbins-Monro, which iteratively search the root of a regression function. July - 2018 Bishop Chapter 2: Probability Distributions