Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
NORMAL OR GAUSSIAN DISTRIBUTION Chapter 5. General Normal Distribution Two parameter distribution with a pdf given by:
Probabilistic models Haixu Tang School of Informatics.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Visual Recognition Tutorial
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Estimation of parameters. Maximum likelihood What has happened was most likely.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Presenting: Assaf Tzabari
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Computer vision: models, learning and inference
Computer vision: models, learning and inference Chapter 3 Common probability distributions.
Maximum likelihood (ML)
Review of Lecture Two Linear Regression Normal Equation
Recitation 1 Probability Review
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Bayesian Prior and Posterior Study Guide for ES205 Yu-Chi Ho Jonathan T. Lee Nov. 24, 2000.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Sampling and estimation Petter Mostad
Machine Learning 5. Parametric Methods.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Dirichlet Distribution
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Oliver Schulte Machine Learning 726
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Probability Theory and Parameter Estimation I
CS 2750: Machine Learning Density Estimation
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
CS 2750: Machine Learning Probability Review Density Estimation
Bayes Net Learning: Bayesian Approaches
Computer vision: models, learning and inference
Maximum Likelihood Estimation
Distributions and Concepts in Probability Theory
Maximum Likelihood Find the parameters of a model that best fit the data… Forms the foundation of Bayesian inference Slide 1.
OVERVIEW OF BAYESIAN INFERENCE: PART 1
Covered only ML estimator
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
Summarizing Data by Statistics
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Mathematical Foundations of BME Reza Shadmehr
Maximum Likelihood Estimation (MLE)
Presentation transcript:

Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008

Motivation All models are wrong, but some are useful. This lecture introduces distributions that have proven useful in constructing models.

Densities, statistics and estimators A probability (density) is any function X a p(X ) that satisfies the probability theory axioms. A statistic is any function of observed data x a f(x). An estimator is a statistic used for estimating a parameter of the probability density x a m.

Estimators Assume D = {x 1,x 2,...,x N } are independent, identically distributed (i.i.d) outcomes of our experiments (observed data). Desirable properties of an estimator are: for N®¥ and (unbiased)

Estimators A general way to get an estimator is using the maximum likelihood (ML) or maximum a posterior (MAP) approach: ML: MAP:...possibly compensating for any bias in these.

Bayesian estimation In a fully Bayesian approach we instead update our distribution based on observed data:

Conjugate priors If the prior distribution belongs to a certain class of functions, p(m) Î C, and the product of prior and likelihood belongs to the same class p(x|m) p(m) Î C, then the prior is called a conjugate prior.

Conjugate priors Typical situation: where f is a well known function with known normalizing constant If the prior distribution belongs to a certain class of functions, p(m) Î C, and the product of prior and likelihood belongs to the same class p(x|m) p(m) Î C, then the prior is called a conjugate prior.

Conjugate priors

Bernoulli and binomial distribution Bernoulli distribution: single event with binary outcome. Binomial: sum of N Bernoulli outcomes. Mean for Bernoulli: Mean for Binomial:

Bernoulli and binomial distribution Bernoulli maximum likelihood for Sufficient statistic

Bernoulli and binomial distribution Bernoulli maximum likelihood for ML estimate:

Bernoulli and binomial distribution Bernoulli maximum likelihood for ML estimate: Average, so

Bernoulli and binomial distribution Similar for binomial distribution: ML estimate:

Beta distribution Beta distribution: Gamma function

Beta distribution Beta distribution:

Beta distribution Beta distribution: Beta function

Beta distribution: Beta distribution Normalizing constant

Beta distribution: Beta distribution Conjugate to Bernoulli/Binomial: Posterior distribution:

Beta distribution: Beta distribution Conjugate to Bernoulli/Binomial: Posterior distribution:

Multinomial distribution One out of K classes: x bitvector with Distribution: Likelihood: Sufficient statistic

Multinomial distribution Maximum likelihood estimate:

Dirichlet distribution Dirichlet distribution:

Dirichlet distribution Dirichlet distribution: Normalizing constant

Dirichlet distribution Dirichlet distribution: Conjugate to Multinomial:

Gaussian/Normal distribution Scalar variable: Vector variable: Symmetric, real, D ´ D matrix

Geometry of a Gaussian Sufficient statistic: There's a linear transformation U so where L is a diagonal matrix. Then where

Geometry of a Gaussian Gaussian constant when is constant – an ellipsis in the U coordinate system:

Geometry of a Gaussian Gaussian constant when is constant – an ellipsis in the U coordinate system:

Parameters are mean and variance/covariance: Parameters of a Gaussian ML estimates are:

Parameters of a Gaussian ML estimates are: Unbiase d Biase d

Parameters of a Gaussian ML estimates are: Unbiase d Biase d Intuition: The variance estimator is based on the mean estimator – which is fitted to data and fits better than the real mean, thus the variance is under estimated. Correction:

Gaussian is its own conjugate For fixed variance: Normalizing constant

For fixed variance: Gaussian is its own conjugate with:

Mixtures of Gaussians A Gaussian has a single mode (peak) and cannot model multi-modal distributions. Instead, we can use mixtures:

Mixtures of Gaussians A Gaussian has a single mode (peak) and cannot model multi-modal distributions. Instead, we can use mixtures: Prob. for selecting given Gaussian Conditiona l distribution

Mixtures of Gaussians A Gaussian has a single mode (peak) and cannot model multi-modal distributions. Instead, we can use mixtures: Prob. for selecting given Gaussian Conditiona l distribution Numerical methods needed for parameter estimation

Summary ● Bernoulli and Binomial, with Beta prior ● Multinomial with Dirichlet prior ● Gaussian with Gaussian prior ● Mixtures of Gaussians