Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Bayes rule, priors and maximum a posteriori

NORMAL OR GAUSSIAN DISTRIBUTION Chapter 5. General Normal Distribution Two parameter distribution with a pdf given by:

Probabilistic models Haixu Tang School of Informatics.

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

Visual Recognition Tutorial

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.

Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.

Estimation of parameters. Maximum likelihood What has happened was most likely.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.

Presenting: Assaf Tzabari

Machine Learning CMPT 726 Simon Fraser University

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Computer vision: models, learning and inference

Computer vision: models, learning and inference Chapter 3 Common probability distributions.

Maximum likelihood (ML)

Review of Lecture Two Linear Regression Normal Equation

Recitation 1 Probability Review

Chapter Two Probability Distributions: Discrete Variables

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.

Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.

Bayesian Prior and Posterior Study Guide for ES205 Yu-Chi Ho Jonathan T. Lee Nov. 24, 2000.

Lecture 2: Statistical learning primer for biologists

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

Sampling and estimation Petter Mostad

Machine Learning 5. Parametric Methods.

Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.

Review of statistical modeling and probability theory Alan Moses ML4bio.

Dirichlet Distribution

Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.

Oliver Schulte Machine Learning 726

Chapter 3: Maximum-Likelihood Parameter Estimation

LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION

Probability Theory and Parameter Estimation I

CS 2750: Machine Learning Density Estimation

Ch3: Model Building through Regression

Parameter Estimation 主講人：虞台文.

CS 2750: Machine Learning Probability Review Density Estimation

Bayes Net Learning: Bayesian Approaches

Computer vision: models, learning and inference

Maximum Likelihood Estimation

Distributions and Concepts in Probability Theory

Maximum Likelihood Find the parameters of a model that best fit the data… Forms the foundation of Bayesian inference Slide 1.

OVERVIEW OF BAYESIAN INFERENCE: PART 1

Covered only ML estimator

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Statistical NLP: Lecture 4

Summarizing Data by Statistics

LECTURE 09: BAYESIAN LEARNING

LECTURE 07: BAYESIAN ESTIMATION

Parametric Methods Berlin Chen, 2005 References:

Learning From Observed Data

Multivariate Methods Berlin Chen

Multivariate Methods Berlin Chen, 2005 References:

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.

Mathematical Foundations of BME Reza Shadmehr

Maximum Likelihood Estimation (MLE)

Presentation transcript:

Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008

Motivation All models are wrong, but some are useful. This lecture introduces distributions that have proven useful in constructing models.

Densities, statistics and estimators A probability (density) is any function X a p(X ) that satisfies the probability theory axioms. A statistic is any function of observed data x a f(x). An estimator is a statistic used for estimating a parameter of the probability density x a m.

Estimators Assume D = {x 1,x 2,...,x N } are independent, identically distributed (i.i.d) outcomes of our experiments (observed data). Desirable properties of an estimator are: for N®¥ and (unbiased)

Estimators A general way to get an estimator is using the maximum likelihood (ML) or maximum a posterior (MAP) approach: ML: MAP:...possibly compensating for any bias in these.

Bayesian estimation In a fully Bayesian approach we instead update our distribution based on observed data:

Conjugate priors If the prior distribution belongs to a certain class of functions, p(m) Î C, and the product of prior and likelihood belongs to the same class p(x|m) p(m) Î C, then the prior is called a conjugate prior.

Conjugate priors Typical situation: where f is a well known function with known normalizing constant If the prior distribution belongs to a certain class of functions, p(m) Î C, and the product of prior and likelihood belongs to the same class p(x|m) p(m) Î C, then the prior is called a conjugate prior.

Conjugate priors

Bernoulli and binomial distribution Bernoulli distribution: single event with binary outcome. Binomial: sum of N Bernoulli outcomes. Mean for Bernoulli: Mean for Binomial:

Bernoulli and binomial distribution Bernoulli maximum likelihood for Sufficient statistic

Bernoulli and binomial distribution Bernoulli maximum likelihood for ML estimate:

Bernoulli and binomial distribution Bernoulli maximum likelihood for ML estimate: Average, so

Bernoulli and binomial distribution Similar for binomial distribution: ML estimate:

Beta distribution Beta distribution: Gamma function

Beta distribution Beta distribution:

Beta distribution Beta distribution: Beta function

Beta distribution: Beta distribution Normalizing constant

Beta distribution: Beta distribution Conjugate to Bernoulli/Binomial: Posterior distribution:

Beta distribution: Beta distribution Conjugate to Bernoulli/Binomial: Posterior distribution:

Multinomial distribution One out of K classes: x bitvector with Distribution: Likelihood: Sufficient statistic

Multinomial distribution Maximum likelihood estimate:

Dirichlet distribution Dirichlet distribution:

Dirichlet distribution Dirichlet distribution: Normalizing constant

Dirichlet distribution Dirichlet distribution: Conjugate to Multinomial:

Gaussian/Normal distribution Scalar variable: Vector variable: Symmetric, real, D ´ D matrix

Geometry of a Gaussian Sufficient statistic: There's a linear transformation U so where L is a diagonal matrix. Then where

Geometry of a Gaussian Gaussian constant when is constant – an ellipsis in the U coordinate system:

Geometry of a Gaussian Gaussian constant when is constant – an ellipsis in the U coordinate system:

Parameters are mean and variance/covariance: Parameters of a Gaussian ML estimates are:

Parameters of a Gaussian ML estimates are: Unbiase d Biase d

Parameters of a Gaussian ML estimates are: Unbiase d Biase d Intuition: The variance estimator is based on the mean estimator – which is fitted to data and fits better than the real mean, thus the variance is under estimated. Correction:

Gaussian is its own conjugate For fixed variance: Normalizing constant

For fixed variance: Gaussian is its own conjugate with:

Mixtures of Gaussians A Gaussian has a single mode (peak) and cannot model multi-modal distributions. Instead, we can use mixtures:

Mixtures of Gaussians A Gaussian has a single mode (peak) and cannot model multi-modal distributions. Instead, we can use mixtures: Prob. for selecting given Gaussian Conditiona l distribution

Mixtures of Gaussians A Gaussian has a single mode (peak) and cannot model multi-modal distributions. Instead, we can use mixtures: Prob. for selecting given Gaussian Conditiona l distribution Numerical methods needed for parameter estimation

Summary ● Bernoulli and Binomial, with Beta prior ● Multinomial with Dirichlet prior ● Gaussian with Gaussian prior ● Mixtures of Gaussians