Parametric Methods Berlin Chen, 2005 References:

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Integration of sensory modalities
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Computer vision: models, learning and inference
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Review of Lecture Two Linear Regression Normal Equation
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Probability theory: (lecture 2 on AMLbook.com)
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
INTRODUCTION TO Machine Learning 3rd Edition
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Machine Learning 5. Parametric Methods.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Chapter 3: Maximum-Likelihood Parameter Estimation
12. Principles of Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
CH 5: Multivariate Methods
Maximum Likelihood Estimation
Data Mining Lecture 11.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Integration of sensory modalities
EE513 Audio Signals and Systems
Mathematical Foundations of BME
Learning From Observed Data
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Multivariate Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
Linear Discrimination
12. Principles of Parameter Estimation
Applied Statistics and Probability for Engineers
Presentation transcript:

Parametric Methods Berlin Chen, 2005 References: 1. Introduction to Machine Learning , Chapter 4

Introduction Statistic Parametric methods Any value that is calculated from a given sample Statistical inference: make a decision using the information provided by a sample (or samples) Parametric methods Assume that samples are drawn from some distribution that obeys a known model Advantage: the model is well defined up to a small number of parameters E.g., mean and variance are sufficient statistics for the Gaussian distribution Model parameters are typically estimated by either maximum likelihood estimation or Bayesian (MAP) estimation Data samples are assumed to be univariate (scalar variables) here

Maximum Likelihood Estimation (MLE) Assume samples are independent and identically distributed (iid), and drawn from some known probability distribution : model parameters (assumed to be fixed but unknown here) MLE attempts to find that make the most likely to be drawn Namely, maximize the likelihood of samples

MLE (cont.) Because logarithm will not change the value of when it take its maximum Finding that maximizes the likelihood of the samples is equivalent to finding that maximizes the log likelihood of the samples As we will see, logarithmic operation can further simplify the computation when estimating the parameters of those distributions that have exponents

MLE: Bernoulli Distribution Random variable takes either the value 1 (with probability ) or the value 0 (with probability ) Can be thought of as is generated form two distinct states The associated probability distribution The log likelihood for a set of iid samples drawn from Bernoulli distribution

MLE: Bernoulli Distribution (cont.) MLE of the distribution parameter The estimate for is the ratio of the number of occurrences of the event ( ) to the number of experiments The expected value for

MLE: Bernoulli Distribution (cont.) Appendix A

MLE: Multinomial Distribution A generalization of Bernoulli distribution A value of a random event can be one of K mutually exclusive and exhaustive states The associated probability distribution The log likelihood for a set of iid samples drawn from Bernoulli distribution

MLE: Multinomial Distribution (cont.) MLE of the distribution parameter The estimate for is the ratio of the number of experiments with outcome of state ( ) to the number of experiments

MLE: Multinomial Distribution (cont.) Appendix B Lagrange Multiplier =1

MLE: Gaussian Distribution Also called Normal Distribution Characterized with mean and variance Recall that mean and variance are sufficient statistics for Gaussian The log likelihood for a set of iid samples drawn from Gaussian distribution

MLE: Gaussian Distribution (cont.) MLE of the distribution parameters and Reminder that and are still fixed but unknown

MLE: Gaussian Distribution (cont.) Appendix C

Evaluating an Estimator: Bias Suppose is a sample from a population distribution with a parameter Let be an estimator of and bias of the estimator is defined as The expected difference between and An unbiased estimator has the property that is an asymptotically unbiased estimator The bias goes to zero as the sample size goes to infinite

Evaluating an Estimator: Variance The variance measures how much, on average, the estimator varies around the expected value As we will see later: the smaller the sample size , the larger the variance The mean square error of the estimator is defined as Measure how much the estimator is different from

Evaluating an Estimator (cont.) The mean square error of the estimator can be further decomposed into two parts respectively composed of bias and variance constant constant variance bias2

Evaluating an Estimator (cont.)

Evaluating an Estimator (cont.) Example 1: sample average and sample variance Assume samples are independent and identically distributed (iid), and drawn from some known probability distribution with mean and variance Mean Variance Sample average (mean) for the observed samples Sample variance for the observed samples ?

Evaluating an Estimator (cont.) Example 1 (count.) Sample average is an unbiased estimator of the mean is also a consistent estimator:

Evaluating an Estimator (cont.) Example 1 (count.) Sample variance is an asymptotically unbiased estimator of the variance

Evaluating an Estimator (cont.) Example 1 (count.) Sample variance is an asymptotically unbiased estimator of the variance The size of the observed sample set

Prior Information Given a function (e.g., likelihood density ) with parameter to be estimated The prior density tells some prior information on the possible value range that may take is helpful Especially when the set of training samples is small is treated as a random variable and tells the likely values that may take before looking at the samples The parameters in are called hyperparameters The prior density can be combined with the likelihood density to form the posterior density of

Prior Information Conjugate Priors A prior density which can make the posterior density , likelihood density and the prior density itself belong to the same distribution family The Gaussian (normal) density family

Prior Information (cont.) Example 2 is approximately normal lies between 5 and 9, symmetrically around 7 with 90 percent confidence

Posterior Density The posterior density of parameters tells the likely values after looking at the samples

Density/Output Estimation Density estimation of using and Output estimation of using and Take an average over predictions ( or ) using all value of , weighted by their (prior) probabilities : sufficient statistics Function output estimation

MAP and ML Estimators However, evaluating the integrals in above equations are not feasible Estimation based a single point (point estimators) Maximum A Posteriori Estimation Maximum Likelihood Estimation Maximum A Posteriori (MAP) Estimator Maximum Likelihood (ML) Estimator

Bayes’ Estimators Bayes’ Estimator Defined as the expected value of given its posterior density is known Suppose that and a estimator with value is made Mean square error of the estimator constant

Bayes’ Estimators (cont.) Bayes’ Estimator (cont.) Mean square error is minimum when The best estimator of a random variable is its mean If the likelihood density and the prior density belong to normal densities is also normal ?

Bayes’ Estimators (cont.) Example 2: Given the likelihood density and the prior density belong to normal densities What is the estimate ? sample mean sample variance ?

Parametric Classification Bayes’ Classifier Revisited Use discriminant function How can we interpret and ? is assumed to be univariate denominator is dropped logarithm is monotonic

Parametric Classification (cont.) Bayes’ Classifier Revisited Assume is Gaussian is simply the proportion of samples that belong to

Parametric Classification (cont.) Bayes’ Classifier Revisited How can we estimate and ? Perform maximum like estimation (MLE) on the given (training) samples How about Bayesian or MAP estimation ?

Parametric Classification (cont.) Bayes’ Classifier Revisited If class variances and priors are further set to be equal among the classes Assign to the class with the nearest mean

Parametric Classification (cont.) E.g., Classes with Equal Priors and Variances

Parametric Classification (cont.) E.g., Classes with Equal Priors and Unequal Variances has a larger variance ?

Parametric Classification (cont.) Two common approaches for classification problems Likelihood-based approach (as the classifiers mentioned above) Estimate the probability distribution (likelihood density) for samples Get the discriminant function using Bayes’ rule Gaussian densities are usually assumed for continuous variables Normality test is needed : bell-shaped (unimodal, symmetric around the center) Example classifiers: HMMs Discriminant-based approach Bypass the estimation of densities and directly estimate the discriminants Example classifiers: Neural Networks, Support Vector Machines

Regression Function Approximation Assume the observed numeric output is the sum of a deterministic function of the input and random noise An estimator used to approximate : a set of parameters

Regression (cont.) Find a parameter setting of that can maximize the logarithm of the product of the likelihoods for all training samples Maximizing is equivalent to minimizing the error function constant Least squares estimation (minimizing the sum of squared errors)

Regression (cont.) Further assume that is a linear model (model) Linear Regression Find the minimum of by taking partial derivatives with respect to and accordingly

Regression (cont.) Express the above two equations in vector-matrix form Also can be extended to polynomial regression

HW-5: Parametric Classification Perform two-class classification (dichotomy) on the two data sets (MaleData, FemaleData) given in HW-3 The first 1000 samples coming from each data set are reserved for testing and are blended together The rest samples of each data sets (training data sets) are respectively used to estimate the likelihood densities and prior densities All samples are projected onto the first eigenvector (dimension) of the LDA or PCA matrix The LDA or PCA matrix are obtained merely based on the training data sets Normality is assumed for the samples Analyze the classification accuracy (using testing samples & Bayes’s classifier) and also plot two figures regarding the likelihood densities and poster densities (using training samples) As those shown in Fig. 4.3 (Alpaydin)