Maximum Likelihood Estimation Psych 818 - DeShon.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Tests of Static Asset Pricing Models
Managerial Economics in a Global Economy
The Simple Regression Model
Structural Equation Modeling
Brief introduction on Logistic Regression
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Pattern Recognition and Machine Learning
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Ch11 Curve Fitting Dr. Deshi Ye
Nonlinear Regression Ecole Nationale Vétérinaire de Toulouse Didier Concordet ECVPT Workshop April 2011 Can be downloaded at
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Multiple regression analysis
Maximum likelihood estimates What are they and why do we care? Relationship to AIC and other model selection criteria.
Maximum likelihood (ML) and likelihood ratio (LR) test
Chapter 10 Simple Regression.
Chapter 12 Simple Regression
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
458 Fitting models to data – II (The Basics of Maximum Likelihood Estimation) Fish 458, Lecture 9.
Maximum likelihood (ML)
Maximum likelihood (ML) and likelihood ratio (LR) test
The Simple Regression Model
Chapter 11 Multiple Regression.
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
Linear and generalised linear models
Multiple Regression and Correlation Analysis
Computer vision: models, learning and inference
An Introduction to Logistic Regression
Ch. 14: The Multiple Regression Model building
Maximum likelihood (ML)
Introduction to Regression Analysis, Chapter 13,
Objectives of Multiple Regression
Regression and Correlation Methods Judy Zhong Ph.D.
PATTERN RECOGNITION AND MACHINE LEARNING
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
880.P20 Winter 2006 Richard Kass 1 Maximum Likelihood Method (MLM) Does this procedure make sense? The MLM answers this question and provides a method.
Model Inference and Averaging
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.
1 Lecture 16: Point Estimation Concepts and Methods Devore, Ch
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
INTRODUCTION TO Machine Learning 3rd Edition
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Psychology 202a Advanced Psychological Statistics October 22, 2015.
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
Lecture 8: Ordinary Least Squares Estimation BUEC 333 Summer 2009 Simon Woodcock.
Linear Systems Numerical Methods. 2 Jacobi Iterative Method Choose an initial guess (i.e. all zeros) and Iterate until the equality is satisfied. No guarantee.
Logistic Regression and Odds Ratios Psych DeShon.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
4-1 MGMG 522 : Session #4 Choosing the Independent Variables and a Functional Form (Ch. 6 & 7)
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
CHAPTER 29: Multiple Regression*
EC 331 The Theory of and applications of Maximum Likelihood Method
Simple Linear Regression
Presentation transcript:

Maximum Likelihood Estimation Psych DeShon

MLE vs. OLS Ordinary Least Squares Estimation Ordinary Least Squares Estimation Typically yields a closed form solution that can be directly computed Typically yields a closed form solution that can be directly computed Closed form solutions often require very strong assumptions Closed form solutions often require very strong assumptions Maximum Likelihood Estimation Maximum Likelihood Estimation Default Method for most estimation problems Default Method for most estimation problems Generally equal to OLS when OLS assumptions are met Generally equal to OLS when OLS assumptions are met Method yields desirable “asymptotic” estimation properties Method yields desirable “asymptotic” estimation properties Foundation for Bayesian inference Foundation for Bayesian inference Requires numerical methods :( Requires numerical methods :(

MLE logic MLE reverses the probability inference MLE reverses the probability inference Recall: p(X|  ) Recall: p(X|  )  represents the parameters of a model (i.e., pdf)  represents the parameters of a model (i.e., pdf) What’s the probability of observing a score of 73 from a N(70,10) distribution What’s the probability of observing a score of 73 from a N(70,10) distribution In MLE, you know the data (X i ) In MLE, you know the data (X i ) Primary question: Which of a potentially infinite number of distributions is most likely responsible for generating the data? Primary question: Which of a potentially infinite number of distributions is most likely responsible for generating the data? p(  |X)? p(  |X)?

Likelihood Likelihood may be thought of as an unbounded or unnormalized probability measure Likelihood may be thought of as an unbounded or unnormalized probability measure PDF is a function of the data given the parameters on the data scale PDF is a function of the data given the parameters on the data scale Likelihood is a function of the parameters given the data on the parameter scale Likelihood is a function of the parameters given the data on the parameter scale

Likelihood Likelihood function Likelihood function Likelihood is the joint (product) probability of the observed data given the parameters of the pdf Likelihood is the joint (product) probability of the observed data given the parameters of the pdf Assume you have X 1,…,X n independent samples from a given pdf, f  Assume you have X 1,…,X n independent samples from a given pdf, f 

Likelihood Log-Likelihood function Log-Likelihood function Working with products is a pain Working with products is a pain maxima are unaffected by monotone transformations, so can take the logarithm of the likelihood and turn it into a sum maxima are unaffected by monotone transformations, so can take the logarithm of the likelihood and turn it into a sum

Maximum Likelihood Find the value(s) of  that maximize the likelihood function Find the value(s) of  that maximize the likelihood function Can sometimes be found analytically Can sometimes be found analytically Maximization (or minimization) is the focus of calculus and derivatives of functions Maximization (or minimization) is the focus of calculus and derivatives of functions Often requires iterative numeric methods Often requires iterative numeric methods

Likelihood Normal Distribution example Normal Distribution example pdf: pdf: Likelihood Likelihood Log-Likelihood Log-Likelihood Note: C is a constant that vanishes once derivatives are taken Note: C is a constant that vanishes once derivatives are taken

Likelihood Can compute the maximum of this log- likelihood function directly Can compute the maximum of this log- likelihood function directly More relevant and fun to estimate it numerically! More relevant and fun to estimate it numerically!

Normal Distribution example Assume you obtain 100 samples from a normal distribution Assume you obtain 100 samples from a normal distribution rv.norm <- rnorm(100, mean=5, sd=2) rv.norm <- rnorm(100, mean=5, sd=2) This is the true data generating model! This is the true data generating model! Now, assume you don’t know the mean of this distribution and we have to estimate it… Now, assume you don’t know the mean of this distribution and we have to estimate it… Let’s compute the log-likelihood of the observations for N(4,2) Let’s compute the log-likelihood of the observations for N(4,2)

Normal Distribution example sum(dnorm(rv.norm, mean=4, sd=2, log=T)) sum(dnorm(rv.norm, mean=4, sd=2, log=T)) dnorm gives the probability of an observation for a given distribution dnorm gives the probability of an observation for a given distribution Summing it across observations gives the log-likelihood Summing it across observations gives the log-likelihood = = This is the log-likelihood of the data for the given pdf parameters This is the log-likelihood of the data for the given pdf parameters Okay, this is the log-likelihood for one possible distribution….we need to examine it for all possible distributions and select the one that yields the largest value Okay, this is the log-likelihood for one possible distribution….we need to examine it for all possible distributions and select the one that yields the largest value

Normal Distribution example Make a sequence of possible means Make a sequence of possible means m<-seq(from = 1, to = 10, by = 0.1) m<-seq(from = 1, to = 10, by = 0.1) Now, compute the log-likelihood for each of the possible means Now, compute the log-likelihood for each of the possible means This is a simple “grid search” algorithm This is a simple “grid search” algorithm log.l<-sapply(m, function (x) sum(dnorm(rv.norm, mean=x, sd=2, log=T)) ) log.l<-sapply(m, function (x) sum(dnorm(rv.norm, mean=x, sd=2, log=T)) )

mean log.l Why are these numbers negative? Normal Distribution example

dnorm gives us the probability of an observation from the given distribution dnorm gives us the probability of an observation from the given distribution The log of a value between 0-1 is negative The log of a value between 0-1 is negative Log(.05)=-2.99 Log(.05)=-2.99 What’s the MLE? What’s the MLE? m[which(log.l==max(log.l))] m[which(log.l==max(log.l))] = 5.1 = 5.1

Normal Distribution example What about estimating both the mean and the SD simultaneously? What about estimating both the mean and the SD simultaneously? Use grid search approach again… Use grid search approach again… Compute the log-likelihood at each combination of mean and SD Compute the log-likelihood at each combination of mean and SD SD Mean log.l

Normal Distribution example Get max(log.l) Get max(log.l) m[which(log.l==max(log.l), arr.ind=T)] m[which(log.l==max(log.l), arr.ind=T)] = 5.0, 1.9 = 5.0, 1.9 Note: this could be done the same way for a simple linear regression (2 parameters) Note: this could be done the same way for a simple linear regression (2 parameters)

Algorithms Grid search works for these simple problems with few estimated parameters Grid search works for these simple problems with few estimated parameters Much more advanced search algorithms are needed for more complex problems Much more advanced search algorithms are needed for more complex problems More advanced algs take advantage of the slope or gradient of the likelihood surface to make good guesses about the direction of search in parameter space More advanced algs take advantage of the slope or gradient of the likelihood surface to make good guesses about the direction of search in parameter space We’ll use the “mlm” routine in R We’ll use the “mlm” routine in R

Algorithms Grid Search: Grid Search: Vary each parameter in turn, compute the log-likelihood, then find parameter combination yielding lowest log-likelihood Vary each parameter in turn, compute the log-likelihood, then find parameter combination yielding lowest log-likelihood Gradient Search: Gradient Search: Vary all parameters simultaneously, adjusting relative magnitudes of the variations so that the direction of propagation in parameter space is along the direction of steepest ascent of log-likelihood Vary all parameters simultaneously, adjusting relative magnitudes of the variations so that the direction of propagation in parameter space is along the direction of steepest ascent of log-likelihood Expansion Methods: Expansion Methods: Find an approximate analytical function that describes the log- likelihood hypersurface and use this function to locate the minimum. Number of computed points is less, but computations are considerably more complicated. Find an approximate analytical function that describes the log- likelihood hypersurface and use this function to locate the minimum. Number of computed points is less, but computations are considerably more complicated. Marquardt Method: Gradient-Expansion combination Marquardt Method: Gradient-Expansion combination

R – mlm routine First we need to define a function to maximize First we need to define a function to maximize Wait! Most general routines focus on minimization Wait! Most general routines focus on minimization e.g., root finding for solving equations e.g., root finding for solving equations So, usually minimize –log-likelihood So, usually minimize –log-likelihood norm.func<-function(x,y) { sum(sapply(rv.norm, function(z) norm.func<-function(x,y) { sum(sapply(rv.norm, function(z) -1*dnorm(z, mean=x, sd=y, log=T))) }

R – mlm routine norm.mle <- mle(norm.func, start=list(x=4,y=2), method="L-BFGS-B", lower=c(0, 0)) norm.mle <- mle(norm.func, start=list(x=4,y=2), method="L-BFGS-B", lower=c(0, 0)) Many interesting points Many interesting points Starting values Starting values Global vs. local maxima or minima Global vs. local maxima or minima Bounds Bounds SD can’t be negative SD can’t be negative

R – mlm routine Output - summary(norm.mle) Output - summary(norm.mle) Standard errors come from the inverse of the hessian matrix Standard errors come from the inverse of the hessian matrix Convergence!! Convergence!! -2(log-likelihood) = deviance -2(log-likelihood) = deviance Functions like the R2 in regression Functions like the R2 in regression Coeficients: Estimate Std. Error x y log L: > [1] 0

Maximum Likelihood Regression A standard regression: A standard regression: May be broken down into two components May be broken down into two components

Maximum Likelihood Regression First define our x's and y's x<- 1:100 y< *x+rnorm(100, mean=5, sd=20) First define our x's and y's x<- 1:100 y< *x+rnorm(100, mean=5, sd=20) Define -log likelihood function Define -log likelihood function reg.func <- function(b0,b1,sigma) { if(sigma<=0) return(NA) # no sd of 0 or less! yhat<-b0*x+b1 #the estimated function -sum(dnorm(y, mean=yhat, sd=sigma,log=T)) #the -log likelihood function } #the -log likelihood function }

Maximum Likelihood Regression Call MLE to minimize the –log-likelihood Call MLE to minimize the –log-likelihood lm.mle<-mle(reg.func, start=list(b0=2, b1=2, sigma=35)) Get results - summary(lm.mle) Get results - summary(lm.mle) Coefficients: Estimate Std. Error b b sigma log L:

Maximum Likelihood Regression Compare to OLS results Compare to OLS results lm(y~x) lm(y~x) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * x <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 98 degrees of freedom Multiple R-Squared: ,

Standard Errors of Estimates Behavior of the likelihood function near the maximum is important Behavior of the likelihood function near the maximum is important If it is flat then observations have little to say about the parameters If it is flat then observations have little to say about the parameters changes in the parameters will not cause large changes in the probability changes in the parameters will not cause large changes in the probability if the likelihood has a pronounced peak near to the maximum then small changes in parameters would cause large changes in probability if the likelihood has a pronounced peak near to the maximum then small changes in parameters would cause large changes in probability In this cases we say that observation has more information about parameters In this cases we say that observation has more information about parameters Expressed as the second derivative (or curvature) of the log-likelihood function Expressed as the second derivative (or curvature) of the log-likelihood function If more than 1 parameter, then 2 nd partial deriviatives If more than 1 parameter, then 2 nd partial deriviatives

Standard Errors of Estimates Rate of change is the second derivative of a function (e.g., velocity and acceleration) Rate of change is the second derivative of a function (e.g., velocity and acceleration) Hessian Matrix is the matrix of 2 nd partial derivatives of the -log-likelihood function Hessian Matrix is the matrix of 2 nd partial derivatives of the -log-likelihood function The entries in the Hessian are called the observed information for an estimate The entries in the Hessian are called the observed information for an estimate

Standard Errors Information is used to obtained the expected variance (or standard error) or the estimated parameters Information is used to obtained the expected variance (or standard error) or the estimated parameters When sample size becomes large then maximum likelihood estimator becomes approximately normally distributed with variance close to When sample size becomes large then maximum likelihood estimator becomes approximately normally distributed with variance close to More precisely… More precisely…

Likelihood Ratio Test Let L F be the maximum of the likelihood function for an unrestricted model Let L F be the maximum of the likelihood function for an unrestricted model Let L R be the maximum of the likelihood function of a restricted model nested in the full model Let L R be the maximum of the likelihood function of a restricted model nested in the full model L F must be greater than or equal to L R L F must be greater than or equal to L R Removing a variable or adding a constraint can only hurt model fit. Same logic as R 2 Removing a variable or adding a constraint can only hurt model fit. Same logic as R 2 Question: Does adding the constraint or removing the variable (constraint of zero) significantly impact model fit? Question: Does adding the constraint or removing the variable (constraint of zero) significantly impact model fit? Model fit will decrease but does it decrease more than would be expected by chance? Model fit will decrease but does it decrease more than would be expected by chance?

Likelihood Ratio Test Likelihood Ratio Likelihood Ratio R = -2ln(L R / L F ) R = -2ln(L R / L F ) R = 2(log(L F ) – log(L R )) R = 2(log(L F ) – log(L R )) R is distributed as chi-square distribution with m degrees of freedom R is distributed as chi-square distribution with m degrees of freedom m is the difference in the number of estimated parameters between the two models. m is the difference in the number of estimated parameters between the two models. The expected value of R is m, so if you get an R that is bigger than the difference in parameters then the constraint hurts model fit. The expected value of R is m, so if you get an R that is bigger than the difference in parameters then the constraint hurts model fit. More formally…. You should reference the chi-square table with m degrees of freedom to find the probability of getting R by chance alone, assuming that the null hypothesis is true. More formally…. You should reference the chi-square table with m degrees of freedom to find the probability of getting R by chance alone, assuming that the null hypothesis is true.

Likelihood Ratio Example Go back to our simple regression example Go back to our simple regression example Does the variable (X) significantly improve our predictive ability or model fit? Does the variable (X) significantly improve our predictive ability or model fit? Alternatively, does removing X or constraining it’s parameter estimate to zero significantly decrease prediction or model fit? Alternatively, does removing X or constraining it’s parameter estimate to zero significantly decrease prediction or model fit? Full Model: -2log-L = Full Model: -2log-L = log-L = Reduced Model: -2log-L = Chi-square critical value = 3.84 Chi-square critical value = 3.84

Fit Indices Akaike’s information criterion (AIC) Akaike’s information criterion (AIC) Pronounced “Ah-kah-ee-key” Pronounced “Ah-kah-ee-key” K is the number of estimated parameters in our model. K is the number of estimated parameters in our model. Penalizes the log-likelihood for using many parameters to increase fit Penalizes the log-likelihood for using many parameters to increase fit Choose the model with the smallest AIC value Choose the model with the smallest AIC value

Fit Indices Bayesian Information Criterion (BIC) Bayesian Information Criterion (BIC) AKA- SIC for Schwarz Information Criterion AKA- SIC for Schwarz Information Criterion Choose the model with the smallest BIC Choose the model with the smallest BIC the likelihood is the probability of obtaining the data you did under the given model. It makes sense to choose a model that makes this probability as large as possible. But putting the minus sign in front switches the maximization to minimization the likelihood is the probability of obtaining the data you did under the given model. It makes sense to choose a model that makes this probability as large as possible. But putting the minus sign in front switches the maximization to minimization

Multiple Regression -Log-Likelihood function for multiple regression -Log-Likelihood function for multiple regression #Note, theta is a vector of parameters, with std.dev being the first one #theta[-1] is all values of theta, except the first #and here we're using matrix multiplication ols.lf3 <- function(theta, y, X) { if (theta[1] <= 0) return(NA) -sum(dnorm(y, mean = X %*% theta[-1], sd = ols.lf3 <- function(theta, y, X) { if (theta[1] <= 0) return(NA) -sum(dnorm(y, mean = X %*% theta[-1], sd = sqrt(theta[1]), log = TRUE)) } sqrt(theta[1]), log = TRUE)) }