Applied Econometrics William Greene Department of Economics Stern School of Business.

Slides:

Advertisements

Similar presentations

Tests of Static Asset Pricing Models

Advertisements

Econometrics I Professor William Greene Stern School of Business

Discrete Choice Modeling William Greene Stern School of Business New York University Lab Sessions.

Discrete Choice Modeling William Greene Stern School of Business New York University.

Part 11: Asymptotic Distribution Theory 11-1/72 Econometrics I Professor William Greene Stern School of Business Department of Economics.

Part 17: Nonlinear Regression 17-1/26 Econometrics I Professor William Greene Stern School of Business Department of Economics.

Part 12: Asymptotics for the Regression Model 12-1/39 Econometrics I Professor William Greene Stern School of Business Department of Economics.

3. Binary Choice – Inference. Hypothesis Testing in Binary Choice Models.

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests

Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.

The General Linear Model. The Simple Linear Model Linear Regression.

Nguyen Ngoc Anh Nguyen Ha Trang

Visual Recognition Tutorial

L18: CAPM1 Lecture 18: Testing CAPM The following topics will be covered: Time Series Tests –Sharpe (1964)/Litner (1965) version –Black (1972) version.

Econometrics I Professor William Greene Stern School of Business

Maximum likelihood (ML) and likelihood ratio (LR) test

1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.

Generalized Regression Model Based on Greene’s Note 15 (Chapter 8)

Maximum likelihood (ML) and likelihood ratio (LR) test

Statistical Inference and Regression Analysis: GB Professor William Greene Stern School of Business IOMS Department Department of Economics.

Linear and generalised linear models

Linear and generalised linear models

Maximum likelihood (ML)

Part 21: Generalized Method of Moments 21-1/67 Econometrics I Professor William Greene Stern School of Business Department of Economics.

Discrete Choice Modeling William Greene Stern School of Business New York University Lab Sessions.

Discrete Choice Modeling William Greene Stern School of Business New York University.

[Topic 6-Nonlinear Models] 1/87 Discrete Choice Modeling William Greene Stern School of Business New York University.

Discrete Choice Modeling William Greene Stern School of Business New York University.

Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.

Maximum Likelihood See Davison Ch. 4 for background and a more thorough discussion. Sometimes.

Part 5: Random Effects [ 1/54] Econometric Analysis of Panel Data William Greene Department of Economics Stern School of Business.

Discrete Choice Modeling William Greene Stern School of Business New York University.

Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.

Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.

Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.

[Part 4] 1/43 Discrete Choice Modeling Bivariate & Multivariate Probit Discrete Choice Modeling William Greene Stern School of Business New York University.

Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

1/62: Topic 2.3 – Panel Data Binary Choice Models Microeconometric Modeling William Greene Stern School of Business New York University New York NY USA.

Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.

Discrete Choice Modeling William Greene Stern School of Business New York University.

1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.

M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.

Discrete Choice Modeling William Greene Stern School of Business New York University.

Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.

G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.

MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.

Discrete Choice Modeling William Greene Stern School of Business New York University.

The Probit Model Alexander Spermann University of Freiburg SS 2008.

Maximum Likelihood. Much estimation theory is presented in a rather ad hoc fashion. Minimising squared errors seems a good idea but why not minimise the.

Estimation Econometría. ADE.. Estimation We assume we have a sample of size T of: – The dependent variable (y) – The explanatory variables (x 1,x 2, x.

5. Extensions of Binary Choice Models

The Probit Model Alexander Spermann University of Freiburg SoSe 2009

Microeconometric Modeling

Microeconometric Modeling

Discrete Choice Modeling

Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests

Econometric Analysis of Panel Data

EC 331 The Theory of and applications of Maximum Likelihood Method

Microeconometric Modeling

Chengyuan Yin School of Mathematics

Econometrics I Professor William Greene Stern School of Business

Chengyuan Yin School of Mathematics

Econometrics I Professor William Greene Stern School of Business

Econometrics I Professor William Greene Stern School of Business

Econometrics I Professor William Greene Stern School of Business

Econometrics I Professor William Greene Stern School of Business

Econometrics I Professor William Greene Stern School of Business

Presentation transcript:

Applied Econometrics William Greene Department of Economics Stern School of Business

Applied Econometrics 18. Maximum Likelihood Estimation

Maximum Likelihood Estimation This defines a class of estimators based on the particular distribution assumed to have generated the observed random variable. The main advantage of ML estimators is that among all Consistent Asymptotically Normal Estimators, MLEs have optimal asymptotic properties. The main disadvantage is that they are not necessarily robust to failures of the distributional assumptions. They are very dependent on the particular assumptions. The oft cited disadvantage of their mediocre small sample properties is overstated in view of the usual paucity of viable alternatives.

Setting up the MLE The distribution of the observed random variable is written as a function of the parameters to be estimated P(y i |data,β) = Probability density | parameters. The likelihood function is constructed from the density Construction: Joint probability density function of the observed sample of data – generally the product when the data are a random sample.

Regularity Conditions  What they are 1. logf(.) has three continuous derivatives wrt parameters 2. Conditions needed to obtain expectations of derivatives are met. (E.g., range of the variable is not a function of the parameters.) 3. Third derivative has finite expectation.  What they mean Moment conditions and convergence. We need to obtain expectations of derivatives. We need to be able to truncate Taylor series. We will use central limit theorems

The MLE The log-likelihood function: log-L(  |data) The likelihood equation(s): First derivatives of log-L equal zero at the MLE. (1/n)Σ i ∂logf(y i |  )/∂  MLE = 0. (Sample statistic.) (The 1/n is irrelevant.) “First order conditions” for maximization A moment condition - its counterpart is the fundamental result E[  log-L/  ] = 0. How do we use this result? An analogy principle.

Average Time Until Failure Estimating the average time until failure, , of light bulbs. y i = observed life until failure. f(y i |  )=(1/  )exp(-y i /  ) L(  )=Π i f(y i |  )=  -N exp(-Σy i /  ) logL (  )=-Nlog (  ) - Σy i /  Likelihood equation: ∂logL(  )/∂  =-N/  + Σy i /  2 =0 Note, ∂logf(y i |  )/∂  = -1/  + y i /  2 Since E[y i ]= , E[∂logf(  )/∂  ]=0. (‘Regular’)

Properties of the Maximum Likelihood Estimator We will sketch formal proofs of these results: The log-likelihood function, again The likelihood equation and the information matrix. A linear Taylor series approximation to the first order conditions: g(  ML ) = 0  g(  ) + H(  ) (  ML -  ) (under regularity, higher order terms will vanish in large samples.) Our usual approach. Large sample behavior of the left and right hand sides is the same. A Proof of consistency. (Property 1) The limiting variance of  n(  ML -  ). We are using the central limit theorem here. Leads to asymptotic normality (Property 2). We will derive the asymptotic variance of the MLE. Efficiency (we have not developed the tools to prove this.) The Cramer-Rao lower bound for efficient estimation (an asymptotic version of Gauss-Markov). Estimating the variance of the maximum likelihood estimator. Invariance. (A VERY handy result.) Coupled with the Slutsky theorem and the delta method, the invariance property makes estimation of nonlinear functions of parameters very easy.

Testing Hypotheses – A Trinity of Tests The likelihood ratio test: Based on the proposition (Greene’s) that restrictions always “make life worse” Is the reduction in the criterion (log-likelihood) large? Leads to the LR test. The Lagrange multiplier test: Underlying basis: Reexamine the first order conditions. Form a test of whether the gradient is significantly “nonzero” at the restricted estimator. The Wald test: The usual.

The Linear (Normal) Model Definition of the likelihood function - joint density of the observed data, written as a function of the parameters we wish to estimate. Definition of the maximum likelihood estimator as that function of the observed data that maximizes the likelihood function, or its logarithm. For the model:y i =  x i +  i, where  i ~ N[0,  2 ], the maximum likelihood estimators of  and  2 are b = (XX) -1 Xy and s 2 = ee/n. That is, least squares is ML for the slopes, but the variance estimator makes no degrees of freedom correction, so the MLE is biased.

Normal Linear Model The log-likelihood function =  i log f(y i |  ) = sum of logs of densities. For the linear regression model with normally distributed disturbances log-L =  i [ - ½log2  - ½log  2 - ½(y i – x i  ) 2 /  2 ].

Likelihood Equations The estimator is defined by the function of the data that equates  log-L/  to 0. (Likelihood equation) The derivative vector of the log-likelihood function is the score function. For the regression model, g = [  log-L/ ,  log-L/  2 ]’ =  log-L/  =  i [(1/  2 )x i (y i - x i  ) ]  log-L/  2 =  i [-1/(2  2 ) + (y i - x i  ) 2 /(2  4 )] For the linear regression model, the first derivative vector of log-L is (1/  2 )X(y - X  ) and (1/2  2 )  i [(y i - x i  )2/  2 - 1] (K  1) (1  1) Note that we could compute these functions at any  and  2. If we compute them at b and ee/n, the functions will be identically zero.

Moment Equations Note that g =  i g i is a random vector and that each term in the sum has expectation zero. It follows that E[(1/n)g] = 0. Our estimator is found by finding the  that sets the sample mean of the gs to 0. That is, theoretically, E[g i ( ,  2 )] = 0. We find the estimator as that function which produces (1/n)  i g i (b,s 2 ) = 0. Note the similarity to the way we would estimate any mean. If E[x i ] = , then E[x i -  ] = 0. We estimate  by finding the function of the data that produces (1/n)  i (x i -  ) = 0, which is, of course the sample mean. There are two main components to the “regularity conditions for maximum likelihood estimation. The first is that the first derivative has expected value 0. That ‘moment equation’ motivates the MLE

Information Matrix The negative of the second derivatives matrix of the log-likelihood, -H = is called the information matrix. It is usually a random matrix, also. For the linear regression model,

Hessian for the Linear Model Note that the off diagonal elements have expectation zero.

Estimated Information Matrix (which should look familiar). The off diagonal terms go to zero (one of the assumptions of the linear model). This can be computed at any vector  and scalar  2. You can take expected values of the parts of the matrix to get

Properties of the MLE  Consistent: Not necessarily unbiased, however  Asymptotically normally distributed: Proof based on central limit theorems  Asymptotically efficient: Among the possible estimators that are consistent and asymptotically normally distributed  Invariant: The MLE of g(  ) is g(the MLE of  )

Asymptotic Variance  The asymptotic variance is {–E[H]} -1  There are several ways to estimate this matrix Inverse of negative expected second derivatives Inverse of negative of actual second derivatives Inverse of sum of squares of first derivatives Robust matrix for some special cases

Deriving the Properties of the Maximum Likelihood Estimator

The MLE

Consistency:

Consistency Proof

Asymptotic Variance

Asymptotic Distribution

Other Results 1 – Variance Bound

Invariance The maximum likelihood estimator of a function of , say h(  ) is h(MLE). This is not always true of other kinds of estimators. To get the variance of this function, we would use the delta method. E.g., the MLE of θ=(β/σ) is b/(e’e/n)

Invariance

Reparameterizing the Log Likelihood

Estimating the Tobit Model

Computing the Asymptotic Variance We want to estimate {-E[H]} -1 Three ways: (1) Just compute the negative of the actual second derivatives matrix and invert it. (2) Insert the maximum likelihood estimates into the known expected values of the second derivatives matrix. Sometimes (1) and (2) give the same answer (for example, in the linear regression model). (3) Since E[H] is the variance of the first derivatives, estimate this with the sample variance (i.e., mean square) of the first derivatives. This will almost always be different from (1) and (2). Since they are estimating the same thing, in large samples, all three will give the same answer. Current practice in econometrics often favors (3). Stata rarely uses (3). Others do.

Linear Regression Model Example: Different Estimators of the Variance of the MLE Consider, again, the gasoline data. We use a simple equation: G t =  1 +  2 Y t +  3 Pg t +  t.

Linear Model

BHHH Estimator

Newton’s Method

Poisson Regression

Asymptotic Variance of the MLE

Estimators of the Asymptotic Covariance Matrix

ROBUST ESTIMATION  Sandwich Estimator  H -1 (G’G) H -1  Is this appropriate? Why do we do this?

Application: Doctor Visits  German Individual Health Care data: N=27,236  Model for number of visits to the doctor: Poisson regression (fit by maximum likelihood) Income, Education, Gender

Poisson Regression Iterations poisson ; lhs = doctor ; rhs = one,female,hhninc,educ;mar;output=3$ Method=Newton; Maximum iterations=100 Convergence criteria: gtHg.1000D-05 chg.F.0000D+00 max|db|.0000D+00 Start values:.00000D D D D+00 1st derivs D D D D+07 Parameters:.28002D D D D-01 Itr 2 F= D+06 gtHg=.2832D+03 chg.F=.1587D+06 max|db|=.1346D+01 1st derivs D D D D+06 Parameters:.21404D D D D-01 Itr 3 F= D+06 gtHg=.9725D+02 chg.F=.4716D+05 max|db|=.6348D+00 1st derivs D D D D+05 Parameters:.17997D D D D-01 Itr 4 F= D+06 gtHg=.1545D+02 chg.F=.5162D+04 max|db|=.1437D+00 1st derivs D D D D+04 Parameters:.17276D D D D-01 Itr 5 F= D+06 gtHg=.5006D+00 chg.F=.1218D+03 max|db|=.6542D-02 1st derivs D D D D+01 Parameters:.17249D D D D-01 Itr 6 F= D+06 gtHg=.6215D-03 chg.F=.1254D+00 max|db|=.9678D-05 1st derivs D D D D-05 Parameters:.17249D D D D-01 Itr 7 F= D+06 gtHg=.9957D-09 chg.F=.1941D-06 max|db|=.1602D-10 * Converged

Regression and Partial Effects |Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Mean of X| Constant| FEMALE | HHNINC | EDUC | | Partial derivatives of expected val. with | | respect to the vector of characteristics. | | Effects are averaged over individuals. | | Observations used for means are All Obs. | | Conditional Mean at Sample Point | | Scale Factor for Marginal Effects | |Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Mean of X| Constant| FEMALE | HHNINC | EDUC |

Comparison of Standard Errors Negative Inverse of Second Derivatives |Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Mean of X| Constant| FEMALE | HHNINC | EDUC | BHHH |Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Constant| FEMALE | HHNINC | EDUC | Why are they so different? Model failure. This is a panel. There is autocorrelation.

Testing Hypotheses Wald tests, using the familiar distance measure Likelihood ratio tests: LogL U = log likelihood without restrictions LogL R = log likelihood with restrictions LogL U > logL R for any nested restrictions 2(LogL U – logL R )  chi-squared [J]

Testing the Model | Poisson Regression | | Maximum Likelihood Estimates | | Dependent variable DOCVIS | | Number of observations | | Iterations completed 7 | | Log likelihood function | Log likelihood | Number of parameters 4 | | Restricted log likelihood | Log Likelihood with only a | McFadden Pseudo R-squared | constant term. | Chi squared | 2*[logL – logL(0)] | Degrees of freedom 3 | | Prob[ChiSqd > value] = | Likelihood ratio test that all three slopes are zero.

Wald Test --> MATRIX ; List ; b1 = b(2:4) ; v11 = varb(2:4,2:4) ; B1' B1$ Matrix B1 Matrix V11 has 3 rows and 1 columns. has 3 rows and 3 columns | | D D D-05 2| | D D-05 3| | D D D-05 Matrix Result has 1 rows and 1 columns | LR statistic was

LM Test ? Unconstrained namelist; x=one,female,hhninc,educ$ poisson ; lhs=docvis ; rhs=x$ create ; e = docvis - exp(x'b)$ matrix ; list ; b ; dlogldb = X'e $ ? Constrained poisson ; lhs=docvis ; rhs=x ; cml:female=0,hhninc=0,educ=0$ create ; lambda0=exp(x'b) $ matrix ; list ; b ; dlogldb = X'e $ matrix ; list ; H= ; lm = dlogldb'[h]dlogldb$

Constrained Regression Poisson Regression Dependent variable DOCVIS Log likelihood function Estimation based on N = 27326, K = 1 Linear constraints imposed | Standard Prob. DOCVIS| Coefficient Error z |z|>Z* Constant| *** FEMALE| (Fixed Parameter)..... HHNINC| (Fixed Parameter)..... EDUC| (Fixed Parameter)

Scores Unrestricted B| | | | | DLOGLDB| | E-06 2| E-09 3| E-07 4| E-05 Restricted B| | | | | DLOGLDB| | | | |

Test results LM| | WALD| | LR statistic was

MLE vs. Nonlinear LS

Chow Style Test for Structural Change

Poisson Regressions Poisson Regression Dependent variable DOCVIS Log likelihood function (Pooled, N = 27326) Log likelihood function (Male, N = 14243) Log likelihood function (Female, N = 13083) Variable| Coefficient Standard Error b/St.Er. P[|Z|>z] Mean of X Pooled Constant| *** AGE|.00791*** EDUC| *** HSAT| *** HHNINC| *** HHKIDS| *** Males Constant| *** AGE|.01232*** EDUC| *** HSAT| *** HHNINC| *** HHKIDS| *** Females Constant| *** AGE|.00379*** EDUC|.00893*** HSAT| *** HHNINC| *** HHKIDS| ***

Chi Squared Test Namelist; X = one,age,educ,hsat,hhninc,hhkids$ Sample ; All $ Poisson ; Lhs = Docvis ; Rhs = X $ Calc ; Lpool = logl $ Poisson ; For [female = 0] ; Lhs = Docvis ; Rhs = X $ Calc ; Lmale = logl $ Poisson ; For [female = 1] ; Lhs = Docvis ; Rhs = X $ Calc ; Lfemale = logl $ Calc ; K = Col(X) ; list ; Chisq = 2*(Lmale + Lfemale - Lpool) ; Ctb(.95,k) $ | Listed Calculator Results | CHISQ = *Result*= The hypothesis that the same model applies is rejected.