M. K. GARBA, W. B. Yahya and B. A. Oyejola Department of Statistics,

Slides:



Advertisements
Similar presentations
Introduction Describe what panel data is and the reasons for using it in this format Assess the importance of fixed and random effects Examine the Hausman.
Advertisements

Dynamic panels and unit roots
Forecasting Using the Simple Linear Regression Model and Correlation
Random effects estimation RANDOM EFFECTS REGRESSIONS When the observed variables of interest are constant for each individual, a fixed effects regression.
Econometric Details -- the market model Assume that asset returns are jointly multivariate normal and independently and identically distributed through.
The Simple Linear Regression Model: Specification and Estimation
12.3 Correcting for Serial Correlation w/ Strictly Exogenous Regressors The following autocorrelation correction requires all our regressors to be strictly.
Chapter 15 Panel Data Analysis.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
12 Autocorrelation Serial Correlation exists when errors are correlated across periods -One source of serial correlation is misspecification of the model.
2-1 MGMG 522 : Session #2 Learning to Use Regression Analysis & The Classical Model (Ch. 3 & 4)
Chap 14-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 14 Additional Topics in Regression Analysis Statistics for Business.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Chapter 4 The Classical Model Copyright © 2011 Pearson Addison-Wesley. All rights reserved. Slides by Niels-Hugo Blunch Washington and Lee University.
1/25 Introduction to Econometrics. 2/25 Econometrics Econometrics – „economic measurement“ „May be defined as the quantitative analysis of actual economic.
Heteroscedasticity Chapter 8
Ch5 Relaxing the Assumptions of the Classical Model
Degree of Business Administration and Management
Esman M. Nyamongo Central Bank of Kenya
Chapter 15 Panel Data Models.
Chapter 7. Classification and Prediction
REGRESSION DIAGNOSTIC III: AUTOCORRELATION
Vera Tabakova, East Carolina University
Esman M. Nyamongo Central Bank of Kenya
Econometrics ITFD Week 8.
PANEL DATA REGRESSION MODELS
Kakhramon Yusupov June 15th, :30pm – 3:00pm Session 3
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
THE LINEAR REGRESSION MODEL: AN OVERVIEW
Chapter 5: The Simple Regression Model
Chapter 6: Autoregressive Integrated Moving Average (ARIMA) Models
Econometric methods of analysis and forecasting of financial markets
Simultaneous equation system
Multivariate Regression
Fundamentals of regression analysis
Chapter 11 Simple Regression
Chapter 3: TWO-VARIABLE REGRESSION MODEL: The problem of Estimation
Chapter 15 Panel Data Analysis.
HETEROSCEDASTICITY: WHAT HAPPENS IF THE ERROR VARIANCE IS NONCONSTANT?
Simple Linear Regression - Introduction
Advanced Panel Data Methods
Chapter 12 – Autocorrelation
Autocorrelation.
Serial Correlation and Heteroskedasticity in Time Series Regressions
Chapter 6: MULTIPLE REGRESSION ANALYSIS
Migration and the Labour Market
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
MULTICOLLINEARITY: WHAT HAPPENS IF THE REGRESSORS ARE CORRELATED
OVERVIEW OF LINEAR MODELS
Serial Correlation and Heteroscedasticity in
HETEROSCEDASTICITY: WHAT HAPPENS IF THE ERROR VARIANCE IS NONCONSTANT?
MULTICOLLINEARITY: WHAT HAPPENS IF THE REGRESSORS ARE CORRELATED
Econometric Analysis of Panel Data
Simultaneous equation models Prepared by Nir Kamal Dahal(Statistics)
Chapter 4, Regression Diagnostics Detection of Model Violation
Simple Linear Regression
OVERVIEW OF LINEAR MODELS
Multicollinearity Susanta Nag Assistant Professor Department of Economics Central University of Jammu.
Linear Panel Data Models
Chapter 13 Additional Topics in Regression Analysis
Linear Regression Summer School IFPRI
Autocorrelation.
3.2. SIMPLE LINEAR REGRESSION
Multicollinearity What does it mean? A high degree of correlation amongst the explanatory variables What are its consequences? It may be difficult to separate.
Financial Econometrics Fin. 505
Financial Econometrics Fin. 505
Serial Correlation and Heteroscedasticity in
Qi Li,Qing Wang,Ye Yang and Mingshu Li
Advanced Panel Data Methods
Presentation transcript:

M. K. GARBA, W. B. Yahya and B. A. Oyejola Department of Statistics, A COMPARATIVE STUDY OF SOME ESTIMATORS FOR PANEL DATA WITH HETEROSCEDASTICITY AND CORRELATIONS By M. K. GARBA, W. B. Yahya and B. A. Oyejola Department of Statistics, University of Ilorin, Ilorin, Nigeria. garba.mk@unilorin.edu.ng

ABSTRACT Failure of the orthogonality assumption in panel data models leads to biased ordinary least squares (OLS) estimates. For OLS to be efficient at estimating such models, errors have to be independent and homoscedastic. These conditions are so atypical and mostly unrealistic while modeling real life data using OLS. Panel dataset is commonly afflicted with problems that typically plague time series and cross sectional data because it combines the characteristics of both datasets. This paper, therefore, examines four methods of estimating panel data models (Pooling (OLS), First-Differenced (FD), Between (BTW) and Feasible Generalized Least Squares (FGLS)) in which the assumptions of homoscedasticity, no autocorrelation and no multicollinearity are jointly violated.

Monte-Carlo studies were carried out at different sample sizes and varying degrees of heteroscedasticity and levels of autocorrelation and collinearity at different time periods. The results from this work showed that in small sample, irrespective of number of time periods, FGLS is preferable when heteroscedasticity is severe regardless of levels of autocorrelation and multicollinearity. But when heteroscedasticity is low or mild with moderate autocorrelation level, both FD and FGLS are preferred, while BTW performs better only when there is no autocorrelation and low degree of heteroscedasticity. However, in large sample with little time periods, both FD and BTW could be used when there is no autocorrelation and low degree of heteroscedasticity, while FGLS is preferred elsewhere.

INTRODUCTION A panel data set is one where there are repeated observations on the same units. The units may be persons, families, firms, states or nations. It has the combination of the characteristics of both time- series and cross-sectional data. Therefore, problems that generally plague time-series data (i.e. autocorrelation) and cross-sectional data (i.e. heteroscedasticity) need to be addressed while analyzing panel data.

Panel data like other aspects of econometrics uses regression analysis as one of the statistical tools to formulate, describe and evaluate models. This regression analysis relies on some assumptions which if violated might result to one problem or the other. Among the assumptions of classical linear regression model (CLRM) are homoscedasticity, no serial correlation and orthogonality. If these assumptions are violated, we surmise the presence of heteroscedasticity, autocorrelation and multicollinearity respectively. See Schmidt (2005), Chatterjee (2006), Maddala (2008), Greene (2008), Gujarati & Porter (2009), Creel (2011), Wooldridge (2012).

Situation where the assumptions of CLRM are satisfied are rarely found Situation where the assumptions of CLRM are satisfied are rarely found. The violation of each of the assumptions has attracted the attention of researchers. When heteroscedasticity and autocorrelation are present, the OLS estimators (naive approach) may still be linear, unbiased, and asymptotically normally distributed but not efficient relative to other linear and unbiased estimators . (See Greene, 2008; Baltagi et al., 2008; Olofin et al., 2010). The presence of multicollinearity renders the estimates of the parameters indeterminate. In the cases where the estimates are obtainable, the accompanying confidence intervals tend to be too large and the standard errors become infinitely large. For OLS to be optimal, it is necessary that the errors have the same variance (homoscedasticity) and that all of the errors are independent of each other (Stimson 1985; Hicks 1994; Beck and Katz 1995 & 1996).

This study examines the behaviours of four methods of estimating panel data models when the assumptions of no multicollinearity, no serial correlation and homoscedasticity are jointly violated. The best estimator, in terms of efficiency, among those considered in this study that is robust to the violations of the aforementioned basic assumptions is determined using absolute biases, variances and root mean square errors (RMSE) of parameter estimates. Results from this work would serve as useful guides to econometricians and students while modelling panel data that are characterized by the structure conceptualised here.

MATERIALS AND METHODS This work considers one-way error component model with an endogenous and two exogenous variables. Heteroscedasticity was introduced into the model through the individual-specific error component, i.e. αi ~ (0, σαi2) and uit ~ IID(0, σu2) while performing Monte-Carlo experiments. This is in consonance with the earlier studies of Mazodier and Trongon (1978), Baltagi and Griffin (1988), Roy (2002), Bresson et al. (2006) to mention but few. First-order serial correlation was used to implant autocorrelation into the model as did by Lillard and Wallis (1978), Bhargava et al. (1983), Burke et al. (1990), Galbraith and Zind-Walsh (1995) and others. As earlier mentioned, the panel data model considered has two exogenous variables with the likelihood of existence of multicollinearity between the variables. We then studied the effects of this collinearity in terms of efficiency and stability of the four methods of estimating panel data models examined.

General Static Panel Data Model A general static panel data model is Yit = αi + β1X1it + β2X2it + . . . + βkXkit + uit ...1a or in matrix form as Yit = αi + Xʹit β + uit …1b for i = 1, 2, …., n and t = 1, 2, …, T where Yit is the response for unit i at time t, αi is the individual-specific intercept, vector Xʹit contains k regressors for unit i at time t, vector β contains k regression coefficients to be estimated and uit is the error component for unit i at time t.

The model considered in this study has two exogenous and one endogenous variables as shown below; Yit = αi + β1X1it + β2X2it + uit … 2 where αi = α + εi The model becomes Yit = α + β1X1it + β2X2it + εi + uit … 3 where εi is the individual-specific error component and uit is the combined time-series and cross-section error component with variances σε2 and σu2 respectively.

Structure of the Panel Data Used for this Work C/S units Time Period Yit X1it X2it 1 . 2 3 T y11 y12 y13 y1T x111 x112 x113 x11T X211 x212 x213 x21T y21 y22 y23 y2T x121 x122 x123 x12T x221 x222 x223 x22T n yn1 yn2 yn3 ynt x1n1 x1n2 x1n3 x1nt x2n11 x2n 2 x2n 3 x2n T

Source of Data Several Monte Carlo experiments were carried out to generate the datasets used for this work in the environment of R statistical package (www.cran.org). Two sizes of cross-sectional units (50 and 250), three time periods (10, 40, and 100), five levels of autocorrelation (ρ = ±0.9, ±0.5, 0), five levels of collinearity (r = ±0.9, ±0.5, 0) and three degrees of heteroscedasticity (low, mild & severe) were used for simulation (i.e 2 x 3 x 5 x 5 x 3 combinations). Each of the combinations was iterated 1000 times.

Brief Overview of Panel Data Estimation Methods Considered

1. Pooled Estimator (OLS) This estimator ignores the individual specific effects, stacks the data and treats them as if there is only a single index. In this case, all coefficients are constant across individuals and time periods. The estimator assumes that the intercept values are the same for all individuals and that the slope coefficients of all explanatory variables are identical.

The Pooled Estimator is given by This is the simplest estimator for panel data models. In most cases, it is unlikely to be adequate. In such cases, it provides a baseline for comparison with more complex estimators.

2. Between Estimator (BTW) The between estimator regresses the group means of Y on the group means of X’s in a regression of n observations. Basically, it uses the cross-sectional variation by averaging the observations over period t and converts all the observations into individual-specific averages then performs OLS on the transformed data. See Baltagi 2005; Cameron and Trivedi 2005; Greene 2008; Wooldridge, 2012

The Between Estimator (BTW) is given as

It should be noted that if OLS estimation on the pooled sample is consistent, the BTW estimator is also consistent but not as efficient as OLS when the assumptions are valid. (See Heineck, 2004). However, if the homoscedasticity assumption which says the error variances among individuals are the same is not valid, BTW estimator is expected to perform better than OLS because it takes into consideration the variation between the individuals.

3. First-differenced Estimator (FD) The first-differenced estimator exploits the special features of panel data models in that it measures the association between individual- specific one-period changes in regressors and individual-specific one-period changes in the regressand. See Wooldridge (2012) for details.

ΔYit = β1 ΔX1it + β2 ΔX2it + Δuit ; Considering the model (1a), the one-period lag gives Yi, t-1 = αi + β1X1i, t-1 + β2X2i, t-1 + ui, t-1 … 6 Subtracting the model 6 (lagged model) from model (1a), we have Yit - Yi, t-1 = β1 (X1it - X1i, t-1) + β2 (X2it - X2i, t-1) + (uit - ui, t-1), … 8 Such that ΔYit = β1 ΔX1it + β2 ΔX2it + Δuit ; for i = 1, 2, …n and t = 2, 3, …, T

4. Feasible Generalized Least Squares Estimator (FGLS) The feasible generalized least squares estimator can be obtained from the OLS estimation of the transformed model shown below Yit* = αi* + Xʹit* β + uit* ; i = 1, 2, …., n and t = 1, 2, …, T …9 where Yit* = Yit – λi. ; Xʹit* = Xʹit – λi. ; uit* = λi and αi* = 1 – λ The term λ gives a measure of the relative sizes of the within and between unit variances. Note that if all the cross-sectional units have the same intercept, so that σα2 = 0, λ becomes 0 and the FGLS estimator reduces to OLS.

In compact form, GLS = (XʹΩ-1X)-1 XʹΩ-1Y … 10

When Ω is known GLS based on the true variance component is BLUE and all the feasible GLS estimators considered are asymptotically efficient as n or T approaches Infinity. Since Ω is often unknown, FGLS is more frequently used rather than GLS. See Hsiao (2002), Arellano (2003), Wooldridge (2003 & 2012), Baltagi (2001 & 2005), Greene (2008), Creel (2011).

Criteria for Assessing the Performances of the Estimators The assessments of the estimators considered were based on the following criteria. Absolute bias Variance Root mean squared error Each of the criteria are defined in the next slide

Friedman Test and LSD After the estimators were evaluated based on the above criteria they were all being ranked according to their relative performances to one another. A test of significance was carried out to verify if the sums of ranks of other estimators are actually significantly different from that of the estimator proclaimed best. Because the ranks are on ordinal scale of measurements, a non-parametric statistical test developed by Milton Friedman (1939) was employed to perform this task.

RESULTS AND DISCUSSION The four estimators were assessed using each of the criteria stated earlier. These estimators were ranked using the ranks 1, 2, 3 and 4 with rank 1 assigned to the best estimator that has the lowest value of the absolute bias, variance and root mean square error. A rank of 2 is assigned to the second best estimator and so on. For instance, the performances of the estimators using variance criterion when we have 50 cross-sectional units and 10 time periods at all levels of autocorrelation and multicollinearity and degrees of heteroscedasticity are presented in the next slide.

Table 1: Ranks of the Estimators Using Variance Criterion when N = 50 and T = 10

The ranks for each of the estimators were summed for all levels of multicollinearity to determine an estimator with lowest sum of ranks at each of the autocorrelation levels and degrees of heteroscedasticity. Here, the effects of multicollinearity are repressed to assess the upshots of autocorrelation and heteroscedasticity which are respectively the problems that are peculiar to time- series and cross-sectional data which are brass tacks of panel data.

Table 2: Sum of Ranks of the Estimators Using Variance Criterion when N = 50 and T = 10

Table 3: Preference of the Estimators for each of the Sample Sizes

It can be deduced from Table 3 that at N = 50; T = 10 and at high level of autocorrelation (±0.9) irrespective of degree of heteroscedasticity, the performances of FD and FGLS estimators are not significantly different from each other. Therefore, each of them is efficient for this scenario. Meanwhile, for moderate autocorrelation level (±0.5), any of FD and BTW estimators could be used at low degree of heteroscedasticity. The FD estimator is preferred when there is mild heteroscedasticity, while under severe heteroscedasticity, the FD and FGLS estimators performed better. But at no autocorrelation, BTW estimator which regresses the group means of the dependent variable on the group means of independent variables using OLS is efficient irrespective of the degree of heteroscedasticity for sample size N =50; T = 10. The efficient estimators for other sample sizes at various conditions are presented accordingly in Table 3.

CONCLUSION Various results obtained in this work generally showed that the behaviours of the four estimators investigated for modeling various panel data vary as the violations are varied. Failure of the orthogonality assumption makes the OLS estimators to be biased and imprecise. For OLS to be accurately used in estimating the parameters of panel data models, errors have to be independent and homoscedastic. These conditions are so atypical and mostly unrealistic in many real life situations that would have warranted the use of OLS for modeling panel data efficiently.

Our findings from Monte Carlo experiments for several combinations of violations show that in small sample, irrespective of number of time periods, FGLS is preferable when heteroscedasticity is severe regardless of autocorrelation level. But when heteroscedasticity is low or mild with moderate autocorrelation level, both FD and FGLS are preferred, while BTW performs better only when there is no autocorrelation and low degree of heteroscedasticity. That is, in large sample with little time periods, both FD and BTW could be used when there is no

Also when the degree of heteroscedasticity is mild and there is no autocorrelation in large sample with little time periods, any of the FD and FGLS would produce efficient results. Finally, when severe degree of heteroscedasticity is present and autocorrelation is apparent in large sample regardless of time periods, FGLS is superior. Meanwhile, both FD and FGLS are suitable when there is low heteroscedasticity despite the existence of autocorrelation and multicollinearity.

Generally, we observe based on the various results obtained in this study that, it is always necessary to assess the degree of heteroscedasticity and level of autocorrelation while modeling panel data in order to ensure efficient results. In case there is more than one predictor in the model, the strength of relationship between or among the predictors needs to be examined for possible presence of multicollinearity in the data as to avoid erroneous inferences that may arise from the use wrong method for estimating the model.

REFERENCES Baltagi, B. H. (1988), An alternative heteroscedastic error component model, Econometric Theory 4, 349-350. Baltagi, B. H. (2005), Econometrics analysis of panel data, 3rd edition, John Wiley and Sons Ltd, England. Baltagi, B. H. and J. M. Griffin (1988), A generalized error component model with heteroscedastic disturbances, International Economic Review 29, 745-753. Baltagi, B. H. and Q. Li (1995), Testing AR(1) against MA(1) disturbances in a error component model. Journal of Econometrics 68, 133-151. Baltagi, B. H., B. C. Jung and S. H. Song (2008), Testing for heteroscedasticity and serial correlation in a random effects panel data model. Working paper No. 111, Syracuse University, USA. Bhargava, A., L. Franzini and W. Narendranathan (1983), Serial correlation and the fixed effects model, Review of Economic Studies 49, 533-549. Bresson, G., C. Hsiao and A Pirotte (2006), Heteroskedasticity and ranodom coeeficient model on panel data, Working Papers ERMES 0601, ERMES, University Paris 2. Galbraith, J. W. and V. Zinde-Walsh (1995), Transforms the error component model for estimation with general ARMA disturbances, Journal of Econometrics 66, 349-355.

Gujarati, D. N. And Porter, D. C Gujarati, D. N. And Porter, D. C. (2009), Basic Econometrics, 5th edition, McGraw-Hill/Irwin, New York. Li, Q. and T. Stengos (1994), Adaptive estimation in the panel data error component model with heteroscedasticity of unknown form, International Economic Review 35, 981-1000. Lillard, L.A. and R.J. Wallis (1978), Dynamic aspects of earning mobility, Econometrica 46, 985-1012. Maddala, G.S. (2008), Introduction to econometrics, 3rd edition, John Wiley & Sons, Ltd, Chichester, UK. Magnus, J.R. (1982), Multivariate error components analysis of linear and non-linear regression models by maximum likelihood, Journal of econometrics 19, 239-285. Mazodier, P. and A. Trognon (1978), Heteroskedasticity and stratification in error components models, Annales de l’INSEE 30-31, 451-482. Olofin, S. O., Kouassi, E. and Salisu, A. A. (2010), Testing for heteroscedasticity and serial correlation in a two-way error component model. Ph.D dissertation submitted to the Department of Economics, University of Ibadan, Nigeria. Roy, N. (2002), Is adaptive estimation useful for panel modles with heteroscedasticity in the individual specific error component? Some Monte Carlo evidence, Econometric Reviews 21, 189-203. Schmidt J. S. (2005), Econometrics, McGraw-Hill/Irwin, New York. Wansbeek, T.J. (1989), An alternative heteroscedastic error component model, Econometric Theory 5, 326. Wooldridge, J. M. (2012), Introductory Econometrics: A Modern Approach, 5th edition, South-Western College

ABRIDGE CODES FOR SIMULATION set.seed(123) id = rep(1:samp, each = period) time = rep(1:period, samp) sigma <- matrix(c(var1, r*sqrt(var1*var2), r*sqrt(var1*var2), var2), 2) mv <- mvrnorm(samp*period, c(mu1,mu2), sigma) pData$x1 <- mv[,1] pData$x2 <- mv[,2] U_it <- rnorm(samp*period, 0, sigma^2_u) u_it <- rho*U_itlag + e_it e_it <- rnorm(samp*period, 0, sigma^2_e) hetero_i <- rep(rnorm(samp, sd = round(seq(2, het, length = samp))), each = period) alpha_i <- alpha + hetero_i pData$y <- alpha_i + beta_1*pData$x1 + beta_2*pData$x2 + u_it pData <- data.frame(id, time, pData$y, pData$x1, pData$x2, u_it) pdat = function(samp, period, mu1, mu2, var1, var2, r, rho, alpha, beta_1, beta_2, sigma^2_u, sigma^2_e, het)

THANK YOU