Econometrics Course: Cost at the Dependent Variable (II) Paul G. Barnett, PhD December 4, 2013.

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

Econometric Modeling Through EViews and EXCEL
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Brief introduction on Logistic Regression
Logistic Regression Psy 524 Ainsworth.
Logistic Regression.
Nguyen Ngoc Anh Nguyen Ha Trang
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Multiple Linear Regression Model
Chapter Seventeen HYPOTHESIS TESTING
1-1 Regression Models  Population Deterministic Regression Model Y i =  0 +  1 X i u Y i only depends on the value of X i and no other factor can affect.
Chapter 10 Simple Regression.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
QUALITATIVE AND LIMITED DEPENDENT VARIABLE MODELS.
Linear Regression with One Regression
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
The Simple Regression Model
Topic 2: Statistical Concepts and Market Returns
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics.
Topic 3: Regression.
OLS versus MLE Example YX Here is the data:
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Generalized Linear Models
SAS Lecture 5 – Some regression procedures Aidan McDermott, April 25, 2005.
MODELS OF QUALITATIVE CHOICE by Bambang Juanda.  Models in which the dependent variable involves two ore more qualitative choices.  Valuable for the.
Regression and Correlation Methods Judy Zhong Ph.D.
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
Regression Method.
Cost as the Dependent Variable (II) Paul G. Barnett, PhD VA Health Economics Resource Center.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Limited Dependent Variables Ciaran S. Phibbs May 30, 2012.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
Linear correlation and linear regression + summary of tests
Why Model? Make predictions or forecasts where we don’t have data.
Adjusted from slides attributed to Andrew Ainsworth
Limited Dependent Variables Ciaran S. Phibbs. Limited Dependent Variables 0-1, small number of options, small counts, etc. 0-1, small number of options,
Introduction to logistic regression and Generalized Linear Models July 14, 2011 Introduction to Statistical Measurement and Modeling Karen Bandeen-Roche,
Evaluating Risk Adjustment Models Andy Bindman MD Department of Medicine, Epidemiology and Biostatistics.
The Simple Linear Regression Model: Specification and Estimation ECON 4550 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s.
Beyond the Two-Part Model: Methods for Handling Truncated and Skewed Dependent Variables Eric P. Slade, PhD 1,2 1.VISN5 Capitol Network Mental Illness.
Econometrics Course: Cost as the Dependent Variable (I) Paul G. Barnett, PhD November 20, 2013.
1 Topic 4 : Ordered Logit Analysis. 2 Often we deal with data where the responses are ordered – e.g. : (i) Eyesight tests – bad; average; good (ii) Voting.
Handbook for Health Care Research, Second Edition Chapter 13 © 2010 Jones and Bartlett Publishers, LLC CHAPTER 13 Statistical Methods for Continuous Measures.
Qualitative and Limited Dependent Variable Models ECON 6002 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.
KNN Ch. 3 Diagnostics and Remedial Measures Applied Regression Analysis BUSI 6220.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Dependent Variable Discrete  2 values – binomial  3 or more discrete values – multinomial  Skewed – e.g. Poisson Continuous  Non-normal.
Chapter 13 Understanding research results: statistical inference.
Nonparametric Statistics
Instructor: R. Makoto 1richard makoto UZ Econ313 Lecture notes.
Logistic Regression: Regression with a Binary Dependent Variable.
Nonparametric Statistics
BINARY LOGISTIC REGRESSION
Logistic Regression APKC – STATS AFAC (2016).
Statistics in MSmcDESPOT
Drop-in Sessions! When: Hillary Term - Week 1 Where: Q-Step Lab (TBC) Sign up with Alice Evans.
Chapter 12: Regression Diagnostics
Nonparametric Statistics
Introduction to Logistic Regression
Statistics II: An Overview of Statistics
Chapter 13 Additional Topics in Regression Analysis
Logistic Regression.
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
Presentation transcript:

Econometrics Course: Cost at the Dependent Variable (II) Paul G. Barnett, PhD December 4, 2013

2 Review of Ordinarily Least Squares (OLS) Classic linear model Assume dependent variable can be expressed as a linear function of the chosen independent variables, e.g.: Y i = α + β X i + ε i

3 Review of OLS assumptions Expected value of error is zero E(ε i )=0 Errors are independent E(ε i ε j )=0 Errors have identical variance E(ε i 2 )=σ 2 Errors are normally distributed Errors are not correlated with independent variables E(X i ε i )=0

4 Cost is a difficult variable Skewed by rare but extremely high cost events Zero cost incurred by enrollees who don’t use care No negative values

5 Review from last session Applying Ordinary Least Squares OLS to data that aren’t normal can result in biased parameters –OLS can predict negative costs Log transformation can make cost more normally distributed Predicted cost is affected by re-transformation bias –Corrected using smearing estimator –Assumes constant error (homoscedasticity)

6 Topics for today’s course What is heteroscedasticity, and what should be done about it? What should be done when there are many zero values? How to test differences in groups with no assumptions about distribution? How to determine which method is best?

7 Topics for today’s course What is heteroscedasticity and what should be done about it? What should be done when there are many zero values? How to test differences in groups with no assumptions about distribution? How to determine which method is best?

What is heterscedasticity? A violation of one of the assumptions needed to use OLS regarding the error tems. OLS assumes Homoscedasticity –Identical variance E(ε i 2 )=σ 2 Heteroscedasticity –Variance depends on x (or on predicted y) 8

9 Homoscedasticity –Errors have identical variance E(ε i 2 )=σ 2

10 Heteroscedasticity –Errors depend on x (or on predicted y)

11 Why worry about heteroscedasticity? Predictions based on OLS model can be biased Re-transformation assumes homoscedastic errors Predicted cost when the error is heteroscedastic can be “appreciably biased”

12 What should be done about heteroscedasticity? Use a Generalized Linear Models (GLM) Analyst specifies a link function g( ) Analyst specifies a variance function –Key reading: “Estimating log models: to transform or not to transform,” Mullahy and Manning JHE 20:461, 2001

13 Link function g( ) in GLM g (E (y | x) )=α + βx Link function can be natural log, square root, or other function –E.g. ln ( E ( y | x)) = α + βx –When link function is natural log, then β represents percent change in y

14 GLM vs. OLS OLS of log estimate: E ( ln ( y) | x)) GLM estimate: ln (E ( y | x)) –Log of expectation of y is not the same as expectation of log Y! With GLM to find predicted Y –No retransformation bias with GLM –Smearing estimator not used

15 Variance function GLM does not assume constant variance GLM assumes there is function that explains the relationship between the variance and mean –v (y | x)

16 Variance assumptions for GLM cost models Gamma Distribution (most common) –Variance is proportional to the square of the mean Poisson Distribution –Variance is proportional to the mean

17 Estimation methods How to specify log link and gamma distribution with dependent variable Y and independent variables X1, X2, X3

GLM with log link and gamma distribution in Stata GLM Y X1 X2 X3, FAM(GAM) LINK(LOG) 18

GLM with log link and gamma distribution in SAS Basic syntax (drops zero cost observations) PROC GENMOD MODEL Y=X1 X2 X3 / DIST=GAMMA LINK=LOG; Refined syntax (keeps zero cost observations) PROC GENMOD; A = _MEAN_; B = _RESP_; D = B/A + LOG(A) VARIANCE VAR = A**2 DEVIANCE DEV = D; MODEL Y=X1 X2 X3 / LINK=LOG; 19

Choice between GLM and OLS of log transform GLM advantages: –GLM can correct for heteroscedasticity –GLM does not lead to retransformation error OLS of log transform advantages –OLS is more efficient (standard errors are smaller than with GLM) 20

Which link function? –Box-Cox regression –Stata command: boxcox cost {indep. vars} if y > 0 21

Which link function? Box-Cox parameter 22 Link functionTheta Inverse (1/cost) Log(cost)0 Square root (cost).5 Cost1 Cost Squared2

23 Which variance structure with GLM? Modified Park test GLM regression & find residual Square the residuals Second regression by OLS –Dependent variable squared residuals –Independent variable predicted y

24 Which variance structure with GLM? Parameter from GLM family test (modified Park test) γ1γ1 Variance 0Gaussian (Norma) 1Poisson 2Gamma 3Wald (Inverse Normal)

25 Other models for skewed data Generalized gamma models –Estimate link function, distribution, and parameters in single model –See: Basu & Rathouz (2005)

26 Questions?

27 Topics for today’s course What is heteroscedasticity, and what should be done about it? (GLM models) What should be done when there are many zero values? How to test differences in groups with no assumptions about distribution? How to determine which method is best?

28 What should be done when there are many zero values? Example of participants enrolled in a health plan who have no utilization

29 Annual per person VHA costs FY09 among those who used VHA in FY10

30 The two-part model Part 1: Dependent variable is indicator any cost is incurred –1 if cost is incurred (Y > 0) –0 if no cost is incurred (Y=0) Part 2: Regression of how much cost, among those who incurred any cost

31 The two-part model Expected value of Y conditional on X Is the product of: Part 1. The probability that Y is greater than zero, conditional on X } Part 2. Expected value of Y, conditional on Y being greater than zero, conditional on X }

32 Predicted cost in two-part model Predicted value of Y Is the product of: Part 1. Probability of any cost being incurred } Part 2. Predicted cost conditional on incurring any cost }

33 Question for class Part one estimates probability Y > 0 –Y > 0 is dichotomous indicator –1 if cost is incurred (Y > 0) –0 if no cost is incurred (Y=0) What type of regression should be used when the dependent variable is dichotomous (takes a value of either zero or one)?

34 First part of model Regression with dichotomous variable Logistic regression or probit Logistic regression uses maximum likelihood function to estimate log odds ratio:

35 Logistic regression syntax in SAS Proc Logistic; Model Y = X1 X2 X3 / Descending; Output out={dataset} prob={variable name}; Output statement saves the predicted probability that the dependent variable equals one (cost was incurred) Descending option in model statement is required, otherwise SAS estimates the probability that the dependent variable equals zero

36 Logistic regression syntax in Stata Logit Y = X1 X2 X3 Predict {variable name}, pr Predict statement generates the predicted probability that the dependent variable equals one (cost was incurred)

37 Second part of model Conditional quantity Regression involves only observations with non-zero cost (conditional cost regression) Use GLM or OLS with log cost

38 Two-part models Separate parameters for participation and conditional quantity –How independent variables predict  participation in care  quantity of cost conditional on participation –each parameter may have its policy relevance Disadvantage: hard to predict confidence interval around predicted Y given X

39 Alternate to two-part model OLS with untransformed cost OLS with log cost, using small positive values in place of zero Certain GLM models

40 Topics for today’s course What is heteroscedasticity, and what should be done about it? (GLM models) What should be done when there are many zero values? (Two-part models) How to test differences in groups with no assumptions about distribution? How to determine which method is best?

41 Non-parametric statistical tests Make no assumptions about distribution, variance Wilcoxon rank-sum test Assigns rank to every observation Compares ranks of groups Calculates the probability that the rank order occurred by chance alone

42 Extension to more than two groups Group variable with more than two mutually exclusive values Kruskall Wallis test –is there any difference between any pairs of the mutually exclusive groups? If KW is significant, then a series of Wilcoxon tests allows comparison of pairs of groups

43 Limits of non-parametric test It is too conservative –Compares ranks, not means –Ignores influence of outliers –E.g. all other ranks being equal, Wilcoxon will give same result regardless of whether  Top ranked observation is $1 million more costly than second observation, or  Top ranked observation just $1 more costly Doesn’t allow for additional explanatory variables

44 Topics for today’s course What is heteroscedasticity, and what should be done about it? (GLM models) What should be done when there are many zero values? (Two-part models) How to test differences in groups with no assumptions about distribution? (Non- parametric statistical tests) How to determine which method is best?

45 Which method is best? Find predictive accuracy of models Estimate regressions with half the data, test their predictive accuracy on the other half of the data Find –Mean Absolute Error (MAE) –Root Mean Square Error (RMSE)

46 Mean Absolute Error For each observation –find difference between observed and predicted cost –take absolute value –find the mean Model with smallest value is best

47 Root Mean Square Error Square the differences between predicted and observed, find their mean, find its square root Best model has smallest value

48 Evaluations of residuals Mean residual (predicted less observed) or Mean predicted ratio (ratio of predicted to observed) –calculate separately for each decile of observed Y –A good model should have equal residuals (or equal mean ratio) for all deciles

49 Formal tests of residuals Variant of Hosmer-Lemeshow Test –F test of whether residuals in raw scale in each decile are significantly different Pregibon’s Link Test –Tests if linearity assumption was violated  See Manning, Basu, & Mullahy, 2005

50 Questions?

51 Review of presentation Cost is a difficult dependent variable –Skewed to the right by high outliers –May have many observations with zero values –Cost is not-negative

52 When cost is skewed OLS of raw cost is prone to bias –Especially in small samples with influential outliers –“A single case can have tremendous influence”

53 When cost is skewed (cont.) Log transformed cost –Log cost is more normally distributed than raw cost –Log cost can be estimated with OLS

54 When cost is skewed (cont.) To find predicted cost, must correct for retransformation bias –Smearing estimator assumes errors are homoscedastic –Biased if errors are heteroscedasctic

55 When cost is skewed and errors are heteroscedastic GLM with log link and gamma variance –Considers heteroscedasctic errors –Not subject to retransformation bias –May not be very efficient –Alternative GLM specification  Poisson instead of gamma variance function  Square root instead of log link function

56 When cost has many zero values Two part model –Logit or probit is the first part –Conditional cost regression is the second part

57 Comparison without distributional assumptions Non-parametric tests can be useful May be too conservative Don’t allow co-variates

58 Evaluating models Mean Absolute Error Root Mean Square Error Other evaluations and tests of residuals

59 Key sources on GLM MANNING, W. G. (1998) The logged dependent variable, heteroscedasticity, and the retransformation problem, J Health Econ, 17, * MANNING, W. G. & MULLAHY, J. (2001) Estimating log models: to transform or not to transform?, J Health Econ, 20, * MANNING, W. G., BASU, A. & MULLAHY, J. (2005) Generalized modeling approaches to risk adjustment of skewed outcomes data, J Health Econ, 24, BASU, A. & Rathouz P.J. (2005) Estimating marginal and incremental effects on health outcomes using flexible link and variance function models, Biostatistics 6(1): , 2005.

60 Key sources on two-part models * MULLAHY, J. (1998) Much ado about two: reconsidering retransformation and the two- part model in health e conometrics, J Health Econ, 17, JONES, A. (2000) Health econometrics, in: Culyer, A. & Newhouse, J. (Eds.) Handbook of Health Economics, pp (Amsterdam, Elsevier).

61 References to worked examples FLEISHMAN, J. A., COHEN, J. W., MANNING, W. G. & KOSINSKI, M. (2006) Using the SF-12 health status measure to improve predictions of medical expenditures, Med Care, 44, I MONTEZ-RATH, M., CHRISTIANSEN, C. L., ETTNER, S. L., LOVELAND, S. & ROSEN, A. K. (2006) Performance of statistical models to predict mental health and substance abuse cost, BMC Med Res Methodol, 6, 53.

62 References to worked examples (cont). MORAN, J. L., SOLOMON, P. J., PEISACH, A. R. & MARTIN, J. (2007) New models for old questions: generalized linear models for cost prediction, J Eval Clin Pract, 13, DIER, P., YANEZ D., ASH, A., HORNBROOK, M., LIN, D. Y. (1999). Methods for analyzing health care utilization and costs Ann Rev Public Health (1999) 20: (Also gives accessible overview of methods, but lacks information from more recent developments)

63 Link to HERC Cyberseminar HSR&D study of worked example Performance of Statistical Models to Predict Mental Health and Substance Abuse Cost Maria Montez-Rath, M.S. 11/8/2006 The audio: ars/HERC asx ars/HERC asx The Power point slides: ars/HERC pdf ars/HERC pdf

64 Book chapters MANNING, W. G. (2006) Dealing with skewed data on costs and expenditures, in: Jones, A. (Ed.) The Elgar Companion to Health Economics, pp (Cheltenham, UK, Edward Elgar).