1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

Slides:

Advertisements

Similar presentations

Statistical Analysis SC504/HS927 Spring Term 2008

Advertisements

Linear Regression.

Brief introduction on Logistic Regression

1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.

Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.

Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.

1 Logit/Probit Models. 2 Making sense of the decision rule Suppose we have a kid with great scores, great grades, etc. For this kid, x i β is large. What.

Data Analysis Statistics. Inferential statistics.

Chapter 13 Multiple Regression

ChiSq Tests: 1 Chi-Square Tests of Association and Homogeneity.

Chapter 12 Multiple Regression

Duration models Bill Evans 1. timet0t0 t2t2 t 0 initial period t 2 followup period a b c d e f h g i Flow sample.

1 Discrete and Categorical Data William N. Evans Department of Economics University of Maryland.

Ordered probit models.

The Simple Regression Model

So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.

1 Section 3 Probit and Logit Models. 2 Dichotomous Data Suppose data is discrete but there are only 2 outcomes Examples –Graduate high school or not –Patient.

In previous lecture, we dealt with the unboundedness problem of LPM using the logit model. In this lecture, we will consider another alternative, i.e.

Section Count Data Models. Introduction Many outcomes of interest are integer counts –Doctor visits –Low work days –Cigarettes smoked per day –Missed.

Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.

Data Analysis Statistics. Inferential statistics.

Data Analysis Statistics. Levels of Measurement Nominal – Categorical; no implied rankings among the categories. Also includes written observations and.

TESTING A HYPOTHESIS RELATING TO A REGRESSION COEFFICIENT This sequence describes the testing of a hypotheses relating to regression coefficients. It is.

BINARY CHOICE MODELS: LOGIT ANALYSIS

TOBIT ANALYSIS Sometimes the dependent variable in a regression model is subject to a lower limit or an upper limit, or both. Suppose that in the absence.

DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.

Lecture 16 Duration analysis: Survivor and hazard function estimation

Single and Multiple Spell Discrete Time Hazards Models with Parametric and Non-Parametric Corrections for Unobserved Heterogeneity David K. Guilkey.

Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.

Hypothesis Testing:.

CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.

Methods Workshop (3/10/07) Topic: Event Count Models.

Essentials of survival analysis How to practice evidence based oncology European School of Oncology July 2004 Antwerp, Belgium Dr. Iztok Hozo Professor.

How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.

Business Statistics: Communicating with Numbers

Estimation of Statistical Parameters

ESTIMATES AND SAMPLE SIZES

What is the MPC?. Learning Objectives 1.Use linear regression to establish the relationship between two variables 2.Show that the line is the line of.

Chi-squared Tests. We want to test the “goodness of fit” of a particular theoretical distribution to an observed distribution. The procedure is: 1. Set.

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.

1 Discrete and Categorical Data William N. Evans Department of Economics University of Maryland.

April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.

Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.

N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.

Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for year old working males.

Chapter 6: Analyzing and Interpreting Quantitative Data

IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.

Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.

1 BINARY CHOICE MODELS: LINEAR PROBABILITY MODEL Economists are often interested in the factors behind the decision-making of individuals or enterprises,

1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.

The Probit Model Alexander Spermann University of Freiburg SS 2008.

1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.

Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Instructor: R. Makoto 1richard makoto UZ Econ313 Lecture notes.

Week 2 Normal Distributions, Scatter Plots, Regression and Random.

26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.

The Probit Model Alexander Spermann University of Freiburg SoSe 2009

BINARY LOGISTIC REGRESSION

assignment 7 solutions ► office networks ► super staffing

QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.

Hypothesis Testing Review

QM222 Class 8 Section A1 Using categorical data in regression

POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.

Multiple logistic regression

Chapter 7: The Normality Assumption and Inference with OLS

Product moment correlation

Introduction to Econometrics, 5th edition

Presentation transcript:

1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland

2 Part I Introduction

3 Workhorse statistical model in social sciences is the multivariate regression model Ordinary least squares (OLS) y i = β 0 + x 1i β 1 + x 2i β 2 +… x ki β k + ε i y i = x i β + ε i

4 Linear model y i =  +  x i +  i  and  are “population” values – represent the true relationship between x and y Unfortunately – these values are unknown The job of the researcher is to estimate these values Notice that if we differentiate y with respect to x, we obtain dy/dx = 

5  represents how much y will change for a fixed change in x –Increase in income for more education –Change in crime or bankruptcy when slots are legalized –Increase in test score if you study more

6 Put some concreteness on the problem State of Maryland budget problems –Drop in revenues –Expensive k-12 school spending initiatives Short-term solution – raise tax on cigarettes by 34 cents/pack Problem – a tax hike will reduce consumption of taxable product Question for state – as taxes are raised, how much will cigarette consumption fall?

7 Simple model: y i =  +  x i +  i Suppose y is a state’s per capita consumption of cigarettes x represents taxes on cigarettes Question – how much will y fall if x is increased by 34 cents/pack? Problem – many reasons why people smoke – cost is but one of them –

8 Data –(Y) State per capita cigarette consumption for the years –(X) tax (State + Federal) in real cents per pack –“Scatter plot” of the data –Negative covariance between variables When x>, more likely that y<  When x  Goal: pick values of  and  that “best fit” the data –Define best fit in a moment

9 Notation True model y i =  +  x i +  i We observe data points (y i,x i ) The parameters  and  are unknown The actual error (  i ) is unknown Estimated model (a,b) are estimates for the parameters ( ,  ) e i is an estimate of  i where e i =y i -a-bx i How do you estimate a and b?

10 Objective: Minimize sum of squared errors Min  i e i 2 =  i (y i – a – bx i ) 2 Minimize sum of squared errors (SSE) Treat (+) and (-) errors equally –Over or under predict by “5” is the same magnitude of error –“Quadratic form” –The optimal value for a and b are those that make the 1 st derivative equal zero –Functions reach min or max values when derivatives are zero

11

12

13 The model has a lot of nice features –Statistical properties easy to establish –Optimal estimates easy to obtain –Parameter estimates are easy to interpret –Model maximizes prediction If you minimize SSE you maximize R 2 The model does well as a first order approximation to lots of problems

14 Discrete and Qualitative Data The OLS model work well when y is a continuous variable –Income, wages, test scores, weight, GDP Does not has as many nice properties when y is not continuous Example: doctor visits Integer values Low counts for most people Mass of observations at zero

15 Downside of forcing non-standard outcomes into OLS world? Can predict outside the allowable range –e.g., negative MD visits Does not describe the data generating process well –e.g., mass of observations at zero Violates many properties of OLS –e.g. heteroskedasticity

16 This talk Look at situations when the data generating process does not lend itself well to OLS models Mathematically describe the data generating process Show how we use different optimization procedure to obtain estimates Describe the statistical properties

17 Show how to interpret parameters Illustrate how to estimate the models with popular program STATA

18 Types of data generating processes we will consider Dichotomous events (yes or no) –1=yes, 0=no –Graduate high school? work? Are obese? Smoke? Ordinal data –Self reported health (fair, poor, good, excel) –Strongly disagree, disagree, agree, strongly agree

19 Count data –Doctor visits, lost workdays, fatality counts Duration data –Time to failure, time to death, time to re- employment

20 Recommended Textbooks Jeffrey Wooldridge, “Econometric analysis of cross sectional and panel data” –Lots of insight and mathematical/statistical detail –Very good examples William Greene, “Econometric Analysis” –more topics –Somewhat dated examples

21 Course web page Contains –These notes –All STATA programs and data sets –A couple of “Introduction to STATA” handouts –Links to some useful web sites

22 STATA Resources Discrete Outcomes “Regression Models for Categorical Dependent Variables Using STATA” –J. Scott Long and Jeremy Freese Available for sale from STATA website for $52 ( Post-estimation subroutines that translate results –Do not need to buy the book to use the subroutines

23 In STATA command line type net search spost Will give you a list of available programs to download One is Spostado from Click on the link and install the files

24 Part II A brief introduction to STATA

25 STATA Very fast, convenient, well-documented, cheap and flexible statistical package Excellent for cross-section/panel data projects, not as great for time series but getting better Not as easy to manipulate large data sets from flat files as SAS I usually clean data in SAS, estimate models in STATA

26 Key characteristic of STATA –All data must be loaded into RAM –Computations are very fast –But, size of the project is limited by available memory Results can be generated two different ways –Command line –Write a program, (*.do) then submit from the command line

27 Sample program to get you started cps87_or.do Program gets you to the point where can Load data into memory Construct new variables Get simple statistics Run a basic regression Store the results on a disk

28 Data (cps87_do.dta) Random sample of data from 1987 Current Population Survey outgoing rotation group Sample selection –Males –21-64 –Working 30+hours/week 19,906 observations

29 Major caveat Hardest thing to learn/do: get data from some other source and get it into STATA data set We skip over that part All the data sets are loaded into a STATA data file that can be called by saying: use data file name

30 Housekeeping at the top of the program * this line defines the semicolon as the ; * end of line delimiter; # delimit ; * set memork for 10 meg; set memory 10m; * write results to a log file; * the replace options writes over old; * log files; log using cps87_or.log,replace; * open stata data set; use c:\bill\stata\cps87_or; * list variables and labels in data set; desc;

> - storage display value variable name type format label variable label > - age float %9.0g age in years race float %9.0g 1=white, non-hisp, 2=place, n.h, 3=hisp educ float %9.0g years of education unionm float %9.0g 1=union member, 2=otherwise smsa float %9.0g 1=live in 19 largest smsa, 2=other smsa, 3=non smsa region float %9.0g 1=east, 2=midwest, 3=south, 4=west earnwke float %9.0g usual weekly earnings

32 Constructing new variables Use ‘gen’ command for generate new variables Syntax –gen new variable name=math statement Easily construct new variables via –Algebraic operations –Math/trig functions (ln, exp, etc.) –Logical operators (when true, =1, when false, =0)

33 From program * generate new variables; * lines 1-2 illustrate basic math functoins; * lines 3-4 line illustrate logical operators; * line 5 illustrate the OR statement; * line 6 illustrates the AND statement; * after you construct new variables, compress the data again; gen age2=age*age; gen earnwkl=ln(earnwke); gen union=unionm==1; gen topcode=earnwke==999; gen nonwhite=((race==2)|(race==3)); gen big_ne=((region==1)&(smsa==1));

34 Getting basic statistics desc -- describes variables in the data set sum – gets summary statistics tab – produces frequencies (tables) of discrete variables

35 From program * get descriptive statistics; sum; * get detailed descriptics for continuous variables; sum earnwke, detail; * get frequencies of discrete variables; tabulate unionm; tabulate race; * get two-way table of frequencies; tabulate region smsa, row column cell;

36 Results from sum Variable | Obs Mean Std. Dev. Min Max age | race | educ | unionm | smsa |

37 Detailed summary usual weekly earnings Percentiles Smallest 1% % % Obs % Sum of Wgt % 449 Mean Largest Std. Dev % % Variance % Skewness % Kurtosis

38 Results for tab 1=union | member, | 2=otherwise | Freq. Percent Cum | 4, | 15, Total | 19,

39 2x2 Table 1=east, | 2=midwest, | 1=live in 19 largest smsa, 3=south, | 2=other smsa, 3=non smsa 4=west | | Total | 2,806 1, | 4,997 | | | | | | | 1,501 1,742 1,592 | 4,835 | | | | | | | 1,501 2,542 1,904 | 5,947 | | | | | | | 1,487 1,507 1,133 | 4,127 | | | | | | Total | 7,295 7,140 5,471 | 19,906 | | | | | |

40 Running a regression Syntax reg dependent-variable independent-variables Example from program *run simple regression; reg earnwkl age age2 educ nonwhite union;

41 Source | SS df MS Number of obs = F( 5, 19900) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = earnwkl | Coef. Std. Err. t P>|t| [95% Conf. Interval] age | age2 | educ | nonwhite | union | _cons |

42 Analysis of variance R 2 =.3085 –Variables explain 31% of the variation in log weekly earnings F(5,19900) –Tests the hypothesis that all covariates (except constant) are jointly zero

43 Interpret results Y = β 0 + β 1 X i + ε i dY/dX = β 1 But in this case Y=ln(W) where W weekly wages dln(W)/dX = (dW/W)/dX = β 1 –Percentage change in wages given a change in x

44 For each additional year of education, wages increase by 6.9% Non whites earn 17.2% less than whites Union members earn 13% more than nonunion members

45 Part III Some notes about probability distributions

46 Continuous Distributions Random variables with infinite number of possible values Examples -- units of measure (time, weight, distance) Many discrete outcomes can be treated as continuous, e.g., SAT scores

47 How to describe a continuous random variable The Probability Density Function (PDF) The PDF for a random variable x is defined as f(x), where f(x) $ 0 I f(x)dx = 1 Calculus review: The integral of a function gives the “area under the curve”

48

49 Cumulative Distribution Function (CDF) Suppose x is a “measure” like distance or time 0 # x # 4 We may be interested in the Pr(x # a) ?

50 CDF What if we consider all values?

51 Properties of CDF Note that Pr(x # b) + Pr(x>b) =1 Pr(x>b) = 1 – Pr(x # b) Many times, it is easier to work with compliments

52 General notation for continuous distributions The PDF is described by lower case such as f(x) The CDF is defined as upper case such as F(a)

53 Standard Normal Distribution Most frequently used continuous distribution Symmetric “bell-shaped” distribution As we will show, the normal has useful properties Many variables we observe in the real world look normally distributed. Can translate normal into ‘standard normal’

54 Examples of variables that look normally distributed IQ scores SAT scores Heights of females Log income Average gestation (weeks of pregnancy) As we will show in a few weeks – sample means are normally distributed!!!

55 Standard Normal Distribution PDF: For -  # z # 

56 Notation  (z) is the standard normal PDF evaluated at z  [a] = Pr(z  a)

57

58 Standard Normal Notice that: –Normal is symmetric:  (a) =  (-a) –Normal is “unimodal” –Median=mean –Area under curve=1 –Almost all area is between (-3,3) Evaluations of the CDF are done with –Statistical functions (excel, SAS, etc) –Tables

59 Standard Normal CDF Pr(z  -0.98) =  [-0.98] =

60

61 Pr(z  1.41) =  [1.41] =

62

63 Pr(x>1.17) = 1 – Pr(z  1.17) = 1-  [1.17] = 1 – =

64

65 Pr(0.1  z  1.9) = Pr(z  1.9) – Pr(z  0.1) = M (1.9) - M (0.1) = =

66

67

68

69 Important Properties of Normal Distribution Pr(z  A) =  [A] Pr(z > A) = 1 -  [A] Pr(z  - A) =  [-A] Pr(z > -A) = 1 -  [-A] =  [A]

70 Section IV Maximum likelihood estimation

71 Maximum likelihood estimation Observe n independent outcomes, all drawn from the same distribution (y 1, y 2, y 3 ….y n ) y i is drawn from f(y i ; θ) where θ is an unknown parameter for the PDF f Recall definition of indepedence. If a and b and independent, Prob(a and b) = Pr(a)Pr(B)

72 Because all the draws are independent, the probability these particular n values of Y would be drawn at random is called the ‘likelihood function’ and it equals L = Pr(y 1 )Pr(y 2 )…Pr(y n ) L = f(y 1 ; θ)f(y 2 ; θ)…..f(y 3 ; θ)

73 MLE: pick a value for θ that best represents the chance these n values of y would have been generated randomly To maximize L, maximize a monotonic function of L Recall ln(abcd)=ln(a)+ln(b)+ln(c)+ln(d)

74 Max L = ln(L) = ln[f(y 1 ; θ)] +ln[f(y 2 ; θ)] + ….. ln[f(y n ; θ) = Σ i ln[f(y i ; θ)] Pick θ so that L is maximized d L /dθ = 0

75 L θ θ1θ1 θ2θ2

76 Example: Poisson Suppose y measures ‘counts’ such as doctor visits. y i is drawn from a Poisson distribution f(y i ;λ) =e -λ λ y i /y i ! For λ>0 E[y i ]= Var[y i ] = λ

77 Given n observations, (y 1, y 2, y 3 ….y n ) Pick value of λ that maximizes L Max L = Σ i ln[f(y i ; θ)] = Σ i ln[e -λ λ y i /y i !] = Σ i [– λ + y i ln(λ) – ln(y i !)] = -n λ + ln(λ) Σ i y i – Σ i ln(y i !)

78 L = -n λ + ln(λ) Σ i y i – Σ i ln(y i !) d L /dθ = -n + (1/ λ )Σ i y i = 0 Solve for λ λ = Σ i y i /n =  = sample mean of y

79 In most cases however, cannot find a ‘closed form’ solution for the parameter in ln[f(y i ; θ)] Must ‘search’ over all possible solutions How does the search work? Start with candidate value of θ. Calculate d L /dθ

80 If d L /dθ > 0, increasing θ will increase L so we increase θ some If d L /dθ < 0, decreasing θ will increase L so we decrease θ some Keep changing θ until d L /dθ = 0 How far you ‘step’ when you change θ is determined by a number of different factors

81 L θθ1θ1 d L/d θ > 0

82 L θ θ3θ3 d L/d θ < 0

83 Properties of MLE estimates Sometimes call efficient estimation. Can never generate a smaller variance than one obtained by MLE Parameters estimates are distributed as a normal distribution when samples sizes are large Therefore, if we divide the parameter by its standard error, should be normally distributed with a mean zero and variance 1 if the null (=0) is correct

84 Section 5 Dichotomous outcomes

85 Dichotomous Data Suppose data is discrete but there are only 2 outcomes Examples –Graduate high school or not –Patient dies or not –Working or not –Smoker or not In data, y i =1 if yes, y i =0 if no

86 How to model the data generating process? There are only two outcomes Research question: What factors impact whether the event occurs? To answer, will model the probability the outcome occurs Pr(Y i =1) when y i =1 or Pr(Y i =0) = 1- Pr(Y i =1) when y i =0

87 Think of the problem from a MLE perspective Likelihood for i’th observation L i = Pr(Y i =1) Yi [1 - Pr(Y i =1)] (1-Yi) When y i =1, only relevant part is Pr(Y i =1) When y i =0, only relevant part is [1 - Pr(Y i =1)]

88 L = Σ i ln[L i ] = = Σ i {y i ln[Pr(y i =1)] + (1-y i )ln[Pr(y i =0)] } Notice that up to this point, the model is generic. The log likelihood function will determined by the assumptions concerning how we determine Pr(y i =1)

89 Modeling the probability There is some process (biological, social, decision theoretic, etc) that determines the outcome y Some of the variables impacting are observed, some are not Requires that we model how these factors impact the probabilities Model from a ‘latent variable’ perspective

90 Consider a women’s decision to work y i * = the person’s net benefit to work Two components of y i * –Characteristics that we can measure Education, age, income of spouse, prices of child care –Some we cannot measure How much you like spending time with your kids how much you like/hate your job

91 We aggregate these two components into one equation y i * = β 0 + x 1i β 1 + x 2i β 2 +… x ki β k + ε i = x i β + ε i x i β (measurable characteristics but with uncertain weights) ε i random unmeasured characteristics Decision rule: person will work if y i * > 0 (if net benefits are positive) y i =1 if y i *>0 y i =0 if y i * ≤0

92 y i =1 if y i *>0 y i * = x i β + ε i > 0 only if ε i > - x i β y i =0 if y i * ≤0 y i * = x i β + ε i ≤ 0 only if ε i ≤ - x i β

93 How to interpret ε? When we look at certain people, we have expectations about whether y should equal 1 or 0 These expectations do not always hold true The error ε represents deviations from what we expect Go back to the work example, suppose x i β is ‘big.’ We observe a woman with: –High wages –Low husband’s income –Low cost of child care

94 We would expect this person to work, UNLESS, there is some unmeasured variable that counteracts this For example: –Suppose a mom really likes spending time with her kids, or she hates her job. –The unmeasured benefit of working is then a big negative coefficient ε i

95 If we observe them working, there are a certain range of values that ε i must have been in excess of y i =1 if ε i > - x i β If we observe someone not working, then Consider the opposite. Suppose we observe someone NOT working. Then ε i must not have been big or it was a bigger negative number, since y i =0 if ε i ≤ - x i β

96 The Probabilities The estimation procedure used is determined by the assumed distribution of ε What is the probability we observe someone with y=1? –Use definition of the CDF –Pr(y i =1) = Pr(y i * >0) = Pr(ε i > - x i β) = 1 – F(-x i β)

97 What is the probability we observe someone with y=0? –Use definition of the CDF –Pr(y i =0) = Pr(y i * ≤ 0) = Pr(ε i ≤ - x i β) = F(-x i β) Two standard models: ε is either –normal or –logistic

98 Normal (probit) Model ε is distributed as a standard normal –Mean zero –Variance 1 Evaluate probability (y=1) –Pr(y i =1) = Pr(ε i > - x i β) = 1 – Ф(-x i β) –Given symmetry: 1 – Ф(-x i β) = Ф(x i β) Evaluate probability (y=0) –Pr(y i =0) = Pr(ε i ≤ - x i β) = Ф(-x i β) –Given symmetry: Ф(-x i β) = 1 - Ф(x i β)

99 Summary –Pr(y i =1) = Ф(x i β) –Pr(y i =0) = 1 -Ф(x i β) Notice that Ф(a) is increasing a. Therefore, is one of the x’s increases the probability of observing y, we would expect the coefficient on that variable to be (+)

100 The standard normal assumption (variance=1) is not critical In practice, the variance may be not equal t 1, but given the math of the problem, we cannot identify the variance. It is absorbed into parameter estimates

101 Logit CDF: F(a) = exp(a)/(1+exp(a)) –Symmetric, unimodal distribution –Looks a lot like the normal –Incredibly easy to evaluate the CDF and PDF –Mean of zero, variance > 1 (more variance than normal) Evaluate probability (y=1) –Pr(y i =1) = Pr(ε i > - x i β) = 1 – F(-x i β) –Given symmetry: 1 – F(-x i β) = F(x i β) –F(x i β) = exp(x i β)/(1+exp(x i β))

102 Evaluate probability (y=0) –Pr(y i =0) = Pr(ε i ≤ - x i β) = F(-x i β) –Given symmetry: F(-x i β) = 1 - F(x i β) –1 - F(x i β) = 1 /(1+exp(x i β)) When ε i is a logistic distribution –Pr(y i =1) = exp(x i β)/(1+exp(x i β)) –Pr(y i =0) = 1/(1+exp(x i β))

103 Example: Workplace smoking bans Smoking supplements to 1991 and 1993 National Health Interview Survey Asked all respondents whether they currently smoke Asked workers about workplace tobacco policies Sample: workers Key variables: current smoking and whether they faced by workplace ban

104 Data: workplace1.dta Sample program: workplace1.doc Results: workplace1.log

105 Description of variables in data. desc; storage display value variable name type format label variable label > - smoker byte %9.0g is current smoking worka byte %9.0g has workplace smoking bans age byte %9.0g age in years male byte %9.0g male black byte %9.0g black hispanic byte %9.0g hispanic incomel float %9.0g log income hsgrad byte %9.0g is hs graduate somecol byte %9.0g has some college college float %9.0g

106 Summary statistics sum; Variable | Obs Mean Std. Dev. Min Max smoker | worka | age | male | black | hispanic | incomel | hsgrad | somecol | college |

107 Running a probit probit smoker age incomel male black hispanic hsgrad somecol college worka; The first variable after ‘probit’ is the discrete outcome, the rest of the variables are the independent variables Includes a constant as a default

108 Running a logit logit smoker age incomel male black hispanic hsgrad somecol college worka; Same as probit, just change the first word

109 Running linear probability reg smoker age incomel male black hispanic hsgrad somecol college worka, robust; Simple regression. Standard errors are incorrect (heteroskedasticity) robust option produces standard errors with arbitrary form of heteroskedasticity

110 Probit Results Probit estimates Number of obs = LR chi2(9) = Prob > chi2 = Log likelihood = Pseudo R2 = smoker | Coef. Std. Err. z P>|z| [95% Conf. Interval] age | incomel | male | black | hispanic | hsgrad | somecol | college | worka | _cons |

111 How to measure fit? Regression (OLS) –minimize sum of squared errors –Or, maximize R 2 –The model is designed to maximize predictive capacity Not the case with Probit/Logit –MLE models pick distribution parameters so as best describe the data generating process –May or may not ‘predict’ the outcome well

112 Pseudo R 2 LL k log likelihood with all variables LL 1 log likelihood with only a constant 0 > LL k > LL 1 so | LL k | < |LL 1 | Pseudo R 2 = 1 - |LL 1 /LL k | Bounded between 0-1 Not anything like an R 2 from a regression

113 Predicting Y Let b be the estimated value of β For any candidate vector of x i, we can predict probabilities, P i P i = Ф(x i b) Once you have P i, pick a threshold value, T, so that you predict Y p = 1 if P i > T Y p = 0 if P i ≤ T Then compare, fraction correctly predicted

114 Question: what value to pick for T? Can pick.5 –Intuitive. More likely to engage in the activity than to not engage in it –However, when the  is small, this criteria does a poor job of predicting Y i =1 –However, when the  is close to 1, this criteria does a poor job of picking Y i =0

115 *predict probability of smoking; predict pred_prob_smoke; * get detailed descriptive data about predicted prob; sum pred_prob, detail; * predict binary outcome with 50% cutoff; gen pred_smoke1=pred_prob_smoke>=.5; label variable pred_smoke1 "predicted smoking, 50% cutoff"; * compare actual values; tab smoker pred_smoke1, row col cell;

116. sum pred_prob, detail; Pr(smoker) Percentiles Smallest 1% % % Obs % Sum of Wgt % Mean Largest Std. Dev % % Variance % Skewness % Kurtosis

117 Notice two things –Sample mean of the predicted probabilities is close to the sample mean outcome –99% of the probabilities are less than.5 –Should predict few smokers if use a 50% cutoff

118 | predicted smoking, is current | 50% cutoff smoking | 0 1 | Total | 12, | 12,167 | | | | | | | 4, | 4,091 | | | | | | Total | 16, | 16,258 | | | | | |

119 Check on-diagonal elements. The last number in each 2x2 element is the fraction in the cell The model correctly predicts = 74.90% of the obs It only predicts a small fraction of smokers

120 Do not be amazed by the 75% percent correct prediction If you said everyone has a  chance of smoking (a case of no covariates), you would be correct Max[( ,(1-  )] percent of the time

121 In this case, 25.16% smoke. If everyone had the same chance of smoking, we would assign everyone Pr(y=1) =.2516 We would be correct for the = people who do not smoke

122 Key points about prediction MLE models are not designed to maximize prediction Should not be surprised they do not predict well In this case, not particularly good measures of predictive capacity

123 Translating coefficients in probit: Continuous Covariates Pr(y i =1) = Φ[β 0 + x 1i β 1 + x 2i β 2 +… x ki β k ] Suppose that x 1i is a continuous variable d Pr(y i =1) /d x 1i = ? What is the change in the probability of an event give a change in x 1i?

124 Marginal Effect d Pr(y i =1) /d x 1i = β 1 φ[β 0 + x 1i β 1 + x 2i β 2 +… x ki β k ] Notice two things. Marginal effect is a function of the other parameters and the values of x.

125 Translating Coefficients: Discrete Covariates Pr(y i =1) = Φ[β 0 + x 1i β 1 + x 2i β 2 +… x ki β k ] Suppose that x 2i is a dummy variable (1 if yes, 0 if no) Marginal effect makes no sense, cannot change x 2i by a little amount. It is either 1 or 0. Redefine the variable of interest. Compare outcomes with and without x 2i

126 y 1 = Pr(y i =1 | x 2i =1) = Φ[β 0 + x 1i β 1 + β 2 + x 3i β 3 +… ] y 0 = Pr(y i =1 | x 2i =0) = Φ[β 0 + x 1i β 1 + x 3i β 3 … ] Marginal effect = y 1 – y 0. Difference in probabilities with and without x 2i?

127 In STATA Marginal effects for continuous variables, and Change in probabilities for dichotomous outcomes, STATA picks sample means for X’s

128 STATA command for Marginal Effects mfx compute; Must come after the outcome when estimates are still active in program.

129 Marginal effects after probit y = Pr(smoker) (predict) = variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X age | incomel | male*| black*| hispanic*| hsgrad*| somecol*| college*| worka*| (*) dy/dx is for discrete change of dummy variable from 0 to 1

130 Interpret results 10% increase in income will reduce smoking by 2.9 percentage points 10 year increase in age will decrease smoking rates.4 percentage points Those with a college degree are 21.5 percentage points less likely to smoke Those that face a workplace smoking ban have 6.7 percentage point lower probability of smoking

131 Do not confuse percentage point and percent differences –A 6.7 percentage point drop is 29% of the sample mean of 24 percent. –Blacks have smoking rates that are 3.2 percentage points lower than others, which is 13 percent of the sample mean

132 Comparing Marginal Effects VariableLPProbitLogit age incomel male Black hispanic hsgrad college worka

133 Marginal effects for specific characteristics Can generate marginal effects for a specific x prchange, x(age=40 black=0 hispanic=0 hsgrad=0 somecol=0 worka=0); If an x is not specified, STATA will use the sample mean (e.g., log income in this case) Make sure when you specify a particular dummy variable (=1) you set the rest to zero

134 probit: Changes in Predicted Probabilities for smoker min->max 0->1 -+1/2 -+sd/2 MargEfct age incomel male black hispanic hsgrad somecol college worka

135 Testing significance of individual parameters In large samples, MLE estimates are normally distributed Null hypothesis, β j =0 If the null is true and the sample is larges, β j is distributed as a normal with variance σ j 2. Using notes from before, if we divide β j by the standard deviation, we get standard normal

136 β j /se(β j ) should be N(0,1) β j /se(β j ) = z-score 95% of the distribution of a N(0,1) is between -1.96, 1.96 Reject null of the z-score > |1.96| Only age is statistically insignificant (cannot reject null)

137 When will results differ? Normal and logit CDF look: –Similar in the mid point of the distribution –Different in the tails You obtain more observations in the tails of the distribution when –Samples sizes are large –  approaches 1 or 0 These situations will produce more differences in estimates

138 Some nice properties of the Logit Outcome, y=1 or 0 Treatment, x=1 or 0 Other covariates, x Context, –x = whether a baby is born with a low weight birth –x = whether the mom smoked or not during pregnancy

139 Risk ratio RR = Prob(y=1|x=1)/Prob(y=1|x=0) Differences in the probability of an event when x is and is not observed How much does smoking elevate the chance your child will be a low weight birth

140 Let Y yx be the probability y=1 or 0 given x=1 or 0 Think of the risk ratio the following way Y 11 is the probability Y=1 when X=1 Y 10 is the probability Y=1 when X=0 Y 11 = RR*Y 10

141 Odds Ratio OR=A/B = [Y 11 /Y 01 ]/[Y 10 /Y 00 ] A = [Pr(Y=1|X=1)/Pr(Y=0|X=1)] = odds of Y occurring if you are a smoker B = [Pr(Y=1|X=0)/Pr(Y=0|X=0)] = odds of y happening if you are not a smoker What are the relative odds of Y happening if you do or do not experience X

142 Suppose Pr(Y i =1) = F(β o + β 1 X i + β 2 Z) and F is the logistic function Can show that OR = exp(β 1 ) = e β1 This number is typically reported by most statistical packages

143 Details Y 11 = exp(β o + β 1 + β 2 Z) /(1+ exp(β o + β 1 + β 2 Z) ) Y 10 = exp(β o + β 2 Z)/(1+ exp(β o +β 2 Z)) Y 01 = 1 /(1+ exp(β o + β 1 + β 2 Z) ) Y 00 = 1/(1+ exp(β o +β 2 Z) [Y 11 /Y 01 ] = exp(β o + β 1 + β 2 Z) [Y 10 /Y 00 ] = exp(β o + β 2 Z) OR=A/B = [Y 11 /Y 01 ]/[Y 10 /Y 00 ] = exp(β o + β 1 + β 2 Z)/ exp(β o + β 2 Z) = exp(β 1 )

144 Suppose Y is rare,  close to 0 –Pr(Y=0|X=1) and Pr(Y=0|X=0) are both close to 1, so they cancel Therefore, when  is close to 0 –Odds Ratio = Risk Ratio Why is this nice?

145 Population attributable risk Average outcome in the population  = (1- ) Y 10 + Y 11 = (1- )Y 10 + (RR)Y 10 Average outcomes are a weighted average of outcomes for X=0 and X=1 What would the average outcome be in the absence of X (e.g., reduce smoking rates to 0) Y a = Y 10

146 Population Attributable Risk PAR Fraction of outcome attributed to X The difference between the current rate and the rate that would exist without X, divided by the current rate PAR = (  – Y a )/  = (RR – 1) /[(1- ) + RR ]

147 Example: Maternal Smoking and Low Weight Births 6% births are low weight –< 2500 grams ( –Average birth is 3300 grams (5.5 lbs) Maternal smoking during pregnancy has been identified as a key cofactor –13% of mothers smoke –This number was falling about 1 percentage point per year during 1980s/90s –Doubles chance of low weight birth

148 Natality detail data Census of all births (4 million/year) Annual files starting in the 60s Information about –Baby (birth weight, length, date, sex, plurality, birth injuries) –Demographics (age, race, marital, educ of mom) –Birth (who delivered, method of delivery) –Health of mom (smoke/drank during preg, weight gain)

149 Smoking not available from CA or NY ~3 million usable observations I pulled.5% random sample from 1995 About 12,500 obs Variables: birthweight (grams), smoked, married, 4-level race, 5 level education, mothers age at birth

> - storage display value variable name type format label variable label > - birthw int %9.0g birth weight in grams smoked byte %9.0g =1 if mom smoked during pregnancy age byte %9.0g moms age at birth married byte %9.0g =1 if married race4 byte %9.0g 1=white,2=black,3=asian,4=other educ5 byte %9.0g 1=0-8, 2=9-11, 3=12, 4=13-15, 5=16+ visits byte %9.0g prenatal visits

151 dummy | variable, | =1 | =1 if mom smoked ifBW<2500 | during pregnancy grams | 0 1 | Total | 11,626 1,745 | 13,371 | | | | | | | | 859 | | | | 6.04 | | Total | 12,285 1,945 | 14,230 | | | | | |

152 Notice a few things –13.7% of women smoke –6% have low weight birth Pr(LBW | Smoke) =10.28% Pr(LBW |~ Smoke) = 5.36% RR = Pr(LBW | Smoke)/ Pr(LBW |~ Smoke) = / = 1.92

153 Logit results Log likelihood = Pseudo R2 = lowbw | Coef. Std. Err. z P>|z| [95% Conf. Interval] smoked | age | married | _Ieduc5_2 | _Ieduc5_3 | _Ieduc5_4 | _Ieduc5_5 | _Irace4_2 | _Irace4_3 | _Irace4_4 | _cons |

154 Odds Ratios Smoked –exp(0.674) = 1.96 –Smokers are twice as likely to have a low weight birth _Irace4_2 (Blacks) –exp(0.707) = 2.02 –Blacks are twice as likely to have a low weight birth

155 Asking for odds ratios Logistic y x1 x2; In this case xi: logistic lowbw smoked age married i.educ5 i.race4;

156 Log likelihood = Pseudo R2 = lowbw | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] smoked | age | married | _Ieduc5_2 | _Ieduc5_3 | _Ieduc5_4 | _Ieduc5_5 | _Irace4_2 | _Irace4_3 | _Irace4_4 |

157 PAR PAR = (RR – 1) /[(1- ) + RR ] = RR = 1.96 PAR = % of low weight births attributed to maternal smoking

158 Hypothesis Testing in MLE models MLE are asymptotically normally distributed, one of the properties of MLE Therefore, standard t-tests of hypothesis will work as long as samples are ‘large’ What ‘large’ means is open to question What to do when samples are ‘small’ – table for a moment

159 Testing a linear combination of parameters Suppose you have a probit model Φ[β 0 + x 1i β 1 + x 2i β 2 + x 3i β 3 +… ] Test a linear combination or parameters Simplest example, test a subset are zero β 1 = β 2 = β 3 = β 4 =0 To fix the discussion N observations K parameters J restrictions (count the equals signs, j=4)

160 Wald Test Based on the fact that the parameters are distributed asymptotically normal Probability theory review –Suppose you have m draws from a standard normal distribution (z i ) –M = z z …. z m 2 –M is distributed as a Chi-square with m degrees of freedom

161 Wald test constructs a ‘quadratic form’ suggested by the test you want to perform This combination, because it contains squares of the true parameters, should, if the hypothesis is true, be distributed as a Chi square with J degrees of freedom. If the test statistic is ‘large’, relative to the degrees of freedom of the test, we reject, because there is a low probability we would have drawn that value at random from the distribution

162 Reading critical values from a table All stats books will report the ‘percentiles’ of a chi-square –Vertical axis (degrees of freedom) –Horizontal axis (percentiles) –Entry is the value where ‘percentile’ of the distribution falls below

163 Example: Suppose 4 restrictions 95% of a chi-square distribution falls below So there is only a 5% a number drawn at random will exceed If your test statistic is below, cannot reject null If your test statistics is above, reject null

164 Chi-square

165 Wald test in STATA Default test in MLE models Easy to do. Look at program test hsgrad somecol college Does not estimate the ‘restricted’ model ‘Lower power’ than other tests, i.e., high chance of false negative

166. test hsgrad somecol college; ( 1) hsgrad = 0 ( 2) somecol = 0 ( 3) college = 0 chi2( 3) = Prob > chi2 =

167 Notice the higher value of the test statistic. There is a low chance that a variable, drawn at random from a ch- square with three degrees of freedom will be this large. Reject null

Log likelihood test * how to run the same tests with a -2 log like test; * estimate the unresticted model and save the estimates ; * in urmodel; probit smoker age incomel male black hispanic hsgrad somecol college worka; estimates store urmodel; * estimate the restricted model. save results in rmodel; probit smoker age incomel male black hispanic worka; estimates store rmodel; lrtest urmodel rmodel;

169 I prefer -2 log likelihood test –Estimates the restricted and unrestricted model –Therefore, has more power than a Wald test In most cases, they give the same ‘decision’ (reject/not reject)

170 Section VI Categorical Data

171 Ordered Probit Many discrete outcomes are to questions that have a natural ordering but no quantitative interpretation: Examples: –Self reported health status (excellent, very good, good, fair, poor) –Do you agree with the following statement Strongly agree, agree, disagree, strongly disagree

172 Can use the same type of model as in the previous section to analyze these outcomes Another ‘latent variable’ model Key to the model: there is a monotonic ordering of the qualitative responses

173 Self reported health status Excellent, very good, good, fair, poor Coded as 1, 2, 3, 4, 5 on National Health Interview Survey We will code as 5,4,3,2,1 (easier to think of this way) Asked on every major health survey Important predictor of health outcomes, e.g. mortality Key question: what predicts health status?

174 Important to note – the numbers 1-5 mean nothing in terms of their value, just an ordering to show you the lowest to highest The example below is easily adapted to include categorical variables with any number of outcomes

175 Model y i * = latent index of reported health The latent index measures your own scale of health. Once y i * crosses a certain value you report poor, then good, then very good, then excellent health

176 y i = (1,2,3,4,5) for (fair, poor, VG, G, excel) Interval decision rule y i =1 if y i * ≤ u 1 y i =2 if u 1 < y i * ≤ u 2 y i =3 if u 2 < y i * ≤ u 3 y i =4 if u 3 < y i * ≤ u 4 y i =5 if y i * > u 4

177 As with logit and probit models, we will assume y i * is a function of observed and unobserved variables y i * = β 0 + x 1i β 1 + x 2i β 2 …. x ki β k + ε i y i * = x i β + ε i

178 The threshold values (u 1, u 2, u 3, u 4 ) are unknown. We do not know the value of the index necessary to push you from very good to excellent. In theory, the threshold values are different for everyone Computer will not only estimate the β’s, but also the thresholds – average across people

179 As with probit and logit, the model will be determined by the assumed distribution of ε In practice, most people pick nornal, generating an ‘ordered probit’ (I have no idea why) We will generate the math for the probit version

180 Probabilities Lets do the outliers, Pr(y i =1) and Pr(y i =5) first Pr(y i =1) = Pr(y i * ≤ u 1 ) = Pr(x i β +ε i ≤ u 1 ) =Pr(ε i ≤ u 1 - x i β) = Φ[u 1 - x i β] = 1- Φ[x i β – u 1 ]

181 Pr(y i =5) = Pr(y i * > u 4 ) = Pr(x i β +ε i > u 4 ) =Pr(ε i > u 4 - x i β) = 1 - Φ[u 4 - x i β] = Φ[x i β – u 4 ]

182 Sample one for y=3 Pr(y i =3) = Pr(u 2 < y i * ≤ u 3 ) = Pr(y i * ≤ u 3 ) – Pr(y i * ≤ u 2 ) = Pr(x i β +ε i ≤ u 3 ) – Pr(x i β +ε i ≤ u 2 ) = Pr(ε i ≤ u 3 - x i β) - Pr(ε i ≤ u 2 - x i β) = Φ[u 3 - x i β] - Φ[u 2 - x i β] = 1 - Φ[x i β - u 3 ] – 1 + Φ[x i β - u 2 ] = Φ[x i β - u 2 ] - Φ[x i β - u 3 ]

183 Summary Pr(y i =1) = 1- Φ[x i β – u 1 ] Pr(y i =2) = Φ[x i β – u 1 ] - Φ[x i β – u 2 ] Pr(y i =3) = Φ[x i β – u 2 ] - Φ[x i β – u 3 ] Pr(y i =4) = Φ[x i β – u 3 ] - Φ[x i β – u 4 ] Pr(y i =5) = Φ[x i β – u 4 ]

184 Likelihood function There are 5 possible choices for each person Only 1 is observed L = Σ i ln[Pr(y i =k)] for k

185 Programming example Cancer control supplement to 1994 National Health Interview Survey Question: what observed characteristics predict self reported health (1-5 scale) 1=poor, 5=excellent Key covariates: income, education, age, current and former smoking status Programs sr_health_status.do,.dta,.log

186 desc; male byte %9.0g =1 if male age byte %9.0g age in years educ byte %9.0g years of education smoke byte %9.0g current smoker smoke5 byte %9.0g smoked in past 5 years black float %9.0g =1 if respondent is black othrace float %9.0g =1 if other race (white is ref) sr_health float %9.0g 1-5 self reported health, 5=excel, 1=poor famincl float %9.0g log family income

187 tab sr_health; 1-5 self | reported | health, | 5=excel, | 1=poor | Freq. Percent Cum | | | 3, | 3, | 4, Total | 12,

188 In STATA oprobit sr_health male age educ famincl black othrace smoke smoke5;

189 Ordered probit estimates Number of obs = LR chi2(8) = Prob > chi2 = Log likelihood = Pseudo R2 = sr_health | Coef. Std. Err. z P>|z| [95% Conf. Interval] male | age | educ | famincl | black | othrace | smoke | smoke5 | _cut1 | (Ancillary parameters) _cut2 | _cut3 | _cut4 |

190 Interpret coefficients Marginal effects/changes in probabilities are now a function of 2 things –Point of expansion (x’s) –Frame of reference for outcome (y) STATA –Picks mean values for x’s –You pick the value of y

191 Continuous x’s Consider y=5 d Pr(y i =5)/dx i = d Φ[x i β – u 4 ]/dx i = βφ[x i β – u 4 ] Consider y=3 d Pr(y i =3)/dx i = βφ[x i β – u 3 ] - βφ[x i β – u 4 ]

192 Discrete X’s x i β = β 0 + x 1i β 1 + x 2i β 2 …. x ki β k –X 2i is yes or no (1 or 0) ΔPr(y i =5) = Φ[β 0 + x 1i β 1 + β 2 + x 3i β x ki β k ] - Φ[β 0 + x 1i β 1 + x 3i β 3 …. x ki β k ] Change in the probabilities when x 2i =1 and x 2i =0

193 Ask for marginal effects mfx compute, predict(outcome(5));

194 mfx compute, predict(outcome(5)); Marginal effects after oprobit y = Pr(sr_health==5) (predict, outcome(5)) = variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X male*| age | educ | famincl | black*| othrace*| smoke*| smoke5*| (*) dy/dx is for discrete change of dummy variable from 0 to 1

195 Interpret the results Males are 4.7 percentage points more likely to report excellent Each year of age decreases chance of reporting excellent by 0.7 percentage points Current smokers are 7.5 percentage points less likely to report excellent health

196 Minor notes about estimation Wald tests/-2 log likelihood tests are done the exact same was as in PROBIT and LOGIT Tests of individual parameters are done the same way (z-score)

197 Use PRCHANGE to calculate marginal effect for a specific person prchange, x(age=40 black=0 othrace=0 smoke=0 smoke5=0 educ=16); –When a variable is NOT specified (famincl), STATA takes the sample mean.

198 PRCHANGE will produce results for all outcomes male Avg|Chg| > >

199 age Avg|Chg| Min->Max / sd/ MargEfct

200 Section VII Count Data Models

201 Introduction Many outcomes of interest are integer counts –Doctor visits –Low work days –Cigarettes smoked per day –Missed school days OLS models can easily handle some integer models

202 Example –SAT scores are essentially integer values –Few at ‘tails’ –Distribution is fairly continuous –OLS models well In contrast, suppose –High fraction of zeros –Small positive values

203 OLS models will –Predict negative values –Do a poor job of predicting the mass of observations at zero Example –Dr visits in past year, Medicare patients(65+) –1987 National Medical Expenditure Survey –Top code (for now) at 10 –17% have no visits

204 visits | Freq. Percent Cum | | | | | | | | | | | Total | 5,

205 Poisson Model y i is drawn from a Poisson distribution Poisson parameter varies across observations f(y i ;λ i ) =e -λi λ i yi /y i ! For λ i >0 E[y i ]= Var[y i ] = λ i = f(x i, β)

206 λ i must be positive at all times Therefore, we CANNOT let λ i = x i β Let λ i = exp(x i β) ln(λ i ) = (x i β)

207 d ln(λ i )/dx i = β Remember that d ln(λ i ) = dλ i /λ i Interpret β as the percentage change in mean outcomes for a change in x

208 Problems with Poisson Variance grows with the mean –E[y i ]= Var[y i ] = λ i = f(x i, β) Most data sets have over dispersion, where the variance grows faster than the mean In dr. visits sample,  = 5.6, s=6.7 Impose Mean=Var, severe restriction and you tend to reduce standard errors

209 Negative Binomial Model Where γ i = exp(x i β) and δ ≥ 0 E[y i ] = δγ i = δexp(x i β) Var[y i ] = δ (1+δ) γ i Var[y i ]/ E[y i ] = (1+δ)

210 δ must always be ≥ 0 In this case, the variance grows faster than the mean If δ=0, the model collapses into the Poisson Always estimate negative binomial If you cannot reject the null that δ=0, report the Poisson estimates

211 Notice that ln(E[y i ]) = ln(δ) + ln(γ i ), so d ln(E[y i ]) /dx i = β Parameters have the same interpretation as in the Poisson model

212 In STATA POISSON estimates a MLE model for poisson –Syntax POISSON y independent variables NBREG estimates MLE negative binomial –Syntax NBREG y independent variables

213 Interpret results for Poisson Those with CHRONIC condition have 50% more mean MD visits Those in EXCELent health have 78% fewer MD visits BLACKS have 33% fewer visits than whites Income elasticity is 0.021, 10% increase in income generates a 2.1% increase in visits

214 Negative Binomial Interpret results the same was as Poisson Look at coefficient/standard error on delta Ho: delta = 0 (Poisson model is correct) In this case, delta = 5.21 standard error is 0.15, easily reject null. Var/Mean = 1+delta = 6.21, Poisson is mis-specificed, should see very small standard errors in the wrong model

215 Selected Results, Count Models Parameter (Standard Error) VariablePoissonNegative Binomial Age (0.026)0.103(0.055) Age (0.026)0.204(0.054) Chronic0.500(0.014)0.509(0.029) Excel-0.784(0.031)-0.527(0.059) Ln(Inc).0.021(0.007)0.038(0.016)

216 Section VIII Duration Data

217 Introduction Sometimes we have data on length of time of a particular event or ‘spells’ –Time until death –Time on unemployment –Time to complete a PhD Techniques we will discuss were originally used to examine lifespan of objects like light bulbs or machines. These models are often referred to as “time to failure”

218 Notation T is a random variable that indicates duration (time til death, find a new job, etc) t is the realization of that variable f(t) is a PDF that describes the process that determines the time to failure CDF is F(t) represents the probability an event will happen by time t

219 F(t) represents the probability that the event happens by ‘t’. What is the probability a person will die on or before the 65 th birthday?

220 Survivor function, what is the chance you live past (t) S(t) = 1 – F(t) If 10% of a cohort dies by their 65 th birthday, 90% will die sometime after their 65 th birthday

221 Hazard function, h(t) What is the probability the spell will end at time t, given that it has already lasted t What is the chance you find a new job in month 12 given that you’ve been unemployed for 12 months already

222 PDF, CDF (Failure function), survivor function and hazard function are all related λ(t) = f(t)/S(t) = f(t)/(1-F(t)) We focus on the ‘hazard’ rate because its relationship to time indicates ‘duration dependence’

223 Example: suppose the longer someone is out of work, the lower the chance they will exit unemployment – ‘damaged goods’ This is an example of duration dependence, the probability of exiting a state of the world is a function of the length

224 Mathematically d λ(t) /dt = 0 then there is no duration dep. d λ(t) /dt > 0 there is + duration dependence the probability the spell will end increases with time d λ(t) /dt < 0 there is – duration dependence the probability the spell will end decreases over time

225 Your choice, is to pick values for f(t) that have +, - or no duration dependence

226 Different Functional Forms Exponential –λ(t)= λ –Hazard is the same over time, a ‘memory less’ process Weibull –F(t) = 1 – exp(-γt ρ ) where ρ,γ > 0 –λ(t) = ρ γ ρ-1 –if ρ >1, increasing hazard –if ρ <1, decreasing hazard –if ρ =1, exponential

227 Others: Lognormal, log-logistic, Gompertz. Little more difficult – can examine when you get comfortable with Weibull

228 A note about most data sets Most data sets have ‘censored’ spells –Follow people over time –All will eventually die, but some do not in your period of analysis –Incomplete spells or censored data Must build into the log likelihood function

229 Let t i be the duration we observe for all people Some people die, and their they lived until period t i Others are observed for t i periods, but do not Let d i =1 if data is complete spell d i =1 if incomplete

230 Recall, that f(s) is the PDF for someone who dies at period s F(t) is the probability you die by t 1-F(t) = the probability you die after (t)

231 If d i =1 then we observe f(t i ), someone who died in period t i If d i =0 then someone lived past period t i and the probability of that is [1-F(t i )] L = Σ i {d i ln[f(t i )] + (1-d i )ln[1-F(t i )]}

232 Introducing covariates Look at exponential λ(t)= λ Allow this to vary across people λ i (t)= λ i But like Poisson, λ i is always positive Let λ i = exp(β 0 + x 1i β 1 + x 2i β 2 …. x ki β k )

233 In the Weibull λ(t) = αγt α-1 Allow it to vary across people λ i (t) = αγ i t α-1 γ i = exp(β 0 + x 1i β 1 + x 2i β 2 …. x ki β k )

234 Interpreting Coefficients This is the same for both Weibull and Exponential In Weibull, λ(t i ) = αγ i t α-1 Suppose x 1i is a dummy variable When x i1 =1, then γ i1 = e β0 + β1 + x2i β2 …. xki βk When x i1 =0, then γ i0 = e β0 + x2i β2 …. xki βk

235 When you construct the ratio of γ i1 / γ i0, all the others parameters cancel, so (αγ i1 t α-1 – αγ i0 t α-1 ) / αγ i0 t α-1 = e β1 -1 Percentage change in the hazard when x 1i turns from 0 to 1. STATA prints out e β1, just subtract 1

236 Suppose x 2i is continuous Suppose we increase x 2i by 1 unit γ i1 = exp(β 0 + β 1 x 1i + x 2i β 2 …. x ki β k ) γ i2 = exp(β 0 + β 1 (x 1i +1) + x 2i β 2 …. x ki β k ) Can show that (αγ i1 t α-1 – αγ i0 t α-1 ) / αγ i0 t α-1 = e β1 -1 = exp(β 2 ) – 1 Percentage change in the hazard for 1 unit increase in x

237 NHIS Multiple Cause of Death NHIS –annual survey of 60K households –Data on individuals –Self-reported healthm DR visits, lost workdays, etc. MCOD –Linked NHIS respondents from to National Death Index through Dec 31, 1995 –Identified whether respondent died and of what cause

238 Our sample –Males, 50-70, who were married at the time of the survey – surveys –Give everyone 5 years (60 months) of followup

239 Key Variables max_mths maximum months in the survey. Diedin5 respondent died during the 5 years of followup Note if diedn5=0, the max_mths=60. Diedin5 identifies whether the data is censored or not.

240 Variable | Obs Mean Std. Dev. Min Max age_s_yrs | max_mths | black | hispanic | income | educ | diedin5 |

241 Duration Data in STATA Need to identify which is the duration data stset length, failure(failvar) Length=duration variable Failvar=1 when durations end in failure, =0 for censored values If all data is uncensored, omit failure(failvar)

242 In our case stset max_mths, failure(diedin5)

243 Getting Kaplan-Meier Curves Tabular presentation of results sts list Graphical presentation sts graph Results by subgroup sts graph, by(educ)

244

245

246 MLE of duration model with Covariates Basic syntax streg covariates, d(distribution) streg age_s_yrs black hispanic _Ie* _Ii*, d(weibull); In this model, STATA will print out exp(β) If you want the coefficients, add ‘nohr’ option (no hazard ratio)

247 Weibull coefficients No. of subjects = Number of obs = No. of failures = 3245 Time at risk = LR chi2(10) = Log likelihood = Prob > chi2 = _t | Coef. Std. Err. z P>|z| [95% Conf. Interval] age_s_yrs | black | hispanic | _Ieduc_2 | _Ieduc_3 | _Ieduc_4 | _Iincome_2 | _Iincome_3 | _Iincome_4 | _Iincome_5 | _cons | /ln_p | p | /p |

248 The sign of the parameters is informative –Hazard increasing in age –Blacks, hispanics have higher mortality rates –Hazard decreases with income and age The parameter ρ = –Check 95% confidence interval (1.13, 1.21). Can reject null p=1 (exponential) –Hazard is increasing over time

249 Hazard ratios No. of subjects = Number of obs = No. of failures = 3245 Time at risk = LR chi2(10) = Log likelihood = Prob > chi2 = _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] age_s_yrs | black | hispanic | _Ieduc_2 | _Ieduc_3 | _Ieduc_4 | _Iincome_2 | _Iincome_3 | _Iincome_4 | _Iincome_5 | /ln_p | p | /p |

250 Interpret coefficients Age: every year hazard increases by 4.6% Black: have 61% greater hazard than whites Hispanics: 14% greater hazard than non-hispanics Educ 2, 3, 4 are some 9-11, and 16+ years of school

251 Educ 3: those with years of educ have.93 – 1 = or a 7% lower hazard than those with <9 years of school Educ 4: those with a college degree have 0.88 – 1 = or a 12% lower hazard than those with <9 years of school

252 Income 2-5 are dummies for people with $10-$20K, $20-$30K, $30-$40K, >$40K Income 2: Those with $10-$20K have 0.83 – 1 = or a 17% lower hazard than those with income <$10K Income 5, those with >$40K in income have a 0.58 – 1 = or a 42% lower hazard than those with income <$10K

253 Topics not covered Time varying covariates Competing risk models –Die from multiple causes Cox proportional hazard model –Heterogeneity in baseline hazard