Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model Assumptions, and their effects (9.6) 1
9.5 Inference for a slope. Problem: we have measures for the strength of association between two linear variables, but no measures for the statistical significance of that association. We know the slope & intercept for our sample; what can we say about the slope & intercept for the population? Solution: hypothesis tests for a slope and confidence intervals for a slope. Need a standard error for the coefficients Difficulties: additional assumptions, complications with estimating a standard error for a slope. 2
Assumptions Needed to make Population Inferences for slopes. The sample is selected randomly. X and Y are interval scale variables. The mean of Y is related to X by the linear equation E{Y} = + X. The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity) The conditional distribution of Y at each value of X is normal. There is no error in the measurement of X. 3
Common Ways to Violate These Assumptions The sample is selected randomly. o Cluster sampling (e.g., census tracts / neighborhoods) causes observations in any cluster to be more similar than to observations outside the cluster. o Two or more siblings in the same family. o Sample = populations (e.g., states in the U.S.) X and Y are interval scale variables. o Ordinal scale attitude measures o Nominal scale categories (e.g., race/ethnicity, religion) 4
Common Ways to Violate These Assumptions (2) The mean of Y is related to X by the linear equation E{Y} = + X. o U-shape: e.g., Kuznets inverted-U curve (inequality <- GDP/capita) o Thresholds: o Logarithmic (e.g., earnings <- education) The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity) o earnings <- education o hours worked <- time o adult child occupational status <- parental occupational status 5
Common Ways to Violate These Assumptions (3) The conditional distribution of Y at each value of X is normal. o earnings (skewed) <- education o Y is binary, or a % There is no error in the measurement of X. o almost everything o what is the effect of measurement error in x on b? 6
The Null hypothesis for slopes Null hypothesis: the variables are statistically independent. H o : = 0. The null hypothesis is that there is no linear relationship between X and Y. Implication for : E{Y} = + 0*X = ; = . (Draw figure of distribution of Y, X when H o is true) 7
Test Statistic for slopes What is the range of b’s we would get if we take repeated samples from a population and calculate b for each of those samples? That is, what is the standard error of the sample slope b’s? Test statistic: t = b / hat b o where hat b is the standard error of the sample slope b. o df for the t statistic (with one x – variable) is n-2 o when n is large, the t statistic is asymptotically equivalent to a z-statistic What would make hat b smaller? 8
Calculating the s.e. of b hat b = hat / (s X *sqrt(n-1)) where hat = sqrt(SSE/n-2)(= root MSE) the standard error of b is smaller when… o the sample size is large o the standard deviation of X is large (there is a wide range of X values) o the conditional standard deviation of Y is small. 9
Conclusions about Population P-value: calculated as in any t-test, but remember df = n-2 a z-test is appropriate when n > 30 or so Conclusions: evaluate p-value based o n a previously selected alpha level Rule of thumb: b should be at least 2x standard error. 10
Example of Inference about a Slope In an analysis of poverty and crime in the 50 states plus DC, a computer output provides the following: E{Murder rate} = *{Poverty rate} (Poverty rate in %, murder rate per 100,000) SSE = SST = N = 51S x = Do a hypothesis test to determine whether there is a linear relationship between crime rates and poverty rates. 11
Stata Example of Inference about a Slope In an analysis of poverty and crime in the 50 states plus DC, stata computer output provides the following: regress murder poverty Source | SS df MS Number of obs = F( 1, 49) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = murder | Coef. Std. Err. t P>|t| [95% Conf. Interval] poverty | _cons | Interpret whether there is a linear relationship between crime rates and poverty rates. 12
Example of Inference about a Slope SSE = SST = N = 51S x = 4.58 b=
Example of Inference about a Slope SSE = SST = N = 51S x = 4.58 b= se b = sqrt (SSE / (n-2) ) / (s x * sqrt(n-1)) = sqrt (3904.3/49) / ( 4.585*sqrt(50) ) = sqrt (79.68) / (4.585 * 7.071) = / = t = b / se b = / = 4.81 p < % confidence interval for b = to
Confidence interval for a slope. Confidence interval for a slope: c.i. = b ± t* hat b the standard t-score for a 95% confidence interval is t.025, with df = n-2 An alternative to a confidence interval is to report both b and hat b. 15
Example of Confidence Interval of a Slope SSE = SST = N = 51S x = 4.58 b = se b = % confidence interval for b= *0.275 = = to
Inference for a slope using STATA. regress attend regul Source | SS df MS Number of obs = F( 1, 16) = 9.65 Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = attend | Coef. Std. Err. t P>|t| [95% Conf. Interval] regul | _cons | The significance test and confidence interval for b appear on the line with the name of the x-variable. Can you find SSE and SST? df for the model? r? 17
Inferences for correlation. Inferences for a Pearson correlation The t-score for r is the same as the t-score for b. We don’t focus on inferences for correlation in this class. 18
Things to watch out for: extrapolation. Extrapolation beyond observed values of X is dangerous. The pattern may be nonlinear. Even if the pattern is linear, the standard errors become increasingly wide. Be especially careful interpreting the Y-intercept: it may lie outside the observed data. o e.g., year zero o e.g., zero education in the U.S. o e.g., zero parity 19
Things to watch out for: outliers Influential observations and outliers may unduly influence the fit of the model. The slope and standard error of the slope may be affected by influential observations. This is an inherent weakness of least squares regression. You may wish to evaluate two models; one with and one without the influential observations. 20
Things to watch out for: outlier example Example: discussion between Kahn and Udry 1986 (American Sociological Review 51(5): ) and Jasso 1986 (ASR 51(5): ). Topic: time and age trends in marital coital frequency Issues: outliers, sample truncation, nonlinear effects. 21
Things to watch out for: truncated samples Truncated samples cause the opposite problems of influential observations and outliers. Truncation on the X axis reduces the correlation coefficient for the remaining data. Truncation on the Y axis is a worse problem, because it violates the assumption of normally distributed errors. Examples: Topcoded income data, health as measured by number of days spent in a hospital in a year. 22
Things to watch out for: measurement error Error in measurement of the X variable creates a bias that makes the correlation appear weaker. This problem can be a measurement issue or an interpretation issue. 23