Inference for Simple Regression Social Research Methods 2109 & 6507 Spring 2006 March 15, 16, 2006
Regression Equation Equation of a regression line: (y_hat) = α +βx y = α +βx + ε y = dependent variable x = independent variable β = slope = predicted change in y with a one unit change in x α= intercept = predicted value of y when x is 0 y_hat = predicted value of dependent variable
補充 : Proportional Reduction of Error (PRE)( 消減錯誤的比例 ) PRE measures compare the errors of predictions under different prediction rules; contrasts a naïve to sophisticated rule R 2 is a PRE measure Naïve rule = predict y_bar Sophisticated rule = predict y_hat R 2 measures reduction in predictive error from using regression predictions as contrasted to predicting the mean of y
Example: SPSS Regression Procedures and Output To get a scatterplot (): 統計圖 (G) → 散佈圖 (S) → 簡單 → 定義(選 x 及 y ) To get a correlation coefficient: 分析 (A) → 相關 (C) → 雙變量 To perform simple regression 分析 (A) → 回歸方法 (R) → 線性 (L) (選 x 及 y ) (還可選擇儲存預測值及殘差)
SPSS Example: Infant mortality vs. Female Literacy, 1995 UN Data
Example: correlation between infant mortality and female literacy
Regression: infant mortality vs. female literacy, 1995 UN Data
Diagnosis: a residual plot
Global test--F 檢定 : 檢定迴歸方程式 有無解釋能力 ( β= 0 )
The regression model ( 迴歸模型 ) Note: the slope and intercept of the regression line are statistics (i.e., from the sample data). To do inference, we have to think of α and β as estimates of unknown parameters.
Regression as conditional means Ways to think about regression: 1.Straight-line description of association 2.Prediction 3.Conditional means ( 條件平均數 ) Conditional mean: a mean computed conditional on the value of another variable Regression line predicts the conditional mean of y given x
Assumptions for regression inference Think about there as being a population or “true” regression line Assumptions: For any fixed value of x, the response (y) varies according to a normal distribution. Repeated responses y are independent of each other. μ y = α +βx (means of y conditional on x fall in a straight line) The standard deviation of y (call it σ) for each value of x is the same. The value of σ is unknown.
“True” regression line
Inference for regression Population regression line: μ y = α +βx estimated from sample: (y_hat) = a + bx b is an unbiased estimator ( 不偏估計式 )of the true slope β, and a is an unbiased estimator of the true intercept α
Sampling distribution of a (intercept) and b (slope) Mean of the sampling distribution of a is α Mean of the sampling distribution of b is β
Sampling distribution of a (intercept) and b (slope) Mean of the sampling distribution of a is α Mean of the sampling distribution of b is β The standard error of a and b are related to the amount of spread about the regression line (σ) Normal sampling distributions; with σ estimated use t-distribution for inference
The standard error of the least-squares line Estimate σ (spread about the regression line using residuals from the regression) recall that residual = (y –y_hat) Estimate the population standard deviation about the regression line (σ) using the sample estimates
Estimate σ from sample data
Standard Error of Slope (b) The standard error of the slope has a sampling distribution given by: Small standard errors of b means our estimate of b is a precise estimate of SE b is directly related to s; inversely related to sample size (n) and S x
Confidence Interval for regression slope A level C confidence interval for the slope of “true” regression line β is b ± t * SE b Where t* is the upper (1-C)/2 critical value from the t distribution with n-2 degrees of freedom To test the hypothesis H 0 : β= 0, compute the t statistic: t = b/ SE b In terms of a random variable having the t,n-2 distribution
Significance Tests for the slope Test hypotheses about the slope of β. Usually: H 0 : β= 0 (no linear relationship between the independent and dependent variable) Alternatives: H A : β > 0 or H A : β < 0 or H A : β ≠ 0
Statistical inference for intercept We could also do statistical inference for the regression intercept, α Possible hypotheses: H 0 : α = 0 H A : α≠ 0 t-test based on a, very similar to prior t-tests we have done For most substantive applications, interested in slope (β), not usually interested in α
Regression: infant mortality vs. female literacy, 1995 UN Data
Hypothesis test example 大華正在分析教育成就的世代差異,他蒐集到 117 組父子教 育程度的資料。父親的教育程度是自變項,兒子的教育 程度是依變項。他的迴歸公式是: y_hat = *x 迴歸斜率的標準誤差 (standard error) 是 : 在 α=0.05 ,大華可得出父親與兒子的教育程度是有關連 的嗎? 2. 對所有父親的教育程度是大學畢業的男孩而言,這些男 孩的平均教育程度預測值是多少? 3. 有一男孩的父親教育程度是大學畢業,預測這男孩將來 的教育程度會是多少?