Download presentation
Presentation is loading. Please wait.
1
1 Part IB. Descriptive Statistics Multivariate Statistics ( 多變量統計 ) Focus: Multiple regression Spring 2007
2
2 Regression Analysis ( 迴歸分析 ) Y = f(X): Y is a function of X Regression analysis: a method of determining the specific function relating Y to X Linear Regression ( 線性迴歸分析 ): a popular model in social science A brief review offered here –Can see ppt files on the course website
3
3 Example: summarize the relationship with a straight line
4
4 Draw a straight line, but how? ( 怎麼畫 那條直線 ?)
5
5 Notice that some predictions are not completely accurate.
6
6 How to draw the line? Purpose: draw the regression line to give the most accurate predictions of y given x Criteria for “accurate”: Sum of (observed y – predicted y) 2 = sum of (prediction errors) 2 [ 觀察值與估計值之差的平方和 ] Called the sum of squared errors or sum of the squared residuals (SSE)
7
7 Ordinary Least Squares (OLS) Regression ( 普通最小平方法 ) The regression line is drawn so as to minimize the sum of the squared vertical distances from the points to the line ( 讓 SSE 最小 ) This line minimize squared predictive error This line will pass through the middle of the point cloud ( 迴歸線從資料群中間穿 過 )(think as a nice choice to describe the relationship)
8
8 To describe a regression line (equation): Algebraically, line described by its intercept ( 截 距 ) and slope ( 斜率 ) Notation: y = the dependent variable x = the independent variable y_hat = predicted y, based on the regression line β = slope of the regression line α= intercept of the regression line
9
9 The meaning of slope and intercept: slope = change in (y_hat) for a 1 unit change in x (x 一單位的改變導致 y 估計值 的變化 ) intercept = value of (y_hat) when x is 0 解釋截距與斜率時要注意到 x and y 的單 位
10
10 General equation of a regression line: (y_hat) = α +βx where α and β are chosen to minimize: sum of (observed y – predicted y) 2 A formula for α and β which minimize this sum is programmed into statistical programs and calculators
11
11 An example of a regression line
12
12 Fit: how much can regression explain? ( 迴歸能解釋 y 多少的變異? ) Look at the regression equation again: (y_hat) = α +βx y = α +βx + ε Data = what we explain + what we don’t explain Data = predicted + residual ( 資料有我們不能解釋的與可解釋的部分,即 能預估的與誤差的部分)
13
13 In regression, we can think “fit” in this way: Total variation = sum of squares of y explained variation = total variation explained by our predictions unexplained variation = sum of squares of residuals R 2 = (explained variation)/ (total variation) (判定係數) [y 全部的變易量中迴歸分析能解釋的部分 ]
14
14 R 2 = r 2 NOTE: a special feature of simple regression (OLS), this is not true for multiple regression or other regression methods. [ 注意:這是簡單迴歸分析的特性,不 適用於多元迴歸分析或其他迴歸分析 ]
15
15 Some cautions about regression and R 2 It’s dangerous to use R 2 to judge how “good” a regression is. ( 不要用 R 2 來判斷迴 歸的適用性 ) –The “appropriateness” of regression is not a function of R 2 When to use regression? –Not suitable for non-linear shapes [you can modify non-linear shapes] – regression is appropriate when r (correlation) is appropriate as a measure
16
16 補充 : Proportional Reduction of Error (PRE)( 消減錯誤的比例 ) PRE measures compare the errors of predictions under different prediction rules; contrasts a naïve to sophisticated rule R 2 is a PRE measure Naïve rule = predict y_bar Sophisticated rule = predict y_hat R 2 measures reduction in predictive error from using regression predictions as contrasted to predicting the mean of y
17
17 Cautions about correlation and regression: Extrapolation is not appropriate Regression: pay attention to lurking or omitted variables –Lurking (omitted) variables: having influence on the relationship between two variables but is not included among the variables studied –A problem in establishing causation Association does not imply causation. –Association alone: weak evidence about causation –Experiments with random assignment are the best way to establish causation.
18
18 Inference for Simple Regression
19
19 Regression Equation Equation of a regression line: (y_hat) = α +βx y = α +βx + ε y = dependent variable x = independent variable β = slope = predicted change in y with a one unit change in x α= intercept = predicted value of y when x is 0 y_hat = predicted value of dependent variable
20
20 Global test--F 檢定 : 檢定迴歸方程式 有無解釋能力 ( β= 0 )
21
21
22
22 The regression model ( 迴歸模型 ) Note: the slope and intercept of the regression line are statistics (i.e., from the sample data). To do inference, we have to think of α and β as estimates of unknown parameters.
23
23 Inference for regression Population regression line: μ y = α +βx estimated from sample: (y_hat) = a + bx b is an unbiased estimator ( 不偏估計式 )of the true slope β, and a is an unbiased estimator of the true intercept α
24
24 Sampling distribution of a (intercept) and b (slope) Mean of the sampling distribution of a is α Mean of the sampling distribution of b is β
25
25 Sampling distribution of a (intercept) and b (slope) Mean of the sampling distribution of a is α Mean of the sampling distribution of b is β The standard error of a and b are related to the amount of spread about the regression line (σ) Normal sampling distributions; with σ estimated use t-distribution for inference
26
26 The standard error of the least-squares line Estimate σ (spread about the regression line using residuals from the regression) recall that residual = (y –y_hat) Estimate the population standard deviation about the regression line (σ) using the sample estimates
27
27 Estimate σ from sample data
28
28 Standard Error of Slope (b) The standard error of the slope has a sampling distribution given by: Small standard errors of b means our estimate of b is a precise estimate of β SE b is directly related to s; inversely related to sample size (n) and S x
29
29 Confidence Interval for regression slope A level C confidence interval for the slope of “true” regression line β is b ± t * SE b Where t* is the upper (1-C)/2 critical value from the t distribution with n-2 degrees of freedom To test the hypothesis H 0 : β= 0, compute the t statistic: t = b/ SE b In terms of a random variable having the t,n-2 distribution
30
30 Significance Tests for the slope Test hypotheses about the slope of β. Usually: H 0 : β= 0 (no linear relationship between the independent and dependent variable) Alternatives: H A : β > 0 or H A : β < 0 or H A : β ≠ 0
31
31
32
32 Statistical inference for intercept We could also do statistical inference for the regression intercept, α Possible hypotheses: H 0 : α = 0 H A : α≠ 0 t-test based on a, very similar to prior t-tests we have done For most substantive applications, interested in slope (β), not usually interested in α
33
33 Example: SPSS Regression Procedures and Output To get a scatterplot (): 統計圖 (G) → 散佈圖 (S) → 簡單 → 定義(選 x 及 y ) To get a correlation coefficient: 分析 (A) → 相關 (C) → 雙變量 To perform simple regression 分析 (A) → 迴歸方法 (R) → 線性 (L) (選 x 及 y ) (還可選擇儲存預測值及殘差)
34
34 SPSS Example: Infant mortality vs. Female Literacy, 1995 UN Data
35
35 Example: correlation between infant mortality and female literacy
36
36 Regression: infant mortality vs. female literacy, 1995 UN Data
37
37 Regression: infant mortality vs. female literacy, 1995 UN Data
38
38 Hypothesis test example 大華正在分析教育成就的世代差異,他蒐集到 117 組父子教 育程度的資料。父親的教育程度是自變項,兒子的教育 程度是依變項。他的迴歸公式是: y_hat = 0.2915*x +10.25 迴歸斜率的標準誤差 (standard error) 是 : 0.10 1. 在 α=0.05 ,大華可得出父親與兒子的教育程度是有關連 的嗎? 2. 對所有父親的教育程度是大學畢業的男孩而言,這些男 孩的平均教育程度預測值是多少? 3. 有一男孩的父親教育程度是大學畢業,預測這男孩將來 的教育程度會是多少?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.