Reading – Linear Regression Le (Chapter 8 through 8.1.6) C &S (Chapter 5:F,G,H)

Issues with hypothesis testing Significance does not imply causality –Need a proper prospective experiment Significance does not imply practical importance –Trivial but significant differences Run lots of tests, will find significant difference by chance –With α = 0.05, expect 1 in 20 results to be sig. by chance

Issues with hypothesis testing Large p-values because sample size is small –Effect could exist but we may not have a large enough sample size Outliers may cause problems especially in small samples.

Issues With Hypothesis Testing What is the population of inference? Example: A statistics class of n=15 women and n=5 men yield the following exam scores: Women:mean = 90%SD = 10% Men: mean = 85%SD = 11% Test the hypothesis that women did better on the exam then men.

If 95% CI excludes 0 then the p-value will be <0.05.

Linear Regression Investigate the relationship between two variables –Does blood pressure relate to age? –Does weight loss relate to blood pressure loss –Does income relate to education? –Do sales relate to years of experience? Dependent variable –The variable that is being predicted or explained Independent variable –The variable that is doing the predicting or explaining Think of data in pairs (x i, y i )

Linear Regression - Purpose Is there an association between the two variables –Does weight change relate to BP change? Estimation of impact –How much BP change occurs per pound of weight change Prediction –If a person loses 10 pounds how much of a drop in blood pressure can be expected

Regression History Sir Francis Galton (1822-1911) studied the relationship between a father’s height and the son’s height. He found that although there was a relationship between father and son’s height the relationship was not perfect. If the father was above average in height so was the son (typically) but not as much above average. This was called regression to the mean

Example of Regression Equation We know systolic BP increases with age. How much does it increase per year and is the increase constant over time? SBP = 90 + 0.8*AGE Interpretation: For each year of age SBP increases by 0.8 mmHg. At age 50: SBP = 90 + 0.8*50 = 130 mmHg At age 60: SBP = 90 + 0.8*60 = 138 mmHg Y or Dependent Variable X or Dependent Variable

Simple Linear Regression Equation n The simple linear regression equation is:  y =  0 +  1 x Graph of the regression equation is a straight line. Graph of the regression equation is a straight line.  0 is the y intercept of the regression line.  0 is the y intercept of the regression line.  1 is the slope of the regression line.  1 is the slope of the regression line.  y is the mean value of y for a given x value.  y is the mean value of y for a given x value.

Simple Linear Regression Model  The equation that describes how y is related to x and an error term is called the regression model.  The simple linear regression model is: y =  0 +  1 x +   0 and  1 are called parameters of the model.  0 and  1 are called parameters of the model.  is a random variable called the error term.  is a random variable called the error term.

Simple Linear Regression Equation n Positive Linear Relationship E(y)E(y)E(y)E(y) x Slope  1 is positive Regression line Intercept  0

Simple Linear Regression Equation n Negative Linear Relationship E(y)E(y)E(y)E(y) x Slope  1 is negative Regression line Intercept  0

Simple Linear Regression Equation n No Relationship E(y)E(y)E(y)E(y) x Slope  1 is 0 Regression line Intercept  0

Estimated Simple Linear Regression Equation n The estimated simple linear regression equation is: The graph is called the estimated regression line. The graph is called the estimated regression line. b 0 is the y intercept of the line. b 0 is the y intercept of the line. b 1 is the slope of the line. b 1 is the slope of the line. is the estimated value of y for a given x value. is the estimated value of y for a given x value.

Estimation Process Regression Model y =  0 +  1 x +  Regression Equation  y =  0 +  1 x Unknown Parameters  0,  1 Sample Data: x y x 1 y 1...... x n y n Estimated Regression Equation Sample Statistics b 0, b 1 b 0 and b 1 provide estimates of  0 and  1

Least Squares Method  Least Squares Criterion: Choose    and    to minimize where: y i = observed value of the dependent variable for the ith observation for the ith observation S =  Y i –  0  1  

Estimation

Slope: The Least Squares Estimates Intercept:

Example RestaurantStudent Population (Thousands) Quarterly Sales 1258 26105 3888 48118 512117 616137 720157 820169 922149 1026202

X-Y PLOT OF DATA

Calculations ObsXiXi YiYi X i -XBARY i -YBAR(Xi – XBAR)* (Yi – YBAR) (Xi – XBAR) 2 1258-12-72864144 26105-8-2520064 3888-6-4225236 48118-6-127236 512117-2-13264 61613727144 72015762716236 82016963923436 92214981915264 10262021272864144 Tot14013002840568

Estimates for Dataset b 1 = 2840/568 = 5 b 0 = 130 – 5*14 = 60 Y = Sales; X = # thousands of students Equation: Y = 60 + 5* X

DATA sales; INFILE DATALINES; INPUT restaurant studentpop quarsales; DATALINES; 1 2 58 2 6 105 3 8 88 4 8 118 5 12 117 6 16 137 7 20 157 8 20 169 9 22 149 10 26 202 ;

PROC PRINT DATA=sales; PROC MEANS DATA=sales; PROC REG DATA=sales SIMPLE; MODEL quarsales = studentpop; PLOT quarsales * studentpop ; RUN;

OUTPUT FROM PROC REG The REG Procedure Descriptive Statistics Uncorrected Standard Variable Sum Mean SS Variance Deviation Intercept 10.00000 1.00000 10.00000 0 0 studentpop 140.00000 14.00000 2528.00000 63.11111 7.94425 quarsales 1300.00000 130.00000 184730 1747.77778 41.80643

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 60.00000 9.22603 6.50 0.0002 studentpop 1 5.00000 0.58027 8.62 <.0001 REGRESSION EQUATION : Y = 60.0 + 5.0*X QUARSALES = 60 + 5*STUDENTPOP

The Coefficient of Determination  Relationship Among SST, SSR, SSE SST = SSR + SSE where: SST = total sum of squares SST = total sum of squares SSR = sum of squares due to regression SSR = sum of squares due to regression SSE = sum of squares due to error SSE = sum of squares due to error ^^

n The coefficient of determination is: r 2 = SSR/SST where: SST = total sum of squares SST = total sum of squares SSR = sum of squares due to regression SSR = sum of squares due to regression The Coefficient of Determination

OUTPUT FROM PROC REG Dependent Variable: quarsales Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 SSR 14200 14200 74.25 <.0001 Error 8 SSE 1530 191.25000 Corrected Total 9 SST 15730 Root MSE 13.82932 R-Square 0.9027 Dependent Mean 130.00000 Coeff Var 10.63794 Coefficient of Determination

42130 46115 42148 71100 80156 74162 70151 80156 85162 72158 64155 81160 41125 61150 75165 First value is age Second value is SBP Find the regression equation SBP = b0 + b1*age Your TURN

Reading – Linear Regression Le (Chapter 8 through 8.1.6) C &S (Chapter 5:F,G,H)

Similar presentations

Presentation on theme: "Reading – Linear Regression Le (Chapter 8 through 8.1.6) C &S (Chapter 5:F,G,H)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reading – Linear Regression Le (Chapter 8 through 8.1.6) C &S (Chapter 5:F,G,H)

Similar presentations

Presentation on theme: "Reading – Linear Regression Le (Chapter 8 through 8.1.6) C &S (Chapter 5:F,G,H)"— Presentation transcript:

Similar presentations

About project

Feedback