Download presentation
Presentation is loading. Please wait.
Published byEzra Briggs Modified over 8 years ago
1
Regression Analysis Presentation 13
2
Regression In Chapter 15, we looked at associations between two categorical variables. We will now focus on relationships between two continuous variables. In Chapter 15, we looked at associations between two categorical variables. We will now focus on relationships between two continuous variables. Regression is used to describe a relationship between a quantitative response variable and one or more quantitative predictor variables. In this class we discuss simple linear regression, which is describing a linear relationship between a single response Y and a single predictor X. Regression is used to describe a relationship between a quantitative response variable and one or more quantitative predictor variables. In this class we discuss simple linear regression, which is describing a linear relationship between a single response Y and a single predictor X. The approach to this problem is to determine an equation by which the average value of particular random variable (Y) can be estimated based on the values of the other variable (X). This problem is called regression. The approach to this problem is to determine an equation by which the average value of particular random variable (Y) can be estimated based on the values of the other variable (X). This problem is called regression. Example: Is there a relationship between X= the concentration of iron in the diet and Y= iron in the blood? If we can determine a relationship (i.e. an equation) between the two variables, then we might use this equation 1. 1. To find the mean concentration of iron in the blood for individuals with a specific concentration of iron in their diet, e.g. for X=80ppm. 2. 2. To predict a someone’s concentration of iron in the blood based on the concentration of iron in his/hers diet.
3
Some terms and notation Y = the response variable (dependent variable) and is of primary interest. Y = the response variable (dependent variable) and is of primary interest. X = the predictor variable (explanatory, or independent variable). X = the predictor variable (explanatory, or independent variable). We want to find an equation of the E(Y) in terms of X. We will call this function regression line of Y on X. This equation is of the form We want to find an equation of the E(Y) in terms of X. We will call this function regression line of Y on X. This equation is of the form E(Y)=β 0 + β 1 x where, where, E (Y) is actually E(Y|X=x), and is the expected value of Y for individuals in the population with the same particular value of X. E (Y) is actually E(Y|X=x), and is the expected value of Y for individuals in the population with the same particular value of X. β 0 is the intercept of the straight line (i.e. the value of E(Y) for x=0). β 0 is the intercept of the straight line (i.e. the value of E(Y) for x=0). β 1 is the slope of this line. (when do we have β 1 =0?) β 1 is the slope of this line. (when do we have β 1 =0?) Once we know the slope and the intercept, then, for a given value of X we can obtain the expected value of Y. However, we cannot know the values of β 0 and β 1 (they are population parameters). Once we know the slope and the intercept, then, for a given value of X we can obtain the expected value of Y. However, we cannot know the values of β 0 and β 1 (they are population parameters). Our goal is to estimate the parameters of the regression line using the observed data (x 1,y 1 ), …, (x n,y n ). Our goal is to estimate the parameters of the regression line using the observed data (x 1,y 1 ), …, (x n,y n ).
4
What is the first thing we need to check? We first need to determine if it is appropriate to use the linear regression model. One way to check this is to plot the observed data pairs (x 1,y 1 ), …, (x n,y n ). We first need to determine if it is appropriate to use the linear regression model. One way to check this is to plot the observed data pairs (x 1,y 1 ), …, (x n,y n ). A plot such this is called scatterplot and it can be obtained in Minitab by clicking on Graph/Scatterplot. A plot such this is called scatterplot and it can be obtained in Minitab by clicking on Graph/Scatterplot. The first plot indicates that there is a linear relationship between these two variables and it is reasonable to proceed with simple (linear) regression analysis. On the other hand, the data in the second plot demonstrates a non linear relationship between the X and the Y variable. The first plot indicates that there is a linear relationship between these two variables and it is reasonable to proceed with simple (linear) regression analysis. On the other hand, the data in the second plot demonstrates a non linear relationship between the X and the Y variable.
5
Two assumptions about deviations from the regression line Furthermore, in order to make statistical inferences about the population, we need to make two assumptions about how the y- values vary from the population regression line: Furthermore, in order to make statistical inferences about the population, we need to make two assumptions about how the y- values vary from the population regression line: 1. The general size of the deviation of the y values from the line is the same for all values of x (constant variance assumption). 2. For any specific value of x, the distribution of y values is normal.
6
Simple Regression Model for a Population The model we are going to use is The model we are going to use is y = Mean + Deviation 1. β 0 + β 1 x 1. Mean: in the population is the line E(Y ) = β 0 + β 1 x if the relationship is linear. 2. 2. Individual’s deviation = y - mean, which is what is left unexplained after accounting for the mean y value at that individual’s x value. Putting all the assumptions together (linear relation between X and Y, constant variance and normality) we have that: Putting all the assumptions together (linear relation between X and Y, constant variance and normality) we have that: y i =β 0 + β 1 x i +ε i =E(y i )+ ε i, where ε i are assumed to follow a normal distribution with mean 0 and standard deviation σ (i.e. the same s.d. for all i’s). where ε i are assumed to follow a normal distribution with mean 0 and standard deviation σ (i.e. the same s.d. for all i’s).
7
Regression Line in the Sample Once we decide that the relationship between X and Y is linear, using out data set we estimate the parameters of then regression equation β 0 and β 1. But how do we estimate the regression line? Which line is the "optimal"? Once we decide that the relationship between X and Y is linear, using out data set we estimate the parameters of then regression equation β 0 and β 1. But how do we estimate the regression line? Which line is the "optimal"?
8
Method of Least Squares We will use the method of least squares to obtain the estimates of β 0 and β 1 (i.e. to specify the line). We will use the method of least squares to obtain the estimates of β 0 and β 1 (i.e. to specify the line). The idea behind this method is to choose the line that comes as close as possible to all the data points simultaneously. The idea behind this method is to choose the line that comes as close as possible to all the data points simultaneously. The estimates of these parameters are denoted by b 0 and b 1 respectively and the estimated regression line is The estimates of these parameters are denoted by b 0 and b 1 respectively and the estimated regression line is where y-hat is the estimated value of y for X=x, where y-hat is the estimated value of y for X=x, b 0 = the sample intercept of the linear regression line, and b 0 = the sample intercept of the linear regression line, and b 1 = the sample slope of the linear regression line. b 1 = the sample slope of the linear regression line.
9
Method of LS -Deviations from the Regression Line The distance between an actual data point, y i and the estimated line of regression is called a residual (or error). Thus, for an observation y i in the sample, the residual is The distance between an actual data point, y i and the estimated line of regression is called a residual (or error). Thus, for an observation y i in the sample, the residual is where x i is the value of the explanatory variable for the observation. where x i is the value of the explanatory variable for the observation. Therefore, we have a residual for each data point and they are denoted e 1,e 2,…,e n. Therefore, we have a residual for each data point and they are denoted e 1,e 2,…,e n. The method of least squares will find the values of b 0 and b 1 minimizing the sum of the squared residuals, The method of least squares will find the values of b 0 and b 1 minimizing the sum of the squared residuals,
10
Example The regression line on the previous plot is the dotted line. The regression line on the previous plot is the dotted line. Using the regression equation we can predict the average response value (y-hat) when the predictor variable assumes some value x. For example, in the dad's - son's problem, using the data, the estimated intercept and slope turned out to be b 0 =3.41 and b 1 =.97. We can estimate the average height of man that his dad is 70 in. tall by Using the regression equation we can predict the average response value (y-hat) when the predictor variable assumes some value x. For example, in the dad's - son's problem, using the data, the estimated intercept and slope turned out to be b 0 =3.41 and b 1 =.97. We can estimate the average height of man that his dad is 70 in. tall by
11
Notes… For x=0, y-hat=b 0. For x=0, y-hat=b 0. The estimated slope, b 1, tell us how much of an increase (or decrease, if negative) there is for y-hat when the x variable increases by one unit. The estimated slope, b 1, tell us how much of an increase (or decrease, if negative) there is for y-hat when the x variable increases by one unit. You CAN NOT use a regression line to predict the response for observations that fall outside your predictor range. You CAN NOT use a regression line to predict the response for observations that fall outside your predictor range.
12
Standard deviation for regression We can estimate the population standard deviation of y, σ, with We can estimate the population standard deviation of y, σ, with This is called the standard deviation for regression and it roughly measures the average deviation of y values from the mean (the regression line). This is called the standard deviation for regression and it roughly measures the average deviation of y values from the mean (the regression line). This is a useful statistic for describing individual variation in a regression problem. Small s indicates that individual data points fall close to the line, thus, it provides information about how accurately the regression equation might predict y values. This is a useful statistic for describing individual variation in a regression problem. Small s indicates that individual data points fall close to the line, thus, it provides information about how accurately the regression equation might predict y values.
13
Example - Height and Weight Data: x = heights (in inches) y = weight (pounds) of n = 43 male students. Standard deviation s = 24.00 (pounds): Roughly measures, for any given height, the general size of the deviations of individual weights from the mean weight for the height.
14
Correlation Correlation, r, between two quantitative variables is a number that indicates the strength and the direction of a straight-line relationship. Correlation, r, between two quantitative variables is a number that indicates the strength and the direction of a straight-line relationship. Some properties of r are: Some properties of r are: 1. It is always between -1 and 1. 2. The magnitude of the correlation indicates the strength of the relationship. A correlation of either -1 or +1 indicates that there is a perfect linear relationship. A correlation of zero means no relationship. 3. The sign of the correlation indicates the direction of the relationship. A positive correlation indicates that when one variable increases the other is likely to increase as well, and a negative correlation indicates that when one variable increases the other is likely to decrease. 4. Thus, the sign of r is the same as the sign of b 1 !
15
Correlation Examples
16
Proportion of Variation Explained by x Squared correlation, r 2, is between 0 and 1 and indicates the proportion of variation in the response explained by x. Squared correlation, r 2, is between 0 and 1 and indicates the proportion of variation in the response explained by x. SSTO = sum of squares total = sum of squared differences between observed y values and (sample mean of y ’ s). SSE = sum of squared errors (residuals) = sum of squared differences between observed y values and predicted values based on least squares line.
17
Example Iron Diet Iron Blood 99.0231.67 73.2918.71 95.7323.71 66.4923.23 59.1420.79 98.9125.91 76.4022.45 …… Is there a relationship between X= the concentration of iron in the diet and Y= iron in the blood? Regression Summary: b0 = 5.95 b1 = 0.194 r =.839 r-sq =.703
18
Explanation of Terms b1 = the sample slope. For every unit increase in X we expect Y to increase by b1. Example: For every increase of 1mg of iron in the diet we expect blood iron to increase by 0.194 mg. r = the correlation, varies between -1 and 1. A correlation of -1 means a perfect negative relationship, a correlation of +1 means a perfect positive relationship. A correlation of zero means no relationship. Example: Our correlation of 0.839 indicates a strong positive relationship. r-sq = the percent of variation in the response variable that is explained by the predictor. Example: Our r-sq of.703 means that 70.3% of the individual variation in blood iron concentration can be explained by iron in the diet.
19
Example: Driver Age and Maximum Legibility Distance of Highway Signs Study to examine relationship between age and maximum distance at which drivers can read a newly designed sign. Average Distance = 577 – 3.01 × Age
20
Example: Age and Distance Cont. s = 49.76 and R-sq = 64.2%. Thus, the average distance from regression line is about 50 feet, and 64.2% of the variation in sign reading distances is explained by age. SSE = 69334 SSTO = 193667
21
Inference About Linear Regression Relationship Inference about a linear relationship can be evaluated through inference about the slope, β 1. Inference about a linear relationship can be evaluated through inference about the slope, β 1. We will see how to create CI for β 1 and how to test whether or not the β 1 is 0. We will see how to create CI for β 1 and how to test whether or not the β 1 is 0. As in any other type of CI or hypothesis test we will need a sample estimate of the parameter of interest β 1, and the standard error of this estimate. As in any other type of CI or hypothesis test we will need a sample estimate of the parameter of interest β 1, and the standard error of this estimate. These quantities are b 1 and se(b 1 ) and you do not need to know their formulas or how to calculate them from the data. You will need to know how to get them form the Minitab output and how to use them. These quantities are b 1 and se(b 1 ) and you do not need to know their formulas or how to calculate them from the data. You will need to know how to get them form the Minitab output and how to use them. The results of the CI or hypothesis test analysis are meaningful only if the assumptions of the regression model are valid. We will see how we can check them using different plots towards the end of this chapter. The results of the CI or hypothesis test analysis are meaningful only if the assumptions of the regression model are valid. We will see how we can check them using different plots towards the end of this chapter.
22
CI for Slope A Confidence Interval for a Population Slope β 1 is A Confidence Interval for a Population Slope β 1 is where the multiplier t * is the value in a t-distribution with degrees of freedom = df = n – 2, such that the area between –t * and t * equals the desired confidence level. (Found from Table A.2.) Interpretation: This CI gives the range of the expected increase of y for one unit increase in x.
23
Testing For Significance How do we test if there is a significant relationship between 2 quantitative variables? How do we test if there is a significant relationship between 2 quantitative variables? Perform a test of slope! Perform a test of slope! H o : β 1 = 0 (No relationship) H a : β 1 ≠ 0 (There is a relationship) Remember Test Statistic Formula: Remember Test Statistic Formula: The test statistic has a t distribution with n-2 df, if the null hypothesis is true. Thus, p-value=2P( T df=n-2 > t). The test statistic has a t distribution with n-2 df, if the null hypothesis is true. Thus, p-value=2P( T df=n-2 > t). If p-value is less than the critical value (usually.05), then reject the null hypothesis. Conclude there IS a linear relationship between the two variables and say whether it is positive or negative depending on the sign of b 1. If p-value is less than the critical value (usually.05), then reject the null hypothesis. Conclude there IS a linear relationship between the two variables and say whether it is positive or negative depending on the sign of b 1.
24
Example: Age and Distance (cont) If we consider the test H 0 : 1 = 0 vs H a : 1 0, we have The p-value suggests that the probability that observed slope could be as far from 0 or farther if there is no linear relationship in population is virtually 0. The relationship in the sample is significant and represents a real relationship in the population. 95% CI for the Slope: With 95% confidence, we can estimate that in the population of drivers represented by this sample, the mean sign-reading distance decreases somewhere between 3.88 and 2.14 feet for each one-year increase in age.
25
Prediction and Confidence Intervals A 95% prediction interval estimates the value of y for an individual A 95% prediction interval estimates the value of y for an individual with a particular value of x. This interval can be interpreted in two with a particular value of x. This interval can be interpreted in two equivalent ways: equivalent ways: 1. It estimates the (central) 95% of the values of y for members of population with specified value of x. 2. With probability.95, the response of a randomly selected individual from the population with a specified value of x falls into the 95% prediction interval. A 95% confidence interval for the mean estimates the mean value of the response variable y, E(Y ), for (all) individuals with a particular value of x. A 95% confidence interval for the mean estimates the mean value of the response variable y, E(Y ), for (all) individuals with a particular value of x. You do not need to know the formulas for these intervals, just how to get them from the Minitab output and how to interpreter them. You do not need to know the formulas for these intervals, just how to get them from the Minitab output and how to interpreter them. For a given x, which interval is wider, PI or CI? For a given x, which interval is wider, PI or CI?
26
Example: Age and Distance (cont) Probability is 0.95 that a randomly selected … 21-year-old will read the sign at somewhere between 407 and 620 feet. 30-year-old will read the sign at somewhere between 381 and 592 feet. 45-year-old will read the sign at somewhere between 338 and 545 feet. With 95% confidence, we can estimate that the mean reading distance of... 21-year-old is somewhere between 482 and 546 feet. 30-year-old is somewhere between 460 and 513 feet. 45-year-old is somewhere between 422 and 461 feet.
27
How to Check Conditions for Simple Linear Regression 1.Relationship must be linear. If you perform a scatter plot of X and Y and the relationship is obviously curved then this assumption is violated. 2.There should not be any extreme outliers. Check the scatter plot of X and Y for extreme outlying values. 3.Constant variance, the standard deviation of the values of y from the fitted line is the same regardless of the x-variable. Check this with a scatter plot of residuals versus x. How should it look like? 4.The residuals are normally distributed. Check this with a histogram of the residuals. How should it look like? - This condition can be relaxed if the sample size is large.
28
Detailed Example: Suppose we are interested in the relationship between high school GPA and the amount of sleep a student gets. For 100 students we record their GPA and average hours of sleep. A. Fit a simple linear regression line to the data. B. Check the conditions for a hypothesis test and CI of slope. C. Test to see if there is a significant relationship between the 2 variables. D. Construct and interpret a 95% CI for the slope. E. Suppose a student gets 10 hours of sleep. What would their expected GPA be? Is this a good estimate? Explain in terms of r-sq. F. Suppose a student gets 18 hours of sleep. Can we predict the GPA of this student using the regression equation?
29
A. MINITAB: Fitted Line Plot R = 0.282 R-Sq = 8.0% b 1 = 0.0615 b 0 = 2.50 S=0.29379
30
B. Check Conditions 1. From scatter plot seems reasonably linear. 2. From scatter plot, doesn’t seem like there are any extreme outliers. 3. Variance seems 4. Residuals are constant along X. approx. normal.
31
C. Hypothesis Test for Slope: Regression Analysis: GPA versus Sleep (Hours) The regression equation is GPA = 2.50 + 0.0615 Sleep (Hours) Predictor Coef SE Coef T P Constant 2.5047 0.1708 14.67 0.000 Sleep 0.06152 0.02114 2.91 0.004 S = 0.2938 R-Sq = 8.0% R-Sq(adj) = 7.0% Based on the p-value of.004 we can REJECT the null hypothesis. Conclude there is a significant positive relationship between GPA and sleep.
32
D. 95% CI for Slope: b 1 ± t* SE(b 1 ) =.0615 ± 1.99 *.02114 CI = (.0194,.104) We are 95% confident that the true population slope is between.0194 and.104. We are 95% confident that for each additional hour of sleep the expected GPA will increase between.0194 and.104 units.
33
Expected GPA E. Equation is: Y = 2.50 + 0.0615*X The predicted GPA for 10 hours of sleep is: The predicted GPA for 10 hours of sleep is: 2.5 + 0.0615*10 = 3.115 2.5 + 0.0615*10 = 3.115 For someone who gets 10 hours of sleep we expect them to have a GPA of 3.115 This will NOT be a very good predictor because the r-squared value is only.08. Sleeping hours only explains 8% of the variation in GPA. Most of the variation in GPA is unaccounted for. F. We can not use the regression equation to predict the GPA of a student that sleeps 18 hours per day because 18 hours is not in the range of the values of X in the data set.
34
Exercise 14.47: Height and Foot Length. a. There is a linear relationship with a positive slope, and there is an obvious outlier in the data. b. With the outlier omitted from the data set, the Minitab regression output is: The regression equation is height = ____ + ____ foot Predictor Coef SE Coef T P Constant 30.150 6.541 4.61 0.000 foot 1.4952 0.2351 6.36 0.000 S = 2.029 R-Sq = 57.4% R-Sq(adj) = 56.0% What is the regression equation? What is r? What is se(b 1 )? What is the test statistic for testing the hypothesis that the slope is zero? Verify the value.
35
Exercise 14.34 d. The regression line doesn't provide particularly accurate predictions of height based on foot length. Notice the standard deviation from the regression line is given in the output as s = 2.029 inches. This is roughly the average difference between actual heights and predicted heights determined from the line. e. The residual plot shows that a linear equation is probably appropriate, there are no outliers, and it's reasonable to make the constant variance assumption (although it may be that there is less variation among residuals for small foot lengths than for large foot lengths).
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.