The Practice of Statistics in the Life Sciences Fourth Edition

The Practice of Statistics in the Life Sciences Fourth Edition
Chapter 4 :Relationships: Regression Copyright © 2018 W. H. Freeman and Company

Objectives Regression The least-squares regression line
Facts about least-squares regression Outliers and influential observations Working with logarithm transformations Cautions about correlation and regression Association does not imply causation

The least-squares regression line
The least-squares regression line is the line that makes the sum of the squared vertical distances of the data points from the line as small as possible. Because it is the “vertical distances” that are minimized, the distinction between x and y is crucial in regression. If you switched the axes, you’d get a different regression line. Always use the response variable for y and the explanatory variable for x.

Residuals The vertical distances from each point to the least-squares regression line are called residuals. We can show with algebra that the sum of all the residuals is 0. Note that outliers have unusually large residuals (in absolute value).

Notation 𝑦 is the predicted 𝑦 value on the regression line 𝑦 =intercept+slope 𝑥 𝑦 =𝑎+𝑏𝑥 Not all calculators/software use this convention. Other notations include: This textbook uses the notation b for slope and a for intercept. Students should know that different technology platforms may use variations of this notation. 𝑦 =𝑎𝑥+𝑏 𝑦 = 𝑏 0 + 𝑏 1 𝑥 𝑦 = variable name 𝑥+constant

Interpreting the regression line
The slope of the regression line describes how much we expect y to change, on average, for every unit change in x. The intercept is a necessary mathematical descriptor of the regression line. It does not necessarily describe a specific property of the data.

Finding the least-squares regression line
The slope of the regression line is 𝑏=𝑟 𝑠 𝑦 𝑠 𝑥 r is the correlation coefficient between x and y. sy is the standard deviation of the response variable y. sx is the standard deviation of the explanatory variable x. This means that we don’t have to calculate a lot of squared distances to find the least-squares regression line for a data set. We can instead rely on these equations. But typically, we use a 2-var stats calculator or a statistics software. The intercept is 𝑎= 𝑦 −𝑏 𝑥 𝑥 and 𝑦 are the respective means of the 𝑥 and 𝑦 variables.

Plotting the least-squares regression line
Use the regression equation to find the value of y for two distinct values of x, and draw the line that goes through those two points. Hint: The regression line always passes through the mean of x and y. The points used for drawing the regression line are derived from the equation. They are NOT actual points from the data set (except by pure coincidence).

Facts about the least-squares regression line
Fact 1. There is a distinction between the explanatory variable and the response variable. If their roles are reversed, we get a different regression line. Fact 2. The slope of the regression line is proportional to the correlation between the two variables. Fact 3. The regression line always passes through the point 𝑥 , 𝑦 . Fact 4. The correlation measures the strength of the association, while the square of the correlation measures the percent of the variation that is explained by the regression line.

Linear associations only (1 of 2)
Don’t compute the regression line until you have confirmed that there is a linear relationship between x and y. ALWAYS PLOT THE RAW DATA These data sets all give a linear regression equation of about ŷ = x. But don’t report that until you have plotted the data.

Linear associations only (2 of 2)
Is regression appropriate for these data sets? Set A shows moderation association, so regression is appropriate. Set B is nonlinear; regression is inappropriate. Set C has an extreme outlier; regression is inappropriate. Set D has only two values for x; the study should be redesigned. A: Moderate linear association; regression OK. B: Obvious nonlinear relationship; regression inappropriate. C: One extreme outlier, requiring further examination. D: Only two values for x; a redesign is due here… Only data set A is clearly suitable for linear regression. Data set C is problematic because the outlier is very suspicious (likely a typo or an experimental error).

The coefficient of determination, r 2
r 2, the coefficient of determination, is the square of the correlation coefficient. r 2 represents the fraction of the variance in y that can be explained by the regression model. r = 0.87, so r 2 = 0.76 This model explains 76% of individual variations in BAC Top graph: Distance of each yi to the mean of y; when squared, the sum of these deviations (SST) are part of the variance of y. Bottom graph: Distance of each yi to its predicted value on the regression line (the residuals); the sum of the squared residuals (SSError) represents the variation in y that is left after taking into account the regression model. SSTotal = SSRegression + SSError. Mathematically, it can be shown that r2 = SSRegression / SSTotal = 1 – (SSError / SSTotal). Therefore, r2 is the fraction of the total sum of squares that is explained by the regression model. Note that the notation R2 is also commonly used.

Interpreting r2 (1 of 3) r = –0.3, r 2 = 0.09, or 9% The regression model explains not even 10% of the variations in y. r = –0.7, r 2 = 0.49, or 49% The regression model explains nearly half of the variations in y. r quantifies the strength and direction of a linear relationship between two quantitative variables. r is positive for positive linear relationships, and negative for negative linear relationships. The closer r is to zero, the weaker the linear relationship is. Beware that r has this particular meaning for linear relationships only.

Interpreting r2 (2 of 3) r = –0.99, r 2 = , or about 98% The regression model explains almost all of the variations in y. r quantifies the strength and direction of a linear relationship between two quantitative variables. r is positive for positive linear relationships, and negative for negative linear relationships. The closer r is to zero, the weaker the linear relationship is. Beware that r has this particular meaning for linear relationships only.

Interpreting r2 (3 of 3) r represents the direction and strength of a linear relationship. r2 indicates what fraction of the variation in y can be explained by the linear regression model. r = –0.972 r2 = r = –0.538 r2 = 0.290 Left: This linear relationship is negative and strong (r = –0.972), and the regression model explains 94.6% (r2 = 0.946) of the variation in cold symptom severity scores. Right: This linear relationship is negative and moderate to weak (r = –0.538), and the regression model explains only 29% (r2 = 0.290) of the variation in straightforwardness scores.

Outliers and influential points
Outlier: An observation that lies outside the overall pattern. “Influential individual”: An observation that markedly changes the regression if removed. This is often an isolated point. One child (top) is an outlier of the relationship; it is unusually far from the regression line, vertically, and thus has a large residual. Another child (bottom right) is isolated from the rest of the points and might be an influential point.

Outlier example The rightmost point changes the regression line substantially when it is removed, which makes it an influential point. The topmost point is an outlier of the relationship, but it is not influential (regression line changes very little by its removal).

Regression with a transformation
Logarithm transformations are often used when data are highly right skewed. If the response variable is transformed with logarithms, regression is performed normally, except the predicted response must be transformed back into original units. To predict brain weight when body weight is 100 kg: log brain weight = × log body weight log brain weight = ×2=2.45 brain weight= =282g

Making predictions (1 of 4)
Use the equation of the least-squares regression to predict y for any value of x within the range studied. Prediction outside the range is extrapolation. Avoid extrapolation. What would we expect for the BAC after drinking 6.5 beers? 𝑦 = ∗ 𝑦 = =0.0944mg/mL Nobody in the study drank 6.5 beers, but by finding the value of y from the regression line for x = 6.5, we would expect a BAC of mg/mL. With the data collected, we can use this equation for a number of beers drunk between 1 and 8 (prediction within range). Don’t use this regression model to predict the BAC after drinking 30 beers: That’s extrapolation. The person would most likely be passed out or dead!

Note that using this model to make a prediction for 200,000 powerboats leads to a NEGATIVE predicted number of manatee DEATHS from collision. This is clearly nonsense and reflect the unreliability of extrapolation. Using the model to make a prediction for 1,500,000 powerboats would not give a negative value, but it would be just as unreliable as it would also be extrapolation. Positive linear relationship

If Florida were to limit the number of powerboats to 500,000, what could we expect the number of manatee deaths to be in that year? A) ~21 B) ~ 65 C) ~109 D) ~65,006 What if Florida were to limit the number of powerboats to 200,000? Note that using this model to make a prediction for 200,000 powerboats leads to a NEGATIVE predicted number of manatee DEATHS from collision. This is clearly nonsense and reflect the unreliability of extrapolation. Using the model to make a prediction for 1,500,000 powerboats would not give a negative value, but it would be just as unreliable as it would also be extrapolation.

Year Boats Deaths 1977 447 13 1978 460 21 1979 481 24 1980 498 16 1981 513 1982 512 20 1983 526 15 1984 559 34 1985 585 33 1986 614 1987 645 39 1988 675 43 Year Boats Deaths 1989 711 50 1990 719 47 1991 681 55 1992 679 38 1993 678 35 1994 696 49 1995 713 42 1996 732 60 1997 755 54 1998 809 66 1999 830 82 2000 880 78 Year Boats Deaths 2001 944 81 2002 962 95 2003 978 73 2004 983 69 2005 1010 79 2006 1024 92 2007 1027 2008 90 2009 982 97 2010 942 83 2011 922 87 2012 902 Note that using this model to make a prediction for 200,000 powerboats leads to a NEGATIVE predicted number of manatee DEATHS from collision. This is clearly nonsense and reflect the unreliability of extrapolation. Using the model to make a prediction for 1,500,000 powerboats would not give a negative value, but it would be just as unreliable as it would also be extrapolation.

Association does not imply causation
Association, however strong, does NOT imply causation. The observed association could have an external cause. A lurking variable is a variable that is not among the explanatory or response variables in a study, and yet may influence the relationship between the variables studied. We say that two variables are confounded when their effects on a response variable cannot be distinguished from each other. Only careful experimentation can show causation. The important distinction between experiments and observational studies is made in Chapters 7 and 8. The confounded variables may be either explanatory variables or lurking variables.

Lurking variables (1 of 4)
What is most likely the lurking variable, if any, in each case? Strong positive association between the shoe size and reading skills in young children. Top: Child age is most likely. As kids grow, their feet get bigger, and their reading skill improve with practice and schooling. Middle: We can think of a long list of possible lurking variables (diet type, socialization, stress, general quality of life), but nothing as obvious as in the two previous examples. Bottom: There is no good explanation here; the association is most likely spurious (a coincidence). Negative association between moderate amounts of wine-drinking and death rates from heart disease in developed nations.

Clear positive association between per capita chocolate consumption and the concentration of Nobel Laureates in world nations! Top: Child age is most likely. As kids grow, their feet get bigger, and their reading skill improve with practice and schooling. Middle: We can think of a long list of possible lurking variables (diet type, socialization, stress, general quality of life), but nothing as obvious as in the two previous examples. Bottom: There is no good explanation here; the association is most likely spurious (a coincidence).

Relationship between muscle sympathetic nerve activity and a measure of arterial stiffness in young adults. Gender is a lurking variable. A confounding variable (here, gender) can also sometimes mask a relationship between two variables (here, sympathetic nerve activity and augmented aortic blood pressure).

Same data broken down by gender.

Establishing causation (1 of 2)
Establishing causation from an observed association can be done if: The association is strong. The association is consistent. Higher doses are associated with stronger responses. The alleged cause precedes the effect. The alleged cause is plausible.

Establishing causation (2 of 2)
Lung cancer is clearly associated with smoking. What if a genetic mutation (lurking variable) caused people to both get lung cancer and become addicted to smoking? It took years of research and accumulated indirect evidence to reach the conclusion that smoking causes lung cancer.

The Practice of Statistics in the Life Sciences Fourth Edition

Similar presentations

Presentation on theme: "The Practice of Statistics in the Life Sciences Fourth Edition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Practice of Statistics in the Life Sciences Fourth Edition

Similar presentations

Presentation on theme: "The Practice of Statistics in the Life Sciences Fourth Edition"— Presentation transcript:

Similar presentations

About project

Feedback