Presentation is loading. Please wait.

Presentation is loading. Please wait.

CHAPTER 4 DESCRIBING BIVARIATE NUMERICAL DATA Created by Kathy Fritz.

Similar presentations


Presentation on theme: "CHAPTER 4 DESCRIBING BIVARIATE NUMERICAL DATA Created by Kathy Fritz."— Presentation transcript:

1 CHAPTER 4 DESCRIBING BIVARIATE NUMERICAL DATA Created by Kathy Fritz

2 Forensic scientists must often estimate the age of an unidentified crime victim. Prior to 2010, this was usually done by analyzing teeth and bones, and the resulting estimates were not very reliable. A study described in the paper “Estimating Human Age from T-Cell DNA Rearrangements” (Current Biology [2010]) examined the relationship between age and a measure based on a blood test. Age and the blood test measure were recorded for 195 people ranging in age from a few weeks to 80 years. A scatterplot of the data appears to the right.

3 CORRELATION Pearson’s Sample Correlation Coefficient Properties of r

4  Does it look like there is a relationship between the two variables?  If so, is the relationship linear?

5  Does it look like there is a relationship between the two variables?  If so, is the relationship linear?

6  Does it look like there is a relationship between the two variables?  If so, is the relationship linear?

7  Does it look like there is a relationship between the two variables?  If so, is the relationship linear?

8  Does it look like there is a relationship between the two variables?  If so, is the relationship linear?

9 Interpreting Scatterplots To interpret a scatterplot, follow the basic strategy of data analysis from Chapters 1 and 2. Look for patterns and importantdepartures from those patterns. As in any graph of data, look for the overall pattern and for striking departures from that pattern. You can describe the overall pattern of a scatterplot by the An important kind of departure is an How to Examine a Scatterplot

10 Interpreting Scatterplots Definition: Two variables have a positive association Two variables have a negative association Consider the SAT problem. Interpret the scatterplot. Direction Form Strength

11 . Interpreting Scatterplots DirectionFormStrength Outlier

12 When the points in a scatterplot tend to cluster tightly around a line, the relationship is described as strong. Try to order the scatterplots from strongest relationship to the weakest. These four scatterplots were constructed using data from graphs in Archives of General Psychiatry (June 2010). AB CD

13 PEARSON’S SAMPLE CORRELATION COEFFICIENT Usually referred to as just the correlation coefficient Denoted by r Measures the strength and direction of a linear relationship between two numerical variables

14 Measuring Linear Association: Correlation A scatterplot displays the strength, direction, and form of the relationship between two quantitative variables. Linear relationships are important because a straight line is a simple pattern that is quite common. Unfortunately, our eyes are not goodjudges of how strong a linear relationship is. Definition: The correlation r measures the strength of the linear relationship between two quantitative variables.

15 PROPERTIES OF R 1. The ______ of r matches the _________ of the linear relationship. r is positive r is negative

16 PROPERTIES OF R 2. The value of r is always Weak correlation Strong correlation Moderate correlation

17 PROPERTIES OF R 3.r = 1 Similarly, r = -1

18 PROPERTIES OF R 4. r is a measure of Find the correlation for these points: Compute the correlation coefficient? Sketch the scatterplot. x2468101214 y4020848 40

19 PROPERTIES OF R 5.The value of r does not depend on

20 Measuring Linear Association: Correlation

21 CALCULATING CORRELATION COEFFICIENT The correlation coefficient is calculated using the following formula: where and

22 The web site www.collegeresults.org (The Education Trust) publishes data on U.S. colleges and universities. The following six-year graduation rates and student- related expenditures per full-time student for 2007 were reported for the seven primarily undergraduate public universities in California with enrollments between 10,000 and 20,000. Here is the scatterplot: Does the relationship appear linear? Explain. Expenditures8810778081128149847773427984 Graduation rates 66.152.448.948.142.038.331.3

23 COLLEGE EXPENDITURES CONTINUED: To compute the correlation coefficient, first find the z- scores. xyzxzx zyzy zxzyzxzy 881066.11.521.742.64 778052.4-0.660.51-0.34 811248.90.040.190.01 814948.10.12 0.01 847742.00.81-0.42-0.34 734238.3-1.59-0.761.21 798431.3-0.23-1.380.32

24 Facts about Correlation How correlation behaves is more important than the details of the formula. Here are some important facts about r. 1.Correlation makes no distinction between explanatory and response variables. 2.r does not change when we change the units of measurement of x, y, or both. 3.The correlation r itself has no unit of measurement. Cautions:

25 DOES A VALUE OF R CLOSE TO 1 OR -1 MEAN THAT A CHANGE IN ONE VARIABLE CAUSES A CHANGE IN THE OTHER VARIABLE? Consider the following examples: The relationship between the number of cavities in a child’s teeth and the size of his or her vocabulary is strong and positive. Consumption of hot chocolate is negatively correlated with crime rate.

26 LINEAR REGRESSION Least Squares Regression Line

27 Suppose there is a relationship between two numerical variables. Let x be the amount spent on advertising and y be the amount of sales for the product during a given period. You might want to predict product sales (y) for a month when the amount spent on advertising is $10,000 (x).

28 Where: b – is the it is the amount by which y increases when x increases by 1 unit a – is the it is the height of the line above x = 0 in some contexts, it is not reasonable to interpret the intercept The equation of a line is:

29 Regression Line Linear (straight-line) relationships between two quantitative variables are common and easy to understand. A regression line summarizes the relationship between two variables, but only in settings where one of thevariables helps explain or predict the other. Definition: A regression line is a line that describes This is a scatterplot of the change in nonexercise activity (cal) and measured fat gain (kg) after 8 weeks for 16 healthy young adults. The plot shows a moderately strong, negative, linear association between NEA change and fat gain with no outliers. The regression line predicts fat gain from change in NEA. When nonexercise activity = 800 cal, our line predicts a fat gain of about 0.8 kg after 8 weeks.

30 Interpreting a Regression Line A regression line is a model for the data, much like density curves. The equation of a regression line gives a compact mathematicaldescription of what this model tells us about the relationshipbetween the response variable y and the explanatory variable x. Definition: Suppose that y is a response variable (plotted on the vertical axis) and x is an explanatory variable (plotted on the horizontal axis). A regression line relating y to x has an equation of the form ŷ = a + bx In this equation, ŷ (read “y hat”) is the b is the a is the

31 Interpreting a Regression Line Consider the regression line from the example “Does Fidgeting Keep You Slim?” Identify the slope and y -intercept and interpret each value in context. The y-intercept a = 3.505 kg is the fat gain estimated by this model if NEA does not change when a person overeats. The slope b = -0.00344 tells us that the amount of fat gained is predicted to go down by 0.00344 kg for each added calorie of NEA.

32 Prediction We can use a regression line to predict the response ŷ for a specific value of the explanatory variable x. Use the NEA and fat gain regression line to predict the fat gain for a person whose NEA increases by 400 cal when she overeats. We predict a fat gain of 2.13 kg when a person with NEA = 400 calories.

33 Extrapolation We can use a regression line to predict the response ŷ for a specific value of the explanatory variable x. The accuracy of the prediction depends on how much the data scatter about theline. While we can substitute any value of x into the equation of the regression line, we must exercise caution in making predictionsoutside the observed values of x. Definition: Extrapolation is the use of Don’t make predictions using values of x that are much larger or much smaller than those that actually appear in your data.

34 HOW DO YOU FIND AN APPROPRIATE LINE FOR DESCRIBING A BIVARIATE DATA SET? y = 10 + 2x

35 ResidualsIn most cases, no line will pass exactly through all the points in a scatterplot. A good regression line makes the vertical distances of thepoints from the line as small as possible. Definition: A residual is the difference Positive residuals (above line) Negative residuals (below line) Negative residuals (below line)

36 LEAST SQUARES REGRESSION LINE Different regression lines produce different residuals. The least squares regression line is the line that minimizes the sum of squared deviations or residuals.

37 (0,0) (3,10) (6,2) Find the vertical deviations from the line Let’s investigate the meaning of the least squares regression line. Suppose we have a data set that consists of the observations (0,0), (3,10) and 6,2).

38 Pomegranate, a fruit native to Persia, has been used in the folk medicines of many cultures to treat various ailments. Researchers are now investigating if pomegranate's antioxidants properties are useful in the treatment of cancer. In one study, mice were injected with cancer cells and randomly assigned to one of three groups, plain water, water supplemented with.1% pomegranate fruit extract (PFE), and water supplemented with.2% PFE. The average tumor volume for mice in each group was recorded for several points in time. (x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume (in mm 3 ) x1115192327 y 150270450580740 Sketch a scatterplot for this data set. Number of days after injection Average tumor volume

39 Let’s find the least squares regression line for the data from the previous slide.

40 Pomegranate study continued Predict the average volume of the tumor for 20 days after injection. Predict the average volume of the tumor for 5 days after injection.

41 ASSESSING THE FIT OF A LINE Residuals Residual Plots Outliers and Influential Points Coefficient of Determination Standard Deviation about the Line

42 ASSESSING THE FIT OF A LINE Important questions are: 1. Is the line an appropriate way to summarize the relationship between x and y ? 2. Are there any unusual aspects of the data set that you need to consider before proceeding to use the least squares regression line to make predictions? 3. If you decide that it is reasonable to use the line as a basis for prediction, how accurate can you expect predictions to be?

43 RESIDUALS Recall, the vertical deviations of points from the least squares regression line are called deviations. These deviations are also called residuals.

44 In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. Distance from Debris (x) Distance Traveled (y) 6.940.00 5.236.13 5.2111.29 7.1014.35 8.1612.03 5.5022.72 9.1920.11 9.0526.16 9.3630.65 Calculate the predicted y and the residuals.

45 Residual Plots One of the first principles of data analysis is to look for an overall pattern and for striking departures from the pattern. A regression line describesthe overall pattern of a linear relationship between two variables. Wesee departures from this pattern by looking at the residuals. Definition: A residual plot is a

46 RESIDUAL PLOTS A residual plot is a scatterplot of the Residuals can also be graphed against the Isolated points or a pattern of points in the residual plot

47 Interpreting Residual Plots A residual plot magnifies the deviations of the points from the line, making it easier to see unusual observations and patterns. Pattern in residuals Linear model not appropriate Pattern in residuals Linear model not appropriate

48 Deer mice continued Distance from Debris (x) Distance Traveled (y) 6.940.00 5.236.13 5.2111.29 7.1014.35 8.1612.03 5.5022.72 9.1920.11 9.0526.16 9.3630.65 14.76-14.76 9.23-3.10 9.162.13 15.28-0.93 18.70-6.67 10.1012.62 22.04-1.93 21.584.58 22.598.06 Plot the residuals against the distance from debris (x)

49 Deer mice continued

50 Residual plots can be plotted against either the x-values or the predicted y-values. Deer mice continued

51 Residual plots continued Let’s examine the accompanying data on x = height (in inches) and y = average weight (in pounds) for American females, ages 30-39 (from The World Almanac and Book of Facts). x585960616263646566676869707172 y113115118121124128131134137141145150153159164

52 Let’s examine the data set for 12 black bears from the Boreal Forest. x = age (in years) and y = weight (in kg) Sketch a scatterplot with the fitted regression line. x10.56.528.510.56.57.56.55.57.511.59.55.5 Y544062515556624240595150 This observation has an x-value that differs greatly from the others in the data set.

53 Black bears continued x10.56.528.510.56.57.56.55.57.511.59.55.5 Y544062515556624240595150

54 COEFFICIENT OF DETERMINATION The coefficient of determination is the Denoted by The value of r 2 is often

55 Suppose you didn’t know any x-values. What distance would you expect deer mice to travel? Let’s explore the meaning of r 2 by revisiting the deer mouse data set. x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food x6.945.235.217.108.165.509.199.059.36 y06.1311.2914.3512.0322.7220.1126.1630.65 To find the total amount of variation in the distance traveled (y) you need to find the sum of the squares of these deviations from the mean.

56 Now let’s find how much variation there is in the distance traveled (y) from the least squares regression line. Deer mice continued x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food x6.945.235.217.108.165.509.199.059.36 y06.1311.2914.3512.0322.7220.1126.1630.65 To find the amount of variation in the distance traveled (y), find the sum of the squared residuals. Distance traveled Distance to debris

57 The amount of variation in y values from the regression line is SS Resid = 526.27 m 2 Total amount of variation in the distance traveled (y) is SS TO = 773.95 m 2. Approximately what percent of the variation in distance traveled (y) can be explained by the linear relationship? Deer mice continued x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food r 2 = 32%

58 STANDARD DEVIATION ABOUT THE LEAST SQUARES REGRESSION LINE The standard deviation about the least squares regression line is The value of s e can be interpreted as the

59 Partial output from the regression analysis of deer mouse data: PredictorCoefSE CoefTP Constant-7.6913.33-0.580.582 Distance to debris3.2341.7821.820.112 S = 8.67071R-sq = 32.0%R-sq(adj) = 22.3% Analysis of Variance SourceDFSSMSFP Regression1247.68 3.290.112 Resid Error7526.2775.18 Total8773.95

60 INTERPRETING THE VALUES OF S E AND R 2 A small value of s e indicates that_____________________. This value tells you ___________________ you can expect when using the least squares regression line to make predictions. A large value of r 2 indicates that a large proportion of the ________________________________ by the approximate linear relationship between x and y. This tells you that knowing the value of x is helpful for predicting y. A useful regression line will have a reasonably _______________ and a reasonably _______________________.

61 A study (Archives of General Psychiatry[2010]: 570-577) looked at how working memory capacity was related to scores on a test of cognitive functioning and to scores on an IQ test. Two groups were studied – one group consisted of patients diagnosed with schizophrenia and the other group consisted of healthy control subjects.

62 EXAMPLE - CALCULATOR Find the least squares regression line, r, r 2, and s e for the following data. Hours, x 3528244563 Scores, y 6580608866788590 71

63 PUTTING IT ALL TOGETHER Describing Linear Relationships Making Predictions

64 STEPS IN A LINEAR REGRESSION ANALYSIS 1. Summarize the data graphically by constructing a scatterplot 2. Based on the scatterplot, decide if it looks like the relationship between x an y is approximately linear. If so, proceed to the next step. 3. Find the equation of the least squares regression line. 4. Construct a residual plot and look for any patterns or unusual features that may indicate that line is not the best way to summarize the relationship between x and y. In none are found, proceed to the next step. 5. Compute the values of s e and r 2 and interpret them in context. 6. Based on what you have learned from the residual plot and the values of s e and r 2, decide whether the least squares regression line is useful for making predictions. If so, proceed to the last step. 7. Use the least squares regression line to make predictions.

65 REVISIT THE CRIME SCENE DNA DATA Recall the scientists were interested in predicting age of a crime scene victim (y) using the blood test measure (x). Step 1: Scientist first constructed a scatterplot of the data. Step 2: Based on the scatterplot, it does appear that there is a reasonably strong negative linear relationship between and the blood test measure.

66 Step 4: A residual plot constructed from these data showed a few observations with large residuals, but these observations were not far removed from the rest of the data in the x direction. The observations were not judged to be influential. Also there were no unusual patterns in the residual plot that would suggest a nonlinear relationship between age and the blood test measure. Step 5: s e = 8.9 andr 2 = 0.835 Approximately 83.5% of the variability in age can be explained by the linear relationship. A typical difference between the predicted age and the actual age would be about 9 years.

67 Step 6: Based on the residual plot, the large value of r 2, and the relatively small value of s e, the scientists proposed using the blood test measure and the least squares regression line as a way to estimate ages of crime victims.

68 COMMON MISTAKES

69 AVOID THESE COMMON MISTAKES 1. Correlation does not imply causation. A strong correlation implies only that the two variables tend to vary together in a predictable way, but there are many possible explanations for why this is occurring other than one variable causing change in the other.

70 AVOID THESE COMMON MISTAKES 2. A correlation coefficient near 0 does not necessarily imply that there is no relationship between two variables. Although the variables may be unrelated, it is also possible that there is a strong but nonlinear relationship.

71 AVOID THESE COMMON MISTAKES 3. The least squares regression line for predicting y from x is NOT the same line as the least squares regression line for predicting x from y. The ages (x, in months) and heights (y, in inches) of seven children are given. x1624426075102120 y24303540485660

72 AVOID THESE COMMON MISTAKES 4. Beware of extrapolation. Using the least squares regression line to make predictions outside the range of x values in the data set often leads to poor predictions. Predict the height of a child that is 15 years (180 months) old.

73 AVOID THESE COMMON MISTAKES 5. Be careful in interpreting the value of the intercept of the least squares regression line. In many instances interpreting the intercept as the value of y that would be predicted when x = 0 is equivalent to extrapolating way beyond the range of x values in the data set. The ages (x, in months) and heights (y, in inches) of seven children are given. x1624426075102120 y24303540485660

74 AVOID THESE COMMON MISTAKES 6. Remember that the least squares regression line may be the “best” line, but that doesn’t necessarily mean that the line will produce good predictions.

75 AVOID THESE COMMON MISTAKES 7. It is not enough to look at just r 2 or just s e when evaluating the regression line. Remember to consider both values. In general, your would like to have both a small value for s e and a large value for r 2.

76 AVOID THESE COMMON MISTAKES 8. The value of the correlation coefficient, as well as the values for the intercept and slope of the least squares regression line, can be sensitive to influential observations in the data set, particularly if the sample size is small.


Download ppt "CHAPTER 4 DESCRIBING BIVARIATE NUMERICAL DATA Created by Kathy Fritz."

Similar presentations


Ads by Google