Presentation is loading. Please wait.

Presentation is loading. Please wait.

STATISTICS INFORMED DECISIONS USING DATA

Similar presentations


Presentation on theme: "STATISTICS INFORMED DECISIONS USING DATA"— Presentation transcript:

1 STATISTICS INFORMED DECISIONS USING DATA
Fifth Edition Chapter 4 Describing the Relation between Two Variables

2 4.1 Scatter Diagrams and Correlation Learning Objectives
1. Draw and interpret scatter diagrams 2. Describe the properties of the linear correlation coefficient 3. Compute and interpret the linear correlation coefficient 4. Determine whether a linear relation exists between two variables 5. Explain the difference between correlation and causation

3 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Draw and Interpret Scatter Diagrams (1 of 6) The response variable is the variable whose value can be explained by the value of the explanatory or predictor variable. A scatter diagram is a graph that shows the relationship between two quantitative variables measured on the same individual. Each individual in the data set is represented by a point in the scatter diagram. The explanatory variable is plotted on the horizontal axis, and the response variable is plotted on the vertical axis.

4 Depth at Which Drilling Begins, x (in feet)
4.1 Scatter Diagrams and Correlation Draw and Interpret Scatter Diagrams (2 of 6) EXAMPLE Drawing and Interpreting a Scatter Diagram The data shown to the right are based on a study for drilling rock. The researchers wanted to determine whether the time it takes to dry drill a distance of 5 feet in rock increases with the depth at which the drilling begins. So, depth at which drilling begins is the explanatory variable, x, and time (in minutes) to drill five feet is the response variable, y. Draw a scatter diagram of the data. Source: Penner, R., and Watts, D.G. “Mining Information.” The American Statistician, Vol. 45, No. 1, Feb. 1991, p. 6. Depth at Which Drilling Begins, x (in feet) Time to Drill 5 Feet, y (in minutes) 35 5.88 50 5.99 75 6.74 95 6.1 120 7.47 130 6.93 145 6.42 155 7.97 160 7.92 175 7.62 185 6.89 190 7.9

5 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Draw and Interpret Scatter Diagrams (3 of 6)

6 Various Types of Relations in a Scatter Diagram
4.1 Scatter Diagrams and Correlation Draw and Interpret Scatter Diagrams (4 of 6) Various Types of Relations in a Scatter Diagram

7 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Draw and Interpret Scatter Diagrams (5 of 6) Two variables that are linearly related are positively associated when above-average values of one variable are associated with above-average values of the other variable and below-average values of one variable are associated with below-average values of the other variable. That is, two variables are positively associated if, whenever the value of one variable increases, the value of the other variable also increases.

8 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Draw and Interpret Scatter Diagrams (6 of 6) Two variables that are linearly related are negatively associated when above-average values of one variable are associated with below-average values of the other variable. That is, two variables are negatively associated if, whenever the value of one variable increases, the value of the other variable decreases.

9 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Describe the Properties of the Linear Correlation Coefficient (1 of 6) The linear correlation coefficient or Pearson product moment correlation coefficient is a measure of the strength and direction of the linear relation between two quantitative variables. The Greek letter ρ (rho) represents the population correlation coefficient, and r represents the sample correlation coefficient. We present only the formula for the sample correlation coefficient.

10 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Describe the Properties of the Linear Correlation Coefficient (2 of 6) Sample Linear Correlation Coefficient

11 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Describe the Properties of the Linear Correlation Coefficient (3 of 6) Properties of the Linear Correlation Coefficient The linear correlation coefficient is always between −1 and 1, inclusive. That is, −1 ≤ r ≤ 1. If r = + 1, then a perfect positive linear relation exists between the two variables. If r = −1, then a perfect negative linear relation exists between the two variables. The closer r is to +1, the stronger the evidence is of a positive association between the two variables. The closer r is to −1, the stronger the evidence is of a negative association between the two variables.

12 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Describe the Properties of the Linear Correlation Coefficient (4 of 6) If r is close to 0, then little or no evidence exists of a linear relation between the two variables. So r close to 0 does not imply no relation, just no linear relation. The linear correlation coefficient is a unitless measure of association. So the unit of measure for x and y plays no role in the interpretation of r. The correlation coefficient is not resistant. Therefore, an observation that does not follow the overall pattern of the data could affect the value of the linear correlation coefficient.

13 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Describe the Properties of the Linear Correlation Coefficient (5 of 6)

14 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Describe the Properties of the Linear Correlation Coefficient (6 of 6)

15 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Compute and Interpret the Linear Correlation Coefficient (1 of 5) Depth at Which Drilling Begins, x (in feet) Time to Drill 5 Feet, y (in minutes) 35 5.88 50 5.99 75 6.74 95 6.1 120 7.47 130 6.93 145 6.42 155 7.97 160 7.92 175 7.62 185 6.89 190 7.9 EXAMPLE Determining the Linear Correlation Coefficient Determine the linear correlation coefficient of the drilling data.

16 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Compute and Interpret the Linear Correlation Coefficient (2 of 5)

17 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Compute and Interpret the Linear Correlation Coefficient (3 of 5)

18 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Compute and Interpret the Linear Correlation Coefficient (4 of 5) IN CLASS ACTIVITY Correlation Randomly select six students from the class and have them determine their at-rest pulse rates and then discuss the following: When determining each at-rest pulse rate, would it be better to count beats for 30 seconds and multiply by 2 or count beats for 1 full minute? Explain. What are some other ways to find the at-rest pulse rate? Do any of these methods have an advantage? What effect will physical activity have on pulse rate? Do you think the at-rest pulse rate will have any effect on the pulse rate after physical activity? If so, how? If not, why not? Have the same six students jog in place for 3 minutes and then immediately determine their pulse rates using the same technique as for the at-rest pulse rates.

19 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Compute and Interpret the Linear Correlation Coefficient (5 of 5) Draw a scatter diagram for the pulse data using the at-rest data as the explanatory variable. Comment on the relationship, if any, between the two variables. Is this consistent with your expectations? Based on the graph, estimate the linear correlation coefficient for the data. Then compute the correlation coefficient and compare it to your estimate.

20 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Determine whether a Linear Relation Exists between Two Variables (1 of 2) Testing for a Linear Relation Step 1 Determine the absolute value of the correlation coefficient. Step 2 Find the critical value in Table II for the given sample size. Step 3 If the absolute value of the correlation coefficient is greater than the critical value, we say a linear relation exists between the two variables. Otherwise, no linear relation exists.

21 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Determine whether a Linear Relation Exists between Two Variables (2 of 2) EXAMPLE Does a Linear Relation Exist? Determine whether a linear relation exists between time to drill five feet and depth at which drilling begins. Comment on the type of relation that appears to exist between time to drill five feet and depth at which drilling begins. The correlation between drilling depth and time to drill is The critical value for n = 12 observations is Since > 0.576, there is a positive linear relation between time to drill five feet and depth at which drilling begins. Table II Critical Values for Correlation Coefficient n blank 3 0.997 4 0.950 5 0.878 6 0.811 7 0.754 8 0.707 9 0.666 10 0.632 11 0.602 12 0.576 13 0.553 14 0.532

22 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Explain the Difference between Correlation and Causation (1 of 8) According to data obtained from the Statistical Abstract of the United States, the correlation between the percentage of the female population with a bachelor’s degree and the percentage of births to unmarried mothers since 1990 is Does this mean that a higher percentage of females with bachelor’s degrees causes a higher percentage of births to unmarried mothers?

23 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Explain the Difference between Correlation and Causation (2 of 8) Certainly not! The correlation exists only because both percentages have been increasing since It is this relation that causes the high correlation. In general, time series data (data collected over time) may have high correlations because each variable is moving in a specific direction over time (both going up or down over time; one increasing, while the other is decreasing over time). When data are observational, we cannot claim a causal relation exists between two variables. We can only claim causality when the data are collected through a designed experiment.

24 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Explain the Difference between Correlation and Causation (3 of 8) Another way that two variables can be related even though there is not a causal relation is through a lurking variable. A lurking variable is related to both the explanatory and response variable. For example, ice cream sales and crime rates have a very high correlation. Does this mean that local governments should shut down all ice cream shops? No! The lurking variable is temperature. As air temperatures rise, both ice cream sales and crime rates rise.

25 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Explain the Difference between Correlation and Causation (4 of 8) Table 4 Number of Colas per Week Bone Mineral Density (g/cm2) 0.893 0.882 1 0.891 0.881 2 0.888 0.871 3 0.868 0.876 4 0.873 5 0.875 6 0.867 7 0.862 0.872 8 0.865 EXAMPLE Lurking Variables in a Bone Mineral Density Study Because colas tend to replace healthier beverages and colas contain caffeine and phosphoric acid, researchers Katherine L. Tucker and associates wanted to know whether cola consumption is associated with lower bone mineral density in women. The table lists the typical number of cans of cola consumed in a week and the femoral neck bone mineral density for a sample of 15 women. The data were collected through a prospective cohort study.

26 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Explain the Difference between Correlation and Causation (5 of 8) EXAMPLE Lurking Variables in a Bone Mineral Density Study The figure on the next slide shows the scatter diagram of the data. The correlation between number of colas per week and bone mineral density is −0.806.The critical value for correlation with n = 15 from Table II in Appendix A is Because |−0.806| > 0.514, we conclude a negative linear relation exists between number of colas consumed and bone mineral density. Can the authors conclude that an increase in the number of colas consumed causes a decrease in bone mineral density? Identify some lurking variables in the study.

27 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Explain the Difference between Correlation and Causation (6 of 8)

28 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Explain the Difference between Correlation and Causation (7 of 8) EXAMPLE Lurking Variables in a Bone Mineral Density Study In prospective cohort studies, data are collected on a group of subjects through questionnaires and surveys over time. Therefore, the data are observational. So the researchers cannot claim that increased cola consumption causes a decrease in bone mineral density. Some lurking variables in the study that could confound the results are: body mass index height smoking alcohol consumption calcium intake physical activity

29 4. 1 Scatter Diagrams and Correlation 4. 1
4.1 Scatter Diagrams and Correlation Explain the Difference between Correlation and Causation (8 of 8) EXAMPLE Lurking Variables in a Bone Mineral Density Study The authors were careful to say that increased cola consumption is associated with lower bone mineral density because of potential lurking variables. They never stated that increased cola consumption causes lower bone mineral density.

30 4.2 Least-squares Regression Learning Objectives
1. Find the least-squares regression line and use the line to make predictions 2. Interpret the slope and the y-intercept of the least-squares regression line 3. Compute the sum of squared residuals

31 4.2 Least-squares Regression EXAMPLE Finding an Equation that Describes Linearly Relate Data (1 of 2) Using the following sample data: x 2 3 5 6 y 5.8 5.7 5.2 2.8 1.9 2.2

32 4.2 Least-squares Regression EXAMPLE Finding an Equation that Describes Linearly Relate Data (2 of 2) (b) Graph the equation on the scatter diagram.

33 4. 2 Least-squares Regression 4. 2
4.2 Least-squares Regression Find the Least-Squares Regression Line and Use the Line to Make Predictions (1 of 7) The difference between the observed value of y and the predicted value of y is the error, or residual. Using the line from the last example, and the predicted value at x = 3: residual = observed y − predicted y = 5.2 − 4.75 = 0.45

34 4. 2 Least-squares Regression 4. 2
4.2 Least-squares Regression Find the Least-Squares Regression Line and Use the Line to Make Predictions (2 of 7) Least-Squares Regression Criterion

35 4. 2 Least-squares Regression 4. 2
4.2 Least-squares Regression Find the Least-Squares Regression Line and Use the Line to Make Predictions (3 of 7) The Least-Squares Regression Line The equation of the least-squares regression line is given by

36 4. 2 Least-squares Regression 4. 2
4.2 Least-squares Regression Find the Least-Squares Regression Line and Use the Line to Make Predictions (4 of 7) The Least-Squares Regression Line

37 4. 2 Least-squares Regression 4. 2
4.2 Least-squares Regression Find the Least-Squares Regression Line and Use the Line to Make Predictions (5 of 7) Depth at Which Drilling Begins, x (in feet) Time to Drill 5 Feet, y (in minutes) 35 5.88 50 5.99 75 6.74 95 6.1 120 7.47 130 6.93 145 6.42 155 7.97 160 7.92 175 7.62 185 6.89 190 7.9 EXAMPLE Finding the Least- squares Regression Line Using the drilling data Find the least-squares regression line. Predict the drilling time if drilling starts at 130 feet. Is the observed drilling time at 130 feet above, or below, average. Draw the least-squares regression line on the scatter diagram of the data.

38 4. 2 Least-squares Regression 4. 2
4.2 Least-squares Regression Find the Least-Squares Regression Line and Use the Line to Make Predictions (6 of 7) The observed drilling time is 6.93 seconds. The predicted drilling time is seconds. The drilling time of seconds is below average.

39 4. 2 Least-squares Regression 4. 2
4.2 Least-squares Regression Find the Least-Squares Regression Line and Use the Line to Make Predictions (7 of 7)

40 4. 2 Least-squares Regression 4. 2
4.2 Least-squares Regression Interpret the Slope and the y-Intercept of the Least-Squares Regression Line (1 of 3) Interpretation of Slope: The slope of the regression line is For each additional foot of depth we start drilling, the time to drill five feet increases by minutes, on average.

41 4. 2 Least-squares Regression 4. 2
4.2 Least-squares Regression Interpret the Slope and the y-Intercept of the Least-Squares Regression Line (2 of 3) Interpretation of the y-Intercept: The y-intercept of the regression line is To interpret the y- intercept, we must first ask two questions: Is 0 a reasonable value for the explanatory variable? Do any observations near x = 0 exist in the data set? A value of 0 is reasonable for the drilling data (this indicates that drilling begins at the surface of Earth. The smallest observation in the data set is x = 35 feet, which is reasonably close to 0. So, interpretation of the y-intercept is reasonable. The time to drill five feet when we begin drilling at the surface of Earth is minutes.

42 4. 2 Least-squares Regression 4. 2
4.2 Least-squares Regression Interpret the Slope and the y-Intercept of the Least-Squares Regression Line (3 of 3) If the least-squares regression line is used to make predictions based on values of the explanatory variable that are much larger or much smaller than the observed values, we say the researcher is working outside the scope of the model. Never use a least- squares regression line to make predictions outside the scope of the model because we can’t be sure the linear relation continues to exist.

43 4. 2 Least-squares Regression 4. 2
4.2 Least-squares Regression Compute the Sum of Squared Residuals To illustrate the fact that the sum of squared residuals for a least-squares regression line is less than the sum of squared residuals for any other line, use the “regression by eye” applet.

44 4.3 Diagnostics on the Least-squares Regression Line Learning Objectives
1. Compute and interpret the coefficient of determination 2. Perform residual analysis on a regression model 3. Identify influential observations

45 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (1 of 18) The coefficient of determination, R2, measures the proportion of total variation in the response variable that is explained by the least-squares regression line. The coefficient of determination is a number between 0 and 1, inclusive. That is, 0 < R2 < 1. If R2 = 0 the line has no explanatory value If R2 = 1 means the line explains 100% of the variation in the response variable.

46 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (2 of 18) The data to the right are based on a study for drilling rock. The researchers wanted to determine whether the time it takes to dry drill a distance of 5 feet in rock increases with the depth at which the drilling begins. So, depth at which drilling begins is the predictor variable, x, and time (in minutes) to drill five feet is the response variable, y. Source: Penner, R., and Watts, D.G. “Mining Information.” The American Statistician, Vol. 45, No. 1, Feb. 1991, p. 6. Depth at Which Drilling Begins, x (in feet) Time to Drill 5 Feet, y (in minutes) 35 5.88 50 5.99 75 6.74 95 6.1 120 7.47 130 6.93 145 6.42 155 7.97 160 7.92 175 7.62 185 6.89 190 7.9

47 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (3 of 18)

48 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (4 of 18) Sample Statistics blank Mean Standard Deviation Depth 126.2 52.2 Time 6.99 0.781 Correlation Between Depth and Time: 0.773 Regression Analysis The regression equation is Time = Depth

49 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (5 of 18) Suppose we were asked to predict the time to drill an additional 5 feet, but we did not know the current depth of the drill. What would be our best “guess”?

50 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (6 of 18) Suppose we were asked to predict the time to drill an additional 5 feet, but we did not know the current depth of the drill. What would be our best “guess”? ANSWER: The mean time to drill an additional 5 feet: 6.99 minutes

51 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (7 of 18) Now suppose that we are asked to predict the time to drill an additional 5 feet if the current depth of the drill is 160 feet? ANSWER:

52 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (8 of 18)

53 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (9 of 18) The difference between the observed value of the response variable and the mean value of the response variable is called the total deviation and is equal to

54 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (10 of 18) The difference between the predicted value of the response variable and the mean value of the response variable is called the explained deviation and is equal to

55 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (11 of 18) The difference between the observed value of the response variable and the predicted value of the response variable is called the unexplained deviation and is equal to

56 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (12 of 18) Total Deviation = Unexplained Deviation + Explained Deviation

57 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (13 of 18) Total Deviation = Unexplained Deviation + Explained Deviation

58 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (14 of 18) Total Variation = Unexplained Variation + Explained Variation

59 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (15 of 18) To determine R2 for the linear regression model simply square the value of the linear correlation coefficient.

60 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (16 of 18) EXAMPLE Determining the Coefficient of Determination Find and interpret the coefficient of determination for the drilling data. Because the linear correlation coefficient, r, is 0.773, we have that R2 = = = 59.75%. So, 59.75% of the variability in drilling time is explained by the least-squares regression line.

61 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (17 of 18) DATA SET A DATA SET B DATA SET C X Y X Y Draw a scatter diagram for each of these data sets. For each data set, the variance of y is

62 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Compute and Interpret the Coefficient of Determination (18 of 18) Data Set A: 99.99% of the variability in y is explained by the least- squares regression line Data Set B: 94.7% of the variability in y is explained by the least- squares regression line Data Set C: 9.4% of the variability in y is explained by the least- squares regression line

63 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (1 of 14) Residuals play an important role in determining the adequacy of the linear model. In fact, residuals can be used for the following purposes: To determine whether a linear model is appropriate to describe the relation between the predictor and response variables. To determine whether the variance of the residuals is constant. To check for outliers.

64 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (2 of 14) If a plot of the residuals against the predictor variable shows a discernable pattern, such as a curve, then the response and predictor variable may not be linearly related.

65 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (3 of 14)

66 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (4 of 14) EXAMPLE Is a Linear Model Appropriate? A chemist has a 1000-gram sample of a radioactive material. She records the amount of radioactive material remaining in the sample every day for a week and obtains the following data. Day Weight (in grams) 1000.0 1 897.1 2 802.5 3 719.8 4 651.1 5 583.4 6 521.7 7 468.3

67 Linear correlation coefficient: − 0.994
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (5 of 14) Linear correlation coefficient: − 0.994

68 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (6 of 14)

69 Linear model not appropriate
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (7 of 14) Linear model not appropriate

70 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (8 of 14) If a plot of the residuals against the explanatory variable shows the spread of the residuals increasing or decreasing as the explanatory variable increases, then a strict requirement of the linear model is violated. This requirement is called constant error variance. The statistical term for constant error variance is homoscedasticity.

71 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (9 of 14)

72 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (10 of 14) A plot of residuals against the explanatory variable may also reveal outliers. These values will be easy to identify because the residual will lie far from the rest of the plot.

73 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (11 of 14)

74 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (12 of 14) EXAMPLE Residual Analysis Draw a residual plot of the drilling time data. Comment on the appropriateness of the linear least-squares regression model.

75 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (13 of 14)

76 Boxplot of Residuals for the Drilling Data
4.3 Diagnostics on the Least-squares Regression Line Perform Residual Analysis on a Regression Model (14 of 14) Boxplot of Residuals for the Drilling Data

77 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Identify Influential Observations (1 of 8) An influential observation is an observation that significantly affects the least-squares regression line’s slope and/or y- intercept, or the value of the correlation coefficient.

78 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Identify Influential Observations (2 of 8) Influential observations typically exist when the point is an outlier relative to the values of the explanatory variable. So, Case 3 is likely influential.

79 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Identify Influential Observations (3 of 8) Influence is affected by two factors: (1) the relative vertical position of the observation (residuals) and (2) the relative horizontal position of the observation (leverage).

80 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Identify Influential Observations (4 of 8) EXAMPLE Influential Observations Suppose an additional data point is added to the drilling data. At a depth of 300 feet, it took minutes to drill 5 feet. Is this point influential?

81 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Identify Influential Observations (5 of 8)

82 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Identify Influential Observations (6 of 8)

83 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Identify Influential Observations (7 of 8)

84 4. 3 Diagnostics on the Least-squares Regression Line 4. 3
4.3 Diagnostics on the Least-squares Regression Line Identify Influential Observations (8 of 8) As with outliers, influential observations should be removed only if there is justification to do so. When an influential observation occurs in a data set and its removal is not warranted, there are two courses of action: (1) Collect more data so that additional points near the influential observation are obtained, or (2) Use techniques that reduce the influence of the influential observation (such as a transformation or different method of estimation - e.g. minimize absolute deviations). These techniques are beyond the scope of this text.

85 4.4 Contingency Tables and Association Learning Objectives
1. Compute the marginal distribution of a variable 2. Use the conditional distribution to identify association among categorical data 3. Explain Simpson’s Paradox

86 4.4 Contingency Tables and Association Example: Data Information
A professor at a community college in New Mexico conducted a study to assess the effectiveness of delivering an introductory statistics course via traditional lecture-based method, online delivery (no classroom instruction), and hybrid instruction (online course with weekly meetings) methods, the grades students received in each of the courses were tallied. blank Traditional Online Hybrid A B C D F 36 52 57 46 39 55 68 38 54 24 66 90 41 31

87 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Compute the Marginal Distribution of a Variable (1 of 3) A marginal distribution of a variable is a frequency or relative frequency distribution of either the row or column variable in the contingency table.

88 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Compute the Marginal Distribution of a Variable (2 of 3) EXAMPLE Determining Frequency Marginal Distributions A professor at a community college in New Mexico conducted a study to assess the effectiveness of delivering an introductory statistics course via traditional lecture-based method, online delivery (no classroom instruction), and hybrid instruction (online course with weekly meetings) methods, the grades students received in each of the courses were tallied. Find the frequency marginal distributions for course grade and delivery method. blank Traditional Online Hybrid Total A B C D F 36 52 57 46 39 55 68 38 54 24 66 90 41 31 99 173 215 125 131 237 254 252 743

89 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Compute the Marginal Distribution of a Variable (3 of 3) EXAMPLE Determining Relative Frequency Marginal Distributions Determine the relative frequency marginal distribution for course grade and delivery method. blank Traditional Online Hybrid Total A B C D F 36 52 57 46 39 55 68 38 54 24 66 90 41 31 0.133 0.233 0.289 0.168 0.176 0.319 0.342 0.339 1.000

90 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Use the Conditional Distribution to Identify Association among Categorical Data (1 of 4) A conditional distribution lists the relative frequency of each category of the response variable, given a specific value of the explanatory variable in the contingency table.

91 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Use the Conditional Distribution to Identify Association among Categorical Data (2 of 4) EXAMPLE Determining a Conditional Distribution Construct a conditional distribution of course grade by method of delivery. Comment on any type of association that may exist between course grade and delivery method. It appears that students in the hybrid course are more likely to pass (A, B, or C) than the other two methods. blank Traditional Online Hybrid A B C D F 36 52 57 46 39 55 68 38 54 24 66 90 41 31 blank Traditional Online Hybrid A B C D F 0.152 0.219 0.241 0.194 0.154 0.217 0.268 0.150 0.213 0.095 0.262 0.357 0.163 0.123

92 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Use the Conditional Distribution to Identify Association among Categorical Data (3 of 4) EXAMPLE Drawing a Bar Graph of a Conditional Distribution Using the results of the previous example, draw a bar graph that represents the conditional distribution of method of delivery by grade earned.

93 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Use the Conditional Distribution to Identify Association among Categorical Data (4 of 4) The following contingency table shows the survival status and demographics of passengers on the ill- fated Titanic. Draw a conditional bar graph of survival status by demographic characteristic. Men Women Boys Girls Survived 334 318 29 27 Died 1360 104 35 18

94 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Explain Simpson’s Paradox (1 of 6) EXAMPLE Illustrating Simpson’s Paradox Insulin dependent (or Type 1) diabetes is a disease that results in the permanent destruction of insulin-producing beta cells of the pancreas. Type 1 diabetes is lethal unless treatment with insulin injections replaces the missing hormone. Individuals with insulin independent (or Type 2) diabetes can produce insulin internally. The data shown in the table below represent the survival status of 902 patients with diabetes by type over a 5-year period. blank Type 1 Type 2 Total Survived 253 326 579 Died 105 218 323 358 544 902

95 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Explain Simpson’s Paradox (2 of 6) EXAMPLE Illustrating Simpson’s Paradox blank Type 1 Type 2 Total Survived 253 326 579 Died 105 218 323 358 544 902

96 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Explain Simpson’s Paradox (3 of 6) However, Type 2 diabetes is usually contracted after the age of 40. If we account for the variable age and divide our patients into two groups (those 40 or younger and those over 40), we obtain the data in the table below. blank Type 1 Type 2 Total < 40 > 40 Survived 129 124 15 311 579 Died 1 104 218 323 130 228 529 902

97 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Explain Simpson’s Paradox (4 of 6) blank Type 1 Type 2 Total < 40 > 40 Survived 129 124 15 311 579 Died 1 104 218 323 130 228 529 902

98 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Explain Simpson’s Paradox (5 of 6) blank Type 1 Type 2 Total < 40 > 40 Survived 129 124 15 311 579 Died 1 104 218 323 130 228 529 902

99 4. 4 Contingency Tables and Association 4. 4
4.4 Contingency Tables and Association Explain Simpson’s Paradox (6 of 6) Simpson’s Paradox describes a situation in which an association between two variables inverts or goes away when a third variable is introduced to the analysis.


Download ppt "STATISTICS INFORMED DECISIONS USING DATA"

Similar presentations


Ads by Google