Download presentation
Presentation is loading. Please wait.
Published byPaula Stanley Modified over 8 years ago
1
The Practice of Statistics for Business and Economics Third Edition David S. Moore George P. McCabe Layth C. Alwan Bruce A. Craig William M. Duckworth © 2011 W.H. Freeman and Company
2
Examining Relationships Scatterplots PSBE Chapter 2.1 © 2011 W. H. Freeman and Company
3
© 2010 Pearson Education 3 Class Exercise – Real Estate, House Prices
4
© 2010 Pearson Education 4 A scatterplot, which plots one quantitative variable against another, can be an effective display for data. Scatterplots are the ideal way to picture associations between two quantitative variables.
5
© 2010 Pearson Education 5 Assigning Roles to Variables in Scatterplots To make a scatterplot of two quantitative variables, assign one to the y -axis and the other to the x -axis. Be sure to label the axes clearly, and indicate the scales of the axes with numbers. Each variable has units, and these should appear with the display—usually near each axis.
6
© 2010 Pearson Education 6 Assigning Roles to Variables in Scatterplots Each point is placed on a scatterplot at a position that corresponds to values of the two variables. The point’s horizontal location is specified by its x - value, and its vertical location is specified by its y - value variable. Together, these variables are known as coordinates and written ( x, y ).
7
© 2010 Pearson Education 7 Assigning Roles to Variables in Scatterplots One variable plays the role of the explanatory or predictor variable, while the other takes on the role of the response variable. We place the explanatory variable on the x -axis and the response variable on the y -axis. The x - and y -variables are sometimes referred to as the independent and dependent variables, respectively. In this class, use the terms explanatory or predictor variable ( x ) and the response variable ( y ).
8
© 2010 Pearson Education 8 Looking at Scatterplots – Diamond Prices CaratPrice 0.331079 0.391030 0.41150 0.411110 0.421210 0.461570 0.472113 0.482147 0.511770 0.561720 0.612500 0.623116 0.633165 0.642600 0.73080 0.73390 0.713440 0.713530 0.714481 0.724562 0.755069 0.85847 0.834930 Which variable will be the explanatory variable and which will be the response variable?
9
© 2010 Pearson Education 9 Looking at Scatterplots – Diamond Prices CaratPrice 0.331079 0.391030 0.41150 0.411110 0.421210 0.461570 0.472113 0.482147 0.511770 0.561720 0.612500 0.623116 0.633165 0.642600 0.73080 0.73390 0.713440 0.713530 0.714481 0.724562 0.755069 0.85847 0.834930
10
© 2010 Pearson Education 10 Looking at Scatterplots The direction of the association is important. A pattern that runs from the upper left to the lower right is said to be negative. A pattern running from the lower left to the upper right is called positive.
11
© 2010 Pearson Education 11 Looking at Scatterplots – Diamond Prices CaratPrice 0.331079 0.391030 0.41150 0.411110 0.421210 0.461570 0.472113 0.482147 0.511770 0.561720 0.612500 0.623116 0.633165 0.642600 0.73080 0.73390 0.713440 0.713530 0.714481 0.724562 0.755069 0.85847 0.834930 Direction? Positive
12
© 2010 Pearson Education 12 Looking at Scatterplots The second thing to look for in a scatterplot is its form. If there is a straight line relationship, it will appear as a cloud or swarm of points stretched out in a generally consistent, straight form. This is called linear form. Sometimes the relationship curves gently, while still increasing or decreasing steadily; sometimes it curves sharply up then down.
13
© 2010 Pearson Education 13 Looking at Scatterplots – Diamond Prices CaratPrice 0.331079 0.391030 0.41150 0.411110 0.421210 0.461570 0.472113 0.482147 0.511770 0.561720 0.612500 0.623116 0.633165 0.642600 0.73080 0.73390 0.713440 0.713530 0.714481 0.724562 0.755069 0.85847 0.834930 Form? Linear
14
© 2010 Pearson Education 14 Looking at Scatterplots The third feature to look for in a scatterplot is the strength of the relationship. Do the points appear tightly clustered in a single stream or do the points seem to be so variable and spread out that we can barely discern any trend or pattern?
15
© 2010 Pearson Education 15 Looking at Scatterplots – Diamond Prices CaratPrice 0.331079 0.391030 0.41150 0.411110 0.421210 0.461570 0.472113 0.482147 0.511770 0.561720 0.612500 0.623116 0.633165 0.642600 0.73080 0.73390 0.713440 0.713530 0.714481 0.724562 0.755069 0.85847 0.834930 Strength? Moderately Strong
16
© 2010 Pearson Education 16 Looking at Scatterplots Finally, always look for the unexpected. An outlier is an unusual observation, standing away from the overall pattern of the scatterplot.
17
© 2010 Pearson Education 17 Looking at Scatterplots – Diamond Prices CaratPrice 0.331079 0.391030 0.41150 0.411110 0.421210 0.461570 0.472113 0.482147 0.511770 0.561720 0.612500 0.623116 0.633165 0.642600 0.73080 0.73390 0.713440 0.713530 0.714481 0.724562 0.755069 0.85847 0.834930 Outliers? No Outliers
18
Examining relationships Most statistical studies involve more than one variable. Questions: What individuals do the data describe? What variables are present and how are they measured? Are all of the variables quantitative? Do some of the variables explain or even cause changes in other variables?
19
Looking at relationships Start with a graph Look for an overall pattern and deviations from the pattern Use numerical descriptions of the data and overall pattern (if appropriate)
20
Explanatory and response variables A response variable measures or records an outcome of a study. Also called dependent variable. An explanatory variable explains changes in the response variable (also called independent variable).
21
Scatterplot A scatterplot shows the relationship between two quantitative variables measured on the same individuals. Typically, the explanatory or independent variable is plotted on the x axis, and the response or dependent variable is plotted on the y axis. Each individual in the data appears as a point in the plot.
22
Scatterplot example BotnetBotsSpams Srizbi31560 Bobax1859 Rustock15030 Cutwail12516 Storm853 Grum502 Ozdok3510 Nucrypt205 Wopla200.06 Spamthru100.035 Here, we have two quantitative variables for each of 10 botnets: Number of bots (thousands) Spams per day (billions) We are interested in the relationship between the two variables: How is one affected by changes in the other one?
23
BotnetBotsSpams Srizbi31560 Bobax1859 Rustock15030 Cutwail12516 Storm853 Grum502 Ozdok3510 Nucrypt205 Wopla200.06 Spamthru100.035 Scatterplot example
24
Scatterplots Some plots don’t have clear explanatory and response variables. Do calories explain sodium amounts?
25
Scatterplots Some plots don’t have clear explanatory and response variables. Does percent return on Treasury bills explain percent return on common stocks?
26
Interpreting scatterplots After plotting two variables on a scatterplot, we describe the relationship by examining the form, direction, and strength of the association. We look for an overall pattern … Form: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the “form” … and deviations from that pattern. Outliers
27
Form and direction of an association Linear Nonlinear No relationship
28
Positive association: High values of one variable tend to occur together with high values of the other variable. Negative association: High values of one variable tend to occur together with low values of the other variable.
29
No relationship: X and Y vary independently. Knowing X tells you nothing about Y.
30
Strength of the association The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. With a strong relationship, you can get a pretty good estimate of y if you know x. With a weak relationship, for any x you might get a wide range of y values.
31
Strength of the association This is a weak relationship. For a particular state median household income, you can’t predict the state per capita income very well.
32
Strength of the association This is a very strong relationship. The daily amount of gas consumed can be predicted quite accurately for a given temperature value.
33
Stronger association? Two scatterplots of the same data. The straight-line pattern in the lower plot appears stronger because of the surrounding open space.
34
How to scale a scatterplot Using an inappropriate scale for a scatterplot can give an incorrect impression. Both variables should be given a similar amount of space: Plot roughly square Points should occupy all the plot space (no blank space)
35
Outliers An outlier is a data value that has a low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship.
36
Outliers The upper-right-hand point here is not an outlier of the relationship—It is what you would expect for this number of bots given the linear relationship between spams per day and bots. This point is not in line with the others, so it is an outlier of the relationship. Not an outlier Outlier
37
Outliers IQ score and Grade Point Average a)Describe in words what this plot shows. b)Describe the direction, shape, and strength. Are there outliers? c)What might explain these people?
38
Categorical variables in scatterplots Often, things are not simple and one-dimensional. We need to group the data into categories to reveal trends. What may look like a positive linear relationship is in fact a series of negative linear associations. Plotting different habitats in different colors allows us to make that important distinction.
39
Categorical variables in scatterplots Comparison of men’s and women’s racing records over time. Each group shows a very strong negative linear relationship that would not be apparent without the gender categorization.
40
Categorical variables in scatterplots Relationship between lean body mass and metabolic rate in men and women. Both men and women follow the same positive linear trend, but women show a stronger association. As a group, males typically have larger values for both variables.
41
Categorical variables in scatterplots Comparison of GDP of high- and low-ranked countries versus unemployment rate. Countries with higher GDP are ranked higher. Unemployment does not appear to be a factor in determining ranking
42
Categorical explanatory variables When the explanatory variable is categorical, you cannot make a scatterplot, but you can compare the different categories side-by-side on the same graph (boxplots, or mean +/ standard deviation). Comparison of income (quantitative response variable) for different education levels (five categories). But be careful in your interpretation: This is NOT a positive association because education is not quantitative.
43
The log transformation When the data are skewed toward large values, a log transformation may provide more information. Notice the data fill up much of the central part to the graph after the log transformation.
44
Examining Relationships Correlation PSBE Chapter 2.2 © 2011 W.H. Freeman and Company
45
© 2010 Pearson Education 45 Understanding Correlation Correlation Conditions Correlation measures the strength of the linear association between two quantitative variables.
46
Objectives (PSBE Chapter 2.2) Correlation The correlation coefficient “ r ” r does not distinguish between x and y r has no units of measurement r ranges from -1 to +1 r is strongly affected by influential points an outliers
47
© 2010 Pearson Education 47 Understanding Correlation The ratio of the sum of the product z x z y for every point in the scatterplot to n – 1 is called the correlation coefficient. Two of the more common alternative formulas for correlation are:
48
The correlation coefficient “r” Bots: x = 99.5, s x = 96.9 Spams per day: y = 13.51 s y = 18.71 = 0.885
49
© 2010 Pearson Education 49 Understanding Correlation Correlation Conditions Before you use correlation, you must check three conditions: Quantitative Variables Condition: Correlation applies only to quantitative variables. Linearity Condition: Correlation measures the strength only of the linear association. Outlier Condition: Unusual observations can distort the correlation.
50
No matter how strong the association, r does not describe curved relationships. Correlation only describes linear relationships
51
© 2010 Pearson Education 51 Understanding Correlation Correlation Properties The sign of a correlation coefficient gives the direction of the association. Correlation is always between –1 and +1. Correlation measures the strength of the linear association between the two variables. Correlation treats x and y symmetrically. Correlation has no units. Correlation is not affected by changes in the center or scale of either variable. Correlation is sensitive to unusual observations.
52
The correlation coefficient “r” The correlation coefficient is a measure of the direction and strength of a linear relationship. It is calculated using the mean and the standard deviation of both the x and y variables. Correlation can only be used to describe quantitative variables. Categorical variables don’t have means and standard deviations.
53
Facts about correlation r ignores the distinction between response and explanatory variables r measures the strength and direction of a linear relationship between two quantitative variables r is not affected by changes in the unit of measurement Positive value of r means association between the two variables is positive Negative value of r means association between the variables is negative r is always between -1 and +1 r is strongly affected by outliers
54
“r” ranges from -1 to +1 Strength: how closely the points follow a straight line. Direction: is positive when individuals with higher X values tend to have higher values of Y.
55
Review example Estimate r 1.r = 1.00 2.r = -0.94 3.r = 1.12 4.r = 0.94 5.r = 0.21 (in 1000’s)
56
© 2010 Pearson Education 56 Understanding Correlation Correlation Tables Sometimes the correlations between each pair of variables in a data set are arranged in a table like the one below.
57
© 2010 Pearson Education 57 Straightening Scatterplots Example: After the Dow Jones Industrial Average, the S&P 500 is the most widely-watched index of U.S. stocks. The time series plot of the data does not seem to indicate a strong linear association:
58
© 2010 Pearson Education 58 Straightening Scatterplots However, if we look at the logarithm of the S&P 500 over Time: the plot looks straighter, so the correlation is now a more appropriate measure of association.
59
© 2010 Pearson Education 59 Straightening Scatterplots Simple transformations such as the logarithm, square root, or reciprocal can sometimes straighten a scatterplot’s form.
60
© 2010 Pearson Education 60 Lurking Variables and Causation There is no way to conclude from a high correlation alone that one variable causes the other. There’s always the possibility that some third variable—a lurking variable—is simultaneously affecting both of the variables you have observed.
61
© 2010 Pearson Education 61 What Can Go Wrong? Don’t say “correlation” when you mean “association.” Don’t correlate categorical variables. Make sure the association is linear. Beware of outliers. Don’t confuse correlation with causation. Watch out for lurking variables.
62
© 2010 Pearson Education 62 What Have We Learned? Begin our investigation by looking at a scatterplot. The sign of the correlation tells us the direction of the association. The magnitude of the correlation tells us of the strength of a linear association. Correlation has no units, so shifting or scaling the data, standardizing, or even swapping the variables has no effect on the numerical value.
63
© 2010 Pearson Education 63 What Have We Learned? To use correlation we have to check certain conditions for the analysis to be valid: Check the Linearity Condition. Watch out for unusual observations. We’ve learned not to make the mistake of assuming that a high correlation or strong association is evidence of a cause-and-effect relationship.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.