Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scatterplots.

Similar presentations


Presentation on theme: "Scatterplots."— Presentation transcript:

1 Scatterplots

2 Learning Objectives By the end of this lecture, you should be able to:
Describe what a scatterplot is Be comfortable with the terms exaplanatory variable and response variable. Describe a scatterplot in terms of form, direction, and strength Define what is meant by an outlier, and be able to Identify them on a scatterplot Recognize why poorly chosen scales on a scatterplot can give misleading impressions of the data

3 Examining Relationships
Up to this point, we have focused on single-variable (“univariate”) data. Eg: Women’s heights, Percentage of Hispanics in each state, SAT scores, etc. Most statistical studies involve more than one variable. For example, a great deal of analysis goes into examining the relationship between two variables. Example: We may be interested in the relationship between The number of beers they consumed at a party Blood alcohol level (BAC) With the proper statistical tools we can try to determine things like: IS there a relationship? I.E. Does the number of beers affect blood alcohol level? If there is a relationship, can we predict how much each beer contributes to BAC. A great human flaw: It is tempting to just intuitively assume that there is a relationship between two variables. However, this can lead to some highly erroneous conclusions. As humans, we LOVE to assume stuff, find patterns that don’t truly exist, and then jump to conclusions. This is a very well-known flaw in the human character and we should be aware of it. We will discuss this topic in more detail as we progress through the course.

4 Student Beers Blood Alcohol S1 5 0.1 S2 2 0.03 S3 9 0.19 S4 7 0.095 S5 3 0.07 S6 0.02 S7 4 S8 0.085 S9 8 0.12 S10 0.04 S11 0.06 S12 0.05 S13 6 S14 0.09 S15 1 0.01 S16 Here, we have two quantitative variables for each of 16 students (n=16). 1) How many beers they drank, and 2) Their blood alcohol level (BAC) We are interested in the relationship between the two variables: How is one affected by changes in the other one?

5 Looking for relationships between variables
Always start with a graph (if possible) Look for an overall pattern deviations from the pattern (deviations such as outliers are sometimes the most interesting part!) If appropriate, try to provide numerical descriptions of the data and overall pattern.

6 Scatterplots In a scatterplot, one axis is used to represent each of the variables, and the data are plotted as points on the graph. Student Beers BAC 1 5 0.1 2 0.03 3 9 0.19 6 7 0.095 0.07 0.02 11 4 13 0.085 8 0.12 0.04 0.06 10 0.05 12 14 0.09 15 0.01 16

7 Explanatory and response variables
A response variable measures or records an outcome of a study. An explanatory variable explains (“causes”) the changes in the response variable. Typically, the explanatory variable is plotted on the x axis, and the response variable is plotted on the y axis. Number of Beers (Explanatory Variable) Blood Alcohol Content (Response variable) x y

8 Terminology: Dependent / Independent
Instead of explanatory / response, you will often encounter the terms independent and dependent used. Independent for Explanatory Dependent for Response They are pretty much interchangable, but there is a subtle difference. However, it is more accurate to use the terms explanatory and response, so I would like you to focus on those terms. You will ocasionally see SPSS use dependent/indepdent.

9 Which should be the explanatory, and which the response?
The variable that you think “causes” the change in the other variable should be the explanatory variable. (This is why it is frequently called the ‘dependent’ variable. But as was just mentioned, there is a subtle distinction between them which we may get to down the road). The variable that “responds” to a change in the explanatory variable, is, then, the response variable. Example: Exercise v.s. Calories burned? Answer: The amount of exercise will (hopefully!) result in a change in calories burned. Whereas, burning calories, does not ‘cause’ a change in exercise. So exercise should be our explanatory variable, and calories the response variable. Exam Score v.s. Hours studying Answer: We would expect that that the amount of hours studying would cause a change in exam score rather than the othe rway around. So ‘hours studying’ would be our explanatory variable.

10 Describing/Interpreting scatterplots
When describing a scatterplot, we describe the relationship by examining the form, direction, and strength of the association. We look for an overall pattern … Form: linear (a straight line), curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the “form”

11 Form of an association: Linear / Nonlinear / No Relationship

12 Direction of a linear association Positive or Negative
A linear relationship is given a directional description of Positive or Negative Positive association: High values of one variable tend to occur together with high values of the other variable. Negative association: High values of one variable tend to occur together with low values of the other variable. Note that we only describe the direction of the relationship when the relationship is linear.

13 Scatterplot Direction: No Relationship
Sometimes there isn’t any relationship: X and Y may vary, but are independent of each other. Knowing a value for X tells you nothing about the value for Y. We describe as ‘no relationship’

14 Scatterplot: Strength of the association
The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. ? ? ? With a strong relationship, you can get a pretty good estimate of y if you know x. With a weak relationship, for any x you might get a wide range of y values. (You could probably make a reasonable argument that the reationship of this plot isn’t even linear.)

15 This is a relatively weak relationship
This is a relatively weak relationship. For a particular state median household income, you can’t predict the state per capita income very well. This is a strong relationship. The daily amount of gas consumed can be predicted quite accurately for a given temperature value.

16 Describing the strength
For now we are using the admittedly vague terms ‘strong, moderate, weak’. In a subsequent lecture on scatterplots, we will learn a technique for quantifying the strength.

17 Describing/Interpreting scatterplots
As mentioned earlier, when you are asked to interpret a scatterplot, you should be familiar with these 3 terms in particular. Form: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the “form” Note: Recall that if the relationship is not linear, we will not bother to describe direction or strength.

18 Examples – Describe each plot
Form: Linear, Direction: positive, Strength: strong Form: Linear, Direction: negative, Strength: moderate Form: No relationship. Note that for a given x does not tell us anything new about y. As a result, the terms ‘postive/negative’ don’t apply. Neither does the strength.

19 Examples Form: Non-linear. Therefore, we don’t bother trying to describe direction or strength. Form: Linear, Direction: positive, Strength: moderate In our next lecture on scatterplots, we will discuss a tool for quantifying the strength of the relationship.

20 Lying with statistics: How (not) to scale a scatterplot
Same data in all four plots Using an inappropriate scale for a scatterplot can give an incorrect impression. Ideally, both variables should be given a similar amount of space: Plot roughly square Points should occupy most of the plot space

21 How to scale a scatterplot
Same data in all four plots In other words, if faced with this group plots, you should be suspicious of most of them!

22 Outliers An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship.

23 Outliers Not an outlier:
The upper right-hand point here is not an outlier of the relationship—It is what you would expect for this many beers given the linear relationship between beers/weight and blood alcohol. This point is not in line with the others, so it is an outlier of the relationship.

24 IQ score and Grade point average Describe in words what this plot shows. Looking to see if there is a relationship between IQ score and GPA. Describe the direction, shape, and strength. Are there outliers? Shape: linear Direction: positive Strength: appears somewhat weak Outliers present? Appear to be outliers, but it is hard to say.

25 IQ score and Grade point average Are there outliers present? The circled datapoints (and perhaps some of the others too) appear to be outliers. Still, it is hard to say. How do we decide? Recall that on a scatterplot, we consider a datapoint to be an outlier if it is way off the “line”. If the “regression” line (the line through the points) looks like the one here, then both IQ scores (circled) would almost certainly be considered outliers.

26 IQ score and Grade point average Are there outliers present? If the regression line looks like the one drawn here, then certainly the lower circled datapoint (and probably some of others nearby as well) would be considered outliers.

27 IQ score and Grade point average Are there outliers present? Conversely, if the regression line looks like the one drawn here, then certainly the upper circled datapoint (and probably several of others nearby as well) would be considered outliers. But the lower one would not be.

28 WHICH line, then, is the “correct” regression line?
Answer: Once again, we use a mathematical model to draw a regression line. We will discuss how to do so in our next lecture on scatterplots.


Download ppt "Scatterplots."

Similar presentations


Ads by Google