Association between 2 variables We've described the distribution of 1 variable (univariate) but what if 2 variables are measured on the same individual (bivariate)? Examples? How could you describe the association between the two? Our descriptions will depend upon the types of variables (categorical or quantitative): categorical vs. categorical - Examples? categorical vs. quantitative - Examples? quantitative vs. quantitative - Examples?
Explanatory variable vs. Response Variable One common task is to show that one variable can be used to explain variation in the other. Explanatory variable vs. Response Variable (sometimes these are called independent vs. dependent variables) These associations can be explored both graphically and numerically: begin your analysis with graphics find a pattern & look for deviations from the pattern look for a mathematical model to describe the pattern But again we do the above depending upon what type variables we have… we'll start with quantitative vs. quantitative ...
A scatterplot is the best graph for showing relationships between two quantitative variables In a scatterplot, one axis is used to represent each of the variables, and the data are plotted as points on the graph. Student Beers BAC 1 5 0.1 2 0.03 3 9 0.19 6 7 0.095 0.07 0.02 11 4 13 0.085 8 0.12 0.04 0.06 10 0.05 12 14 0.09 15 0.01 16
Explanatory (independent) variable: Explanatory and response variables A response variable measures or records an outcome of a study. An explanatory variable explains changes in the response variable. Typically, the explanatory or independent variable is plotted on the x axis, and the response or dependent variable is plotted on the y axis. Explanatory (independent) variable: number of beers Response (dependent) variable: blood alcohol content x y
Describe the pattern of the relationship between the two variables in a scatterplot by its direction, strength, and form. direction: positive, negative or flat (no direction) strength: strong, weak, moderately strong, etc. form: linear, curved (non-linear), clusters, no pattern See example to the right…
Form and direction of an association Linear No relationship Nonlinear
Positive association: High values of one variable tend to occur together with high values of the other variable. Negative association: High values of one variable tend to occur together with low values of the other variable. The scatterplots below show perfect linear associations
No relationship: X and Y vary independently No relationship: X and Y vary independently. Knowing X tells you nothing about Y. One way to think about this is to remember the following: Imagine a line through the data points.. the equation for that line is y = 5. x is not involved.
Strength of the relationship or association ... This is a weak relationship. For a particular state median household income, you can’t predict the state per capita income very well. This is a very strong relationship. The daily amount of gas consumed can be predicted quite accurately for a given temperature value.
What if there are categorical variables involved What if there are categorical variables involved? either as the explanatory variable or as a “lurking variable”? A scatterplot sometimes can help by indicating the categories of the lurking variable with different plotting symbols or colors... Often though the best way to see the pattern if the explanatory variable is categorical is to draw side-by-side boxplots. Put the categorical variable on the horizontal axis, and draw a boxplot for each category, side-by-side. Here are some some examples of various explanatory, lurking, and response variables...
Categorical variables in scatterplots Often, things are not simple and one-dimensional. We need to group the data into categories to reveal trends. Lurking Variable! What may look like a positive linear relationship is in fact a series of negative linear associations. Plotting different habitats (the lurking variable) in different colors allows us to make that important distinction.
Comparison of men and women racing records over time. Each group shows a very strong negative linear relationship that would not be apparent without the gender categorization. Relationship between lean body mass and metabolic rate in men and women. Both men and women follow the same positive linear trend, but women show a stronger association. As a group, males typically have larger values for both variables.
Look at this figure.. Note the ordinal scale of the explanatory variable education level. Are these two variables associated ? Why? The next slide is tricky...
Example: Beetles trapped on boards of different colors Beetles were trapped on sticky boards scattered throughout a field. The sticky boards were of four different colors (categorical explanatory variable). The number of beetles trapped (response variable) is shown on the graph below. Blue White Green Yellow Board color ? What association? What relationship? Yellow White Green Blue Board color Describe one category at a time. When both variables are quantitative, the order of the data points is defined entirely by their value. This is not true for categorical data.
HW: Start reading Notes 2.1 on Bivariate Data with R. Then . . . 1. Load the lean body mass data (lbm.csv) into R using the read.csv function. We are interested in knowing if lean body mass explains metabolic rate. > # first, save the file on your desktop … then read it into R > bodymass = read.csv(file=file.choose()) > str(bodymass) # to see the structure of the data frame > attach(bodymass) > plot(x,y) # to see a scatterplot of the two variables > # which variable is x? y? > # how would you describe the relationship you see? > # don't forget: direction, strength, and form. > # is the relationship different for males and females? 2. Bring in bivariate data on two quantitative variables in your field that you can analyze with R - we'll plot it, correlate it, do regression on it… Is one of your variables explanatory while the other is the response? Or not?