Module 5: examining relationships... Quantitative Data
Before we formally learn about analysis of bivariate data Before we formally learn about analysis of bivariate data... What do you see?
... What do you see?
... What do you see?
Remember...It starts with a topic, followed by a question... Dr. Gould, UCLA
consider two variables & come up with a question relating those two variables... Dr. Gould, UCLA Two variables; bivariate data; we will focus on numeric for now; but later in the course we will further explore relationships between two (or more) categorical data.
Do you believe there Is a relationship between... Time spent studying and GPA? # of cigarettes smoked daily & life expectancy Salary and education level? Age and height? How could we find out? The data cycle!
Relationships When we consider data that comes in pairs or two’s or has two variables, the data is referred to as bivariate data. Much of the bivariate data we will examine is numeric. There may or may not exist a relationship/an association between the 2 variables. Does one variable influence the other? Or vice versa? Or do the two variables just ‘go together’ by chance? Or is the relationship influenced by another variable(s) that we are unaware of? Does one variable ‘cause’ the other? Caution!
Bivariate Data Proceed similarly as univariate distributions … (review... What is univariate data? Which graphical models do we typically use with univariate numerical data?)
Bivariate Data Like we were saying... proceed similarly as univariate distributions With bivariate data, we still graph (use visual model(s) to describe data; scatter plot; Least Squares Regression Line (LSRL) With bivariate data, we still look at overall patterns and deviations from those patterns (DOFS: Direction, Outlier(s), Form, Strength). Review: How did we look for patterns in univariate numeric data? What did we use? With bivariate data, we still analyze numerical summary/descriptive statistics (what is this?)
Bivariate Distributions Explanatory variable, x, ‘factor,’ may help predict or explain changes in response variable; explanatory variable is usually on horizontal axis Response variable, y, measures an outcome of a study, usually on vertical axis
Bivariate Data Distributions For example ... Alcohol (explanatory) and body temperature (response). Generally, the more alcohol consumed, the higher the body temperature. Still use caution with ‘cause.’ Sometimes we don’t have variables that are clearly explanatory and response. Sometimes there could be two ‘explanatory’ variables, such as ACT scores and SAT scores, or activity level and physical fitness. Discuss with a partner for 1 minute; come up with a situation where we have two variables that are related, but neither are clearly explanatory nor response.
Graphical models… Many graphing models display uni-variate numeric data exclusively (review). Main graphical representations used to display bivariate data (two quantitative variables) is scatterplot and least squares regression line (LSRL).
Scatterplots * Scatterplots show relationship between two quantitative variables measured on the same individuals or objects. * Each individual/object in data appears as a point (x, y) on the scatterplot. * Plot explanatory variable (if there is one) on horizontal axis. If no distinction between explanatory and response, either can be plotted on horizontal axis. * Label both axes. Scale both axes with uniform intervals (but scales don’t have to match); and doesn’t have to start with zero; not considered misleading with scatterplots.
variables: Clearly Explanatory and Response? Practice: Trends?
Creating & Interpreting Scatterplots Let’s collect some data: your age in years and the number of states you have visited in your lifetime. Input into Stat Crunch & create scatter plot; which is our explanatory and which is our response variable? Let’s do some predicting... to the best of our ability...
Interpreting Scatterplots Look for overall patterns (DOFS) including: direction: up or down, + or – association? outliers/deviations: individual value(s) falls outside overall pattern; no outlier rule for bi-variate data – unlike uni-variate data form: linear? curved? clusters? gaps? strength: how closely do the points follow a clear form? Strong, weak, moderate?
Measuring Linear Association Scatterplots (bi-variate data) show direction, outliers/ deviation(s), form, strength of relationship between two quantitative variables Linear relationships are important; common, simple pattern; linear relationships are our focus in this course Linear relationship is strong if points are close to a straight line; weak if scattered about Other relationships (quadratic, logarithmic, etc.)
Linear relationships
Non-linear relationships
Let’s go back to previous scatterplots... With a partner, look at one of the previous scatterplots (your choice) and analyze through DOFS (direction, outlier(s), form, strength) Three minutes... Then report out in groups that choose the same scatterplots) Be ready to make predictions based on the scatterplot
Creating & Interpreting Scatterplots Go to my website, download the COC Math 140 Survey Data Fall 2015 OR Spring 2016. Copy & paste columns (‘Height’ And ‘Weight’) Is data messy? Does it need to be ‘fixed?’ ... Hint, scan for ordered pairs (this is bivariate data); each and every point must be an ordered pair. Graph it; do we need to evaluate any points (any possible inaccuracies?) Person 131 & 61; what should we do?
Creating & Interpreting Scatterplots ‘Height’ & ‘Weight’ Create a scatter plot of the data. Analyze (DOFS) Let’s do some predictions... It is difficult to do predictions sometimes? We will get back to this with a ‘better’ model... Person 131 & 61; what should we do???
How strong are these relationships? Which one is stronger?
Measuring Linear Association: Correlation or “r” Sometimes our eyes are not a good judge Need to specify just how strong or weak a linear relationship is with bivariate data Need a numeric measure Correlation or ‘r’
Measuring Linear Association: Correlation or “r” * Correlation (r) is a numeric measure of direction and strength of a linear relationship between two quantitative variables Correlation (r) is always between -1 and 1 Correlation (r) is not resistant (look at formula; based on mean) r doesn’t tell us about individual data points, but rather trends in the data * Never calculate by formula; use Stat Crunch (dependent on having raw data)
Calculating Correlation “r” n, x1, x2, etc., 𝒙 , y1, y2, etc., 𝒚 , sx, sy, …
Measuring Linear Association: Correlation or “r” r ≈0 not strong linear relationship r close to 1 strong positive linear relationship r close to -1 strong negative linear relationship Go back to our height/weight data & calculate ‘r,’ correlation PRACTICE: Go to my website, data sets, Cereal Data from Lock 5, and copy/paste Calories and Fat columns into Stat Crunch; create scatterplot; calculate ‘r’; make some observations, some predictions
Correlation; ‘r’
Guess the correlation www.rossmanchance.com/applets (also stat crunch) ‘March Madness’ bracket-style Guess the Correlation tournament Playing cards; match up head-to-head competition/rounds Look at a scatterplot, make your guess Student who is closest survives until the next round
Correlation & regression applet partner activity Go to www.whfreeman.com/tps5e Go to applets Go to Correlation & Regression Now download (from my website, under ‘articles, assignments, and activities’) Correlation Partner Activity & follow the directions. Partner up with someone you have not partnered with yet; this should take no more than 15-20 minutes, including the write-up; print out & turn in with both your names on it.
Caution… interpreting correlation Note: be careful when addressing form in scatterplots Strong positive linear relationship ► correlation ≈ 1 But Correlation ≈ 1 does not necessarily mean relationship is linear; always plot data!
R ≈ 0.816 for each of these
Facts about Correlation Correlation doesn’t care which variables is considered explanatory and which is considered response; can switch x & y; still same correlation (r) value Try with height & weight Math 140 data; try with cereal calories and fat data CAUTION! Switching x & y WILL change your scatterplot; try with our data sets!… just won’t change ‘r’
Facts about Correlation r is in standard units, so r doesn’t change if units are changed If we change from yards to feet, or years to months, or gallons to liters ... r is not effected + r, positive association - r, negative association
Facts about Correlation Correlation is always between -1 & 1 Makes no sense for r = 13 or r = -5 r = 0 means very weak linear relationship r = 1 or -1 means strong linear association
Facts about Correlation Both variables must be quantitative, numerical. Doesn’t make any sense to discuss r for qualitative or categorical data Correlation is not resistant (like mean and SD). Be careful using r when outliers are present (think of the formula, think of our partner activity)
Facts about Correlation r isn’t enough! … if we just consider r, it could be misleading; we must also consider the distribution’s mean, standard deviation, graphical representation, etc. Correlation does not imply causation; i.e., # ice cream sales in a given week and # of pool accidents
Absurd examples… correlation does not imply causation… Did you know that eating chocolate makes winning a Nobel Prize more likely? The correlation between per capita chocolate consumption and the number of Nobel laureates per 10 million people for 23 selected countries is r = 0.791 Did you know that statistics is causing global warming? As the number of statistics courses offered has grown over the years, so has the average global temperature!
Least Squares Regression Last section… scatterplots of two quantitative variables r measures strength and direction of linear relationship of scatterplot
What would we expect the sodium level to be in a hot dog that has 170 calories?
Least Squares Regression BETTER model to summarize overall pattern by drawing a line on scatterplot Not any line; we want a best-fit line over scatterplot Least Squares Regression Line (LSRL) or Regression Line
Least-Squares Regression Line
Let’s do some predicting by using the LSRL... About how much would a home cost if it were: 2,000 square feet? 2,600 square feet? 1,600 square feet? Categorical data embedded; sometimes scatterplots include this ‘extra’ information.
Let’s do some predicting by using the LSRL... About how large would a home be if it were worth: $450,000? $350,000? $220,000? Also, let’s discuss where the x and y axes start... Categorical data embedded; sometimes scatterplots include this ‘extra’ information.
Least Squares Regression equation to predict values LSRL Model: 𝑦 =𝑎+𝑏𝑥 𝑦 is predicted value of response variable a is y-intercept of LSRL b is slope of LSRL; slope is predicted (expected) rate of change x is explanatory variable
Least Squares Regression equation Typical to be asked to interpret slope & y-intercept of the equation of the LSRL, in context Caution: Interpret the slope of the equation of LSRL as the predicted or average change or expected change in the response variable given a unit change in the explanatory variable NOT change in y for a unit change in x; LSRL is a model; models are not perfect
Interpret slope & y-intercept... Notice the embedded context in the equation of the LSRL
LSRL: Our Data Go back to our data (age & # states visited; height and weight data from Math 140; calories & fat cereal data). Create scatter plot; then put LSRL on our scatter plot; also determine the equation of the LSRL Stat Crunch: stat, regression, simple linear, x variable, y variable, graphs, fitted line plot
LSRL: Our Data Look at graph of our LSRL for our data Look at our LSRL equation for our data Our line fits scatterplot well (best fit) but not perfectly Make some predictions… do we use our graph or our equation? Which is easier? Which is better? More on this in a minute... Interpret our y-intercept; does it make sense? Interpretation of our slope?
Another example… value of a truck
Truck example… Suppose we were given the LSRL equation for our truck data as 𝒑𝒓𝒊𝒄𝒆 =𝟑𝟖,𝟐𝟓𝟕−𝟎.𝟏𝟔𝟐𝟗(𝒎𝒊𝒍𝒆𝒔 𝒅𝒓𝒊𝒗𝒆𝒏) We want to find a more precise estimation of the value if we have driven 100,000 miles. Use the LSRL equation. Using graph, estimate price if we have driven 40,000 miles. Then use the above LSRL equation to calculate the predicted value of the truck.
Ages & Heights… Age (years) Height (inches) 18 1 28 4 40 5 42 8 49
Let’s review for a moment… Input data into Stat Crunch Create scatterplot and describe scatterplot (what do we include in a description?) Calculate r (different from slope; why?), equation of LSRL; interpret equation of LSRL in context; does y-intercept make sense? Create a graph of LSRL Based on the graph of the LSRL or the equation of the LSRL (you choose), make a prediction as to the height of a person at age 35.
LSRL: Our Data Extrapolation: Use of a regression line (or equation of a regression line) for prediction outside the range of values of the explanatory variable, x, used to obtain the line/equation of the line. Such predictions are often not accurate. Friends don’t let friends extrapolate
Detour… memory Monday (or way-back Wednesday)… What is r? What is r’s range? What does it describe?
Detour… memory Monday (or way-back Wednesday)… r (or correlation) is a numerical measure of how linear scatter plot is r (or correlation) tells us the direction of the scatterplot r (or correlation) ranges from -1 to 1 r (or correlation) describes the scatterplot only (not LSRL)
Now... We need a numerical measurement that tell us how well the LSRL fits/accurately describes the scatter plot points, the data. Coefficient of Determination, or r2
Coefficient of determination … Do all the points on the scatterplot fall exactly on the LSRL? Sometimes too high and sometimes too low Is LSRL a good model to use for a particular data set? How well does our model fit our data?
Coefficient of determination or r2 “R-sq” software (Stat Crunch) output Always 0 ≤ r2 ≤ 1 Never calculate by hand; always use Stat Crunch No need to memorize formula; trust me... It’s ugly!
Coefficient of Determination or 𝑟 2 Remember “r” correlation, direction and strength of linear relationship of scatterplot −1≤𝑟≤1 𝑟 2 , coefficient of determination, fraction of the variation in the values of y that are explained by LSRL, describes to LSRL 0≤ 𝑟 2 ≤1
Coefficient of determination or r2 Interpretation of r2: We say, “x% of the variation in (y variable) is explained by the least squares regression line relating (y variable) to (x variable) Let’s practice calculating r2 and interpreting it for our data sets (age & # states; height & weight Math 140 data; cereal calories & fat) Stat, regression, simple linear, ... Remember this describes the LSRL not scatter plot
General Facts to remember about bivariate data Distinction between explanatory and response variables. If switched, scatterplot changes and LSRL changes (but what doesn’t change?) LSRL minimizes distances from data points to line only vertically
General Facts to remember about bivariate data Correlation (r) describes direction and strength of straight-line relationships in scatterplots Coefficient of determination ( 𝑟 2 ) is the fraction of variation in values of y explained by LSRL
Correlation & Regression Wisdom Which of the following scatterplots has the highest correlation?
Correlation & Regression Wisdom All r = 0.816; all have same exact LSRL equation Lesson: Always graph your data! … because correlation and regression describe only linear relationships
Correlation & Regression Wisdom Correlation and regression describe only linear relationships
Correlation & Regression Wisdom Correlation is not causation! Association does not imply causation… want a Nobel Prize? Eat some chocolate! How about Methodist ministers & rum imports? Year Number of Methodist Ministers in New England Cuban Rum Imported to Boston (in # of barrels) 1860 63 8,376 1865 48 6,506 1870 53 7,005 1875 64 8,486 1890 85 11,265 1900 80 10,547 1915 140 18,559
Beware of nonsense associations… r = 0.9749, but no economic relationship between these variables Strong association is due entirely to the fact that both imports & health spending grew rapidly in these years. Common year is other variable. Any two variables that both increase over time will show a strong association. Doesn’t mean one explains the other or influences the other
Correlation & Regression Wisdom Correlation is not resistant; always plot data and look for unusual trends. … what if Bill Gates walked into a bar?
Correlation & Regression Wisdom Extrapolation! Don’t do it… ever. Example: Growth data from children from age 1 month to age 12 years … LSRL 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 ℎ𝑒𝑖𝑔ℎ𝑡 =1.5𝑓𝑡+0.25(𝑎𝑔𝑒 𝑖𝑛 𝑦𝑒𝑎𝑟𝑠) What is the predicted height of a 40-year old?
Outliers & Influential Points All influential points are outliers, but not all outliers are influential points. Outliers: observations lie outside overall pattern
Outliers & Influential Points Influential points/observations: If removed would significantly change LSRL (slope and/or y-intercept)
Input following data... Graph the scatterplot, the LSRL, & calculate the equation of the LSRL # Hours Spent Studying for the Stats Test Percentage Earned on the Stats Test 3 89 4 92 4.5 94 1 85 1.5 86 83
Input following data... Now, calculate the equation of the LSRL again with this additional piece of data (last line in table below)... What do you observe about the scatter plot and the equation of the LSRL? # Hours Spent Studying for the Stats Test Percentage Earned on the Stats Test 3 89 4 92 4.5 94 1 85 1.5 86 83 7 70
Input following data... Now, calculate the equation of the LSRL again with this (slightly different; only last line has changed) set of data... What do you observe about the scatter plot and the equation of the LSRL? # Hours Spent Studying for the Stats Test Percentage Earned on the Stats Test 3 89 4 92 4.5 94 1 85 1.5 86 83
Continue on next slide for more questions .... Class Activity… Groups of 2; go to my website, data sets; choose 2 numerical categories that you believe are associated (that we have not used as examples yet). Be sure to go through your data and ‘clean’ it up; justify any ‘cleaning’ you do. Create scatterplot and describe the association between the two variables using DOFS. Calculate the correlation of the scatter plot (r). Do you think that a regression line appropriate for our data? Why or why not? Even if you believe a line is not appropriate for your data, go ahead and create LSRL graph & calculate equation of the LSRL; calculate the coefficient of determination (r2) & interpret r2. Interpret the slope and the y-intercept of the LSRL in context. Continue on next slide for more questions .... Which variable should be your ‘x’ and which should be your ‘y’?
Class Activity… Come up with a question about a prediction (such as if a person weighs 140 pounds, what would we expect their height to be?). Based on your LSRL graph or equation, calculate your prediction; show your work. If there is/are outliers and/or influential point(s) on your scatter plot, circle it/them in red and label it/them appropriately as ‘outlier’ and/or ‘influential point.’ Print everything up, put each group member name on it, turn it in. Which variable should be your ‘x’ and which should be your ‘y’?
OLI Assignments...