DESCRIBING RELATIONSHIPS …
RELATIONSHIPS BETWEEN... Talk to the person next to you. Think of two things that you believe may be related. For example, height and weight are generally related... The taller the person, generally, the more they weigh. Share out two numerical categories that you believe are related on the board.
DO YOU BELIEVE THERE IS A RELATIONSHIP BETWEEN... TIME SPENT STUDYING AND GPA? # OF CIGARETTES SMOKED DAILY & LIFE EXPECTANCY SALARY AND EDUCATION LEVEL? AGE AND HEIGHT? AGE OF AUTOMOBILE AND VALUE OF AUTOMOBILE VALUE?
RELATIONSHIPS When we consider (possible) relationships between 2 (numeric) variables, the data is referred to as bi-variate data. There may or may not exist a relationship/an association between the 2 variables. Does one variable ‘cause’ the other? Caution! Does one variable influence the other? Or is the relationship influenced by another variable(s) that we are unaware of?
BIVARIATE DATA Proceed similarly as uni-variate distributions … Still graph (use model to describe data; scatter plot; LSRL) Still look at overall patterns and deviations from those patterns (DOFS; Direction, Outlier(s), Form, Strength) Still analyze numerical summary (descriptive statistics)
BIVARIATE DISTRIBUTIONS Explanatory variable, x, ‘factor,’ may help predict or explain changes in response variable; usually on horizontal axis Response variable, y, measures an outcome of a study, usually on vertical axis
BIVARIATE DATA DISTRIBUTIONS For example... Alcohol (explanatory) and body temperature (response). Generally, the more alcohol consumed, the higher the body temperature. Still use caution with ‘cause.’ Sometimes we don’t have variables that are clearly explanatory and response. Sometimes there could be two ‘explanatory’ variables. Examples: Discuss with a partner for 1 minute
EXPLANATORY & RESPONSE OR TWO EXPLANATORY VARIABLES? ACT Score and SAT Score Activity level and physical fitness SAT Math and SAT Verbal Scores
GRAPHICAL MODELS… Many graphing models display uni-variate data exclusively (review). Discuss for 30 seconds and share out. Main graphical representation used to display bivariate data (two quantitative variables) is scatterplot.
SCATTERPLOTS Scatterplots show relationship between two quantitative variables measured on the same individuals Each individual in data appears as a point (x, y) on the scatterplot. Plot explanatory variable (if there is one) on horizontal axis. If no distinction between explanatory and response, either can be plotted on horizontal axis. Label both axes. Scale both axes with uniform intervals (but scales don’t have to match)
LABEL & SCALE SCATTERPLOT VARIABLES: CLEARLY EXPLANATORY AND RESPONSE??
CREATING & INTERPRETING SCATTERPLOTS Let’s collect some data; on the board write your height in inches and your hand span in inches (to nearest ½ inch) Input into Minitab & create scatterplot Let’s do some predicting... to the best of our ability...
INTERPRETING SCATTERPLOTS Look for overall patterns (DOFS) including: direction: up or down, + or – association? outliers/deviations: individual value(s) falls outside overall pattern; no outlier rule for bi-variate data – unlike uni-variate data form: linear? curved? clusters? gaps? strength: how closely do the points follow a clear form? Strong, weak, moderate?
CREATING & INTERPRETING SCATTERPLOTS Now let’s go to Kathy Kubo’s website, get height/weight data from Math 075 Spring; copy/paste into Minitab (graph, scatterplot) Is data messy? Does it need to be ‘fixed?’ Interpret scatterplot
SCATTERPLOTS: NOTE Might be asked to graph a scatterplot from data Might need to sketch what’s on Minitab Doesn’t have to be 100% exactly accurate; do your best Scaling, labeling: a must!
MEASURING LINEAR ASSOCIATION Scatterplots (bi-variate data) show direction, outliers/ deviation(s), form, strength of relationship between two quantitative variables Linear relationships are important; common, simple pattern Linear relationship is strong if points are close to a straight line; weak if scattered about Other relationships (quadratic, logarithmic, etc.)
HOW STRONG ARE THESE RELATIONSHIPS? WHICH ONE IS STRONGER?
MEASURING LINEAR ASSOCIATION: CORRELATION OR “R” Eyes are not a good judge Need to specify just how strong or weak a linear relationship is Need a numeric measure Correlation or ‘r’
MEASURING LINEAR ASSOCIATION: CORRELATION OR “R” * Correlation (r) is a numeric measure of direction and strength of a linear relationship between two quantitative variables Correlation (r) is always between -1 and 1 Correlation (r) is not resistant (look at formula; based on mean) R doesn’t tell us about individual data points, but rather trends in the data * Never calculate by formula; use Minitab (dependent on having raw data)
MEASURING LINEAR ASSOCIATION: CORRELATION OR “R” r ≈0 not strong linear relationship r close to 1 strong positive linear relationship r close to -1 strong negative linear relationship Go back to our height/hand span data & calculate ‘r,’ correlation
GUESS THE CORRELATION ‘March Madness’ bracket-style Guess the Correlation tournament Playing cards; match up head-to-head competition/rounds Look at a scatterplot, each write down your guess on notecards and reveal at same time Student who is closest survives until the next round
CORRELATION & REGRESSION APPLET PARTNER ACTIVITY Go to Go to applets Go to Correlation & Regression Follow the directions on the hand out (or see my website) Partner up with the person next to you; this should take no more than minutes, including the write-up
CAUTION… INTERPRETING CORRELATION Note: be careful when addressing form in scatterplots Strong positive linear relationship ► correlation ≈ 1 But Correlation ≈ 1 does not necessarily mean relationship is linear; always plot data!
R ≈ FOR EACH OF THESE
CALCULATING CORRELATION “R”
Let’s calculate r for Math 075 Spring height & weight data and determine how weak or strong the linear relationship is with our data Stat, regression, fitted line
FACTS ABOUT CORRELATION Correlation doesn’t care which variables is considered explanatory and which is considered response Can switch x & y Still same correlation (r) value CAUTION! Switching x & y WILL change your scatterplot… just not ‘r’
FACTS ABOUT CORRELATION r is in standard units, so r doesn’t change if units are changed If we change from yards to feet, r is not effected + r, positive association - r, negative association
FACTS ABOUT CORRELATION Correlation is always between -1 & 1 Makes no sense for r = 13 or r = -5 r = 0 means very weak linear relationship r = 1 or -1 means strong linear association
FACTS ABOUT CORRELATION Both variables must be quantitative, numerical. Doesn’t make any sense to discuss r for qualitative or categorical data Correlation is not resistant (like mean and SD). Be careful using r when outliers are present
FACTS ABOUT CORRELATION r isn’t enough! … mean, standard deviation, graphical representation Correlation does not imply causation; i.e., # ice cream sales in a given week and # of pool accidents
ABSURD EXAMPLES… CORRELATION DOES NOT IMPLY CAUSATION… Did you know that eating chocolate makes winning a Nobel Prize more likely? The correlation between per capita chocolate consumption and the number of Nobel laureates per 10 million people for 23 selected countries is r = Did you know that statistics is causing global warming? As the number of statistics courses offered has grown over the years, so has the average global temperature!
LEAST SQUARES REGRESSION Last section… scatterplots of two quantitative variables r measures strength and direction of linear relationship of scatterplot
WHAT WOULD WE EXPECT THE SODIUM LEVEL TO BE IN A HOT DOG THAT HAS 170 CALORIES?
LEAST SQUARES REGRESSION BETTER model to summarize overall pattern by drawing a line on scatterplot Not any line; we want a best-fit line over scatterplot Least Squares Regression Line (LSRL) or Regression Line
LEAST-SQUARES REGRESSION LINE
LET’S DO SOME PREDICTING...
LEAST SQUARES REGRESSION (PREDICTS VALUES)
Often will be asked to interpret slope of LSRL & y- intercept, in context Caution: Interpret slope of LSRL as the predicted or average change or expected change in the response variable given a unit change in the explanatory variable NOT change in y for a unit change in x; LSRL is a model; models are not perfect
INTERPRET SLOPE & Y-INTERCEPT...
LSRL: OUR DATA Go back to Kathy Kubo’s data on height and weight (or we can choose 2 other ‘related’ numerical distributions... your choice Now let’s put our LSRL on our scatterplot & determine the equation of the LSRL Minitab: stat, regression, fitted line plot
LSRL: OUR DATA Look at graph of our LSRL for our data Look at our LSRL equation for our data Our line fits scatterplot well (best fit) but not perfectly Make some predictions… what if our height was … what if our weight was … Interpret our y-intercept; does it make sense? Interpretation of our slope?
ANOTHER EXAMPLE… VALUE OF A TRUCK
TRUCK EXAMPLE…
AGES & HEIGHTS… Age (years)Height (inches)
LET’S REVIEW FOR A MOMENT, SHALL WE … Input into Minitab Create scatterplot and describe scatterplot (what do we include in a description?) Calculate r (btw, different from slope; why?), equation of LSRL; interpret equation of LSRL in context; does y-intercept make sense? Based on this data, make a prediction as to the height of a person at age 35.
LSRL: OUR DATA Extrapolation: Use of a regression line for prediction outside the range of values of the explanatory variable x used to obtain the line. Such predictions are often not accurate. Friends don’t let friends extrapolate!
CALCULATING THE EQUATION OF THE LSRL: WHAT IF WE DON’T HAVE THE RAW DATA?
CALCULATING THE EQUATION FOR THE LSRL: WHAT IF WE DON’T HAVE THE RAW DATA?
EXAMPLE: CREATING EQUATION OF LSRL (WITHOUT RAW DATA)
INTERPRETING SOFTWARE OUTPUT… Age vs. Gesell Score
DETOUR… MEMORY MONDAY (OR WAY-BACK WEDNESDAY)… What is r? What is r’s range? r tells us how linear (and direction) scatterplot is. ‘r’ ranges from -1 to 1. ‘r’ describes the scatterplot only (not LSRL)
NOW…
COEFFICIENT OF DETERMINATION … Do all the points on the scatterplot fall exactly on the LSRL? Sometimes too high and sometimes too low Is LSRL a good model to use for a particular data set? How well does our model fit our data?
FACTS TO REMEMBER ABOUT LSRL Distinction between explanatory and response variables. If switched, scatterplot changes and LSRL changes (but what doesn’t change?) LSRL minimizes distances from data points to line only vertically
FACTS TO REMEMBER ABOUT LSRL
CORRELATION & REGRESSION WISDOM Which of the following scatterplots has the highest correlation?
CORRELATION & REGRESSION WISDOM All r = 0.816; all have same exact LSRL equation Lesson: Always graph your data! … because correlation and regression describe only linear relationships
CORRELATION & REGRESSION WISDOM Correlation and regression describe only linear relationships
CORRELATION & REGRESSION WISDOM Correlation is not causation! Association does not imply causation… want a Nobel Prize? Eat some chocolate! How about Methodist ministers & rum imports? YearNumber of Methodist Ministers in New England Cuban Rum Imported to Boston (in # of barrels) , , , , , , ,559
BEWARE OF NONSENSE ASSOCIATIONS… r = , but no economic relationship between these variables Strong association is due entirely to the fact that both imports & health spending grew rapidly in these years. Common year is other variable. Any two variables that both increase over time will show a strong association. Doesn’t mean one explains the other or influences the other
CORRELATION & REGRESSION WISDOM Correlation is not resistant; always plot data and look for unusual trends. … what if Bill Gates walked into a bar?
CORRELATION & REGRESSION WISDOM
OUTLIERS & INFLUENTIAL POINTS All influential points are outliers, but not all outliers are influential points.
OUTLIERS & INFLUENTIAL POINTS Outlier: observation lies outside overall pattern Points that are outliers in the ‘y’ direction of scatterplot have large residuals. Points that are outliers in the ‘x’ direction of scatterplot may not necessarily have large residuals.
OUTLIERS & INFLUENTIAL POINTS Influential points/observations: If removed would significantly change LSRL (slope and/or y-intercept)
CLASS ACTIVITY… Groups of 2 or 3; measure each other’s head circumferences & arm spans (both in inches, rounded to the nearest ½ “). Write data on board Create scatterplot and describe the association between head circumference and arm span. Is a regression line appropriate for our data? Why or why not? If so, create LSRL graph & equation, calculate the correlation and the coefficient of determination Interpret the slope and the y-intercept of the LSRL in context What does it mean if a point falls above the LSRL? Below the LSRL?