Presentation is loading. Please wait.

Presentation is loading. Please wait.

Describing relationships …

Similar presentations


Presentation on theme: "Describing relationships …"— Presentation transcript:

1 Describing relationships …

2 Relationships between ...
Talk to the person next to you. Think of two things that you believe may be related. For example, height and weight are generally related... The taller the person, generally, the more they weigh. Write your two numerical categories that you believe are related on the board.

3 Do you believe there Is there a relationship between...
Time spent studying and GPA? # of cigarettes smoked daily & life expectancy Salary and education level? Age and height? Age of automobile and value of automobile value? Possibly discuss categorical vs. numerical data (spiral)

4 Relationships When we consider (possible) relationships between 2 (numeric) variables, the data is referred to as bi-variate data. There may or may not exist a relationship/an association between the 2 variables. Does one variable ‘cause’ the other? Caution! Does one variable influence the other? Or is the relationship influenced by another variable(s) that we are unaware of?

5 Bivariate Data Proceed similarly as uni-variate distributions … Still graph (use model to describe data; scatter plot; LSRL) Still look at overall patterns and deviations from those patterns (DOFS; Direction, Outlier(s), Form, Strength; or Trends, Strength, Shape) Still analyze numerical summary (descriptive statistics) DOFS – direction, outlier(s), form, strength

6 Bivariate Distributions
Explanatory variable, x, ‘factor,’ may help predict or explain changes in response variable; usually on horizontal axis Response variable, y, measures an outcome of a study, usually on vertical axis Notice the lower left corner of graph. In SPs we do not require that the lower left corner be (0, 0). Reason: want to zoom in on the data and not show a substantial amount of empty space.

7 Bivariate Data Distributions
For example ... Alcohol (explanatory) and body temperature (response). Generally, the more alcohol consumed, the higher the body temperature. Still use caution with ‘cause.’ Sometimes we don’t have variables that are clearly explanatory and response. Sometimes there could be two ‘explanatory’ variables. Examples: Discuss with a partner for 1 minute

8 Explanatory & Response or Two Explanatory Variables?
ACT Score and SAT Score Activity level and physical fitness SAT Math and SAT Verbal Scores

9 Graphical models… Many graphing models display uni-variate data exclusively (review). Discuss for 30 seconds and share out. Main graphical representation used to display bivariate data (two quantitative variables) is scatterplot.

10 Scatterplots Scatterplots show relationship between two quantitative variables measured on the same individuals Each individual in data appears as a point (x, y) on the scatterplot. Plot explanatory variable (if there is one) on horizontal axis. If no distinction between explanatory and response, either can be plotted on horizontal axis. Label both axes. Scale both axes with uniform intervals (but scales don’t have to match)

11 Label & Scale Scatterplot Variables: Clearly Explanatory and Response??
Can either be explanatory and response? Yes, in this case. Notice it doesn’t start at 0 on either axis. That’s ok for scatterplots… can be deceptive for bar graphs though.

12 Creating & Interpreting Scatterplots
Let’s collect some data On board, write your height (in inches) and your weight (in pounds) Input into Minitab (graph, scatterplot)

13 Interpreting Scatterplots
Look for overall patterns (DOFS) including: direction: up or down, + or – association? outliers/deviations: individual value(s) falls outside overall pattern; no outlier rule for bi-variate data – unlike uni-variate data form: linear? curved? clusters? gaps? strength: how closely do the points follow a clear form? Strong, weak, moderate? DOFS – direction, outlier(s), form, strength; can also look at page 135 and describe those SPs

14 Disregard ‘corelation r = …” for now
Disregard ‘corelation r = …” for now. Let’s just look at the scatterplots. 5 minutes with your partner to describe each SP using DOFS (direction, outlier(s), form, strength)

15 Sometimes there is clearly a non-linear, curved (could be quadratic, log, etc.) shape to scatterplots. We will focus primarily on linear scatterplots in this course (very common). But if we did have a curved SP, we could describe it through DOFS (direction, outlier(s), form, strength). Two of these SPs could be labeled and scaled much better. It is important, though, that we examine the SP to be sure that the trend is linear. If we apply the techniques in this chapter to a nonlinear trend, it is (very often) incorrect, big time.

16 Scatterplots: Note Might be asked to graph a scatterplot from data Might need to sketch what’s on Minitab Doesn’t have to be 100% exactly accurate; do your best Scaling, labeling: a must!

17 Measuring Linear Association
Scatterplots (bi-variate data) show direction, outliers/ deviation(s), form, strength of relationship between two quantitative variables Linear relationships are important; common, simple pattern Linear relationship is strong if points are close to a straight line; weak if scattered about Other relationships (quadratic, logarithmic, etc.) We need more than ‘strong’ or ‘weak’… leads us into r or correlation

18 How strong are these relationships? Which one is stronger?

19 Measuring Linear Association: Correlation or “r”
Eyes are not a good judge Need to specify just how strong or weak a linear relationship is Need a numeric measure Correlation or ‘r’

20 Measuring Linear Association: Correlation or “r”
* Correlation (r) is a numeric measure of direction and strength of a linear relationship between two quantitative variables Correlation (r) is always between -1 and 1 Correlation (r) is not resistant (look at formula; based on mean) R doesn’t tell us about individual data points, but rather trends in the data * Never calculate by formula; use Minitab (dependent on having raw data) R makes sense only in linear context (not curved)

21 Measuring Linear Association: Correlation or “r”
r ≈0  not strong linear relationship r close to 1  strong positive linear relationship r close to -1  strong negative linear relationship -1 & 1 are equally strong; the sign just tells us the direction; likewise 0.9 and -0.9 are equally strong

22 To estimate the correlation from a scatterplot, have student imagine drawing an oval around the points. The rounder the oval, the closer the correlation will be to 0. The longer and skinnier the oval, the closer the correlation will be to + or – 1. Rounder ovals look like 0’s and skinnier ovals look like tilted 1’s.

23 Guess the correlation www.rossmanchance.com/applets
‘March Madness’ bracket-style Guess the Correlation tournament Number off; randomly choose numbers to match up head-to- head competition/rounds Look at a scatterplot, each write down your guess on notecards and reveal at same time Student who is closest survives until the next round Bracket style tournament. For each head-to-head match-up, present one SP. Have each student write down their guesses on notecards and reveal them at the same time. The student who is the closest survives until the next round; randomly choose students (have students # off, then use RDG in calculator or use RFT OR use RDT page 221 in textbook.

24 Caution… interpreting correlation
Note: be careful when addressing form in scatterplots Strong positive linear relationship ► correlation ≈ 1 But Correlation ≈ 1 does not necessarily mean relationship is linear; always plot data! Y=x^2; correlation = 0.97; these are not bi-conditional (bi-conditionals go either way and are still true)

25 R ≈ for each of these Anscombe's quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.[1]

26 Calculating Correlation “r”
n, x1, x2, etc., 𝒙 , y1, y2, etc., 𝒚 , sx, sy, … Not the way to go; explain formula to show students why r is not resistant

27 Calculating Correlation “r”
Let’s calculate r for our height & weight data and determine how weak or strong the linear relationship is with our data Stat, regression, fitted line

28 Facts about Correlation
Correlation doesn’t care which variables is considered explanatory and which is considered response Can switch x & y Still same correlation (r) value CAUTION! Switching x & y WILL change your scatterplot… just not ‘r’ Try switching in Minitab; convince yourself; the SP changes, but the correlation does not

29 Facts about Correlation
r is in standard units, so r doesn’t change if units are changed If we change from yards to feet, r is not effected + r, positive association - r, negative association So if we wanted to change our height units from inches to cm, r would not change. If we wanted to change our weight units from pounds to grams, r wouldn’t change.

30 Facts about Correlation
Correlation is always between -1 & 1 Makes no sense for r = 13 or r = -5 r = 0 means very weak linear relationship r = 1 or -1 means strong linear association Only makes sense with linear relationships

31

32 Facts about Correlation
Both variables must be quantitative, numerical. Doesn’t make any sense to discuss r for qualitative or categorical data Correlation is not resistant (like mean and SD). Be careful using r when outliers are present

33 Facts about Correlation
r isn’t enough! … mean, standard deviation, graphical representation Correlation does not imply causation; i.e., # students who own cell phones and # students passing AP exams

34 Absurd examples… correlation does not imply causation…
Did you know that eating chocolate makes winning a Nobel Prize more likely? The correlation between per capita chocolate consumption and the number of Nobel laureates per 10 million people for 23 selected countries is r = Did you know that statistics is causing global warming? As the number of statistics courses offered has grown over the years, so has the average global temperature! Correlation does NOT imply causation! Sometimes, there is another variable in the background that is influencing one or both of the variables.

35 Least Squares Regression
Last section… scatterplots of two quantitative variables r measures strength and direction of linear relationship of scatterplot

36 Least Squares Regression
BETTER model to summarize overall pattern by drawing a line on scatterplot Not any line; we want a best-fit line over scatterplot Least Squares Regression Line (LSRL)

37 Least-Squares Regression Line

38 Least Squares Regression (predicts values)
LSRL Model: 𝑦 =𝑎+𝑏𝑥 𝑦 is predicted value of response variable a is y-intercept of LSRL b is slope of LSRL; slope is predicted (expected) rate of change x is explanatory variable Review y = mx + b; then discuss LSRL model equation.

39 Least Squares Regression (predicts values)
Often will be asked to interpret slope of LSRL & y- intercept, in context Caution: Interpret slope of LSRL as the predicted or average change or expected change in the response variable given a unit change in the explanatory variable NOT change in y for a unit change in x; LSRL is a model; models are not perfect

40 LSRL: Our Data Go back to whole-class data on height and weight Now let’s put our LSRL on our scatterplot & determine the equation of the LSRL Minitab: stat, regression, fitted line plot Always embed context into LSRL equation (and on graph); also talk about r and r^2 and s

41 LSRL: Our Data Look at graph of our LSRL for our data Look at our LSRL equation for our data Our line fits scatterplot well (best fit) but not perfectly Make some predictions… what if our height was … what if our weight was … Interpret our y-intercept; does it make sense? Interpretation of our slope? Remember switching x & y will change SP… so WILL change LSRL; will not change r; what does it mean for a point to be above LSRL? Below LSRL?

42 Another example… value of a truck
Data points does not fit exactly on line; best fit; only prediction, expected values given a certain # of miles driven; it’s a model… models are not perfect; what if we had driven truck 100,000 miles… what do we expect truck would be valued at?; and vice versa

43 Truck example… Suppose we were given the LSRL equation for our truck data as 𝒑𝒓𝒊𝒄𝒆 =𝟑𝟖,𝟐𝟓𝟕−𝟎.𝟏𝟔𝟐𝟗(𝒎𝒊𝒍𝒆𝒔 𝒅𝒓𝒊𝒗𝒆𝒏) We want to find a more precise estimation of the value if we have driven 100,000 miles. Use the LSRL equation. Using graph, estimate price if we have driven 40,000 miles. Then use the above LSRL equation to calculate the predicted value of the truck. (100,000 , 21,967); (40,000 , 31,741); sometimes given predicted price, what is the approximate mileage according to our model? Sometimes use graph and sometimes use LSRL equation

44 Ages & Heights… Age (years) Height (inches) 18 1 28 4 40 5 42 8 49
18 1 28 4 40 5 42 8 49 Input into Minitab

45 Let’s review for a moment, shall we …
Input into Minitab Create scatterplot and describe scatterplot (what do we include in a description?) Calculate r (btw, different from slope; why?), equation of LSRL; interpret equation of LSRL in context; does y-intercept make sense? Based on this data, make a prediction as to the height of a person at age 25.

46 LSRL: Our Data Extrapolation: Use of a regression line for prediction outside the range of values of the explanatory variable x used to obtain the line. Such predictions are often not accurate. Friends don’t let friends extrapolate! Maybe look at our data from height & weight. What if we had a height of 15 inches? Does that even make sense?

47 Calculating the equation of the LSRL: What if we don’t have the raw data?
We still can calculate the equation for the LSRL, but a little more time consuming Note: Every LSRL goes through the point ( 𝒙 , 𝒚 ) Formula for slope of LSRL: 𝑏=𝑟 𝑠 𝑦 𝑠 𝑥 LSRL: 𝑦 =𝑎+𝑏𝑥

48 Calculating the equation for the LSRL: What if we don’t have the raw data?
Equation of LSRL: 𝑦 =𝑎+𝑏𝑥 If you do not have raw data, but still need to calculate a LSRL, you will be given: 𝒙 , 𝒚 , 𝑟 (𝑜𝑟 𝑟 2 ), 𝑠 𝑦 , 𝑎𝑛𝑑 𝑠 𝑥 Remember, ( 𝑥 , 𝑦 ) is an ordered pair that is on the graph of the LSRL

49 Example: Creating Equation of LSRL (without raw data)
𝐵𝐴𝐿 = a + b (# of beers consumed) (equation of LSRL in context – better than x & y) Remember, slope formula of LSRL: 𝑏=𝑟 𝑠 𝑦 𝑠 𝑥 Givens: 𝒙 =4.8125, 𝑦 = 𝑆 𝑥 =2.1975, 𝑆 𝑦 =.0441, 𝑎𝑛𝑑 𝑟 2 = .80 Calculate slope for equation of LSRL

50 Example: Creating Equation of LSRL (without raw data)
𝐵𝐴𝐿 = a + b (# of beers consumed) Givens: 𝒙 =4.8125, 𝑦 = , 𝑆 𝑥 =2.1975, 𝑆 𝑦 =.0441, 𝑎𝑛𝑑 𝑟 2 = .80 So, slope = b = Remember, equations of all LSRL’s go through 𝑥 , 𝑦 … so what’s next?

51 Example: Creating Equation of LSRL (without raw data)
𝐵𝐴𝐿 = a + b (# of beers consumed) Givens: 𝒙 =4.8125, 𝑦 = , 𝑆 𝑥 =2.1975, 𝑆 𝑦 = .0441, 𝑎𝑛𝑑 𝑟 2 = .80 𝑦 =𝑎 𝑥 Substitute ( 𝑥 , 𝑦 ) into equation

52 Example: Creating Equation of LSRL (without raw data)
= a + (.0179) ( ) and solve for ‘a’ 𝐵𝐴𝐿 = a + b (# of beers consumed) 𝐵𝐴𝐿 = (# of beers consumed)

53 Interpreting Software output…
Age vs. Gesell Score Gesell score; when first word was spoken and a later aptitude test

54

55 Detour… memory Monday (or way-back Wednesday)…
What is r? What is r’s range? r tells us how linear (and direction) scatterplot is. ‘r’ ranges from -1 to 1. ‘r’ describes the scatterplot only (not LSRL)

56 Now… We need a numerical measurement that tells us how well the LSRL fits Coefficient of Determination, or 𝑟 2

57 Coefficient of determination …
Do all the points on the scatterplot fall exactly on the LSRL? Sometimes too high and sometimes too low Is LSRL a good model to use for a particular data set? How well does our model fit our data?

58 Coefficient of Determination or 𝑟 2
“R-sq” software output Always 0≤ 𝑟 2 ≤1 Never calculate by hand; always use Minitab No need to memorize formula; trust me … it’s ugly

59 Coefficient of Determination or 𝑟 2
Remember “r” correlation, direction and strength of linear relationship of scatterplot −1≤𝑟≤1 𝑟 2 , coefficient of determination, fraction of the variation in the values of y that are explained by LSRL, describes to LSRL 0≤ 𝑟 2 ≤1

60 Coefficient of Determination or 𝑟 2
Interpretation of 𝒓 𝟐 : We say, “x% of variation in (y variable) is explained by LSRL relating (y variable) to (x variable).”

61 Facts to remember about LSRL
Distinction between explanatory and response variables. If switched, scatterplot changes and LSRL changes (but what doesn’t change?) LSRL minimizes distances from data points to line only vertically

62 Facts to remember about LSRL
𝑏=𝑟 𝑠 𝑦 𝑠 𝑥 Close relationship between correlation (r) and slope of LSRL; but r and b are (often) not the same; when would r and b have the same value? LSRL always passes through ( 𝑥 , 𝑦 ) Don’t have to have raw data to identify the equation of LSRL

63 Facts to remember about LSRL
Correlation (r) describes direction and strength of straight-line relationships in scatterplots Coefficient of determination ( 𝑟 2 ) is the fraction of variation in values of y explained by LSRL

64 Correlation & Regression Wisdom
Which of the following scatterplots has the highest correlation? Already saw this; reminder.

65 Correlation & Regression Wisdom
All r = 0.816; all have same exact LSRL equation Lesson: Always graph your data! … because correlation and regression describe only linear relationships

66 Correlation & Regression Wisdom
Correlation and regression describe only linear relationships

67 Correlation & Regression Wisdom
Correlation is not causation! Association does not imply causation… want a Nobel Prize? Eat some chocolate! How about Methodist ministers & rum imports? Year Number of Methodist Ministers in New England Cuban Rum Imported to Boston (in # of barrels) 1860 63 8,376 1865 48 6,506 1870 53 7,005 1875 64 8,486 1890 85 11,265 1900 80 10,547 1915 140 18,559 Sometimes there are ‘nonsense’ variables.

68 Beware of nonsense associations…
r = , but no economic relationship between these variables Strong association is due entirely to the fact that both imports & health spending grew rapidly in these years. Common year is other variable. Any two variables that both increase over time will show a strong association. Doesn’t mean one explains the other or influences the other

69 Correlation & Regression Wisdom
Correlation is not resistant; always plot data and look for unusual trends. … what if Bill Gates walked into a bar?

70 Correlation & Regression Wisdom
Extrapolation! Don’t do it… ever. Example: Growth data from children from age 1 month to age 12 years … LSRL 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 ℎ𝑒𝑖𝑔ℎ𝑡 =1.5𝑓𝑡+0.25(𝑎𝑔𝑒 𝑖𝑛 𝑦𝑒𝑎𝑟𝑠) What is the predicted height of a 40-year old?

71 Outliers & Influential Points
All influential points are outliers, but not all outliers are influential points.

72 Outliers & Influential Points
Outlier: observation lies outside overall pattern Points that are outliers in the ‘y’ direction of scatterplot have large residuals. Points that are outliers in the ‘x’ direction of scatterplot may not necessarily have large residuals.

73 Outliers & Influential Points
Influential points/observations: If removed would significantly change LSRL (slope and/or y-intercept)

74 Class Activity… Groups of 2 or 3; measure each other’s head circumferences & arm spans (both in inches, rounded to the nearest ½ “). Write data on board Create scatterplot and describe the association between head circumference and arm span. Is a regression line appropriate for our data? Why or why not? If so, create LSRL graph & equation, calculate the correlation and the coefficient of determination Interpret the slope and the y-intercept of the LSRL What does it mean if a point falls above the LSRL? Below the LSRL? Put all your names on the paper. Due beginning of next class.


Download ppt "Describing relationships …"

Similar presentations


Ads by Google