Download presentation
Presentation is loading. Please wait.
Published byDwayne Shepherd Modified over 9 years ago
1
Bivariate Data analysis
2
Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plotsCorrelation OutliersCausation
3
Variables Discrete Continuous Quantitative (Numerical) (measurements and counts) Qualitative (categorical) (define groups) Ordinal (fall in natural order) Categorical (no idea of order) We are only going to consider quantitative variables in this AS
4
Quantitative Discrete Many repeated values Age groups Marks Continuous Few repeated values Height Length Weight
5
Qualitative Categorical Gender Religious denomination Blood types Sport ’ s numbers (e.g. He wears the number ‘ 8 ’ jersey) Ordinal Grades Places in a race (e.g. 1st, 2nd, 3rd)
6
We often want to know if there is a relationship between two numerical variables. A scatter plot, which gives a visual display of the relationship between two variables, provides a good starting point.
7
In a relationship involving two variables, if the values of one variable ‘ depend ’ on the values of another variable, then the former variable is referred to as the dependent (or response) variable and the latter variable is referred to as the independent (or explanatory) variable. y - axis dependent (response) variable x - axis independent (explanatory) variable
8
Consider data on ‘ hours of study ’ vs ‘ test score ’ HoursScoreHoursScoreHoursScore 185914541759 166717721676 227414631459 279019722989 156220583093 288910473096 187128852382 196025752635 228418632278 30981961
9
We may want to see if we could predict the test score (response variable) based on the hours of study (explanatory variable). y - axis: Test score x - axis: Hours of study
10
We look for a pattern in the way the points lie Certain patterns tell us about the relationship This is called correlation This point is an outlier
11
We could describe the rest of the data as having a linear form.
12
Scatter plots Use hollow circles for points Label axes correctly with units What you want to predict goes on the y-axis (response variable) Title of graph No background; No gridlines Unless you need to show categories- no legend Show different categories on a single graph in different colours rather than on separate graphs. Adjust scale and size of font (14pt for pasting)
13
What to look for in your plot? Direction of the relationship - positive or negative Form of the graph - linear or curved The strength - whether it is strong, moderate or weak Scatter - constant scatter, a fan effect… Outliers Groupings
14
Page 22
19
What do you see in this scatter plot? There appears to be a linear trend. There appears to be moderate constant scatter. Negative Association. No outliers or groupings visible. 4540 35 20 19 18 17 16 15 14 Latitude (°S) Mean January Air Temperatures for 30 New Zealand Locations Temperature (°C)
20
What do you see in this scatter plot? There appears to be a non-linear trend. There appears to be non-constant scatter about the trend line. Positive Association. One possible outlier (Large GDP, low % Internet Users). Internet Users (%) % of population who are Internet Users vs GDP per capita for 202 Countries
21
What do you see in this scatter plot? Two non-linear trends (Male and Female). Very little scatter about the trend lines Negative association until about 1970, then a positive association. Gap in the data collection (Second World War).
22
Rank these relationships from weakest (1) to strongest (4): 1 2 3 4
23
Describe these relationships Perfect, negative, linear relationship Perfect, positive, linear relationship No relationship Moderate, negative linear relationship Weak, positive linear relationship
24
Describe this relationship.
25
As the hours of study increase, the test score....?...
26
Pearson ’ s product-moment correlation coefficient, r Correlation measures the strength of the linear association between two quantitative variables. r = -1 r = -0.7 r = -0.4 r = 0 r = 0.3 r = 0.8 r = 1 Points fall exactly on a straight line No linear relationship (uncorrelated) Points fall exactly on a straight line The correlation coefficient may take any value between -1.0 and +1.0
27
How close the points in the scatter plot come to lying on the line. r - what does it tell you? r = 0.99 x y * * * * * * * * ** * * * * * * * * * * r = 0.57 x y * * * * * * * * * * * * * * * * * * * * r = 0.99r = 0.57
28
Interpreting r 0.75-1 Strong positive linear association 0.5-0.75 Moderate positive linear association 0.25-0.5 Weak positive linear association - 0.25-0.25 No association or weak linear association - 0.5- - 0.25 Weak negative linear association -0.75- - 0.5 Moderate negative linear association - 1 - - 0.75 Strong negative linear association
29
Useful websites http://www.ruf.rice.edu/~lane/stat_sim/reg_by_ey e/index.html Regression by eyehttp://www.ruf.rice.edu/~lane/stat_sim/reg_by_ey e/index.html http://istics.net/stat/Correlations/ Guessinghttp://istics.net/stat/Correlations/ http://illuminations.nctm.org/LessonDetail.a spx?ID=L455#whatif effect of outliershttp://illuminations.nctm.org/LessonDetail.a spx?ID=L455#whatif
30
Assumptions linear relationship between x and y continuous random variables The residuals must be normally distributed x and y must be independent of each other all individuals must be selected at random from the population all individuals must have equal chance of being selected
31
What is correlation? A measure of the strength of a LINEAR association between two quantitative variables.
32
Sure you can calculate a correlation coefficient for any pair of variables but correlation measures the strength only of the linear association and will be misleading if the relationship is not linear.
33
Do you know that: Correlation applies only to quantitative variables. Check you know the units and what they measure. Outliers can distort the correlation dramatically.
34
Some facts about the correlation coefficient The sign gives the direction of the association. Correlation is always between -1 and 1. Correlation treats x and y symmetrically. The correlation of x and y is the same as the correlation of y with x. Correlation has no units and is generally given as a decimal. r is a multiple of the slope Note: variables can have a strong association but still have a small correlation if the association isn ’ t linear. Correlation is sensitive to outliers. A single outlying value can make a small correlation large or make a large one small.
35
The sign gives the direction of the association. Positive Negative
36
Correlation treats x and y symmetrically. The correlation of x and y is the same as the correlation of y with x.
37
r is a multiple of the slope
38
Variables can have a strong association but still have a small correlation if the association isn ’ t linear. Always plot the data before looking at the correlation!
39
Would it be OK to use a correlation coefficient to describe the strength of the relationship? 9876543210 4000 3000 2000 1000 0 Position Number Distance (million miles) Distances of Planets from the Sun √ Reaction Times (seconds) for 30 Year 10 Students 0 0.2 0.4 0.6 0.8 00.20.40.60.81 Non-dominant Hand Dominant Hand 4540 35 20 19 18 17 16 15 14 Latitude (°S) Mean January Air Temperatures for 30 New Zealand Locations Temperature (°C) √ Female ($) Average Weekly Income for Employed New Zealanders in 2001 Male ($) 0 200 400 600 800 1000 1200 0200400 600 800 X X
40
Correlation is sensitive to outliers. A single outlying value can make a small correlation large or make a large one small.
41
You should be cautious in interpreting the correlation - these graphs all have the same correlation coefficient (0.817)
42
Data set 1
43
Data set 2
44
Data set 3
45
Data set 4
46
Outliers can distort the correlation dramatically. An outlier can make an otherwise small correlation look big or hide a large correlation. It can even give an otherwise positive association a negative correlation coefficient (and vice versa).
47
What do you see in this scatterplot? 2223242526272829 150 160 170 180 190 200 Foot size (cm) Height (cm) Height and Foot Size for 30 Year 10 Students Appears to be a linear trend, with a possible outlier (tall person with a small foot size.) Appears to be constant scatter. Positive association.
48
What will happen to the correlation coefficient if the tallest Year 10 student is removed? 2223242526272829 150 160 170 180 190 200 Foot size (cm) Height (cm) Height and Foot Size for 30 Year 10 Students It will get smaller It will get bigger It will stay the same
49
What do you see in this scatter plot? Appears to be a strong linear trend. Outlier in X (the elephant). Appears to be constant scatter. Positive association. 6005004003002001000 40 30 20 10 Gestation (Days) Life Expectancy (Years) Life Expectancies and Gestation Period for a sample of non-human Mammals Elephant
50
6005004003002001000 40 30 20 10 Gestation (Days) Life Expectancy (Years) Life Expectancies and Gestation Period for a sample of non-human Mammals Elephant What will happen to the correlation coefficient if the elephant is removed? It will get smaller It will get bigger It will stay the same
51
How does the outlier affect the r - value?
57
When you see an outlier, it ’ s often a good idea to report the correlations with and without the point.
58
Don ’ t confuse Correlation with causation. Scatterplots and correlation never prove causation.
59
Using the information in the plot, can you suggest what needs to be done in a country to increase the life expectancy? Explain. 400003000020000100000 80 70 60 50 People per Doctor Life Expectancy Life Expectancy and Availability of Doctors for a Sample of 40 Countries Perhaps if you have less people per Doctor (i.e. more Doctors per person), then the life expectancy will increase.
60
Using the information in this plot, can you make another suggestion as to what needs to be done in a country to increase life expectancy? 6005004003002001000 80 70 60 50 People per Television Life Expectancy Life Expectancy and Availability of Televisions for a Sample of 40 Countries It looks like if you decrease the number of people per television (i.e. have more TVs per person), then the life expectancy will increase!
61
Can you suggest another variable that is linked to life expectancy and the availability of doctors (and televisions) which explains the association between the life expectancy and the availability of doctors (and televisions)? Some measure of wealth of a country. Eg Average income per person or GDP.
62
Damaged for life by too much TV
63
Watching too much television as a child causes serious health problems years later, and raises the risk of heart disease, a New Zealand study of 1000 children has found…. It links the amount of time spent in front of the box as a child with obesity, high cholesterol, poor fitness and smoking…. Damaged for life by too much TV
64
Health Score TV watching r = - 0.93
65
Causal relationships Two general types of studies: experiments and observational studies In an experiment, the experimenter determines which experimental units receive which treatments. In an observational study, we simply compare units that happen to have received each of the treatments.
66
Only properly designed and carefully executed experiments can reliably demonstrate causation. An observational study is often useful for identifying possible causes of effects, but it cannot reliably establish causation Causal relationships
67
In observational studies, strong relationships are not necessarily causal relationships. Correlation does not imply causation. Be aware of the possibility of lurking variables. Causal relationships
68
Watch out for lurking variables. Damage ($) vs number of firemen would show a strong correlation, but damage doesn ’ t cause firemen and firemen do seem to cause damage (spraying water and chopping holes). The underlying variable is the size of the blaze.
69
Although there was plenty of evidence that increased smoking was associated with increased levels of lung cancer, it took years to provide evidence that smoking actually causes lung cancer.
70
It would be a good idea to read the two pages of notes you have that discusses correlation and causation!
71
So now you want to know how to calculate the correlation coefficient, r. Here is one version of the formula!
73
Luckily the computer will calculate R 2 and you can square root this to get r. Remember only when the association is linear.
74
r measures the strength of the relationship NOT R 2 !!!! r measures the strength of the relationship NOT R 2 !!!! r measures the strength of the relationship NOT R 2 !!!!
75
The words you use There is a strong, positive, linear relationship between ‘ x ’ and ‘ y ’ and when the x- values increase, the y-values increase also. This is indicated by the value of the correlation coefficient i.e. r = 0.85 which is close to 1. (Note: Do not use ‘ x ’ and ‘ y ’ use what they represent.)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.