Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bivariate Data analysis

Similar presentations


Presentation on theme: "Bivariate Data analysis"— Presentation transcript:

1 Bivariate Data analysis

2 Bivariate Data In this PowerPoint we look at sets of data which contain two variables. Scatter plots Correlation Outliers Causation

3 Quantitative (Numerical) Qualitative (categorical)
We are only going to consider quantitative variables in this AS Variables Quantitative (Numerical) (measurements and counts) Qualitative (categorical) (define groups) Continuous Discrete Categorical (no idea of order) Ordinal (fall in natural order)

4 Quantitative Discrete Many repeated values Age groups Marks Continuous
Few repeated values Height Length Weight

5 Qualitative Categorical Gender Religious denomination Blood types
Sport’s numbers (e.g. He wears the number ‘8’ jersey) Ordinal Grades Places in a race (e.g. 1st, 2nd, 3rd)

6 We often want to know if there is a relationship between two numerical variables.
A scatter plot, which gives a visual display of the relationship between two variables, provides a good starting point.

7 In a relationship involving two variables, if the values of one variable ‘depend’ on the values of another variable, then the former variable is referred to as the dependent (or response) variable and the latter variable is referred to as the independent (or explanatory) variable. y - axis dependent (response) variable x - axis independent (explanatory) variable

8 Consider data on ‘hours of study’ vs ‘ test score’
18 59 14 54 17 16 67 72 76 22 74 63 27 90 19 29 89 15 62 20 58 30 93 28 10 47 96 71 85 23 82 60 25 75 26 35 84 78 98 61

9 We may want to see if we could predict the test score (response variable) based on the hours of study (explanatory variable). y - axis: Test score x - axis: Hours of study

10 This is called correlation
We look for a pattern in the way the points lie Certain patterns tell us about the relationship This point is an outlier This is called correlation

11 We could describe the rest of the data as having a linear form.

12 Scatter plots Use hollow circles for points
Label axes correctly with units What you want to predict goes on the y-axis (response variable) Title of graph No background; No gridlines Unless you need to show categories- no legend Show different categories on a single graph in different colours rather than on separate graphs. Adjust scale and size of font (14pt for pasting)

13 What to look for in your plot?
Direction of the relationship - positive or negative Form of the graph - linear or curved The strength - whether it is strong, moderate or weak Scatter - constant scatter, a fan effect… Outliers Groupings

14 Page 22

15

16

17

18

19 What do you see in this scatter plot?
There appears to be a linear trend. There appears to be moderate constant scatter. Negative Association. No outliers or groupings visible. 45 40 35 20 19 18 17 16 15 14 Latitude (°S)    Mean January Air Temperatures for 30 New Zealand Locations Temperature (°C)

20 What do you see in this scatter plot?
There appears to be a non-linear trend. There appears to be non-constant scatter about the trend line. Positive Association. One possible outlier (Large GDP, low % Internet Users). % of population who are Internet Users vs GDP per capita for 202 Countries 10 20 30 40 GDP per capita (thousands of dollars) 50 60 70 80 Internet Users (%)

21 What do you see in this scatter plot?
Two non-linear trends (Male and Female). Very little scatter about the trend lines Negative association until about 1970, then a positive association. Gap in the data collection (Second World War). Year 1990 1980 1970 1960 1950 1940 1930 30 28 26 24 22 20 Age Average Age New Zealanders are First Married

22 Rank these relationships from weakest (1) to strongest (4):
2 1 3

23 Describe these relationships
Perfect, positive, linear relationship Perfect, negative, linear relationship No relationship Moderate, negative linear relationship Weak, positive linear relationship

24 Describe this relationship.

25 As the hours of study increase, the test score . . . .? . . .

26 Pearson’s product-moment correlation coefficient, r
Points fall exactly on a straight line Correlation measures the strength of the linear association between two quantitative variables. No linear relationship (uncorrelated) Points fall exactly on a straight line r = -1 The correlation coefficient may take any value between -1.0 and +1.0 r = -0.7 r = -0.4 r = 0.3 r = 0 r = 0.8 r = 1

27 r - what does it tell you? How close the points in the
scatter plot come to lying on the line. r = 0.99 x y * r = 0.57 r = 0.99 r = 0.57

28 Interpreting r 0.75-1 Strong positive linear association
Moderate positive linear association Weak positive linear association No association or weak linear association Weak negative linear association Moderate negative linear association Strong negative linear association

29 Useful websites http://istics.net/stat/Correlations/ Guessing
Regression by eye Guessing effect of outliers

30 Assumptions linear relationship between x and y
continuous random variables The residuals must be normally distributed x and y must be independent of each other all individuals must be selected at random from the population all individuals must have equal chance of being selected

31 What is correlation? A measure of the strength of a LINEAR association between two quantitative variables.

32 Sure you can calculate a correlation coefficient for any pair of variables but correlation measures the strength only of the linear association and will be misleading if the relationship is not linear.

33 Do you know that: Correlation applies only to quantitative variables. Check you know the units and what they measure. Outliers can distort the correlation dramatically.

34 Some facts about the correlation coefficient
The sign gives the direction of the association. Correlation is always between -1 and 1. Correlation treats x and y symmetrically. The correlation of x and y is the same as the correlation of y with x. Correlation has no units and is generally given as a decimal. r is a multiple of the slope Note: variables can have a strong association but still have a small correlation if the association isn’t linear. Correlation is sensitive to outliers. A single outlying value can make a small correlation large or make a large one small.

35 The sign gives the direction of the association.
Positive Negative

36 Correlation treats x and y symmetrically
Correlation treats x and y symmetrically. The correlation of x and y is the same as the correlation of y with x.

37 r is a multiple of the slope

38 Variables can have a strong association but still have a small correlation if the association isn’t linear. Always plot the data before looking at the correlation!

39 Would it be OK to use a correlation coefficient to describe the strength of the relationship?
9 8 7 6 5 4 3 2 1 4000 3000 2000 1000 Position Number Distance (million miles) Distances of Planets from the Sun Reaction Times (seconds) for 30 Year 10 Students 0.2 0.4 0.6 0.8 1 Non-dominant Hand Dominant Hand Female ($) Average Weekly Income for Employed New Zealanders in 2001 Male ($) 200 400 600 800 1000 1200 45 40 35 20 19 18 17 16 15 14 Latitude (°S)    Mean January Air Temperatures for 30 New Zealand Locations Temperature (°C) X X

40 Correlation is sensitive to outliers
Correlation is sensitive to outliers. A single outlying value can make a small correlation large or make a large one small.

41 You should be cautious in interpreting the correlation - these graphs all have the same correlation coefficient (0.817)

42 Data set 1

43 Data set 2

44 Data set 3

45 Data set 4

46 Outliers can distort the correlation dramatically
Outliers can distort the correlation dramatically. An outlier can make an otherwise small correlation look big or hide a large correlation. It can even give an otherwise positive association a negative correlation coefficient (and vice versa).

47 What do you see in this scatterplot?
22 23 24 25 26 27 28 29 150 160 170 180 190 200 Foot size (cm) Height (cm) Height and Foot Size for 30 Year 10 Students Appears to be a linear trend, with a possible outlier (tall person with a small foot size.) Appears to be constant scatter. Positive association.

48 Height and Foot Size for 30 Year 10 Students
What will happen to the correlation coefficient if the tallest Year 10 student is removed? 22 23 24 25 26 27 28 29 150 160 170 180 190 200 Foot size (cm) Height (cm) Height and Foot Size for 30 Year 10 Students It will get smaller It will get bigger It will stay the same

49 What do you see in this scatter plot?
Appears to be a strong linear trend. Outlier in X (the elephant). Appears to be constant scatter. Positive association. 600 500 400 300 200 100 40 30 20 10 Gestation (Days) Life Expectancy (Years) Life Expectancies and Gestation Period for a sample of non-human Mammals Elephant

50 What will happen to the correlation coefficient if the elephant is removed?
600 500 400 300 200 100 40 30 20 10 Gestation (Days) Life Expectancy (Years) Life Expectancies and Gestation Period for a sample of non-human Mammals Elephant It will get smaller It will get bigger It will stay the same

51 How does the outlier affect the r - value?

52 How does the outlier affect the r - value?

53 How does the outlier affect the r - value?

54 How does the outlier affect the r - value?

55 How does the outlier affect the r - value?

56 How does the outlier affect the r - value?

57 When you see an outlier, it’s often a good idea to report the correlations with and without the point.

58 Don’t confuse Correlation with causation
Don’t confuse Correlation with causation. Scatterplots and correlation never prove causation.

59 Using the information in the plot, can you suggest what needs to be done in a country to increase the life expectancy? Explain. Life Expectancy and Availability of Doctors for a Sample of 40 Countries 40000 30000 20000 10000 80 70 60 50 People per Doctor Life Expectancy Perhaps if you have less people per Doctor (i.e. more Doctors per person), then the life expectancy will increase.

60 Life Expectancy and Availability of Televisions for a
Using the information in this plot, can you make another suggestion as to what needs to be done in a country to increase life expectancy? Life Expectancy and Availability of Televisions for a Sample of 40 Countries It looks like if you decrease the number of people per television (i.e. have more TVs per person), then the life expectancy will increase! 600 500 400 300 200 100 80 70 60 50 People per Television Life Expectancy

61 Can you suggest another variable that is linked to life expectancy and the availability of doctors (and televisions) which explains the association between the life expectancy and the availability of doctors (and televisions)? Some measure of wealth of a country. Eg Average income per person or GDP.

62 Damaged for life by too much TV

63 Damaged for life by too much TV
Watching too much television as a child causes serious health problems years later, and raises the risk of heart disease, a New Zealand study of 1000 children has found…. It links the amount of time spent in front of the box as a child with obesity, high cholesterol, poor fitness and smoking….

64 Damaged for life by too much TV
Health Score TV watching

65 Causal relationships Two general types of studies: experiments and observational studies In an experiment, the experimenter determines which experimental units receive which treatments. In an observational study, we simply compare units that happen to have received each of the treatments.

66 Causal relationships Only properly designed and carefully executed experiments can reliably demonstrate causation. An observational study is often useful for identifying possible causes of effects, but it cannot reliably establish causation

67 Causal relationships In observational studies, strong relationships are not necessarily causal relationships. Correlation does not imply causation. Be aware of the possibility of lurking variables.

68 Watch out for lurking variables
Watch out for lurking variables. Damage ($) vs number of firemen would show a strong correlation, but damage doesn’t cause firemen and firemen do seem to cause damage (spraying water and chopping holes). The underlying variable is the size of the blaze.

69 Although there was plenty of evidence that increased smoking was associated with increased levels of lung cancer, it took years to provide evidence that smoking actually causes lung cancer.

70 It would be a good idea to read the two pages of notes you have that discusses correlation and causation!

71 So now you want to know how to calculate the correlation coefficient, r. Here is one version of the formula!

72

73 Luckily the computer will calculate R2 and you can square root this to get r. Remember only when the association is linear.

74 r measures the strength of the relationship NOT R2
r measures the strength of the relationship NOT R2!!!! r measures the strength of the relationship NOT R2!!!! r measures the strength of the relationship NOT R2!!!!

75 The words you use There is a strong, positive, linear relationship between ‘x’ and ‘y’ and when the x- values increase, the y-values increase also. This is indicated by the value of the correlation coefficient i.e. r = 0.85 which is close to 1. (Note: Do not use ‘x’ and ‘y’ use what they represent.)


Download ppt "Bivariate Data analysis"

Similar presentations


Ads by Google