CRITICAL NUMBERS Bivariate Data: When two variables meet

Slides:



Advertisements
Similar presentations
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Advertisements

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Chapter 12 Simple Regression
The Simple Regression Model
SIMPLE LINEAR REGRESSION
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
Correlation and Regression Analysis
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Simple Linear Regression Analysis
Relationships Among Variables
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Correlation & Regression
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Correlation & Regression
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Correlation & Regression Analysis
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-1 Overview Overview 10-2 Correlation 10-3 Regression-3 Regression.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Stats Methods at IC Lecture 3: Regression.
Linear Regression Essentials Line Basics y = mx + b vs. Definitions
Inference about the slope parameter and correlation
Lecture #25 Tuesday, November 15, 2016 Textbook: 14.1 and 14.3
The simple linear regression model and parameter estimation
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Regression and Correlation
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
26134 Business Statistics Week 5 Tutorial
Correlation and Simple Linear Regression
Inference for Regression
Chapter 5 STATISTICS (PART 4).
Correlation and Regression
Chapter 11 Simple Regression
Understanding Standards Event Higher Statistics Award
Elementary Statistics
Correlation and Simple Linear Regression
Lecture Slides Elementary Statistics Thirteenth Edition
Correlation and Regression
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Prepared by Lee Revere and John Large
Chapter 10 Correlation and Regression
CHAPTER 3 Describing Relationships
Correlation and Simple Linear Regression
Simple Linear Regression
Simple Linear Regression and Correlation
Product moment correlation
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
SIMPLE LINEAR REGRESSION
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Lecture Slides Elementary Statistics Eleventh Edition
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Warsaw Summer School 2017, OSU Study Abroad Program
Linear Regression and Correlation
CHAPTER 3 Describing Relationships
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

CRITICAL NUMBERS Bivariate Data: When two variables meet

Recap: types of data Categorical (Qualitative) Nominal (no natural ordering) Haemoglobin types gender Ordered categorical Anaemic / borderline / not anaemic Quantitative (numerical) Count (can only take certain values) Number of positive tests for anaemia Continuous (limited only by accuracy of instrument) Haemoglobin concentration (g/dl)

Population and Sample

The Standard Error The standard error (se) is an estimate of the precision of the population parameter estimate that doesn’t require lots of repeated samples. It is used to determine how far from the true value (the population parameter) the sample estimate is likely to be. Thus, all other things being equal, we would expect estimates to get more precise and the value of the se to decrease as sample size increases.

Confidence Intervals A confidence interval describes the variability surrounding the sample estimate It gives limits within which we are confident (in terms of probability) that the true population parameter lies. For example a 95% CI means that if you could sample an infinite number of times 95% of the time the CI would contain the true population parameter 5% of the time the CI would fail to contain the true population parameter Alternatively: a confidence interval gives a range of values that will include the true population value for 95% of all possible samples

Hypothesis testing: the main steps Set null hypothesis Set study (alternative) hypothesis Carry out significance test Obtain test statistic Compare test statistic to hypothesized critical value Obtain p-value Make a decision

P-values P-value Small Large A p-value is the probability of obtaining your results or results more extreme, if the null hypothesis is true It is used to make a decision about whether to reject, or not reject the null hypothesis P-value Small Large The results are unlikely when the null hypothesis is true The results are likely when the null hypothesis is true But how small is small? The significance level is usually set at 0.05. Thus if the p-value is less than this value we reject the null hypothesis

P-values We say that our results are statistically significant if the p-value is less than the significance level () set at 5% P ≤ 0.05 P > 0.05 Result is Statistically significant Not statistically significant Decide That there is sufficient evidence to reject the null hypothesis and accept the alternative hypothesis That there is insufficient evidence to reject the null hypothesis We cannot say that the null hypothesis is true, only that there is not enough evidence to reject it

At the end of session, you should know about: Approaches to analysis for simple continuous bivariate data At the end of session, you should be able to: Construct and interpret scatterplots for quantitative bivariate data Identify when to use correlation Interpret the results of correlation coefficients Identify when to use linear regression Interpret the results for linear regression As with last weeks concepts, those introduced here today are not easy and so they shouldn’t panic if they don’t grasp them immediately. Having said that, they are important and so they should be aware of them and their implications as this is what the previous session having been building towards They will probably be pleased to know that there are no numerical calculations today and no video!

The Scenario “Our Dr has noticed that since she moved practices, from one in a wealthy suburb of the city to one in a more deprived area, she is seeing many more teenage pregnancies. She wants to know whether it is worth her setting up a contraceptive advice clinic especially for teenagers…” As with last weeks concepts, those introduced here today are not easy and so they shouldn’t panic if they don’t grasp them immediately. Having said that, they are important and so they should be aware of them and their implications as this is what the previous session having been building towards They will probably be pleased to know that there are no numerical calculations today and no video!

What do we mean when we talk about bivariate data? Data where there are two variables The two variables can be either categorical, or numerical This session we are dealing with continuous bivariate data i.e. both variables are continuous During the risk lecture last year we looked at categorical bivariate data … As with last weeks concepts, those introduced here today are not easy and so they shouldn’t panic if they don’t grasp them immediately. Having said that, they are important and so they should be aware of them and their implications as this is what the previous session having been building towards They will probably be pleased to know that there are no numerical calculations today and no video!

… categorical bivariate data example from Risk lecture Baycol Other statins Number who die from 2 1 rhabdomyolysis Number alive or die 999 998 9 999 999 of other causes Total 1 000 000 10 000 000 There are two binary (categorical) variables Type of statin (Baycol / other) Whether died of rhabdomyolysis or not From these data we examined the risk of death from rhabdomyolysis of Baycol compared to other statins

Association between two variables: Correlation or regression? There are two basic situations: There is no distinction between the two variables. No causation is implied, simply association: use correlation One variable Y is a response to another variable X. You could use the value of X to predict what Y would be: use regression

are two variables associated? Correlation: are two variables associated? When examining the relationship between two continuous variables ALWAYS look at the scatterplot, as you will be able to see visually the pattern of the relationship between them

Teenage pregnancy example

Teenage pregnancy example There appears to be a linear relationship between adult smoking rates and teenage pregnancy So, now what do you do….? ….. could calculate the correlation coefficient This is a measure of the linear association between two variables Used when you are not interested in predicting the value of one variable for a given value of the other variable Any relationship is not assumed to be a causal one – it may be caused by other factors

Teenage pregnancy example

Properties of Pearson’s correlation coefficient (r) r must be between -1 and +1 +1 = perfect positive linear association -1 = perfect negative linear association 0 = no linear relation at all

Consider the following graphs, what do you think their value for r could be?

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

Confidence interval for the correlation coefficient Complicated to calculate by hand, but useful Hypothesis tests Can be done, the null hypothesis is that the population correlation r = 0, but this is not very useful as an estimate of the strength of an association, because it is influenced by the number of observations (see next slide)…..

Sample size Value at which the correlation coefficient becomes significant at the 5% level 10 0.63

Sample size Value at which the correlation coefficient becomes significant at the 5% level 10 20 0.63 0.44

Sample size Value at which the correlation coefficient becomes significant at the 5% level 10 20 50 0.63 0.44 0.28

Sample size Value at which the correlation coefficient becomes significant at the 5% level 10 20 50 100 0.63 0.44 0.28 0.20

Sample size Value at which the correlation coefficient becomes significant at the 5% level 10 20 50 100 150 0.63 0.44 0.28 0.20 0.16

And so what do correlations of 0.63 and 0.16 look like?

Teenage pregnancy example: null & alternative hypothesis State the null and alternative hypothesis: H0: No relationship or correlation between adult smoking and teenage pregnancy rates i.e.population correlation coefficient (r) = 0.0 HA: There is a relationship or correlation between i.e.population correlation coefficient (r) 0.0 32

Teenage pregnancy example

Example: Answers The correlation coefficient is 0.94 (p< 0.001) What does P < 0.001 mean? Your results are unlikely when the null hypothesis is true Is this result statistically significant? The result is statistically significant at the 5% level because the P-value is less than the significance level () set at 5% or 0.05 You decide? That there is sufficient evidence to reject the null hypothesis and therefore you accept the alternative hypothesis that there is a correlation between adult smoking and the teenage pregnancy rates 30

Points to note Do not assume causality - a different variable could have caused both to change together – in this case it is unlikely that smoking increases the risk of conception! Be careful comparing r from different studies with different n Do not assume the scatterplot looks the same outside the range of the axes Avoid multiple testing Always examine the scatterplot!

Teenage pregnancy example

Teenage pregnancy example

Association between two variables: Correlation or regression? There are two basic situations: There is no distinction between the two variables. No causation is implied, simply association: use correlation One variable Y is a response to another variable X. You could use the value of X to predict what Y would be: use regression

Regression: Quantifying the relationship between two continuous variables Teenage pregnancy example: If you believe that the relationship is causal i.e. that the level of smoking in an area affects the teenage pregnancy rate for that area, you may want to: Quantify the relationship between smoking and the teenage pregnancy rate Predict on average what the pregnancy rate would be, given a particular level of smoking

Regression: Quantifying the relationship between two continuous variables Teenage pregnancy example: However, in this case it would not be sensible as both are mediated by deprivation. So let’s look at the rates of teenage pregnancy by area deprivation. If we believe that deprivation is causally linked with teenage pregnancy we could: Quantify the relationship between deprivation and the teenage pregnancy rate Predict on average what the pregnancy rate would be, given a particular level of deprivation

Y Response variable (dependent variable) X Predictor / explanatory variable (independent variable)

Always plot the graph this way round, with the explanatory (independent) variable on the horizontal axis and the dependent variable on the vertical axis We try to fit the “best” straight line If the relationship is linear, this should give the best prediction of Y for any value of X

Estimating the best fitting line The standard way to do this is using a method called least squares using a computer. The method chooses a line so that the square of the vertical distances between the line and the point (averaged over all points) is minimised.

Y Response variable (dependent variable) X Predictor / explanatory variable (independent variable)

Estimating the best fitting line The line can be represented numerically by an equation (the regression equation), which includes two coefficients, one for the intercept (the value of the dependent variable, when the independent variable is equal to zero) and the slope (the average change in the dependent variable for a unit change in the x variable): y = a + b x Dependent variable Independent variable Intercept Slope

Equation of the line Y = a + bX b is the slope or gradient of the line The amount of change in Y for a one unit change in X (Dependent variable) Y Response variable a is the intercept – value of Y when X is zero X Predictor / explanatory variable (Independent variable)

Teenage pregnancy example Regression line Y Response variable (dependent variable) Slope is the average change in the Y variable for a change of one unit in the X variable Intercept (where the line crosses the y axis) X Predictor / explanatory variable (independent variable)

Teenage pregnancy example equation Pregnancy rate = 13.04 + 0.006 x deprivation here, a = 13.04 (intercept) b = 0.006 (slope) i.e. for every unit increase in deprivation score there are an additional 0.006 pregnancies per 1000 women aged 15-17 (or an extra 6 per million women aged 15-17) (or: for every increase in deprivation of 1000 units there are 6 extra teenage pregnancies per 1000 women)

Teenage pregnancy example Regression line: pregnancy rate = 13.04 + 0.006 x deprivation score Y Response variable (dependent variable) Slope = 0.006; i.e. unit change in pregnancy rate for a unit change in deprivation score Intercept = 13.04 X Predictor / explanatory variable (independent variable)

Teenage pregnancy example equation Often in papers, when presenting the results of a regression analysis you will see a quantity known as r2 quoted This is the proportion of variance explained by the predictor variable and is a measure of the fit of the model to the data. It can be expressed as a percentage For our example the r2 value is 0.646, thus 64.6% of the variability in the teenage pregnancy rate is explained by variation in the deprivation score NB: This is the square of the correlation coefficient 0.8042 = 0.646

Prediction Regression slopes can be used to predict the value of the dependent variable with a particular value of the predictor / explanatory / independent variable The slope, b, indicates the strength of the relationship between x and y We are often interested in how likely we are to obtain our value of b if there is actually no relationship between x and y in the population One way to do this is to do a test of significance for the slope (b)

Caveats Do not use the graph or regression model to predict outside of the range of observations Do not assume just because you have an equation that means that X causes Y As with correlation, it is always a good idea to have a look at the scatterplot

Teenage pregnancy example Ref: www.empho.org.uk/whatsnew/teenage-pregnancy-presentation.ppt

Teenage pregnancy example Regression line: Pregnancy rate = 26.4 + 0.003 x deprivation score Regression line: pregnancy rate = 13.04 + 0.006 x deprivation score Ref: www.empho.org.uk/whatsnew/teenage-pregnancy-presentation.ppt

Association between two variables: Correlation or regression (1) We have now learned that there are two basic situations: There is no distinction between the two variables. No causation is implied, simply association: use correlation One variable Y is a response to another variable X. You could use the value of X to predict what Y would be: use regression

Association between two variables: Correlation or regression (2) Correlation is used to denote association between two quantitative variables The degree of association is estimated using the correlation coefficient It measures the level of linear association between the two variables

Association between two variables: Correlation or regression (3) Regression quantifies the relationship between two quantitative variables It involves estimating the best straight line with which to summarise the association The relationship is represented by an equation, the regression equation It is useful when we want to describe the relationship between the variables, or even predict a value of one variable for a given value of the other

You should now know about: Approaches to analysis for simple continuous bivariate data – correlation and regression You should now be able to: Construct and interpret scatterplots for quantitative bivariate data Identify when it is appropriate to use correlation Interpret the results of correlation coefficients Identify when it is appropriate to use linear regression Interpret the results of a linear regression As with last weeks concepts, those introduced here today are not easy and so they shouldn’t panic if they don’t grasp them immediately. Having said that, they are important and so they should be aware of them and their implications as this is what the previous session having been building towards They will probably be pleased to know that there are no numerical calculations today and no video!

Formula for Pearson’s r Given a set of n pairs of observations (x1,y1), (x2,y2),….(xn,yn) the Pearson correlation coefficient r is given by: For this equation to work X and Y must both be continuous variables, (and Normally distributed if the CI and hypothesis test are to be valid). It is easier to do it on a computer!

Hypothesis test for r To test whether the population correlation coefficient, r, is significantly different from zero, calculate: Compare the test statistic with the t-distribution with n – 2 degrees of freedom.

Confidence interval for r A 100 (1-a)% CI for the population correlation coefficient, r: r – t1-a/2 SE(r) to r + t1-a/2SE(r) Where t1-a/2 are values from tables of t distribution n - 2 degrees of freedom.

Formula for estimating a and b Given a set of n pairs of observations (x1,y1),(x2,y2),….(xn,yn). The regression coefficient b of y given x is:

Significance test and CI for b To test whether b is significantly different from zero, calculate: Compare t = b/SE(b) with a t distribution with n - 2 degrees of freedom. A 100 (1-a)% CI for the population slope, b, with n - 2 degrees of freedom is given by: b – t1-a/2 SE(b) to b + t1-a/2 SE(b)

Residuals Residuals are the: observed value minus the fitted value Yobs - Yfit i.e. the dashed lines on the previous slide Plots involving residuals can be very informative. They can: help assess if assumptions are valid help assess if other variables need to be taken into account

Assumptions for the linear regression model to be valid: The residuals are Normally distributed for each value of X (predictor variable). The variance of Y is the same at each value of X. The relationship between the two variables is linear. You do not have to have X random or X Normally distributed.