CRITICAL NUMBERS Bivariate Data: When two variables meet

CRITICAL NUMBERS Bivariate Data: When two variables meet

Recap: types of data Categorical (Qualitative)
Nominal (no natural ordering) Haemoglobin types gender Ordered categorical Anaemic / borderline / not anaemic Quantitative (numerical) Count (can only take certain values) Number of positive tests for anaemia Continuous (limited only by accuracy of instrument) Haemoglobin concentration (g/dl)

Population and Sample

The Standard Error The standard error (se) is an estimate of the precision of the population parameter estimate that doesn’t require lots of repeated samples. It is used to determine how far from the true value (the population parameter) the sample estimate is likely to be. Thus, all other things being equal, we would expect estimates to get more precise and the value of the se to decrease as sample size increases.

Confidence Intervals A confidence interval describes the variability surrounding the sample estimate It gives limits within which we are confident (in terms of probability) that the true population parameter lies. For example a 95% CI means that if you could sample an infinite number of times 95% of the time the CI would contain the true population parameter 5% of the time the CI would fail to contain the true population parameter Alternatively: a confidence interval gives a range of values that will include the true population value for 95% of all possible samples

Hypothesis testing: the main steps
Set null hypothesis Set study (alternative) hypothesis Carry out significance test Obtain test statistic Compare test statistic to hypothesized critical value Obtain p-value Make a decision

P-values P-value Small Large
A p-value is the probability of obtaining your results or results more extreme, if the null hypothesis is true It is used to make a decision about whether to reject, or not reject the null hypothesis P-value Small Large The results are unlikely when the null hypothesis is true The results are likely when the null hypothesis is true But how small is small? The significance level is usually set at Thus if the p-value is less than this value we reject the null hypothesis

P-values We say that our results are statistically significant if the p-value is less than the significance level () set at 5% P ≤ 0.05 P > 0.05 Result is Statistically significant Not statistically significant Decide That there is sufficient evidence to reject the null hypothesis and accept the alternative hypothesis That there is insufficient evidence to reject the null hypothesis We cannot say that the null hypothesis is true, only that there is not enough evidence to reject it

At the end of session, you should know about:
Approaches to analysis for simple continuous bivariate data At the end of session, you should be able to: Construct and interpret scatterplots for quantitative bivariate data Identify when to use correlation Interpret the results of correlation coefficients Identify when to use linear regression Interpret the results for linear regression As with last weeks concepts, those introduced here today are not easy and so they shouldn’t panic if they don’t grasp them immediately. Having said that, they are important and so they should be aware of them and their implications as this is what the previous session having been building towards They will probably be pleased to know that there are no numerical calculations today and no video!

The Scenario “Our Dr has noticed that since she moved practices, from one in a wealthy suburb of the city to one in a more deprived area, she is seeing many more teenage pregnancies. She wants to know whether it is worth her setting up a contraceptive advice clinic especially for teenagers…” As with last weeks concepts, those introduced here today are not easy and so they shouldn’t panic if they don’t grasp them immediately. Having said that, they are important and so they should be aware of them and their implications as this is what the previous session having been building towards They will probably be pleased to know that there are no numerical calculations today and no video!

What do we mean when we talk about bivariate data?
Data where there are two variables The two variables can be either categorical, or numerical This session we are dealing with continuous bivariate data i.e. both variables are continuous During the risk lecture last year we looked at categorical bivariate data … As with last weeks concepts, those introduced here today are not easy and so they shouldn’t panic if they don’t grasp them immediately. Having said that, they are important and so they should be aware of them and their implications as this is what the previous session having been building towards They will probably be pleased to know that there are no numerical calculations today and no video!

… categorical bivariate data example from Risk lecture
Baycol Other statins Number who die from rhabdomyolysis Number alive or die of other causes Total There are two binary (categorical) variables Type of statin (Baycol / other) Whether died of rhabdomyolysis or not From these data we examined the risk of death from rhabdomyolysis of Baycol compared to other statins

Association between two variables: Correlation or regression?
There are two basic situations: There is no distinction between the two variables. No causation is implied, simply association: use correlation One variable Y is a response to another variable X. You could use the value of X to predict what Y would be: use regression

are two variables associated?
Correlation: are two variables associated? When examining the relationship between two continuous variables ALWAYS look at the scatterplot, as you will be able to see visually the pattern of the relationship between them

Teenage pregnancy example

There appears to be a linear relationship between adult smoking rates and teenage pregnancy So, now what do you do….? ….. could calculate the correlation coefficient This is a measure of the linear association between two variables Used when you are not interested in predicting the value of one variable for a given value of the other variable Any relationship is not assumed to be a causal one – it may be caused by other factors

Properties of Pearson’s correlation coefficient (r)
r must be between -1 and +1 +1 = perfect positive linear association -1 = perfect negative linear association 0 = no linear relation at all

Consider the following graphs, what do you think their value for r could be?

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

Confidence interval for the correlation coefficient
Complicated to calculate by hand, but useful Hypothesis tests Can be done, the null hypothesis is that the population correlation r = 0, but this is not very useful as an estimate of the strength of an association, because it is influenced by the number of observations (see next slide)…..

Sample size Value at which the correlation coefficient becomes significant at the 5% level 10 0.63

Sample size Value at which the correlation coefficient becomes significant at the 5% level 10 20 0.63 0.44

Sample size Value at which the correlation coefficient becomes significant at the 5% level 10 20 50 0.63 0.44 0.28

Sample size Value at which the correlation coefficient becomes significant at the 5% level 10 20 50 100 0.63 0.44 0.28 0.20

Sample size Value at which the correlation coefficient becomes significant at the 5% level 10 20 50 100 150 0.63 0.44 0.28 0.20 0.16

And so what do correlations of 0.63 and 0.16 look like?

Teenage pregnancy example: null & alternative hypothesis
State the null and alternative hypothesis: H0: No relationship or correlation between adult smoking and teenage pregnancy rates i.e.population correlation coefficient (r) = 0.0 HA: There is a relationship or correlation between i.e.population correlation coefficient (r) 0.0 32

Example: Answers The correlation coefficient is 0.94 (p< 0.001)
What does P < mean? Your results are unlikely when the null hypothesis is true Is this result statistically significant? The result is statistically significant at the 5% level because the P-value is less than the significance level () set at 5% or 0.05 You decide? That there is sufficient evidence to reject the null hypothesis and therefore you accept the alternative hypothesis that there is a correlation between adult smoking and the teenage pregnancy rates 30

Points to note Do not assume causality - a different variable could have caused both to change together – in this case it is unlikely that smoking increases the risk of conception! Be careful comparing r from different studies with different n Do not assume the scatterplot looks the same outside the range of the axes Avoid multiple testing Always examine the scatterplot!

Association between two variables: Correlation or regression?
There are two basic situations: There is no distinction between the two variables. No causation is implied, simply association: use correlation One variable Y is a response to another variable X. You could use the value of X to predict what Y would be: use regression

Regression: Quantifying the relationship between two continuous variables
Teenage pregnancy example: If you believe that the relationship is causal i.e. that the level of smoking in an area affects the teenage pregnancy rate for that area, you may want to: Quantify the relationship between smoking and the teenage pregnancy rate Predict on average what the pregnancy rate would be, given a particular level of smoking

Regression: Quantifying the relationship between two continuous variables
Teenage pregnancy example: However, in this case it would not be sensible as both are mediated by deprivation. So let’s look at the rates of teenage pregnancy by area deprivation. If we believe that deprivation is causally linked with teenage pregnancy we could: Quantify the relationship between deprivation and the teenage pregnancy rate Predict on average what the pregnancy rate would be, given a particular level of deprivation

Y Response variable (dependent variable)
X Predictor / explanatory variable (independent variable)

Always plot the graph this way round, with the explanatory (independent) variable on the horizontal axis and the dependent variable on the vertical axis We try to fit the “best” straight line If the relationship is linear, this should give the best prediction of Y for any value of X

Estimating the best fitting line
The standard way to do this is using a method called least squares using a computer. The method chooses a line so that the square of the vertical distances between the line and the point (averaged over all points) is minimised.

Y Response variable (dependent variable)
X Predictor / explanatory variable (independent variable)

Estimating the best fitting line
The line can be represented numerically by an equation (the regression equation), which includes two coefficients, one for the intercept (the value of the dependent variable, when the independent variable is equal to zero) and the slope (the average change in the dependent variable for a unit change in the x variable): y = a + b x Dependent variable Independent variable Intercept Slope

Equation of the line Y = a + bX
b is the slope or gradient of the line The amount of change in Y for a one unit change in X (Dependent variable) Y Response variable a is the intercept – value of Y when X is zero X Predictor / explanatory variable (Independent variable)

Regression line Y Response variable (dependent variable) Slope is the average change in the Y variable for a change of one unit in the X variable Intercept (where the line crosses the y axis) X Predictor / explanatory variable (independent variable)

Teenage pregnancy example equation
Pregnancy rate = x deprivation here, a = (intercept) b = (slope) i.e. for every unit increase in deprivation score there are an additional pregnancies per 1000 women aged (or an extra 6 per million women aged 15-17) (or: for every increase in deprivation of 1000 units there are 6 extra teenage pregnancies per 1000 women)

Regression line: pregnancy rate = x deprivation score Y Response variable (dependent variable) Slope = 0.006; i.e. unit change in pregnancy rate for a unit change in deprivation score Intercept = 13.04 X Predictor / explanatory variable (independent variable)

Teenage pregnancy example equation
Often in papers, when presenting the results of a regression analysis you will see a quantity known as r2 quoted This is the proportion of variance explained by the predictor variable and is a measure of the fit of the model to the data. It can be expressed as a percentage For our example the r2 value is 0.646, thus 64.6% of the variability in the teenage pregnancy rate is explained by variation in the deprivation score NB: This is the square of the correlation coefficient = 0.646

Prediction Regression slopes can be used to predict the value of the dependent variable with a particular value of the predictor / explanatory / independent variable The slope, b, indicates the strength of the relationship between x and y We are often interested in how likely we are to obtain our value of b if there is actually no relationship between x and y in the population One way to do this is to do a test of significance for the slope (b)

Caveats Do not use the graph or regression model to predict outside of the range of observations Do not assume just because you have an equation that means that X causes Y As with correlation, it is always a good idea to have a look at the scatterplot

Ref:

Regression line: Pregnancy rate = x deprivation score Regression line: pregnancy rate = x deprivation score Ref:

Association between two variables: Correlation or regression (1)
We have now learned that there are two basic situations: There is no distinction between the two variables. No causation is implied, simply association: use correlation One variable Y is a response to another variable X. You could use the value of X to predict what Y would be: use regression

Correlation is used to denote association between two quantitative variables The degree of association is estimated using the correlation coefficient It measures the level of linear association between the two variables

Regression quantifies the relationship between two quantitative variables It involves estimating the best straight line with which to summarise the association The relationship is represented by an equation, the regression equation It is useful when we want to describe the relationship between the variables, or even predict a value of one variable for a given value of the other

You should now know about:
Approaches to analysis for simple continuous bivariate data – correlation and regression You should now be able to: Construct and interpret scatterplots for quantitative bivariate data Identify when it is appropriate to use correlation Interpret the results of correlation coefficients Identify when it is appropriate to use linear regression Interpret the results of a linear regression As with last weeks concepts, those introduced here today are not easy and so they shouldn’t panic if they don’t grasp them immediately. Having said that, they are important and so they should be aware of them and their implications as this is what the previous session having been building towards They will probably be pleased to know that there are no numerical calculations today and no video!

Formula for Pearson’s r
Given a set of n pairs of observations (x1,y1), (x2,y2),….(xn,yn) the Pearson correlation coefficient r is given by: For this equation to work X and Y must both be continuous variables, (and Normally distributed if the CI and hypothesis test are to be valid). It is easier to do it on a computer!

Hypothesis test for r To test whether the population correlation coefficient, r, is significantly different from zero, calculate: Compare the test statistic with the t-distribution with n – 2 degrees of freedom.

Confidence interval for r
A 100 (1-a)% CI for the population correlation coefficient, r: r – t1-a/2 SE(r) to r + t1-a/2SE(r) Where t1-a/2 are values from tables of t distribution n - 2 degrees of freedom.

Formula for estimating a and b
Given a set of n pairs of observations (x1,y1),(x2,y2),….(xn,yn). The regression coefficient b of y given x is:

Significance test and CI for b
To test whether b is significantly different from zero, calculate: Compare t = b/SE(b) with a t distribution with n - 2 degrees of freedom. A 100 (1-a)% CI for the population slope, b, with n - 2 degrees of freedom is given by: b – t1-a/2 SE(b) to b + t1-a/2 SE(b)

Residuals Residuals are the:
observed value minus the fitted value Yobs - Yfit i.e. the dashed lines on the previous slide Plots involving residuals can be very informative. They can: help assess if assumptions are valid help assess if other variables need to be taken into account

Assumptions for the linear regression model to be valid:
The residuals are Normally distributed for each value of X (predictor variable). The variance of Y is the same at each value of X. The relationship between the two variables is linear. You do not have to have X random or X Normally distributed.

CRITICAL NUMBERS Bivariate Data: When two variables meet

Similar presentations

Presentation on theme: "CRITICAL NUMBERS Bivariate Data: When two variables meet"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CRITICAL NUMBERS Bivariate Data: When two variables meet

Similar presentations

Presentation on theme: "CRITICAL NUMBERS Bivariate Data: When two variables meet"— Presentation transcript:

Similar presentations

About project

Feedback