Association between quantitative variables: Correlation and regression

Association between quantitative variables: Correlation and regression

Correlation Examining relationships between two variables
paired by time paired by person or team paired by region or country Leads on to regression

Is Third World Aid a Good Thing?
Argument for: ( Infant mortality Govt health expenditure Paired by country

Argument against: Aid GDP Time

Paired by year

How good a relationship between Government Health Expenditure & Infant Mortality rate does there have to be to convince you that they really are linked?

Key Questions: When does a relationship start to suggest that there really is an association, rather than it just being a coincidence? Can we measure how strong the association is? How do we interpret the results

Relationships between 2 variables
First draw a scatter graph. Each point represents an x value and a y value. y x

Perfect positive correlation can draw a straight line through all points, with a positive slope. As x increases, so does y – a positive relationship y x

Perfect negative correlation can draw a straight line through all points, with a negative slope. As x increases, y decreases – a negative relationship y x

No correlation an increase or decrease in x doesn't appear to have any effect on the value of y. y x

Non linear relationship a bit more complicated than the straight lines we saw earlier. y x

Some correlation but how much? As x increases, then y tends to increase, but not always.. y x

Outliers Unusual value Investigate and find out the reason
May need to remove from data y x outlier

Important note about x and y
x is the explanatory (or independent) variable y is the response (or dependent) variable Important to get these the right way round Ways of thinking about this: y is the variable we are trying to predict x is the variable we can control (sometimes) we think that x might cause y (but see later: the data can’t prove this)

Describing scatter plots
Direction: does the trend go up or down? Curvature: is the pattern linear or curved? Variation: are the points tightly clustered around the trend? Outliers: is there something unexpected?

Measuring association
We would like to measure the amount of association Measure using covariance and correlation

Example A company has varied promotional marketing expenditure over the last 6 months to see the effect on sales. x variable is marketing, y variable is sales Data is paired by month (assume marketing has the effect in the same month, e.g. direct promotions)

Scatter chart

Covariance = Σ (xi – x)(yi – y) Like the variance: n – 1
= Σ (xi – x)(xi – x) n – 1 As for the standard deviation and variance, population version divides by n rather than n – 1

Covariance x below mean, x above mean, y above mean, y above mean,
so covariance –ve x above mean, y above mean, so covariance +ve x below mean, y below mean, so covariance +ve x above mean, y below mean, so covariance –ve

Covariance for our data
Covariance = 15.2 (units are £0002) But: If we measure our data in £ rather than £000 then: covariance = 15,200,000 (units are £2) Issues: Can’t tell whether number indicates strong or weak association Value depends on the units

Correlation Usually denoted r
Often called Pearson’s correlation coefficient Dividing by the standard deviations takes away the effects of the size of the numbers and the number of points – i.e. it makes r scale invariant Correlation doesn’t have any units

Interpretation of r – 1  r  1 (ALWAYS!!)
r = 1 means perfect positive correlation r = – 1 means perfect negative correlation r = 0 means no correlation closer to 1 or – 1 means stronger correlation if we calculate r greater than 1 or smaller than –1 , there is a mistake!

Correlations (Stine and Foster, Statistics for Business p113)

Scatter chart For our data, r = 0.946

Critical values (CV) for a hypothesis test on correlation at significance level α – there is evidence of correlation at significance level α if the absolute value of r is greater than CV n CV (α=0.05) CV (α=0.01) 4 .950 .999 5 .878 .959 6 .811 .917 7 .754 .875 8 .707 .834 9 .666 .798 10 .632 .765 11 .602 .735 12 .576 .708 13 .553 .684 14 .532 .661 15 .514 .641 16 .497 .623 17 .482 .606 18 .468 .590 19 .456 .575 n CV (α=0.05) CV (α=0.01) 20 .444 .561 25 .396 .505 30 .361 .463 35 .335 .430 40 .312 .402 45 .294 .378 50 .279 60 .254 .330 70 .236 .305 80 .220 .286 90 .207 .269 100 .196 .256

Hypothesis testing If the absolute value of r is greater than the critical value then it is “statistically significant” (at significance level α) α = 0.05 is commonly used but really just an arbitrary cut-off level just means that the likelihood of this result occurring by chance is 5% (i.e. fairly unlikely) i.e. if there is really no relationship the chance of getting this result is 5% (1 in 20) Therefore “statistically significant” just means reasonably strong evidence of a pattern in the data sometimes occurs when there really isn’t a relationship sometimes doesn’t occur when there really is a relationship

Hypothesis testing - lottery
n = 13, correlation = –0.836, CV (0.05) = 0.553, CV (0.01) = Significant at the 1% level! Does the Saturday draw have some influence on the draw on the next Wednesday?

Hypothesis testing If we do lots of studies and analyse them separately then there is very likely to be one that is “statistically significant” just by chance lottery example: 47 quarters, 3 are statistically significant at the 5% level Need to combine studies together lottery example: 630 pairs, r = 0.025, CV (5%) = 0.078, not statistically significant Hence: one small scientific study with a “statistically significant” result doesn’t prove much Also “statistically significant” does not mean causal relationship (see next slide) Note: media generally doesn’t understand these points (and nor do some scientists!!)

Where do babies come from?
Old European folk stories explain that babies are delivered to new parents by storks

Where do babies come from?
High correlation does NOT imply cause and effect NOT

Other examples of correlations
Predict top authors by hand size? Hand size and writing ability CO2 and football ticket prices Taking medicine and being ill Ice cream and deaths from drowning More CO2 makes people prepared to pay more? Taking medicine causes illness? Ban ice cream?

High correlation High correlation (positive or negative) just means there is an interesting pattern in the data. Possible reasons are: changes in x cause changes in y changes in y cause changes in x changes in z cause changes in both x and y where z is a 3rd factor apparent correlation is just caused by chance particularly if not much data can test for statistical significance

Regression

Modelling We have our data – probably a sample representing a much larger population We've drawn a scatter diagram We've calculated the correlation coefficient What do we do now? want a model estimate of what happens more generally can apply model to other hypothetical situations Assuming we believe there is a causal relationship

Regression model A linear regression model is a mathematical equation of a straight line involving y and x Formula is y = a + bx Not exact: usually points won't all lie exactly on a straight line We need to choose the line of best fit where a and b are constants

Equation of a straight line
y = a + bx (where a and b are constants) a is the intercept on the y axis b is the gradient or slope (negative if the line slopes downwards) y H b = H / L L a x

Line of best fit Assumes: x is explanatory variable, y is response y x

Line of best fit y We are trying to predict the y values. We will measure how well the line fits using the error distance in the y direction x Regression minimises the sum of the squared differences

Regression line “Least squares” regression line: minimises Σ(y – yL)2 where y is the y value of the data point and yL is the y value of the line Squaring the differences stops the positive and negative values cancelling out Squaring the distances penalise large distances from the line otherwise tends to follow greatest concentration of points rather than going through the middle i.e. corresponds to what a well fitting line should look like Important: putting x and y the other way round alters the equation of the regression line

Previous example

Regression in Excel

Plot of regression line
y = x

Plot of regression line
y = x Predict the sales if we spend £4500 on marketing Note: this assumes a causal relationship

Comments Line looks sensible
always plot to check it looks o.k. We can only be reasonably confident in the regression line within the range of the data here: between £3k and £7k advertising If we believe there is a causal relationship then we can make predictions: if we spend £4,500 on marketing we will get sales of £74,200 Data cannot prove causation

Comments Interpretation of constants:
intercept, a = 40: if the line applies down to x = 0, this is the sales if no advertising slope, b = 7.6: extra sales for each unit of advertising

Extrapolation y x

The regression model How "good" is the model?
r2 (where r = correlation coefficient) gives an indication of how closely the straight line fits the points r2 measures the proportion of variation in the data that can be explained by the regression equation but it's not the only possible measure of the model adjusted r2 hypothesis tests

Correlation & regression - summary
Draw scatter graph of data x value is explanatory variable; y response variable look at the pattern of the data look for outliers and decide whether to discard them Relationship is strong if: r is close to 1 or -1 (r2 close to 1) AND many data points can do a hypothesis test Correlation does not necessarily imply cause and effect

Correlation & regression - summary
Regression line gives the line of “best” fit minimises squared errors in y Can use regression line for prediction as long as confident there is a relationship dangerous to extrapolate regression line beyond range of data don’t use regression line after circumstances change

Further regression topics
Multiple regression Transforming variables to deal with non-linear variables Choosing the best model Hypothesis testing

Association between quantitative variables: Correlation and regression

Similar presentations

Presentation on theme: "Association between quantitative variables: Correlation and regression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Association between quantitative variables: Correlation and regression

Similar presentations

Presentation on theme: "Association between quantitative variables: Correlation and regression"— Presentation transcript:

Similar presentations

About project

Feedback