Learning outcomes By the end of this session you should know about:

Slides:



Advertisements
Similar presentations
Forecasting Using the Simple Linear Regression Model and Correlation
Advertisements

Correlation and Linear Regression.
Inference for Regression
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
BA 555 Practical Business Analysis
Lecture 6: Multiple Regression
Business Statistics - QBM117 Statistical inference for regression.
Correlation and Regression Analysis
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Correlation & Regression
Inference for regression - Simple linear regression
Relationship of two variables
Simple Linear Regression
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Examining Relationships in Quantitative Research
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
SIMPLE LINEAR REGRESSION AND CORRELLATION
Correlation & Simple Linear Regression Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU 1.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Lecture Slides Elementary Statistics Twelfth Edition
Warm-up Get a sheet of computer paper/construction paper from the front of the room, and create your very own paper airplane. Try to create planes with.
Statistics 200 Lecture #6 Thursday, September 8, 2016
Correlation & Regression
Chapter 3: Describing Relationships
Correlation and Simple Linear Regression
Chapter 5 STATISTICS (PART 4).
Understanding Standards Event Higher Statistics Award
Chapter 14: Correlation and Regression
Elementary Statistics
(Residuals and
CRITICAL NUMBERS Bivariate Data: When two variables meet
Regression and Residual Plots
Lecture Slides Elementary Statistics Thirteenth Edition
Stats Club Marnie Brennan
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Prepared by Lee Revere and John Large
Lecture Notes The Relation between Two Variables Q Q
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Review of Chapter 3 Examining Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
11A Correlation, 11B Measuring Correlation
Chapter 3: Describing Relationships
Linear Regression and Correlation
Product moment correlation
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Linear Regression and Correlation
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Learning outcomes By the end of this session you should know about:
Correlation & Regression
Chapter 3: Describing Relationships
Exercise 1: Gestational age and birthweight
BUS-221 Quantitative Methods
Honors Statistics Review Chapters 7 & 8
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
Review of Chapter 3 Examining Relationships
Exercise 1: Gestational age and birthweight
Chapter 3: Describing Relationships
Presentation transcript:

Correlation and Regression Mathematics & Statistics Help University of Sheffield

Learning outcomes By the end of this session you should know about: Approaches to analysis for simple continuous bivariate data By the end of this session you should be able to: Construct and interpret scatterplots in SPSS Identify when it is appropriate to use correlation Calculate a correlation coefficient in SPSS Interpret a correlation coefficient Identify when it appropriate to use linear regression Run a simple regression model in SPSS Interpret the results of a linear regression model

Download the slides from the MASH website MASH > Resources > Statistics Resources > Workshop materials

Association between two continuous variables: correlation or regression? Two basic questions: Is there a relationship? No causation is implied, simply association Use CORRELATION How can we use the value of one variable to predict the value of the other variable? May be causal, may not be Use REGRESSION

Correlation: are two continuous variables associated? When examining the relationship between two continuous variables ALWAYS look at the scatterplot, to see visually the pattern of the relationship between them

Scatterplot: Relationship between two continuous variables: Outlier Linear Explores the way the two co-vary (correlate): Positive / negative Linear / non-linear Strong / weak Presence of outliers The scatter graph or scatterplot is one of the most common graphs in statistics. It is used to explore the relationship between two continuous variables. Is there a linear or non-linear pattern? The summary statistic we calculate to represent this relationship is called the Pearson’s correlation co-efficient. Is the correlation positive or negative? Is the correlation, strong or weak? This graph will help us answer these questions. It will also help us detect any outliers, that is observations that are outside the norm, outside the pattern that the other observations show. As usual, the scatter graph can be built using SPSS or Excel or any other statistical software.

Scatterplots

Correlation Coefficient r Measures strength of a linear relationship between 2 continuous variables, can take values between -1 to +1 r = 0.9 r = 0.01 r = -0.9

Correlation: Interpretation An interpretation of the size of the coefficient has been described by Cohen (1992) as: Cohen, L. (1992). Power Primer. Psychological Bulletin, 112(1) 155-159 Correlation coefficient value Effect size -0.3 to +0.3 Weak -0.5 to -0.3 or 0.3 to 0.5 Moderate -0.9 to -0.5 or 0.5 to 0.9 Strong -1.0 to -0.9 or 0.9 to 1.0 Very strong

Relationship is not assumed to be a causal one – it may be caused by other factors Does chocolate make you clever or crazy? A paper in the New England Journal of Medicine claimed there was a relationship between chocolate and Nobel Prize winners http://www.nejm.org/doi/full/10.1056/NEJMon1211064

Chocolate and serial killers What else is related to chocolate consumption? http://www.replicatedtypo.com/chocolate-consumption-traffic-accidents-and-serial-killers/5718.html

Dataset for today: Birthweight_reduced_data Factors affecting birth weight of babies Mother smokes = 1 Standard gestation = 40 weeks

Exercise 1: Gestational age and birthweight Draw a line of best fit through the data (with roughly half the points above and half below). Describe the relationship Is the relationship: strong/ weak? positive/ negative? linear?

Exercise 2: Interpretation Interpret the following correlation coefficients using Cohen’s classification and explain what they mean. Which correlations seem meaningful? Relationship Correlation Average IQ and chocolate consumption 0.27 Road fatalities and Nobel winners 0.55 Gross Domestic Product and Nobel winners 0.70 Mean temperature and Nobel winners -0.60

Scatterplot in SPSS Graphs  Legacy Dialogs  Scatter/Dot

Scatterplot in SPSS Graphs  Legacy Dialogs  Scatter/Dot

Use Spearman’s correlation for ordinal variables or skewed scale data Correlation in SPSS Analyze  Correlate  Bivariate  Pearson Use Spearman’s correlation for ordinal variables or skewed scale data

Scatterplot and correlation SPSS output using reduced baby weight data set Pearson correlation r = 0.708 Strong relationship

Hypothesis test for the correlation coefficient Can be done, the null hypothesis is that the population correlation r = 0 Not recommended as it is influenced by the number of observations Better to use Cohen’s interpretation

Hypothesis test: Influence of sample size Value at which the correlation coefficient becomes significant at the 5% level (i.e. p<0.05) 10 0.63

Hypothesis test: Influence of sample size Value at which the correlation coefficient becomes significant at the 5% level (i.e. p<0.05) 10 20 0.63 0.44

Hypothesis test: Influence of sample size Value at which the correlation coefficient becomes significant at the 5% level (i.e. p<0.05) 10 20 50 0.63 0.44 0.28

Hypothesis test: Influence of sample size Value at which the correlation coefficient becomes significant at the 5% level (i.e. p<0.05) 10 20 50 100 0.63 0.44 0.28 0.20

Hypothesis test: Influence of sample size Value at which the correlation coefficient becomes significant at the 5% level (i.e. p<0.05) 10 20 50 100 150 0.63 0.44 0.28 0.20 0.16

And so what do correlations of 0.63 (n=10) and 0.16 (n=150) look like? Correlation=0.63, p=0.048 (n=10) Correlation=0.16, p=0.04 (n=150)

Points to note Do not assume causality Be careful comparing the correlation coefficient, r, from different studies with different n Do not assume the scatterplot looks the same outside the range of the axes Use Cohen’s scale to interpret, rather than the p- value Always examine the scatterplot!

Association between two continuous variables: correlation or regression? Two basic questions: Is there a relationship? No causation is implied, simply association Use CORRELATION How can we use the value of one variable to predict the value of the other variable? May be causal, may not be Use REGRESSION

Simple linear regression Regression quantifies the relationship between two continuous variables It involves estimating the best straight line with which to summarise the association The relationship is represented by an equation, the regression equation It is useful when we want to look for significant relationships between two variables predict the value of one variable for a given value of the other

Independent / dependent variables Does attendance have an association with exam score? Does temperature have an impact on the growth rate of a cell culture? DEPENDENT (outcome) variable (y) INDEPENDENT (explanatory/ predictor) variable (x) affects You will need to distinguish between independent (explanatory) variables and dependent (outcome) variables regarding the research question. The explanatory variables are thought to have an effect on the dependent variable and the distinction between the two is important when carrying out statistical analysis. For example, if you were investigating the relationship between attendance is exam score, the independent variable is the attendance and the dependent variable is the exam score.

Does gestational age have an association with birth weight?

Regression y = a + b x Simple linear regression looks at the relationship between two continuous variables by producing an equation for a straight line of the form You can use this to predict the value of the dependent (outcome) variable for any value of the independent (explanatory) variable Independent variable Dependent variable Intercept Slope

Birth weight example equation Birth weight (y) = -3.03 + 0.16 * gestational age (x) here, a = -3.03 (intercept) b = 0.16 (slope) i.e. for every extra week of gestation, birth weight increases by 0.16 kgs

b = 0.16 so extra 0.16 kgs for every extra week of gestation Birth weight example - Slope Slope b is the average change in the Y variable for a change of one unit in the X variable b = 0.16 so extra 0.16 kgs for every extra week of gestation

Birth weight example - Intercept Y Response variable (dependent variable) X Predictor / explanatory variable (independent variable)

Estimating the best fitting line We try to fit the “best” straight line The standard way to do this is using a method called least squares using a computer Residuals = differences between observed and predicted values for each observation Least squares method chooses a line so that the sum of the squared residuals (averaged over all points) is minimised

Line of best fit Residuals = observed - predicted

Hypothesis testing in regression Regression finds the best straight line with which to summarise an association It is useful when we want to look for significant relationships between variables The slope is tested for significance. If there is no relationship, the gradient of the line (b) would be 0; i.e. the regression line would be a horizontal line crossing the y axis at the average value for the y variable

Regression in SPSS Analyse  Regression  Linear

Output from SPSS: key regression table P – value < 0.001 Y = -3.029 + 0.162X As p < 0.05, gestational age is a significant predictor of birth weight Weight increases by 0.16 kgs for each week of gestation

Output from SPSS: ANOVA table ANOVA compares the null model (mean birth weight for all babies) with the regression model Null model: y = 3.31 Regression model: y = -3.02 + 0.16x

Output from SPSS: ANOVA table Does a model containing gestational age predict significantly more accurately than just using the mean birth weight for all babies? Yes as p < 0.001 Total: number of subjects included in the analysis – 1

How reliable are predictions? Using R2 How much of the variation in birth weight is explained by the model including Gestational age? Proportion of the variation in birth weight explained by the model R2 = 0.502 = 50% Predictions using the model are fairly reliable. Which other variables may help improve the fit of the model? Compare models using Adjusted R2, as this adjusts for the number of variables in the model

Assumptions for regression Plot to check The relationship between the independent and dependent variable is linear Original scatter plot of the independent and dependent variable Homoscedasticity: The variance of the residuals about predicted responses should be the same for all predicted responses. Scatterplot of standardised predicted values and residuals. There should be no obvious patterns The residuals are normally distributed Plot the residuals in a histogram

Checking assumptions: normality of residuals observed value minus value predicted by the model (fitted value) Yobs - Yfit i.e. the vertical lines on the plot below It is the residuals that need to be normally distributed, not the data

Checking assumptions: normality of residuals Use standardised residuals to check the assumptions. Outliers are those values < -3 or > 3 Select histogram of residuals Scatterplot of predicted vs residuals

Checking assumptions: normality Histogram looks approximately normally distributed When writing up, just say ‘normality checks were carried out on the residuals and the assumption of normality was met’

Predicted values against residuals Are there any patterns as the predicted values increases? There is a problem with Homoscedasticity if the scatter is not random. These shapes are bad:

What if assumptions are not met? Regression is fairly robust to violations of the assumptions If the residuals are heavily skewed or the residuals show different variances as predicted values increase, the data needs to be transformed Try taking the natural log (ln) of the dependent variable first. Then repeat the analysis and check the assumptions

Exercise 3 Investigate whether mother’s pre-pregnancy weight and birth weight are associated using a scatterplot, correlation and simple regression

Exercise 3: correlation Pearson’s correlation coefficient = Describe the relationship using the scatterplot and correlation coefficient

Exercise 3: regression Adjusted R2 = Does the model result in reliable predictions? ANOVA p-value = Is the model an improvement on the null model (where every baby is predicted to be the mean weight)?

Exercise 3: Regression Pre-pregnancy weight coefficient and p-value: Regression equation: Interpretation:

Caveats Do not use the graph or regression model to predict outside of the range of observations Do not assume just because you have an equation that means that X causes Y As with correlation, it is always a good idea to have a look at the scatterplot

Effect of outliers on regression model Original regression model: y = -3.03 + 0.16x R2 =0.502 Adjusted regression model: y = -1.689 + 0.126x R2 =0.323

Multiple regression Multiple regression has several categorical or scale independent variables: 𝑦=𝛼+ 𝛽 1 𝑥 1 + 𝛽 2 𝑥 2 +…+ 𝛽 𝑖 𝑥 𝑖 Effect of other variables is removed (controlled for) when assessing relationships You can only include binary categorical using: Analyse  Regression  Linear If you want to adjust for a categorical variable with more than 2 levels you need to use the General Linear Model procedure Analyse  General Linear Model  Univariate (not covered here)

Multiple regression Example: What factors affect the birth weight of babies? Dependent: Birth weight Possible independents: Gestational age, mothers’ pre-pregnancy weight, mothers height, mother smoking etc

Relationships with categorical Identify smokers by different markers on the scatterplot Graphs  Legacy Dialogs  Scatter/Dot

Adding smoking status Identify smokers by different markers on the scatterplot Is there a difference between smokers and non-smokers?

Regression output from SPSS Y = -2.661 + 0.156(gestation) – 0.298(smoker) Both p-values < 0.05 R2 has increased from 0.502 to 0.563 so 56.3% of variation explained with gestation and smoking. Gestational age (p < 0.001) and smoking status (p=0.024) are significant predictors of birth weight. Weight increases by 0.16 kgs for each week of gestation and decreases by 0.30 kgs for smokers Note: Smoker = 1 and Non-smoker = 0

Effect of smoking status Binary variables affect the intercept only Note that if you have an interactions between a binary and scale variable the lines for the two groups will not be parallel

Comparing models using R2 Adding predictor variables will always increase R2. It’s the size of the change that’s important Use adjusted R2 as it makes an adjustment for the number of variables in a model

Multiple regression In addition to the standard linear regression checks, relationships BETWEEN independent variables should be assessed Multicollinearity is a problem where continuous independent variables are too related (r > 0.8) Relationships can be assessed using scatterplots and correlation for continuous variables

Exercise 4: correlations Which variables are most strongly related to each other?

Model selection If models are to be used for prediction, only significant predictors should be included unless they are being used as controls Methods include forward, backward and stepwise regression Backward means that the predictor with the highest p-value is removed and the model re-run. Keep going until only significant predictors are left

Important point By default, SPSS only includes cases with no missing values on any variable included in multiple regression. This can seriously reduce the number of cases being used To change this, select ‘Options’ from the linear regression options and then ‘Exclude cases pairwise’

Exercise 5 Run a multiple regression model for Birth weight with gestational age, mothers weight and smoking status as independent variables Check the assumptions and interpret the output Does the model give more reliable predictions than the model with just gestational age? Add mother’s height to the model. Does anything change? Note you will need to create a variable for smoking status based on the number of cigarettes that the mother smokes (assuming that 0 cigarettes indicates someone who does not smoke)

Exercise 5: model 1 summary Variable Coefficient (β) P-value Significant? Constant Gestation Smoker Pre-pregnancy weight Adjusted R2 = Interpretation:

Exercise 5: model 2 summary Variable Coefficient (β) P-value Significant? Constant Gestation Smoker Pre-pregnancy weight Height Adjusted R2 = Interpretation:

Exercise 5: Compare p-values Model Gestation Smoking Weight Height P < 0.001 + Smoker 0.028 + Weight + Height

Exercise 5: Compare R2 Model R2 Adjusted R2 Gestation 0.499 0.486 + Smoker 0.558 0.535 + Weight + Height

Regression summary Use correlation to look at relationships between dependent and independent variables Scatterplots to look for a linear relationships Use regression to quantify the relationship between variables Check normality of residuals Check scatterplot of predicted vs residuals Interpret significance, coefficients and R2

Learning outcomes You should now know about: You should be able to: Approaches to analysis for simple continuous bivariate data You should be able to: Construct and interpret scatterplots in SPSS Identify when it is appropriate to use correlation Calculate a correlation coefficient in SPSS Interpret a correlation coefficient Identify when it appropriate to use linear regression Run a simple regression model in SPSS Interpret the results of a linear regression model

Maths And Statistics Help Statistics appointments: Mon-Fri (10am-1pm) Statistics drop-in: Mon-Fri (10am-1pm), Weds (4-7pm) http://www.sheffield.ac.uk/mash

Resources: All resources are available in paper form at MASH or on the MASH website

Contacts Follow MASH on twitter: @mash_uos Staff Jenny Freeman (j.v.freeman@sheffield.ac.uk) Basile Marquier (b.marquier@sheffield.ac.uk) Marta Emmett (m.emmett@sheffield.ac.uk) Website http://www.sheffield.ac.uk/mash Follow MASH on twitter: @mash_uos