SIMPLE LINEAR REGRESSION AND CORRELLATION By Mpembeni RNM, School of Public Health and Social Sciences, Dept of Epidemiology and Biostatistics MUHAS
LEARNING OBJECTIVES After successful completion of this session, you should be able to: Describe the correlation coefficient Describe the linear regression model Understand and check model assumptions Understand meaning of regression coefficients
ANALYSING RELATIONSHIPS BETWEEN TWO OR MORE QUANTITATIVE VARIABLES Two commonly used Methods are: Correlation linear regression Multiple Linear Regression
CORRELATION The (Pearson's) correlation coefficient, r measures the closeness (strength) of the linear association i.e. the closeness with which the points lie along the straight line r is a bivariate correlation coefficient summarizing the magnitude and direction of the relationship between two variables
Characteristics of r Ranges between -1 and +1 r = 0: No linear relationship r = 1 perfect positive relationship r = -1 perfect negative relationship
Interpretation of r If r > 0: variables are positively correlated. i.e as x increases, y tends to increase, while as x decreases, y tends to decrease If r < 0: variables are said to be negatively correlated. i.e as x increases, y tends to decrease, while as x decreases, y tends to increase
Little or No Correlation: -0.3 to 0.3 Rule of thumb for r Correlation Strong Weak Positive up and right 0.7 to 1.0 0.3 to 0.7 Negative down and left -1.0 to -0.7 -0.7 to -0.3 Little or No Correlation: -0.3 to 0.3
SCATTER DIAGRAM First step in investigating the relationship between two variables Two related variables - plotted on a graph in the form of points or dots Each point on the diagram represents a pair of values, one based on X-scale and the other based on Y-scale. X-scale refer to the explanatory or independent variable and the Y-scale refer to the response or dependent variable. Diagram shows visually the shape and degree of closeness of the relationship
Head circumference and Gestational age of 100 LBW babies
Scatter Plot From the scatter plot, there is a trend of head circumference to increase with increasing gestational age
Strong positive correlation
Weak negative correlation
No correlation
CORRELATION COEFFICIENT r=∑(X-Xˉ)(Y-ӯ) √∑(X-X͞) 2∑(Y-Ῡ)2 = ∑XY-(∑X)(∑Y)/n √∑x2-(∑x)2/n ∑y2(∑y)2/n
Example: Association between Body weight and Plasma volume
Calculation of r ∑xy – (∑x∑y)/n = 1615.295 – 535 x 24.02/8=8.96 ∑x2 –(∑x)2/n = 35983.5-5352/8 = 205.38 ∑y2-∑y2/n = 72.789 – 24.022/8 = 0.678 r = 8.96 √(205.38 x 0.678) = 0.76
STRENGHT OF THE ASSOCIATION BETWEEN WEIGHT AND PLASMA VOLUME How strong is the association?
Simple Linear Regression The two quantitative variables should be defined: y refers to the dependent variable (AKA response or outcome variable) x the independent variable (AKA explanatory or predictor variable)
Simple linear regression The objective of the analysis is to see whether a change in an independent variable, x, is associated with a change in the dependent variable, y, Be able to predict the value of the dependent variable given the value of the independent variable Eg Age and Weight of a child under five years of age.
EXAMPLE Data on body weight and plasma volume of eight healthy men. The objective of the analysis is to see whether a change in plasma volume is associated with a change in body weight.
ASSOCIATION BETWEEN QUANTITATIVE VARIABLES
.Scatter Diagram of Body weight and Plasma Volume
Body weight and plasma volume There is a trend of plasma volume to increase with increasing body weight
LINEAR REGRESSION When Linear relationship exists, can summarize the relationship by a line drawn through the scatter of points. any straight line drawn on a graph can be represented by the equation: y = a + bx where y refers to the values of the response (dependent) variable x to values of the explanatory (independent) variable.
LINEAR REGRESSION The constant 'a' is the intercept, the point at which the line crosses the y-axis. That is, the value of y when x = 0. The coefficient of x variable ('b') is the slope of the line. It tells us the average change (increase or decrease) in y due to a unit change in x. b is also called the regression coefficient.
METHOD OF LEAST SQUARES A mathematical technique to fit a straight line to a set of points i.e is used to estimate a and b
LINEAR REGRESSION Numerator =Sxy= Denominator = = Sxx
LINEAR REGRESSION The resultant line is called the regression line, which estimates the average value of y for a given value of x.
Calculating the least Square Estimates Example – data on plasma volume and body weight
Example
Example Regression line: Plasma volume = 0.09 + 0.04 x body weight
ESTIMATION Once you have the value of a and b, you can substitute various values of x into the equation for the line, solve for the corresponding values of y. Eg what would be the plasma volume for an adult with 62 kgs? 77 kgs?
Regression line
INFERENCES FOR REGRESSION COEFFICIENTS Just like in any other estimate, the standard error for the regression coefficient can be calculated. Can test the hypothesis whether b differs significantly from b0 using a t test The t value and the corresponding p-value are all shown in the output table.
Evaluation of the model The coefficient of Determination, R2 which is the square of the Pearsons Correllation Coefficent, r, is used to assess how best the model fits the data. This is the proportion of the variability among the observed values of y that is explained by the linear regression of y on x
Model Evaluation If for example R2 is 0.6095 it implies that almost 61% of the variation among the observed values of y is explained by its linear relationship with the independent variable
EXERCISE Using the provided data set ( LBW babies) Correlate birth weight with Gestational age What is the Correlation Coefficient between bweight and Gestage? Regression of Birth weight on gestational age. What is the equation of the line? What is the estimated birth weight for a baby with 42 weeks of gestation?, 36 weeks? What proportion of the variability of birth weight is explained by gestational age?
Model Unstandardized Coefficients Standardized Coefficients t Sig. Coefficients(a) Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta 1 (Constant) -932.404 234.488 -3.976 .000 gestage 70.310 8.086 .660 8.695 .000 a Dependent Variable: birthwt