Scientific Practice Regression
Where We Are/Where We Are Going We have looked at how correlation shows how two things might be associated eg between arm length and leg length no causality implied ie leg length is not responsible for arm length! The correlation coefficient, r, assumes a linear ‘fit’ between the data and reflects how far away from that ‘line of best’ fit the data lie Regression takes this one step further describes fit mathematically generally implies a causal link; eg increased BP causes increased mortality
Independent/Dependent Variables As a causal link is implied, then… the independent variable is the thing doing the influencing the dependent variable is the one being influenced By convention, if we were to plot this graphically… the x-axis represents the independent variable the y-axis represents the dependent variable eg… blood pressure on the x-axis cardiovascular mortality on the y-axis
The Equation of the Straight Line Any linear relationship can be described as… y = mx + c y can be calculated (predicted) for any x using… c, the intercept where the line crosses the y-axis when x=0 m, the slope of the line change in y per x non-zero if relationship
The Line of Best Fit For a given set of data, linear regression derives the straight line equation that best describes the data by minimising the overall distance of data points to that straight line
The Power of Linear Regression Eg, data from a practical class looked at relationship between latency of the Achilles Heel stretch reflex (ms) and height (cm) the taller the person, the longer the nerve pathway mediating the reflex
The Power of Linear Regression The class results (n=85)
The Power of Linear Regression The line of best fit : y = 0.1337x + 12.225 the non-zero slope suggests a relationship
Testing the Line of Best Fit But even if we used random data, the line of best fit would have a non-zero slope! so how do we know if the slope is significantly different to zero? The reported slope is the best estimate based on data that doesn’t perfectly fit a straight line when we use a collection of data to estimate a single value, then that estimate comes with a standard error eg data mean +/- SE of the mean in our case data slope +/- SE of the slope we can predict the 95% CI of the slope! does the 95% CI encompass zero?
The Power of Linear Regression In reality, Minitab will do the significance calculation for us a t-test on the slope and its SE Null Hypo is that slope is zero The regression equation is latency = 12.2 + 0.134 height Predictor Coef SE Coef T P Constant 12.225 7.072 1.73 0.088 height 0.13366 0.04144 3.23 0.002 S = 4.16466 R-Sq = 11.1% R-Sq(adj) = 10.1% p < 0.05, so the slope is sig diff to zero reflex latency increases with height (0.134 ms/cm)
The Power of Linear Regression The intercept also reported as a non-zero value is it significantly different to zero? (the Null Hypo) a t-test on the intercept and its SE The regression equation is latency = 12.2 + 0.134 height Predictor Coef SE Coef T P Constant 12.225 7.072 1.73 0.088 height 0.13366 0.04144 3.23 0.002 S = 4.16466 R-Sq = 11.1% R-Sq(adj) = 10.1% p > 0.05, so the intercept not sig diff to zero predictive equation is latency=0.134 height
The Power of Linear Regression The analysis also yields r-squared this is the square of the correlation coefficient proportion of variation in y-axis variable that can be explained by variation in x-axis variable The regression equation is latency = 12.2 + 0.134 height Predictor Coef SE Coef T P Constant 12.225 7.072 1.73 0.088 height 0.13366 0.04144 3.23 0.002 S = 4.16466 R-Sq = 11.1% R-Sq(adj) = 10.1% at 10%, this is very low but it is still highly significant!
R-squared and Significance Just because something has a low r-squared does not mean it is not significant means it has a low predictive power Eg the more clothes worn, the heavier is a person’s weight clothes significantly influence weight But it can only account for a small amount of variation in weight that we see ie r-squared is small
Summary Linear regression extends correlation by reporting the mathematical ‘line of best fit’ y = mx + c The slope needs to be tested statistically to ‘prove’ the relationship is ‘real’ eg the Null Hypo is that m = 0 The intercept should also be tested to see if it is non-zero The equation of the ‘line of best fit’ is predictive ie given a value of x, you can predict y the usefulness of this depends on r-squared proportion of variation in y ‘explained’ by x (0-100%)