Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)

Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value) and a direction (the sign on the r value). untransformed straight The r value measures how close the untransformed data points are to a straight line. Therefore, the r value is a very important statistic for regression analysis because it tells you how accurate your predicted values will be. That is why we tested the correlation value so thoroughly because being able to predict the future is pretty cool.

Regression Regression analysis is another method by which the relationship between dependent and independent variables can be estimated. Unlike correlation, which just tells you the strength and direction of a relationship, regression tells you much more about each point and its place in the relationship. Regression also tells you how you can use an ‘ x ’ value to predict a ‘ y ’ value using the mathematical expression of a trend line. This is how you can predict the future.

Trend Lines A trend line is a line drawn through a frequency distribution of paired values called a scatterplot. The scatterplot shows the overall pattern of the points around the trend line. TWO IMPORTANT POINTS: The data values are the finest degree of resolution in your data. The trend line is the coarsest degree. Each shows you a different type of information.

Trend Lines straight Trend lines can be straight and have their ‘straightness’ defined by their different angles: i.e. are they steep or shallow? curvilinear Trend lines can also be curvilinear and have their ‘curviness’ defined by polynomial, logistic, or exponential (log) functions. Both types of lines also have the ‘fit’ of their data points to the line defined by their correlation coefficient.

Types of Trend Lines Also called exponential functions and defined by the exponent on y=x x Are labeled by their ‘degree’: Quadratic = 2 Cubic = 3 Quartic = 4 Quintic = 5

Linear Trend Lines The other aspect of trend lines, apart from their shape, is whether they have one or more than one independent variable. That is, are they bivariate or multivariate. We have seen only bivariate trend lines so far: that is, lines having a y and one x. For our discussion on regression we will stick with these bivariate linear trend lines.

How are linear trend lines created? First, the line always passes through the arithmetic means of the x and the y variables. Second, the trend line is always as close as it can possibly be to every data point. Third, the difference between each data point and the line is as small as it can be when all points are considered. This is done by minimising the sum of the squared differences. This is why the Pearson formulation we shall use is called the “least squares” method.

An example: 42 pairs of grades. Each student has a high school grade (the X or independent variable) and a 1 st year university grade (the Y or dependent variable). They are labelled such because the HS grade could influence the Uni grade but not the other way around – i.e. Y is dependent on X and not X on Y. Student # Best 6 HS Grade 1st Yr Uni Grade Student # Best 6 HS Grade 1st Yr Uni Grade XY XY 1 91.8387.002273.6769.00 2 90.8388.002373.1767.50 3 84.6782.402472.8376.90 4 83.8376.302572.8363.00 5 83.6778.302672.6765.00 6 83.3380.202772.3367.00 7 82.0075.402872.0071.00 8 81.1767.402971.5065.00 9 81.0084.303071.1754.00 10 78.3376.803171.0063.50 11 77.8378.603270.6767.00 12 77.8367.903370.6767.90 13 77.6755.803470.5069.00 14 76.6776.003570.3365.40 15 76.5076.603670.1765.10 16 75.6742.703770.1764.70 17 74.8380.303870.00 18 74.6782.903970.00 19 74.6767.004069.6740.60 20 74.5071.004169.6761.00 21 74.3365.804269.3367.00 Mean of all X = 75.48 Mean of all Y = 69.76 r = 0.617 r 2 = 0.38 SEE = 8.03

Means of the Regression Line Mean of x Mean of y The regression line passes through the mean of y (75.48) and the mean of x (69.96) The sum of the squared distances from each point to the line is as small as it can be when all points are considered. That is, the line cannot get any closer overall to the points.

Prediction - Regression A high school grade of 80% will predict a first year university grade of almost 75% But can we get a more accurate prediction than “almost”? Yes, using this linear equation.

Regression for Real Regression is a mathematical method which uses a linear equation by which one value ( y ) can be predicted by another value ( x ). Furthermore, the predicted value can be given ‘margins of error’ - that is, x will predict y within ± whatever units y is in. The accuracy of the predicted value of y and the size of the margins of error will depend on how well the data points match a straight line. AND THAT DEPENDS ON HOW HIGH YOUR r VALUE IS. It also depends on how many pairs of data points you have – that is, your ‘n’.

Linear Regression – Equating the Line Where: is the predicted value of y for a given x b is the intercept value m is the slope of the line x 1 is the given value in the x (independent) dataset from which you want to predict y. This is the formula that Excel uses. You sometimes also see: y = a + bx CALLED ‘WHY’ HAT

Prediction - Regression This is the linear regression equation used to predict the value in Excel, where =mx+b, with ‘ m ’ as the slope and ‘ b ’ as the intercept

Prediction Reprise - “Almost” Regression Using The Line A high school grade of 80% will predict a first year university grade of almost 75% But can we get a more accurate prediction that “almost”? Yes, using this linear equation.

Predicting an 80% Incoming HS Grade from the Linear Regression Equation = 1.0862 x - 12.22 = 1.0862 * 80% - 12.22 = 74.7% This is our “almost” 75%.

Standard Error of the Estimate (SEE) of The predicted is not exact even if we have an exact x to start with because… There is likely more than one y value for every x value, and… The line is based on the correlation coefficient which was not a perfect 1.0 but an imperfect 0.617, and… Our r 2 of 0.38 only explains 38% of variability, and… Our ‘n’ is only a ample of 42 pairs and not everyone in the population from which the sample of 42 pairs came.

These two students have the same HS grade but widely differing first year grades. These two students have the same HS grade and very close first year grades. If you plug 70.17 into the equation you get a predicted first year value of 63.99, which is very close to actual grades. If you plug 72.83 into the equation you get a predicted first year value of 66.89, which is not very close to actual grades. This variability between the predicted grades and the actual grades is called the error of estimate and it can be calculated as a statistic called the standard error of estimate ( SEE ).

These lines represent the idea of variability of the data points from the line. The SEE is the average of all the squared differences from the data points to the line.

Standard Error of Estimate of Luckily we don’t have to calculate this by hand. Excel calculates it for you. The result is the ± value on in whatever values y was in (e.g. in this case, student CGPA in %). Again the SEE is similar to using the standard deviation. Note the ‘2’. That’s because we have a pair of values and not just one value. And remember that n-1 is the sample n. Note the squared differences of x ’s and y ’s from the mean of all x ’s and y ’s.

Standard Error of Estimate of Now look again, … … and compare it to this: Which you should all recognize as the standard deviation formula. Once again the usefulness of the arithmetic mean and the standard deviation is evident.

Interpreting the SEE The SEE for the example data is 8.03%. This number is the ± value on in whatever values y was in (e.g. student 1 st year University CGPA %). Since the SEE is similar to the standard deviation, then saying 1.96* SEE is the same as saying 1.96* s. Thus you can say that the population value of (labeled as upper case ) will, with 95% certainty, fall between ±1.96* SEE, or… ±1.96*8.03%=15.74%

These lines represent the average variability of the data points from the line This average variability is calculated as the SEE and represents the average margin of error of any data point from the trend line

= 1.0862 x - 12.22 = 1.0862 * 80% - 12.22 = 74.7% Predicting the Value x = 80% from the Linear Regression Equation

Predicting Margins of Error at 95% Confidence from the Linear Regression Equation Predicted value = 74.7% Predicted margin (the SEE ) at 95% = 1.96 * 8.03% Predicted margin = ±15.78% Range within which population value ( )falls with 95% certainty = 74.7% ±15.78% = 58.9% to 90.5%. The large range of the margin is a function of the relatively modest ‘ r ‘ value (.617) and the small ‘ n ’ (42 pairs). How might you reduce these margins?

Reducing Margins of Error The bad news is that they likely cannot be reduced by very much. Why? Consider the following: The good news is that theoretically they can be reduced by increasing the number of pairs or ‘ n ’. The bad news is that they likely cannot be reduced by very much. Why? Consider the following: If your sample were to have caught only the red circled students, then your r would be small and hence your SEE high. But the larger the sample, the more likely you’ll approximate the distribution and the r seen on the graph, but no better. Remember what happens when you increase sample size using √n.

28 Diminishing Returns on Sample Size Doublings of ‘n’ starting at 30 Change in CI for every doubling of ‘n’ Confidence Interval We’ll look at this more closely later in Sampling lecture. n=30 n=60 n=120 n=240 n=480

Relationships Summary Relationships measure the effect of one variable (the independent or x ) on another (the dependent or y ). The direction and strength of the effect is given by the correlation coefficient ( r ) and its reliability by the Ser and Ser0. The degree (in % terms) to which x causes change in y is given by the coefficient of determination ( r 2 ). Using line equations, regression allows us to use the relationship measured by correlation to forecast values of y for given values of x. Using the standard error (called SEE ) allows us to put margins of error on the predicted values.

Regression and Residuals Linear equations express how well a straight line fits your data. The actual regression line is calculated as the line that minimizes the squared distances of all points in the dataset from the line. More precisely, it calculates a line where the sum of the squared differences of all values of y from for any value of x, will be the smallest.

What Are Residuals?

What Does A Residuals Analysis Tell You? First: A residuals plot confirms whether a linear trend exists in your data. Second: A residuals plot can also indicate whether another type of trend exists in your data. If the trend in the residuals plot is… StrongWeak/None The trend in the data is not linear but could be non- linear. The trend in the data is linear not non-linear.

Observed value of a ‘y’ for a given ‘x’. Predicted value of y… Red lines (or their mathematical equivalents) are called residuals. Best fit line will minimize the total of the squared ‘lengths’ of the all the red lines …using line equation. Another observed value of a ‘y’… for an identical given value of ‘x’.

x ye The is predicted by the linear equation for each ‘ x ’ The e is given by y - How Residuals Are Calculated

Linear pattern to the data and a moderate r. Scatterplot of y against x For HS grades and first year University x y

e x No pattern to the data and an almost zero r.

Residuals Summary The stronger the linear trend, the weaker the pattern in the residuals. The weaker the linear trend, the stronger the pattern in the residuals. BUT! Take a look at these non-linear trend lines and their residuals.

Natural raw data scatter is strongly skewed. The r 2, 38%, is fair but the shape of these data clearly do not fit a linear trend line (they fit the red logarithmic line much better).

but its not linear And the residuals plot for those same data shows another strong pattern (observe the r 2 ) - but its not linear, indicating that a straight line fit to these data is inadvisable.

If you calculate your residuals from a linear trend line and your data is not linear, then there will be a pattern in the residuals. The poorer a linear trend line fits your data the stronger the residuals pattern will be. If you calculate your residuals from a linear trend line and your data is linear, then there will be weaker pattern in the residuals. At r = ±1.0, then r 2 =1.0 and there will be no pattern in the residuals because all points will fall on the line and there will be no residual values – that is, the values will all be zero. Residuals Summary

Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)

Similar presentations

Presentation on theme: "Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)

Similar presentations

Presentation on theme: "Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)"— Presentation transcript:

Similar presentations

About project

Feedback