Download presentation
1
Action Research Correlation and Regression
INFO 515 Glenn Booker INFO 515 Lecture #7
2
Measures of Association
Measures of association are used to determine how strong the relationship is between two variables or measures, and how we can predict such a relationship Only applies for interval or ratio scale variables Everything this week only applies to interval or ratio scale variables! INFO 515 Lecture #7
3
Measures of Association
For example, I have GRE and GPA scores for a random sample of graduate students How strong is the relationship between GRE scores and GPA? Do these variables relate to each other in some way? If there is a strong relationship, how well can we predict the values of one variable when values of the other variable are known? INFO 515 Lecture #7
4
Strength of Prediction
Two techniques are used to describe the strength of a relationship, and predict values of one variable when another variable’s value is known Correlation: Describes the degree (strength) to which the two variables are related Regression: Used to predict the values of one variable when values of the other are known INFO 515 Lecture #7
5
Strength of Prediction
Correlation and regression are linked -- the ability to predict one variable when another variable is known depends on the degree and direction of the variables’ relationship in the first place We find correlation before we calculate regression So generating a regression without checking for a correlation first is pointless (though we’ll do both at once) INFO 515 Lecture #7
6
Correlation There are different types of statistical measures of correlation They give us a measure known as the correlation coefficient The most common procedure used is known as the Pearson’s Product Moment Correlation, or Pearson’s ‘r’ INFO 515 Lecture #7
7
Pearson’s ‘r’ Can only be calculated for interval or ratio scale data
Its value is a real number from -1 to +1 Strength: As the value of ‘r’ approaches -1 or +1, the relationship is stronger. As the magnitude of ‘r’ approaches zero, we see little or no relationship INFO 515 Lecture #7
8
Pearson’s ‘r’ For example, ‘r’ might equal 0.89, -0.9, 0.613, or -0.3
Which would be the strongest correlation? Direction: Positive or negative correlation can not be distinguished from looking at ‘r’ Direction of correlation depends on the type of equation used, and the resulting constants obtained for it INFO 515 Lecture #7
9
Example of Relationships
Positive direction -- as the independent variable increases, the dependent variable tends to increase: Student GRE (X) GPA1 (Y) INFO 515 Lecture #7
10
Example of Relationships
Negative direction -- as the dependent variable increases, the independent variable decreases: Student GRE (X) GPA2 (Y) INFO 515 Lecture #7
11
Positive and Negative Correlation
Positive correlation, r = 1.0 Negative correlation, r = 1.0 Data from slide 9 Data from slide 10 Notice that high ‘r’ doesn’t tell whether the correlation is positive or negative! INFO 515 Lecture #7
12
*Important Note* An association value provided by a correlation analysis, such as Pearson’s ‘r’, tells us nothing about causation In this case, high GRE scores don’t necessarily cause high or low GPA scores, and vice versa INFO 515 Lecture #7
13
Significance of r We can test for the significance of r (to see whether our relationship is statistically significant) by consulting a table of critical values for r (Action Research p. 41/42) Table “VALUES OF THE CORRELATION COEFFICIENT FOR DIFFERENT LEVELS OF SIGNIFICANCE” Where df = (number of data pairs) – 2 INFO 515 Lecture #7
14
Significance of r We test the null hypothesis that the correlation between the two variables is equal to zero (there is no relationship between them) Reject the null hypothesis (H0) if the absolute value of r is greater than the critical r value Reject H0 if |r| > rcrit This is similar to evaluating actual versus critical ‘t’ values INFO 515 Lecture #7
15
Significance of r Example
So if we had 20 pairs of data For two-tail 95% confidence (P=.05), the critical ‘r’ value at df=20-2=18 is 0.444 So reject the null hypothesis (hence correlation is statistically significant) if: r > or r < INFO 515 Lecture #7
16
Strength of “|r|” Absolute value of Pearson’s ‘r’ indicates the strength of a correlation 1.0 to 0.9: very strong correlation 0.9 to 0.7: strong 0.7 to 0.4: moderate to substantial 0.4 to 0.2: moderate to low 0.2 to 0.0: low to negligible correlation Notice that a correlation can be strong, but still not be statistically significant! (especially for small data sets) INFO 515 Lecture #7
17
*Important Notes* The stronger the r, the smaller the standard estimate of the error, the better the prediction! A significant r does not necessarily mean that you have a strong correlation A significant r means that whatever correlation you do have is not due to random chance INFO 515 Lecture #7
18
Coefficient of Determination
By squaring r, we can determine the amount of variance the two variables share (called “explained variance”) R Square is the coefficient of determination So, an “R Square” of 0.94 means that 94% of the variance in the Y variable is explained by the variance of the X variable INFO 515 Lecture #7
19
What is R Squared? The Coefficient of determination, R2, is a measure of the goodness of fit R2 ranges from 0 to 1 R2 = 1 is a perfect fit (all data points fall on the estimated line or curve) R2 = 0 means that the variable(s) have no explanatory power INFO 515 Lecture #7
20
What is R Squared? Having R2 closer to 1 helps choose which regression model is best suited to a problem Having R2 actually equal zero is very difficult A sample of ten random numbers from Excel still obtained an R2 of 0.006 INFO 515 Lecture #7
21
Scatter Plots It’s nice to use R2 to determine the strength of a relationship, but visual feedback helps verify whether the model fits the data well Also helps look for data fliers (outliers) A scatter plot (or scatter gram) allows us to compare any two interval or ratio scale variables, and see how data points are related to each other INFO 515 Lecture #7
22
Scatter Plots Scatter plots are two-dimensional graphs with an axis for each variable (independent variable X and dependent variable Y) To construct: place an * on the graph for each X and Y value from the data Seeing data this way can help choose the correct mathematical model for the data INFO 515 Lecture #7
23
Scatter Plots Y (Dep.) X=2 Data point (2, 3) * Y=3 (0, 0) X (Indep.)
INFO 515 Lecture #7
24
Models Allow us to focus on select elements of the problem at hand, and ignore irrelevant ones May show how parts of the problem relate to each other May be expressed as equations, mappings, or diagrams May be chosen or derived before or after measurement (theory vs. empirical) INFO 515 Lecture #7
25
Modeling Often we look for a linear relationship – one described by fitting a straight line as well to the data as possible More generally, any equation could be used as the basis for regression modeling, or describing the relationship between two variables You could have Y = a*X**2 + b*ln(X) c*sin(d*X-e) INFO 515 Lecture #7
26
Linear Model Y = m*X + b or Y = b0 + b1*X Y (Dep.) m = slope
X (Indep.) Y (Dep.) Y = m*X + b or Y = b0 + b1*X b = Y axis intercept 1 unit of X m = slope INFO 515 Lecture #7
27
Linear Model Pearson’s ‘r’ for linear regression is calculated per (Action Research p. 29/30) Define: N = number of data pairs SX = Sum of all X values SX2 = Sum of all (X values squared) SY = Sum of all Y values SY2 = Sum of all (Y values squared) SXY = Sum of all (X values times Y values) Pearson’s r = [N*(SXY) – (SX)*(SY)] / sqrt[(N*(SX2) – (SX)^2)*(N*(SY2) – (SY)^2)] INFO 515 Lecture #7
28
Linear Model For the linear model, you could find the slope ‘m’ and Y-intercept ‘b’ from m = (r) * (standard deviation of Y) / (standard deviation of X) b = (mean of Y) – (m)*(mean of X) But it’s a lot easier to use SPSS’ slope=b1 and Y intercept = b0 INFO 515 Lecture #7
29
Regression Analysis Allows us to predict the likely value of one variable from knowledge of another variable The two variables should be fairly highly correlated (close to a straight line) The regression equation is a mathematical expression of the relationship between 2 variables on, for example, a straight line INFO 515 Lecture #7
30
Regression Equation Y = mX + b
In this linear equation, you predict Y values (the dependent variable) from known values of X (the independent variable); this is called the regression of Y on X The regression equation is fundamentally an equation for plotting a straight line, so the stronger our correlation -- the closer our variables will fall to a straight line, and the better our prediction will be INFO 515 Lecture #7
31
Linear Regression y ^ ^ y = a + b*x ^ y = y + e x
Choose “best” line by minimizing the sum of the squares of the vertical distances between the data points and the regression line INFO 515 Lecture #7
32
Standard Error of the Estimate
Is the standard deviation of data around the regression line Tells how much the actual values of Y deviate from the predicted values of Y INFO 515 Lecture #7
33
Standard Error of the Estimate
After you calculate the standard error of the estimate, you add and subtract the value from your predicted values of Y to get a % area around the regression line within which you would expect repeated actual values to occur or cluster if you took many samples (sort of like a sampling distribution for the mean….) INFO 515 Lecture #7
34
Standard Error of Estimate
The Standard Error of Estimate for Y predicted by X is sy/x = sqrt[sum of(Y–predicted Y)2 /(N–2)] where ‘Y’ is each actual Y value ‘predicted Y’ is the Y value predicted by the linear regression ‘N’ is the number of data pairs For example on (Action Research p. 33/34), Sy/x = sqrt(2.641/(10-2)) = 0.574 INFO 515 Lecture #7
35
Standard Error of the Estimate
So, if the standard error of the estimate is equal to 0.574, and if you have a predicted Y value of 4.560, then 68% of your actual values, with repeated sampling, would fall between and (predicted Y +/- 1 std error) The smaller the standard error, the closer your actual values are to the regression line, and the more confident you can be in your prediction INFO 515 Lecture #7
36
SPSS Regression Equations
Instead of constants called ‘m’ and ‘b’, ‘b0’ and ‘b1’ are used for most equations The meaning of ‘b0’ and ‘b1’ varies, depending on the type of equation which is being modeled Can repress the use of ‘b0’ by unchecking “Include constant in equation” INFO 515 Lecture #7
37
SPSS Regression Models
Linear model Y = b0 + b1*X Logarithmic model Y = b0 + b1*ln(X) where ‘ln’ = natural log Inverse model Y = b0 + b1/X Similar to the form X*Y = constant, which is a hyperbola INFO 515 Lecture #7
38
SPSS Regression Models
Power model Y = b0*(X**b1) Compound model Y = b0*(b1**X) A variant of this is the Logistic model, which requires a constant input ‘u’ which is larger than Y for any actual data point Y = 1/[ 1/u + b0*(b1**X) ] Where “**” indicates “to the power of” INFO 515 Lecture #7
39
SPSS Regression Models
“exp” means “e to the power of”; e = … Exponential model Y = b0*exp(b1*X) Other exponential functions S model Y = exp(b0 + b1/X) Growth model (is almost identical to the exponential model) Y = exp(b0 + b1*X) INFO 515 Lecture #7
40
SPSS Regression Models
Polynomials beyond the Linear model (linear is a first order polynomial): Quadratic (second order) Y = b0 + b1*X + b2*X**2 Cubic (third order) Y = b0 + b1*X + b2*X**2 + b3*X**3 These are the only equations which use constants b2 & b3 Higher order polynomials require the Regression module of SPSS, which can do regression using any equation you enter INFO 515 Lecture #7
41
Y = whattheflock? To help picture these equations
Make an X variable over some typical range (0 to 10 in a small increment, maybe 0.01) Define a Y variable Calculate the Y variable using Transform > Compute… and whatever equation you want to see Pick values for b0 and b1 that aren’t 0, 1, or 2 Have SPSS plot the results of a regression of Y vs X for that type of equation INFO 515 Lecture #7
42
How Apply This? Given a set of data containing two variables of interest, generate a scatter plot to get some idea of what the data looks like Choose which types of models are most likely to be useful For only linear models, use Analyze / Regression / Linear... INFO 515 Lecture #7
43
How Apply This? Select the Independent (X) and Dependent (Y) variables
Rules may be applied to limit the scope of the analysis, e.g. gender=1 Dozens of other characteristics may also be obtained, which are beyond our scope here INFO 515 Lecture #7
44
How Apply This? Then check for the R Square value in the Model Summary
Check the Coefficients to make sure they are all significant (e.g. Sig. < 0.050) If so, use the ‘b0’ and ‘b1’ coefficients from under the ‘B’ column (see Statistics for Software Process Improvement handout), plus or minus the standard errors “SE B” INFO 515 Lecture #7
45
Regression Example For example, go back to the “GSS91 political.sav” data set Generate a linear regression (Analyze > Regression > Linear) for ‘age’ as the Independent variable, and ‘partyid’ as the Dependent variable Notice that R2 and the ANOVA summary are given, with F and its significance INFO 515 Lecture #7
46
Regression Example INFO 515 Lecture #7
47
Regression Example The R Square of means there is a very slight correlation (little strength) But the ANOVA Significance well under confirms there is a statistically significant relationship here - it’s just a really weak one INFO 515 Lecture #7
48
Regression Example Output from Analyze > Regression > Linear
Output from Analyze > Regression > Curve Estimation INFO 515 Lecture #7
49
Regression Example The heart of the regression analysis is in the Coefficients section We could look up ‘t’ on a critical values table, but it’s easier to: See if all values of Sig are < if they are, reject the null hypothesis, meaning there is a significant relationship If so, use the values under B for b0 and b1 If any coefficient has Sig > 0.050, don’t use that regression (coeff might be zero) INFO 515 Lecture #7
50
Regression Example The answer for “what is the effect of age on political view?” is that there is a very weak but statistically significant linear relationship, with a reduction of (b1) political view categories per year From the Variable View of the data, since low values are liberal and large values conservative, this means that people tend to get slightly more liberal as they get older INFO 515 Lecture #7
51
Curve Estimation Example
For the other regression options, choose Analyze / Regression / Curve Estimation… Define the Dependents (variable) and the Independent variable - note that multiple Dependents may be selected Check which math models you want used Display the ANOVA table for reference INFO 515 Lecture #7
52
Curve Estimation Example
SPSS Tip: up to three regression models can be plotted at once, so don’t select more than that if you want a scatter plot to go with the data and the regressions For the same example just used, get a summary for the linear and quadratic models (Analyze > Regression > Curve Estimation) Find “R Square” for each model Generally pick the model with largest R Square Already saw Linear output, now see Quadratic INFO 515 Lecture #7
53
Curve Estimation Example
For the quadratic regression, R Square is slightly higher, and the ANOVA is still significant INFO 515 Lecture #7
54
Curve Estimation Example
The Quadratic coefficients are all significant at the level Interpret as partyid = ( / ) + ( / )*age ( / )*age**2 Edit the data table, then double click on the cells to get the values of b2 and its std error. INFO 515 Lecture #7
55
Curve Estimation Example
The data set will be plotted as the Observed points, with the regression models shown for comparison Look to see which model most closely matches the data Look for regions of data which do or don’t match the model well (if any) INFO 515 Lecture #7
56
Curve Estimation Example
<- quadratic <- linear INFO 515 Lecture #7
57
Curve Estimation Procedure
See which models are significant (throw out the rest!) Compare the R Square values to see which provides the best fit Use the graph to verify visually that the correct model was chosen Use the model equation’s ‘B’ values and their standard errors to describe and predict the data’s behavior INFO 515 Lecture #7
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.