Action Research Correlation and Regression

Slides:



Advertisements
Similar presentations
Lesson 10: Linear Regression and Correlation
Advertisements

Correlation and regression
Correlation and Linear Regression.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Describing Relationships Using Correlation and Regression
Correlation and Linear Regression
Correlation Chapter 9.
Chapter 13 Multiple Regression
Chapter 12 Multiple Regression
Lecture 4: Correlation and Regression Laura McAvinue School of Psychology Trinity College Dublin.
The Simple Regression Model
SIMPLE LINEAR REGRESSION
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Correlation and Regression. Correlation What type of relationship exists between the two variables and is the correlation significant? x y Cigarettes.
REGRESSION AND CORRELATION
Introduction to Probability and Statistics Linear Regression and Correlation.
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
SIMPLE LINEAR REGRESSION
BCOR 1020 Business Statistics Lecture 24 – April 17, 2008.
Correlation and Regression Analysis
Leon-Guerrero and Frankfort-Nachmias,
Relationships Among Variables
Correlation and Linear Regression
Chapter 8: Bivariate Regression and Correlation
Lecture 16 Correlation and Coefficient of Correlation
Lecture 15 Basics of Regression Analysis
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
Linear Regression and Correlation
Correlation and Regression
Regression Analysis. Scatter plots Regression analysis requires interval and ratio-level data. To see if your data fits the models of regression, it is.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Production Planning and Control. A correlation is a relationship between two variables. The data can be represented by the ordered pairs (x, y) where.
Examining Relationships in Quantitative Research
Psych 230 Psychological Measurement and Statistics Pedro Wolf September 23, 2009.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
Correlation & Regression Chapter 15. Correlation It is a statistical technique that is used to measure and describe a relationship between two variables.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 16 Data Analysis: Testing for Associations.
Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)
Correlation & Regression Analysis
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
STATS 10x Revision CONTENT COVERED: CHAPTERS
© The McGraw-Hill Companies, Inc., Chapter 10 Correlation and Regression.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 9 l Simple Linear Regression 9.1 Simple Linear Regression 9.2 Scatter Diagram 9.3 Graphical.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Stats Methods at IC Lecture 3: Regression.
Regression and Correlation
Regression Analysis.
Regression Analysis AGEC 784.
Correlation and Simple Linear Regression
Multiple Regression.
Correlation and Simple Linear Regression
Correlation and Regression
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Correlation and Simple Linear Regression
Correlation and Regression
Simple Linear Regression and Correlation
Product moment correlation
SIMPLE LINEAR REGRESSION
Warsaw Summer School 2017, OSU Study Abroad Program
Chapter Thirteen McGraw-Hill/Irwin
MGS 3100 Business Analysis Regression Feb 18, 2016
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Action Research Correlation and Regression INFO 515 Glenn Booker INFO 515 Lecture #7

Measures of Association Measures of association are used to determine how strong the relationship is between two variables or measures, and how we can predict such a relationship Only applies for interval or ratio scale variables Everything this week only applies to interval or ratio scale variables! INFO 515 Lecture #7

Measures of Association For example, I have GRE and GPA scores for a random sample of graduate students How strong is the relationship between GRE scores and GPA? Do these variables relate to each other in some way? If there is a strong relationship, how well can we predict the values of one variable when values of the other variable are known? INFO 515 Lecture #7

Strength of Prediction Two techniques are used to describe the strength of a relationship, and predict values of one variable when another variable’s value is known Correlation: Describes the degree (strength) to which the two variables are related Regression: Used to predict the values of one variable when values of the other are known INFO 515 Lecture #7

Strength of Prediction Correlation and regression are linked -- the ability to predict one variable when another variable is known depends on the degree and direction of the variables’ relationship in the first place We find correlation before we calculate regression So generating a regression without checking for a correlation first is pointless (though we’ll do both at once) INFO 515 Lecture #7

Correlation There are different types of statistical measures of correlation They give us a measure known as the correlation coefficient The most common procedure used is known as the Pearson’s Product Moment Correlation, or Pearson’s ‘r’ INFO 515 Lecture #7

Pearson’s ‘r’ Can only be calculated for interval or ratio scale data Its value is a real number from -1 to +1 Strength: As the value of ‘r’ approaches -1 or +1, the relationship is stronger. As the magnitude of ‘r’ approaches zero, we see little or no relationship INFO 515 Lecture #7

Pearson’s ‘r’ For example, ‘r’ might equal 0.89, -0.9, 0.613, or -0.3 Which would be the strongest correlation? Direction: Positive or negative correlation can not be distinguished from looking at ‘r’ Direction of correlation depends on the type of equation used, and the resulting constants obtained for it INFO 515 Lecture #7

Example of Relationships Positive direction -- as the independent variable increases, the dependent variable tends to increase: Student GRE (X) GPA1 (Y) 1 1500 4.0 2 1400 3.8 3 1250 3.5 4 1050 3.1 5 950 2.9 INFO 515 Lecture #7

Example of Relationships Negative direction -- as the dependent variable increases, the independent variable decreases: Student GRE (X) GPA2 (Y) 1 1500 2.9 2 1400 3.1 3 1250 3.4 4 1050 3.7 5 950 4.0 INFO 515 Lecture #7

Positive and Negative Correlation Positive correlation, r = 1.0 Negative correlation, r = 1.0 Data from slide 9 Data from slide 10 Notice that high ‘r’ doesn’t tell whether the correlation is positive or negative! INFO 515 Lecture #7

*Important Note* An association value provided by a correlation analysis, such as Pearson’s ‘r’, tells us nothing about causation In this case, high GRE scores don’t necessarily cause high or low GPA scores, and vice versa INFO 515 Lecture #7

Significance of r We can test for the significance of r (to see whether our relationship is statistically significant) by consulting a table of critical values for r (Action Research p. 41/42) Table “VALUES OF THE CORRELATION COEFFICIENT FOR DIFFERENT LEVELS OF SIGNIFICANCE” Where df = (number of data pairs) – 2 INFO 515 Lecture #7

Significance of r We test the null hypothesis that the correlation between the two variables is equal to zero (there is no relationship between them) Reject the null hypothesis (H0) if the absolute value of r is greater than the critical r value Reject H0 if |r| > rcrit This is similar to evaluating actual versus critical ‘t’ values INFO 515 Lecture #7

Significance of r Example So if we had 20 pairs of data For two-tail 95% confidence (P=.05), the critical ‘r’ value at df=20-2=18 is 0.444 So reject the null hypothesis (hence correlation is statistically significant) if: r > 0.444 or r < -0.444 INFO 515 Lecture #7

Strength of “|r|” Absolute value of Pearson’s ‘r’ indicates the strength of a correlation 1.0 to 0.9: very strong correlation 0.9 to 0.7: strong 0.7 to 0.4: moderate to substantial 0.4 to 0.2: moderate to low 0.2 to 0.0: low to negligible correlation Notice that a correlation can be strong, but still not be statistically significant! (especially for small data sets) INFO 515 Lecture #7

*Important Notes* The stronger the r, the smaller the standard estimate of the error, the better the prediction! A significant r does not necessarily mean that you have a strong correlation A significant r means that whatever correlation you do have is not due to random chance INFO 515 Lecture #7

Coefficient of Determination By squaring r, we can determine the amount of variance the two variables share (called “explained variance”) R Square is the coefficient of determination So, an “R Square” of 0.94 means that 94% of the variance in the Y variable is explained by the variance of the X variable INFO 515 Lecture #7

What is R Squared? The Coefficient of determination, R2, is a measure of the goodness of fit R2 ranges from 0 to 1 R2 = 1 is a perfect fit (all data points fall on the estimated line or curve) R2 = 0 means that the variable(s) have no explanatory power INFO 515 Lecture #7

What is R Squared? Having R2 closer to 1 helps choose which regression model is best suited to a problem Having R2 actually equal zero is very difficult A sample of ten random numbers from Excel still obtained an R2 of 0.006 INFO 515 Lecture #7

Scatter Plots It’s nice to use R2 to determine the strength of a relationship, but visual feedback helps verify whether the model fits the data well Also helps look for data fliers (outliers) A scatter plot (or scatter gram) allows us to compare any two interval or ratio scale variables, and see how data points are related to each other INFO 515 Lecture #7

Scatter Plots Scatter plots are two-dimensional graphs with an axis for each variable (independent variable X and dependent variable Y) To construct: place an * on the graph for each X and Y value from the data Seeing data this way can help choose the correct mathematical model for the data INFO 515 Lecture #7

Scatter Plots Y (Dep.) X=2 Data point (2, 3) * Y=3 (0, 0) X (Indep.) INFO 515 Lecture #7

Models Allow us to focus on select elements of the problem at hand, and ignore irrelevant ones May show how parts of the problem relate to each other May be expressed as equations, mappings, or diagrams May be chosen or derived before or after measurement (theory vs. empirical) INFO 515 Lecture #7

Modeling Often we look for a linear relationship – one described by fitting a straight line as well to the data as possible More generally, any equation could be used as the basis for regression modeling, or describing the relationship between two variables You could have Y = a*X**2 + b*ln(X) + c*sin(d*X-e) INFO 515 Lecture #7

Linear Model Y = m*X + b or Y = b0 + b1*X Y (Dep.) m = slope X (Indep.) Y (Dep.) Y = m*X + b or Y = b0 + b1*X b = Y axis intercept 1 unit of X m = slope INFO 515 Lecture #7

Linear Model Pearson’s ‘r’ for linear regression is calculated per (Action Research p. 29/30) Define: N = number of data pairs SX = Sum of all X values SX2 = Sum of all (X values squared) SY = Sum of all Y values SY2 = Sum of all (Y values squared) SXY = Sum of all (X values times Y values) Pearson’s r = [N*(SXY) – (SX)*(SY)] / sqrt[(N*(SX2) – (SX)^2)*(N*(SY2) – (SY)^2)] INFO 515 Lecture #7

Linear Model For the linear model, you could find the slope ‘m’ and Y-intercept ‘b’ from m = (r) * (standard deviation of Y) / (standard deviation of X) b = (mean of Y) – (m)*(mean of X) But it’s a lot easier to use SPSS’ slope=b1 and Y intercept = b0 INFO 515 Lecture #7

Regression Analysis Allows us to predict the likely value of one variable from knowledge of another variable The two variables should be fairly highly correlated (close to a straight line) The regression equation is a mathematical expression of the relationship between 2 variables on, for example, a straight line INFO 515 Lecture #7

Regression Equation Y = mX + b In this linear equation, you predict Y values (the dependent variable) from known values of X (the independent variable); this is called the regression of Y on X The regression equation is fundamentally an equation for plotting a straight line, so the stronger our correlation -- the closer our variables will fall to a straight line, and the better our prediction will be INFO 515 Lecture #7

Linear Regression y ^ ^ y = a + b*x ^ y = y + e x Choose “best” line by minimizing the sum of the squares of the vertical distances between the data points and the regression line INFO 515 Lecture #7

Standard Error of the Estimate Is the standard deviation of data around the regression line Tells how much the actual values of Y deviate from the predicted values of Y INFO 515 Lecture #7

Standard Error of the Estimate After you calculate the standard error of the estimate, you add and subtract the value from your predicted values of Y to get a % area around the regression line within which you would expect repeated actual values to occur or cluster if you took many samples (sort of like a sampling distribution for the mean….) INFO 515 Lecture #7

Standard Error of Estimate The Standard Error of Estimate for Y predicted by X is sy/x = sqrt[sum of(Y–predicted Y)2 /(N–2)] where ‘Y’ is each actual Y value ‘predicted Y’ is the Y value predicted by the linear regression ‘N’ is the number of data pairs For example on (Action Research p. 33/34), Sy/x = sqrt(2.641/(10-2)) = 0.574 INFO 515 Lecture #7

Standard Error of the Estimate So, if the standard error of the estimate is equal to 0.574, and if you have a predicted Y value of 4.560, then 68% of your actual values, with repeated sampling, would fall between 3.986 and 5.134 (predicted Y +/- 1 std error) The smaller the standard error, the closer your actual values are to the regression line, and the more confident you can be in your prediction INFO 515 Lecture #7

SPSS Regression Equations Instead of constants called ‘m’ and ‘b’, ‘b0’ and ‘b1’ are used for most equations The meaning of ‘b0’ and ‘b1’ varies, depending on the type of equation which is being modeled Can repress the use of ‘b0’ by unchecking “Include constant in equation” INFO 515 Lecture #7

SPSS Regression Models Linear model Y = b0 + b1*X Logarithmic model Y = b0 + b1*ln(X) where ‘ln’ = natural log Inverse model Y = b0 + b1/X Similar to the form X*Y = constant, which is a hyperbola INFO 515 Lecture #7

SPSS Regression Models Power model Y = b0*(X**b1) Compound model Y = b0*(b1**X) A variant of this is the Logistic model, which requires a constant input ‘u’ which is larger than Y for any actual data point Y = 1/[ 1/u + b0*(b1**X) ] Where “**” indicates “to the power of” INFO 515 Lecture #7

SPSS Regression Models “exp” means “e to the power of”; e = 2.7182818… Exponential model Y = b0*exp(b1*X) Other exponential functions S model Y = exp(b0 + b1/X) Growth model (is almost identical to the exponential model) Y = exp(b0 + b1*X) INFO 515 Lecture #7

SPSS Regression Models Polynomials beyond the Linear model (linear is a first order polynomial): Quadratic (second order) Y = b0 + b1*X + b2*X**2 Cubic (third order) Y = b0 + b1*X + b2*X**2 + b3*X**3 These are the only equations which use constants b2 & b3 Higher order polynomials require the Regression module of SPSS, which can do regression using any equation you enter INFO 515 Lecture #7

Y = whattheflock? To help picture these equations Make an X variable over some typical range (0 to 10 in a small increment, maybe 0.01) Define a Y variable Calculate the Y variable using Transform > Compute… and whatever equation you want to see Pick values for b0 and b1 that aren’t 0, 1, or 2 Have SPSS plot the results of a regression of Y vs X for that type of equation INFO 515 Lecture #7

How Apply This? Given a set of data containing two variables of interest, generate a scatter plot to get some idea of what the data looks like Choose which types of models are most likely to be useful For only linear models, use Analyze / Regression / Linear... INFO 515 Lecture #7

How Apply This? Select the Independent (X) and Dependent (Y) variables Rules may be applied to limit the scope of the analysis, e.g. gender=1 Dozens of other characteristics may also be obtained, which are beyond our scope here INFO 515 Lecture #7

How Apply This? Then check for the R Square value in the Model Summary Check the Coefficients to make sure they are all significant (e.g. Sig. < 0.050) If so, use the ‘b0’ and ‘b1’ coefficients from under the ‘B’ column (see Statistics for Software Process Improvement handout), plus or minus the standard errors “SE B” INFO 515 Lecture #7

Regression Example For example, go back to the “GSS91 political.sav” data set Generate a linear regression (Analyze > Regression > Linear) for ‘age’ as the Independent variable, and ‘partyid’ as the Dependent variable Notice that R2 and the ANOVA summary are given, with F and its significance INFO 515 Lecture #7

Regression Example INFO 515 Lecture #7

Regression Example The R Square of 0.006 means there is a very slight correlation (little strength) But the ANOVA Significance well under 0.050 confirms there is a statistically significant relationship here - it’s just a really weak one INFO 515 Lecture #7

Regression Example Output from Analyze > Regression > Linear Output from Analyze > Regression > Curve Estimation INFO 515 Lecture #7

Regression Example The heart of the regression analysis is in the Coefficients section We could look up ‘t’ on a critical values table, but it’s easier to: See if all values of Sig are < 0.050 - if they are, reject the null hypothesis, meaning there is a significant relationship If so, use the values under B for b0 and b1 If any coefficient has Sig > 0.050, don’t use that regression (coeff might be zero) INFO 515 Lecture #7

Regression Example The answer for “what is the effect of age on political view?” is that there is a very weak but statistically significant linear relationship, with a reduction of 0.009 (b1) political view categories per year From the Variable View of the data, since low values are liberal and large values conservative, this means that people tend to get slightly more liberal as they get older INFO 515 Lecture #7

Curve Estimation Example For the other regression options, choose Analyze / Regression / Curve Estimation… Define the Dependents (variable) and the Independent variable - note that multiple Dependents may be selected Check which math models you want used Display the ANOVA table for reference INFO 515 Lecture #7

Curve Estimation Example SPSS Tip: up to three regression models can be plotted at once, so don’t select more than that if you want a scatter plot to go with the data and the regressions For the same example just used, get a summary for the linear and quadratic models (Analyze > Regression > Curve Estimation) Find “R Square” for each model Generally pick the model with largest R Square Already saw Linear output, now see Quadratic INFO 515 Lecture #7

Curve Estimation Example For the quadratic regression, R Square is slightly higher, and the ANOVA is still significant INFO 515 Lecture #7

Curve Estimation Example The Quadratic coefficients are all significant at the 0.050 level Interpret as partyid = (4.191 +/- 0.412) + (-0.048 +/- 0.018)*age + (0.0003918+/- 0.0001754)*age**2 Edit the data table, then double click on the cells to get the values of b2 and its std error. INFO 515 Lecture #7

Curve Estimation Example The data set will be plotted as the Observed points, with the regression models shown for comparison Look to see which model most closely matches the data Look for regions of data which do or don’t match the model well (if any) INFO 515 Lecture #7

Curve Estimation Example <- quadratic <- linear INFO 515 Lecture #7

Curve Estimation Procedure See which models are significant (throw out the rest!) Compare the R Square values to see which provides the best fit Use the graph to verify visually that the correct model was chosen Use the model equation’s ‘B’ values and their standard errors to describe and predict the data’s behavior INFO 515 Lecture #7