Correlation and Simple Linear Regression PSY440 June 10, 2008.

Slides:



Advertisements
Similar presentations
Correlation and Linear Regression.
Advertisements

Bivariate Analyses.
Chapter 4 The Relation between Two Variables
Overview Correlation Regression -Definition
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
Describing the Relation Between Two Variables
Statistics for the Social Sciences Psychology 340 Spring 2005 Prediction cont.
Statistics for the Social Sciences
Lecture 11 PY 427 Statistics 1 Fall 2006 Kin Ching Kong, Ph.D
Basic Statistical Concepts Psych 231: Research Methods in Psychology.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
The Simple Regression Model
Basic Statistical Concepts
Statistics Psych 231: Research Methods in Psychology.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Statistics for the Social Sciences Psychology 340 Fall 2006 Relationships between variables.
Linear Regression and Correlation Analysis
Chapter 13 Introduction to Linear Regression and Correlation Analysis
RESEARCH STATISTICS Jobayer Hossain Larry Holmes, Jr November 6, 2008 Examining Relationship of Variables.
Basic Statistical Concepts Part II Psych 231: Research Methods in Psychology.
Correlation and Regression. Relationships between variables Example: Suppose that you notice that the more you study for an exam, the better your score.
Correlation and Regression Analysis
Relationships Among Variables
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Correlation and Regression A BRIEF overview Correlation Coefficients l Continuous IV & DV l or dichotomous variables (code as 0-1) n mean interpreted.
Chapter 8: Bivariate Regression and Correlation
Lecture 16 Correlation and Coefficient of Correlation
Lecture 15 Basics of Regression Analysis
Statistics for the Social Sciences Psychology 340 Fall 2013 Thursday, November 21 Review for Exam #4.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Covariance and correlation
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
Inference for Linear Regression Conditions for Regression Inference: Suppose we have n observations on an explanatory variable x and a response variable.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.
Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Chapter 8 – 1 Chapter 8: Bivariate Regression and Correlation Overview The Scatter Diagram Two Examples: Education & Prestige Correlation Coefficient Bivariate.
Statistical Analysis Topic – Math skills requirements.
Relationships between variables Statistics for the Social Sciences Psychology 340 Spring 2010.
BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
By: Amani Albraikan.  Pearson r  Spearman rho  Linearity  Range restrictions  Outliers  Beware of spurious correlations….take care in interpretation.
Chapters 8 & 9 Linear Regression & Regression Wisdom.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Statistics for the Social Sciences Psychology 340 Fall 2013 Tuesday, November 12, 2013 Correlation and Regression.
Correlation & Regression Chapter 15. Correlation It is a statistical technique that is used to measure and describe a relationship between two variables.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
CORRELATION. Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson’s coefficient of correlation.
Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.
Statistics: Analyzing 2 Quantitative Variables MIDDLE SCHOOL LEVEL  Session #2  Presented by: Dr. Del Ferster.
ANOVA, Regression and Multiple Regression March
Advanced Statistical Methods: Continuous Variables REVIEW Dr. Irina Tomescu-Dubrow.
Regression. Outline of Today’s Discussion 1.Coefficient of Determination 2.Regression Analysis: Introduction 3.Regression Analysis: SPSS 4.Regression.
Lecturer’s desk Physics- atmospheric Sciences (PAS) - Room 201 s c r e e n Row A Row B Row C Row D Row E Row F Row G Row H Row A
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Regression and Correlation
Reasoning in Psychology Using Statistics
Statistics for the Social Sciences
Correlation and Simple Linear Regression
Multiple Regression.
Statistics for the Social Sciences
Statistics for the Social Sciences
Inferential Statistics
Presentation transcript:

Correlation and Simple Linear Regression PSY440 June 10, 2008

A few points of clarification For the chi-squared test, the results are unreliable if the expected frequency in too many of your cells is too low. A rule of thumb is that the minimum expected frequency should be 5 (i.e., no cells with expected counts less than 5). A more conservative rule recommended by some is a minimum expected frequency of 10. If your minimum is too low, you need a larger sample! The more categories you have the larger your sample must be. SPSS will warn you if you have any cells with expected frequency less than 5.

Regarding threats to internal validity One of the strengths of well-designed single- subject research is the use of repeated observations during each phase. Repeated observations during baseline and intervention (during an AB study, e.g.) helps rule out testing, instrumentation (somewhat) and regression. These effects would be unlikely to result in a marked change between experimental phases that is not apparent during repeated observations before and after the phase change.

Regarding histograms The difference between a histogram and a bar graph is that the variable on the x axis (which represents the score on the variable being graphed, as opposed to the frequency of observations) is conceptualized as being continuous in a histogram, whereas a bar graph represents discrete categories along the x axis.

About the exam…. Exam on Thursday will cover material from the first three weeks of class (lectures 1-6, or everything through Chi- Squared tests). Emphasis of exam will be on generating results with computers (calculations by hand will not be emphasized), and interpreting the results. Exam questions will be based mainly on lecture material and modeled on previous active learning experiences (homework and in-class demonstrations and exercises). Knowledge of material on qualitative methods and experimental & single-subject design is expected.

Before we move on….. Any questions?

Today’s lecture and next homework Today’s lecture will cover correlation and simple (bivariate) regression. Homework based on today’s lecture will be distributed on Thursday and due on Tuesday (June 17).

Correlation A correlation is the association between scores on two variables –age and coordination skills in children, as kids get older their motor coordination tends to improve –price and quality, generally the more expensive something is the higher in quality it is

Correlation and Causality Correlational research –Correlation as a statistical procedure is generally used to measure the association between two (or more) continuous variables –Correlation as a kind of research design refers to observational studies in which there is no experimental manipulation.

Correlation and Causality Correlational research –Not all “correlational” (i.e., observational) research designs use correlation as the statistical procedure for analyzing the data (example: comparison of verbal abilities between boys and girls - observational study - don’t manipulate gender - but probably analyze mean differences with t-tests). –But: Virtually of the inferential statistical methods (including t-tests, anova, ancova) covered in 440 can be represented in terms of correlational/regression models (general linear model - we’ll talk more about this later). –Bottom line: Don’t confuse design with analytic strategy.

Correlation and Causality Correlations (like other linear statistical models) describe relationships between variables, but DO NOT explain why the variables are related Suppose that Dr. Steward finds that rates of spilled coffee and severity of plane turbulence are strongly positively correlated. One might argue that turbulence cause coffee spills One might argue that spilling coffee causes turbulence

Correlation and Causation Suppose that Dr. Cranium finds a positive correlation between head size and digit span (roughly the number of digits you can remember). One might argue that bigger your head, the larger your digit span One might argue that head size and digit span both increase with age (but head size and digit span aren’t directly related)

Correlation and Causation Observational research and correlational statistical methods (including regression and path analysis) can be used to compare competing models of causation, to see which model fits the data best. One might argue that bigger your head, the larger your digit span One might argue that head size and digit span both increase with age (but head size and digit span aren’t directly related)

Relationships between variables Properties of a statistical correlation –Form (linear or non-linear) –Direction (positive or negative) –Strength (none, weak, strong, perfect) To examine this relationship you should : –Make a scatterplot - a picture of the relationship –Compute the Correlation Coefficient - a numerical description of the relationship

Graphing Correlations Steps for making a scatterplot (scatter diagram) 1.Draw axes and assign variables to them 2.Determine range of values for each variable and mark on axes 3.Mark a dot for each person’s pair of scores

Scatterplot Y X Plots one variable against the other Each point corresponds to a different individual A 6 6 XY

Scatterplot Y X Plots one variable against the other Each point corresponds to a different individual A 6 6 B 1 2 XY

Scatterplot Y X Plots one variable against the other Each point corresponds to a different individual A 6 6 B 1 2 C 5 6 XY

Scatterplot Y X Plots one variable against the other Each point corresponds to a different individual A 6 6 B 1 2 C 5 6 D 3 4 XY

Scatterplot Y X Plots one variable against the other Each point corresponds to a different individual A 6 6 B 1 2 C 5 6 D 3 4 E 3 2 XY

Scatterplot Y X Imagine a line through the data points Plots one variable against the other Each point corresponds to a different individual A 6 6 B 1 2 C 5 6 D 3 4 E 3 2 XY Useful for “seeing” the relationship –Form, Direction, and Strength

Scatterplots with Excel and SPSS In SPSS, charts menu=>legacy dialogues=>scatter/dot=>simple scatter Click on define, and select which variable you want on the x axis and which on the y axis. In Excel, insert menu=>chart=>xyscatter Specify if variables are arranged in rows or columns and select the cells with the relevant data.

Form Non-linearLinear

NegativePositive Direction X & Y vary in the same direction As X goes up, Y goes up positive Pearson’s r X & Y vary in opposite directions As X goes up, Y goes down negative Pearson’s r Y X Y X

Strength The strength of the relationship –Spread around the line (note the axis scales) –Correlation coefficient will range from -1 to +1 Zero means “no relationship”. The farther the r is from zero, the stronger the relationship –In general when we talk about correlation coefficients: Correlation coefficient = Pearson’s product moment coefficient = Pearson’s r = r.

Strength r = 1.0 “perfect positive corr.” r 2 = 100% r = -1.0 “perfect negative corr.” r 2 = 100% r = 0.0 “no relationship” r 2 = The farther from zero, the stronger the relationship

The Correlation Coefficient Formulas for the correlation coefficient: Conceptual FormulaCommon Alternative

The Correlation Coefficient Formulas for the correlation coefficient: Conceptual FormulaCommon alternative

Computing Pearson’s r (using SP) Step 1: SP (Sum of the Products) mean X Y

Computing Pearson’s r (using SP) Step 1: SP (Sum of the Products) mean X Y = = = = = Quick check

Computing Pearson’s r (using SP) Step 1: SP (Sum of the Products) mean X Y 2.0= = = = = Quick check

Computing Pearson’s r (using SP) Step 1: SP (Sum of the Products) mean SP X Y 4.8 * = 5.2 * = 2.8 * = 0.0 * = 1.2 * =

Computing Pearson’s r (using SP) Step 2: SS X & SS Y

Computing Pearson’s r (using SP) Step 2: SS X & SS Y mean X Y SS X 2 = = = = 2 =

Computing Pearson’s r (using SP) Step 2: SS X & SS Y mean X Y =4.0 2 = 2 = 2 =0.0 2 = SS Y

Computing Pearson’s r (using SP) Step 3: compute r

Computing Pearson’s r (using SP) Step 3: compute r mean X Y SS Y SS X SP

Computing Pearson’s r Step 3: compute r SS Y SS X SP

Computing Pearson’s r Step 3: compute r SS Y SS X

Computing Pearson’s r Step 3: compute r SS X

Computing Pearson’s r Step 3: compute r

Computing Pearson’s r Step 3: compute r Y X Appears linear Positive relationship Fairly strong relationship.89 is far from 0, near +1

The Correlation Coefficient Formulas for the correlation coefficient: Conceptual FormulaCommon alternative

Computing Pearson’s r (using z-scores) Step 1: compute standard deviation for X and Y (note: keep track of sample or population) X Y For this example we will assume the data is from a population

Computing Pearson’s r (using z-scores) Step 1: compute standard deviation for X and Y (note: keep track of sample or population) Mean X Y SS X Std dev 1.74 For this example we will assume the data is from a population

Computing Pearson’s r (using z-scores) Step 1: compute standard deviation for X and Y (note: keep track of sample or population) Mean X Y SS Y Std dev For this example we will assume the data is from a population

Computing Pearson’s r (using z-scores) Step 2: compute z-scores Mean X Y Std dev

Computing Pearson’s r (using z-scores) Step 2: compute z-scores Mean X Y Std dev Quick check

Computing Pearson’s r (using z-scores) Step 2: compute z-scores Mean X Y Std dev

Computing Pearson’s r (using z-scores) Step 2: compute z-scores Mean X Y Std dev Quick check

Computing Pearson’s r (using z-scores) Step 3: compute r Mean X Y Std dev * =

Computing Pearson’s r (using z-scores) Step 3: compute r Mean X Y Std dev

Computing Pearson’s r (using z-scores) Step 3: compute r Y X Appears linear Positive relationship Fairly strong relationship.88 is far from 0, near +1

Correlation in Research Articles Correlation matrix –A display of the correlations between more than two variables Acculturation Why have a “-”? Why only half the table filled with numbers?

Correlations with SPSS & Excel SPSS: Analyze => correlate=> bivariate Then select the variables you want correlation(s) for (can select just one pair, or many variables to get a correlation matrix) Try this with height and shoe size in our data. Now try with height, shoe size, mother’s height, and number of shoes owned. Excel: Arrange data for two variables in two columns or rows & use formula bar to request a correlation: =correl(array1,array2)

SPSS correlation output

Invalid inferences from correlations Why you should always look at the scatter plot before computing (and certainly before interpreting Pearson’s r): Correlations are greatly affected by range of scores in data –Consider height and age relationship –Restricted range example from text (SAT and GPA) Extreme scores can have dramatic effects on correlations –A single extreme score can radically change r, especially when your sample is small. Relations between variables may differ for subgroups, resulting in misleading r values for aggregate data Curvilinear relations not captures by Pearson’s r

What to do about a curvilinear pattern If pattern is monotonically increasing or decreasing, convert scores to ranks and compute r (using same formula) based on the rank scores. Result is called Spearman’s Rank Correlation Coefficient or Spearman’s Rho and can be requested in your spss output by checking the appropriate box when you select the variables for which you want correlations. If pattern is more complicated (u-shaped or s- shaped, for example), consult more advanced statistics resources.

Coefficient of determination When considering "how good" a relationship is, we really should consider r 2 (coefficient of determination), not just r. This coefficient tells you the percent of the variance in one variable that is explained or accounted for by the other variable.

From Correlation to Regression With correlation, we can examine whether variables X & Y are related With regression, we try to predict the value of one variable given what we know about the other variable and the relationship between the two.

Regression Last time: “it doesn’t matter which variable goes on the X-axis or the Y-axis” Y X For regression this is NOT the case The variable that you are predicting goes on the Y-axis (criterion or “dependent” variable) Predicted variable Predicting variable The variable that you are making the prediction based on goes on the X-axis (predictor or “independent” variable) Quiz performance Hours of study

Regression Correlation: “Imagine a line through the points” Y X But there are lots of possible lines One line is the “best fitting line” Regression: compute the equation corresponding to this “best fitting line” Quiz performance Hours of study

The equation for a line A brief review of geometry Y = (X)(slope) + (intercept) 2.0 Y X Y = intercept, when X = 0

The equation for a line A brief review of geometry Y = (X)(slope) + (intercept) 2.0 Change in Y Change in X = slope 0.5 Y X

The equation for a line A brief review of geometry Y = (X)(slope) + (intercept) Y X Y = (X)(0.5) + 2.0

Regression A brief review of geometry Consider a perfect correlation Y = (X)(0.5) + (2.0) Y X Can make specific predictions about Y based on X X = 5 Y = ? Y = (5)(0.5) + (2.0) Y = =

Regression Y X Consider a less than perfect correlation The line still represents the predicted values of Y given X Y = (X)(0.5) + (2.0) X = 5 Y = ? Y = (5)(0.5) + (2.0) Y = =

Regression Y X The “best fitting line” is the one that minimizes the error (differences) between the predicted scores (the line) and the actual scores (the points) Rather than compare the errors from different lines and picking the best, we will directly compute the equation for the best fitting line

Regression The linear model Y = intercept + slope (X) + error Beta’s (  ) are sometimes called parameters Come in two types: standardized unstanderdized Now let’s go through an example computing these things

Scatterplot Using the dataset from our correlation example X Y Y X

From when we computed Pearson’s r X Y mean SS Y SS X SP

Computing regression line (with raw scores) X Y SS Y SS X SP mean

Computing regression line (with raw scores) X Y mean Y X

Computing regression line (with raw scores) X Y mean Y X The two means will be on the line

Computing regression line (standardized, using z-scores) Sometimes the regression equation is standardized. –Computed based on z-scores rather than with raw scores Mean X Y Std dev

Computing regression line (standardized, using z-scores) Sometimes the regression equation is standardized. –Computed based on z-scores rather than with raw scores Prediction model –Predicted Z score (on criterion variable) = standardized regression coefficient multiplied by Z score on predictor variable –Formula –The standardized regression coefficient ( β ) In bivariate prediction, β = r

Computing regression line (with z-scores) mean ZYZY ZXZX

Regression Also need a measure of error Y = X(.5) + (2.0) + error Y X Y X Same line, but different relationships (strength difference) Y = intercept + slope (X)+ error The linear equation isn’t the whole thing

Regression Error –Actual score minus the predicted score Measures of error –r 2 (r-squared) –Proportionate reduction in error Note: Total squared error when predicting from the mean = SS Total =SS Y –Squared error using prediction model = Sum of the squared residuals = SS residual = SS error

R-squared r 2 represents the percent variance in Y accounted for by X Y X Y X r = 0.8 r = 0.5r 2 = 0.64r 2 = % variance explained 25% variance explained

Computing Error around the line Compute the difference between the predicted values and the observed values (“residuals”) Square the differences Add up the squared differences Y X Sum of the squared residuals = SS residual = SS error

Computing Error around the line X Y mean Predicted values of Y (points on the line) Sum of the squared residuals = SS residual = SS error

Computing Error around the line X Y mean = (0.92)(6) Predicted values of Y (points on the line) Sum of the squared residuals = SS residual = SS error

Computing Error around the line X Y mean = (0.92)(6) = (0.92)(1) = (0.92)(5) = (0.92)(3) = (0.92)(3) Sum of the squared residuals = SS residual = SS error

Computing Error around the line Y X Sum of the squared residuals = SS residual = SS error X Y

Computing Error around the line X Y mean residuals Sum of the squared residuals = SS residual = SS error Quick check = = = = =

Computing Error around the line X Y mean SS ERROR Sum of the squared residuals = SS residual = SS error

Computing Error around the line X Y mean SS ERROR Sum of the squared residuals = SS residual = SS error SS Y

Computing Error around the line Sum of the squared residuals = SS residual = SS error Standard error of estimate (from textbook) is analagous to standard deviation. It is the square root of the average error: s x.y = sqrt(SS error /df) Also, the standard error of estimate is related to r 2 and to the standard deviaion of y: s x.y =s y *sqrt(1-r 2 )

Computing Error around the line 3.09 SS ERROR Sum of the squared residuals = SS residual = SS error 16.0 SS Y –Proportionate reduction in error Also (like r 2 ) represents the percent variance in Y accounted for by X In fact, it is mathematically identical to r 2

Seeing patterns in the error Residual plots The sum of the residuals should always equal 0 (as should the mean). –the least squares regression line splits the data in half, half of the error is above the line and half is below the line. In addition to summing to zero, we also want the residuals to be randomly distributed. –That is, there should be no pattern to the residuals. –If there is a pattern, it may suggest that there is more than a simple linear relationship between the two variables. Residual plots are very useful tools to examine the relationship even further. –These are basically scatterplots of the residuals (Y obs -Y pred ) against the Explanatory (X) variable (note: the examples actually plot the residuals that have transformed into z-scores).

Seeing patterns in the error The residual plot shows that the residuals fall randomly above and below the line. Critically there doesn't seem to be a discernable pattern to the residuals. Residual plot Scatter plot The scatterplot shows a nice linear relationship.

Seeing patterns in the error Residual plot The scatterplot also shows a nice linear relationship. The residual plot shows that the residuals get larger as X increases. This suggests that the variability around the line is not constant across values of X. This is referred to as a violation of homogeniety of variance. Scatter plot

Seeing patterns in the error The residual plot suggests that a non- linear relationship may be more appropriate (see how a curved pattern appears in the residual plot). Residual plot Scatter plot The scatterplot shows what may be a linear relationship.

Regression in SPSS Running the analysis in SPSS is pretty easy –Analyze: Regression: Linear –X or predictor variable(s) go into the ‘independent variable’ field –Y or predicted variable goes into the ‘dependent variable’ field –You can save the residuals as a new variable to plot the residuals against x as shown in the previous slide. You get a lot of output

Regression in SPSS The variables in the model r r 2 Unstandardized coefficients Slope (indep var name) Intercept (constant) Standardized coefficients We’ll get back to these numbers in a few weeks

In Excel With Data Analysis “Tool Pack” you can perform regression analysis With standard software package, you can get bivariate correlation (which is the same as the standardized regression coefficient), you can create a scatterplot, and you can request a trend line (as we did when plotting data for single- subject research), which is a regression line (what is y and what is x in that case?)

Considerations: Slope is dependent on variance of x and y Standardized slope = r (weaker associations between x and y result in flatter slopes) Means as the association becomes weaker, your prediction of y is more influenced by the mean of y than by changes in x. Regression to the mean is a special case of this…..

Regression to the mean Sometimes reliability is represented as r values (test-retest, split-half). If you have a test with low test-retest reliability, your score on the first administration is only weakly related to your score on the second administration. It is influenced by a considerable amount of error variance. Score(true)=Score(observed)+/-Error Score-/+Error=Score(observed) Any time you take a measurement, the observed score reflects your true score plus error. The further away your observed score gets from the mean score for the test, the more likely it is that the distance from the mean is due at least in part to error. If error is randomly distributed, then your next observed score is more likely to be closer to the mean than farther from the mean.

Regression to the mean If x=obs1 and y=obs2, and the test-retest reliability of your measure is relatively low (say, r=.5), then your first score only helps predict your second score somewhat. Standardized regression equation is y=.5x + error On a standardized test with mean=0 and sd=1, if you get a score above the mean, say 1.2, the first time you take the test, (obs1=x=1.2), and the test-retest reliability is only.5, your predicted score the next time you take the test is.5*1.2=.6. You are more likely to score closer to the mean. This doesn’t mean that you will definitely score closer to the mean, it just means that on average, people who score 1.2 sd above the mean the first time tend to have scores closer to.6 the next time they are tested. This is because the test isn’t that reliable, and the original observation of 1.2 includes error. For the average person with that score (but not for everyone), the error is part of what accounts for the difference between the score and the mean. If your test has higher reliability, then the regression to the mean effect is reduced.

Multiple Regression Multiple regression prediction models “fit”“residual”

Prediction in Research Articles Bivariate prediction models rarely reported Multiple regression results commonly reported