11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.

Slides:



Advertisements
Similar presentations
Review ? ? ? I am examining differences in the mean between groups
Advertisements

Regression Greg C Elvers.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Association for Interval Level Variables
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
Cal State Northridge  320 Andrew Ainsworth PhD Regression.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Statistics for the Social Sciences
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
PPA 415 – Research Methods in Public Administration
Correlation 2 Computations, and the best fitting line.
Correlation-Regression The correlation coefficient measures how well one can predict X from Y or Y from X.
RESEARCH STATISTICS Jobayer Hossain Larry Holmes, Jr November 6, 2008 Examining Relationship of Variables.
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Regression Analysis We have previously studied the Pearson’s r correlation coefficient and the r2 coefficient of determination as measures of association.
Correlation Question 1 This question asks you to use the Pearson correlation coefficient to measure the association between [educ4] and [empstat]. However,
Relationships Among Variables
8/10/2015Slide 1 The relationship between two quantitative variables is pictured with a scatterplot. The dependent variable is plotted on the vertical.
Simple Linear Regression 1. 2 I want to start this section with a story. Imagine we take everyone in the class and line them up from shortest to tallest.
Correlation and Linear Regression
Simple Linear Regression
Chapter 8: Bivariate Regression and Correlation
Example of Simple and Multiple Regression
Objectives (BPS chapter 5)
Linear Regression and Correlation
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Introduction to Linear Regression and Correlation Analysis
Relationship of two variables
Linear Regression and Correlation
Correlation and regression 1: Correlation Coefficient
EC339: Lecture 6 Chapter 5: Interpreting OLS Regression.
Correlation.
Correlation and Regression. The test you choose depends on level of measurement: IndependentDependentTest DichotomousContinuous Independent Samples t-test.
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
Chapter 15 Correlation and Regression
Introduction to Quantitative Data Analysis (continued) Reading on Quantitative Data Analysis: Baxter and Babbie, 2004, Chapter 12.
Chapter 3: Examining relationships between Data
Chapter 6 & 7 Linear Regression & Correlation
PSYCHOLOGY: Themes and Variations Weiten and McCann Appendix B : Statistical Methods Copyright © 2007 by Nelson, a division of Thomson Canada Limited.
Inference for Linear Regression Conditions for Regression Inference: Suppose we have n observations on an explanatory variable x and a response variable.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)
Correlation is a statistical technique that describes the degree of relationship between two variables when you have bivariate data. A bivariate distribution.
Linear Regression Least Squares Method: the Meaning of r 2.
Section 3.1 Scatterplots & Correlation Mrs. Daniel AP Statistics.
Examining Relationships in Quantitative Research
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
Objective: Understanding and using linear regression Answer the following questions: (c) If one house is larger in size than another, do you think it affects.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Testing hypotheses Continuous variables. H H H H H L H L L L L L H H L H L H H L High Murder Low Murder Low Income 31 High Income 24 High Murder Low Murder.
Chapter 10 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 A perfect correlation implies the ability to predict one score from another perfectly.
Examining Bivariate Data Unit 3 – Statistics. Some Vocabulary Response aka Dependent Variable –Measures an outcome of a study Explanatory aka Independent.
LECTURE 9 Tuesday, 24 FEBRUARY STA291 Fall Administrative 4.2 Measures of Variation (Empirical Rule) 4.4 Measures of Linear Relationship Suggested.
Chapter 9: Correlation and Regression Analysis. Correlation Correlation is a numerical way to measure the strength and direction of a linear association.
CHAPTER 5 CORRELATION & LINEAR REGRESSION. GOAL : Understand and interpret the terms dependent variable and independent variable. Draw a scatter diagram.
Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)
Correlation/Regression - part 2 Consider Example 2.12 in section 2.3. Look at the scatterplot… Example 2.13 shows that the prediction line is given by.
Business Statistics for Managerial Decision Making
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Linear Regression 1 Sociology 5811 Lecture 19 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Multiple Regression.
Chapter 15 Linear Regression
Regression and Residual Plots
Least Squares Method: the Meaning of r2
Objectives (IPS Chapter 2.3)
Review I am examining differences in the mean between groups How many independent variables? OneMore than one How many groups? Two More than two ?? ?
Presentation transcript:

11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear regression. Correlation and regression provide us with different, but complementary, information on the relationship between two quantitative variables.

Slide 2 The goal of this analysis is to study the relationship between family size and number of credit cards. Finding the relationship will help us predict the number of credit cards a family typically has relative to the number of family members. If a family had fewer than expected, they would be a good candidate for us to extend another credit card offer. CreditCardData.sav has five variables for 8 cases. The data for the 8 cases is shown in the Data View to the left. The names and labels for each of the variables is shown below in the Variable View.

Slide 3 Creating a histogram of the dependent variable, ncards, shows a distribution that is about as normal as we could expect for only 8 cases. I have superimposed the red normal curve and blue mean line on the histogram. For any quantitative variable, our best estimate of the values for cases in the distribution is the mean, because it minimizes the errors or differences between the estimated value and the actual score represented by each of the bars in the histogram.

Slide 4 To demonstrate that the mean is the best value to estimate, I created a worksheet in Excel that compares the error associated with three different estimates of values for each case: the mean of 7, an estimate lower than the mean: 6, and an estimate higher than the mean: 8. Error is calculated as the sum of the squared deviations from the value used as the estimate. Columns C, F and I contain the deviations from each of the estimates 7, 6, and 8). Columns D, G, and J contain the squared deviations, and the summed total at the base of the columns. Using the mean of 7 as the estimate, there are 22 units of error. Using either 6 or 8 results in 30 units of error. The measure of error is called the Total Sum of Squares.

Slide 5 The graph for the relationship between two quantitative variables is the scatterplot, with the independent variable Family Size on the horizontal x-axis, and the dependent variable Number of Credit Cards on the vertical y-axis. I have superimposed the blue dotted mean line for Number of Credit Cards on the scatterplot. We see that the scores for two cases actually fall on the mean line, while the other six are at varying distances from the mean line. Each dot represents the combination of scores for one case. For example, this dot represents a family of 5 that had 8 credit cards.

Slide 6 The purple lines are the deviations – the differences between individual scores and the mean of the dependent variable. If we square the deviations and sum the squares, we have the Total Sum of Squares. The differences are often phrased as distances, i.e. the vertical distance between the mean line and the score for this cases on the dependent variable is 3.

Slide 7 I have added the green vertical dotted line at the mean number of credit cards, The regression line will pass through the intersection of the means of both variables, and will minimize the total sum of the differences between the individual scores and the regression line.

Slide 8 One way to think about linear regression is that we are rotating a line through the intersection of the means of the two variables. Each time we rotate the line, we would compute the total sum of squares. We stop when we have found the line that has the smallest total sum of squares. There is a direct method for finding the regression line that does not require this trial and error strategy.

Slide 9 If there is no relationship, the blue regression line will be on top of (or very close) to the dotted blue mean line for the dependent variable. No relationship means that we can not reduce the error or total sum of squares of the dependent variable by using the relationship to the independent variable.

Slide 10 The points along the regression line represent the estimated values for all possible values of the independent variable. For example, if we wanted to estimate the number of cards for a family of 4, we would draw a vertical line from the 4 on the horizontal axis up to the regression line, and from the regression line left to the vertical axis. The location on the vertical axis is the estimated number of cards that a family of 4 would have, i.e. about 6.8 cards.

Slide 11 The differences between the estimated value and the actual value for the cases are deviations that are called residuals (the light blue lines). They represent errors in predicting the values of the dependent variable based on the value of the independent variable. We had two cases with a family size of 4. Our estimated value was overstated for one of the cases, and understated for the other case.

Slide 12 The formula for the regression line can be extracted from the SPSS output. For this example, the regression equation is: ncards = x famsize

Slide 13 We can plug the regression equation into Excel and estimate the number of cards for each case. To compute the residuals, we subtract the actual value for ncards from the estimated value for the case. If we square the residuals, and sum the squares, we have the amount of error associated with using the regression line to estimate each case,

Slide 14 If we plug the total sum of squares and the sum of squared residuals into an Excel spreadsheet, we can compute the reduction in the total sum of squared errors associated with using the information in the independent variable, as represented by the regression equation. We can compute the percentage of total error reduced by the regression equation, we end up with the value of R², the percentage of variance explained by the regression relationship. Our calculation for R² agrees with the value of R Square in the SPSS output.

Slide 15 R² is often interpreted as the percentage of variance explained. We can convert our Sum of Squares column to Variance by dividing by the number of cases in the sample minus one (8 – 1). If we compute the percentages using variances instead of sum of squares, we end with exactly the sample value for R², R² is also interpreted as the proportional reduction in error ( a PRE statistic), which we can also phrase as an increase in accuracy. We should remember the no matter whether we interpret R² as explaining variance or reducing error, the statistic applies to the total error in distribution, not to the error in individual cases.

Slide 16 We can also think of regression and correlation as based on the pattern of deviations for the two variable across the cases in the distribution. To present this, we will first compute the standard scores for each variable. As standard scores, the value for each case is the deviation from the mean of 0 which is the mean of the distribution of standard scores.

Slide 17 Plotting the z-scores for both variables produces the same pattern in the scatterplot that we found with the raw data. As we would expect for standard scores, the green dotted line for the mean z-score for family size is at zero, as is the dotted blue line for the standard scores for number of credit cards.

Slide 18 We add lines for the deviation from the means for both variables. The green deviation lines represent differences from the mean z-score for family size. The blue deviation lines represent differences from the mean z-score for number of credit cards.

Slide 19 For some points, the length of the green deviation line is similar to the length of the blue deviation line. The strength of the relationship will depend on the agreement of the deviations for each case, i.e. the extent to which the green line deviation for a case agrees with the blue line deviation. For other points, the length of the green deviation line is shorter than the length of the blue deviation line.

Slide 20 Overall, the pattern of the deviations is similar. Green deviations above the mean are paired with blue deviations above the mean. Green deviations below the mean are paired with blue deviations below the mean. Though the length of the deviations for individual cases varies, the overall pattern suggests a strong relationship.

Slide 21 To compute the correlation coefficient, we multiply the z-scores, and sum across all the cases. To compute Pearson’s r, we divide the sum of the z-score products by the number of cases minus one. The value for Pearson’s r that we computed agrees with the value supplied by SPSS. Finally, if we square the value of Pearson’s r, we have the same value as R Square in the SPSS regression output.

Slide 22 If we return to the regression results for the raw data instead of the standard scores, we can show the link between Pearson’s r and the slope in the regression equation. Recall that the slope of the regression line represents change in the dependent variable associated with a one unit change in the independent variable. Thus, when a family had one more member, we would predict that they had.971 more credit cards. Think of the standard deviation to be a measure of average difference from the mean for all of the cases for each of the variables. The standard deviation for number of cares is and the standard deviation for family size is

Slide 23 If the relationship between the two variables were perfect (one predicted the other without error), we could compute slope of the line using the average amount of differences in each of the distributions – the standard deviations. On average, the number of cards would go up cards for a difference of members in a family. We can simplify this by dividing the standard deviation for number of cards by the standard deviation for family size: ÷ = Thus, if the relationship were perfect, we would increase our estimate of the number of cards in a family by for every additional member of a family.

Slide 24 If the slope of the regression line were when the relationship were perfect, then we might expect the slope to be x when the relationship was less than perfect. And it fact, that turns out to be true, since: x = The slope of the regression line is the ratio of the standard deviations multiplied by the correlation coefficient. If the relationship between the two variables were perfect, Pearson’s r would be 1.0 (or -1.0 if the relationship were inverse). However, we know that Pearson’s r is less than that, actually it is