Describing the Relation Between Two Variables

Slides:



Advertisements
Similar presentations
Forecasting Using the Simple Linear Regression Model and Correlation
Advertisements

Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Correlation and Regression
Chapter 4 The Relation between Two Variables
Chapter 3 Bivariate Data
Scatter Diagrams and Linear Correlation
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
Chapter 4 Describing the Relation Between Two Variables
2.2 Correlation Correlation measures the direction and strength of the linear relationship between two quantitative variables.
Describing the Relation Between Two Variables
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
Linear Regression and Correlation Analysis
Chapter 13 Introduction to Linear Regression and Correlation Analysis
REGRESSION AND CORRELATION
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
Math 227 Elementary Statistics Math 227 Elementary Statistics Sullivan, 4 th ed.
Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two Variables 4.
CHAPTER 3 Describing Relationships
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
Correlation & Regression Math 137 Fresno State Burger.
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Chapter 5 Regression. Chapter 51 u Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). u We.
Descriptive Methods in Regression and Correlation
Linear Regression.
Regression and Correlation Methods Judy Zhong Ph.D.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Relationship of two variables
Slide Copyright © 2008 Pearson Education, Inc. Chapter 4 Descriptive Methods in Regression and Correlation.
1 Chapter 3: Examining Relationships 3.1Scatterplots 3.2Correlation 3.3Least-Squares Regression.
M23- Residuals & Minitab 1  Department of ISM, University of Alabama, ResidualsResiduals A continuation of regression analysis.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 12-1 Correlation and Regression.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
1 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 5 Summarizing Bivariate Data.
Chapter 10 Correlation and Regression
BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression.
Summarizing Bivariate Data
Chapter 4 Describing the Relation Between Two Variables 4.1 Scatter Diagrams; Correlation.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Examining Bivariate Data Unit 3 – Statistics. Some Vocabulary Response aka Dependent Variable –Measures an outcome of a study Explanatory aka Independent.
Chapter 5 Regression. u Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). u We can then predict.
Chapter 10 Correlation and Regression Lecture 1 Sections: 10.1 – 10.2.
Chapter 9: Correlation and Regression Analysis. Correlation Correlation is a numerical way to measure the strength and direction of a linear association.
Correlation & Regression Analysis
Multivariate Data. Descriptive techniques for Multivariate data In most research situations data is collected on more than one variable (usually many.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
CHAPTER 3 Describing Relationships
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two Variables 4.
Simple Linear Regression The Coefficients of Correlation and Determination Two Quantitative Variables x variable – independent variable or explanatory.
Describing Relationships. Least-Squares Regression  A method for finding a line that summarizes the relationship between two variables Only in a specific.
1. Analyzing patterns in scatterplots 2. Correlation and linearity 3. Least-squares regression line 4. Residual plots, outliers, and influential points.
Lecture Slides Elementary Statistics Twelfth Edition
Inference for Least Squares Lines
Suppose the maximum number of hours of study among students in your sample is 6. If you used the equation to predict the test score of a student who studied.
1) A residual: a) is the amount of variation explained by the LSRL of y on x b) is how much an observed y-value differs from a predicted y-value c) predicts.
Lecture Slides Elementary Statistics Thirteenth Edition
Chapter 10 Correlation and Regression
Lecture Notes The Relation between Two Variables Q Q
3 4 Chapter Describing the Relation between Two Variables
CHAPTER 3 Describing Relationships
Honors Statistics Review Chapters 7 & 8
Presentation transcript:

Describing the Relation Between Two Variables Learning Objectives Construct and interpret scatter diagrams 2. Compute and interpret correlation coefficients 3. Compute and interpret least square lines 4. Interpret residual plots

Scatter Diagrams; Correlation Bivariate data is data in which two variables are measured on an individual. Often the purpose is to study the relationship between two variables: Correlation problem or to predict one variable using the other: Least-squares Regression Problem (Also called: Simple Linear Regression problem). For bivariate problems, we call one variable the response variable, y (also called dependent variable), which is the variable whose value can be explained or determined based upon the value of the predictor variable, x (also called independent variable).

Examples of Bivariate Data What is the relationship between hand-size and height? One needs to collect two variables from each subject (hand-size and height) of a sample of n subjects. It is a bivariate study. We can conduct two studies: (a) To find out the relationship between Hand-size and Height. (b) To predict Height using Hand-size and investigate if Hand-size is a good predictor of Height or not . 2. Is the weight of a car a good predictor of mileage? One needs to collect the weight and mileage of a sample of n cars, in order to study this problem. We can conduct two studies: (a) To find out the relationship between Weight and Mileage. (b) To predict Mileage using Weight and investigate if Weight is a good predictor of Mileage or not.

Real-Time Activity Is Hand-size a good predictor of Height? How do you measure Hand-size? Measure your Hand-size by measuring Hand-Width and Hand Length as discussed in the class. Now go to the Real-Time Online activity site at http://stat.cst.cmich.edu/statact/ Go to Data Entry, select Activity: Hand-Size. Use the Activity Code: To be provided in class

Is Hand size a good predictor of height Is Hand size a good predictor of height? Here are 20 cases from previous students How to demonstrate the relationship between hand length and height? Graphical method – Scatter diagram. Numerical Method- Correlation coefficient and Least squares regression. Row Gender length width height 1 female 8.50 9.50 68.5 2 female 8.40 9.00 68.0 3 female 7.50 8.00 68.0 4 female 7.25 8.00 68.0 5 male 7.40 7.70 70.0 6 male 7.50 8.75 71.0 7 female 6.50 7.25 66.0 8 male 8.00 7.00 68.0 9 male 8.00 8.75 72.0 10 male 8.75 9.50 76.8 11 male 8.00 9.00 71.0 12 female 6.00 7.50 62.0 13 male 6.20 11.50 69.0 14 female 6.50 7.50 61.5 15 male 8.00 9.50 69.0 16 female 7.00 8.00 69.0 17 male 7.00 9.10 72.2 18 female 6.50 7.50 63.0 19 female 6.50 7.00 61.0 20 female 7.25 7.50 63.5

How can we demonstrate the relationship between Hand Length and Height? Graphical method: A scatter diagram (scatter plot): shows the relationship between two quantitative variables measured on the same individual. Each individual in the data set is represented by a point in the scatter diagram. The predictor variable (independent variable) is plotted on the horizontal axis and the response variable (dependent variable) is plotted on the vertical axis. Note: Points are not connected in scatter diagram.

For the example of using Hand Length to predict Height What is the response variable? ______________________ What is the predictor variable? _____________________ Scatter Plot of Height Vs. Hand Length Height Hand Length

Scatter plot using Minitab Go to Graph, choose Scatterplot, choose Simple, select variable name, OK. Is the relation positive? Is the relation strong?

positive Perfectly correlated Positive : Highly correlated Positive: Moderately correlated Nonlinear, Positive Correlation Nonlinear, Positive Correlation Nonlinear, No correlation

Negative: Moderately Correlated Negative Perfectly correlated Negative: Highly Correlated Nonlinear Negative Correlation No correlation No correlation

How can we quantify the correlation? Numerical Method: Pearson Correlation The linear correlation coefficient or Pearson product moment correlation coefficient is a measure of the strength of linear relation between two quantitative variables. NOTATION: We use the Greek letter (rho=ρ) to represent the population correlation coefficient and r to represent the sample correlation coefficient -1 < r < +1.

Properties of the Linear Correlation Coefficient 1. If r = +1 (or -1) there is a perfect positive (negative) linear relation between the two variables. 2. The closer r is to +1 (or -1), the stronger the evidence of positive (negative) association between the two variables. 3. If r is close to 0, there is no evidence of linear relation between the two variables. Because the linear correlation coefficient is a measure of the linear relation, r close to 0 does not imply no relation, just no linear relation.

r = +.4 r = +1 r= +.9 r = +.8 r~0 r=+.6 Positive : Highly correlated Positive: Moderately correlated Positive Perfectly correlated r = +1 r= +.9 Nonlinear, Positive Correlation Nonlinear, Positive Correlation Nonlinear, No correlation r = +.8 r~0 r=+.6

r=-0.4 r=-1.0 r=-0.9 R~0.0 R~0.0 r=-0.8 Negative: Moderately Correlated Negative Perfectly correlated Negative: Highly Correlated r=-0.4 r=-1.0 r=-0.9 Nonlinear Negative Correlation No correlation No correlation R~0.0 R~0.0 r=-0.8

A strong relationship does not imply cause and effect!!!! Important distinction between Association and Cause and Effect Relation between two variables A strong relationship does not imply cause and effect!!!! Examples: A study shows there exists a strong correlation between Math IQ and Feet size for Kindergarten children. Does this mean that Feet Size is the cause of Math IQ for kindergarten children? A study shows there is a positive correlation between CEO salary and Stock price. Does this mean that CEO salary is the cause of Stock price? For each of the above examples, the relationship is not a causal relation. There is a hidden variable that is related to both variables and is the cause. This hidden variable is often called ‘Lurking Variable’. Can you identify the lurking variable for each case?

The following is the Scatter Plot between Height and Hand Length. How do we determine the Pearson correlation coefficient , r, for the data of HAND SIZE AND HEIGHT?

where SSxx = (n-1) sx2 SSyy = (n-1) sy2 SSxy = xy - n Computing r Sx2 is the sample variance of the x variable. Sy2 is the sample variance of the y variable.

Compute the linear correlation coefficient to quantify the relationship between Height and Hand Length Row Gender length width height 1 female 8.50 9.50 68.5 2 female 8.40 9.00 68.0 3 female 7.50 8.00 68.0 4 female 7.25 8.00 68.0 5 male 7.40 7.70 70.0 6 male 7.50 8.75 71.0 7 female 6.50 7.25 66.0 8 male 8.00 7.00 68.0 9 male 8.00 8.75 72.0 10 male 8.75 9.50 76.8 11 male 8.00 9.00 71.0 12 female 6.00 7.50 62.0 13 male 6.20 11.50 69.0 14 female 6.50 7.50 61.5 15 male 8.00 9.50 69.0 16 female 7.00 8.00 69.0 17 male 7.00 9.10 72.2 18 female 6.50 7.50 63.0 19 female 6.50 7.00 61.0 20 female 7.25 7.50 63.5

EXAMPLE Height & Hand Length: Compute the linear correlation coefficient . Sample Statistics N Mean Standard Deviation Hand Length (X) 20 7.338 0.808 Height (Y) 20 67.875 4.056 Use Minitab: Go to Stat, Basic Statistics, Correlation, select Height and Hand-length. OK. The computer result is : r = .668. The correlation is moderately high.

Online Applet Activity: Visualizing correlation using scatter plots Go to the site: http://bcs.whfreeman.com/scc/content/cat_040/spt/correlation/correlationregression.html To review scatter plot, correlation coefficient (a) create a scatter plot using 10 pairs of data with near zero correlation. (b) create a scatter plot with nonlinear relation and near zero correlation using 10 pairs of data . © create a scatter plot with nonlinear relation and correlation near .8 using 10 pairs of data. (d) create a scatter plot using 10 pairs of data with near zero correlation and add one additional point that will greatly increase correlation. (e) create a scatter plot using 10 pairs of data with near one correlation and add one additional point that will greatly decrease correlation.

Finding the Least-squares Regression Line Recall: A mathematical Line y = mx + b m: is the slope: the unit change of y when increasing one unit of x. b: the intercept, the y-value when setting x = 0. Examples: (1) Graph the line : y = 2x – 3 and determine the slope and intercept (2) Graph the line y = (-.5)x +1 and determine the slope and intercept. (3) Determine the line y = mx+b that passes through two points (1, 5) and (3,2) Ans: determine the slope m = (y2-y1)/(x2-x1)=(2-5)/(3-1) = -1.5 The equation is y = (-1.5)x+b. Now, to determine the intercept b, apply a point, say, (1,5) into the equation y=(-1.5)x+b: 5 = (-1.5)(1)+b then, solve for b = 6.5 So the equation is y=(-1.5)x + 6.5 (4) Exercise: determine the line passing through (2,3) and (4,9)

How do we determine a line that can be used to predict The Height using Hand Length? That is to determine a line : b1 is the slope and b0 is the intercept An intuitive approach: By drawing your best guess line and determine two points of your best guess line, then obtain the line using the two points you chose.

Use Fathom to Demonstrate The Least Squared Method Predictions, Residuals, Sum of Squares of Residuals Problem: How well can Hand_size predict Height? Data: Hand_size_20cases

What is Residual? The difference between the observed value of y and the predicted value of y is the error or residual. That is residual = observed – predicted Notation:

b1 is the slope and b0 is the intercept. One way to determine the best line is to find b1 and b0 so that the sum of the squared residuals is the smallest.

Predicting Height using Hand Length (a) Find the least-squares regression line: Sample Statistics N Mean Standard Deviation Hand Length (X) 20 7.338 ( ) 0.808 (sx ) Height (Y) 20 67.875 ( ) 4.056 (sy ) = .668 (4.056)/(.808) = 3.353 = 67.875 – (3.353)(7.338) = 43.27 The regression line is Height = 3.353(Hand Length) + 43.27 (b) Interpret the slope: Increase Hand Length by one inches will increase Height by 3.353 inches.

(b) Interpret the slope: Use Minitab to obtain the regression line: Go to Stat, Regression, Fitted Line Plot, choose Y and X. OK. (b) Interpret the slope: Increase Hand Length by one inches will increase Height by 3.353 inches.

Predicting Height using Hand Length (c) Predict the height for the individual with Hand Length 7.5” and 6.5”, respectively. (d) Draw the least-squares regression line on the scatter diagram of the data. (e) Compute the residual, y-ŷ: In the data, there are individuals whose Hand Length is 7.5” and the height is 68”. Find the residual of the height when using the model to predict the height. Do the same for (6.5”, 63”). (f) Find the sum of the squared residuals. (g) Does any other line yield smaller squared residuals?

Predicting Height using Hand Length ŷ=3.351x + 43.29 (Predicted Height) (Residual) Row Gender length width height 1 female 8.50 9.50 68.5 2 female 8.40 9.00 68.0 3 female 7.50 8.00 68.0 4 female 7.25 8.00 68.0 5 male 7.40 7.70 70.0 6 male 7.50 8.75 71.0 7 female 6.50 7.25 66.0 8 male 8.00 7.00 68.0 9 male 8.00 8.75 72.0 10 male 8.75 9.50 76.8 11 male 8.00 9.00 71.0 12 female 6.00 7.50 62.0 13 male 6.20 11.50 69.0 14 female 6.50 7.50 61.5 15 male 8.00 9.50 69.0 16 female 7.00 8.00 69.0 17 male 7.00 9.10 72.2 18 female 6.50 7.50 63.0 19 female 6.50 7.00 61.0 20 female 7.25 7.50 63.5 For each case, can you find the predicted Height and the corresponding residual?

Should the line be used to predict the Height when the Hand Length = 3” or = 10”? Do not use a least-squares regression line to make predictions for X values far outside the scope of the model (in this case, x variable is from (6”) to (8.75”) ), because we can’t be sure the linear relation continues to exist when hand length < 6” or > 8.75”.

Diagnostics on the Least-squares Regression Line Questions: How do I know if this is a ‘good’ model? That is how much information of the Height can be explained by the Hand Length. Is there any unusual Height – an outlier in the Y, response variable? Is there any unusual X value that may dramatically affect the model – an influential case?

(1) Using the sample average of Height: =68.875” When we were asked to predict the Height using only the Height data, without knowing any information about the relation between height and hand length. Our ‘typical guess’ is the ‘Average Height: (1) Using the sample average of Height: =68.875” When we have the information of Hand Length, and we are asked to predict the Height for the individual whose hand length is 7”, we can apply the model we derived: (2): Use the least squares regression line: ŷ = 3.351(7) + 43.29 = 66.75”

The difference between the 2 predictions is the additional information explained by the Hand Length: explained deviation y-ŷ

Total Deviation = Unexplained Deviation + Explained Deviation Total Variation = Unexplained Variation + Explained Variation Which is computed as follows:

R2 : Coefficient of Determination Variation Explained by the X variable: SS due to Regression Variation due to Error – Sum of squared Residuals R2 =the proportion of variation explained by X variable

Where do I find the SS(Total), SS(Error) and SS(Regression)? This information can be easily obtained from computer output. The regression equation is The Regression Line is : Height = 43.29 + 3.351 Hand_length S = 3.10061 R-Sq = 44.6% R-Sq(adj) = 41.6% Analysis of Variance Source DF SS MS F P Regression 1 139.469 139.469 14.51 0.001 Error 18 173.048 9.614 Total 19 312.518 SS(Total) = SS(Error) + SS(Regression)

R2 = SS(Regression)/SS(Total) = 1- SS(Error)/SS(Total) The coefficient of determination R2 is the % of variation in the response variable that is explained by variation in the predictor variable. R2 = SS(Regression)/SS(Total) = 1- SS(Error)/SS(Total) To determine R2 for the linear regression model simply square the value of the linear correlation coefficient. We can also use: (r2)*100% [NOTE: The method does not work for regression equations that have more than 1 predictor variable.]

Determining the Coefficient of Determination for the model: Predicting Height using Hand Length Find and interpret the coefficient of determination for the model of predicting Height using Hand Length: R2 = 139.469/312.518*100% = 44.6% OR : use r=.668 R2 = (.668)2* 100% = 44.6% The Hand Length can explain 44.6% of variation of the Height.

Some concept questions Determine if each of the following statement true or false: If the Pearson coefficient, r > 0, then, the slope, b1 > 0. If r = 0, it is possible b1 >0. If r < 0, then, R2 < 0 If R2 = .64, and b1 =2.35, then, r = .8 If R2 = .64, and b1 = -2.35, then, r = .8 NOTE: Slope must have the same sign as correlation coefficient, r R2 can not be negative. (b) To compute r from R2 for simple linear regression: r can be positive or negative, Where the sign of r is the same as the sign of b1, the slope.

How do we know if the model is adequate? Is a linear model adequate ? Are there any outliers in response variable? Are there any influential cases? All of these questions cane be answered by analyzing residuals, ei.

Online Applet Activity: Visualizing the effects of outliers and influential cases using scatter plots Go to the site: http://bcs.whfreeman.com/scc/content/cat_040/spt/correlation/correlationregression.html (a) Create a scatter plot of 10 cases with high positive correlation. Add one case to the very right and lower corner, observe the change of the correlation coefficient and the change of the regression line (pay special attention to the change of the slope. What do you find? (b) Create 10 pairs of data points with one on the upper right corner, the rest show a high negative correlation, and the regression line has almost zero slope. Delete the upper right corner case, and observe the effect of deleting the upper right corner case in changing the slope and correlation. (c ) Create 10 pairs of cases with high negative correlation and one case having X-values around the middle and y-value (outlier case) is much higher than the rest. Now, delete the outlier case, and observe the change of the slope and correlation.

Useful Residual Plots for Model Diagnosis 1. The residuals Vs. the order of Data: If the linear model is adequate, then, this plot would look like random, no specific pattern can be identified. If there is a curve pattern, then, it indicates the relationship between x and y is not linear. A clear nonlinear pattern of the residuals Vs data order. The linear model is not adequate. No specific Pattern. Model is adequate

2. Plot Residuals Vs. Predicted Y If the model is adequate, residuals should show no specific pattern along the zero line. Two common problems can be identified from this plot: (a) A curve pattern indicates the relation between Y and X is nonlinear. (b) Some residuals are far away from zero indicates there are outliers in the response variable.

Model is not linear. Nonlinear Pattern. No unusual pattern. Adequate model

3. A plot of residuals against the predictor variable may also reveal outliers. These values will be easy to identify because the residual will lie far from the rest of the plot. An Outlier case -5

The effect of Influential Observations An influential observation is one that has a disproportionate affect on the value of the slope and y-intercept in the least-squares regression equation.

Open the activity worksheet: Re-visit the Online Applet Activity: Visualizing the effects of outliers and influential cases using scatter plots Open the activity worksheet: Online Applet Activity-Outlier&InfluCases(Reg&Corr). Working with your group to answer the questions asked in the worksheet. We will work on some of the problems in class, and your team will complete the work by the next class eriod. It is due next class period.

If there are outliers or influential cases, how do we deal with them? As with outliers, influential observations should be removed only if there is justification to do so. When an influential observation occurs in a data set and its removal is not warranted, there are two courses of action: (1) Collect more data so that additional points near the influential observation are obtained, or (2) Use more advanced techniques such as transformations to log transformations

Activity: How well can Hand_length or Hand_width predict height? Open the activity worksheet: Activity-Regression-Hand-size Working with your group to answer the questions asked in the worksheet. We will work on some of the problems, and you will complete the rest after class. It is due next class period.