Linear Regression Dr. Richard Jackson

Slides:



Advertisements
Similar presentations
AP Statistics Section 3.2 C Coefficient of Determination
Advertisements

Lesson 10: Linear Regression and Correlation
Chapter 10 Regression. Defining Regression Simple linear regression features one independent variable and one dependent variable, as in correlation the.
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
SIMPLE LINEAR REGRESSION
Linear Models. Functions n function - a relationship describing how a dependent variable changes with respect to an independent variable n dependent variable.
SIMPLE LINEAR REGRESSION
Graphing & Interpreting Data
Relationships Among Variables
Least Squares Regression Line (LSRL)
Chapter 8: Bivariate Regression and Correlation
Lecture 3-2 Summarizing Relationships among variables ©
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
SIMPLE LINEAR REGRESSION
Correlation and regression 1: Correlation Coefficient
Correlation and Regression. The test you choose depends on level of measurement: IndependentDependentTest DichotomousContinuous Independent Samples t-test.
Linear Functions 2 Sociology 5811 Lecture 18 Copyright © 2004 by Evan Schofer Do not copy or distribute without permission.
© 2008 Pearson Addison-Wesley. All rights reserved Chapter 1 Section 13-6 Regression and Correlation.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Correlation is a statistical technique that describes the degree of relationship between two variables when you have bivariate data. A bivariate distribution.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 4 Section 2 – Slide 1 of 20 Chapter 4 Section 2 Least-Squares Regression.
CORRELATION. Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson’s coefficient of correlation.
1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School.
© 2011 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
Copyright © Cengage Learning. All rights reserved. 8 4 Correlation and Regression.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Linear Regression Essentials Line Basics y = mx + b vs. Definitions
Copyright © Cengage Learning. All rights reserved.
Department of Mathematics
Straight Line Graph.
Regression and Correlation
Regression Analysis AGEC 784.
Correlation & Regression
Gradients of straight-line graphs
Day 4 – Slop-Intercept Form
Practice. Practice Practice Practice Practice r = X = 20 X2 = 120 Y = 19 Y2 = 123 XY = 72 N = 4 (4) 72.
Lines of Best Fit When data show a correlation, you can estimate and draw a line of best fit that approximates a trend for a set of data and use it to.
A Session On Regression Analysis
Chapter 5 STATISTICS (PART 4).
SIMPLE LINEAR REGRESSION MODEL
Multiple Regression.
2.1 Equations of Lines Write the point-slope and slope-intercept forms
Regression and Residual Plots
Lecture Slides Elementary Statistics Thirteenth Edition
Lesson 5.3 How do you write linear equations in point-slope form?
Regression.
Linear Equations in Two Variables
Descriptive Analysis and Presentation of Bivariate Data
Chi Square (2) Dr. Richard Jackson
Find the line of best fit.
Correlation and Regression
Functions and Their Graphs
Regression.
Solve Systems of Linear Inequalities
Basic Practice of Statistics - 3rd Edition Inference for Regression
SIMPLE LINEAR REGRESSION
Correlation and Regression
Product moment correlation
Variability Dr. Richard Jackson
DSS-ESTIMATING COSTS Cost estimation is the process of estimating the relationship between costs and cost driver activities. We estimate costs for three.
Correlation and the Pearson r
SIMPLE LINEAR REGRESSION
Unit 2 Quantitative Interpretation of Correlation
Introduction to Regression
Warsaw Summer School 2017, OSU Study Abroad Program
Chapter Thirteen McGraw-Hill/Irwin
Linear Regression and Correlation
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Dr. Richard Jackson jackson_r@mercer.edu Linear Regression Dr. Richard Jackson jackson_r@mercer.edu This module covers a topic which you probably covered back in high school, dealing with linear regression. You might remember it as constructing a line through best fit through a set of data points. © Mercer University 2005 All Rights Reserved

Linear Regression Used to Predict One Variable from Another Example: Predict Blood Level From Dose Linear regression is used to predict one variable from the other. In other words, to predict the blood level from a given dose of a drug. Using the data that has been collected one can then use that information to predict future variables one from the other.

Linear Regression Takes Observed Data From Group of Subjects on Two Variables (X and Y) Calculate Formula for Predicting Y from X Construct Graph with Regression Line (Line of Best Fit): Straight Line Linear regression takes observed data from a group of subjects on two variables. It then uses that data to calculate that formula for predicting why the dependent variable from x, the independent variable. It concerns the construction of the graph with a regression line or line of best fit. A straight line through a set of data points and then that straight line enables one to predict a variable from the other using the graph. Lets go back and review a few basic fundamentals about the formula for the line of best fit or a straight line on a graph.

Formula for Straight Line Y = A + bx Y = Dependent Variable A = Y Intercept b = Slope of Line (Y with Unit  in X) x = Independent Variable Example: Y =4 +2X Lets take a look at the formula for a straight line. Y equals A plus bx. You may recall y as the dependent variable. A is the y intercept and b is the slope of the line. The slope of the line is how many unit changes that we see in y with every unit change or change of 1 on the x variable and x is the independent variable. If we consider the equation y equals 4 plus 2x we can identify the straight line that corresponds to that formula. By substituting two values for x in the equation y equals 4 plus 2x we can determine the corresponding y values.

Example of Straight Line If x equals 0 then y equals 4 and if x equals 1 then y equals 6. Then by plotting those two points with coordinates on the x axis, x being 0 and y being 4 and on the x axis, x being 1 and y being 6, we can identify a particular straight line. Recalling that it only takes two points to identify specific straight lines. On your handout you can see correctly, represented the line that is identified by the equation y equals 4 plus 2x and as you can see it crosses the y axis at 4 and the slop of the line is 2. In other words, for every unit change of x that we have, for example, going form 0 to 1, we have a change on the y axis of 2 units. 1 2 3 4 5 6 If Then X= Y= 0 4 1 6

Clinical Example (See Scatter Diagram See Table I) Independent Variable: Plasma Atenolol Dependent Variable: Maximum HR Lets take a look at a clinical example. Refer to table 1. This is a scatter diagram that represents the relationship between plasma atenolol and maximum exercise heart rate and as you can see as you increase the plasma atenolol blood level the maximum exercise heart rate decreases. This would be an example of negative correlation because it is downward sloping to the right and as one variable increases, that is the x variable, the other variable, y, decreases.

Clinical Example (See Scatter Diagram See Table I) Patient 1 2 3 4 Plasma Atenolol 500 400 800 1000 Max. HR 80 75 70 65 The data represent the x and y measures for 4 patients in this study and it looks like there is about 30 patients in total but as you can see from the diagram, patient number 1 has a plasma atenolol level of 500, a maximum heart rate of 80. Patient number two, 475, patient 3, 870, and patient number 4, 1065.

Formula For Regression Line (Line of Best Fit) Y’ = 100 + (-0.03) X It is possible through various formulas to draw a line of best fit through these data and through formulas we will not concern ourselves with, we will let our computer do that. It is possible to take these observed data on the x and y variable and determine the y intercept and slope for a line of best fit that will go through these data. This has been done and the y intercept or A is 100 and the slope or B is minus 0.03. its minus because there is a negative change in the y variable. So this line of best fit can be drawn and it would cross the y axis at 100 and for every unit change on the x variable we would see a change of -0.03 on the y axis. This is the formula that represents the line of best fit through test points.

Using Y’ =100 + (-0.03) X X Y 400 88 1000 70 If we then simply pick two values for x we can then plot or determine with the formula the value for y, then plot those two points. Those two points would then identify the line of best fit through these data. So if we choose just arbitrarily these x values lest say 400 and 1000 and solve for the corresponding y values we end up with 88 and 70 respectively and if we identify those two points on the graph as we have on the table with the asterisk, by connecting those points we would have identified the line of best fit through the data. It would then be possible to predict the y variable from a given x variable using the graph.

Predicted Y’s (y’) Not Same As Observed Y’s Y’ (or Y prime) is the Predicted Y Patient PA(X) MHR(Y) Y’ 2 400 75 88 4 1000 65 70 We could also use the formula to calculate a y value by plugging in which we have just done, a value of x, 400, and then calculating a value for y. Note that our y in this case is designated y prime. It is also sometimes called y had and it is the predicted y value which differs from the observed y value. The predicted y values all fall along that straight line of best fit but as you can see the observed y values all vary about that straight line. The predicted y values designated y prime are not the same as the observed y values except in a circumstance that I will describe momentarily. For example, if you take a look at patient number 2, the plasma atenolol of the x variable is 400 and the y variable or maximum heart rate is 75 but the predicted y value or y prime is 88. For patient number 4 the x variable is 1000, the y variable is 65 and the predicted y variable or y prime is 70. So except in a particular circumstance, the particular y values are going to be different from the observed y values because all of the predicted y values fall along that straight line, the regression line or the line of best fit.

Predicted Y’s (Y’) Will Be Same as Observed Y’s when r= +1.00 or -1.00 The circumstances where in the predicted y’s will be the same as the observed y’s occurs when the Pearson r between the two variables is either plus 1 or minus 1. Recall from our previous discussion of the Pearson r, that the closer the points fall along the straight line the closer the Pearson r will be to either plus 1 or minus 1. Well if the correlation between the two variables is plus 1 or minus 1 then all of the observed values would already fall along a straight line. So if that was the case then the predicted y’s would be the same as the observed y’s.

Accuracy of Prediction The closer the points on scatter diagram fall on a straight line. Closer the Pearson r is to +1.00 or –1.00 The More Accurate is the Prediction There are two ways that you can predict a y value from a x value using linear regression. One involves simply substituting an x value in the equation for a straight line and solving for the y value. The other simply involves reading a value off the x axis up to the regression line and then reading across to the corresponding y value on the y axis. Once a prediction is made one may question well how accurate is the prediction and the closer the points they fall on the straight line the more accurate the prediction. In other words the closer the Pearson r is to plus 1 or minus 1, the more accurate is the prediction. The more the points are spread out the more the Pearson r approaches zero then the less accurate is the prediction.

Accuracy of Prediction Quantified by Standard Error of Estimate Syx. Formula: Syx = Sy Ö1-r2 The accuracy of prediction with linear regression is quantified with a statistic known as the standard error of estimate. The symbol is the letter S with a subscript yx. It is a standard deviation as you might guess with a symbol being a small letter s and the subscript sub yx means it is the standard error of estimate in predicting y from x. Most of the time one is involved in predicting the y variable from x though in certain circumstances one might want to predict x from y. However that would involve a different equation for a straight line. The formula for the standard error of estimate is s sub yx equals s sub y which is simply just the standard deviation of the y values multiplied by the square root of 1 minus r squared where r is the Pearson r between the two variables.

Clinical Example Sy = 5 r = +0.8 Syx = 5y Ö1-r2 Syx = 5 Ö1-(.8)2 Assume that our standard deviation of y values is 5 and the Pearson r is +0.8. Plugging those numbers into the equation gives us a standard error of estimate of 3.

Interpretation of Syx At any point on Regression Line, 95% observed Y’s Fall plus and minus 2 Syx Example Y’ for Plasma Level of 800: Y’ = 100 + (-0.03) (800) = 76 Interpretation of the standard error of estimate is as follow. At any point in the regression line, 95% of the observed y’s, that is the actual values of the dots on the scatogram fall plus or minus 2 standard error of estimates from the regression line. The example predicted y for a plasma level of 800 is 76.

Standard Error of Estimate Predicted Y = 76 Syx = 3 95% of Y’s fall plus or minus 2 times 3 or 70-82 From the data involving the plasma atenolol substituting 800 as the x value gives us a predicted y value of 76. So using the prediction equation our predicted y value is 76. The standard error of estimate is 3. This means that 95% of the observed y values fall plus or minus 2 times the standard error of estimate or 3 which means that 95% of our observed y’s fall within a range of 70 to 82 at the predicted y at the regression line of 76. So that gives us an idea of how accurate our prediction is if 95% of the y’s fall within the range of 70 to 82. That gives us some indication of how well we are predicting our y values from a given x value.

Other Observations About Syx When r = +1.00 or –1, Syx = 0 (See formula) When r = 0, Syx is at its maximum and = Sy Other observations about the standard error of estimate include when the Pearson r between the two variables is either plus one or minus one. When that is the case then the standard error of estimate is zero. If you take a look at the formula you will see when r is plus 1 or minus 1 and you square it. The term under the square root radical becomes zero and the standard error of estimate is zero. That follows from what we have said earlier if the Pearson r is plus one or minus one then all the observed y’s already follow along the straight line so there is no error in measurement when predicting one variable from the other because the observed y’s are not all spread out rather they are on a straight line already. Further if there is no relationship between the two variables. In other words if the Pearson r is equal to 0 then the standard error of estimate is at its maximum value and its maximum value is equal to the standard deviation of the y values. Again, take a look at the formula and substitute a zero for r, so you are left only with the square root of one which is one multiplied by the standard deviation of the y values.

Also, When r = 0 Regression Line Parallel to X axis Crosses Y axis at the mean of the Y values Slope = 0, therefore the Y’ for any X value = A or the mean of the Y values Also when the Pearson r is zero the regression line can still be drawn but it will be parallel to the x axis and it crosses the y axis at the mean of the y values. The slope of that regression line is zero. Therefore, for the predicted y for any x value will be equal to A or the y intercept which is equal to the mean of the y values.

Summary of Linear Regression Provides Formula and Graphic Device for Predicting One Variable from Another Accuracy of Prediction Indicated by Standard Error of Estimate Close Association With Pearson r To summarize linear regression it provides a formula and a graphic device in predicting one variable from another. The accuracy of the prediction is indicated by a statistic known as the standard error of estimate and there is a close association between linear regression, the Pearson r, and the accuracy of prediction.

How to Perform Linear Regression Using the Statistix Software Enter Data for Two Variables Select Statistics, Linear Models, Linear Regression Highlight and Move Dependent and Independent Variable Names to Appropriate Boxes then OK Check on Results, Select Plots Read Prediction Equation at Bottom of Graph