The Simple Regression Model

Slides:



Advertisements
Similar presentations
Lesson 10: Linear Regression and Correlation
Advertisements

Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Forecasting Using the Simple Linear Regression Model and Correlation
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Simple Linear Regression and Correlation
Chapter 12 Simple Linear Regression
Chapter 10 Simple Regression.
Chapter 12 Simple Regression
Simple Linear Regression
Chapter 13 Introduction to Linear Regression and Correlation Analysis
SIMPLE LINEAR REGRESSION
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Chapter Topics Types of Regression Models
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Chapter 11 Multiple Regression.
Introduction to Probability and Statistics Linear Regression and Correlation.
SIMPLE LINEAR REGRESSION
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Introduction to Regression Analysis, Chapter 13,
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Lecture 5 Correlation and Regression
Regression and Correlation Methods Judy Zhong Ph.D.
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Correlation and Regression
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
INTRODUCTORY LINEAR REGRESSION SIMPLE LINEAR REGRESSION - Curve fitting - Inferences about estimated parameter - Adequacy of the models - Linear.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Production Planning and Control. A correlation is a relationship between two variables. The data can be represented by the ordered pairs (x, y) where.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Chapter 5: Regression Analysis Part 1: Simple Linear Regression.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Lecture 10: Correlation and Regression Model.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Regression Analysis. 1. To comprehend the nature of correlation analysis. 2. To understand bivariate regression analysis. 3. To become aware of the coefficient.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Statistics for Managers Using Microsoft® Excel 5th Edition
Chapter 12 Simple Linear Regression n Simple Linear Regression Model n Least Squares Method n Coefficient of Determination n Model Assumptions n Testing.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Lecture 10 Introduction to Linear Regression and Correlation Analysis.
Bivariate Regression. Bivariate Regression analyzes the relationship between two variables. Bivariate Regression analyzes the relationship between two.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Chapter 13 Simple Linear Regression
The simple linear regression model and parameter estimation
Regression and Correlation
Correlation and Simple Linear Regression
Linear Regression and Correlation Analysis
Chapter 11: Simple Linear Regression
Chapter 11 Simple Regression
Slides by JOHN LOUCKS St. Edward’s University.
Correlation and Simple Linear Regression
Correlation and Regression
PENGOLAHAN DAN PENYAJIAN
Correlation and Simple Linear Regression
SIMPLE LINEAR REGRESSION
Simple Linear Regression and Correlation
SIMPLE LINEAR REGRESSION
St. Edward’s University
Correlation and Simple Linear Regression
Presentation transcript:

The Simple Regression Model Interval Estimation, Section 15.3 also read confidence and prediction intervals Correlation, Section 15.4 Estimation and Tests, Section 15.5, 15.6, 15.7

The Standard Error of the Estimate - Se or Sy.x The least squares method minimizes the distance between the predicted y and the observed y, the SSE Need a statistic that measures the variability of the observed y values from the predicted y A measure of the variability of the observed y values around the sample regression line Also our estimate of the scatter of the y values in the population around the population regression line It is an estimate of y|x The least squares method results in a line that fits the data such that the distance between the predicted y and the observed y is minimized. We would like a statistic that measures the variability of the observed y values from the predicted. In other words, a measure of the variability of the observed y values around the sample regression line. This statistic is also our estimate of the scatter of the y values in the population around the population regression line. Thus it is an estimate of y|x PP 9

Standard Error of the Estimate Se is an estimate of σy|x E(Y|X=20) E(Y|X=50) E(Y|X=80) Y X 20 50 80 σ y|x A -Sample B - Sample C - Population PP 9

Standard Error of the Estimate The standard error has the units of the dependent variable, y The formula requires us to find first the predicted value for each observation in the data set and second, the error term for that observation Can calculate the error or residual for an observation PP 9

Calculating a Residual xi yi ei 40 165 54 85 125.2 -40.2 9 37.5 -28.5 xi yi ei 40 165 54 85 125.2 -40.2 9 37.5 -28.5 To calculate the standard error for a sample, use For a given x value, you should be able to calculate the error term. However, to calculate the standard error of the estimate, you will want to use the computational formula that is faster PP 9

Calculating the Standard Error of the Estimate Substituting Units are deaths per 1000 live births. Since we choose b0 and b1 to minimize the SSE, we were implicitly minimizing the standard error of the estimate PP 9

The Coefficient of Determination Want to develop a measure as to how well the independent variable predicts the dependent variable Want to answer the following question Of the total variation among the y’s, how much can be attributed to the relationship between X and Y, and how much can be attributed to chance? PP 9

The Coefficient of Determination By total variation among the y’s, we mean the changes in Y from one sample observation to another Why do the values of Y differ from observation to observation? The answer, according to our hypothesized regression model, is That the variation in Y is partly due to changes in X, which leads to changes in the expected value of Y And partly due to chance, that is, the effect of the random error term PP 9

The Coefficient of Determination Ask how much of the observed variation in Y can be attributed to the variation in X and how much is due to other factors (error) Define “sample variation of Y” If there was no variation in Y, all the values of Y when plotted against X would lie on a straight line Corresponds to the average value of Y PP 9

No Variation in Y X PP 9

The Coefficient of Determination Now in reality the observed values of Y are scattered around this line Variation in Y can be measured as the distance of the observed yi from the average Y yi,xi X PP 9

SST = SSR + SSE Total variation can be decomposed into explained variation and unexplained variation SST = SSR + SSE yi Xi PP 9

Coefficient of Determination or R2 R2 is the proportion of the variation of Y that can be attributed to the variation of X R2 = SSR/SST or R2 = 1 - SSE/SST SST = SSR + SSE SST/SST = SSR/SST + SSE/SST 1 = R2 + SSE/SST PP 9

Coefficient of Determination or R2 R2 describes how well the sample regression line fits the observed data Tells us the proportion of the total variation in the dependent variable explained by variation in the explanatory variable R2 is an index No units associated 0  R2  1 PP 9

Interpreting R2 R2 = 1 indicates a perfect fit An R2 close to zero indicates a very poor fit of the regression line to the data R2 = 0 PP 9

Computational Formulas SSR = 89315.20 – 27922.98 = 61392.21 Here is an interesting question. If we have the R2 value and we want to know the correlation coefficient (R value), can we determine it? The answer is that we will not know what the sign is for the correlation coefficient. We could determine the sign of the relationship between the two variables by looking at the coefficient on the estimated slope of the regression equation R2 = 61392.21/89315.20 =0.68737 Interpret the R2 value in terms of our problem 68.74% of the variation in mortality rates is explained by variation in immunization rates PP 9

Interpretation of R2 as a Descriptive Statistic Suppose we find a very low R2 for a given sample Implies that the sample regression line fits the observations poorly A possible explanation is that X is a poor explanatory variable This is a statement about the population regression line That is, the population regression line is horizontal Can test this with reference to the sample data Null hypothesis is H0: 1 = 0 If we do not reject this null hypothesis, we find that Y is influenced only by the random error term Another explanation of a low R2 is that X is a relevant explanatory variable But that its influence on Y is weak compared to the influence of the error term PP 9

Pearson’s Correlation Coefficient Correlation is used to measure the strength of the linear association between two variables The correlation coefficient is an index No units of measurement Positive or negative sign associated with the measure The boundaries for the correlation coefficient are The values r = 1 and r = -1 occur when there is an exact linear relationship between x and y PP 9

Pearson’s Correlation Coefficient X and Y are perfectly negatively correlated X and Y are perfectly positively correlated X and Y are uncorrelated PP 9

Pearson’s Correlation Coefficient As the relationship between x and y deviate from perfect linearity, r moves away from |1| toward 0 With the data to the right, the correlation model should not be applied Y If y tends to decrease as x increases, then the correlation is negative. If y tends to increase as x increases the correlation is positive. If r = 0, we say x and y are uncorrelated. There is no linear relationship between the two variables. However, a non-linear relationship may exist. X PP 9

Computational Formula for r Based on this sample there appears to be a fairly strong linear relationship between the percentage of children immunized in a specified country and its under-5 mortality rate. The correlation coefficient is fairly close to 1. In addition there is a negative relationship. Mortality decreases as percent immunized increases. PP 9

Pearson’s Correlation Coefficient Limitations of the Correlation Model The correlation model does not specify the nature of the relationship Do not infer causality An effective immunization program might be the primary reason for the decrease in mortality, but it is possible that the immunization program is a small part of an overall health care system that is responsible for the decrease in mortality The model measures linear relationships The Y values for a given X are assumed to be normally distributed and the X values for a given Y are also assumed to be normally distributed Sampling from a “bivariate normal distribution” The model is very sensitive to outliers If there are pairs of data points way outside the range of the other data points, this can alter the value of the correlation coefficient and give misleading results Do not extrapolate the correlation coefficient outside the range of data points The relationship between X and Y may change outside the range of sample points PP 9

Testing Hypotheses about the Population Correlation Coefficient Test whether there is a significant correlation, , in the population between X and Y H0:  = 0 There is no linear association H1:   0 There is a significant linear association The sample correlation coefficient is an unbiased estimator of the population correlation coefficient, which we designate as  That is, the E(r) =  The sampling distribution of the statistic r is approximately normally distributed PP 9

Testing Hypotheses about the Population Correlation Coefficient The standard error of the sample correlation coefficient is The test statistic is PP 9

Testing Hypotheses about the Population Correlation Coefficient Critical Value at ⍺ = 0.05 t18,.05/2 = t 18,.025 = 2.101 Degrees of freedom = df = n - 2 Decision Rule If (-2.101 ≤ -6.291 ≤ 2.101) do not reject Therefore, Reject Comparing the test statistic with the critical value, we reject the null hypothesis and conclude that there is a significant linear association between immunization rates and mortality rates PP 9

Relationship between Correlation, R, and Coefficient of Determination, R2 r = R = the square root of the coefficient of determination, R2 R = -0.829 Correlation coefficient R2 = 0.687 Coefficient of determination PP 9

Computer Presentation of Correlation Matrix   MORTRATE IMMUNRATE 1 -0.829075272 PP 9

Inferences about the Population Parameters Want to create a confidence interval for the slope (or intercept) or want to test whether the population slope, , (or intercept) equals zero Saw before (OLS properties): Sampling Distributions E(b0) = β0 normal E(b1) = β1 normal We want to use statistical inference to draw conclusions about the population parameters. For example, we might want to create a confidence interval for the slope (or intercept) or we might want to test whether the population slope, , equals zero We considered earlier the properties of the OLS estimators. These properties described the sampling distributions. The estimators, a and b, are linear combinations of the yi. This implies that the distribution of b will follow the distribution of the yi (or the error term). If the error terms are normal, the distribution of the b is normal. If the sample is large, the distribution of the b will be approximately normal even fi the error terms are not normal. We also saw that the expected values of b0 and b1 are 0 and 1, respectively. b0 b1 PP 9

Inferences about the Population Parameters Among all linear unbiased estimators, OLS estimators have the smallest variance The standard error of b0 and b1 are PP 9

Inferences about the Population Parameters Since y|x is unknown, we substitute the standard error of the estimate, Se, and use the t distribution In order to use the t distribution, we now have to assume the yi’s are normal. In large samples, the t provides a good approximation even if the yi’s are not normal. PP 9

Confidence Intervals Population Slope and Intercept Use information about the sampling distributions to construct confidence intervals for the population slope and intercept If the conditional probability distribution of Y|X follows a normal distribution PP 9

Confidence Intervals Population Slope and Intercept For the slope t18,05/2 = t 18,.025 = 2.101 -3.77  1  -1.89 with a degree of confidence of .95 For the intercept 203.77  0  352.75 with a degree of confidence of .95 In 95 out of 100 intervals the population parameter will fall w/in the interval. PP 9

Confidence Intervals Population Slope and Intercept The interval estimates appear wide Small sample size Large variation in mortality for given immunization rates Se is large PP 9

Tests of Hypotheses The most common type of hypothesis that is tested with the regression model is that there is no relationship between the explanatory variable X and the dependent variable Y The relationship between X and Y is given by the linear dependence of the mean value of Y on X, that is E(Y|X) = 0 +1 x To say there is no relationship means E(Y|X) is not linearly dependent, which is to say 1 equals zero H0: 1 = 0 There is no relationship between X and Y H1: 1  0 There is a significant relationship between X and Y If we have a theory that suggests the direction of the relationship than we will want a one tail test The most common type of hypothesis that is tested with the help of the regression model is that there is no relationship between the explanatory variable X and the dependent variable Y. The relationship between X and Y is given by the linear dependence of the mean value of Y on X, that is, E(Y|X) = +x. To say there is no relationship means E(Y|X) is not linearly dependent, which is to say  equals zero. H0:  = 0 There is no relationship between X and Y H1:   0 There is a significant relationship between X and Y If we have a theory that suggests the direction of the relationship than we will want a one tail test. The test statistic is PP 9

H0: 1 = 0 There is no relationship between X and Y

Sampling Distribution under the null hypothesis Tests of Hypotheses The test statistic is Set level of significance Find critical value in t -table df = n - 2 DR: if (-tcv ≤ t-test ≤ tcv), do not reject Sampling Distribution under the null hypothesis t n - 2 -t reject do not reject normal reject b1 t PP 9

Sampling Distribution under the null hypothesis Tests of Hypotheses For our problem H0: 1 ≥ 0 No relationship between X and Y H1: 1 < 0 An inverse relationship between X and Y Test statistic Let ⍺ = 0.05 Critical value: t18,0.05 = -1.734 DR: if (-tcv ≤ t-test), do not reject (-1.734 > -6.291), reject Sampling Distribution under the null hypothesis do not reject reject -2.831 b1 -1.734 -6.291 t n - 2 PP 9

Tests of Hypotheses Conclude that the immunization rate is significantly and inversely related to the mortality rate Remember: You want to reject the null You have found that your independent variable is related PP 9

Computer Output of the Problem MORTALITY,Y IMMUNIZED, X Mean 62.2 76.3 Standard Error 15.33101432 4.488640634 Median 31 83 Mode 9 Standard Deviation 68.56238036 20.07381117 Sample Variance 4700.8 402.9578947 Range 220 72 Minimum 6 26 Maximum 226 98 Sum 1244 1526 Count 20 PP 9

Excel Output = Se b0 = b1 = Sb0 = Sb1 = PP 9

Online Homework - Chapter 15 Overview Simple Regression CengageNOW fourteenth assignment PP 9