The Regression Equation Using the regression equation to individualize prediction and move beyond saying that everyone is equal, that everyone should score.

Slides:

Advertisements

Similar presentations

Lesson 10: Linear Regression and Correlation

Advertisements

Objectives 10.1 Simple linear regression

13- 1 Chapter Thirteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.

Correlation and Regression

CHAPTER 24: Inference for Regression

Objectives (BPS chapter 24)

Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.

T scores and confidence intervals using the t distribution.

PSY 307 – Statistics for the Behavioral Sciences

t scores and confidence intervals using the t distribution

Chapter 12 Simple Regression

1 Chapter 8 – Regression 2 Basic review, estimating the standard error of the estimate and short cut problems and solutions.

Correlation 2 Computations, and the best fitting line.

The Simple Regression Model

Correlation 2 Computations, and the best fitting line.

Confidence intervals using the t distribution. Chapter 6 t scores as estimates of z scores; t curves as approximations of z curves Estimated standard.

SIMPLE LINEAR REGRESSION

The Regression Equation How we can move beyond predicting that everyone should score right at the mean by using the regression equation to individualize.

1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.

Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.

The Regression Equation Using the regression equation to individualize prediction and move beyond saying that everyone is equal, that everyone should score.

Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.

SIMPLE LINEAR REGRESSION

Chapter 9 - Lecture 2 Computing the analysis of variance for simple experiments (single factor, unrelated groups experiments).

BCOR 1020 Business Statistics Lecture 24 – April 17, 2008.

1 Chapter 8 – Regression 2 Basic review, estimating the standard error of the estimate and short cut problems and solutions.

T scores and confidence intervals using the t distribution.

Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Relationships Among Variables

1 Psych 5500/6500 Statistics and Parameters Fall, 2008.

Correlation and Linear Regression

Correlation and Linear Regression

Correlation and Linear Regression Chapter 13 Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.

Lecture 16 Correlation and Coefficient of Correlation

SIMPLE LINEAR REGRESSION

Introduction to Linear Regression and Correlation Analysis

Linear Regression and Correlation

Correlation and Linear Regression

STA291 Statistical Methods Lecture 27. Inference for Regression.

Linear Regression Inference

Confidence Intervals and Hypothesis Testing

Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.

Inferences for Regression

Regression. Idea behind Regression Y X We have a scatter of points, and we want to find the line that best fits that scatter.

© The McGraw-Hill Companies, Inc., Chapter 11 Correlation and Regression.

Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.

Statistical Analysis Topic – Math skills requirements.

Production Planning and Control. A correlation is a relationship between two variables. The data can be represented by the ordered pairs (x, y) where.

McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 13 Linear Regression and Correlation.

Correlation Analysis. Correlation Analysis: Introduction Management questions frequently revolve around the study of relationships between two or more.

Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.

MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.

+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.

© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.

Midterm Review Ch 7-8. Requests for Help by Chapter.

Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.

McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.

Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.

Example x y We wish to check for a non zero correlation.

©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Linear Regression and Correlation Chapter 13.

Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.

Regression and Correlation

Computations, and the best fitting line.

Chapter 11: Simple Linear Regression

Correlation and Regression

SIMPLE LINEAR REGRESSION

Regression & Correlation (1)

MGS 3100 Business Analysis Regression Feb 18, 2016

Presentation transcript:

The Regression Equation Using the regression equation to individualize prediction and move beyond saying that everyone is equal, that everyone should score right at the mean

Best fitting line: A review

The definition of the best fitting line plotted on t axes A “best fitting line” minimizes the average squared vertical distance of Y scores in the sample (expressed as t Y scores) from the line. The best fitting line is a least squares, unbiased estimate of values of Y in the sample. The generic formula for a line is Y=mx+b where m is the slope and b is the Y intercept. Thus, any specific line, such as the best fitting line, can be defined by its slope and its intercept.

The intercept of the best fitting line plotted on t axes The origin is the point where both t X and t Y =0.000 So the origin represents the mean of both the X and Y variable When plotted on t axes all best fitting lines go through the origin. Thus, the t Y intercept of the best fitting line =

The slope of and formula for the best fitting line When plotted on t axes the slope of the best fitting line = r, the correlation coefficient. To define a line we need its slope and Y intercept r = the slope and t Y intercept =0.00 The formula for the best fitting line is therefore t Y =rt X or t Y = rt X

Here’s how a visual representation of the best fitting line (slope = r, Y intercept = 0.000) and the dots representing t X and t Y scores might be described. (Whether the correlation is positive of negative doesn’t matter.) Perfect - scores fall exactly on a straight line. Strong - most scores fall near the line. Moderate - some are near the line, some not. Weak - the scores are only mildly linear. Independent - the scores are not linear at all.

Strength of a relationship Perfect

Strength of a relationship Strong r about.800

Strength of a relationship Moderate r about.500

Strength of a relationship r about Independent

r=.800, the formula for the best fitting line = ???

r=-.800, the formula for the best fitting line = ???

r=0.000, the formula for the best fitting line is:

Notice what that formula for independent variables says t Y = rt X = (t X ) = When t Y = 0.000, you are at the mean of Y So, when variables are independent, the best fitting line says that the best estimate of Y scores in the sample is back to the mean of Y regardless of your score on X Thus, when variables are independent we go back to saying everyone will score right at the mean

A note of caution: Watch out for the plot for which the best fitting line is a curve

Moving from the best fitting line to the regression equation and the regression line.

The best fitting line (t Y =rt X ) was the line closest to the Y values in the sample. But what should we do if we want to go beyond our sample and use a version of our best fitting line to make individualized predictions for the rest of the population?

What do we need to do to be able to use the regression equation t Y' = rt X Notice this is not quite the same as the formula for the best fitting line. The formula now reads t Y' (read t-y-prime). Not t Y. t Y' is the predicted score on the Y variable for every X score in the population within the range observed in our random sample. Before we were describing the linear relationship of the X and Y variable in our sample. Now we are predicting estimated Z scores (t scores) for most or all of the population.

This is one of the key points in the course; a point when things change radically. Up to this point, we have just been describing scores, means and relationships. We have not yet gone beyond predicting that everyone in the population who was not in our sample will score at the mean of Y. But now we want to be able to make individualized predictions for the rest of the population, the people not in our sample and for whom we don’t have Y scores

That’s dangerous. Our first rule as scientists in “Do not increase error.” Individualizing prediction can easily do that. Let me give you an example.

Assume, you are the personnel officer for a mid size company. You need to hire a typist. There are 2 applicants for the job. You give the applicants a typing test. Which would you hire: someone who types 6 words a minute with 12 mistakes or someone who types 100 words a minute with 1 mistake.

Who would you hire? Of course, you would predict that the second person will be a better typist and hire that person. Notice that we never gave the person with 6 words/minute a chance to be a typist in our firm. We prejudged her on the basis of the typing test. That is probably valid in this case – a typing test probably predicts fairly well how good a typist someone will be.

But say the situation is a little more complicated! You have several applicants for a leadership position in your firm. But it is not 2002, it is 1957, when we knew that only white males were capable of leadership in corporate America. That is, we all “know” that leadership ability is correlated with both gender and skin color, white and male are associated with high leadership ability and darker skin color and female gender with lower leadership ability. We now know this is absurd, but lots of people were never given a chance to try their hand at leadership, because of pre-judgement that you can now see as obvious prejudice.

We would have been much better off saying that everyone is equal, everyone should be predicted to score at the mean. Pre-judgements on the basis of supposed relationships between variables that have no real scientific support are a form of prejudice. They cost potential leaders jobs in which they could have shown their ability. That is unfair. Moreover, by excluding such individuals, you narrow the talent pool of potential leaders. The more restricted the group of potential leaders, the less talented the average leader will be. This is why aristocracies don’t work in the long run. The talent pool is too small.

So, to avoid prejudice you must start with the notion that everyone will score at the mean. In correlational language, to make that prediction you have to hypothesize that rho = Only if you can disprove the notion that rho = and no other time, should you make any other prediction

We call the hypothesis that rh0=0.000 the null hypothesis The symbol for the null hypothesis is H 0. We will see the null hypothesis many times during the rest of this course. It is the hypothesis that you will learn to test statistically.

Confidence intervals around rho T

Confidence intervals around rho T – relation to Chapter 6 In Chapter 6 we learned to create confidence intervals around mu T that allowed us to test a theory. To test our theory about mu we took a random sample, computed the sample mean and standard deviation, and determined whether the sample mean fell into that interval. If the sample mean fell into the confidence interval, there was some support for our theory, and we held onto it. The interesting case was when mu T fell outside the confidence interval. In that case, the data from our sample falsified our theory, so we had to discard the theory and the estimate of mu specified by the theory

If we discard a theory based prediction, what do we use in its place? Generally, our best estimate of a population parameter is the sample statistic that estimates it. Our best estimate of mu has been and is the sample mean, X-bar. Since X-bar fell outside the confidence interval, we discarded our theory. Then we were back to using X-bar, the sample mean that fell outside the confidence interval and falsified our theory, as our best (least squares, unbiased, consistent estimate) of mu.

To test any theory about any population parameter, we go through similar steps: We theorize about the value of the population parameter. We obtain some measure of the variability of sample based estimates of the population parameter. We create a test of the theory about the population parameter by creating a confidence interval, almost always a CI.95. We then obtain and measure the parameter in a random sample.

The sample statistic will fall inside or outside of the CI.95 If the sample statistic falls inside the confidence interval, our theory has received some support and we hold on to it. But the more interesting case is when the sample statistic falls outside the confidence interval. Then we must discard the theory and the theory based estimate of the population parameter. In that case, our best estimate of the population parameter is the sample statistic Remember, the sample statistic is a least squares, unbiased, consistent estimate of its population parameter.

We are going to do the same thing with a theory about rho rho is the correlation coefficient for the population. If we have a theory about rho, we can create a 95% confidence interval into which we expect r will fall. An r computed from a random sample will then fall inside or outside the confidence interval.

When r falls inside or outside of the CI.95 around rho T If r falls inside the confidence interval, our theory about rho has received some support and we hold on to it. But the more interesting case is when r falls outside the confidence interval. Then we must discard the theory and the theory based estimate of the population parameter. In that case, our best estimate of rho is the r we found in our random sample Thus, when r falls outside the CI.95 we can go back to using it as a least squares unbiased estimate of rho.

Then what? Then we can use the r from our sample, the r that falsified the theory that rho=0.000, in the regression equation: t Y' =rt X

To repeat If rho = 0.000, we should go back to saying everyone is equal, everyone will score at the mean of Y. To be fair and avoid doing damage, we must test the hypothesis that rho=0.000 before doing anything else. To test the theory that rho=0.00, we create a CI.95 for rho= If, and only if, we disprove the notion that rho=0.000 by having r fall outside the CI.95 can we use r in the regression equation, t Y' =rt X.

I could teach you how to calculate the confidence interval for rho=0.000 But other people have already calculated the intervals for many different df. Those calculations are summarized in the r table

How the r table is laid out: the important columns –Column 1 of the r table shows degrees of freedom for correlation and regression (df REG ) –df REG =n P -2 –Column 2 shows the CI.95 for varying degrees of freedom –Column 3 shows the absolute value of the r that falls just outside the CI.95. Any r this far or further from falsifies the hypothesis that rho=0.000 and can be used in the regression equation to make predictions of Y scores for people who were not in the original sample but who were part of the population from which the sample is drawn.

to to to to to to to to to to to to to to to to to to to df nonsignificant If r falls in within the 95% CI around 0.000, then the result is not significant. Find your degrees of freedom (N p -2) in this column You cannot reject the null hypothesis. You must assume that rho = Does the absolute value of r equal or exceed the value in this column? r is significant with alpha =.05. If r is significant you can consider it an unbiased, least squares estimate of rho. alpha =.05. You can use it in the regression equation to estimate Y scores.

Pizza and horror films H 1 : People who enjoy food with strong flavors also enjoy other strong sensations. H 0 : There is no relationship between enjoying food with strong flavors and enjoying other strong sensations. anchovies horror films Can we reject the null hypothesis? (scale 0-9)

Can we reject the null hypothesis? Horror films Pizza

Can we reject the null hypothesis? r =.352 df = 8 We do the math and we find that:

to to to to to to to to to to to to to to to to to to to df nonsignificant r table

This finding falls within the CI.95 around We call such findings “nonsignificant” Nonsignificant is abbreviated n.s. We would report these finding as follows r (8)=0.352, n.s. Given that it fell inside the CI.95, we must assume that rho actually equals zero and that our sample r is not solely because of sampling fluctuation. We go back to predicting that everyone will score at the mean of Y.

That seems like a good idea: I would guess that like most variables, desire for anchovy pizza and horror movies are not really correlated. This sample probably has an r of.352 solely because of the way samples of this size fluctuate around a rho of zero

How to report a signficant r For example, let’s say that you had a sample (n P =30) and r = Looking under n P -2=28 df REG, we find the interval consistent with the null is between and So we are outside the CI.95 for rho=0.000 We would write that result as r(28)=-.400, p<.05 That tells you the df REG, the value of r, and that you can expect an r that far from five or fewer times in 100 when rho = 0.000

Then there is Column 4 Column 4 shows the values that lie outside a CI.99 (The CI.99 itself isn’t shown like the CI.95 in Column 2 because it isn’t important enough.) However, Column 4 gives you bragging rights. If your r is as far or further from as the number in Column 4, you can say there is 1 or fewer chance in 100 of an r being this far from zero (p<.01). For example, let’s say that you had a sample (n P =30) and r = The critical value at.01 is.463. You are further from 0.00 than that.So you can brag. You write that result as r(28)=-.525, p<.01.

To summarize If r falls inside the CI.95 around 0.000, it is nonsignificant (n.s.) and you can’t use the regression equation (e.g., r(28)=.300, n.s. If r falls outside the CI.95, but not as far from as the number in Column 4, you have a significant finding and can use the regression equation (e.g., r(28)=-.400,p<.05 If r is as far or further from zero as the number in Column 4, you can use the regression equation and brag while doing it (e.g., r(28)=-.525, p<.01