The Regression Equation How we can move beyond predicting that everyone should score right at the mean by using the regression equation to individualize.

Slides:



Advertisements
Similar presentations
Simple Linear Regression 1. 2 I want to start this section with a story. Imagine we take everyone in the class and line them up from shortest to tallest.
Advertisements

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Inferences for Regression.
Copyright © 2010 Pearson Education, Inc. Chapter 27 Inferences for Regression.
Copyright © 2010 Pearson Education, Inc. Slide
Correlation and Regression
CHAPTER 24: Inference for Regression
Testing Hypotheses About Proportions Chapter 20. Hypotheses Hypotheses are working models that we adopt temporarily. Our starting hypothesis is called.
Objectives (BPS chapter 24)
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
The Regression Equation Using the regression equation to individualize prediction and move beyond saying that everyone is equal, that everyone should score.
T scores and confidence intervals using the t distribution.
PSY 307 – Statistics for the Behavioral Sciences
t scores and confidence intervals using the t distribution
The standard error of the sample mean and confidence intervals
1 Chapter 8 – Regression 2 Basic review, estimating the standard error of the estimate and short cut problems and solutions.
Correlation 2 Computations, and the best fitting line.
The Simple Regression Model
Correlation 2 Computations, and the best fitting line.
Confidence intervals using the t distribution. Chapter 6 t scores as estimates of z scores; t curves as approximations of z curves Estimated standard.
SIMPLE LINEAR REGRESSION
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
The Regression Equation Using the regression equation to individualize prediction and move beyond saying that everyone is equal, that everyone should score.
Introduction to Probability and Statistics Linear Regression and Correlation.
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
SIMPLE LINEAR REGRESSION
Chapter 9 - Lecture 2 Computing the analysis of variance for simple experiments (single factor, unrelated groups experiments).
1 Chapter 8 – Regression 2 Basic review, estimating the standard error of the estimate and short cut problems and solutions.
T scores and confidence intervals using the t distribution.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Relationships Among Variables
Correlation and Linear Regression
Chapter 8: Bivariate Regression and Correlation
Standard Error and Research Methods
SIMPLE LINEAR REGRESSION
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Linear Regression and Correlation
STA291 Statistical Methods Lecture 27. Inference for Regression.
Confidence Intervals and Hypothesis Testing
Inferences for Regression
Regression. Idea behind Regression Y X We have a scatter of points, and we want to find the line that best fits that scatter.
Inferential Statistics 2 Maarten Buis January 11, 2006.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Statistical Analysis Topic – Math skills requirements.
Production Planning and Control. A correlation is a relationship between two variables. The data can be represented by the ordered pairs (x, y) where.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 13 Linear Regression and Correlation.
Hypotheses tests for means
Correlation Analysis. Correlation Analysis: Introduction Management questions frequently revolve around the study of relationships between two or more.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 21: More About Test & Intervals
Midterm Review Ch 7-8. Requests for Help by Chapter.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Chapter 21: More About Tests
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Lecturer’s desk Physics- atmospheric Sciences (PAS) - Room 201 s c r e e n Row A Row B Row C Row D Row E Row F Row G Row H Row A
Chapter 12 Tests of Hypotheses Means 12.1 Tests of Hypotheses 12.2 Significance of Tests 12.3 Tests concerning Means 12.4 Tests concerning Means(unknown.
Chapter 26 Inferences for Regression. An Example: Body Fat and Waist Size Our chapter example revolves around the relationship between % body fat and.
Copyright © Cengage Learning. All rights reserved. 8 9 Correlation and Regression.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Computations, and the best fitting line.
CHAPTER 26: Inference for Regression
Chapter 12 Power Analysis.
Regression & Correlation (1)
Inferences for Regression
Presentation transcript:

The Regression Equation How we can move beyond predicting that everyone should score right at the mean by using the regression equation to individualize prediction.

Prediction using X scores and r We are trying to predict, as accurately as possible, the values of Y (or t Y ) using our knowledge of a person’s X score and of the correlation between X and Y scores. There are both dangers and advantages to individualizing predictions.

Potential advantages and disadvantages The potential advantage is the possibility of making more precise, (less wrong) predictions than we would by saying everyone will score right at the estimated population mean, that is, the sample mean. We are saying, in effect, we know something about this person and this kind of person should score a specific number of points above or below the mean of Y. The potential danger is that we will make things a good deal worse than they would have been had we stayed with predicting that everyone will score right at the mean.

Best fitting line: A review

The definition of the best fitting line plotted on t axes The best fitting line is a least squares, unbiased estimate of values of Y in the sample. A “best fitting line” minimizes the average squared vertical distance of Y scores in the sample (expressed as t Y scores) from the line. The generic formula for a line is Y=mx+b where m is the slope and b is the Y intercept. Thus, any specific line, such as the best fitting line, can be defined by its slope and its intercept.

The intercept of the best fitting line plotted on t axes The origin is the point where both t X and t Y =0.000 So the origin represents the mean of both the X and Y variable When plotted on t axes all best fitting lines go through the origin. Thus, the t Y intercept of the best fitting line =

The slope of and formula for the best fitting line When plotted on t axes the slope of the best fitting line = r, the estimated correlation coefficient. To define a line we need its slope and Y intercept r = the slope and t Y intercept =0.00 The formula for the best fitting line is therefore t Y =rt X or t Y = rt X

Here’s how a visual representation of the best fitting line (slope = r, t Y intercept = 0.000) and the dots representing t X and t Y scores might be described. (Whether the correlation is positive of negative doesn’t matter.) Perfect - scores fall exactly on a straight line. Strong - most scores fall near the line. Moderate - some are near the line, some not. Weak - the scores are only mildly linear. Independent - the scores are not linear at all.

Strength of a relationship Perfect

Strength of a relationship Strong r about.800

Strength of a relationship Moderate r about.500

Strength of a relationship r about Independent

An Important Warning Notice that each best fitting line starts at the lowest value of X and ends at the highest value of X. Outside the range of the X scores you saw in your random sample, you can not assume the correlation stays linear. If it doesn’t, the best fitting line becomes a terrible fit.

What would happen if you only had pairs with positive t X scores scores when you fit a line here. Outside the range of X scores in your sample, you can’t assume linearity

Notice what that formula for independent variables says t Y = rt X = (t X ) = When t Y = 0.000, you are at the mean of Y So, when variables are independent, the best fitting line says that making the best estimate of Y scores in the sample requires you to go back to prediction that everyone will score at the mean of Y (regardless of his or her score on X). Thus, when variables are independent we go back to saying everyone will score right at the mean. The visual representation of this is the t Y axis, a line in which every score predicts the mean of Y.

Moving from the best fitting line to the regression equation and the regression line.

The best fitting line (t Y =rt X ) was the line closest to the Y values in the sample. But what should we do if we want to go beyond our sample and use a version of our best fitting line to make individualized predictions for the rest of the population?

t Y' = rt X Notice this is not quite the same as the formula for the best fitting line. The formula now reads t Y' (read t-y- prime). Not t Y. t Y' is the predicted score on the Y variable for every X score in the population falling within the range of X scores observed in our random sample. Until this point, we have been describing the linear relationship of the X and Y variable in our sample. Now we are predicting t Y ' scores (estimated Z Y scores) for everyone in the population whose X score is in the range of X scores in our random sample.

This is one of the key points in the course; a point when things change radically. Up to this point, we have just been describing scores, means and relationships. We have continued to predict that everyone in the population who was not in our sample will score at the mean of Y. But now we want to be able to make individualized predictions for the rest of the population, the people not in our sample and for whom we can obtain X scores but who don’t have Y scores

In the context of correlation and regression (Ch. 7 & 8), this means making using the correlation between X and Y and someone’s score on the X variable to make predictions of a Y score, from a pre-existing difference among individuals, In Chapter 9 we will determine whether such predictions can be made from the different ways people are treated in an experiment.

Both are somewhat dangerous. Our first rule as scientists is: “Do not increase error.” Individualizing prediction can easily do that. Let me give you an example.

Assume, you are the personnel officer for a mid size company. You need to hire a typist. There are 2 applicants for the job. You give the applicants a typing test. Which would you hire: someone who types 6 words a minute with 12 mistakes or someone who types 100 words a minute with 1 mistake.

Whom would you hire? Of course, you would predict that the second person will be a better typist and hire that person. Notice that we never gave the person with 6 words/minute a chance to be a typist in our firm. We prejudged her on the basis of the typing test. That is probably valid in this case – a typing test probably predicts fairly well how good a typist someone will be.

But say the situation is a little more complicated! You have several applicants for a leadership position in your firm. But it is not 2006, it is 1956, when we knew that only white males were capable of leadership in corporate America. That is, we all “knew” that leadership ability is correlated with both gender and skin color, white and male are associated with high leadership ability and darker skin color and female gender with lower leadership ability.

In 1956, it would have been just as absurd to hire someone of color or a woman for a leadership position as it would be to hire the bad (6-words-a- minute-with-12-mistakes) typist now. Everyone knew that 1) they couldn’t do the job and/or that even if they had some talent 2) no subordinate would be comfortable following him or her. We now know this is absurd, but lots of people were never given a chance to try their hand at leadership, because of a socially based pre-judgment that you can now see as obvious prejudice.

We would have been much better off saying that everyone is equal, everyone should be predicted to score at the mean. Pre-judgments on the basis of supposed relationships between variables that have no real scientific support may well be mere prejudice. In the case we just discussed, they cost potential leaders jobs in which they could have shown their ability. That is unfair.

We would have been much better off saying that everyone is equal, everyone should be predicted to score at the mean. Moreover, by excluding such individuals, you narrow the talent pool of potential leaders. The more restricted the group of potential leaders, the less talented the average leader will be. This is why aristocracies don’t work in the long run. The talent pool is too small.

So, to avoid prejudice you must start with the notion that everyone will score at the mean of Y no matter how they scored on the X variable, the predictor variable. In math language you have to predict that Z Y ' ~ t Y' =0.00 for every value of X How can you do that? To figure it out, look carefully at the regression equation: t Y' = rt X

If t Y' = rt X, the only way that t Y' will always equal 0.00 is when r = (Actually, in this case we are using a theory that X and Y scores can not be used to predict each other. When you are talking theory, the proper form of the regression equation is Artie’s sister, Rosie (Z Y' = rhoZ X ) Thus, to predict that everyone in the population will score at the mean of Y (Z Y' = 0.00), you have to hypothesize that rho =

So, to avoid the possibility of disastrous mistakes and prejudice, only if you can disprove the notion that rho = and no other time, should you make any other prediction than “Everyone is equal, everyone should be predicted to score right at the mean”.

In regression analysis,we call the hypothesis that rho=0.000 the null hypothesis The symbol for the null hypothesis is H 0. We will see the null hypothesis many times during the rest of this course. Though it appears in several forms, the null hypothesis is the only hypothesis that you will learn to test statistically in this course.

To use the regression equation to predict Y scores for the part of the population that was not in your random sample, you must falsify and reject the null. You must start with the assumption that rho=0.000 and that the best prediction is that everyone will score right at the mean unless you can prove otherwise. Proof does not mean beyond any doubt, it means beyond a reasonable doubt. There have to be five chances in 100 or less of our sample looking as it does, if the null were to be true, for us to declare the null false and reject it.

Can you say with absolute certainty that the null hypothesis is wrong? NO! NEVER! However, you can say that there are 5 or fewer chances in 100 that you will get a correlation as strong as the one you found in your random sample, when the null is true (p<.05) If your results are particularly strong, you may be able to brag by saying that there is one or fewer chances in 100 of finding your data when the null is true (p<.01). But there is always some chance that you have simply obtained an unusual random sample and that the null really is true.

Statistical Significance In either event (p<.05 or p<.01) your results are considered statistically significant and you must consider the null so wrong about the data you obtained from your random sample that you must discard it. You then use the regression equation to make predictions of Y scores using each person’s X score and the correlation of X and Y in your random sample.

Remember when your results are statistically significant, you have falsified and must reject the null hypothesis.

ONE FURTHER CRUCIAL WARNING

You can’t use the regression equation to predict t Y when you have an X score outside the range of scores you observed in your random sample. The one thing you know about someone with an X score outside the range of scores in your random sample is that YOU KNOW NOTHING ABOUT SUCH PEOPLE. You never saw one before.

The regression line has end points, just like the best fitting line. Notice that each regression starts at the lowest value of X and ends at the highest value of X. Outside the range of the X scores you saw in your random sample, you can not assume the correlation stays linear. If it doesn’t, your predictions of Y can become absurdly wrong.

Like the example in your book: Imagine you measured the height of girls between 2 and 14, with a mean age of 8.00 and a standard deviation of 3 years. Mean height was 48” with a standard deviation of 8”. You find a correlation of between height and age. Older girls tend to be taller. So, the regression equation is t Y' = t X This allows you to predict that at 5 years the average girl will be 40 inches tall, while at age 14 girls will average 64” tall.

Now you want to predict the height of the average 83 year old. Let’s say you forget the rule about not predicting outside the range of X scores in your sample. If we translate 83 to a t X score, you get –t X =(83-8)/3 = Using the regression equation, you would predict that t Y' =.6(25.00) =15.00 Y' = Y bar + t Y‘ (s Y ) = 48” + 15(8”) =168” You just predicted the average 83 year old will be 14 feet tall.

What went wrong. You assumed linearity outside the range of scores you had seen on the X variable (age). You saw kids They are growing and within that age group age and height have a positive, linear relationship But by 14 or 15 growth stops and the curve goes from rising to flat, eventually going down a little. It doesn’t stay linear. It becomes curvilinear as soon as you include kids 15 and over and adults along with kids.

So, outside the range of X scores in your sample, you can’t assume linearity. Here mean age = 12 s=

Predict they will score right at the mean of Y Remember, the one thing you know about someone with an X score outside the range of scores in your random sample is that YOU KNOW NOTHING ABOUT SUCH PEOPLE. You never saw one before. So you do what we always do when we don’t know anything about someone: We predict they will be average, that they will score right at the mean of Y.

Confidence intervals around rho T

In Chapter 6 we learned to create confidence intervals around mu T that allowed us to test a theory. To test our theory about mu we took a random sample, computed the sample mean and standard deviation, and determined whether the sample mean fell into that interval. If the sample mean fell into the confidence interval, there was some support for our theory, and we held onto it.

Confidence intervals around mu T The interesting case was when mu T fell outside the confidence interval. In that case, the data from our sample falsified our theory, so we had to discard the theory and the estimate of mu specified by the theory

If we discard a theory based prediction, what do we use in its place? Generally, our best estimate of a population parameter is the sample statistic that estimates it. Our best estimate of mu has been and is the sample mean, X-bar. Since X-bar fell outside the confidence interval, we discarded our theory about the value of mu. If we reject the theory (hypothesis) about mu, we must go back to using X-bar, the sample mean that fell outside the confidence interval and falsified our theory, as our best (least squares, unbiased, consistent estimate) of mu.

To test any theory about any population parameter, we go through similar steps: We theorize about the value of the population parameter. We obtain some measure of the variability of sample-based estimates of the population parameter. We create a test of the theory about the population parameter by creating a confidence interval, almost always a CI.95. We then obtain and measure the parameter in a random sample.

Which hypothesis about rho (the population correlation coefficient) do we always test in this way. THE NULL HYPOTHESIS The null says that there is no relationship between any two variables we chose to study, in the population as a whole. Mathematically, we would say that rho= A Corollary: If rho = 0.000, any nonzero correlation found in a random sample is simply a poor estimate or rho. Thus, whatever nonzero correlation you found in the sample is illusory. It is simply a mediocre estimate of zero.

Testing the null hypothesis (rho = 0.000) in correlation and regression

Testing the theory To test the theory that rho=0.00, we create a CI.95 for rho= We then obtain r from a random sample. If r falls outside the CI.95 interval, we have shown that the theory that rho = zero does not explain our data. Therefore we hold the theory to be false.

Summary on significance If rho = 0.000, we should go back to saying everyone is equal, everyone will score at the mean of Y. To be fair and avoid doing damage, we must test the hypothesis that rho=0.000 before doing anything else. To test the theory that rho=0.00, we create a CI.95 for rho= If, and only if, we disprove the notion that rho=0.000 by having r fall outside the CI.95 can we use r in the regression equation, t Y' =rt X.

Two examples: Example 1: John was not a member of our random sample, but he is a member of the population from which it was drawn. His X score falls inside the range seen in our random sample. r fell inside the confidence interval around consistent with the null hypothesis. Should we use the regression equation to predict what score John will obtain on the Y variable?

NO. Predict John will score right at the mean of Y.

Two examples: Example 2: John was not a member of our random sample, but he is a member of the population from which it was drawn. His X score falls inside the range seen in our random sample. r fell outside the confidence interval around consistent with the null hypothesis. Should we use the regression equation to predict what score John will obtain on the Y variable?

YES. Use the regression equation to predict John’s score on the Y variable.

I could teach you how to calculate the confidence interval for rho=0.000 But other people have already calculated the intervals for many different df. Those calculations are summarized in the r table

to to to to to to to to to to to to to to to to to to to df nonsignificant.05.01

How the r table is laid out: the important columns –Column 1 of the r table shows degrees of freedom for correlation and regression (df REG ) –df REG =n P -2 –Column 2 shows the CI.95 for varying degrees of freedom –Column 3 shows the absolute value of the r that falls just outside the CI.95. Any r this far or further from falsifies the hypothesis that rho=0.000 and can be used in the regression equation to make predictions of Y scores for people who were not in the original sample but who were part of the population from which the sample is drawn.

to to to to to to to to to to to to to to to to to to to df nonsignificant.05.01

to to to to to to to to to to to to to to to to to to to df nonsignificant If r falls in within the 95% CI around 0.000, then the result is not significant. Find your degrees of freedom (N p -2) in this column You cannot reject the null hypothesis. You must assume that rho = Does the absolute value of r equal or exceed the value in this column? r is significant with alpha =.05. If r is significant you can consider it an unbiased, least squares estimate of rho. alpha =.05. You can use it in the regression equation to estimate Y scores.

Testing H 0 : rho = To test the null, select a random sample, then see if the resultant r falls inside or outside the CI.95 around

Let’s test the hypothesis that liking for strong sensations in one area is related to liking for strong sensations in other areas. To test our hypothesis, we ask a random sample about their liking for two things that usually produce strong sensations: anchovy pizza and horror movies

Ratings of liking for anchovy pizza and horror films H 1 : People who enjoy food with strong flavors also enjoy other strong sensations. H 0 : There is no relationship between enjoying food with strong flavors and enjoying other strong sensations. Anchovy pizza Horror films Can we reject the null hypothesis? (scale 0-9)

Is this more or less linear? Yes Horror films Pizza

Can we reject the null hypothesis? r =.352 df = 8 We do the math and we find that:

to to to to to to to to to to to to to to to to to to to df nonsignificant.05.01

This finding falls within the CI.95 around We call such findings “nonsignificant” Nonsignificant is abbreviated n.s. We would report these finding as follows r (8)=0.352, n.s. In English, we would say, the correlation with 8 degrees of freedom was.352. That finding is nonsignificant and we fail to falsify the null. Therefore, we can not use the regression equation as we have no evidence that the correlation in the population as a whole is not We go back to predicting everyone will score right at the mean of Y.

This system prevents plausible, but incorrect, theories from affecting peoples’ futures. I would guess that like most variables, desire for anchovy pizza and horror movies are not really correlated. This sample probably has an r of.352 solely because of the way samples of this size fluctuate around a rho of zero.

How to report a significant r For example, let’s say that you had a sample (n P =30) and r = Looking under n P -2=28 df REG, we find the interval consistent with the null is between and So we are outside the CI.95 for rho=0.000 We would write that result as r(28)=-.400, p<.05 This tells you that there were 28 df for r, that r = -.400, and that you can expect an r that far from or fewer times in 100 when rho = 0.000

Then there is Column 4 Column 4 shows the values that lie outside a CI.99 (The CI.99 itself isn’t shown like the CI.95 in Column 2 because it isn’t important enough.) However, Column 4 gives you bragging rights. If your r is as far or further from as the number in Column 4, you can say there is 1 or fewer chance in 100 of an r being this far from zero (p<.01). For example, let’s say that you had a sample (n P =30) and r = The critical value at.01 is.463. You are further from 0.00 than that.So you can brag. You write that result as r(28)=-.525, p<.01.

To summarize: assuming that an X score falls inside the range of X scores seen in your random sample. If r falls inside the CI.95 around 0.000, it is nonsignificant (n.s.) and you can’t use the regression equation (e.g., r(28)=.300, n.s. If r falls outside the CI.95, but not as far from as the number in Column 4, you have a significant finding and can use the regression equation (e.g., r(28)=-.400,p<.05 If r is as far or further from zero as the number in Column 4, you can use the regression equation and brag while doing it (e.g., r(28)=-.525, p<.01

The rest of this course is largely about hypothesis (theory) testing. The one and only one hypothesis that we will test statistically from this point on is the NULL HYPOTHESIS. As a result of our statistical tests, we will either reject the null or fail to reject the null based on the data from a random sample.

Remember, why must the X score be within the range of X scores observed in your random sample?

Why must the X score be within the range of X scores observed in your random sample? Because outside that range, you can not assume linearity. The very direction of the relationship may well suddenly change.

Which does the null say about rho THE NULL HYPOTHESIS The null says that there is no relationship between any two variables we chose to study, in the population as a whole. Mathematically, we would say that rho= A Corollary: If rho = 0.000, any nonzero correlation found in a random sample is simply a poor estimate or rho. Thus, whatever nonzero correlation you found in the sample is illusory. It is simply a mediocre estimate of zero.