Stat 13, Tue 5/29/12. 1. Drawing the reg. line. 2. Making predictions. 3. Interpreting b and r. 4. RMS residual. 5. r 2. 6. Residual plots. Final exam.

Slides:



Advertisements
Similar presentations
AP Statistics Section 3.2 C Coefficient of Determination
Advertisements

Chapter 8 Linear Regression.
Review ? ? ? I am examining differences in the mean between groups
Regression Greg C Elvers.
Chapter 4 The Relation between Two Variables
2nd Day: Bear Example Length (in) Weight (lb)
Chapter 8 Linear Regression © 2010 Pearson Education 1.
CHAPTER 8: LINEAR REGRESSION
Scatter Diagrams and Linear Correlation
Understanding the General Linear Model
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Basic Statistical Concepts Psych 231: Research Methods in Psychology.
Stat 217 – Day 26 Regression, cont.. Last Time – Two quantitative variables Graphical summary  Scatterplot: direction, form (linear?), strength Numerical.
1 Simple Linear Regression Linear regression model Prediction Limitation Correlation.
CHAPTER 3 Describing Relationships
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
Simple Linear Regression 1. 2 I want to start this section with a story. Imagine we take everyone in the class and line them up from shortest to tallest.
Objectives (BPS chapter 5)
Descriptive Methods in Regression and Correlation
Lecture 3-2 Summarizing Relationships among variables ©
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Chapter 3: Examining relationships between Data
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Stat 13, Thur 5/24/ Scatterplot. 2. Correlation, r. 3. Residuals 4. Def. of least squares regression line. 5. Example. 6. Extrapolation. 7. Interpreting.
Section 2.2 Correlation A numerical measure to supplement the graph. Will give us an indication of “how closely” the data points fit a particular line.
3.3 Least-Squares Regression.  Calculate the least squares regression line  Predict data using your LSRL  Determine and interpret the coefficient of.
1 Everyday is a new beginning in life. Every moment is a time for self vigilance.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Objective: Understanding and using linear regression Answer the following questions: (c) If one house is larger in size than another, do you think it affects.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Examining Bivariate Data Unit 3 – Statistics. Some Vocabulary Response aka Dependent Variable –Measures an outcome of a study Explanatory aka Independent.
Chapter 9: Correlation and Regression Analysis. Correlation Correlation is a numerical way to measure the strength and direction of a linear association.
Residuals Recall that the vertical distances from the points to the least-squares regression line are as small as possible.  Because those vertical distances.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 1.
LEAST-SQUARES REGRESSION 3.2 Least Squares Regression Line and Residuals.
The correlation coefficient, r, tells us about strength (scatter) and direction of the linear relationship between two quantitative variables. In addition,
Chapters 8 Linear Regression. Correlation and Regression Correlation = linear relationship between two variables. Summarize relationship with line. Called.
Two-Variable Data Analysis
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Ch. 10 – Linear Regression (Day 2)
Sections Review.
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Regression and Residual Plots
Chapter 3: Describing Relationships
CHAPTER 26: Inference for Regression
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3 Describing Relationships Section 3.2
Objectives (IPS Chapter 2.3)
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Algebra Review The equation of a straight line y = mx + b
Chapter 3: Describing Relationships
Review I am examining differences in the mean between groups How many independent variables? OneMore than one How many groups? Two More than two ?? ?
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Presentation transcript:

Stat 13, Tue 5/29/ Drawing the reg. line. 2. Making predictions. 3. Interpreting b and r. 4. RMS residual. 5. r Residual plots. Final exam is Thur, 6/7, in class. Hw7 is due Tue, 6/5, and is from the handout, which is from “An Introduction to the Practice of Statistics” 3 rd ed. by Moore and McCabe. Problems 2.20, 2.30, 2.38, 2.44, 2.46, and On 2.20, you don’t need to make the scatterplot on the computer, and it doesn’t have to look perfect. On 2.102, the last sentence should be “In particular, suggest some other variables that may be confounded with heavy TV viewing and may contribute to poor grades.” 1

1. Drawing the regression line. A couple more useful facts about the least squares regression line: (a) the reg. line always goes through (, ). (b) the sum of the residuals always = 0. Fact (a) is useful for plotting the regression line. Put one point at (, ). Choose some number c so that x X + c is on the graph, and put another pt. at ( + c, + bc), where b is the regression slope. Connect the 2 points. 2

1. Drawing the regression line. A couple more useful facts about the least squares regression line: (a) the reg. line always goes through (, ). (b) the sum of the residuals always = 0. Fact (a) is useful for plotting the regression line. Put one point at (, ). Choose some number c so that x X + c is on the graph, and put another pt. at ( + c, + bc), where b is the regression slope. Connect the 2 points. 3

2. Making predictions. Given that the regression line is = a + b X i, where you’ve found a and b, to make a prediction of Y for a given value of X, just plug X in for X i in the above equation. Graphically, you can also make this prediction using the up and over line, i.e. by drawing the corresponding vertical line on the scatterplot, seeing where it intersects the regression line, and then drawing the corresponding horizontal line and seeing where it intersects the y axis. For instance, on this scatterplot, to make a prediction for X = 1.8, draw the red vertical line X = 1.8, see where it intersects the reg. line, and then draw a horizontal line from there to see the corresponding y value. Here, = Note that, for pure prediction, you don’t really care about confounding factors. If it works, it works. Confounding is a problem when trying to infer causation though. e.g. ice cream sales and crime. 4

3. Interpreting b and r. The correlation, r, tells you the strength of the linear relationship between X and Y. If the points fall exactly on a line sloping up, then r is exactly 1, and if the line slopes down, then r = -1. The slope, b, of the regression line tells you how much your predicted y value,, would increase if you increase X by one unit. Note that neither tells you anything about causation, especially if the data are observational rather than experimental. But if you are interested in prediction rather than in causation, b is extremely important. To say that the relationship between X and Y is causal would mean that if you increase X by one unit, and everything else is held constant, then you actually increase that person’s Y value by 1 unit. The regression line is = a + b X i. b = rs y /s x. a = - b. 5

Now we will talk about ways to assess how well the regression line fits. 4. RMS residual. RMS means root mean square. The RMS of a bunch of numbers is found by taking those numbers, squaring them, taking the mean of those squares, and then taking the square root of that mean. e.g. RMS{-1,2,-3,4,10} = √ mean {(-1) 2, 2 2, (-3) 2, 4 2, 10 2 } = √ mean{1,4,9,16,100} = √ 26 ~ The RMS of some numbers indicates the typical size (or abs. value) of the numbers. For instance,  is the RMS of the deviations from µ. Thus the RMS of the residuals, Y i - i, tells us the typical size of these residuals, which indicates how far off our regression predictions would typically be. Smaller RMS residual indicates better fit. It turns out that the RMS of the residuals = s y √(1-r 2 ). Very simple! Note that the sum of the residuals = 0. Therefore, SD of residuals = RMS of (residuals - mean residual) = RMS of residuals. Suppose the RMS residual = 10. Then the +/- for your reg. prediction is 10. 6

Another way to assess how well the regression line fits. 5. r 2. r 2 is simply the correlation squared, and it indicates the proportion of the variation in Y that is explained by the regression line. Larger r 2 indicates better fit. The justification for that sentence stems from the fact that, as mentioned on the previous slide, SD of residuals = RMS of residuals = s y √(1-r 2 ). Therefore, squaring both sides, Var(e i ) = Var(Y i ) (1-r 2 ). = Var(Y i ) - r 2 Var(Y i ). So, r 2 Var(Y i ) = Var(Y i ) - Var(e i ). r 2 = [Var(Y i ) - Var(e i )] / Var(Y i ). Var(Y i ) is the amount of variation in the observed y values, and Var(e i ) is the variation in the residuals, i.e. the variation left over after fitting the line by regression. So, Var(Y i ) - Var(e i ) is the variation explained by the regression line, and this divided by Var(Y i ) is the proportion of variation explained by the reg. line. e.g. if r = Then r 2 = 0.64, so we’d say that 64% of the variation in Y is explained by the regression line, and 36% is left over, or residual variation, after fitting the regression line. If instead r = 0.10, then r 2 = 0.01, so only 1% of the variation in Y is explained by the regression line. 7

Another way to assess how well the regression line fits. 6. Residual plots. For each observed point (X i, Y i ), plot X i on the x axis and e i on the y axis. e i = Y i - i. From the residual plot, you can eyeball the typical size of the residuals, and you can also see potential: * outliers * curvature * heteroskedasticity. Outliers may deserve further attention, or reg. predictions may be good for some observations but lousy for others. The prediction +/- RMS residual may be a lousy summary, if you’re typically off by very little and occasionally off by a ton. 8

Another way to assess how well the regression line fits. 6. Residual plots. For each observed point (X i, Y i ), plot X i on the x axis and e i on the y axis. e i = Y i - i. From the residual plot, you can eyeball the typical size of the residuals, and you can also see potential: * outliers * curvature * heteroskedasticity. When curvature is present, the regression line may still be the best fitting line, but it does not seem to predict optimally: we could do better with a curve, rather than a line. 9

Another way to assess how well the regression line fits. 6. Residual plots. For each observed point (X i, Y i ), plot X i on the x axis and e i on the y axis. e i = Y i - i. From the residual plot, you can eyeball the typical size of the residuals, and you can also see potential: * outliers * curvature * heteroskedasticity. Heteroskedasticity means non-constant variation. Homoskedasticity means constant variation. When we use the RMS residual for a +/- for our predictions, we’re assuming homoskedasticity. If the data are heteroskedastic, then the RMS residual will be a lousy summary of how much we’re off by. e.g. in this example, we know that when x is near 0, we’re off by very little, and when x is large, we are often off by a lot. 10