5-1 bivar. Unit 5 Correlation and Regression: Examining and Modeling Relationships Between Variables Chapters 8 - 12 Outline: Two variables Scatter Diagrams.

Slides:



Advertisements
Similar presentations
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Advertisements

Chapter 8 Linear Regression © 2010 Pearson Education 1.
MA-250 Probability and Statistics
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Introduction to Linear Regression.
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Basic Statistical Concepts
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
SIMPLE LINEAR REGRESSION
Linear Regression and Correlation Analysis
Chapter 13 Introduction to Linear Regression and Correlation Analysis
AP Statistics Chapter 8: Linear Regression
SIMPLE LINEAR REGRESSION
Business Statistics - QBM117 Least squares regression.
CHAPTER 3 Describing Relationships
Correlation and Linear Regression
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
ASSOCIATION BETWEEN INTERVAL-RATIO VARIABLES
Relationships between Variables. Two variables are related if they move together in some way Relationship between two variables can be strong, weak or.
1 FORECASTING Regression Analysis Aslı Sencer Graduate Program in Business Information Systems.
Chapter 6 & 7 Linear Regression & Correlation
Regression. Correlation and regression are closely related in use and in math. Correlation summarizes the relations b/t 2 variables. Regression is used.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics.
Examining Relationships in Quantitative Research
Objective: Understanding and using linear regression Answer the following questions: (c) If one house is larger in size than another, do you think it affects.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
CHAPTER 5 CORRELATION & LINEAR REGRESSION. GOAL : Understand and interpret the terms dependent variable and independent variable. Draw a scatter diagram.
Chapter 2 Examining Relationships.  Response variable measures outcome of a study (dependent variable)  Explanatory variable explains or influences.
Correlation & Regression Analysis
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 3 Association: Contingency, Correlation, and Regression Section 3.3 Predicting the Outcome.
Residuals Recall that the vertical distances from the points to the least-squares regression line are as small as possible.  Because those vertical distances.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 1.
Example: set E #1 p. 175 average ht. = 70 inchesSD = 3 inches average wt. = 162 lbs.SD = 30 lbs. r = 0.47 a)If ht. = 73 inches, predict wt. b)If wt. =
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Product moment correlation
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
9/27/ A Least-Squares Regression.
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Presentation transcript:

5-1 bivar. Unit 5 Correlation and Regression: Examining and Modeling Relationships Between Variables Chapters Outline: Two variables Scatter Diagrams to display bivariate data Correlation Concept, Interpretation, Computation, Cautions Regression Model: Using a LINE to describe the relation between two variables & for prediction Finding "the" line Interpreting its coefficients Residuals, Prediction Errors Extensions of Simple Linear Regression A.05

5-2 bivar. Four Scatter Diagrams size of help wanted ad # applicants cost per min. ($) CUME rating % delinquent age of credit account (years) entertain. expenses (x $100) last year's sales ($1000)

5-3 bivar. If there is STRONG ASSOCIATION between 2 variables, then knowing one helps a lot in predicting the other. If there is WEAK ASSOCIATION between 2 variables, then information about one variable does not help much in predicting the other. dependent variable independent variable Usually, the INDEPENDENT variable is thought to influence the DEPENDENT variable. Association

5-4 bivar. Summarizing the Relationship Between Two Variables 1. Plot the points in a scatter diagram. 2. Find average for X and average for Y. Plot the point of averages. 3. Find SD(X), which measures horizontal spread of points, and SD(Y), which measures vertical spread of points. 4. Find the correlation coefficient (r), which measures the degree of clustering / spread of points about a line (the SD line). YY XX

5-5 bivar. Wood Products Shipments and Employment, by state, 1989, excl. California Employment x 100 Shipments ($ million)

5-6 bivar. Wood Products Data , , , , , , , , , , , , , , , , , , Shipments ($ million) Shipments Employment

5-7 bivar. Wood Products Shipments and Employment, by state, 1989, excl. California Employment x 100 Shipments ($ million)

5-8 bivar.

5-9 bivar. Linear Association The correlation coefficient measures the LINEAR relationship between TWO variables. It is a measure of LINEAR association or clustering around a line. r near +1 r near -1 r positive, r negative near 0 near 0 r =1 r = -1

5-10 bivar. Interpretation of r The closer the correlation coefficient is to 1 (or -1), the more tightly clustered the points are around a line (the SD line). The SD line passes through all points which are an equal # of SD's away from the average for both variables. positive association negative association

5-11 bivar. Twelve Plots, with r Look in your textbook, pages 127 and 129.

5-12 bivar.

5-13 bivar. Computing the Correlation Coefficient, r Convert each variable to standard units. The average of the products gives the correlation coefficient. r = average of (z-score for X) (z-score for Y)

5-14 bivar. Example: Computation of r X Y X-X (X-X) 2 Y-Y (Y-Y) 2 z-score for X z-score for Y product

5-15 bivar. Some Cases When the Correlation Coefficient, r, Does Not Give A Good Indication of Clustering X INDEP r =.155 r =.536 Y

5-16 bivar BODY WEIGHT IN KG r =.933 (36 data values) BRAIN WEIGHT IN KG

5-17 bivar. “No Elephants” r =.596 body weight in kg brain weight in grams (r =.887, excluding dinosaurs, elephants, humans)

5-18 bivar. all brain data, log transformed r=.856 (all data) log (body weight) log (brain weight)

5-19 bivar. COUPON P R I C E r =.883 (all data) r =.984 (without flower bonds) (Siegel)

5-20 bivar. Interpretation of Empirical Association 1. Descriptive Example: Height versus Weight 2. Causal Example: Total Cost vs. Volume of Production 3. Nonsense Example: Polio Incidence vs. Soft Drink Sales

5-21 bivar. Prediction Using Correlation 1. What is the best prediction of the dependent variable? What if the value of the independent variable is available? 2. What is the likely size of the prediction error? Fundamental Principle of Prediction 1. Use the mean of the relevant group. 2. SD of the group gives the "likely size of error."

5-22 bivar.

5-23 bivar. Diamond State Telephone Company Demand for LINES versus Proposed MONTHLY charge per line ($) MONTHLY LINES

5-24 bivar. Look At The Vertical Strip Corresponding to the Given X Value Y X

5-25 bivar MONTHLY LINES x x x Graph of Averages x x estimated LINES = MONTHLY

5-26 bivar.

5-27 bivar. Linearly Related Variables The REGRESSION LINE is to a scatter diagram as the AVERAGE is to a list of numbers. The regression line estimates the average values for the dependent variable, Y, corresponding to each value, x, of the independent variable.

5-28 bivar. Linearly Related Variables If we have 2 variables, linearly related to one another, then knowing the value of one variable (for a particular individual) can help to estimate / predict the value of the other variable. If we know nothing re. the value of the independent variable (X), then we estimate the value of the dependent variable to be the OVERALL AVERAGE of the dependent variable (Y). If we know that the independent variable (X) has a particular value for a given individual, then we can take a "more educated guess" at the value of the dependent variable (Y).

5-29 bivar. Regression and SD Lines The REGRESSION LINE for modeling the relation between X (independent variable) and Y (dependent variable) passes through the POINT OF AVERAGES and has slope That is, associated with each increase of one SD in X, there is an increase of r SD’s in Y, on the average. The SD LINE for modeling the relation between X (independent variable) and Y (dependent variable) passes through the POINT OF AVERAGES and has slope

5-30 bivar. Estimating the Intercept and Slope of the Regression Line The REGRESSION LINE for modeling the relation between X (independent variable) and Y (dependent variable) is also known as The REGRESSION LINE for predicting Y from X, and has the form Y = a + b x = intercept + slope x. Here, b = slope = r SD(Y) / SD(X) a = intercept = avg(Y) - b avg(X) = avg(Y) - r [SD(Y) / SD(X)] avg(X)

5-31 bivar. Prediction from a Regression Model Predicted value of Y corresponding to a given value of X is

5-32 bivar.

5-33 bivar. TOTAL OBSERVATIONS: 21 LINES MONTHLY N OF CASES MINIMUM MAXIMUM MEAN VARIANCE STANDARD DEV PEARSON CORRELATION MATRIX LINES MONTHLY LINES MONTHLY NUMBER OF OBSERVATIONS: 21

5-34 bivar. Diamond State Questions In the Diamond State Telephone Company example, avg (LINES) = SD (LINES) = avg (MONTHLY) = SD (MONTHLY) = r = What are the coordinates for the point of averages? What is the slope of the regression line? Suppose the MONTHLY charge was set at $ What would you estimate to be the demand for # LINES from the 62 new businesses? Suppose the MONTHLY charge was set at $ What would you estimate to be the demand for # LINES from the 62 new businesses?

5-35 bivar. Another Diamond State Question Suppose the MONTHLY charge was set at $ What would you estimate to be the demand for # LINES from the 62 new businesses?

5-36 bivar. Regression Computer Output Regression DEP VAR: LINES N: 21 MULTIPLE R: SQUARED MULTIPLE R: ADJ SQRD MULTIPLE R: STANDARD ERROR OF ESTIMATE: VARIABLE COEFF STD ERROR STD COEF TOLERANCE T P(2 TAIL) CONSTANT MONTHLY ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P REGRESSION RESIDUAL

5-37 bivar. Interpreting the Regression Coefficients

5-38 bivar. Other Examples 1. X = Educational expenditureY = Test scores 2. X = Height of a personY = Weight of the person 3. X = # Service years of an automobileY = Operating cost per year 4. X = Total weight of mail bagsY = # Mail orders 5. X = Price of productY = Unit sales 6. X = VolumeY = Total cost of production 7. X = Calories in a candy barY = Grams of fat in the candy bar 8. X = Baseball slugging percentageY = Player salary 9. X = Weight of a diamondY = Price of the diamond

5-39 bivar. Wood Products TOTAL OBSERVATIONS: 23 SHIPMENT EMPLOY N OF CASES MINIMUM MAXIMUM MEAN VARIANCE STANDARD DEV Pearson Correlation Matrix SHIPMENT EMPLOY SHIPMENT1.00 EMPLOY Number of Observations: 23

5-40 bivar.

5-41 bivar. y=ship,x=employ,line

5-42 bivar. y=employ,x=ship,line

5-43 bivar. Computer Output - 1 DEP VAR: SHIPMENT N: 23 MULT R: SQRD MULT R: ADJ SQRD MULTIPLE R: STD ERROR OF ESTIMATE: VARIABLE COEFF STD ERROR STD COEF TOLER T P(2 TAIL) CONSTANT EMPLOY ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P REGRESSION RESIDUAL

5-44 bivar. Computer Output - 2 DEP VAR: EMPLOY N: 23 MULT R: SQRD MULT R: ADJ SQRD MULT R: STD ERROR OF ESTIMATE: VARIABLE COEFF STD ERROR STD COEF TOLER T P(2 TAIL) CONSTANT SHIPMENT ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P REGRESSION RESIDUAL

5-45 bivar. Insurance Availability in Chicago

5-46 bivar. Chicago Plots

5-47 bivar. Chicago Insurance, cont. For cases with income less than or equal to $15,000, avg (Voluntary) = SD (Voluntary) = avg (Income) = $10, SD (Income) = $2, r = Derive the equation for the regression line. According to this linear model, what is the estimated value for "Voluntary" in a ZIP code area with Income $12,000?... with Income $9,500?

5-48 bivar. blank

5-49 bivar. Regression Effect In virtually all test-retest situations, the bottom group on the first test will, on average, show some improvement on the 2nd test, and the top group will, on average, fall back. This is called the REGRESSION EFFECT. The REGRESSION FALLACY is thinking that the regression effect must be due to something important, not just the spread of points around the line.

5-50 bivar. blank

5-51 bivar. Residuals Regression methods allow us to estimate the average value of the dependent variable for each value of the independent variable. Individuals will differ somewhat from the regression estimates. How much?

5-52 bivar. blank Algeria

5-53 bivar. Residuals Prediction error = actual - predicted = vertical distance from the point to the regression line

5-54 bivar. Residuals for Economically Active Women and Crude Birth Rates

5-55 bivar. Residual Plots A residual plot should NOT look systematic (no trend or pattern) -- just a cloud of points around the horizontal axis. Problem plots also can tell us something about the data.

5-56 bivar. Residual Plot for Economically Active Women and Crude Birth Rates

5-57 bivar. Chicago Insurance Case Residual Plot (versus Income)

5-58 bivar. The Least Squares Property of the Regression Line Of all lines, the regression line is the one which has smallest sum of squared residuals (and also the smallest rms error). Thus, it is The Least Squares Line.

5-59 bivar. Look at the Scatter Diagram Before Fitting a Regression Model ! For each of the following data sets, the regression equation is Y = X and r = 0.82 Sorry, I didn’t scan in these plots yet.

5-60 bivar. blank

5-61 bivar. How Big Are The Residuals ? R.M.S. Error of the Regression Line: The rms error of the regression line says how far typical points are above or below the regression line. Standard Deviation of Y: The SD of Y says how far typical point are above or below a horizontal line through the average of y. In other words, the SD of y is the rms error for predicting y by its average, just ignoring the x-values.

5-62 bivar. How Big Are The Residuals ? The overall size of the residuals is measured by computing their standard deviation. The average of the residuals is zero. Computing the rms error of the regression line: The rms error of the regression line estimating Y from X can be figured as Note that here Y is the dependent variable! The rms error is to the regression line as the SD is to the average.

5-63 bivar. How Big Are the Residuals? Recall the First -Order Linear Model: = prediction error = residual The mean of the residuals is zero. The SD of the residuals is also known as the "root mean squared error of the regression line" (rms error).

5-64 bivar. The overall size of the residuals is measured by computing their standard deviation. The rms error is to the regression line as the SD is to the average Computing the rms error: The rms error of the regression line estimating Y from X can be figured as Notes: Here Y is the dependent variable ! 3. Here we are dividing by n, rather than n-2. rms error

5-65 bivar. Looking At Vertical Strips

5-66 bivar. Looking At Vertical Strips For an oval cloud of points, the points in a vertical strip are off the regression line (up and down) by amounts similar in size to the rms error of the regression line. If the diagram is heteroscedastic, the rms error should not be used for individual strips.

5-67 bivar. Using the Normal Curve Inside A Vertical Strip For an oval cloud of points, the SD within a vertical strip is about equal to the rms error of the regression line.

5-68 bivar. blank

5-69 bivar. Uses for r: (1) Describes the clustering of the scatter diagram around the SD line, relative to the SD's (2) Says how the average value of y depends on x r SD(Y) 1 SD(X) (3) Gives the accuracy of the regression estimates (the SD of the prediction errors) via the rms error for the regression line

5-70 bivar. coeff of determin-4 How much of the variation of Y has been explained by X? (How much better are we at predicting Y when we do know the value of X? Compare Thus, the proportion of the variation of Y which is NOT explained by X is And the proportion of the variation of Y which IS explained by X is