Download presentation
Presentation is loading. Please wait.
1
5-1 bivar. Unit 5 Correlation and Regression: Examining and Modeling Relationships Between Variables Chapters 8 - 12 Outline: Two variables Scatter Diagrams to display bivariate data Correlation Concept, Interpretation, Computation, Cautions Regression Model: Using a LINE to describe the relation between two variables & for prediction Finding "the" line Interpreting its coefficients Residuals, Prediction Errors Extensions of Simple Linear Regression A.05
2
5-2 bivar. Four Scatter Diagrams 2468 6 7 8 9 10 size of help wanted ad # applicants 20 25 30 35 40 45 cost per min. ($) 6.06.47.06.6 CUME rating 2 4 8 10 12 14 36912 % delinquent age of credit account (years) 90 110 120 130 140 150 103050 entertain. expenses (x $100) last year's sales ($1000)
3
5-3 bivar. If there is STRONG ASSOCIATION between 2 variables, then knowing one helps a lot in predicting the other. If there is WEAK ASSOCIATION between 2 variables, then information about one variable does not help much in predicting the other. dependent variable independent variable Usually, the INDEPENDENT variable is thought to influence the DEPENDENT variable. Association
4
5-4 bivar. Summarizing the Relationship Between Two Variables 1. Plot the points in a scatter diagram. 2. Find average for X and average for Y. Plot the point of averages. 3. Find SD(X), which measures horizontal spread of points, and SD(Y), which measures vertical spread of points. 4. Find the correlation coefficient (r), which measures the degree of clustering / spread of points about a line (the SD line). YY XX
5
5-5 bivar. Wood Products Shipments and Employment, by state, 1989, excl. California Employment x 100 Shipments ($ million) 050100150200250 0 10 20 30 40 50
6
5-6 bivar. Wood Products Data 469.87,900 246.44,400 205.42,800 186.53,600 175.83,800 142.92,100 139.72,400 120.61,900 118.01,500 104.31,500 89.91,600 73.51,500 72.61,400 71.41,200 53.9800 52.41,400 50.11,200 48.11,400 47.01,100 36.7800 27.4500 27.3400 22.9300 Shipments ($ million) Shipments Employment
7
5-7 bivar. Wood Products Shipments and Employment, by state, 1989, excl. California Employment x 100 Shipments ($ million) 050100150200250 0 10 20 30 40 50
8
5-8 bivar.
9
5-9 bivar. Linear Association The correlation coefficient measures the LINEAR relationship between TWO variables. It is a measure of LINEAR association or clustering around a line. r near +1 r near -1 r positive, r negative near 0 near 0 r =1 r = -1
10
5-10 bivar. Interpretation of r The closer the correlation coefficient is to 1 (or -1), the more tightly clustered the points are around a line (the SD line). The SD line passes through all points which are an equal # of SD's away from the average for both variables. positive association negative association
11
5-11 bivar. Twelve Plots, with r Look in your textbook, pages 127 and 129.
12
5-12 bivar.
13
5-13 bivar. Computing the Correlation Coefficient, r Convert each variable to standard units. The average of the products gives the correlation coefficient. r = average of (z-score for X) (z-score for Y)
14
5-14 bivar. Example: Computation of r X Y X-X (X-X) 2 Y-Y (Y-Y) 2 z-score for X z-score for Y product
15
5-15 bivar. Some Cases When the Correlation Coefficient, r, Does Not Give A Good Indication of Clustering 0 246810 0 2 4 6 8 X 0 203040 0 100 200 300 400 500 600 700 800 INDEP r =.155 r =.536 Y
16
5-16 bivar. 01000200030004000500060007000 0 1000 2000 3000 4000 5000 6000 BODY WEIGHT IN KG r =.933 (36 data values) BRAIN WEIGHT IN KG
17
5-17 bivar. “No Elephants” 0100200300400500600 0 500 1000 1500 r =.596 body weight in kg brain weight in grams (r =.887, excluding dinosaurs, elephants, humans)
18
5-18 bivar. all brain data, log transformed -1001020 -5 0 5 10 r=.856 (all data) log (body weight) log (brain weight)
19
5-19 bivar. COUPON P R I C E 051015 80 90 100 110 120 r =.883 (all data) r =.984 (without flower bonds) (Siegel)
20
5-20 bivar. Interpretation of Empirical Association 1. Descriptive Example: Height versus Weight 2. Causal Example: Total Cost vs. Volume of Production 3. Nonsense Example: Polio Incidence vs. Soft Drink Sales
21
5-21 bivar. Prediction Using Correlation 1. What is the best prediction of the dependent variable? What if the value of the independent variable is available? 2. What is the likely size of the prediction error? Fundamental Principle of Prediction 1. Use the mean of the relevant group. 2. SD of the group gives the "likely size of error."
22
5-22 bivar.
23
5-23 bivar. Diamond State Telephone Company Demand for LINES versus Proposed MONTHLY charge per line ($) 101520253035 100 150 200 250 MONTHLY LINES
24
5-24 bivar. Look At The Vertical Strip Corresponding to the Given X Value Y X
25
5-25 bivar. 101520253035 100 150 200 250 MONTHLY LINES x x x Graph of Averages x x estimated LINES = 237.495 - 3.867 MONTHLY
26
5-26 bivar.
27
5-27 bivar. Linearly Related Variables The REGRESSION LINE is to a scatter diagram as the AVERAGE is to a list of numbers. The regression line estimates the average values for the dependent variable, Y, corresponding to each value, x, of the independent variable.
28
5-28 bivar. Linearly Related Variables If we have 2 variables, linearly related to one another, then knowing the value of one variable (for a particular individual) can help to estimate / predict the value of the other variable. If we know nothing re. the value of the independent variable (X), then we estimate the value of the dependent variable to be the OVERALL AVERAGE of the dependent variable (Y). If we know that the independent variable (X) has a particular value for a given individual, then we can take a "more educated guess" at the value of the dependent variable (Y).
29
5-29 bivar. Regression and SD Lines The REGRESSION LINE for modeling the relation between X (independent variable) and Y (dependent variable) passes through the POINT OF AVERAGES and has slope That is, associated with each increase of one SD in X, there is an increase of r SD’s in Y, on the average. The SD LINE for modeling the relation between X (independent variable) and Y (dependent variable) passes through the POINT OF AVERAGES and has slope
30
5-30 bivar. Estimating the Intercept and Slope of the Regression Line The REGRESSION LINE for modeling the relation between X (independent variable) and Y (dependent variable) is also known as The REGRESSION LINE for predicting Y from X, and has the form Y = a + b x = intercept + slope x. Here, b = slope = r SD(Y) / SD(X) a = intercept = avg(Y) - b avg(X) = avg(Y) - r [SD(Y) / SD(X)] avg(X)
31
5-31 bivar. Prediction from a Regression Model Predicted value of Y corresponding to a given value of X is
32
5-32 bivar.
33
5-33 bivar. TOTAL OBSERVATIONS: 21 LINES MONTHLY N OF CASES 21 21 MINIMUM 105.000 10.320 MAXIMUM 201.000 34.000 MEAN 154.048 21.581 VARIANCE 1122.648 69.623 STANDARD DEV 33.506 8.344 PEARSON CORRELATION MATRIX LINES MONTHLY LINES 1.000 MONTHLY -0.963 1.000 NUMBER OF OBSERVATIONS: 21
34
5-34 bivar. Diamond State Questions In the Diamond State Telephone Company example, avg (LINES) = 154.048SD (LINES) = 33.506 avg (MONTHLY) = 21.581 SD (MONTHLY) = 8.344 r = -0.963 What are the coordinates for the point of averages? What is the slope of the regression line? Suppose the MONTHLY charge was set at $25.00. What would you estimate to be the demand for # LINES from the 62 new businesses? Suppose the MONTHLY charge was set at $15.00. What would you estimate to be the demand for # LINES from the 62 new businesses?
35
5-35 bivar. Another Diamond State Question Suppose the MONTHLY charge was set at $50.00. What would you estimate to be the demand for # LINES from the 62 new businesses?
36
5-36 bivar. Regression Computer Output Regression DEP VAR: LINES N: 21 MULTIPLE R: 0.963 SQUARED MULTIPLE R: 0.927 ADJ SQRD MULTIPLE R: 0.923 STANDARD ERROR OF ESTIMATE: 9.273 VARIABLE COEFF STD ERROR STD COEF TOLERANCE T P(2 TAIL) CONSTANT 237.4955.732 0.000. 41.432 0.000 MONTHLY -3.8670.249-0.963 1.000 -15.560 0.000 ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P REGRESSION20819.092 1 20819.092 242.103 0.000 RESIDUAL1633.860 19 85.993 ------------------------------------------------------------------------------------------------------------------
37
5-37 bivar. Interpreting the Regression Coefficients
38
5-38 bivar. Other Examples 1. X = Educational expenditureY = Test scores 2. X = Height of a personY = Weight of the person 3. X = # Service years of an automobileY = Operating cost per year 4. X = Total weight of mail bagsY = # Mail orders 5. X = Price of productY = Unit sales 6. X = VolumeY = Total cost of production 7. X = Calories in a candy barY = Grams of fat in the candy bar 8. X = Baseball slugging percentageY = Player salary 9. X = Weight of a diamondY = Price of the diamond 10. 11. 12.
39
5-39 bivar. Wood Products TOTAL OBSERVATIONS: 23 SHIPMENT EMPLOY N OF CASES 23 23 MINIMUM 22.900 3.000 MAXIMUM 469.800 79.000 MEAN 112.287 19.783 VARIANCE 9931.683 281.087 STANDARD DEV 99.658 16.766 Pearson Correlation Matrix SHIPMENT EMPLOY SHIPMENT1.00 EMPLOY0.979 1.00 Number of Observations: 23
40
5-40 bivar.
41
5-41 bivar. y=ship,x=employ,line
42
5-42 bivar. y=employ,x=ship,line
43
5-43 bivar. Computer Output - 1 DEP VAR: SHIPMENT N: 23 MULT R: 0.979 SQRD MULT R: 0.958 ADJ SQRD MULTIPLE R: 0.956 STD ERROR OF ESTIMATE: 21.018 VARIABLE COEFF STD ERROR STD COEF TOLER T P(2 TAIL) CONSTANT -2.7816.868 0.000. -0.405 0.690 EMPLOY5.8170.267 0.979 1.000 21.763 0.000 ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P REGRESSION.209220.316 1 209220.31 473.619 0.000 RESIDUAL 9276.710 21 441.748 ----------------------------------------------------------- ------------ --------------------
44
5-44 bivar. Computer Output - 2 DEP VAR: EMPLOY N: 23 MULT R: 0.979 SQRD MULT R: 0.958 ADJ SQRD MULT R: 0.956 STD ERROR OF ESTIMATE: 3.536 VARIABLE COEFF STD ERROR STD COEF TOLER T P(2 TAIL) CONSTANT1.2981.125 0.000. 1.154 0.262 SHIPMENT0.1650.008 0.979 1.000 21.763 0.000 ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P REGRESSION 5921.363 1 5921.363 473.619 0.000 RESIDUAL 262.550 21 12.502 -------------------------------------------------------- ------------- -----------------------
45
5-45 bivar. Insurance Availability in Chicago
46
5-46 bivar. Chicago Plots
47
5-47 bivar. Chicago Insurance, cont. For cases with income less than or equal to $15,000, avg (Voluntary) = 6.376 SD (Voluntary) = 3.959 avg (Income) = $10,332.756 SD (Income) = $2,109.819 r = 0.896 Derive the equation for the regression line. According to this linear model, what is the estimated value for "Voluntary" in a ZIP code area with Income $12,000?... with Income $9,500?
48
5-48 bivar. blank
49
5-49 bivar. Regression Effect In virtually all test-retest situations, the bottom group on the first test will, on average, show some improvement on the 2nd test, and the top group will, on average, fall back. This is called the REGRESSION EFFECT. The REGRESSION FALLACY is thinking that the regression effect must be due to something important, not just the spread of points around the line.
50
5-50 bivar. blank
51
5-51 bivar. Residuals Regression methods allow us to estimate the average value of the dependent variable for each value of the independent variable. Individuals will differ somewhat from the regression estimates. How much?
52
5-52 bivar. blank Algeria
53
5-53 bivar. Residuals Prediction error = actual - predicted = vertical distance from the point to the regression line
54
5-54 bivar. Residuals for Economically Active Women and Crude Birth Rates
55
5-55 bivar. Residual Plots A residual plot should NOT look systematic (no trend or pattern) -- just a cloud of points around the horizontal axis. Problem plots also can tell us something about the data.
56
5-56 bivar. Residual Plot for Economically Active Women and Crude Birth Rates
57
5-57 bivar. Chicago Insurance Case Residual Plot (versus Income)
58
5-58 bivar. The Least Squares Property of the Regression Line Of all lines, the regression line is the one which has smallest sum of squared residuals (and also the smallest rms error). Thus, it is The Least Squares Line.
59
5-59 bivar. Look at the Scatter Diagram Before Fitting a Regression Model ! For each of the following data sets, the regression equation is Y = 3.0 + 0.5 X and r = 0.82 Sorry, I didn’t scan in these plots yet.
60
5-60 bivar. blank
61
5-61 bivar. How Big Are The Residuals ? R.M.S. Error of the Regression Line: The rms error of the regression line says how far typical points are above or below the regression line. Standard Deviation of Y: The SD of Y says how far typical point are above or below a horizontal line through the average of y. In other words, the SD of y is the rms error for predicting y by its average, just ignoring the x-values.
62
5-62 bivar. How Big Are The Residuals ? The overall size of the residuals is measured by computing their standard deviation. The average of the residuals is zero. Computing the rms error of the regression line: The rms error of the regression line estimating Y from X can be figured as Note that here Y is the dependent variable! The rms error is to the regression line as the SD is to the average.
63
5-63 bivar. How Big Are the Residuals? Recall the First -Order Linear Model: = prediction error = residual The mean of the residuals is zero. The SD of the residuals is also known as the "root mean squared error of the regression line" (rms error).
64
5-64 bivar. The overall size of the residuals is measured by computing their standard deviation. The rms error is to the regression line as the SD is to the average Computing the rms error: The rms error of the regression line estimating Y from X can be figured as Notes: 1. 2. Here Y is the dependent variable ! 3. Here we are dividing by n, rather than n-2. rms error
65
5-65 bivar. Looking At Vertical Strips
66
5-66 bivar. Looking At Vertical Strips For an oval cloud of points, the points in a vertical strip are off the regression line (up and down) by amounts similar in size to the rms error of the regression line. If the diagram is heteroscedastic, the rms error should not be used for individual strips.
67
5-67 bivar. Using the Normal Curve Inside A Vertical Strip For an oval cloud of points, the SD within a vertical strip is about equal to the rms error of the regression line.
68
5-68 bivar. blank
69
5-69 bivar. Uses for r: (1) Describes the clustering of the scatter diagram around the SD line, relative to the SD's (2) Says how the average value of y depends on x r SD(Y) 1 SD(X) (3) Gives the accuracy of the regression estimates (the SD of the prediction errors) via the rms error for the regression line
70
5-70 bivar. coeff of determin-4 How much of the variation of Y has been explained by X? (How much better are we at predicting Y when we do know the value of X? Compare Thus, the proportion of the variation of Y which is NOT explained by X is And the proportion of the variation of Y which IS explained by X is
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.