Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation/Regression - part 2 Consider Example 2.12 in section 2.3. Look at the scatterplot… Example 2.13 shows that the prediction line is given by.

Similar presentations


Presentation on theme: "Correlation/Regression - part 2 Consider Example 2.12 in section 2.3. Look at the scatterplot… Example 2.13 shows that the prediction line is given by."— Presentation transcript:

1 Correlation/Regression - part 2 Consider Example 2.12 in section 2.3. Look at the scatterplot… Example 2.13 shows that the prediction line is given by the linear equation (nea="nonexercise activity") predicted fat gain = 3.505 - 0.00344 x nea. The intercept (3.505 kg) equals the fat gain when non- exercise activity increase = 0 and the slope (-0.00344) equals the rate of change of fat gain per calorie increase in nea; i.e., the predicted fat gain decreases by about 0.00344 kg for each calorie increase in nea. So to get the predicted value of fat gain for an nea of say 400 calories, you can either estimate it graphically from the line (next page) or numerically by evaluating the equation at nea = 400; pred. fat gain at nea of 400 = 3.505 - 0.00344 (400) = 2.129 kg But be careful about extrapolation!

2 The graphical method: find nea=400 on the x-axis, draw a vertical line to intersect the regression line, then draw a horizontal line to intersect the y-axis - the place of intersection will be the predicted y for that value of x.

3 The least squares line makes the errors (or residuals) as small as possible by minimizing their sum of squares.

4 The least squares process finds the values of b 0 and b 1 that minimize the sums of the squares of the errors to give y-hat = b 0 + b 1 x, where b 1 = r (s y /s x ) and b 0 = ybar - b 1 xbar As we've noted before, use software to do these calculations for you - but notice a couple of things from these equations: –b 1 and r have the same sign (since s y and s x are >0) –the prediction line always passes through the point (xbar,ybar)

5 Besides the correlation coefficient (r) having the same sign as the slope of the regression line it also has the property that its square r 2 gives the proportion of total variability in y explained by the regression line of y on x. Another important idea to mention is that if you regress y on x (i.e., treat y as the response) you will get a different line than if you regress x on y (treat x as the response), even though the value of r will be the same in both cases! See the Figure 2.15 on the following slide - read about this important set of data in Example 2.17 on page 116.

6 Regress velocity on distance (solid line) and distance on velocity (dashed line) to get two distinct lines - however, r =.7842 in both cases…

7 Cautions about regression and correlation: –always look at the plot of the residuals (recall that for every observed point on the scatterplot, we have: residual at x i = observed y i - predicted y i ) A plot of the residuals against the explanatory variable should show a "random" scatter around zero - see Fig.2.20.There should be no pattern to the resids. Go over Ex. 2.20, p.128

8 –Look out for outliers (in either the explanatory or response variable) and influential values (in the explanatory variable). Go over examples 2.21-2.22 (2.4, 2/5) carefully…note #18 is influential and #15 is an outlier in the y-direction.

9

10 Note that outliers in the y-direction can have large residuals, while outliers in the x-direction (possible influential values) might not have large residuals.

11 All HW for Chapter 2: –section 2.1: #2.6-2.9,2.11,2.13-2.15,2.18,2.19,2.21,2.26 –section 2.2: #2.29-2.32,2.35,2.39,2.41,2.43,2.46,2.50,2.51 –section 2.3: #2.57-2.58,2.62,2.64,2.66,2.68,2.73,2.74 –section 2.4: #2.85,2.87,2.89,2.94,2.96,2.97,2.101 –section 2.5: #2.111-2.113,2.119, 2.121 –section 2.6: #2.122, 2.125, 2.127, 2.129, 2.130 –Chapter 2 Exercises: Do several of those on p. 161- 169

12 Transformations: –Look at the dataset of 62 mammals' brain and body weights ("Beyond the Basics:Transformations" after section 2.3 in eBook)…. copy link location from my website. Analyze it with JMP… what are the difficulties? –Fix this by transforming the data with an appropriate mathematical function - easy to do in JMP: create a new column to contain the transformed data. –Try the log-log transform and then re-analyze…

13 Look ahead to two-way tables… Two-way tables organize data about two categorical variables (factors) obtained from a two- way design. (There are now two ways to group the data). See below from US Census data… First factor: age Group by age Second factor: education Record education

14 Marginal distributions We can look at each categorical variable separately in a two-way table by studying the row totals and the column totals. They represent the marginal distributions, expressed in counts or percentages (They are written as if in a margin.) 2000 U.S. census

15 The marginal distributions summarize each categorical variable independently. But the two-way table actually describes the relationship between both categorical variables. The cells of a two-way table represent the intersection of a given level of one categorical factor with a given level of the other categorical factor. Get the joint distribution by computing percentages of each cell with respect to the grand total of the table. Because counts can be misleading (for instance, one level of one factor might be much less represented than the other levels), we prefer to calculate percents or proportions for the corresponding cells. These make up what are called the conditional distributions. We can compute these using either the row or column totals and get different conditionals… Try it with the Education data…

16 Here's a summary: –Two-way tables consist of counts obtained by crosstabulating two categorical variables - the goal is to understand the relationship or association between these two variables. –The first method of looking for the relationship is to compute percentages - there are three types: those based on the grand total in the table (the joint distribution of the two variables); those based on the column totals and those based on the row totals (the conditional distributions) To look for association, consider all the percentages above but usually percent with respect to the explanatory variable's totals and compare across levels of the explanatory variable.


Download ppt "Correlation/Regression - part 2 Consider Example 2.12 in section 2.3. Look at the scatterplot… Example 2.13 shows that the prediction line is given by."

Similar presentations


Ads by Google