Inference for Linear Relationships

Inference for Linear Relationships
Lesson 2: Section 12.1 (part 2)

objectives Construct and interpret a confidence interval for the slope β of the population (true) regression line.

Homework even answers 2. No. The conditions for performing inference are not met. There is curvature in this plot, which suggest that the original scatterplot has curvature to it. 4. Linear: the residual plot is reasonably centered at 0, so the scatterplot is approximately linear. Independent: Due to the random assignment, the observations can be viewed as independent. Normal: the histogram is mound shaped and approximately symmetric, so the residuals could follow a Normal distribution. Equal variance: The residual plot shows roughly equal scatter for all x-values. Random: This was a randomized experiment. The conditions are met.

Parameters vs. Statistics
About a population (usually unknown) About a sample (use this to estimate or test about a popoulation) p y=α + βx Mean Proportion Linear Regression

Back to the sampling distribution of b (sample regression slope)
Shape: Is it Normal? Center: µb = β (b is the unbiased estimator of β) Spread: Since we do not typically know what σ is, we will have to use the standard deviation of the residuals (s) Remember that standard deviation (error) measures how far we expect our estimate to be from the truth in repeated random sampling or repeated random assignments. Because we are using s instead of the true σ, we will be using t distributions, not z!

Let’s review some other notations with standard deviation:
σ : TRUE standard deviation of the y variable We often do not know this value, so we use s as a substitute s : standard deviation of residuals When this is larger, that means the standard error of the slope is larger. Think about it! If data are more scattered about the population (true) regression line, we should expect more variability in the slopes of the sample regression lines. Use this when you do not know σ sx : standard deviation of the x variable When this is larger, that means the standard error of the slope is smaller. Think about it! The more variability in x, the less variability there will be in the slope.

Constructing a Confidence Interval for Slope
When the conditions for regression inference are met, a level C confidence interval for the slope β of the population (true) regression line is: With degrees of freedom = n – 2 Standard Error: You will rarely have to calculate this by hand – it will be given to you by computer output Sample regression line’s slope Critical value: look in Table B at the % confidence at the certain df

EXAMPLE: Back to “Does Seat Location Matter”?
Remember from last class, we analyzed the results of an experiment designed to see if sitting closer to the front of a classroom causes higher achievement. We verified the conditions for inference earlier. Here is the scatterplot of the data and some output from the analysis: Identify the standard error of the slopes SEb from the computer output. Interpret this value in context. SEb = If we repeated the random assignment many times, the slope of the estimated regression line would typically vary by the from the slope of the true regression line. Predictor Coef SE Coef T P Constant Row S = R-sq = 4.7% R-sq (adj) = 1.3%

Remember from last class, we analyzed the results of an experiment designed to see if sitting closer to the front of a classroom causes higher achievement. We verified the conditions for inference earlier. Here is the scatterplot of the data and some output from the analysis: Calculate the 95% confidence interval for the true slope. Because n=30, our df = 30-2 = 28. Using Table B, we t* = The 95% confidence interval is ± 2.048(0.9472) = ± = ( , ) Predictor Coef SE Coef T P Constant Row S = R-sq = 4.7% R-sq (adj) = 1.3%

Interpret the interval from part (b) in context. We are 95% confident that the interval from to captures the slope of the true regression line relating a student’s test score y and the student’s row number x. Based on your interval, is there convincing evidence that seat location affects scores? (d) Because the interval of plausible slopes includes 0, we do not have convincing evidence that there is an association between test score and row number.

EXAMPLE: Fresh flowers?
Mrs. Childrey wants to know if she puts sugar in the water of her favorite bouquet of flowers, will it keep them fresher? Two AP statistic students decide to help her out and investigate the effect of sugar on the life of cut flowers. They went to the local grocery store and randomly select 12 “fire-and-ice” roses. All the roses seemed equally healthy when they were selected. When the students got home, they prepared 12 identical vases with exactly the same amount of water in each vase. They put one tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the same location, the students randomly assigned one flower to each vase and observed how many hours each flower continued to look fresh. Here are the data: Sugar (tbs) 1 2 3 Freshness (hours) 168 180 192 204 210 222 228 234

Minitab output from a least-squares regression analysis for these data is shown below:

(a) Construct and interpret a 99% confidence interval for the slope of the true regression line. (b) Would you feel confident predicting the number of hours of freshness if 10 tablespoons of sugar are used? Explain. Answer: STATE: We want to estimate the slope β of the true regression line relating hours of freshness y to amount of sugar x at the 99% confidence level.

PLAN: If conditions are met, we will use a t interval for the slope to estimate β: LINEAR: The scatterplot shows a linear pattern, and there is no obvious leftover curvature in the residual plot, even though all the residuals are negative for x = 2 INDEPENDENT: All the flowers were in different vases, so knowing the hours of freshness for one flower shouldn’t provide additional information about the hours of freshness for other flowers. NORMAL: The histogram of residuals does not show any skewness or outliers. EQUAL VARIANCE: Although the variability of the residuals is not constant at each value of x, there is no systematic pattern such as increasing variation as x increases. Random: Flowers were randomly assigned to the treatments.

DO: Using df = 12 – 2 = 10, the t* = (from Table B). Thus, the 99% confidence interval is 15.2 ± 3.169(1.943) = 15.2 ± 6.16 = (9.04, 21.36) CONCLUDE: We are 99% confident that the interval from 9.04 to captures the slope of the true regression line relation hours of freshness y to amount of sugar x. (b) No. This would be an extrapolation. We have no idea if the linear form will continue beyond x = 3 tablespoons. At some point there will be more sugar than the flowers can handle.

When you test β… Think about what testing a null hypothesis of β = 0 means? A regression line with slope 0 is horizontal, which means that y does not change at all when x changes. This says that there is NO linear relationship between x and y in the population. Alternate Hypothesis: β ≠ 0: that there is a linear relationship between x and y (Most computer outputs use this – to get one of the P-values for the alternates below, just divide the computer output P-value by 2) β > 0: that there is a positive linear relationship between x and y β < 0: that there is a negative linear relationship between x and y

Constructing a Significance Test for Slope
When the conditions for regression inference are met, to test the hypothesis H0: β = hypothesized value , compute the t statistic Using computer output, Table B, or Calculator, find the P-value for given alternate hypothesis with degrees of freedom = n – 2 Null Hypothesis Value Sample regression line’s slope Standard Error: You will rarely have to calculate this by hand – it will be given to you by computer output

EXAMPLE: Tipping at a Buffet
Do customers who stay longer at buffets give larger tips? Charlotte, an AP Statistics student who worked at an Asian buffet, decided to investigate this question for her second-semester project. While she was doing her job as a hostess, she obtained a random sample of receipts, which included the length of time (in minutes) the party was in the restaurant and the amount of the tip (in dollars). Do these data provide convincing evidence that customers who stay longer give larger tips? Here are the data: Time (min) 23 39 44 55 61 65 67 70 74 85 90 99 Tip ($) 5.00 2.75 7.75 7.00 8.88 9.01 7.29 7.50 6.00 6.50

(a) Here is a scatterplot of the data with the least-squares regression line added. Describe what this graph tells you about the relationship between the two variables. Answer: There is a weak positive linear association between length of stay and tip amount. People who stay longer tend to give larger tips.

EXAMPLE : Tipping at a Buffet
Minitab output from a linear regression analysis on these data is shown here: (b) What is the equation of the least-square regression line for predicting the amount of the tip from the length of the stay? Define any variables you use. Answer: Where y hat = predicted tip amount (in dollars) and x = length of stay (in minutes) (c) Interpret the slope and y intercept of the least-square regression line in context. Slope: For each additional minute that a party stays at the buffet, the least-squares regression line predicts an increase in the tip of $0.030 (3 cents). Y-intercept: the model predicts that a party who stays at a buffet for 0 minutes will leave a tip of $ This seems pretty unreasonable and is an extrapolation since we have no data around x = 0.

(d) Carry out an appropriate test to answer Charlotte’s question. STATE: We want to perform a test of H0: β = 0 Ha: β > 0 Where β is the true slope of the population regression line relating length of stay x to tip amount y. We will use α = 0.05. PLAN: If the conditions are met, we will do a t test for the slope β.

(d) Carry out an appropriate test to answer Charlotte’s question. PLAN: If the conditions are met, we will do a t test for the slope β. Linear: The scatterplot shows a weak positive linear relationship between length of stay and tip amount. The residual plot shows random scatter about the residual = 0 line. Independent: Knowing one tip amount shouldn’t provide additional information about other tip amounts. Since we are sampling without replacement, w must assume that there are more than 10(12) = 120 receipts in the population from which these receipts were selected. Normal: The Normal probability plot looks roughly linear. Equal variance: The residual plot shows a fairly equal amount of scatter around the residual = 0 line. Random: The receipts were randomly selected.

(d) Carry out an appropriate test to answer Charlotte’s question. DO: From the computer output: Test statistic: t = 1.23 P-value: Since the P-value in the computer output is for a two-sided test, we must divide it in half for a one-sided test. P = 0.247/2 = using df = 12 – 2 = 10. CONCLUDE: Since the P-value is larger than 0.05, we fail to reject H0. We do NOT have convincing evidence that parties who stay longer at buffets leave larger tips.

Confidence Intervals give more information
In Example 2, we could have done a 90% confidence interval and that would have given the same result as a one and two-sided significance test at α = 0.05. A Confidence interval gives more information – not only does it tell us if we could reject that there is no linear relationship (β=0), but it also gives us plausible values for the slope .

Example : Used Hondas A random sample of 11 used Honda CR-Vs from the model years was selected from the inventory at The number of miles driven and the advertise price were recorded for each Cr-V. A 95% confidence interval for the slope of the true least-squares regression line for predicting advertised price form number of miles (in thousands) driven is (-122.3, -50.1). Based on this interval, what conclusion should we draw from a test of H0: β = 0 and Ha: β ≠ 0 at the 0.05 significance level? Answer: Since 0 is not in the interval of plausible slopes, we reject H0. There is convincing evidence of a linear relationship between the number of miles driven and the advertised price.

TECHNOLOGY: In your calculator…
Enter the x-values into L1 and the y-values into L2 Confidence interval: STAT  TESTS and G:LinRegTInt… Make sure the C-level is correct (in decimal form) Highlight Calculate and hit ENTER (notice how the output has 2 screens – you can scroll down) (some older versions may not have this option – so use the formula: SEb = b/t) Significance test: STAT  TESTS and G:LinRegTTest… Choose correct alternate (leave RegEQ blank)

homework Assigned reading: Section 12.1 Complete HW problems:
(put on the same page as Lesson 1 HW) Check answers to odd problems. STUDY FOR QUIZ on Section 12.1

Inference for Linear Relationships

Similar presentations

Presentation on theme: "Inference for Linear Relationships"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Inference for Linear Relationships

Similar presentations

Presentation on theme: "Inference for Linear Relationships"— Presentation transcript:

Similar presentations

About project

Feedback