Two Quantitative Variables

Slides:



Advertisements
Similar presentations
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTION 2.6, 9.1 Least squares line Interpreting.
Advertisements

Chapter 4 The Relation between Two Variables
Paired Data: One Quantitative Variable Chapter 7.
CHAPTER 24: Inference for Regression
Objectives (BPS chapter 24)
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
Correlation 2 Computations, and the best fitting line.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
Linear Regression and Correlation Analysis
Stat 217 – Day 25 Regression. Last Time - ANOVA When?  Comparing 2 or means (one categorical and one quantitative variable) Research question  Null.
Introduction to Probability and Statistics Linear Regression and Correlation.
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
CHAPTER 3 Describing Relationships
Correlation and Regression Analysis
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Introduction to Regression Analysis, Chapter 13,
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Simple Linear Regression Least squares line Interpreting coefficients Prediction Cautions The formal model Section 2.6, 9.1, 9.2 Professor Kari Lock Morgan.
Correlation & Regression
Review Tests of Significance. Single Proportion.
Inference for regression - Simple linear regression
STA291 Statistical Methods Lecture 27. Inference for Regression.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Chapter 9 Comparing More than Two Means. Review of Simulation-Based Tests  One proportion:  We created a null distribution by flipping a coin, rolling.
Confidence Intervals for the Regression Slope 12.1b Target Goal: I can perform a significance test about the slope β of a population (true) regression.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Inference for Linear Regression Conditions for Regression Inference: Suppose we have n observations on an explanatory variable x and a response variable.
Review of Statistical Models and Linear Regression Concepts STAT E-150 Statistical Methods.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Chapters 8 & 9 Linear Regression & Regression Wisdom.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Correlation tells us about strength (scatter) and direction of the linear relationship between two quantitative variables. In addition, we would like to.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables )  Bi-variate data.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
CHAPTER 3 Describing Relationships
Chapter 10 Inference for Regression
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
©2011 Brooks/Cole, Cengage Learning Elementary Statistics: Looking at the Big Picture1 Lecture 35: Chapter 13, Section 2 Two Quantitative Variables Interval.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
1 Chapter 12: Analyzing Association Between Quantitative Variables: Regression Analysis Section 12.1: How Can We Model How Two Variables Are Related?
BPS - 5th Ed. Chapter 231 Inference for Regression.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
CHAPTER 12 More About Regression
Regression and Correlation
AP Statistics Chapter 14 Section 1.
CHAPTER 3 Describing Relationships
Simulation-Based Approach for Comparing Two Means
CHAPTER 12 More About Regression
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
CHAPTER 3 Describing Relationships
CHAPTER 12 More About Regression
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 12 More About Regression
CHAPTER 3 Describing Relationships
9/27/ A Least-Squares Regression.
Honors Statistics Review Chapters 7 & 8
CHAPTER 3 Describing Relationships
Presentation transcript:

Two Quantitative Variables Chapter 10 Two Quantitative Variables Inference for Correlation: Simulation-Based Methods Least-Squares Regression Inference for Correlation and Regression: Theory-Based Methods

Section 10.1 and 10.2: Summarizing Two Quantitative Variables and Inference for Correlation (Simulation) I was interested in knowing the relationship between the time it takes a student to take a test and the resulting score. So I collected some data … Time 30 41 43 47 48 51 54 56 57 58 Score 100 84 94 90 88 99 85 65 64 89 60 61 62 63 66 69 72 78 79   83 86 92 74 73 75 53 91 68

Scatterplot Put explanatory variable on the horizontal axis. Put response variable on the vertical axis.

Describing Scatterplots When we describe data in a scatterplot, we describe the Direction (positive or negative) Form (linear or not) Strength (strong-moderate-weak, we will let correlation help us decide) How would you describe the time and test scatterplot?

Correlation Correlation measures the strength and direction of a linear association between two quantitative variables. Correlation is a number between -1 and 1. With positive correlation one variable increases, on average, as the other increases. With negative correlation one variable decreases, on average, as the other increases. The closer it is to either -1 or 1 the closer the points fit to a line. The correlation for the test data is -0.56.

Correlation Guidelines Correlation Value Strength of Association What this means 0.5 to 1.0 Strong The points will appear to be nearly a straight line 0.3 to 0.5 Moderate When looking at the graph the increasing/decreasing pattern will be clear, but it won’t be nearly a line 0.1 to 0.3 Weak With some effort you will be able to see a slightly increasing/decreasing pattern 0 to 0.1 None No discernible increasing/decreasing pattern Same Strength Results with Negative Correlations

Back to the test data I was not completely honest. There were three more test scores. The last three people to finish the test had scores of 93, 93, and 97. What does this do to the correlation?

Influential Observations The correlation changed from -0.56 (a fairly strong negative correlation) to -0.12 (a fairly weak negative correlation). Points that are far to the left or right and not in the overall direction of the scatterplot can greatly change the correlation. (influential observations)

Correlation Correlation measures the strength and direction of a linear association between two quantitative variables. -1 < r < 1 Correlation makes no distinction between explanatory and response variables. Correlation has no unit. Correlation is not a resistant measure. Correlation based on averages is usually higher than that for individuals. Correlation game

Inference for Correlation with Simulation 1. Compute the observed statistic. (Correlation) 2. Scramble the response variable, compute the simulated statistic, and repeat this process many times. 3. Reject the null hypothesis if the observed statistic is in the tail of the null distribution.

Temperature and Heart Rate Hypotheses Null: There is no association between heart rate and body temperature. (ρ = 0) Alternative: There is a positive linear association between heart rate and body temperature. (ρ > 0) ρ = rho

Temperature and Heart Rate Collect the Data Tmp 98.3 98.2 98.7 98.5 97.0 98.8 99.3 97.8 HR 72 69 71 80 81 68 82 65 99.9 98.6 98.4 97.4 96.7 98.0 79 86 58 84 73 57 62 89

Temperature and Heart Rate Explore the Data r = 0.378

Temperature and Heart Rate If there was no association between heart rate and body temperature, what is the probability we would get a correlation as high as 0.378 just by chance? If there is no association, we can break apart the temperatures and their corresponding heart rates. We will do this by shuffling one of the variables. (The applet shuffles the response.)

Shuffled r =-0.216 Shuffled r =0.212 Shuffled r =-0.097 Temp HR 98.30 81 98.20 73 98.70 57 98.50 86 97 79 98.80 72 98.50 71 98.70 68 99.30 58 97.80 89 98.20 82 99.90 69 98.60 71 98.60 72 97.80 62 98.40 84 98.70 82 97.40 80 96.70 65 98 68 Temp HR 98.30 58 98.20 86 98.70 79 98.50 73 97 71 98.80 82 98.50 57 98.70 80 99.30 71 97.80 72 98.20 81 99.90 82 98.60 89 98.60 62 97.80 84 98.40 72 98.70 68 97.40 68 96.70 69 98 65 Temp HR 98.30 86 98.20 65 98.70 69 98.50 71 97 73 98.80 58 98.50 81 98.70 72 99.30 57 97.80 82 98.20 71 99.90 80 98.60 68 98.60 79 97.80 84 98.40 89 98.70 68 97.40 82 96.70 62 98 72

Temperature and Heart Rate In three shuffles, we obtained correlations of -0.216, 0.212 and -0.097. These are all a bit closer to 0 than our sample correlation of 0.378 Let’s scramble 1000 times and compute 1000 simulated correlations.

Temperature and Heart Rate Notice our null distribution is centered at 0 and somewhat symmetric. We found that 530/10000 times we had a simulated correlation greater than or equal to 0.378.

Temperature and Heart Rate With a p-value of 0.053 we don’t have strong evidence there is a positive linear association between body temperature and heart rate. However, we do have moderate evidence of such an association and perhaps a larger sample give a smaller p-value. If this was significant, to what population can we make our inference?

Temperature and Heart Rate Let’s look at a different data set comparing temperature and heart rate (Example 10.5A) using the Analyzing Two Quantitative Variables applet. Pay attention to which variable is explanatory and which is response. The default statistic used is not correlation, so we need to change that.

Work on exploration 10.2 to see if there is an association between birth date and the 1970 draft lottery number.

Least Squares Regression Section 10.3

Introduction If we decide an association is linear, it’s helpful to develop a mathematical model of that association. Helps make predictions about the response variable. The least-squares regression line is the most common way of doing this.

Introduction Unless the points form a perfect line, there won’t be a single line that goes through every point. We want a line that gets as close as possible to all the points

Introduction We want a line that minimizes the vertical distances between the line and the points These distances are called residuals. The line we will find actually minimizes the sum of the squares of the residuals. This is called a least-squares regression line.

Are Dinner Plates Getting Larger? Example 10.3

Growing Plates? There are many recent articles and TV reports about the obesity problem. One reason some have given is that the size of dinner plates are increasing. Are these black circles the same size, or is one larger than the other?

Growing Plates? They appear to be the same size for many, but the one on the right is about 20% larger than the left. This suggests that people will put more food on larger dinner plates without knowing it.

Portion Distortion There is name for this phenomenon: Delboeuf illusion

Growing Plates? Researchers gathered data to investigate the claim that dinner plates are growing American dinner plates sold on ebay on March 30, 2010 (Van Ittersum and Wansink, 2011) Size and year manufactured. Year 1950 1956 1957 1958 1963 1964 1969 1974 1975 1978 size 10 10.8 10.1 10.6 10.5 1980 1986 1990 1995 2004 2007 2008 2009 Size 10.38 10.4 11 11.5 11.1

Growing Plates? Both year (explanatory variable) and size (response variable) are quantitative. Each dot represents one plate in this scatterplot. Describe the association here.

Growing Plates? The association appears to be roughly linear The least squares regression line is added How can we describe this line?

Growing Plates? The regression equation is 𝑦 =𝑎+𝑏𝑥: a is the y-intercept b is the slope x is a value of the explanatory variable ŷ is the predicted value for the response variable For a specific value of x, the corresponding distance y− 𝑦 is a residual

Growing Plates? The least squares line for the dinner plate data is 𝑦 =−14.8+0.0128𝑥 Or 𝑑𝑖𝑎𝑚𝑒𝑡𝑒𝑟 =−14.8+0.0128(𝑦𝑒𝑎𝑟) This allows us to predict plate diameter for a particular year. Substitute 2000 for year and we predict a diameter of -14.8 + 0.0128(2000) = 10.8 in.

Growing Plates? 𝑦 =−14.8+0.0128𝑥 What is the predicted diameter for a plate manufactured in the year 2001? -14.8 + 0.0128(2001) = 10.8128 in. How does this compare to our prediction for the year 2000 (10.8 in.)? 0.0128 larger Slope b = 0.0128 means that diameters are predicted to increase by 0.0128 inches per year on average

Growing Plates? Slope is the predicted change in the response variable for one-unit change in the explanatory variable. Both the slope and the correlation coefficient for this study were positive. The slope and correlation coefficient will always have the same sign.

Growing Plates? The y-intercept is where the regression line crosses the y-axis or the predicted response when the explanatory variable equals 0. We had a y-intercept of -14.8 in the dinner plate equation. What does this tell us about our dinner plate example? Dinner plates in year 0 were -14.8 inches. How can it be negative? The equation works well within the range of values given for the explanatory variable, but fails outside that range. Our equation should only be used to predict the size of dinner plates from about 1950 to 2010.

Growing Plates? Predicting values for the response variable for values of the explanatory variable that are outside of the range of the original data is called extrapolation.

Predicting Height from Footprints Exploration 10.3

Inference for the Regression Slope: Theory-Based Approach Section 10.5

Beta vs Rho Testing the slope of the regression line is equivalent to testing the correlation. Hence these hypotheses are equivalent. Ho: β = 0 Ha: β > 0 (Slope) Ho: ρ = 0 Ha: ρ > 0 (Correlation) Sample slope (b) Population (β: beta) Sample correlation (r) Population (ρ: rho) When we do the theory based test, we will be using the t-statistic which can be calculated from either the slope or correlation.

Introduction We have seen in this chapter that our null distributions are again bell-shaped and centered at 0 (for either correlation or slope as our statistic). This is also true if we use the t-statistic.

Validity Conditions Under certain conditions, theory-based inference for correlation or slope of the regression line use t-distributions. We are going to use the theory-based methods for the slope of the regression line. (The applet allows us to do confidence intervals for slope.) We would get the same p-value if we used correlation as our statistic.

Validity Conditions Validity Conditions for theory-based inference for slope of the regression equation. The values of the response variable at each possible value of the explanatory variable must have a normal distribution in the population from which the sample was drawn. Each normal distribution at each x-value must have the same standard deviation.

Validity Conditions Suppose the graph represents a scatterplot for an entire population where the explanatory variable is heart rate and the response is body temperature. Normal distributions of temps for each BPM Each normal distribution has the same SD

Validity Conditions How realistic are these conditions? In practice, you can check these conditions by examining your scatterplot. Let’s look at some scatterplots that do not meet the requirements.

Smoking and Drinking The relationship between number of drinks and cigarettes per week for a random sample of students at Hope College. The dot at (0,0) represents 524 students Are conditions met?

Smoking and Drinking Since these conditions aren’t met, we shouldn’t apply theory-based inference

When checking conditions for traditional tests use the following logic. Assume that the condition is met in your data Examine the data in the appropriate manner If there is strong evidence to contradict that the condition is met, then don’t use the test.

Validity Conditions What do you do when validity conditions aren’t met for theory-based inference? Use the simulated-based approach Another strategy is to “transform” the data on a different scale so conditions are met. The logarithmic scale is common

Predicting Heart Rate from Body Temperature Example 10.5A

Heart Rate and Body Temp Earlier we looked at the relationship between heart rate and body temperature with 130 healthy adults 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑎𝑟𝑡 𝑅𝑎𝑡𝑒=−166.3+2.44 𝑇𝑒𝑚𝑝 r = 0.257

Heart Rate and Body Temp We tested to see if we had convincing evidence that there is a positive association between heart rate and body temp in the population using a simulation-based approach. (We will make it 2-sided this time.) Null Hypothesis: There is no association between heart rate and body temperature in the population. β = 0 Alternative Hypothesis: There is an association between heart rate and body temperature in the population. β ≠ 0

Heart Rate and Body Temp We get a very small p-value (≈ 0.006) since anything as extreme as our observed slope of 2.44 is very rare

Heart Rate and Body Temp We can also approximate a 95% confidence interval observed statistic + 2 SD of statistic  2.44 ± 2(0.850) = 0.74 to 4.14 What does this mean? We’re 95% confident that, in the population of healthy adults, each 1º increase in body temp is associated with an increase in heart rate of between 0.74 to 4.14 beats per minute

Heart Rate and Body Temp The theory-based approach should work well since the distribution has a nice bell shape Also check the scatterplot

Heart Rate and Body Temp Let’s do the theory-based test using the applet. We will use the t-statistic to get our theory-based p-value. We will find a theory-based confidence interval for the slope. After we are done with that, do Exploration 10.5: Predicting Brain Density from Facebook Friends