Chapter 3: Examining relationships between Data

Slides:



Advertisements
Similar presentations
Chapter 3 Examining Relationships Lindsey Van Cleave AP Statistics September 24, 2006.
Advertisements

Chapter 3 Bivariate Data
Chapter 6: Exploring Data: Relationships Lesson Plan
Scatterplots and Correlation
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
Looking at data: relationships Scatterplots IPS chapter 2.1 © 2006 W. H. Freeman and Company.
CHAPTER 3 Describing Relationships
Ch 2 and 9.1 Relationships Between 2 Variables
Chapter 5 Regression. Chapter 51 u Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). u We.
Objectives (BPS chapter 5)
Descriptive Methods in Regression and Correlation
Relationship of two variables
Association between 2 variables We've described the distribution of 1 variable in Chapter 1 - but what if 2 variables are measured on the same individual?
Relationships Scatterplots and correlation BPS chapter 4 © 2006 W.H. Freeman and Company.
Looking at data: relationships - Caution about correlation and regression - The question of causation IPS chapters 2.4 and 2.5 © 2006 W. H. Freeman and.
Chapter 6: Exploring Data: Relationships Chi-Kwong Li Displaying Relationships: Scatterplots Regression Lines Correlation Least-Squares Regression Interpreting.
1 Chapter 3: Examining Relationships 3.1Scatterplots 3.2Correlation 3.3Least-Squares Regression.
Chapter 6: Exploring Data: Relationships Lesson Plan Displaying Relationships: Scatterplots Making Predictions: Regression Line Correlation Least-Squares.
Chapter 6: Exploring Data: Relationships Lesson Plan Displaying Relationships: Scatterplots Making Predictions: Regression Lines Correlation Least-Squares.
Relationships Regression BPS chapter 5 © 2006 W.H. Freeman and Company.
Relationships Regression BPS chapter 5 © 2006 W.H. Freeman and Company.
IPS Chapter 2 DAL-AC FALL 2015  2.1: Scatterplots  2.2: Correlation  2.3: Least-Squares Regression  2.4: Cautions About Correlation and Regression.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Objectives (IPS Chapter 2.1)
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Correlation tells us about strength (scatter) and direction of the linear relationship between two quantitative variables. In addition, we would like to.
Chapters 8 & 9 Linear Regression & Regression Wisdom.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Examining Bivariate Data Unit 3 – Statistics. Some Vocabulary Response aka Dependent Variable –Measures an outcome of a study Explanatory aka Independent.
CHAPTER 5 Regression BPS - 5TH ED.CHAPTER 5 1. PREDICTION VIA REGRESSION LINE NUMBER OF NEW BIRDS AND PERCENT RETURNING BPS - 5TH ED.CHAPTER 5 2.
Chapter 5 Regression. u Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). u We can then predict.
Correlation tells us about strength (scatter) and direction of the linear relationship between two quantitative variables. In addition, we would like to.
Chapter 3-Examining Relationships Scatterplots and Correlation Least-squares Regression.
The correlation coefficient, r, tells us about strength (scatter) and direction of the linear relationship between two quantitative variables. In addition,
Lecture 5 Chapter 4. Relationships: Regression Student version.
Chapter 2 Examining Relationships.  Response variable measures outcome of a study (dependent variable)  Explanatory variable explains or influences.
Residuals Recall that the vertical distances from the points to the least-squares regression line are as small as possible.  Because those vertical distances.
Relationships Scatterplots and Correlation.  Explanatory and response variables  Displaying relationships: scatterplots  Interpreting scatterplots.
The correlation coefficient, r, tells us about strength (scatter) and direction of the linear relationship between two quantitative variables. In addition,
Scatter plots Adapted from 350/
Describing Relationships. Least-Squares Regression  A method for finding a line that summarizes the relationship between two variables Only in a specific.
Lecture 7 Simple Linear Regression. Least squares regression. Review of the basics: Sections The regression line Making predictions Coefficient.
Lecture 9 Sections 3.3 Objectives:
4. Relationships: Regression
4. Relationships: Regression
Chapter 3: Describing Relationships
Examining Relationships Least-Squares Regression & Cautions about Correlation and Regression PSBE Chapters 2.3 and 2.4 © 2011 W. H. Freeman and Company.
Chapter 6: Exploring Data: Relationships Lesson Plan
Chapter 6: Exploring Data: Relationships Lesson Plan
Chapter 2 Looking at Data— Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Objectives (IPS Chapter 2.3)
Looking at data: relationships - Caution about correlation and regression - The question of causation IPS chapters 2.4 and 2.5 © 2006 W. H. Freeman and.
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Correlation/regression using averages
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapters Important Concepts and Terms
Chapter 3: Describing Relationships
Correlation/regression using averages
Presentation transcript:

Chapter 3: Examining relationships between Data Explanatory Variable – attempts to explain the observed outcomes. Calling one variable explanatory and the other response does not imply a cause and effect relationship. Response Variable – Measure the outcome of a study. Explanatory Variable Response Variable

Look for patterns or deviation from patterns. Analyzing Data Start with a graph. Look for patterns or deviation from patterns. Examine numerical descriptors of the data. (mean median, IQR, etc.) Scatterplots Shows the relationship between two quantitative variables. Always put explanatory variable on the x axis if one can be identified. (Chicken or egg?)

Interpreting scatterplots Look for pattern - direction; negative or positive association. - Form; clusters, linear, curved etc. - Strength; strong if points in any discernable line or curve. Weak if points are scattered. - Pay careful attention to any outliers or clusters. You may want to split the data into categories to reveal information. Activity - length of ring and first finger.

There is quite some variation in BAC for the same number of beers drunk. A person’s blood volume is a factor in the equation that we have overlooked. Now we change the number of beers to the number of beers/weight of a person in pounds. Note how much smaller the variation is. An individual’s weight was indeed influencing our response variable “blood alcohol content.” So let’s look at this scatterplot. Overall pattern: in general, the BAC increases with the number of beers you drink.

But which line best describes our data? Correlation tells us about strength (scatter) and direction of the linear relationship between two quantitative variables. But which line best describes our data?

If Xi and Yi agree most of the time then r will be positive; if they do not agree most of the time then r will be negative. Insert Definition 14.6 from page 724

Correlation coefficient app http://www.gifted.uconn.edu/siegle/research/Correlation/scatter/scatterplotdemo.html

Correlation coefficient app http://www.gifted.uconn.edu/siegle/research/Correlation/scatter/scatterplotdemo.html

Properties of r Pos r  positive association; Neg r  negative association r always lies between -1 and 1. -1--------------- 0 ---------------1 strong neg. no association strong pos. 3. Changing units will not change r; example converting lbs. to kg The correlation r describes the strength and direction of a straight-line relationship. r is nonresistant; like the mean and std it is affected by outliers or extreme observations Correlation is not a complete description of two variable data. The means and std of both variables are important as well.

A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. The average x value and the average y value are always a point on the regression line Least squares regression line app http://hspm.sph.sc.edu/courses/J716/demos/LeastSquares/LeastSquaresDemo.html

Least squares regression line app http://hspm.sph.sc.edu/courses/J716/demos/LeastSquares/LeastSquaresDemo.html

The least-squares regression line is the unique line such that the sum of the squared vertical (y) distances between the data points and the line is the smallest possible. Distances between the points and line are squared so all are positive values. This is done so that distances can be properly added (Pythagoras).

The least-squares regression line can be shown to have this equation: is the predicted y value (y hat) b is the slope a is the y-intercept "a" is in units of y "b" is in units of y/units of x

Determine the regression line First we calculate the slope of the line, b, from statistics we already know: r is the correlation sy is the standard deviation of the response variable y sx is the the standard deviation of the explanatory variable x Once we know b, the slope, we can calculate a, the y-intercept: where x and y are the sample means of the x and y variables This means that we don’t have to calculate a lot of squared distances to find the least-squares regression line for a data set. We can instead rely on the equation. But typically, we use a 2-var stats calculator or a stats software.

The equation completely describes the regression line. To plot the regression line, you only need to plug two x values into the equation, get y, and draw the line that goes through those two points. Hint: The regression line always passes through the mean of x and y. They are NOT points from your sample data (except by pure coincidence). The points you use for drawing the regression line are derived from the equation.

Residuals The distances from each point to the least-squares regression line give us potentially useful information about the contribution of individual data points to the overall pattern of scatter. These distances are called “residuals.” The sum of these residuals is always 0. Points above the line have a positive residual. Points below the line have a negative residual. So where does this line come from? ^ Predicted y Observed y

Residual plots Residuals are the distances between y-observed and y-predicted. We plot them in a residual plot. If residuals are scattered randomly around 0, chances are your data fit a linear model, were normally distributed, and you didn’t have outliers.

Residuals are randomly scattered—good! A curved pattern—means the relationship you are looking at is not linear. A change in variability across plot is a warning sign. You need to find out why it is and remember that predictions made in areas of larger variability will not be as good.

The x-axis in a residual plot is the same as on the scatterplot. The line on both plots is the regression line. Only the y-axis is different.

Coefficient of determination, r2 r2, the coefficient of determination, is the square of the correlation coefficient. r2 represents the fraction of the variance in y (vertical scatter from the regression line) that can be explained by changes in x. Go over point that missing zero is OK - this is messy data, a prediction based on messy data. Residuals should be scattered randomly around the line, or there is something wrong with your data - not linear, outliers, etc.

Here the change in x only explains 76% of the change in y Here the change in x only explains 76% of the change in y. The rest of the change in y (the vertical scatter, shown as red arrows) must be explained by something other than x. r = 0.87 r2 = 0.76 r = −1 r2 = 1 Changes in x explain 100% of the variations in y. y can be entirely predicted for any given value of x. r = 0 r2 = 0 Changes in x explain 0% of the variations in y. The value(s) y takes is (are) entirely independent of what value x takes.

Grade performance If class attendance explains 16% of the variation in grades, what is the correlation between percent of classes attended and grade? 1. We need to make an assumption: Attendance and grades are positively correlated. So r will be positive too. 2. r2 = 0.16, so r = +√0.16 = + 0.4 A weak correlation.

There is quite some variation in BAC for the same number of beers drunk. A person’s blood volume is a factor in the equation that was overlooked here. r =0.7 r2 =0.49 We changed the number of beers to the number of beers/weight of a person in pounds. r =0.9 r2 =0.81 So let’s look at this scatterplot. Overall pattern: in general, the BAC increases with the number of beers you drink. In the first plot, number of beers only explains 49% of the variation in blood alcohol content. But number of beers/weight explains 81% of the variation in blood alcohol content. Additional factors contribute to variations in BAC among individuals (like maybe some genetic ability to process alcohol).

Outliers and influential points Outlier: An observation that lies outside the overall pattern of observations. “Influential individual”: An observation that markedly changes the regression if removed. This is often an outlier on the x-axis. Child 19 = outlier in y direction Child 18 = outlier in x direction Child 19 is an outlier of the relationship. Child 18 is only an outlier in the x direction and thus might be an influential point.

Are these points influential? All data Without child 18 Without child 19 Outlier in y-direction Influential Are these points influential?

boy may be well within the range of height for his age, but That’s why typically growth charts show a range of values (here from 5th to 95th percentiles). This is a more comprehensive way of displaying the same information. So, any individual boy may be well within the range of height for his age, but also be within the distributions for boys year or two older or younger as well. Two possible size distributions are shown.

(in 1000’s) There is a positive linear relationship between the number of powerboats registered and the number of manatee deaths. The least-squares regression line has for equation: Thus, if we were to limit the number of powerboat registrations to 500,000, what could we expect for the number of manatee deaths? Roughly 21 manatees.