Chapter 4 The Relation between Two Variables

Slides:



Advertisements
Similar presentations
Section 10-3 Regression.
Advertisements

Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Correlation and Regression
Lesson Diagnostics on the Least- Squares Regression Line.
Chapter 3 Bivariate Data
Scatter Diagrams and Linear Correlation
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
Describing the Relation Between Two Variables
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
Math 227 Elementary Statistics Math 227 Elementary Statistics Sullivan, 4 th ed.
Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two Variables 4.
CHAPTER 3 Describing Relationships
Ch 2 and 9.1 Relationships Between 2 Variables
Correlation and Regression Analysis
Least Squares Regression
Scatter Diagrams and Correlation
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Correlation & Regression
Descriptive Methods in Regression and Correlation
Linear Regression.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Relationship of two variables
Describing the Relation Between Two Variables
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 4 Section 1 – Slide 1 of 30 Chapter 4 Section 1 Scatter Diagrams and Correlation.
4.1 Scatter Diagrams and Correlation. 2 Variables ● In many studies, we measure more than one variable for each individual ● Some examples are  Rainfall.
Chapter 10 Correlation and Regression
Correlation & Regression
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 4 Section 2 – Slide 1 of 20 Chapter 4 Section 2 Least-Squares Regression.
Chapters 8 & 9 Linear Regression & Regression Wisdom.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
Scatter Diagrams and Correlation Variables ● In many studies, we measure more than one variable for each individual ● Some examples are  Rainfall.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Chapter 3-Examining Relationships Scatterplots and Correlation Least-squares Regression.
Chapter 2 Examining Relationships.  Response variable measures outcome of a study (dependent variable)  Explanatory variable explains or influences.
Least Squares Regression.   If we have two variables X and Y, we often would like to model the relation as a line  Draw a line through the scatter.
CHAPTER 3 Describing Relationships
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two Variables 4.
Describing Relationships. Least-Squares Regression  A method for finding a line that summarizes the relationship between two variables Only in a specific.
Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-1 Overview Overview 10-2 Correlation 10-3 Regression-3 Regression.
Linear Regression Essentials Line Basics y = mx + b vs. Definitions
Inference for Least Squares Lines
Chapter 3: Describing Relationships
CHAPTER 26: Inference for Regression
Chapter 10 Correlation and Regression
Lecture Notes The Relation between Two Variables Q Q
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Created by Erin Hodgess, Houston, Texas
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Presentation transcript:

Chapter 4 The Relation between Two Variables Prof. Felix Apfaltrer Fapfaltrer@bmcc.cuny.edu Office:N518 Phone: 212-220 8000X 7421 Office hours: Tue/Thu 1:30-3pm

Mathematical model is a mathematical expression that represents some phenomenon. It can be deterministic model or probabilistic model Often describe the relationship between 2 variables.

Learning objectives Draw and interpret scatter diagrams Understand the properties of the linear correlation coefficient Compute and interpret the linear correlation coefficient 1 2 3

4.1. Scatter Diagrams and Correlation When dealing with 2 variables: We try to see the relationship between the 2 variables Sometimes there is a 3rd variable that is not considered, that affects the results (lurking variable). Shoe size does not cause height to change (age affects both the two variables) Therefore, we can’t conclude that variable A causes B Some examples are: Rainfall amounts and plant growth (possible lurking var. Sunlight) Exercise and cholesterol levels for a group of people (possible lurking var. Diet) Height and weight for a group of people Height and fast speed you have ever driven a car. When we have two variables, they could be related in one of several different ways They could be unrelated One variable (the explanatory or predictor variable) could be used to explain the other (the response or dependent variable) One variable could be thought of as causing the other variable to change

Scatter Diagrams The scatter diagram is a graph that shows the relationship visually between 2 quantitative variables. The explanatory variable is plotted on the horizontal axis, the response variable on the vertical axis The response variable (y-axis) is the variable whose value can be explained by the value of the explanatory variable (x-axis).

Linear Correlation The linear correlation coefficient is a measure of the strength and direction of linear relation between two quantitative variables The sample correlation coefficient “r” is This should be computed with software (and not by hand) whenever possible

Answer ‘How Strong Is the Linear Relationship Between 2 Variables?’ Coefficient of Correlation Used Population Correlation Coefficient Denoted  (Rho) Values Range from -1 to +1 Measures Degree of Association The sign of r indicates the direction of the relationship: Positive the two variables tend to increase together. Negative one variable increases, the other is likely to decrease. Used Mainly for Understanding

Perfect Negative Correlation Perfect Positive Correlation No Correlation -1.0 -.5 +.5 +1.0 Increasing degree of negative correlation Increasing degree of positive correlation

Examples of positive correlation Strong Positive r = .8 Moderate Positive r = .5 Very Weak r = .1 Examples of negative correlation Strong Negative r = –.8 Moderate Negative r = –.5 Very Weak r = –.1 In general, if the correlation is visible to the eye, then it is likely to be strong

r = r = r = nxy – (x)(y) n(x2) – (x)2 n(y2) – (y)2 1 2 8 3 6 5 4 Data x y nxy – (x)(y) r = (Shorcut formula) n(x2) – (x)2 n(y2) – (y)2 4(48) – (10)(20) r = 4(36) – (10)2 4(120) – (20)2 –8 r = = –0.135 59.329

Correlation is not causation! Just because two variables are correlated does not mean that one causes the other to change There is a strong correlation between shoe sizes and vocabulary sizes for grade school children Clearly larger shoe sizes do not cause larger vocabularies Clearly larger vocabularies do not cause larger shoe sizes Often lurking variables result in confounding

Summary: Chapter 4 – Section 1 Visual methods Scatter diagrams Analogous to histograms for single variables Numeric methods Linear correlation coefficient Analogous to mean and variance for single variables Care should be taken in the interpretation of linear correlation (nonlinearity and causation) Correlation between two variables can be described with both visual and numeric methods

Chapter 4 – Section 2 Learning objectives Find the least-squares regression line and use the line to make predictions and estimations Interpret the slope and the y-intercept of the least squares regression line Compute the sum of squared residuals 1 2 3

If we have two variables X and Y, we often would like to model the relation as a line Draw a line through the scatter diagram We want to find the line that “best” describes the linear relationship … the regression line

Linear Equations We want to use a linear model Linear models can be written in several different (equivalent) ways y = m x + b y – y1 = m (x – x1) y = b1 x + b0 Because the slope and the intercept are important to analyze, we will use

Linear Equations BMCC PROFESSOR

The formula for the residual is always Residual = Observed – Predicted One difference between math and stat is that statistics assumes that the measurements are not exact, that there is an error or residual The formula for the residual is always Residual = Observed – Predicted What the residual is on the scatter diagram The residual The model line The observed value y The predicted value y The x value of interest The equation for the least-squares regression line is given by y = b1x + b0 b1 is the slope of the least-squares regression line (marginal change) b0 is the y-intercept of the least-squares regression line

y = 5 + 4x x 1 2 4 5 y 4 24 8 32 ^ Least-Squares Property A straight line satisfies this property if the sum of the squares of the residuals is the smallest sum possible.

calculators or computers can compute these values (slope of the least-squares regression line) (Shorcut) n(xy) – (x) (y) b1 = (slope) n(x2) – (x)2 b0 = y – b1 x (y-intercept) calculators or computers can compute these values

Finding the values of b1 and b0, by hand, is a very tedious process You should use software for this Finding the coefficients b1 and b0 is only the first step of a regression analysis We need to interpret the slope b1 We need to interpret the y-intercept b0 We need to do quite a bit more statistical analysis … this is covered in Section 4.3 and also in Chapter 14

the regression line is: 1 2 8 3 6 5 4 Data x y n(xy) – (x) (y) n(x2) –(x)2 b1 = 4(48) – (10) (20) 4(36) – (10)2 –8 44 = –0.181818 n = 4 x = 10 y = 20 x2 = 36 y2 = 120 xy = 48 b0 = y – b1 x 5 – (–0.181818)(2.5) = 5.45 The estimated equation of the regression line is: y = 5.45 – 0.182x ^

Guidelines for Using The Regression Equation 1. If there is no significant linear correlation, don’t use the regression equation to make predictions. 2. When using the regression equation for predictions, stay within the scope of the available sample data. 3. A regression equation based on old data is not necessarily valid now. 4. Don’t make predictions about a population that is different from the population from which the sample data was drawn.

Chapter 4 – Section 3 Total Deviation = Explained + Unexplained Learning objectives Compute and interpret the coefficient of determination Perform residual analysis on a regression model Identify influential observations 1 2 3 The relationship is The larger the explained deviation, the better the model is at prediction / explanation The larger the unexplained deviation, the worse the model is at prediction / explanation Total Deviation = Explained + Unexplained

We began with y – y or the total deviation Our regression model reduces this to or the unexplained deviation The amount of reduction is the explained deviation

Instead of straight deviations, we use variations Variation = Deviation2 It is also true that A measure of the explanatory power of the model is the proportion of variation that is explained: Total Variation = Explained + Unexplained

Y Unexplained sum of squares (Y -Y)2 ^ Total sum of squares (Y -Y)2

Proportion of Variation ‘Explained’ by Relationship Between X & Y Simply Square Correlation r r 2 is called coefficient of determination. 0  r 2  1 (%) (percentage explained by X)

How can we tell how good is our model? To check to see if a linear model is appropriate, plot the residuals (error) on the vertical axis against the explanatory variable (the x) on the horizontal axis If the plot shows a pattern (such as a curve), then the response (y) and explanatory (x) variables may have a nonlinear relationship If there is no obvious pattern, we could be ok …

Two example residual plots The least-squares regression model assumes that the variance of the residuals are constant across values of the explanatory variable To check to see if the variance of the residuals are constant, plot the residuals (error) on the vertical axis against the explanatory variable (the x) on the horizontal axis This is the same plot as the plot checking linearity Two example residual plots If there is a spread (the dotted blue line), then a linear relationship is not very reliable No spread Spread

Definition Outliers for a least-squares regression are those observations that are unusually far away from the model line There are several ways to identify outliers The scatter diagram may show the outlier as a point away from the main pattern of points The residual plot may show the outlier as a unusually high or unusually low residual The boxplot of residuals may identify the outlier as a value outside the upper or lower fence

Three ways to identify outliers From a scatter diagram From a residual plot From a boxplot

Influential Points: An influential point strongly affects the graph of the regression line Usually influential observations are those with unusually high or unusually low values of the predictor (x) variable A significant affect on the value of the slope, or A significant affect on the value of the intercept outlier definitely influential influential The x value is large compared to the others It is not along the general linear pattern of the data It is not along the general linear pattern of the data However, it is likely not to be influential The x and y values are large compared to the others However, it is along the general linear pattern of the data

If a particular observation is influential, we should investigate that observation If the observation is a valid observation, we have a variety of options We could collect additional points near the influential observation We could collect additional points between the main part of our data and the influential point (to check whether the data is nonlinear, for example) We could use techniques that are resistant to influential observations

Summary: Chapter 4 – Section 3 Diagnostics are very important in assessing the quality of a least-squares regression model The coefficient of determination measures the percent of total variation explained by the model The plot of residuals can detect nonlinear patterns, error variances that are not constant, and outliers We must be careful when there are influential observations because they have an unusually large effect on the computation of our model parameters