Linear Regression and Correlation

Slides:



Advertisements
Similar presentations
Chapter 12 Inference for Linear Regression
Advertisements

Forecasting Using the Simple Linear Regression Model and Correlation
Inference for Regression
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Objectives (BPS chapter 24)
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
The Simple Regression Model
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Chapter Topics Types of Regression Models
Introduction to Probability and Statistics Linear Regression and Correlation.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Chapter 7 Forecasting with Simple Regression
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Linear Regression/Correlation
Linear Regression and Correlation Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and the level of.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Correlation & Regression
Objectives of Multiple Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Correlation.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Chapter 13 Simple Linear Regression
Chapter 20 Linear and Multiple Regression
Regression and Correlation
Regression Analysis AGEC 784.
Statistics for Managers using Microsoft Excel 3rd Edition
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Correlation and Simple Linear Regression
Chapter 13 Simple Linear Regression
Checking Regression Model Assumptions
Simple Linear Regression
Correlation and Simple Linear Regression
The Practice of Statistics in the Life Sciences Fourth Edition
CHAPTER 29: Multiple Regression*
Checking Regression Model Assumptions
Linear Regression/Correlation
Chapter 14 – Correlation and Simple Regression
PENGOLAHAN DAN PENYAJIAN
Inference for Regression
Correlation and Simple Linear Regression
The greatest blessing in life is
Basic Practice of Statistics - 3rd Edition Inference for Regression
SIMPLE LINEAR REGRESSION
Linear Regression and Correlation
Product moment correlation
SIMPLE LINEAR REGRESSION
Linear Regression and Correlation
Diagnostics and Remedial Measures
Introduction to Regression
Diagnostics and Remedial Measures
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Linear Regression and Correlation

Linear Regression and Correlation Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and the level of the explanatory variable assumed to be approximately linear (straight line) Model: b1 > 0  Positive Association b1 < 0  Negative Association b1 = 0  No Association

Least Squares Estimation of b0, b1 b0  Mean response when X = 0 (Y-intercept) b1  Change in mean response when X increases by 1 unit (slope) b0, b1 are unknown parameters (like m) b0+b1X  Mean response when explanatory variable takes on the value X Goal: Choose values (estimates) that minimize the sum of squared errors (SSE) of observed values to the straight-line:

Example - Pharmacodynamics of LSD Response (Y) - Math score (mean among 5 volunteers) Predictor (X) - LSD tissue concentration (mean of 5 volunteers) Raw Data and scatterplot of Score vs LSD concentration: Source: Wagner, et al (1968)

Least Squares Computations Parameter Estimates Summary Calculations

Example - Pharmacodynamics of LSD (Column totals given in bottom row of table)

SPSS Output and Plot of Equation

Inference Concerning the Slope (b1) Parameter: Slope in the population model (b1) Estimator: Least squares estimate: Estimated standard error: Methods of making inference regarding population: Hypothesis tests (2-sided or 1-sided) Confidence Intervals

Hypothesis Test for b1 1-sided Test 2-Sided Test H0: b1 = 0 H0: b1 = 0 HA+: b1 > 0 or HA-: b1 < 0 2-Sided Test H0: b1 = 0 HA: b1  0

(1-a)100% Confidence Interval for b1 Conclude positive association if entire interval above 0 Conclude negative association if entire interval below 0 Cannot conclude an association if interval contains 0 Conclusion based on interval is same as 2-sided hypothesis test

Example - Pharmacodynamics of LSD Testing H0: b1 = 0 vs HA: b1  0 95% Confidence Interval for b1 :

Confidence Interval for Mean When X = X* Mean Response at a specific level X* is Estimated Mean response and standard error (replacing unknown b0 and b1 with estimates): Confidence Interval for Mean Response:

Prediction Interval of Future Response @ X=X* Response at a specific level X* is Estimated response and standard error (replacing unknown b0 and b1 with estimates): Prediction Interval for Future Response:

Analysis of Variance in Regression Goal: Partition the total variation in Y into variation “explained” by X and random variation These three sums of squares and degrees of freedom are: Total (TSS) DFTotal = n-1 Error (SSE) DFError = n-2 Regression (SSR) DFRegression = 1

Analysis of Variance for Regression Analysis of Variance - F-test H0: b1 = 0 HA: b1  0

Example - Pharmacodynamics of LSD Total Sum of squares: Error Sum of squares: Regression Sum of Squares:

Example - Pharmacodynamics of LSD Analysis of Variance - F-test H0: b1 = 0 HA: b1  0

Example - SPSS Output

Correlation Coefficient Measures the strength of the linear association between two variables Takes on the same sign as the slope estimate from the linear regression Not effected by linear transformations of Y or X Does not distinguish between dependent and independent variable (e.g. height and weight) Population Parameter: r Pearson’s Correlation Coefficient:

Correlation Coefficient Values close to 1 in absolute value  strong linear association, positive or negative from sign Values close to 0 imply little or no association If data contain outliers (are non-normal), Spearman’s coefficient of correlation can be computed based on the ranks of the X and Y values Test of H0:r = 0 is equivalent to test of H0:b1=0 Coefficient of Determination (r2) - Proportion of variation in Y “explained” by the regression on X:

Example - Pharmacodynamics of LSD Syy SSE

Example - SPSS Output Pearson’s and Spearman’s Measures

Hypothesis Test for r 1-sided Test 2-Sided Test H0: r = 0 H0: r = 0 HA+: r > 0 or HA-: r < 0 2-Sided Test H0: r = 0 HA: r  0

Large-Sample Confidence Interval for r In general, when the population correlation r is not 0, the sample correlation r has a skewed sampling distribution. To obtain a an approximate large-sample Confidence Interval for r, Fisher’s z transformation is applied.

Example - Pharmacodynamics of LSD Note that this is hardly a large sample, the Confidence Interval is given to demonstrate calculations

Model Diagnostics Inferences for the Simple Regression model are based on the following assumptions (which extend to Multiple Regression). Relation between Y and X is linear Errors are normally distributed Errors have constant variance Errors are independent These assumptions can be checked graphically and by formal tests.

Data Description / Model Heights (X) and Weights (Y) for 505 NBA Players in 2013/14 Season. Other Variables included in the Dataset: Age, Position Simple Linear Regression Model: Y = b0 + b1X + e

Linearity of Regression (SLR)

Height and Weight Data – n=505, c=18 Groups Do not reject H0: mj = b0 + b1Xj

Checking Normality of Errors Graphically Histogram – Should be mound shaped around 0 Normal Probability Plot – Residuals versus expected values under normality should follow a straight line. Rank residuals from smallest (large negative) to highest (k = 1,…,n) Compute the quantile for the ranked residual: p=(k-0.375)/(n+0.25) Obtain the Z-score corresponding to the quantiles: z(p) Expected Residual = √MSE*z(p) Plot Ordered residuals versus Expected Residuals Numerical Tests: Correlation Test: Obtain correlation between ordered residuals and z(p). Critical Values for n up to 100 are provided by Looney and Gulledge (1985)). Shapiro-Wilk Test: Similar to Correlation Test, with more complex calculations. Printed directly by statistical software packages

Normal Probability Plot / Correlation Test Extreme and Middle Residuals The correlation between the Residuals and their expected values under normality is 0.9972. Based on the Shapiro-Wilk test in R, the P-value for H0: Errors are normal is P = .0859 (Do not reject Normality)

Box-Cox Transformations Automatically selects a transformation from power family with goal of obtaining: normality, linearity, and constant variance (not always successful, but widely used) Goal: Fit model: Y’ = b0 + b1X + e for various power transformations on Y, and selecting transformation producing minimum SSE (maximum likelihood) Procedure: over a range of l from, say -2 to +2, obtain Wi and regress Wi on X (assuming all Yi > 0, although adding constant won’t affect shape or spread of Y distribution)

Box-Cox Transformation – Obtained in R Maximum occurs near l = 0 (Interval Contains 0) – Try taking logs of Weight

Checking the Constant Variance Assumption Plot Residuals versus X or Predicted Values Random Cloud around 0  Linear Relation Funnel Shape  Non-constant Variance Outliers fall far above (positive) or below (negative) the general cloud pattern Plot absolute Residuals, squared residuals, or square root of absolute residuals Positive Association  Non-constant Variance Numerical Tests Brown-Forsyth Test – 2 Sample t-test of absolute deviations from group medians Breusch-Pagan Test – Regresses squared residuals on model predictors (X variables)

Equal (Homogeneous) Variance - I

Equal (Homogeneous) Variance - II

Brown-Forsyth and Breusch-Pagan Tests Brown-Forsyth Test: Group 1: Heights ≤ 79” Group 2: Heights ≥ 80” H0: Equal Variances Among Errors (Reject H0) Breusch-Pagan Test: H0: Equal Variances Among Errors (Reject H0)

Test For Independence - Durbin-Watson Test

New Orleans Average Annual Temperature 1957-2014 Y = Average Annual Temperature in New Orleans X = Year - 1957

Detecting Influential Observations Studentized Residuals – Residuals divided by their estimated standard errors (like t-statistics). Observations with values larger than 3 in absolute value are considered outliers. Leverage Values (Hat Diag) – Measure of how far an observation is from the others in terms of the levels of the independent variables (not the dependent variable). Observations with values larger than 2p*/n are considered to be potentially highly influential, where p is the number of predictors and n is the sample size. DFFITS – Measure of how much an observation has effected its fitted value from the regression model. Values larger than 2sqrt(p*/n) in absolute value are considered highly influential. Use standardized DFFITS in SPSS.

Detecting Influential Observations DFBETAS – Measure of how much an observation has effected the estimate of a regression coefficient (there is one DFBETA for each regression coefficient, including the intercept). Values larger than 2/sqrt(n) in absolute value are considered highly influential. Cook’s D – Measure of aggregate impact of each observation on the group of regression coefficients, as well as the group of fitted values. Values larger than F.50,p*,n-p* are considered highly influential. COVRATIO – Measure of the impact of each observation on the variances (and standard errors) of the regression coefficients and their covariances. Values outside the interval 1 +/- 3p*/n are considered highly influential.