Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linear Regression and Correlation

Similar presentations


Presentation on theme: "Linear Regression and Correlation"— Presentation transcript:

1 Linear Regression and Correlation

2 Linear Regression and Correlation
Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and the level of the explanatory variable assumed to be approximately linear (straight line) Model: b1 > 0  Positive Association b1 < 0  Negative Association b1 = 0  No Association

3 Least Squares Estimation of b0, b1
b0  Mean response when X = 0 (Y-intercept) b1  Change in mean response when X increases by 1 unit (slope) b0, b1 are unknown parameters (like m) b0+b1X  Mean response when explanatory variable takes on the value X Goal: Choose values (estimates) that minimize the sum of squared errors (SSE) of observed values to the straight-line:

4 Example - Pharmacodynamics of LSD
Response (Y) - Math score (mean among 5 volunteers) Predictor (X) - LSD tissue concentration (mean of 5 volunteers) Raw Data and scatterplot of Score vs LSD concentration: Source: Wagner, et al (1968)

5 Least Squares Computations
Parameter Estimates Summary Calculations

6 Example - Pharmacodynamics of LSD
(Column totals given in bottom row of table)

7 SPSS Output and Plot of Equation

8 Inference Concerning the Slope (b1)
Parameter: Slope in the population model (b1) Estimator: Least squares estimate: Estimated standard error: Methods of making inference regarding population: Hypothesis tests (2-sided or 1-sided) Confidence Intervals

9 Hypothesis Test for b1 1-sided Test 2-Sided Test H0: b1 = 0 H0: b1 = 0
HA+: b1 > 0 or HA-: b1 < 0 2-Sided Test H0: b1 = 0 HA: b1  0

10 (1-a)100% Confidence Interval for b1
Conclude positive association if entire interval above 0 Conclude negative association if entire interval below 0 Cannot conclude an association if interval contains 0 Conclusion based on interval is same as 2-sided hypothesis test

11 Example - Pharmacodynamics of LSD
Testing H0: b1 = 0 vs HA: b1  0 95% Confidence Interval for b1 :

12 Confidence Interval for Mean When X = X*
Mean Response at a specific level X* is Estimated Mean response and standard error (replacing unknown b0 and b1 with estimates): Confidence Interval for Mean Response:

13 Prediction Interval of Future Response @ X=X*
Response at a specific level X* is Estimated response and standard error (replacing unknown b0 and b1 with estimates): Prediction Interval for Future Response:

14 Analysis of Variance in Regression
Goal: Partition the total variation in Y into variation “explained” by X and random variation These three sums of squares and degrees of freedom are: Total (TSS) DFTotal = n-1 Error (SSE) DFError = n-2 Regression (SSR) DFRegression = 1

15 Analysis of Variance for Regression
Analysis of Variance - F-test H0: b1 = HA: b1  0

16 Example - Pharmacodynamics of LSD
Total Sum of squares: Error Sum of squares: Regression Sum of Squares:

17 Example - Pharmacodynamics of LSD
Analysis of Variance - F-test H0: b1 = HA: b1  0

18 Example - SPSS Output

19 Correlation Coefficient
Measures the strength of the linear association between two variables Takes on the same sign as the slope estimate from the linear regression Not effected by linear transformations of Y or X Does not distinguish between dependent and independent variable (e.g. height and weight) Population Parameter: r Pearson’s Correlation Coefficient:

20 Correlation Coefficient
Values close to 1 in absolute value  strong linear association, positive or negative from sign Values close to 0 imply little or no association If data contain outliers (are non-normal), Spearman’s coefficient of correlation can be computed based on the ranks of the X and Y values Test of H0:r = 0 is equivalent to test of H0:b1=0 Coefficient of Determination (r2) - Proportion of variation in Y “explained” by the regression on X:

21 Example - Pharmacodynamics of LSD
Syy SSE

22 Example - SPSS Output Pearson’s and Spearman’s Measures

23 Hypothesis Test for r 1-sided Test 2-Sided Test H0: r = 0 H0: r = 0
HA+: r > 0 or HA-: r < 0 2-Sided Test H0: r = 0 HA: r  0

24 Large-Sample Confidence Interval for r
In general, when the population correlation r is not 0, the sample correlation r has a skewed sampling distribution. To obtain a an approximate large-sample Confidence Interval for r, Fisher’s z transformation is applied.

25 Example - Pharmacodynamics of LSD
Note that this is hardly a large sample, the Confidence Interval is given to demonstrate calculations

26 Model Diagnostics Inferences for the Simple Regression model are based on the following assumptions (which extend to Multiple Regression). Relation between Y and X is linear Errors are normally distributed Errors have constant variance Errors are independent These assumptions can be checked graphically and by formal tests.

27 Data Description / Model
Heights (X) and Weights (Y) for 505 NBA Players in 2013/14 Season. Other Variables included in the Dataset: Age, Position Simple Linear Regression Model: Y = b0 + b1X + e

28

29 Linearity of Regression (SLR)

30 Height and Weight Data – n=505, c=18 Groups
Do not reject H0: mj = b0 + b1Xj

31 Checking Normality of Errors
Graphically Histogram – Should be mound shaped around 0 Normal Probability Plot – Residuals versus expected values under normality should follow a straight line. Rank residuals from smallest (large negative) to highest (k = 1,…,n) Compute the quantile for the ranked residual: p=(k-0.375)/(n+0.25) Obtain the Z-score corresponding to the quantiles: z(p) Expected Residual = √MSE*z(p) Plot Ordered residuals versus Expected Residuals Numerical Tests: Correlation Test: Obtain correlation between ordered residuals and z(p). Critical Values for n up to 100 are provided by Looney and Gulledge (1985)). Shapiro-Wilk Test: Similar to Correlation Test, with more complex calculations. Printed directly by statistical software packages

32 Normal Probability Plot / Correlation Test
Extreme and Middle Residuals The correlation between the Residuals and their expected values under normality is Based on the Shapiro-Wilk test in R, the P-value for H0: Errors are normal is P = (Do not reject Normality)

33 Box-Cox Transformations
Automatically selects a transformation from power family with goal of obtaining: normality, linearity, and constant variance (not always successful, but widely used) Goal: Fit model: Y’ = b0 + b1X + e for various power transformations on Y, and selecting transformation producing minimum SSE (maximum likelihood) Procedure: over a range of l from, say -2 to +2, obtain Wi and regress Wi on X (assuming all Yi > 0, although adding constant won’t affect shape or spread of Y distribution)

34 Box-Cox Transformation – Obtained in R
Maximum occurs near l = 0 (Interval Contains 0) – Try taking logs of Weight

35 Checking the Constant Variance Assumption
Plot Residuals versus X or Predicted Values Random Cloud around 0  Linear Relation Funnel Shape  Non-constant Variance Outliers fall far above (positive) or below (negative) the general cloud pattern Plot absolute Residuals, squared residuals, or square root of absolute residuals Positive Association  Non-constant Variance Numerical Tests Brown-Forsyth Test – 2 Sample t-test of absolute deviations from group medians Breusch-Pagan Test – Regresses squared residuals on model predictors (X variables)

36

37

38 Equal (Homogeneous) Variance - I

39 Equal (Homogeneous) Variance - II

40 Brown-Forsyth and Breusch-Pagan Tests
Brown-Forsyth Test: Group 1: Heights ≤ 79” Group 2: Heights ≥ 80” H0: Equal Variances Among Errors (Reject H0) Breusch-Pagan Test: H0: Equal Variances Among Errors (Reject H0)

41 Test For Independence - Durbin-Watson Test

42 New Orleans Average Annual Temperature 1957-2014
Y = Average Annual Temperature in New Orleans X = Year

43 Detecting Influential Observations
Studentized Residuals – Residuals divided by their estimated standard errors (like t-statistics). Observations with values larger than 3 in absolute value are considered outliers. Leverage Values (Hat Diag) – Measure of how far an observation is from the others in terms of the levels of the independent variables (not the dependent variable). Observations with values larger than 2p*/n are considered to be potentially highly influential, where p is the number of predictors and n is the sample size. DFFITS – Measure of how much an observation has effected its fitted value from the regression model. Values larger than 2sqrt(p*/n) in absolute value are considered highly influential. Use standardized DFFITS in SPSS.

44 Detecting Influential Observations
DFBETAS – Measure of how much an observation has effected the estimate of a regression coefficient (there is one DFBETA for each regression coefficient, including the intercept). Values larger than 2/sqrt(n) in absolute value are considered highly influential. Cook’s D – Measure of aggregate impact of each observation on the group of regression coefficients, as well as the group of fitted values. Values larger than F.50,p*,n-p* are considered highly influential. COVRATIO – Measure of the impact of each observation on the variances (and standard errors) of the regression coefficients and their covariances. Values outside the interval 1 +/- 3p*/n are considered highly influential.


Download ppt "Linear Regression and Correlation"

Similar presentations


Ads by Google