Covariance and Correlation Assaf Oron, May 2008

Slides:



Advertisements
Similar presentations
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
Advertisements

4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.
Inference for Regression
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
A Short Introduction to Curve Fitting and Regression by Brad Morantz
2.2 Correlation Correlation measures the direction and strength of the linear relationship between two quantitative variables.
Statistics for the Social Sciences
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Inference for regression - Simple linear regression
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
CORRELATION & REGRESSION
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
LECTURE 14 TUESDAY, 13 OCTOBER STA 291 Fall
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Ch14: Linear Least Squares 14.1: INTRO: Fitting a pth-order polynomial will require finding (p+1) coefficients from the data. Thus, a straight line (p=1)
Linear Models Alan Lee Sample presentation for STATS 760.
Trees Example More than one variable. The residual plot suggests that the linear model is satisfactory. The R squared value seems quite low though,
Chapter 8: Simple Linear Regression Yang Zhenlin.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Correlation and Regression Basic Concepts. An Example We can hypothesize that the value of a house increases as its size increases. Said differently,
Linear Regression Essentials Line Basics y = mx + b vs. Definitions
The simple linear regression model and parameter estimation
Lecture 11: Simple Linear Regression
GS/PPAL Section N Research Methods and Information Systems
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Regression Analysis AGEC 784.
Inference for Least Squares Lines
Chapter 12 Simple Linear Regression and Correlation
Statistical Data Analysis - Lecture /04/03
Linear Regression.
CHAPTER 7 Linear Correlation & Regression Methods
Statistics for the Social Sciences
Correlation and Simple Linear Regression
Chapter 11: Simple Linear Regression
Correlation – Regression
Correlation and regression
Chapter 14: Correlation and Regression
Simple Linear Regression
…Don’t be afraid of others, because they are bigger than you
Simple Linear Regression - Introduction
Correlation and Simple Linear Regression
I271B Quantitative Methods
Correlation and Regression
Statistical Methods For Engineers
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Inference about the Slope and Intercept
Simple Linear Regression
Simple Linear Regression
Chapter 12 Simple Linear Regression and Correlation
CHAPTER- 17 CORRELATION AND REGRESSION
Inference about the Slope and Intercept
Hypothesis testing and Estimation
Correlation and Simple Linear Regression
Statistics for the Social Sciences
Undergraduated Econometrics
Chapter 4, Regression Diagnostics Detection of Model Violation
Simple Linear Regression and Correlation
Correlation and Regression
Product moment correlation
Regression, Part B: Going a bit deeper Assaf Oron, May 2008
Algebra Review The equation of a straight line y = mx + b
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Covariance and Correlation Assaf Oron, May 2008 Stat 391 – Lecture 13a Covariance and Correlation Assaf Oron, May 2008

Relationships between Two r.v.’s We have seen some relationships between events: Independence, inclusion, exclusion (=“disjoint”) Translated to relationships between r.v.’s, we get: Independence - or determinism (=“degeneracy”) But these are special, extreme cases. We need a quantitative measure for the strength of relationship between r.v.’s This is where covariance and correlation come in (we will start mathematically, and concentrate on the continuous case)

Covariance Assume X, Y r.v.’s with expectations μX, μY Recall that Var[X] is defined as The variance measures how X “co-varies with itself” What if we wanted to see how it co-varies with someone else?

Covariance (2) The covariance Cov[ X, Y ] is defined as With the variance, the integrand was guaranteed to be non-negative, always With the covariance, it is negative whenever one r.v. ventures above its mean while the other goes below Hands-on: prove the shortcut formula

Covariance (3) Hands-on, continued: Use the shortcut formula to show that (hint: write out E[XY] as an integral, then use independence for the densities) Now, show that (a is some constant)

Covariance and Correlation It can be shown that Cov[ X, Y ] is bounded: Which means that it can be normalized to yield the population correlation coefficient: Determinism: ρ=±1, Independence: ρ=0 We got what we wanted: a quantitative measure of dependence

Covariance and Correlation (2) Covariance is used more often in modeling and theory Correlation is more common in data analysis It is estimated from the data in a straightforward, MME manner:

Properties of Correlation Covariance and correlation are symmetric between X and Y They are not linear: The covariance is a second moment of sorts Correlation is not even a moment However, they quantify the mean linear association between X and Y If the association is nonlinear, they may miss it They may also fare poorly if the data are clustered, or due to outliers (example:)

Correlation: Resolving Issues If you suspect a nonlinear association, g(x)~h(y), then look at Cor(g(x),h(y)) Visually, we transform the data Outliers need to be addressed (here’s a repeat of how): Determine if they are errors/ ”a foreign element” / “part of the story” Then, accordingly, remove/”ignore”/include (can also use more robust methods)

Cov/Cor with Several r.v.’s We then have a covariance matrix Always symmetric On the diagonal: each r.v.’s variance Similarly, we have a correlation matrix, symmetric, with 1’s on the diagonal Independent r.v.’s will have zeroes on the off-diagonal

Regression, Part A or “Black-Box Statistics 101” Assaf Oron, May 2008 Stat 391 – Lecture 13b Regression, Part A or “Black-Box Statistics 101” Assaf Oron, May 2008

Overview and Disclaimer We are entering different waters Regression is the gateway to pattern recognition (answers the question “how to draw a line through the data?”) (A subject for countless Ph.D. theses) We cannot expect to cover it at the same depth we did probability, MLE’s, etc. As said earlier, this requires a separate course What I’ll do here is help you become an informed and responsible user This, too, is a tall task, because using regression is deceptively easy

Overview and Disclaimer (2) To use a familiar analogy: up until now in the course we have been walking, running and riding bicycles Doing regression is more like driving a car It should be licensed… but, unfortunately, it is not So think of the next 3 hours as a crash drivers-permit course Here we go, starting from the simplest case

Simple Linear Regression We have n data points ((X,Y) pairs), and limit ourselves to a straight line (Still a subject for countless Ph.D. theses) Unlike correlation, here we typically abandon symmetry: We use X to explain Y and not vice versa (one reason: this makes the math simpler) (another reason: often, this is the practical question we want to answer)

Simple Linear Regression (2) If you just ask for “regression”, you will get the standard, least-squares solution: A linear formula minimizing the sum: (what’s with the betas and the hats?) (We’ll see next lecture) The line minimize the sum of the squares of these

Simple Linear Regression (3) So a regression’s main product is just two numbers, but if you run regression on a standard statistical package, you’ll get something like this: Call: lm(formula = y1 ~ x1, data = anscombe) Residuals: Min 1Q Median 3Q Max -1.92127 -0.45577 -0.04136 0.70941 1.83882 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * x1 0.5001 0.1179 4.241 0.00217 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.002170 The two numbers …all the rest are various quality measures…

Simple Linear Regression (4) The numbers outlined below have to do with hypothesis tests, and we’ll discuss them next lecture: Call: lm(formula = y1 ~ x1, data = anscombe) Residuals: Min 1Q Median 3Q Max -1.92127 -0.45577 -0.04136 0.70941 1.83882 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * x1 0.5001 0.1179 4.241 0.00217 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.002170

Simple Linear Regression (5) Let’s focus on R-squared: Call: lm(formula = y1 ~ x1, data = anscombe) Residuals: Min 1Q Median 3Q Max -1.92127 -0.45577 -0.04136 0.70941 1.83882 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * x1 0.5001 0.1179 4.241 0.00217 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.002170 (“Multiple R-squared” is usually known as just “R-squared”)

R-squared Recall: our solution minimizes (under the linearity constraint) The value of this sum given our solution, is known as Sum of Square Errors (SSerr.) or the residual sum of squares Furthermore, it can be shown that SSy: Overall variability of y (n-1 times the sample variance SSreg.: Amount explained by the regression model SSerr.: Amount left unexplained

R-squared (2) Let’s illustrate this: R-squared is the sum of all the squares of these distances, divided by the sum of all the original square distances from the mean Let’s illustrate this:

R-squared, Regression, Correlation R-squared is defined as with r being the sample correlation coefficient defined last hour (“Adjusted R-squared” penalizes for the number of parameters used; more detail perhaps next time) The relationship between correlation and regression goes further:

Back to the Printout All that’s left are summary stats related to residuals: Call: lm(formula = y1 ~ x1, data = anscombe) Residuals: Min 1Q Median 3Q Max -1.92127 -0.45577 -0.04136 0.70941 1.83882 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * x1 0.5001 0.1179 4.241 0.00217 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.002170 Seems like we have everything about residuals except their mean (which is always 0); but this is a classic example where “one picture is worth a thousand numbers”. Let’s do the Anscombe examples!

Diagnostics and Residual Analysis Regression should always begin with vigorous plotting/tabulation of all variables, alone and vs. each other AND be followed up by equally-vigorous visual inspection of residuals Failing to do both, is just like driving by looking only at the dashboard Residuals should: Be symmetrically distributed with thin tails and no outliers Show no pattern when plotted vs. the responses, the fitted values, any explanatory variable or observation order