Regression and correlation Dependence of two quantitative variables.

Slides:



Advertisements
Similar presentations
Regression and correlation methods
Advertisements

Lesson 10: Linear Regression and Correlation
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Objectives 10.1 Simple linear regression
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Simple Linear Regression. Start by exploring the data Construct a scatterplot  Does a linear relationship between variables exist?  Is the relationship.
Chapter 10 Curve Fitting and Regression Analysis
Linear regression models
Ch11 Curve Fitting Dr. Deshi Ye
Variance and covariance M contains the mean Sums of squares General additive models.
Chapter 10 Simple Regression.
9. SIMPLE LINEAR REGESSION AND CORRELATION
Chapter 12 Simple Regression
The Simple Regression Model
Intro to Statistics for the Behavioral Sciences PSYC 1900
Chapter 11 Multiple Regression.
Probability & Statistics for Engineers & Scientists, by Walpole, Myers, Myers & Ye ~ Chapter 11 Notes Class notes for ISE 201 San Jose State University.
Introduction to Probability and Statistics Linear Regression and Correlation.
SIMPLE LINEAR REGRESSION
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Simple Linear Regression and Correlation
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Simple Linear Regression Analysis
Relationships Among Variables
Correlation and Regression
Correlation and Linear Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Linear Regression and Correlation
Correlation and Linear Regression
Simple Linear Regression Models
Ch4 Describing Relationships Between Variables. Pressure.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Regression and correlation analysis (RaKA) 1. Investigating the relationships between the statistical characteristics: 2 Investigating the relationship.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Go to Table of Content Single Variable Regression Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
1Spring 02 First Derivatives x y x y x y dy/dx = 0 dy/dx > 0dy/dx < 0.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Discussion of time series and panel models
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Correlation & Regression Analysis
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Regression Analysis Deterministic model No chance of an error in calculating y for a given x Probabilistic model chance of an error First order linear.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Regression.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
The Simple Linear Regression Model: Specification and Estimation  Theory suggests many relationships between variables  These relationships suggest that.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
The simple linear regression model and parameter estimation
Chapter 4 Basic Estimation Techniques
Regression Analysis AGEC 784.
Correlation I have two variables, practically „equal“ (traditionally marked as X and Y) – I ask, if they are independent and if they are „correlated“,
Analysis of variance ANOVA.
Multiple regression, ANCOVA, General Linear Models
Non-linear relationships
Correlation and Regression
CHAPTER 29: Multiple Regression*
Product moment correlation
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Regression and correlation Dependence of two quantitative variables

Regression – I do know, which one is dependent and which one is independent

Similarly will depend High of plant on nutrient content in soil Intensity of photosynthesis on amount of light Species diversity on latitude Rate of enzymatic reaction on temperature and not vice versa

Correlation – both variables are “equal”

Similarly we can be interested in correlations of Pb and Cd contents in water Number of points from test in maths and chemistry Cover of Cirsium and Agropyron in square in meadow Anywhere, where it is hard to say, what depends on what

Even by equal variables we can use one of them as a predictor. Regression is then used even in cases when there is not clear causality. I can predict on the basis of DBH (easier measurement) height of a tree.

Model of simple linear regression Dependent variable, response Intercept Slope, coefficient of regression Independent variable, predictor Error variability - N(0,σ 2 )

Coefficient of regression = slope of the line, how much Y changes if X is changed by one unit. So, it is a value dependent on units in which X and Y are measured. It reaches from -  to + . 0 0 α=value of Y if X=0 Β=tg of angle slope

So, we presume: X is measured exactly Y measurement is subject to an error mean value of Y depends linearly on X variance “around line” is always the same (homogenity of variances)

Which line is the best one?

This one probably not, but how I can distinguish it?

The best line is that one fitting Criterion of Least squares (LS) i.e. the least sum of squares of deviations predicted – real value of dependent variable

I.e. the best is that line having the least sum of squares of residuals Vertical, no horizontal distance to line!!!

Can parameters of line be computed from this condition? I replace valuation with Y X and Y are values measured. We count them as fixed. So, I am searching for local minimum of function with two variables, a and b. We calculate derivations according a and b. Then put d SS/da = 0, and d SS/ db=0 by solving those equations, I get the parameters

We get Line always goes through the point of averages of both the variables α and β are real values, a and b are their estimations

b is (sample) estimate of real value β Every estimate is subject to an error – from data variability Statistica computes mean error of estimate b

In case of independence β=0 P-value for the test of H 0 : β=0 is probability, that I get such good dependence by chance, if variables are independent

For test H 0 : β=0 One tailed tests can be used. Similar test can be used also for parameter a, we test then, if the line goes through zero, what is in the most cases uninteresting Number of degrees of freedom is n-2

Test using the ANALYSIS OF VARIANCE of regression model We test null hypothesis, that our model explains nothing (variables are indpendent). Then holds that β=0. [So, the test should be in congruence with the previous one, it just doesn’t enables one-way hypothesis] Again – as in classic ANOVA, the principle is the analysis of sum of squares

Grand variability = squares of deviations of observations from grand mean Variability explained by model=squares of deviations of predicted values from grand mean Age, X, in days Wing length, Y, in centimeters

Error variability= squares of deviations of observed and predict values Holds: Age, X, in days Wing length, Y, in centimeters

As in classic ANOVA holds MS=SS/DF - it is estimate of variance of population, if null hypothesis is true. And also here we make a test using ratio of grand variation estimations based on variance explained and unexplained by the model

This beta is something different from the one used so far Test of the null hypothesis, that in the hatching time birds are wingless (in day zero length is zero) ANOVA model

Coefficient of determination - percent of variability explained

Confidence belt – where is with given [95% here] probability for given X mean value Y Basically – where is the line

Prediction or toleration belt Where the next observation will be

Reliability is the best around the mean

Regression going through zero – it is possible, but How was it in reality?

My regression has proved with high certainty, that in the time of volcanic island’s birth there was a negative number of species

Regression going through zero – it is possible, but How was it in reality? regression going through zero do such a thing

We don’t use linear regression, because We believe, that dependence is linear in all its range, but nevertheless we often (and legitimately) believe, that we can rationally approximate it by linear function in the range of our values used. Be carefull with extrapolations (especially dangerous are extrapolations to zero)

Using of regression doesn’t mean causal dependence Significant are: Dependence of number of murders on number of frost days in year in USA states Dependence of number of divorces on number of fridges in years Dependence of number of inhabitants of India on concentration of CO 2 in years Causal dependence can be proved just by manipulative experiment

Dependence of number of murders (Murders) on number of frost days (Frost) in single states of USA Results of regression analysis of number of murders per inhabitants in year 1976 (Murders) in individual states of USA in dependence on number of frost days in the capital of given state in years (Frost). P<0.01

Power of test Depends on number of observations and strength of the relation (so, on R 2 in the whole population) In experimental studies we can increase R 2 by increasing range of independent variable (keep in mind, it usually makes linearity of relation worse)

In interpretations Make difference, when we are more interested in the strength of relation (and thus R 2 value), and when we are happy, when “it is significant”. How much is new cheap analytical method based on real concentration? (If I haven’t believed, that H 0 : method is completely independent on concentration isn’t true, so I wouldn’t do it – I am interested in R 2 or in error of estimation.)

Declaration Method is excellent, dependence on real concentrations is highly significant (p<0.001) says the only thing – we are very sure, that the method is better than random number generator. We are interested mainly in R 2 [and value of 0.8 can be low for us] (and especially here the error of estimation).

On the other side Declaration: Number of species is positively dependent on soil pH (F 1,33 =12.3, p<0.01) is interesting, as the fact that the null hypothesis is not true is not clear a priori. But I am interested in R 2 too (but I might be satisfied even with very low values, e.g. 0.2).

Changing X for Y I get logically different results (as regression formulas aren’t inverse functions). But R 2, F, and P are the same. I estimate DBH with help of height I estimate height with help of DBH minimise

Even simple regression is computed in Statistica with help of “Multiple regression”. I write to my results, that I have used simple regression!!!

Data transformation in regression Attention – values aren’t equal Independent value is considered exact Dependent variable contains error (and on it I do minimization of error sum of squares)

Make difference with transformation of independent variable I change shape of dependence, but not residual distribution with transformation of dependent variable I change both – shape and residual distribution

Linearized regression The most common transformation is logarithmic one. With logarithm of independent variable, I get Y=a+b log(X) The first line is usually deleted, the second one usually to in the case of publication Presumption - residuals weren’t dependent on mean – transformation haven’t done anything with them. S=a+blog(A)

Relationhip is exponential Residuals are linearly dependent on mean

It doesn’t matter, if I use ln or log But if I want to estimate growth rate, then use ln! I logarithm just dependent variance – and I “homogenize” residuals

Popular is power relationship It always goes through zero - Allometric relationships, Species- Area

Use either ln or log It linearizes most of monotonic relationships without flex point [S=cA z ] going through zero Log transformation of both variables, residuals are assumed to be positively dependent on the mean. Attention, using the logarithm, positive deviance from prediction are “decreased” more than the negative ones.