Biostatistics, statistical software VI. Relationship between two continuous variables, correlation, linear regression, transformations. Relationship between.

Slides:



Advertisements
Similar presentations
Scatterplots and Correlation
Advertisements

Correlation and regression II. Regression using transformations. 1.
Inference for Regression
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Correlation and Regression
Chapter 4 The Relation between Two Variables
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
Lecture 11 PY 427 Statistics 1 Fall 2006 Kin Ching Kong, Ph.D
The Simple Regression Model
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
SIMPLE LINEAR REGRESSION
Ch 2 and 9.1 Relationships Between 2 Variables
Correlation and Regression Analysis
Linear Regression/Correlation
Correlation & Regression Math 137 Fresno State Burger.
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Correlation & Regression
Correlation and Regression A BRIEF overview Correlation Coefficients l Continuous IV & DV l or dichotomous variables (code as 0-1) n mean interpreted.
Lecture 16 Correlation and Coefficient of Correlation
Correlation and Regression
Descriptive Methods in Regression and Correlation
Linear Regression.
Copyright © 2011 Pearson Education, Inc. Multiple Regression Chapter 23.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Correlation Scatter Plots Correlation Coefficients Significance Test.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Chapter 14 – Correlation and Simple Regression Math 22 Introductory Statistics.
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
Chapter 15 Correlation and Regression
Chapter 13 Statistics © 2008 Pearson Addison-Wesley. All rights reserved.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.
1 Chapter 3: Examining Relationships 3.1Scatterplots 3.2Correlation 3.3Least-Squares Regression.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Bivariate Data When two variables are measured on a single experimental unit, the resulting data are called bivariate data. You can describe each variable.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
© 2008 Pearson Addison-Wesley. All rights reserved Chapter 1 Section 13-6 Regression and Correlation.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Correlation, linear regression 1. Scatterplot Relationship between two continouous variables 2.
1 Further Maths Chapter 4 Displaying and describing relationships between two variables.
Chapter 10 Correlation and Regression
Correlation & Regression
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
Examining Bivariate Data Unit 3 – Statistics. Some Vocabulary Response aka Dependent Variable –Measures an outcome of a study Explanatory aka Independent.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
Statistics: Analyzing 2 Quantitative Variables MIDDLE SCHOOL LEVEL  Session #2  Presented by: Dr. Del Ferster.
Chapter 7 Scatterplots, Association, and Correlation.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Correlation & Regression
Correlation and Simple Linear Regression
Basic Estimation Techniques
SCATTERPLOTS, ASSOCIATION AND RELATIONSHIPS
Correlation and Simple Linear Regression
Correlation and Regression
Chapter 10 Correlation and Regression
Correlation and Simple Linear Regression
Simple Linear Regression and Correlation
Chapters Important Concepts and Terms
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Biostatistics, statistical software VI. Relationship between two continuous variables, correlation, linear regression, transformations. Relationship between two discrete variables, contingency tables, test for independence. Krisztina Boda PhD Department of Medical Informatics, University of Szeged

Krisztina Boda INTERREG 2 Relationship between two continuous variables correlation, linear regression, transformations.

Krisztina Boda INTERREG 3 Imagine that 6 students are given a battery of tests by a vocational guidance counsellor with the results shown in the following table: Variables measured on the same individuals are often related to each other.

Krisztina Boda INTERREG 4 Let us draw a graph called scattergram to investigate relationships. Scatterplots show the relationship between two quantitative variables measured on the same cases. In a scatterplot, we look for the direction, form, and strength of the relationship between the variables. The simplest relationship is linear in form and reasonably strong. Scatterplots also reveal deviations from the overall pattern.

Krisztina Boda INTERREG 5 Creating a scatterplot When one variable in a scatterplot explains or predicts the other, place it on the x-axis. Place the variable that responds to the predictor on the y-axis. If neither variable explains or responds to the other, it does not matter which axes you assign them to.

Krisztina Boda INTERREG 6 Possible relationships positive correlation negative correlation no correlation

Krisztina Boda INTERREG 7 Describing linear relationship with number: the coefficient of correlation Correlation is a numerical measure of the strength of a linear association. The formula for coefficient of correlation treats x and y identically. There is no distinction between explanatory and response variable. Let us denote the two samples by x 1,x 2,…x n and y 1,y 2,…y n, the coefficient of correlation can be computed according to the following formula

Krisztina Boda INTERREG 8 Properties of r Correlations are between -1 and +1; the value of r is always between -1 and 1, either extreme indicates a perfect linear association. 1  r  1. a) If r is near +1 or -1 we say that we have high correlation. b) If r=1, we say that there is perfect positive correlation. If r= -1, then we say that there is a perfect negative correlation. c) A correlation of zero indicates the absence of linear association. When there is no tendency for the points to lie in a straight line, we say that there is no correlation (r=0) or we have low correlation (r is near 0 ).

Krisztina Boda INTERREG 9 Effect of outliers Even a single outlier can change the correlation substantially. Outliers can create  an apparently strong correlation where none would be found otherwise,  or hide a strong correlation by making it appear to be weak. r=-0.21 r=0.74 r=0.998r=-0.26

Krisztina Boda INTERREG 10 Two variables may be closely related and still have a small correlation if the form of the relationship is not linear. r=2.8 E-15 r=0.157

Krisztina Boda INTERREG 11 Correlation and causation a correlation between two variables does not show that one causes the other. Causation is a subtle concept best demonstrated statistically by designed experiments.

Krisztina Boda INTERREG 12 Correlation by eye This applet lets you estimate the regression line and to guess the value of Pearson's correlation. Five possible values of Pearson's correlation are listed. One of them is the correlation for the data displayed in the scatterplot. Guess which one it is. To see the correct value, click on the "Show r" button.

Krisztina Boda INTERREG 13 When is a correlation „high”? What is considered to be high correlation varies with the field of application. The statistician must decide when a sample value of r is far enough from zero, that is, when it is sufficiently far from zero to reflect the correlation in the population.

Krisztina Boda INTERREG 14 Testing the significance of the coefficient of correlation The statistician must decide when a sample value of r is far enough from zero to be significant, that is, when it is sufficiently far from zero to reflect the correlation in the population. H 0 : ρ=0 (greek rho=0, correlation coefficient in population = 0) H a : ρ ≠ 0 (correlation coefficient in population ≠ 0) This test can be carried out by expressing the t statistic in terms of r. The following t-statistic has n-2 degrees of freedom Decision using statistical table:  If |t|>t α,n-2, the difference is significant at α level, we reject H 0 and state that the population correlation coefficient is different from 0.  If |t|<t α,n-2, the difference is not significant at α level, we accept H 0 and state that the population correlation coefficient is not different from 0. Decision using p-value:  if p < α a we reject H 0 at α level and state that the population correlation coefficient is different from 0.

Krisztina Boda INTERREG 15 Example 1. The correlation coefficient between math skill and language skill was found r= Is significantly different from 0? H 0 : the correlation coefficient in population = 0, ρ =0. H a : the correlation coefficient in population is different from 0. Let's compute the test statistic: Degrees of freedom: df=6-2=4 The critical value in the table is t 0.05,4 = Because 42.6 > 2.776, we reject H 0 and claim that there is a significant linear correlation between the two variables at 5 % level.

Krisztina Boda INTERREG 16 Example 1, cont. p<0.05, the correlation is significantly different from 0 at 5% level

Krisztina Boda INTERREG 17 Example 2. The correlation coefficient between math skill and retailing skill was found r= Is significantly different from 0? H 0 : the correlation coefficient in population = 0, ρ =0. H a : the correlation coefficient in population is different from 0. Let's compute the test statistic: Degrees of freedom: df=6-2=4 The critical value in the table is t 0.05,4 = Because |-53.42|=53.42 > 2.776, we reject H 0 and claim that there is a significant linear correlation between the two variables at 5 % level.

Krisztina Boda INTERREG 18 Example 2., cont.

Krisztina Boda INTERREG 19 Example 3. The correlation coefficient between math skill and theater skill was found r= Is significantly different from 0? H 0 : the correlation coefficient in population = 0, ρ =0. H a : the correlation coefficient in population is different from 0. Let's compute the test statistic: Degrees of freedom: df=6-2=4 The critical value in the table is t 0.05,4 = Because | |= < 2.776, we do not reject H 0 and claim that there is no a significant linear correlation between the two variables at 5 % level.

Krisztina Boda INTERREG 20 Example 3., cont.

Krisztina Boda INTERREG 21 Prediction based on linear correlation: the linear regression When the form of the relationship in a scatterplot is linear, we usually want to describe that linear form more precisely with numbers. We can rarely hope to find data values lined up perfectly, so we fit lines to scatterplots with a method that compromises among the data values. This method is called the method of least squares. The key to finding, understanding, and using least squares lines is an understanding of their failures to fit the data; the residuals A straight line that best fits the data: y=bx + a is called regression line Geometrical meaning of a and b. b: is called regression coefficient, slope of the best-fitting line or regression line; a: y-intercept of the regression line.

Krisztina Boda INTERREG 22 Equation of regression line for the data of Example 1. y=1.016·x+15.5 the slope of the line is Prediction based on the equation: what is the predicted score for language for a student having 400 points in math? y predicted =1.016 · =421.9

Krisztina Boda INTERREG 23 How to get the formula for the line which is used to get the best point estimates

Krisztina Boda INTERREG 24 Computation of the correlation coefficient from the regression coefficient. There is a relationship between the correlation and the regression coefficient: where s x, s y are the standard deviations of the samples. From this relationship it can be seen that the sign of r and b is the same: if there exist a negative correlation between variables, the slope of the regression line is also negative. It can be shown that the same t-test can be used to test the significance of r and the significance of b.

Krisztina Boda INTERREG 25 Coefficient of determination The square of the correlation coefficient multiplied by 100 is called the coefficient of determination. It shows the percentages of the total variation explained by the linear regression. Example. The correlation between math aptitude and language aptitude was found r =0,9989. The coefficient of determination, r 2 = So 91.7% of the total variation of Y is caused by its linear relationship with X.

Krisztina Boda INTERREG 26 Regression using transformations Sometimes, useful models are not linear in parameters. Examining the scatterplot of the data shows a functional, but not linear relationship between data.

Krisztina Boda INTERREG 27 Example A fast food chain opened in Each year from 1974 to 1988 the number of steakhouses in operation is recorded. The scatterplot of the original data suggests an exponential relationship between x (year) and y (number of Steakhouses) (first plot) Taking the logarithm of y, we get linear relationship (plot at the bottom)

Krisztina Boda INTERREG 28 Performing the linear regression procedure to x and log (y) we get the equation log y = x that is y = e x =e e x = 1.293e x is the equation of the best fitting curve to the original data.

Krisztina Boda INTERREG 29 log y = xy = 1.293e x

Krisztina Boda INTERREG 30 Types of transformations Some non-linear models can be transformed into a linear model by taking the logarithms on either or both sides. Either 10 base logarithm (denoted log) or natural (base e) logarithm (denoted ln) can be used. If a>0 and b>0, applying a logarithmic transformation to the model

Krisztina Boda INTERREG 31 Exponential relationship ->take log y Model: y=a*10 bx Take the logarithm of both sides: lg y =lga+bx so lg y is linear in x

Krisztina Boda INTERREG 32 Logarithm relationship ->take log x Model: y=a+lgx so y is linear in lg x

Krisztina Boda INTERREG 33 Power relationship ->take log x and log y Model: y=ax b Take the logarithm of both sides: lg y =lga+b lgx so lgy is linear in lg x

Krisztina Boda INTERREG 34 Reciprocal relationship ->take reciprocal of x Model: y=a +b/x y=a +b*1/x so y is linear in 1/x

Krisztina Boda INTERREG 35 Example from the literature

Krisztina Boda INTERREG 36

Krisztina Boda INTERREG 37 Relationship between two discrete variables, contingency tables, test for independence

Krisztina Boda INTERREG 38 Comparison of categorical variables (percentages):  2 tests (chi-square) Example: rates of diabetes in three groups: 31%, 27% and 25%*. Frequencies can be arranged into contingency tables. H 0 : the occurrence of diabetes is independent of groups (the rates are the same in the population) DIABTreatment1Treatment 2Treatment3Total yes no Total

Krisztina Boda INTERREG 39  2 tests, assumptions If H 0 is true, the expected frequencies can be computed (E i =row total*column total/total)  2 statistics:  2 =Σ(O i -E i ) 2 /E i DF (degrees of freedom: (number of rows-1)*(number of columns-1) Decision based on table:  2 >  2 table, , df Assumption: cells with expected frequencies <5 are maximum 20% of the cells Exact test (Fisher): it has no assumption, it gives the exact p- value  2 =0.933 Df=(3-1)*(2-1)= <5.99(=  2 table, 0.05,2 ) p=0.627 Assumption holds Exact p=0.663

Krisztina Boda INTERREG 40 2x2-es tables  2 formula can be computed from cell frequencies When the assumptions are violated :  Yates correction  Fisher’s exact test

Krisztina Boda INTERREG 41 Example  2 =10.815, df=1, p=0.001  2 Yates =9.809, df=1, p=0.002 Fisher’s exact test p=0.002

Krisztina Boda INTERREG 42 Review questions and exercises Problems to be solved by hand- calculations ..\Handouts\Problems hand VI.doc..\Handouts\Problems hand VI.doc Solutions ..\Problems hand VI solution.doc..\Problems hand VI solution.doc Problems to be solved using computer ..\Handouts\Problems comp VI.doc..\Handouts\Problems comp VI.doc

Krisztina Boda INTERREG 43 Useful WEB pages     2/index.html 2/index.html