Association, correlation and regression in biomedical research

Slides:



Advertisements
Similar presentations
Chapter 18: The Chi-Square Statistic
Advertisements

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
Statistical Tests Karen H. Hagglund, M.S.
The Simple Regression Model
Correlation and Regression Analysis
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Relationships Among Variables
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Presentation 12 Chi-Square test.
Correlation and Regression
Simple Linear Regression
Statistics 11 Correlations Definitions: A correlation is measure of association between two quantitative variables with respect to a single individual.
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 26.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
CHI SQUARE TESTS.
Chapter 13 CHI-SQUARE AND NONPARAMETRIC PROCEDURES.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Interpretation of Common Statistical Tests Mary Burke, PhD, RN, CNE.
Chi Square Test Dr. Asif Rehman.
I. ANOVA revisited & reviewed
Nonparametric Statistics
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Regression and Correlation
Presentation 12 Chi-Square test.
Introduction to Regression Analysis
Non-Parametric Tests 12/1.
Correlation and Simple Linear Regression
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Non-Parametric Tests 12/1.
Correlation – Regression
Hypothesis Testing Review
Non-Parametric Tests 12/6.
Hypothesis testing. Chi-square test
CHOOSING A STATISTICAL TEST
Non-Parametric Tests.
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
Correlation and Simple Linear Regression
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Correlation and Regression
CHAPTER 29: Multiple Regression*
Nonparametric Statistics
M248: Analyzing data Block D.
Review for Exam 2 Some important themes from Chapters 6-9
Hypothesis testing. Chi-square test
Ass. Prof. Dr. Mogeeb Mosleh
Correlation and Simple Linear Regression
LEARNING OUTCOMES After studying this chapter, you should be able to
Hypothesis testing. Association and regression
Non – Parametric Test Dr. Anshul Singh Thapa.
Simple Linear Regression and Correlation
Linear Regression and Correlation
Product moment correlation
15.1 The Role of Statistics in the Research Process
Linear Regression and Correlation
Correlations: Correlation Coefficient:
Chapter 18: The Chi-Square Statistic
Correlation & Regression
Chapter Thirteen McGraw-Hill/Irwin
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
MGS 3100 Business Analysis Regression Feb 18, 2016
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Association, correlation and regression in biomedical research Asst. Prof. Georgi Iskrov, PhD Department of Social Medicine

Before we start http://www.raredis.work/edu/

Outline Choosing a statistical test Non-parametric tests Mann–Whitney U test Chi-square χ2 test Correlation analysis Pearson and Spearman correlation Coefficient of determination Regression analysis Linear and multiple regression Least squares method

Parametric and non-parametric tests Parametric test – the variable we have measured in the sample is normally distributed in the population to which we plan to generalize our findings Non-parametric test – distribution free, no assumption about the distribution of the variable in the population

Parametric and non-parametric tests Type of test Non-parametric Parametric Scale Nominal Ordinal Ordinal, Interval, Ratio 1 group χ2 goodness of fit test Wilcoxon signed rank test 1-sample t-test 2 unrelated groups χ2 test Mann–Whitney U test 2-sample t-test 2 related groups McNemar test Paired t-test K unrelated groups Kruskal–Wallis H test ANOVA K related groups Friedman matched samples test ANOVA with repeated measurements

Mann–Whitney U test Ordinal data independent samples. H0: Two sampled populations are equivalent in location (they have the same mean ranks). The observations from both groups are combined and ranked, with the average rank assigned in the case of ties. If the populations are identical in location, the ranks should be randomly mixed between the two samples.

Mann–Whitney U test Aim: Compare the average ranks or medians of two unrelated groups. Example: Comparing pain relief score of patients undergoing two different physiotherapy programmes. Effect size: Difference between the two medians (mean ranks). Null hypothesis: The two population medians (mean ranks) are identical. Meaning of P value: If the two population medians (mean ranks) are identical, what is the chance of observing such a difference (or a bigger one) between medians (mean ranks) by chance alone?

Kruskal–Wallis test Ordinal data independent samples. H0: K sampled populations are equivalent in location (they have the same mean ranks). The observations from all groups are combined and ranked, with the average rank assigned in the case of ties. If the populations are identical in location, the ranks should be randomly mixed between the K samples.

Wilcoxon signed rank test Ordinal data two related samples. H0: Two sampled populations are equivalent in location (they have the same mean ranks). Takes into account information about the magnitude of differences within pairs and gives more weight to pairs that show large differences than to pairs that show small differences. Based on the ranks of the absolute values of the differences between the two variables.

Is there an association? Chi-square χ2 test Chi-square χ2 test is used to check for an association between 2 categorical variables. H0: There is no association between the variables. HA: There is an association between the variables. If two categorical variables are associated, it means the chance that an individual falls into a particular category for one variable depends upon the particular category they fall into for the other variable. Is there an association?

Chi-square χ2 test Assumptions: Let’s say that we want to determine if there is an association between Place of birth and Alcohol consumption. When we test if there is an association between these two variables, we are trying to determine if coming from a particular area makes an individual more likely to consume alcohol. If that is the case, then we can say that Place of birth and Alcohol consumption are related or associated. Assumptions: A large sample of independent observations; All expected counts should be ≥ 1 (no zeros); At least 80% of expected counts should ≥ 5.

Chi-square χ2 test The following table presents the data on place of birth and alcohol consumption. The two variables of interest, place of birth and alcohol consumption, have r = 4 and c = 2, resulting in 4 x 2 = 8 combinations of categories. Place of birth Alcohol No alcohol Big city 620 75 Rural 240 41 Small town 130 29 Suburban 190 38

Expected counts For i taking values from 1 to r (number of rows) and j taking values from 1 to c (number of columns), denote: Ri = total count of observations in the i-th row. Cj = total count of observations in the j-th column. Oij = observed count for the cell in the i-th row and the j-th column. Eij = expected count for the cell in the i-th row and the j-th column if the two variables were independent, i.e if H0 was true. These counts are calculated as

Expected counts E11 = (695 x 1180) / 1363 E12 = (695 x 183) / 1363 Place of birth Alcohol No alcohol Total Big city O11 = 620 O12 = 75 R1 = 695 Rural O21 = 240 O22 = 41 R2 = 281 Small town O31 = 130 O32 = 29 R3 = 159 Suburb O41 = 190 O42 = 38 R4 = 228 C1 = 1180 C2 = 183 n=1363 E11 = (695 x 1180) / 1363 E12 = (695 x 183) / 1363 E21 = (281 x 1180) / 1363 E22 = (281 x 183) / 1363 E31 = (159 x 1180) / 1363 E32 = (159 x 183) / 1363 E41 = (228 x 1180) / 1363 E42 = (228 x 183) / 1363

Chi-square χ2 test The test statistic measures the difference between the observed the expected counts assuming independence. If the statistic is large, it implies that the observed counts are not close to the counts we would expect to see if the two variables were independent. Thus, 'large' χ2 gives evidence against H0 and supports HA. To get the corresponding p-value we need to use a χ2 distribution with (r-1) x (c-1) df.

Association is not causation. Beware! Association is not causation. The observed association between two variables might be due to the action of a third, unobserved variable.

Special case In a lot of cases the categorical variables of interest have two levels each. In this case, we can summarize the data using a contingency table having 2 rows and 2 columns (2x2 table): In this case, the χ2 statistic has a simplified form: Under the null hypothesis, χ2 statistic has a χ2 distribution with (2-1) x (2-1) = 1 degrees of freedom. Column 1 Column 2 Total Row 1 A B R1 Row 2 C D R2 C1 C2 n

Special case Gender Alcohol No alcohol Total Male 540 52 592 Female 325 31 356 865 83 948

Limitations No categories should be less than 1 No more than 1/5 of the expected categories should be less than 5 To correct for this, can collect larger samples or combine your data for the smaller expected categories until their combined value is 5 or more Yates Correction When there is only 1 degree of freedom, regular chi-test should not be used Apply the Yates correction by subtracting 0.5 from the absolute value of each calculated O-E term, then continue as usual with the new corrected values

Fisher exact test This test is only available for 2 x 2 tables. For small n, the probability can be computed exactly by counting all possible tables that can be constructed based on the marginal frequencies. Thus, the Fisher exact test computes the exact probability under the null hypothesis of obtaining the current distribution of frequencies across cells, or one that is more uneven.

Fisher exact test Gender Dieting Non-dieting Total Male 1 11 12 Female 9 3 10 14 24 Gender Dieting Non-dieting Total Male a c a + c Female b d b + d a + b c + d a + b + c + d

Fisher exact test Gender Dieting Non-dieting Total Male 1 11 12 Female 9 3 10 14 24

Correlation Correlation quantifies the linear association between two variables. The direction and magnitude of the correlation is expressed by the correlation coefficient r. Its value can range between –1 and 1: r = 0 => no linear association; if r is positive, the two variables tend to increase or decrease together; if r is negative, the two variables are inversely related; If r is equal to 1 or –1, there is a perfect linear association between the two variables.

Correlation The most widely-used type of correlation coefficient is Pearson r, also called linear or product-moment correlation. For Pearson correlation, both X and Y values must be sampled from populations that follow normal distribution. Spearman rank correlation rs does not make this assumption. This non-parametric method separately ranks X and Y values and then calculates the correlation between the two sets of ranks.

Correlation Pearson correlation quantifies the linear relationship between X and Y. As X goes up, does Y go up a consistent amount? Spearman correlation quantifies the monotonic relationship between X and Y. As X goes up, does Y go up as well (by any amount)?

Correlation r 0.77 95% CI for ρ 0.38 to 0.93 r2 0.59 p 0.02 number of XY pairs 24

Coefficient of determination Coefficient of determination r2 is the proportion of the variance in the dependent variable that is predictable from the independent variable/variables. r = 0.70 (association between height and weight) r2 = 0.49 49% of the variance in weight is explained by / predictable from the height 51% of the variance in weight is not explained by / predictable from the height

Mistakes Correlation and coincidence Source: www.tylervigen.com

Mistakes Do not attempt to interpret a correlation coefficient without looking at the corresponding scatterplot! Each data set has a correlation coefficient of 0.7

Mistakes Source: xkcd.com

Regression Those who are higher tend to weight more…

Regression Origin of term regression – children of tall parents tend be shorter than their parents and vice versa; children “regressed” toward the mean height of all children. Model – equation that describes the relationship between variables. Parameters (regression coefficients) Simple linear regression Y = a + b x X Outcome = Intercept + Slope x Predictor Multiple regression Y = a + b1 x X1+ b2 x X2 + … + bn x Xn

Linear regression X-axis Y-axis independent dependent predictor predicted carrier response input output

Linear regression y is the predicted (dependent) variable a is the intercept (estimated value of y when x = 0) b is the slope (the average change in y for each change of 1 unit in x) x is the predictor (independent variable)

Linear regression The association looks like it could be described by a straight line. There are many straight lines that could be drawn through the data. How to choose among them?

Least squares method Residual – the difference between the actual Y value and the Y value predicted by the model. Least squares method finds the values of the slope and intercept that minimize the sum of the squares of the residuals. Coefficient of determination r2 gives information about the goodness of fit of a model. In regression, r2 is a statistical measure of how well the regression line approximates the real-world data. An r2 of 1 indicates that the regression line perfectly fits the data.

Mistakes Source: xkcd.com