Download presentation
Presentation is loading. Please wait.
Published byHerta Martha Hase Modified over 6 years ago
1
Association, correlation and regression in biomedical research
Asst. Prof. Georgi Iskrov, PhD Department of Social Medicine
2
Before we start
3
Outline Choosing a statistical test Non-parametric tests
Mann–Whitney U test Chi-square χ2 test Correlation analysis Pearson and Spearman correlation Coefficient of determination Regression analysis Linear and multiple regression Least squares method
4
Parametric and non-parametric tests
Parametric test – the variable we have measured in the sample is normally distributed in the population to which we plan to generalize our findings Non-parametric test – distribution free, no assumption about the distribution of the variable in the population
5
Parametric and non-parametric tests
Type of test Non-parametric Parametric Scale Nominal Ordinal Ordinal, Interval, Ratio 1 group χ2 goodness of fit test Wilcoxon signed rank test 1-sample t-test 2 unrelated groups χ2 test Mann–Whitney U test 2-sample t-test 2 related groups McNemar test Paired t-test K unrelated groups Kruskal–Wallis H test ANOVA K related groups Friedman matched samples test ANOVA with repeated measurements
6
Mann–Whitney U test Ordinal data independent samples.
H0: Two sampled populations are equivalent in location (they have the same mean ranks). The observations from both groups are combined and ranked, with the average rank assigned in the case of ties. If the populations are identical in location, the ranks should be randomly mixed between the two samples.
7
Mann–Whitney U test Aim: Compare the average ranks or medians of two unrelated groups. Example: Comparing pain relief score of patients undergoing two different physiotherapy programmes. Effect size: Difference between the two medians (mean ranks). Null hypothesis: The two population medians (mean ranks) are identical. Meaning of P value: If the two population medians (mean ranks) are identical, what is the chance of observing such a difference (or a bigger one) between medians (mean ranks) by chance alone?
8
Kruskal–Wallis test Ordinal data independent samples.
H0: K sampled populations are equivalent in location (they have the same mean ranks). The observations from all groups are combined and ranked, with the average rank assigned in the case of ties. If the populations are identical in location, the ranks should be randomly mixed between the K samples.
9
Wilcoxon signed rank test
Ordinal data two related samples. H0: Two sampled populations are equivalent in location (they have the same mean ranks). Takes into account information about the magnitude of differences within pairs and gives more weight to pairs that show large differences than to pairs that show small differences. Based on the ranks of the absolute values of the differences between the two variables.
10
Is there an association?
Chi-square χ2 test Chi-square χ2 test is used to check for an association between 2 categorical variables. H0: There is no association between the variables. HA: There is an association between the variables. If two categorical variables are associated, it means the chance that an individual falls into a particular category for one variable depends upon the particular category they fall into for the other variable. Is there an association?
11
Chi-square χ2 test Assumptions:
Let’s say that we want to determine if there is an association between Place of birth and Alcohol consumption. When we test if there is an association between these two variables, we are trying to determine if coming from a particular area makes an individual more likely to consume alcohol. If that is the case, then we can say that Place of birth and Alcohol consumption are related or associated. Assumptions: A large sample of independent observations; All expected counts should be ≥ 1 (no zeros); At least 80% of expected counts should ≥ 5.
12
Chi-square χ2 test The following table presents the data on place of birth and alcohol consumption. The two variables of interest, place of birth and alcohol consumption, have r = 4 and c = 2, resulting in 4 x 2 = 8 combinations of categories. Place of birth Alcohol No alcohol Big city 620 75 Rural 240 41 Small town 130 29 Suburban 190 38
13
Expected counts For i taking values from 1 to r (number of rows) and j taking values from 1 to c (number of columns), denote: Ri = total count of observations in the i-th row. Cj = total count of observations in the j-th column. Oij = observed count for the cell in the i-th row and the j-th column. Eij = expected count for the cell in the i-th row and the j-th column if the two variables were independent, i.e if H0 was true. These counts are calculated as
14
Expected counts E11 = (695 x 1180) / 1363 E12 = (695 x 183) / 1363
Place of birth Alcohol No alcohol Total Big city O11 = 620 O12 = 75 R1 = 695 Rural O21 = 240 O22 = 41 R2 = 281 Small town O31 = 130 O32 = 29 R3 = 159 Suburb O41 = 190 O42 = 38 R4 = 228 C1 = 1180 C2 = 183 n=1363 E11 = (695 x 1180) / E12 = (695 x 183) / 1363 E21 = (281 x 1180) / E22 = (281 x 183) / 1363 E31 = (159 x 1180) / E32 = (159 x 183) / 1363 E41 = (228 x 1180) / E42 = (228 x 183) / 1363
15
Chi-square χ2 test The test statistic measures the difference between the observed the expected counts assuming independence. If the statistic is large, it implies that the observed counts are not close to the counts we would expect to see if the two variables were independent. Thus, 'large' χ2 gives evidence against H0 and supports HA. To get the corresponding p-value we need to use a χ2 distribution with (r-1) x (c-1) df.
16
Association is not causation.
Beware! Association is not causation. The observed association between two variables might be due to the action of a third, unobserved variable.
17
Special case In a lot of cases the categorical variables of interest have two levels each. In this case, we can summarize the data using a contingency table having 2 rows and 2 columns (2x2 table): In this case, the χ2 statistic has a simplified form: Under the null hypothesis, χ2 statistic has a χ2 distribution with (2-1) x (2-1) = 1 degrees of freedom. Column 1 Column 2 Total Row 1 A B R1 Row 2 C D R2 C1 C2 n
18
Special case Gender Alcohol No alcohol Total Male 540 52 592 Female
325 31 356 865 83 948
19
Limitations No categories should be less than 1
No more than 1/5 of the expected categories should be less than 5 To correct for this, can collect larger samples or combine your data for the smaller expected categories until their combined value is 5 or more Yates Correction When there is only 1 degree of freedom, regular chi-test should not be used Apply the Yates correction by subtracting 0.5 from the absolute value of each calculated O-E term, then continue as usual with the new corrected values
20
Fisher exact test This test is only available for 2 x 2 tables.
For small n, the probability can be computed exactly by counting all possible tables that can be constructed based on the marginal frequencies. Thus, the Fisher exact test computes the exact probability under the null hypothesis of obtaining the current distribution of frequencies across cells, or one that is more uneven.
21
Fisher exact test Gender Dieting Non-dieting Total Male 1 11 12 Female
9 3 10 14 24 Gender Dieting Non-dieting Total Male a c a + c Female b d b + d a + b c + d a + b + c + d
22
Fisher exact test Gender Dieting Non-dieting Total Male 1 11 12 Female
9 3 10 14 24
23
Correlation Correlation quantifies the linear association between two variables. The direction and magnitude of the correlation is expressed by the correlation coefficient r. Its value can range between –1 and 1: r = 0 => no linear association; if r is positive, the two variables tend to increase or decrease together; if r is negative, the two variables are inversely related; If r is equal to 1 or –1, there is a perfect linear association between the two variables.
24
Correlation The most widely-used type of correlation coefficient is Pearson r, also called linear or product-moment correlation. For Pearson correlation, both X and Y values must be sampled from populations that follow normal distribution. Spearman rank correlation rs does not make this assumption. This non-parametric method separately ranks X and Y values and then calculates the correlation between the two sets of ranks.
25
Correlation Pearson correlation quantifies the linear relationship between X and Y. As X goes up, does Y go up a consistent amount? Spearman correlation quantifies the monotonic relationship between X and Y. As X goes up, does Y go up as well (by any amount)?
26
Correlation r 0.77 95% CI for ρ 0.38 to 0.93 r2 0.59 p 0.02
number of XY pairs 24
27
Coefficient of determination
Coefficient of determination r2 is the proportion of the variance in the dependent variable that is predictable from the independent variable/variables. r = 0.70 (association between height and weight) r2 = 0.49 49% of the variance in weight is explained by / predictable from the height 51% of the variance in weight is not explained by / predictable from the height
28
Mistakes Correlation and coincidence Source:
29
Mistakes Do not attempt to interpret a correlation coefficient without looking at the corresponding scatterplot! Each data set has a correlation coefficient of 0.7
30
Mistakes Source: xkcd.com
31
Regression Those who are higher tend to weight more…
32
Regression Origin of term regression – children of tall parents tend be shorter than their parents and vice versa; children “regressed” toward the mean height of all children. Model – equation that describes the relationship between variables. Parameters (regression coefficients) Simple linear regression Y = a + b x X Outcome = Intercept + Slope x Predictor Multiple regression Y = a + b1 x X1+ b2 x X2 + … + bn x Xn
33
Linear regression X-axis Y-axis independent dependent
predictor predicted carrier response input output
34
Linear regression y is the predicted (dependent) variable
a is the intercept (estimated value of y when x = 0) b is the slope (the average change in y for each change of 1 unit in x) x is the predictor (independent variable)
35
Linear regression The association looks like it could be described by a straight line. There are many straight lines that could be drawn through the data. How to choose among them?
36
Least squares method Residual – the difference between the actual Y value and the Y value predicted by the model. Least squares method finds the values of the slope and intercept that minimize the sum of the squares of the residuals. Coefficient of determination r2 gives information about the goodness of fit of a model. In regression, r2 is a statistical measure of how well the regression line approximates the real-world data. An r2 of 1 indicates that the regression line perfectly fits the data.
37
Mistakes Source: xkcd.com
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.