Statistical Data Analysis 2011/2012 M. de Gunst Lecture 7
Statistical Data Analysis 2 Statistical Data Analysis: Introduction Topics Summarizing data Exploring distributions Bootstrap Robust methods Nonparametric tests (continued) Analysis of categorical data Multiple linear regression
Statistical Data Analysis 3 Today’s topic: Nonparametric methods for two sample problems (Chapter 6: 6.3, 6.4 ) 6. Nonparametric methods (continued) 6.1. One sample: two nonparametric tests for location 6.2. Aymptotic efficiency 6.3. Two samples: nonparametric tests for equality of distributions Median test Wilcoxon two-sample test Kolmogorov-Smirnov two-sample test Permutation tests Asymptotic efficiency (read yourself) 6.4. Two samples: nonparametric tests for correlation Rank correlation test of Spearman Rank correlation test of Kendall Permutation tests
Statistical Data Analysis 4 Nonparametric methods: Introduction-recap Nonparametric tests No assumption of parametric family for underlying distribution of data For problems with large class of distributions belonging to H 0 Distribution of test statistic same under every distribution that belongs to H 0 Why these tests? Robust w.r.t. confidence level: conf level α for large class of distributions More efficient than tests with more assumptions when these assumptions do not hold: fewer observations necessary for same power (= onderscheidend vermogen)
Statistical Data Analysis Two-sample problem: equality of distributions (1) Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z) > mean(x) = > mean(z) = > median(x) = > median(z) = 62.5 > sd(x) = > sd(z) = > length(x) = 32 > length(z) = 23 Is distribution of x in some way smaller than that of z? Example
Statistical Data Analysis 6 Two-sample problem: equality of distributions (2) (C ontinued) Is distribution of x same as that of z? How to investigate with plot? Better: Empirical qqplot of x and z (In)equality of distributions not clear (see also Chapter 3), so investigate further with test(s) Boxplots in one figure Example
Statistical Data Analysis 7 Two-sample problem: equality of distributions (3) Situation realizations of, independent, unknown distr. F realizations of, independent, unknown distr. G Are F and G the same? Which aspect? Location, spread, general shape, … Case 1. Paired observations, m = n Case 2. Unpaired observations and two independent groups of random variables
Statistical Data Analysis 8 Paired-samples: equality of distributions ~ F ~ G Case 1. Paired observations Main interest: difference in location of F and G Consider differences → one sample Now investigate location of distribution of with one sample test(s).
Statistical Data Analysis 9 Paired-samples: 3 one-sample tests ~ F ~ G F = G ? Case 1. Paired observations Test whether location of distribution of differences equals 0 with one sample test(s): i) normal: t-test ii) dependent, independent: sign test, Wilcoxon’s signed rank test iii) independent, independent: Wilcoxon’s signed rank test, because then under H 0 symmetry around 0 automatic
Statistical Data Analysis 10 Unpaired samples: equality of distributions ~ F ~ G Case 2. Unpaired observations and two independent groups of random variables m and n may be different
Statistical Data Analysis 11 Unpaired samples : t-test Two-sample t-test Assumptions: F normal, mean μ; G normal, mean ν ; equal variances Test statistic: ~ t m+n-2 If variances not equal: adjusted denominator Note: this is default in R: ?t.test t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95,...)
Statistical Data Analysis Unpaired samples: median test Median test Assumptions: F, G continuous distributions Test statistic: ~ Hyp (m+n, m, p) nonparametric “half” of the total number of observations Suited (efficient) for shift alternatives: H 1 : G = F(-θ) Does not use much information of data (NB. This is not Mood’s median test)
Statistical Data Analysis Unpaired samples: Wilcoxon two-sample test (1) Wilcoxon two-sample or Wilcoxon rank sum test or Mann-Whitney test Assumptions: F, G continuous distribution functions Test statistic: ranks of in combined sample of size N = m+n nonparametric With ties or for large n, m normal approximation Especially suited (efficient) for shift alternatives: H 1 : G = F(-θ) Uses more information from data
Wilcoxon rank sum test Assumptions: F, G continuous distributions Test statistic: ranks of in combined sample of size N = m+n Statistical Data Analysis 14 Unpaired samples: Wilcoxon two-sample test (2) Equivalent test statistics used under same name: and switched roles first and second sample = sum of ranks of first sample = m(m+1) used by R
Statistical Data Analysis Unpaired samples: Kolmogorov-Smirnov test Kolmogorov-Smirnov two-sample test Assumptions: F, G have continuous distribution functions nonparametric Test statistic: ranks of in combined sample of size N = m+n Especially suited (efficient) for general alternatives: F and G need not have same shape
Statistical Data Analysis Unpaired samples: permutation tests Permutation tests for unpaired data Assumptions: F, G have continuous distribution functions Test statistic: that gives info about difference F and G e.g., Med(X) m – Med(Y) n, etc. Test conditionally on ordered combined sample : (right-sided) nonparametric
Statistical Data Analysis 17 Unpaired samples: illustration (1) Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z) F distribution of data x; G distribution of data z H0: F = G H1: F stochastically smaller than G (what does this mean??) Note: For different tests this H1 becomes in R: t.test: difference in expectations is less than 0; > t.test(x,z,alternative="less") median test: difference in location less than 0; # compute yourself with 1-phyper(18, 32, 23, 56/2) # check where the numbers come from!! Mann-Whitney/Wilcoxon: difference in location less than 0; > wilcox.test,alternative="less") Kolmogorov-Smirnov: CDF of X above of that of Y; > ks.test(x,z,alternative="greater") Example
Statistical Data Analysis 18 Unpaired samples: illustration (2) (C ontinued) F distribution of data x; G distribution of data z H0: F = G H1: F stochastically smaller than G p-values: t.test: 0.12 median test: 0.11 Mann-Whitney/Wilcoxon: 0.20 (normal approximation was used –due to ties) Kolmogorov-Smirnov: 0.31 (R-warning) H0 not rejected for these tests. Note: we have performed all tests, but - t.test is not good candidate, because data not likely to be normal based on plots; - whether distributions have same shape, i.e. whether shift-alternative is good choice, not clear: shapes look similar, but sd’s are quite different. If it is, then median and Mann- Whitney tests are good in terms of power; - Kolmogorov Smirnov is good test also for general types of alternatives. There are ties here and R does not know how to adjust for this, so consider p-value as an approximation. Example
Statistical Data Analysis Paired samples: correlation ~ F ; ~ G Only for paired observations : are X i and Y i correlated? How to start investigation? Make scatter plot Measures of correlation? (Pearson’s) sample correlation Kendall’s rank correlation Spearman’s rank correlation Can all be used for testing
Statistical Data Analysis / Paired samples: tests for correlation Only for paired observations : for all i (Pearson’s) (linear) correlation test Assumptions: F normal, G normal Test statistic: ~ t n-2 Kendall’s rank correlation test, Spearman’s rank correlation test Assumptions: F, G continuous distribution functions Test statistic: and, resp. Both based on ranks: nonparametric R: cor.test(x,y, method= "pearson“/ "kendall“/ "spearman", …)
Statistical Data Analysis Paired samples: permutation test for correlation Only for paired observations : for all i Permutation tests Assumptions: F, G continuous distribution functions Test statistic: that gives info about dependence X i and Y i e.g. Kendall’s, Spearman’s Test conditionally on combined first and ordered second sample : (right-sided) Conditional, so different results from former tests with same statistics nonparametric
Statistical Data Analysis 22 Permutation tests (1) 1. Unpaired observations and 2. Paired observations, m = n Bootstrap if not computable exactly: generate large number B of randomly chosen permutations π, and approximate p-value by fraction: 1. Replace by 2. Replace by
Statistical Data Analysis 23 Permutation tests (2) 1. Unpaired observations and 2. Paired observations, m = n How to permute in both cases? Instead of permutation π of 1,…, m+n, and 1,…, n, resp., easier to permute the data: 1. Permute (X 1,..,X m, Y 1,…,Y n ) and make new division in two samples of m and n observations. 2. Permute (Y 1,..,Y n ), leave (X 1,..,X n ) as it is, and make new pairs.
Statistical Data Analysis 24 Permutation test: illustration for unpaired samples (1) Data: Thromboglubine data of Raynaud patients without organ defects (x) and of patients with other auto-immune disease (z) F distribution of data x; G distribution of data z H0: F = G H1: F stochastically smaller than G If interested in specific characteristics: permutation test Unpaired data: conditional test, given the sorted values of the combined sample: Choose T; suppose H0 rejected for small (large) values of T then B times, do: u generate random permutation of (x 1,..,x 32,,z 1,…,z 23 ) and make new division in two samples of 32 and 23 observations with R-function `sample’ u xperm = first 32 elements of permuted data u zperm = last 23 elements of permuted data u determine T(xperm, zperm) Count fraction of B values T(xperm, zperm) smaller (larger) than T(x,z) of thrombo data: this is p-value Example
Statistical Data Analysis 25 Permutation test: illustration for unpaired samples (2) (C ontinued) F distribution of data x; G distribution of data z H0: F = G H1: F stochastically smaller than G Results of permutation tests Permutation test for difference in mean: T=mean(X)-mean(Y) Left p-value: (bootstrap approximation) (several times: 0.107, 0.091, 0.114, …) Permutation test for difference in median: T=median(X)-median(Y) Left p-value: (bootstrap approximation) (several times: 0.163, 0.19, 0.18, …) Permutation test for Mann-Whitney: T=U-tilde Left p-value: (bootstrap approximation) (several times: 0.195, 0.214, 0.201, …) (around value for unconditional Mann-Whitney). Permutation test for difference in sd: T=sd(X)-sd(Y) Left p-value: (bootstrap approximation) (several times: 0.043, 0.053, 0.059, …) Example
Statistical Data Analysis 26 Recap 6. Nonparametric methods (continued) 6.3. Two samples: nonparametric tests for equality of distributions Median test Wilcoxon two-sample test Kolmogorov-Smirnov two-sample test Permutation tests Asymptotic efficiency (read yourself) 6.4. Two samples: nonparametric tests for correlation Rank correlation test of Spearman Rank correlation test of Kendall Permutation tests
Statistical Data Analysis 27 Nonparametric methods for one sample problems The end