Categorical and discrete data. Non-parametric tests
Non-parametric tests: estimate sample differences when the known distribution shapes cannot help, or even confuse
Metrics of arbitrary distibutions Median: the value that "splits the sample in half" Mode: the value that occurs with the greatest frequency. n-th procentile: the value between n% of the sorted sample and the rest of it. Hence quantiles, quartiles, deciles etc.
Non-parametric correlations Spearman R: The closest analog of Pearson r with the difference that, instead of each raw value, its rank in the sample is used 5 – 6 1 – 1 4 – 3 6 – 5 7 – 8 2 – 2 8 – 7 3 – 4 5 – 8 1 – 6 4 – 3 6 – 2 7 – 4 2 – 5 8 – 1 3 – 7 X – YX – Z 2.4 -> > > > > > > > 3
Non-parametric correlations Spearman R: A variability ratio (X and Y vary synchronously) / (X and Y vary in total) Kendall Tau: A probability ratio P(X and Y are related) / P(X and Y are NOT related)
Comparing two independent samples The Mann-Whitney U test: The closest analog of the t-test with the difference that, instead of each raw value, its rank in the sample is used Wald-Wolfowitz runs test: Estimates if two samples differ in BOTH means and distribution shapes – – + – – – – – – – + – – + + – – + + – – + – + – + + – – + + – + – – + + – + + – –
Comparing two dependent samples Wilcoxon matched pairs test: The difference between two tied samples is significant, if a sum of pairwise differences (either positive or negative ones) is TOO BIG Sign test: >><<><<>>><<><<> <<><<<<><<><<<<>
2 x 2 tables SmokeDo not smoke Female1226 Male3425 Chi-square test, Fisher’s exact test: Χ 2 = Σ (Observed - Expected) 2 / Expected
Nonparametric methods are most appropriate when the sample sizes are small. When the data set is large (e.g., n > 100) it often makes little sense to use nonparametric statistics at all: When the samples become very large, then the sample means will follow the normal distribution even if the respective variable is not normally distributed in the population