Presentation is loading. Please wait.

Presentation is loading. Please wait.

BIOL 582 Lecture Set 17 Analysis of frequency and categorical data Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence.

Similar presentations


Presentation on theme: "BIOL 582 Lecture Set 17 Analysis of frequency and categorical data Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence."— Presentation transcript:

1 BIOL 582 Lecture Set 17 Analysis of frequency and categorical data Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence

2 BIOL 582Expansion of Goodness of Fit Tests The first two examples included one frequency distribution and some known or true expectation. The first two examples included categorical data There are two different ways we can (and will) go 1. Goodness of Fit tests for continuous frequency data 2. Goodness of fit tests for more than one distribution We have to start with one of these, so let’s start with 1. Before proceeding, it is important to establish two different hypotheses that are used as “null” model for frequency expectations. In the previous two examples, the expected frequencies were established by theory (expected genotypes) or a larger empirical pool of information (species proportions). These are extrinsic hypotheses for the basis of expected frequencies. Intrinsic hypotheses can also be used for estimating frequencies. For example, if we wish to test if a continuous frequency distribution is normal, we can generate expected frequencies but would first need to know the mean and variance of the sample. Thus, the degrees of freedom for the test are reduced by these additional parameter estimates.

3 BIOL 582Goodness of fit: continuous frequency distributions with intrinsic expected frequencies We have done these types of tests nearly all semester! Kolmogorov-Smirnov and Shapiro-Wilk are such tests An old-fashioned way to do GOF tests was to break the continuous data into classes, find the expected frequency of each class, and proceed as we just learned. For example, we could get compare the heights of the columns (observed) to the point on the red lines at the center of each column (expected) to test if the continuous frequency distribution for this test statistic is normally distributed. Red notches indicate the height of the curve at the centers of bins, which would indicated expected frequencies/densities

4 BIOL 582Goodness of fit: continuous frequency distributions with intrinsic expected frequencies We have done these types of tests nearly all semester! Kolmogorov-Smirnov and Shapiro-Wilk are such tests An old-fashioned way to do GOF tests was to break the continuous data into classes, find the expected frequency of each class, and proceed as we just learned. For example, we could get compare the heights of the columns (observed) to the point on the red lines at the center of each column (expected) to test if the continuous frequency distribution for this test statistic is normally distributed. This method is no longer considered appropriate (as changing the number of columns can change the outcome) We will use the K-S test as a standard, non-parametric GOF between one distribution and either an intrinsic or extrinsic expectation of its frequency.

5 BIOL 582Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov What is the K-S test in a nutshell? The K-S test orders data from lowest to highest The observed “cumulative relative frequency” distribution is calculated by dividing rank by n. (I.e., 1/n, 2/n, 3/n, …. n/n) In a stepwise fashion, a cumulative Frequency function produces the expected cumulative relative frequency for every 1/n steps The difference between observed and expected frequencies is measured at each step The largest absolute (vertical) distance is used as a test statistic. This distance is compared to critical values from a Kolmogorov distribution (you can see what that is on your own) There might be some “adjustments” made along the way to estimate the expected frequencies. Just assume that the canned function knows when to make such adjustments.

6 BIOL 582Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov A good way to understand K-S, test normality of residuals from a previous example. This is an intrinsic test. > # Residuals from an anlysis > > snake<-read.csv("snake.data.csv") > attach(snake) > Sex<-as.factor(Sex) > > lm.snake<-lm(HS ~ log(SVL) + Sex) > r<-resid(lm.snake) > r<-r/var(r) # make residuals into standardized residuals > r<-sort(r) # sorts residuals for small to large > n<-length(r) > > # Creating expected frequencies > o<-array(1:n)/n # observed frequencies (densities) > e<-pnorm(r,mean=mean(r),sd=sd(r)) # expected frequencies (densities)

7 BIOL 582Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov A good way to understand K-S, test normality of residuals from a previous example. This is an intrinsic test. > # Evaluation > max(abs(o-e)) [1] 0.1032246 > > plot(r,o,ylab="Cumulative relative frequency", > xlab="Standardized Residuals", > main="Circles = observed; Line = expected") > points(r,e,type="l") > > ks.test(r,'pnorm',mean(r),sd(r)) > # indicates to get the cumulative area under > # a curve (p) from a normal distribution One-sample Kolmogorov-Smirnov test data: r D = 0.1032, p-value = 0.749 alternative hypothesis: two-sided

8 BIOL 582Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov Let’s repeat the test with a different model on the same data, but where the residuals are a little less normal > lm.snake<-lm(HS ~ sqrt(SVL+0.5*SVL^2)) > r<-resid(lm.snake) > r<-r/var(r) > r<-sort(r) > n<-length(r) > > # Creating expected frequencies > o<-array(1:n)/n > e<-pnorm(r,mean=mean(r),sd=sd(r)) > > # Evaluation > max(abs(o-e)) [1] 0.1616094 > > plot(r,o,ylab="Cumulative relative frequency",xlab="Standardized Residuals", > main="Circles = observed; Line = expected") > points(r,e,type="l") > > ks.test(r,'pnorm',mean(r),sd(r)) One-sample Kolmogorov-Smirnov test data: r D = 0.1616, p-value = 0.2217 alternative hypothesis: two-sided

9 BIOL 582Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov Let’s look at a process that produces one type of data tested against other distributions > # Generate data from a log-normal distribution > y<-rlnorm(50,meanlog=2.5,sdlog=0.6) > y<-sort(y) > r<-(y-mean(y))/sd(y) > n<-length(r) > o<-array(1:n)/n > > # Expected densities from three distributions > e.norm<-pnorm(y,mean=mean(y),sd=sd(y)) > e.poisson<-ppois(y,lambda=mean(y)) > e.log.norm<-plnorm(y,meanlog=mean(log(y)),sdlog=sd(log(y))) > > par(mfrow=c(1,3)) > plot(y,o, main ="Compared to Normal",ylab="Density") > points(y,e.norm,type='l') > plot(y,o, main ="Compared to Poisson",ylab="Density") > points(y,e.poisson,type='l') > plot(y,o, main ="Compared to Lognormal",ylab="Density") > points(y,e.log.norm,type='l’)

10 BIOL 582Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov Let’s look at a process that produces one type of data tested against other distributions > ks.test(y,'pnorm',(mean(y)),(sd(y))) One-sample Kolmogorov-Smirnov test data: y D = 0.2198, p-value = 0.01335 alternative hypothesis: two-sided > ks.test(y,'ppois',(mean(y))) One-sample Kolmogorov-Smirnov test data: y D = 0.4004, p-value = 9.531e-08 alternative hypothesis: two-sided > ks.test(y,'plnorm',(mean(log(y))),sd(log(y))) One-sample Kolmogorov-Smirnov test data: y D = 0.096, p-value = 0.7096 alternative hypothesis: two-sided

11 BIOL 582Tests of independence Now let’s consider the case where we have two sets of categorical frequencies, and we wish to compare them to determine if the they have the same distributions of proportional outcomes (irrespective of the sample size) We have done this already: Contingency Table analysis Often Contingency tables are called Two-way or Multi-way tables because the sample size (n) can be partitioned in two, or more ways Sokal and Rohlf (2011) also describe and recommend the following The following examples are also from Sokal and Rohlf (2011) * Chi-square tests are also applicable ModelFrequency TotalsRecommended Test INot fixedG-test for independence* II Fixed for one criterion G-test for independence* III Fixed for both criteria Fisher’s Exact Test

12 BIOL 582Tests of independence Example Model I (Box 17.6 Sokal and Rohlf 2011) A plant ecologist samples 100 trees of a rare species in a 400 square- mile area He records for each tree if it is rooted in serpentine soil, and whether its leaves are pubescent or smooth Question: Do trees grown in serpentine soils have different ratios of smooth: pubescent leaves? H 0 : Ratios are equal SoilPubescentSmoothTotalRatio Serpentine1222341.833:1 Not Serpentine1634662.125:1 total2872100

13 BIOL 582Tests of independence Example Model I (Box 17.6 Sokal and Rohlf 2011) Expected values are based on a multinomial distribution: For a two-way table The probability of observing the cell frequencies, a, b, c, and d, is computed as Via some steps reserved for additional reading, And G is -2lnL Category 1Category 2Total Sample 1aba + b Sample 2cdc + d Totala + cb + da + b + c +d Computationally Easier Computationally Easier

14 BIOL 582Tests of independence Example Model I (Box 17.6 Sokal and Rohlf 2011) Expected values are based on a multinomial distribution: Observed G components SoilPubescentSmoothTotalRatio Serpentine1222341.833:1 Not Serpentine1634662.125:1 total2872100 SoilPubescentSmoothTotal Serpentine12ln1222ln2234ln34 Not Serpentine16ln1634ln3466ln66 total28ln2872ln72100ln100

15 BIOL 582Tests of independence Example Model I (Box 17.6 Sokal and Rohlf 2011) Expected values are based on a multinomial distribution: Observed G components (add these) SoilPubescentSmoothTotalRatio Serpentine1222341.833:1 Not Serpentine1634662.125:1 total2872100 SoilPubescentSmoothTotal Serpentine12ln1222ln2234ln34 Not Serpentine16ln1634ln3466ln66 total28ln2872ln72100ln100

16 BIOL 582Tests of independence Example Model I (Box 17.6 Sokal and Rohlf 2011) Expected values are based on a multinomial distribution: Observed G components (then add these) SoilPubescentSmoothTotalRatio Serpentine1222341.833:1 Not Serpentine1634662.125:1 total2872100 SoilPubescentSmoothTotal Serpentine12ln1222ln2234ln34 Not Serpentine16ln1634ln3466ln66 total28ln2872ln72100ln100

17 BIOL 582Tests of independence Example Model I (Box 17.6 Sokal and Rohlf 2011) Expected values are based on a multinomial distribution: Observed G components SoilPubescentSmoothTotalRatio Serpentine1222341.833:1 Not Serpentine1634662.125:1 total2872100 SoilPubescentSmoothTotal Serpentine12ln1222ln2234ln34 Not Serpentine16ln1634ln3466ln66 total28ln2872ln72100ln100

18 BIOL 582Tests of independence Example Model I (Box 17.6 Sokal and Rohlf 2011) Expected values are based on a multinomial distribution: Model I Two-way tables have type I error rates that are higher than intended. The William's Correction is recommended Which for the current example is Thus The df is equal to (r-1)(c-1) = 1, for two rows and two columns The probability of finding a value of 1.30277 or higher from a Chi-square distribution with 1 df is 0.253708; thus do not reject the null hypothesis of same ratios (accept null hypothesis of independence  leaf type is independent of soil type)

19 BIOL 582Tests of independence The general formula for the G stat is from now on Also, unless otherwise stated, this is the same As the base is not given, so the log is assumed natural

20 BIOL 582Tests of independence Example Model II (Sokal and Rohlf 2011) An immunology experiment involved inoculating 111 mice with a pathogenic bacteria 57 mice were also given antiserum After a sufficient amount of time, the number of dead mice was compared between the two treatments This is Model II because the number of mice in the treatments was fixed. H 0 : Ratios are equal TreatmentDeadAliveTotalRatio Bacteria + Antiserum 1344570.29545:1 Bacteria only2529540.86201:1 total3873111

21 BIOL 582Tests of independence Example Model II (Sokal and Rohlf 2011) Observed G Components G = 2[377.97216 – 897.29807 + 522.75785] = 6.97927704 G adj = 6.97927704/1.15658 = 5.9470375 P-value = 0.01474; reject null hypothesis  ratios are different TreatmentDeadAliveTotalRatio Bacteria + Antiserum 1344570.29545:1 Bacteria only2529540.86201:1 total3873111 TreatmentDeadAliveTotal Bacteria + Antiserum 33.34434166.50434230.45392 Bacteria only 80.4719097.65158215.40514 total 138.22827313.20354522.75785

22 BIOL 582Tests of independence Next time… (Or next two times) Model III and Fisher’s Exact test More than 2 rows and columns Odds-ratios for proportions Logistic Regression


Download ppt "BIOL 582 Lecture Set 17 Analysis of frequency and categorical data Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence."

Similar presentations


Ads by Google