Comparing phylogenetic and statistical classification methods for DNA barcoding Frederic Austerlitz, Olivier David, Brigitte Schaeffer, Sisi Ye, Michel.

Comparing phylogenetic and statistical classification methods for DNA barcoding Frederic Austerlitz, Olivier David, Brigitte Schaeffer, Sisi Ye, Michel Veuille & Catherine Laredo

Testing the assignation methods Individual to be tested: –X 1 : ATATGTACCTAGTA –X 2 : TTATCTACCTAGAA

Phylogenetic methods 1_11_32_1 1_2 2_2 2_3X1X1 Unanimity rule: X 1 classified as ambiguous, X 2 classified as species 1 Majority rule: X 1 and X 2 classified as belonging to species 1 Two methods tested: neighbor joining and maximum likelihood (PhyML, Guindon and Gascuel 2003) X2X2

Node 0 Start with all the individuals to reach: « individuals in each leave belong to the same species » A: 5 B: 3 C: 8 CART (Classification And Regression Tree) Builds a classification tree from the reference sample (Breiman et al., 1984, 1996) Node 1 A: 5 B: 0 C: 4 Node 2 A: 0 B: 3 C: 4 Leave 1 A: 5 B: 0 C: 0 Leave 2 A: 0 B: 0 C: 4 Leave 3 A: 0 B: 0 C: 4 Leave 1 A: 0 B: 3 C: 0

node t = subset of individuals p (j | t) = relative proportion of individuals of class j in node t maximum minimum for Computing the impurity of the nodes Impurity criterion at node t: I(t) = - ∑ j p(j|t) log p(j|t) entropy I(t) = 1 - ∑ j p(j|t)² Gini index

P L + P R = 1 t tLtL tRtR s PRPR PLPL s  S S: set of variables t: set of individuals Finding the best split ΔI(s*,t) = max { ΔI(s,t), s  S } ΔI(s,t) = I(t) - p L I(t L ) - p R I(t R ) Rule to select a splitting candidate: Decrease in impurity: s* selected as Stop splitting rule: e.g. threshold β max { ΔI(s,t), s  S } < β

Example: I(t) = I(node0) = - ∑ j p(j|t) log p(j|t) At node 0: I(node0) = - [3/10 × log(3/10) + 3/10 × log(3/10) + 4/10 × log(4/10)] 3 species, 10 individuals, 4 variables I(node0) = 1.0889 Species x1x2x3x4 A agcg A aaag B ttag B tacg C acag C gcag C acag C accg Node 0 A: 3/10, B: 3/10, C: 4/10

I(t) = - ∑ j p(j|t) log p(j|t) Splitting according to x 1 Species x1x2x3x4 A agcg A aaag B ttag B tacg C acag C gcag C acag C accg At node L:I x1 (nodeL) = - [0 + 3/3 × log(3/3) + 0] = 0 I x1 (nodeR) = - [3/7 × log(3/7) + 0 + 4/7 × log(4/7)] At node R: I x1 (nodeR) = 0.6829 A: 0, B: 3/3, C: 0 A: 3/7, B: 0, C: 4/7 Node 0 A: 3/10, B: 3/10, C: 4/10 x1= t Node L Node R P R = 7/10P L = 3/10 ΔI(x 1,t) = I(node0) - P L *I x1 (nodeL) - P R *I x1 (nodeR) ΔI(x 1,t) = 1.0889 - 0.3 × 0 - 0.7 × 0.6829 = 0.6109

I(t) = - ∑ j p(j|t) log p(j|t) At node L: I x2 (nodeR) = - [0 + 0 + 4/4*log(4/4)] = 0 Species x1x2x3x4 A agcg A aaag B ttag B tacg C acag C gcag C acag C accg I x2 (nodeL) = - [3/6*log(3/6) + 3/6*log(3/6) +0] At node R: Node 0 A: 3/10, B: 3/10, C: 4/10 x2 = a,g,t Node L Node R A: 3/6, B: 3/6, C: 0 A: 0, B: 0, C: 4/4 P R = 4/10P L = 6/10 ΔI(x 2,t) = I(node0) - P L *I x2 (nodeL) - P R *I x2 (nodeR) ΔI(x 2,t) = 1.0889 - 0.6 *0.6931 - 0.4*0 = 0.6730 I x2 (nodeL) = 0.6931 Splitting according to x 2

I(t) = - ∑ j p(j|t) log p(j|t) I x3 (nodeR) = - [2/4*log(2/4) + 1/4*log(1/4) + 1/4*log(1/4)] = 1.040 Species x1x2x3x4 A agcg A aaag B ttag B tacg C acag C gcag C acag C accg At node L: I x3 (nodeL) = - [1/6*log(1/6) + 2/6*log(2/6) + 2/6*log(2/6)] = 1.031 At node R: Node 0 A: 3/10, B: 3/10, C: 4/10 x3 = a Node L Node R A: 1/6, B: 2/6, C: 1/6 A: 2/4, B: 1/4, C: 1/4 P R = 4/10P L = 6/10 ΔI(x 3,t) = I(node0) - P L *I x2 (nodeL) - P R *I x2 (nodeR) ΔI(x 3,t) = 1.0889 - 0.6 *1.031 - 0.4*1.040 = 0.0662 Splitting according to x 3

Species x1x2x3x4 A agcg A aaag B ttag B tacg C acag C gcag C acag C accg ΔI(x 2,t) = 0.6730 ΔI(s*,t) = max ΔI(s,t) s  S ΔI(x 1,t) = 0.6109 ΔI(x 3,t) = 0.0662 x4x4 no division x 2 is selected Choosing the best split

Criterion: entropy Software: R Package: rpart Implementation

Bagging or bootstrap aggregation N bootstrap samples from original data N classification trees N assignements for a new individual Majority rule → class of the new individual

Simulation method one haploid population that splits into two (or more), T generations in the past. The ancestral population and the two new populations of constant size N. Sequences with mutation rate   parameter of interest  = 2N  Simulations performed with simcoal 2.1.2 (Excoffier et al) T

Evaluation of the different classification methods We simulate n +1 individuals in each species. n individuals are considered as the reference samples, and the last one as the individual to test. Using repeated simulations, we compute the proportion of cases in which each test individual is correctly assigned to its species. T

Parameters assumed for the simulation study  = 3 (e.g. Litoria) or 30 (e.g. Astraptes) Reference sample size n = 3, 5, 10, 25 Effective population size: N = 1000 Separation time T = N/10, N/2, N, 5N or 10N Number of newly founded populations: from 2 to 5  In all cases, all populations assumed to be founded simultaneously.

Comparison between phylogenetic methods 2 populations Separation time = 500,  = 3.

Effect of the number of populations  = 3  = 30 (Separation time: 500 generations, Reference sample size = 10)

Effect of the separation time  = 3  = 30 (4 populations, Reference sample size = 10)

Effect of the size of the reference sample  = 3  = 30 (4 populations, Separation time = 500)

Adding nuclear genes We considered the case where polymorphism of nuclear genes are also available. We assumed that –these genes were independent. –they all have the same  (= 4N  ) value, equal to the value for the cytoplasmic genes. –They do not show intragenic recombination or they show it at a rate equal to the mutation rate (c = , i.e. 4Nc =  ).

Adding nuclear genes Phylogenetic method 2 populations  = 3, separation time = 500, reference sample size = 10

Adding nuclear genes Phylogenetic method  = 30, separation time = 500, reference sample size = 10

Application to real data Litoria (Schneider et al, 1998, Mol. Ecol. 7, 487–498). –4 species –Average sample size: 43.75 –average  = 1.54

Application to real data Astraptes (Hebert et al 2004. Proc. Natl Acad. Sci. USA 101,14812–14817) –12 species –Average sample size: 38.8 –average  = 23.5

Application to real data Cowries (Meyer et al (2005) PLoS Biol 3: e422) –357 taxa (species/subspecies) –Average sample size: 5.7 –average  = 2.93

Conclusions Regarding phylogenetic methods, the maximum likelihood method performs better than the neighbor joining. CART performs better than phylogenetic methods for poorly informative data (low  value) but not for more polymorphic data (high  value) Adding nuclear loci can help, but at a quite high cost. Recombination improves the phylogenetic method for low  values (Ongoing work for CART).

Perspectives Developing a statistical method to put a confidence level on a given assignation. Evaluating other classification methods (learning methods)

Comparing phylogenetic and statistical classification methods for DNA barcoding Frederic Austerlitz, Olivier David, Brigitte Schaeffer, Sisi Ye, Michel.

Similar presentations

Presentation on theme: "Comparing phylogenetic and statistical classification methods for DNA barcoding Frederic Austerlitz, Olivier David, Brigitte Schaeffer, Sisi Ye, Michel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparing phylogenetic and statistical classification methods for DNA barcoding Frederic Austerlitz, Olivier David, Brigitte Schaeffer, Sisi Ye, Michel.

Similar presentations

Presentation on theme: "Comparing phylogenetic and statistical classification methods for DNA barcoding Frederic Austerlitz, Olivier David, Brigitte Schaeffer, Sisi Ye, Michel."— Presentation transcript:

Similar presentations

About project

Feedback