A Bayesian method for DNA barcoding Kasper Munch, Wouter Boomsma, Eske Willerslev, Rasmus Nielsen, University of Copenhagen
Varieties of barcoding Assignment to existing species. Identification of new species. Assignment to taxonomic levels in general
Motivation 1.Environmental aDNA samples. 2.Putative Neandertal DNA. Often short query sequences. –Little information. Permissive PCR conditions. –Not always from the intended locus.
Given a set of database reference sequences from different species – according to which criteria should we assign new query sequences to taxonomic levels? ?
True species assignment Requires proper population genetic analyses quantifying variablity within species. Often not possible... –small database sample size for each species. –short query PCR products.
Phylogenetic alternative -Purely phylogenetic criteria which ignore population genetic problems. -Taxonomic annotation of database sequences is used to map phylogenetic groups to taxonomic levels. -The simpler approach has its own advangates: Less data required / Fewer assumptions
Monophyletic taxonomic group Ingroup or outgroup? Query
Estimating trees Estimation of a single tree is not sufficient because of the uncertainty regarding the phylogeny. We suggest instead to use a Bayesian approach which quantifies this uncertainty
Bayesian approach Let Q be the query sequence, X the database data, G a gene tree, and F a desired taxonomic group, then where G i is the ith gene tree sampled from p(G | X).
Assignment pipeline Summary Statistics Query Sequence Homology set Taxonomy summary Sampled trees Alignment Database (GenBank) NCBI blast Retrieval of sequences and taxonomy annotation ClustalW MrBayes
Summary statistics For each tree: –Find the sister clades to the query. –Find the consensus taxonomy for each clade. –Pick sister clade with most specific consensus taxonomy. For each taxonomic rank: –Find the fraction of consensus taxonomies that include taxonomic names of that rank.
Summary statistics For each tree: –Find the sister group to the query. –Find the list of taxonomic levels shared by the sequences in the sister group (consensus taxonomy) Sister groupQuery
Summary statistics For each tree: –Find the sister group to the query. –Find the list of taxonomic levels shared by the sequences in the sister group (consensus taxonomy) For each name of each taxonomic level: –Find the fraction of samples trees where the consensus taxonomy include that name.
Example taxonomy summary
Environmental Samples 379 environmental samples (aDNA) RBCL and TRNL markers. Aim is the identification of environmental flora
Orders >90% AsteralesBrassicalesCaryophyllalesConiferales DipsacalesEricalesFabalesFagales LamialesLepidopteraMalpighialesPoales PottialesRanunculalesRosalesSapindales SaxifragalesSolanalesZingiberales
Families >90% AmaranthaceaeAsteraceaeBetulaceaeBrassicaceae CaprifoliaceaeCaryophyllaceaeEricaceaeFabaceae FagaceaeJuncaceaeMusaceaePapaveraceae PinaceaePlantaginaceaePoaceaeRosaceae RutaceaeSalicaceaeSaxifragaceaeSolanaceae TaxaceaeTheaceae
Genera >90% AchilleaAlnusAruncusCerastium FagusMusaPiceaPinus PlantagoPoaSaxifragaSymphoricarpos Taxus
Botanical evaluation Temperate climate similar to central Sweden.
Testing putative Neandertal DNA Needless to say we have had several negative examples... One positive example: –Posterior probability of 91%.
Testing putative Neandertal DNA Needless to say we have had several negative examples... One positive example: –Posterior probability of 91%. Croatian squence with Neandertal characteristics point mutations. –sapiens sapiens with post prob. 67%
Problems No population genetic modelling: –Outgroup problem. –Species issues are is not addressed. –Lineage sorting - not reciprocal monophyli. Incomplete database
Advantages Phylogenetic uncertainty and statistical uncertainty of assignment is addressed. Posterior probability of assignment. Alternative to single tree assignment. Can be used on any database.
Conclusions The phylogenetic barcoding does not model the coalescence process. It is the appropriate method for assignment with little data, or when assigning to higher taxonomic levels. Bayesian approach offers a measure of confidence in assignment.