Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton.

Slides:



Advertisements
Similar presentations
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
Exact Inference in Bayes Nets
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Molecular Evolution Revised 29/12/06
Algorithmic Complexity Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Phylogeny Tree Reconstruction
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Redundancy Ratio: An Invariant Property of the Consonant Inventories of the World’s Languages Animesh Mukherjee, Monojit Choudhury, Anupam Basu and Niloy.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Plant Molecular Systematics Michael G. Simpson
Metagenomic Analysis Using MEGAN4
Molecular phylogenetics
Lecture 4 – Characters: Molecular First used by Luca Cavalli-Sforza and Anthony Edwards.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
 Read Chapter 4.  All living organisms are related to each other having descended from common ancestors.  Understanding the evolutionary relationships.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Comp. Genomics Recitation 3 The statistics of database searching.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Fabio Pardi PhD student in Goldman Group European Bioinformatics Institute and University of Cambridge, UK Joint work with: Barbara Holland, Mike Hendy,
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Chapter 9 Genetic Algorithms.  Based upon biological evolution  Generate successor hypothesis based upon repeated mutations  Acts as a randomized parallel.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Methods for point patterns. Methods consider first-order effects (e.g., changes in mean values [intensity] over space) or second-order effects (e.g.,
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Distance-based phylogeny estimation
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
Inferring a phylogeny is an estimation procedure.
Objective: Understand Concepts related to trees.
Inferring phylogenetic trees: Distance and maximum likelihood methods
Summary and Recommendations
CS 581 Tandy Warnow.
Gene Tree Estimation Through Affinity Propagation
Lecture 7 – Algorithmic Approaches
Summary and Recommendations
Simulation Berlin Chen
Berlin Chen Department of Computer Science & Information Engineering
Presentation transcript:

Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton (Computational Biology)

1: Collect DNA from reference individuals 2. Digest with one 6bp rare cutter (CTGCAG) and one 4bp frequent cutter (TCGA) 3. Only fragments with two rare ends are amplified and retained 4. Create a microarray with these fragments (~2-3% of the genome) 5: Analyse phylogenetic samples by digesting them with the same cutters and running them against the microarray (DNA-DNA hybridisation). Each fragment is scored 1 (present) or 0 (absent) * *This is in math fantasy land – in real life you also get ?s Generating DArTs D Diversity Ar Array T Technologies

Properties of DArT data Data is binary (fragments are present or absent, 1/0) A random set of fragments from across the genome. Fragments are much more likely to be lost in parallel than gained in parallel Data exhibit an ascertainment bias: We can observe only the fragments on the chip. These fragments were derived from a small set of reference taxa.

The model Fragment evolution can be modeled as a stochastic Dollo process, i.e. gained once but lost potentially many times Parallel gains are forbidden Fragments are lost at a constant rate r (memoryless) Chance of loss over time t is 1-exp(- rt )

Hamming Horror Ref D C B D(Ref,B)=12/16 =(12+0)/( ) D(C,D)=2/16 =(1+1)/( ) Hamming distance D = (n 10 +n 01 )/(n 11 +n 00 +n 10 +n 01 )

Hamming simulation Tree based on Hamming distances using A as the reference taxon Underlying tree used in simulation

A distance correction is required R AB Let n 00 be the number of fragments absent at both A and B Let n 01 be the number of fragments absent at A and present at B Let n 10 be the number of fragments present at A and absent at B Let n 11 be the number of fragments present at both A and B

A distance correction is required Michael Woodham's key observation was that, due to the Dollo nature of the process, any fragment that is present at the reference taxon R and at taxon A, must also be present at the internal node X. R AB X

A distance correction is required Recall, chance of survival over time t is exp(-rt) d(X,B) = -log[probability fragment survives from X to B] Anything present at A is known to be present at X => d(X,B) = -log[n 11 /(n 11 +n 10 )] d(A,B) = d(A,X) + d(X,B) = -log[n 11 /(n 11 +n 01 )] - log[n 11 /(n 11 +n 10 )] = log[(1+n 01 /n 11 )(1+n 10 /n 11 )] R AB X

A zoo of distances Hamming: d H =(n 01 +n 10 )/(n 11 +n 10 +n 01 +n 00 ) Log Det: d LD =log[det[D]]-0.5Σ k (log(C k )+log(R k )) Jaccard: d J =(n 01 +n 10 )/(n 11 +n 10 +n 01 ) Log Jaccard: d LJ =-log(1-d J )=-log[n 11 /(n 11 +n 10 +n 01 )] HS: d HS =-log[2n 11 /(2n 11 +n 10 +n 01 )] Nei Li: F=2n 11 /(2n 11 +n 10 +n 01 );F=Q^2/(2-Q) d NL =-log(Q)

Simulations Random (yule) topology, Edge lengths chosen from uniform distribution 0.05< l <0.40 Yule tree, subject to minimum edge length 0.01

Simulation details Choose an arbitrary node to start the process at. At this node, the number of DArT fragments is taken from a Poisson distribution with mean M. (We use the result from HS 2004 that a stochastic Dollo process is independent of the root). Propagate outward from the start point along tree edges, so that each new node acquires some new DArT fragments and inherits some of those from its parent. If the edge length is l, then the probability of a given fragment present in the parent still being present at the end of the edge is exp(- l ). The number of new fragments in the child but not the parent is Poisson distributed with mean (1-exp(- l )) M.

Simulations R R S R R S R S T U V Selection of Reference Taxa One ref, included Two refs, included One ref, excluded Two refs, excluded All taxa are refs

Simulations All taxa are references, 9 taxa.

Simulations Single reference, excluded, 9 taxa. Single reference, included, 9 taxa.

Simulations (distance matrix -> tree by FastME)

Simulations

s r SR B A Multiple References If R were the only reference, we'd only see the coloured sites. n 10 =6, n 01 =2, n 11 =2, d(A,B)= -log(2/4) - log(2/8)=3

s r SR B A Multiple References With R and S as references n 10 =7, n 01 =7, n 11 =3, d(A,B)= -2log(3/10)=3.474

Generalizing the DArT Distance The DArT distance does less well when there is more than one reference taxon. Define d RDa (A,B;R)=DArT distance between A and B calculated only from sites that are 1 at R. Then d GD (A,B) is a weighted average: d GD (A,B)=(Σ R d RDa (A,B;R)√n R )/(Σ R √n R )

Partitioned DarT distances (under construction) When the reference taxa are known (typically the case) And it's also known which fragments come from which reference taxon (not always the case) You can define a partitioned DArT distance that takes a weighted average of the DArT distance for each partition.

Simulations All taxa are references, 9 taxa.

Simulations

DArT, Generalized DArT and HS tree (FastME) 94 Eucalcypt taxa 8 reference taxa

Norwich Why does the Generalised DarT distance perform so well when the reference taxa are included and so poorly when they are not?

R A B CD Single reference prpr papa p bcd pbpb p cd pcpc pdpd Pattern proabilities can be computed by rooting the tree at the reference taxon and then only considering loss of fragments. E.g. the probability of seeing R1 A0 B0 C1 D1 is (1-p r )p a (1-p bcd )p b (1-p cd )(1-p c )(1-p d )

Reference unknown RBD n01/n n10/n D(A,C) D(A,C) = log[(1+n 01 /n 11 )(1+n 10 /n 11 )] R A B CD prpr papa p bcd pbpb p cd pcpc pdpd Set all edge probabilities to 0.01

R A B CS Multiple reference taxa prpr papa p bcs pbpb p cs pcpc psps In the multiple reference setting you also have to consider gain of fragments down any edge that is above a reference taxon. E.g. the probability of seeing R0 A0 B0 C1 S1 Has 4 terms p r p a (1-p bcs )p b (1-p cs )(1-p c )(1-p s ) + p bcs p b (1-p cs )(1-p c )(1-p s ) + p cs (1-p c )(1-p s ) + p s * need to renormalise probabilities

R A B prpr papa p bs pbpb psps * need to renormalise probabilities S Set all edge probabilities to 0.01 D(A,B) = log[(1+n 01 /n 11 )(1+n 10 /n 11 )] RSR or S n01/n n10/n D(A,B) RSR or S n n n n

Future ideas The small examples we worked through in Norwich suggest two new ideas to be tested by simulation In the case of unknown references, compute D(X,Y|R) for each R and take the max. In the case of known references, a modification to the Generalised DArT that only averages over the references

Future work - hybridisation

Links to other peoples work Gene content evolution with HGT, aka controlling ancestral genome obesiety (Tal Dagan, Bill Martin) Language evolution with borrowing (Geoff Nicholls, Russell Gray)

BIG Thanks to Torsten and Shiju