How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf DIMACS and UIC The affinities of.

Slides:



Advertisements
Similar presentations
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Advertisements

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Greedy Algorithms Amihood Amir Bar-Ilan University.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Molecular Evolution Revised 29/12/06
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Bioinformatics Algorithms and Data Structures
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Intro to AI Genetic Algorithm Ruth Bergman Fall 2004.
Summer 2008 Workshop in Biology and Multimedia for High School Teachers.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Terminology of phylogenetic trees
Genetic Algorithms and Ant Colony Optimisation
Combinatorial Reconstruction of Sibling Relationships in Absence of Parental Data Tanya Y Berger-Wolf (DIMACS and UIC CS) Bhaskar DasGupta (UIC CS) Wanpracha.
Molecular phylogenetics
0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao Department of Computer Science Northwestern University Evanston,
Organizing Biodiversity with Evolutionary Trees
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
A Comparison of Nature Inspired Intelligent Optimization Methods in Aerial Spray Deposition Management Lei Wu Master’s Thesis Artificial Intelligence Center.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
3- RIBOSOMAL RNA GENE RECONSTRUCITON  Phenetics Vs. Cladistics  Homology/Homoplasy/Orthology/Paralogy  Evolution Vs. Phylogeny  The relevance of the.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
An Introduction to Genetic Algorithms Lecture 2 November, 2010 Ivan Garibay
Phylogeny GENE why is coalescent theory important for understanding phylogenetics (species trees)? coalescent theory lets us test our assumptions.
Modern Systematics Phylogenetics Phylogenetic tree Cladistics
Chapter 26 Phylogeny and the Tree of Life
LET’S GET STARTED.
Calculating branch lengths from distances. ABC A B C----- a b c.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
GENE 3000 Fall 2013 slides wiki. wiki. wiki.
Estimating Species Tree from Gene Trees by Minimizing Duplications
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
An Introduction to Genetic Algorithms Lecture 2 November, 2010 Ivan Garibay
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.
Darwin’s Tree of Life, July million species Phylogenetic inference from genomic.
Brilliant Barnacles Evidence for evolutionary relationships Cover image © The Linnean Society Lesson 2 Building the tree of life – molecular evidence.
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
Chapter 19 Molecular Phylogenetics
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Presentation transcript:

How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf DIMACS and UIC The affinities of all the beings of the same class have sometimes been represented by a great tree… As buds give rise by growth to fresh buds, and these if vigorous, branch out and overtop on all sides many a feeble branch, so by generation I believe it has been with the great Tree of Life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications. Charles Darwin, 1859

Phylogeny Reconstruction OrangutanChimpanzeeHumanGorilla

Phylogeny Reconstruction Process 1.Get an estimate of evolutionary distance between species 2.Treat the species as a set of points with pairwise distance measure 3.Find a tree that optimizes {parsimony, likelihood, function of your choice} on that set of points

Phylogeny Reconstruction Problems 1.Get an estimate of evolutionary distance between species 2.Treat the species as a set of points with pairwise distance measure 3.Find a tree that optimizes {parsimony, likelihood, function of your choice} on that set of points DNA not sufficient for deep evolution and too simple Genomes are better but no good distance measures Other types of data are subjective and no good models Constraints on possible topologies Species are sampled not at the same level and frequency so some points are “more equal than others” Large datasets: efficient storage, query, and representation

Computational Pitfalls Resulting optimization problems are hard No good bounds Existing heuristics expensive on large datasets Same score – many topologies True tree is unknown ⇓ When to stop and what to return?

Consensus Methods A B C D E A C B D E A B C D E + = Consensus is what many people say in chorus but do not believe as individuals Abba Eban ( ), Israeli diplomat In "The New Yorker," 23 Apr 1990

Consensus Methods: Strict McMorris et al. (83) E A B C D E A B C D E A B C D AB CD ABCD ABCDE AB ABC DE ABCDE BCD ABCD ABCDE Strict: contains clades common to all trees E A B C D

Consensus Methods: Majority Margush & McMorris (81), McMorris et al. (83), Barthelemy & McMorris (86) E A B C D E A B C D E A B C D AB CD ABCD ABCDE AB ABC DE ABCDE BCD ABCD ABCDE Majority: contains clades common to majority AB CD ABCDAB ABC DEBCD ABCD E A B C D

Stopping Maximum Parsimony (joint work with T.Williams, B.M.E.Moret, U.Roshan, T.Warnow) If return Majority Consensus of the top scoring trees how early can we stop without changing the outcome? What stopping criteria? Biological datasets: three567: “three-gene” (rbcL, atpB, and 18s) DNA sequences (Soltis et al., 2000) aster328: ITS RNA sequences from the plant Asteracaeae (Gutell Lab, ICMB, UT Austin) ocho854: rbcL DNA sequences (Goloboff, 1999) lipsc439: rDNA sequences of Eukaryotes (Goloboff, 1999) john921: Avian Cytochrome b DNA sequences (Johnson, 2001) eern476: Metazoan DNA sequences (Goloboff, 1999) will2000: Eukaryotic sRNA sequences (Gutell Lab, ICMB, UT Austin) rbcL500: rbcL DNA sequences (Rice et al., 1997) mari2594: rbcL DNA sequences (Kallerjo et al., 1998)

Experiment Design ATTCGGAAGCGATAGCTGA ATCGATCGATCGTATTACGT TAGCTAGTATGCAGCGGAG Biological dataset Run parsimony ratchet (PAUP*) 500 iterations, 5 repetitions Save the tree at each iteration Majority consensus of optimal trees (PAUP*) Output consensus tree … Optimal - best scoring trees in all repetitions Majority consensus of best and second best so far

Results

Online Consensus: Strict C(SC) = C(T i )  i=1 k C(SC i ) = C(T j ) = C(T j )  C(T i ) = C(SC i-1 )  C(T i )  j=1 i  i-1 Running time for a new tree - θ (n) and is optimal

Online Consensus: Majority Running time for a new tree - θ (n) and is optimal c є C(M) if and only if |C(T i ) s.t. c є C(T i )| > k — 2 C(M i ) C(M i-1 ) C(T i ) ∩ — ∩ Maintain the set of clades so far with counters Update counters for the previous majority and the new tree Use good implementation of a dictionary data structure (Amenta et al, 2003)

Conclusions No need to work hard to get good enough trees? Work to get “good” (?) trees, not optimal Stopping criteria Consensus is not the best representation. What else? This is a wide open research area

Using a Different Path: Heterogeneous Data (joint work with Tandy Warnow)

Heterogeneous Data Molecular data: DNA and genomes ProsCons Have distance measure Unambiguous Many characters No data for extinct species Difficulties with ancient evolutionary events Recombination, repeated evolution

Heterogeneous Data Paleontological, morphological, geographical, historical data ProsCons Easy to sample Sometimes is the only available information Has been used for a century Character states hard to determine Genetic basis not known No distance measure Subjective

Data As Constraints Constraints, not distance! Positive: these species are together (phylogenetic trees, presence of a morphological character) Negative: these species are not together (above + geography, fossils) Temporal: these events happened in this order (fossils, history) Frequency: this even happens more often than another (adaptation mechanisms)

E A B C D Consensus Methods: Greedy E A B C D E A B C D E A B C D AB CD ABCD ABCDE AB ABC DE ABCDE BCD ABCD ABCDE Greedy: resolves majority by adding compatible clades E A B C D AB CD ABCD E A B C D AB ABC DE E A B C D

Consensus Methods: AMT Phillips & Warnow (95) E A B C D E A B C D E A B C D AB CD ABCD ABCDE AB ABC DE ABCDE BCD ABCD ABCDE Asymmetric Median Tree: maximum (weighted) collection of compatible clades AB ABC ABCD BCD DE CD AB CD ABCD ABCDE AB ABC ABCD ABCDE AB CD ABCD ABCDE

Consensus of Positive Constraints Formalize constraint, go through existing consensus methods, see if satisfies or can be extended Positive ConstraintsStrict+ resMaj+ resGrdyAMTInput All input have isomorphic T ... all output have T  One input has isomorphic T, no contradictions  output have T  All input have clade  all output have One input has clade, no con- tradictions  output have   Partially from Steel et al. 2000

1.a and b are separated by C 2.C is closer to a than b – same as positive Negative ConstraintsStrict+ resMaj+ resGrdyAMTInput All input have 1 .all output…. have 1  One input has 1, no contradictions  output have 1   Consensus of Negative Constraints

More Conclusions Existing methods are insufficient (Consensus with respect to temporal, frequency constraints) Developing new methods that preserve 4 types of constraints Network phylogeny Error measure and evaluation of quality This is a wide open research area

Work was supported by the National Science Foundation postdoctoral fellowship grant EIA Thank you "The significant problems we face cannot be solved at the same level of thinking we were at when we created them." - Albert Einstein ( ) "A little inaccuracy sometimes saves a ton of explanation." - H. H. Munro (Saki) ( )

Controlled Breeding (joint work with Cris Moore and Jared Saia) Given an initial population of animals design a mating strategy that achieves a breeding goal (within shortest time)

Controlled Breeding: Background Conservation Biology and Agriculture Breeding strategies: designed and evaluated empirically or using stochastic time-step modeling Empirical evaluation – too slow! Stochastic modeling – mathematically and biologically inappropriate. Classic algorithm design problem

Breeding All Possible Animals Given k binary strings of length n Design an algorithm that Produces all possible strings With the smallest expected # matings Greedy: mate two animals with the highest probability of producing new Upper bound: n

Breeding a Target Animal Given k strings of length n Design an algorithm that Produces a target string With the smallest expected # matings Alg 1: breed for one trait at a time O(n lg n) Alg 2: breed the animals closest to the target O(n 2 )

Algorithm: One Trait at a Time AddOneTrait (11… , 00…010…0) x = 11…100…0 y = 00…010…0 While (y has < i+1 ones) do Mate x and y twice y = string with 1 in bit (i+1) Return y The Algorithm (e 1,e 2,…,e n ) x = e 1 For x = 2..n do x = AddOneTrait(x,e i )

More Realistic Breeding Gender Variable probability of outcome Deaths Minimize number of generations Goal: maximum diversity On-line: maintain the distribution