Understanding sets of trees CS 394C September 10, 2009.

Slides:



Advertisements
Similar presentations
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Advertisements

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Molecular Evolution Revised 29/12/06
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
The Tree of Life From Ernst Haeckel, 1891.
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
CIS786, Lecture 3 Usman Roshan.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Probabilistic methods for phylogenetic trees (Part 2)
1 Tricks for trees: Having reconstructed phylogenies what can we do with them? DIMACS, June 2006 Mike Steel Allan Wilson Centre for Molecular Ecology and.
Phylogenetic trees Sushmita Roy BMI/CS 576
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Terminology of phylogenetic trees
Molecular phylogenetics
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Descendent Subtrees Comparison of Phylogenetic Trees with Applications to Co-evolutionary Classifications in Bacterial Genome Yaw-Ling Lin 1 Tsan-Sheng.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
The bootstrap, consenus-trees, and super-trees Phylogenetics Workhop, August 2006 Barbara Holland.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Estimating Species Tree from Gene Trees by Minimizing Duplications
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Why use phylogenetic networks?
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Lecture 19 – Species Tree Estimation
Phylogenetic basis of systematics
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Endeavour to reconstruct the characters of each hypothetical ancestor.
Hierarchical clustering approaches for high-throughput data
The Tree of Life From Ernst Haeckel, 1891.
Technion – Israel Institute of Technology
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
The Most General Markov Substitution Model on an Unrooted Tree
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Presentation transcript:

Understanding sets of trees CS 394C September 10, 2009

Basic challenge Phylogenetic analyses are sometimes based upon a single marker, but often based upon many markers Each marker can be analyzed separately, or the entire set can be combined into one “super-matrix” Each matrix (each dataset) can result in many trees (almost no matter how you analyze the matrix) What to do with huge numbers of trees?

What to do? How to estimate evolutionary history from many trees How to efficiently store large sets of trees How to enable efficient queries of the set of trees

What to do? How to estimate evolutionary history from many trees How to efficiently store large sets of trees How to enable efficient queries of the set of trees

First, a few questions: Why are gene trees different from the species tree? Why are estimated gene trees different from the true gene tree? Under what conditions is the true evolutionary history not a tree? (i.e., what is “reticulation”?)

Reticulation Evolutionary histories can be reticulate (meaning non-treelike): –Horizontal Gene Transfer (HGT) –Hybrid speciation –Recombination Most phylogeny estimation methods produce trees. Good resource about reticulate phylogenies: book chapter by Luay Nakhleh (see 394C webpage for the link)

We will assume that all evolutionary histories are treelike for the remainder of today’s presentation. Later in the course we’ll discuss reticulate evolution…

Estimated Gene Trees can differ from Species Trees Biological reasons: –Deep coalescent events (alleles) –Gene duplication and loss (gene families) Computational reasons: –Insufficient time –Poor methods (e.g., UPGMA) –Poor models (e.g., ML using Jukes-Cantor) Data issues: –Insufficient data (meaning not enough sites) –Poor alignments

Examples of problems When true gene trees can differ from species tree: Given a collection of gene trees, find a species tree that minimizes the number of “deep coalescent” events When true gene trees should equal the species tree: Given a collection of gene trees, find a species tree that minimizes the total distance to the gene trees

When gene trees can differ from species tree Software/Algorithms for deep-coalescent ( see PhyloNet from Nakhleh’s webpage at Rice) GLASS (Roch and Mossel) - distance-based MDC (Than and Nakhleh) - parsimony STEM (Kubatko) - ML BEST (Liu et al.) - Bayesian BUCKy (Ané et al.) - Bayesian Software/Algorithms for duplication-loss NOTUNG (Durand) Duptree (Bansal et al.) Hallet and Lagergren - algorithms/complexity

When gene trees should equal the species tree The problem here is that estimated gene trees can differ from the true gene trees. Although the problem is “simple”, it is still interesting -- computationally and mathematically. Plus, we can still make novel contributions.

The very simplest problem Easiest case: One species tree, true gene trees will agree with the species tree, Estimated trees are on the full set of taxa Approaches: Consensus methods: return a tree on the entire set S of taxa summarizing the input trees Agreement methods: return a tree on a subset of the taxa on which the trees agree Clustering, then consensus/agreement

Consensus methods These are the most usual ways of analyzing datasets of trees Examples: –Strict consensus –Majority consensus –Greedy consensus (aka “extended majority”) –Others less frequently used include: Gordon’s, Adams, the Strict Consensus Supertree, Local Consensus methods, and more. Survey paper by David Bryant for some of these

Simplest problems, cont. “Agreement” methods return trees on subsets of S, on which the trees are the same (or compatible) –MAST: maximum agreement subtree (used in practice, sometimes) –MCST: maximum compatible subtree (Ganapathy et al., not used in practice) The difference between these is how polytomies are handled

Soft vs. hard polytomies Polytomy: node of high degree (greater than three for an unrooted tree) Polytomies arise in estimations when consensus methods are used Polytomies also arise when contracting short branches in estimated trees Polytomies can be “hard” (representing true radiations) or “soft” (representing lack of information)

Compatible source trees Estimated trees can be “compatible” when we interpret polytomies as “soft” “Compatible” means that there is a tree which is a common refinement. Example: 123|456, 12|3456, 1235|46. We can compute the compatibility tree (when it exists) in O(nk) time, where n=|S| and there are k source trees

Computational complexity Most consensus methods (which return a tree on the entire set S of taxa) are polynomial time. Most “agreement methods” (which return a tree on the largest subset of the taxa on which the source trees “agree”) are based upon NP-hard problems. Some (e.g., MAST) have fixed-parameter polynomial time solutions.

Supertree problems Realistic complexity: not all the source trees are on the same set of taxa. Obvious problems: –Find the tree on which all the source trees agree (if it exists). –Find the tree on which a maximum number of the source trees agree. Both are NP-hard.

Quartet compatibility Simple case: all the source trees are on four taxa. We ask: does there exist a tree which agrees with all the source trees? NP-hard!

Quartet tree amalgamation Given collection of quartet trees, find a tree which agrees with a maximum number of these quartet trees NP-hard, since compatibility is NP-hard Hard to approximate, but PTAS if you have a tree on every quartet of taxa (Jiang et al.)

Quartet amalgamation algorithms Quartet Puzzling (Strimmer and von Haeseler) Q* (Berry et al.) Quartet Cleaning (Berry et al.) Weight Optimization (Ranwez and Gascuel) Quartets MaxCut (Snir and Rao) But see also the paper (St. John et al.) evaluating early quartet methods on the CS 394C webpage

What about rooted trees? Given set of rooted source trees, we ask: Is there a tree on which all the rooted source trees are correct?

Rooted tree compatibility Aho, Sagiv, Szymanski, and Ullman: polynomial time, recursive algorithm: –If n=1, return the singleton tree. –If n>1, then compute an equivalence relation on the set of taxa as follows. For each rooted triple ((a,b),c) in the set, put a and b in the same equivalence class. Compute transitive closure. –If only one equivalence class, reject (set is incompatible). Otherwise, recurse on each subset, and return tree obtained by making all recursively computed trees sibling subtrees.

Subtree compatibility If source trees are rooted, then compatibility can be tested in polynomial time. Optimization problems are NP-hard, however. If source trees are unrooted, then compatibility is NP-hard. And so optimization problems are also NP-hard.

Supertree problems, in practice In practice, the most frequently used supertree method is MRP, for “Matrix Representation with Parsimony”. There are, however, many other supertree methods!

Many Supertree Methods MRP weighted MRP Min-Cut Modified Min-Cut Semi-strict Supertree MRF MRD QILI SDM Q-imputation PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more... Matrix Representation with Parsimony (Most commonly used)

MRP Idea: take every sourcetree, and replace it with a matrix of 0,1,?. Concatenate the matrices. Apply Maximum Parsimony. If all the source trees are compatible, then an exact solution to MRP will return the compatibility trees.

Homework, due 9/15 Read two papers (linked on the webpage): –St. John et al., about quartet-based methods –Moret et al., about sequence-length requirements Pick one, write summary, and include questions

Question! How do you feel about occasionally having class on some Monday or Friday, so we can have guest lectures?