Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.

Slides:

Advertisements

Similar presentations

New methods for simultaneous estimation of trees and alignments Tandy Warnow The University of Texas at Austin.

Advertisements

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.

Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.

Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.

Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.

Molecular Evolution Revised 29/12/06

BNFO 602 Phylogenetics Usman Roshan.

CIS786, Lecture 3 Usman Roshan.

Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.

BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.

CIS786, Lecture 4 Usman Roshan.

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.

Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.

CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.

Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.

Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.

Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.

Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.

Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.

CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.

394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.

394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.

CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.

CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.

Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.

394C, October 2, 2013 Topics: Multiple Sequence Alignment

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.

Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.

Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.

Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.

Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.

The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.

Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.

394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.

Distance-based phylogeny estimation

Phylogenetic basis of systematics

394C, Spring 2012 Jan 23, 2012 Tandy Warnow.

Multiple Sequence Alignment Methods

Tandy Warnow Department of Computer Sciences

Techniques for MSA Tandy Warnow.

Algorithm Design and Phylogenomics

New methods for simultaneous estimation of trees and alignments

BNFO 602 Phylogenetics Usman Roshan.

BNFO 602 Phylogenetics – maximum parsimony

CS 581 Tandy Warnow.

CS 581 Tandy Warnow.

New methods for simultaneous estimation of trees and alignments

Texas, Nebraska, Georgia, Kansas

Ultra-Large Phylogeny Estimation Using SATé and DACTAL

CS 394C: Computational Biology Algorithms

September 1, 2009 Tandy Warnow

Algorithms for Inferring the Tree of Life

Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.

Tandy Warnow The University of Texas at Austin

Tandy Warnow The University of Texas at Austin

New methods for simultaneous estimation of trees and alignments

Presentation transcript:

Introduction to Phylogenetic Estimation Algorithms Tandy Warnow

Questions What is a phylogeny? What data are used? What is involved in a phylogenetic analysis? What are the most popular methods? What is meant by “accuracy”, and how is it measured?

Phylogeny Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona

Data Biomolecular sequences: DNA, RNA, amino acid, in a multiple alignment Molecular markers (e.g., SNPs, RFLPs, etc.) Morphology Gene order and content These are “character data”: each character is a function mapping the set of taxa to distinct states (equivalence classes), with evolution modelled as a process that changes the state of a character

Data Biomolecular sequences: DNA, RNA, amino acid, in a multiple alignment Molecular markers (e.g., SNPs, RFLPs, etc.) Morphology Gene order and content These are “character data”: each character is a function mapping the set of taxa to distinct states (equivalence classes), with evolution modelled as a process that changes the state of a character

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Phylogeny Problem TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

Indels and substitutions at the DNA level …ACGGTGCAGTTACCA… MutationDeletion

Indels and substitutions at the DNA level …ACGGTGCAGTTACCA… MutationDeletion

Indels and substitutions at the DNA level …ACGGTGCAGTTACCA… MutationDeletion …ACCAGTCACCA…

…ACGGTGCAGTTACCA… …ACCAGTCACCA… Mutation Deletion The true pairwise alignment is: …ACGGTGCAGTTACCA… …AC----CAGTCACCA… The true multiple alignment on a set of homologous sequences is obtained by tracing their evolutionary history, and extending the pairwise alignments on the edges to a multiple alignment on the leaf sequences.

Easy Sequence Alignment B_WEAU160 ATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGG 45 A_U A.....G A_IFA G A_92UG G A_Q C G B_SF B_LAI B_F B_HXB2R B_LW B_NL B_NY B_MN C C B_JRCSF B_JRFL B_NH G B_OYI B_CAM

Harder Sequence Alignment B_WEAU160 ATGAGAGTGAAGGGGATCAGGAAGAATTATCAGCACTTG 39 A_U T......ACA..G CTTG A_SF T......ACA..T...C.G...AA....A 39 A_92RW G......ACA..C..G..GG..AA A_92UG G.A....ACA..G.....GG A 35 A_92UG T......AGA..G CTTG..G. 35 A_TZ G..A...G.A..G A..A 39 A_UG275A....A..C..T.....CACA..T.....G...AA...G. 39 A_UG273A ACA..G.....GG A_DJ258A T......ACA CA.T...A 39 A_KENYA T.....CACA..G.....G A 39 A_CARGAN T......ACA A A_CARSAS CACA CTCT.C A_CAR A..CACA..G.....GG..CA A_CAR286A CACA..G.....GG..AA A_CAR A A A_CAR423A A A A_VI191A ACA..T.....GG..A

Multiple sequence alignment Objective: Estimate the “true alignment” (defined by the sequence of evolutionary events) Typical approach: 1.Estimate an initial tree 2.Estimate a multiple alignment by performing a “progressive alignment” up the tree, using Needleman- Wunsch (or a variant) to align alignments

UVWXYUVWXY U VW X Y AGTGGAT TATGCCCA TATGACTT AGCCCTA AGCCCGCTT

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Multiple Sequence Alignment S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 2: Construct tree S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 S4 S2 S3

So many methods!!! Alignment method Clustal POY (and POY*) Probcons (and Probtree) MAFFT Prank Muscle Di-align T-Coffee Satchmo Etc. Blue = used by systematists Purple = recommended by protein research community Phylogeny method Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining UPGMA Quartet puzzling Etc.

So many methods!!! Alignment method Clustal POY (and POY*) Probcons (and Probtree) MAFFT Prank Muscle Di-align T-Coffee Satchmo Etc. Blue = used by systematists Purple = recommended by protein research community Phylogeny method Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining UPGMA Quartet puzzling Etc.

So many methods!!! Alignment method Clustal POY (and POY*) Probcons (and Probtree) MAFFT Prank Muscle Di-align T-Coffee Satchmo Etc. Blue = used by systematists Purple = recommended by Edgar and Batzoglou for protein alignments Phylogeny method Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining UPGMA Quartet puzzling Etc.

1.Polynomial time distance-based methods: UPGMA, Neighbor Joining, FastME, Weighbor, etc. 2. Hill-climbing heuristics for NP-hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Phylogenetic reconstruction methods Phylogenetic trees Cost Global optimum Local optimum 3.Bayesian methods

UPGMA While |S|>2: find pair x,y of closest taxa; delete x Recurse on S-{x} Insert y as sibling to x Return tree a bc de

UPGMA a bc de Works when evolution is “clocklike”

UPGMA a bc de Fails to produce true tree if evolution deviates too much from a clock!

Performance criteria Running time. Space. Statistical performance issues (e.g., statistical consistency and sequence length requirements) “Topological accuracy” with respect to the underlying true tree. Typically studied in simulation. Accuracy with respect to a mathematical score (e.g. tree length or likelihood score) on real data.

Distance-based Methods

Additive Distance Matrices

Four-point condition A matrix D is additive if and only if for every four indices i,j,k,l, the maximum and median of the three pairwise sums are identical D ij +D kl < D ik +D jl = D il +D jk The Four-Point Method computes trees on quartets using the Four-point condition

Naïve Quartet Method Compute the tree on each quartet using the four-point condition Merge them into a tree on the entire set if they are compatible: –Find a sibling pair A,B –Recurse on S-{A} –If S-{A} has a tree T, insert A into T by making A a sibling to B, and return the tree

Better distance-based methods Neighbor Joining Minimum Evolution Weighted Neighbor Joining Bio-NJ DCM-NJ And others

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FN FP

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. NJ No. Taxa Error Rate

“Character-based” methods Maximum parsimony Maximum Likelihood Bayesian MCMC (also likelihood-based) These are more popular than distance- based methods, and tend to give more accurate trees. However, these are computationally intensive!

Standard problem: Maximum Parsimony (Hamming distance Steiner Tree) Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T –leaf-labeled by sequences in S –additional sequences of length k labeling the internal nodes of T such that is minimized.

Maximum parsimony (example) Input: Four sequences –ACT –ACA –GTT –GTA Question: which of the three trees has the best MP scores?

Maximum Parsimony ACT GTTACA GTA ACA ACT GTA GTT ACT ACA GTT GTA

Maximum Parsimony ACT GTT GTA ACA GTA MP score = 5 ACA ACT GTA GTT ACAACT MP score = 7 ACT ACA GTT GTA ACAGTA MP score = 4 Optimal MP tree

Maximum Parsimony: computational complexity ACT ACA GTT GTA ACAGTA MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in linear time O(nk)

But solving this problem exactly is … unlikely # of Taxa # of Unrooted Trees x x x

Local search strategies Phylogenetic trees Cost Global optimum Local optimum

Local search strategies Hill-climbing based upon topological changes to the tree Incorporating randomness to exit from local optima

Evaluating heuristics with respect to MP or ML scores Time Score of best trees Performance of Heuristic 1 Performance of Heuristic 2 Fake study

“Boosting” MP heuristics We use “Disk-covering methods” (DCMs) to improve heuristic searches for MP and ML DCM Base method MDCM-M

Rec-I-DCM3 significantly improves performance (Roshan et al.) Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset Current best techniques DCM boosted version of best techniques

Current methods Maximum Parsimony (MP): –TNT –PAUP* (with Rec-I-DCM3) Maximum Likelihood (ML) –RAxML (with Rec-I-DCM3) –GARLI –PAUP* Datasets with up to a few thousand sequences can be analyzed in a few days Portal at

UVWXYUVWXY U VW X Y AGTGGAT TATGCCCA TATGACTT AGCCCTA AGCCCGCTT But…

Phylogenetic reconstruction methods assume the sequences all have the same length. Standard models of sequence evolution used in maximum likelihood and Bayesian analyses assume sequences evolve only via substitutions, producing sequences of equal length. And yet, almost all nucleotide datasets evolve with insertions and deletions (“indels”), producing datasets that violate these models and methods. How can we reconstruct phylogenies from sequences of unequal length?

Basic Questions Does improving the alignment lead to an improved phylogeny? Are we getting good enough alignments from MSA methods? (In particular, is ClustalW - the usual method used by systematists - good enough?) Are we getting good enough trees from the phylogeny reconstruction methods? Can we improve these estimations, perhaps through simultaneous estimation of trees and alignments?

DNA sequence evolution Simulation using ROSE: 100 taxon model trees, models 1-4 have “long gaps”, and 5-8 have “short gaps”, site substitution is HKY+Gamma

Results Model difficulty