Download presentation
Presentation is loading. Please wait.
Published byGodfrey Sparks Modified over 9 years ago
1
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign
2
Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Phylogeny (evolutionary tree)
3
Phylogenies and Applications Basic Biology: How did life evolve? Applications of phylogenies to: protein structure and function population genetics human migrations metagenomics Figure from https://en.wikipedia.org/wiki/Common_descent
5
DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
6
Phylogenetic Tree Estimation TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y
7
AGAT TAGACTTTGCACAATGCGCTT AGGGCATGA UVWXY U VW X Y However…
8
…ACGGTGCAGTTACCA… MutationDeletion …ACCAGTCACCA… Indels (insertions and deletions)
9
…ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… The true multiple alignment – Reflects historical substitution, insertion, and deletion events – Defined using transitive closure of pairwise alignments computed on edges of the true tree … ACGGTGCAGTTACCA … Substitution Deletion … ACCAGTCACCTA … Insertion
10
Phylogenetic Tree Estimation S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
11
Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
12
Phase 1: Alignment S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
13
Phase 2: Construct tree S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 S4 S2 S3
14
Two-phase estimation Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc.
15
Two-phase estimation Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. RAxML: heuristic for large-scale ML optimization
16
Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) FN FP 50% error rate
17
1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009)
18
Multiple Sequence Alignment (MSA): a scientific grand challenge 1 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013
19
1KP: Thousand Transcriptome Project First publication: Wickett, Mirarab, et al., PNAS, 2014 Used SATé (Liu et al., Science 2009 and Syst Biol 2012) to compute multiple sequence alignments and trees Used ASTRAL (Mirarab et al., Bioinf 2014 and 2015) to compute the species tree G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin Upcoming Challenge: Multiple sequence alignment and gene tree estimation on 100,000 sequences
20
Computational Phylogenetics (2005) Courtesy of the Tree of Life web project, tolweb.org Current methods can use months to estimate trees on 1000 DNA sequences Our objective: More accurate trees and alignments on 500,000 sequences in under a week
21
Computational Phylogenetics (2015) Courtesy of the Tree of Life web project, tolweb.org 1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences 2012: Computing accurate trees (almost) without multiple sequence alignments 2009-2015: Co-estimation of multiple sequence alignments and gene trees, now on 1,000,000 sequences in under two weeks 2014-2015: Species tree estimation from whole genomes in the presence of massive gene tree heterogeneity
22
Computational Phylogenetics (2015) Courtesy of the Tree of Life web project, tolweb.org 1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences 2012: Computing accurate trees (almost) without multiple sequence alignments 2009-2015: Co-estimation of multiple sequence alignments and gene trees, now on 1,000,000 sequences in under two weeks 2014-2015: Species tree estimation from whole genomes in the presence of massive gene tree heterogeneity
23
Key technique: Divide-and-conquer! In general, small datasets with not too much “heterogeneity” are easy to analyze with good accuracy.
24
Divide-and-Conquer Divide-and-conquer is a basic algorithmic trick for solving problems! Three steps: – divide a dataset into two or more sets, – solve the problem on each set, and – combine solutions.
25
Sorting 1035423755125 Objective: sort this list of integers from smallest to largest. 10, 3, 54, 23, 75, 5, 1, 25 should become 1, 3, 5, 10, 23, 25, 54, 75
26
MergeSort 1035423755125 Step 1: Divide into two sublists Step 2: Recursively sort each sublist Step 3: Merge the two sorted sublists
27
Step 1: break into two lists 1035423 755125 X:Y:
28
Step 2: sort the two lists 3102354 152575 X:Y:
29
Step 3: merge the sorted lists 3102354 152575 X:Y: Result:
30
Merging (cont.) 3102354 52575 1 X:Y: Result:
31
Merging (cont.) 102354 52575 13 X:Y: Result:
32
Merging (cont.) 102354 2575 135 X:Y: Result:
33
Merging (cont.) 2354 2575 13510 X:Y: Result:
34
Merging (cont.) 54 2575 1351023 X:Y: Result:
35
Merging (cont.) 54 75 135102325 X:Y: Result:
36
Merging (cont.) 75 13510232554 X:Y: Result:
37
Merging (cont.) 1351023255475 X:Y: Result:
38
Multiple Sequence Alignment (MSA): a scientific grand challenge 1 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013
39
SATé and PASTA Input: set of unaligned sequences Output: multiple sequence alignment and phylogenetic tree SATé: Liu et al., Science 2009 (up to 10,000 sequences) and Systematic Biology 2012 (up to 50,000 sequences) PASTA: Mirarab et al., J. Comp Biol 2015 (up to 1,000,000 sequences)
40
1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009)
41
Re-aligning on a tree A B D C Merge sub- alignments Estimate ML tree on merged alignment Decompose dataset AB CD Align subproblem s AB CD ABCD
42
SATé and PASTA Algorithms Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment
43
SATé and PASTA Algorithms Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment Alignment
44
SATé and PASTA Algorithms Estimate ML tree on new alignment Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment Alignment
45
SATé and PASTA Algorithms Estimate ML tree on new alignment Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score
46
1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009) 24-hour SATé analysis, on desktop machines (Similar improvements for biological datasets) SATé: 24-hour co-estimation of highly accurate alignments and trees on 1000 sequences
47
(Liu et al., Syst Biol 61(1):90-106, 2012) SATé-2: even more accurate!
48
Simulated RNASim datasets from 10K to 200K taxa Limited to 24 hours using 12 CPUs Not all methods could run (missing bars could not finish) PASTA, Mirarab et al., J Comp Biol 22(5): 377-386 (2015) PASTA: even more accurate, and can scale to 1,000,000 sequences
49
Avian Phylogenomics Project E Jarvis, HHMI G Zhang, BGI First analysis (Jarvis, Mirarab, et al., Science 2014): Approx. 50 species, 14,000 loci Used SATé for gene sequence alignment and tree estimation Next analysis will have more species, and will use PASTA MTP Gilbert, Copenhagen S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin T. Warnow UT-Austin Plus many many other people…
50
1KP: Thousand Transcriptome Project First analysis (Wickett, Mirarab, et al., PNAS, 2014) About 100 species and 800 loci Used SATé G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin Next analysis will be much larger and more difficult: Multiple sequence alignment and gene tree estimation on 100,000 sequences, many datasets highly fragmentary Will use PASTA and UPP (Nguyen et al., Genome Biology 2015)
51
Computational Phylogenetics (2015) Courtesy of the Tree of Life web project, tolweb.org 1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences 2012: Computing accurate trees (almost) without multiple sequence alignments 2009-2015: Co-estimation of multiple sequence alignments and gene trees, now on 1,000,000 sequences in under two weeks 2014-2015: Species tree estimation from whole genomes in the presence of massive gene tree heterogeneity
52
“Boosters”, or “Meta-Methods” Meta-methods use divide-and-conquer and iteration (or other techniques) to “boost” the performance of base methods (phylogeny reconstruction, alignment estimation, etc) Meta-method Base method MM*
53
Main Points Innovative algorithm design can improve accuracy as well as reduce running time. Divide-and-conquer is a key algorithmic technique that has dramatically changed the toolkit for biologists!
54
Acknowledgments Funding: Guggenheim Foundation, Packard Foundation, NSF, Microsoft Research New England, David Bruton Jr. Centennial Professorship, Grainger Foundation, and TACC (Texas Advanced Computing Center)
55
Avian Phylogenomics Project E Jarvis, HHMI G Zhang, BGI Jarvis, Mirarab, et al., Science 2014 MTP Gilbert, Copenhagen S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin T. Warnow UT-Austin Plus many many other people… Major challenge: Massive gene tree heterogeneity consistent with incomplete lineage sorting Very poor resolution in the 14,000 gene trees Standard coalescent-based species tree estimation methods had poor accuracy Solution: New technique to improve coalescent-based species tree (statistical binning, Mirarab et al., Science 2014)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.