Techniques for MSA Tandy Warnow.

Techniques for MSA Tandy Warnow

Multiple Sequence Alignment (MSA): another grand challenge1
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- … Sn = TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

BeeTLe: a heuristic for GTA
BeeTLe is designed to improve on POY (leading software) for optimizing GTA Gap penalty function impacts accuracy, with Affine better than Simple Alignments using GTA not very good – there are many better methods, including SATe, SATe-II, Opal, and MAFFT. Maximum likelihood (ML) produces more accurate trees than maximum parsimony (MP). ML trees on good alignments are better than BeeTLe trees. See Liu and Warnow, PLOS One 2012

Today Simulation studies and what they revealed about MSA methods
MSA techniques Progressive alignment Divide-and-conquer Iteration New MSA methods with improved accuracy

Impact of guide tree Most MSA methods use progressive alignment
Hence, there is a potential for the guide tree to impact the final alignment. Many authors have studied this issue… here’s our take on it (Nelesen et al., PSB 2008)

Estimated tree and alignment
Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC–- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 S2 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC–- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S4 Compare S4 S3 True tree and alignment S2 S3 Estimated tree and alignment

Alignment Error/Accuracy
SPFN: percentage of homologies in the true alignment that are not recovered (false negative homologies) SPFP: percentage of homologies in the estimated alignment that are false (false positive homologies) TC: total number of columns correctly recovered SP-score: percentage of homologies in the true alignment that are recovered Pairs score: 1-(avg of SP-FN and SP-FP)

FN FP 50% error rate FN: false negative (missing edge)
FP: false positive (incorrect edge) FP 50% error rate

Nelesen et al., PSB 2008 Pacific Symposium on Biocomputing, 2008
MSA methods: ClustalW, Muscle, Probcons, MAFFT, and FTA (Fixed Tree Alignment, using POY on the guidetree) Guide trees: Default for each method Two different UPGMA trees Probtree (ML on Probcons+GT alignment) Examined results on simulated datasets with respect to alignment error and tree error

Figure from Nelesen et al., Pacific Symposium on Biocomputing, 2008

Impact of Guide Tree on Alignment Error

Impact of Guide Tree on Tree Error

Figure from Nelesen et al., Pacific Symposium on Biocomputing, 2008

Observations The choice of guide tree can have a big impact on tree error, but less so on “alignment error” (as measured using sum-of-pairs) Some methods are greatly improved using topologically more accurate guide trees; others are less so. Note the interesting behavior of FTA (POY) when the guide tree is the true tree, compared to ML on the true alignment!

Observations Guide tree choice did not seem to affect alignment SP error Guide tree choice affected tree error – but impact depended on dataset size (25 vs. 100) and MSA method. Probcons very impacted by guide tree (and that may be because its own default guide tree is poorly chosen). FTA very impacted by guide tree. Note that FTA on the true tree is MORE accurate than ML on the true alignment. For analyses of 100-taxon datasets, Probtree is a good guide tree.

The SATe family of methods
Liu et al., Science 2009 introduced SATe, a technique to co-estimate alignments and trees on large datasets. Subsequently improved in SATe-II and PASTA Basic techniques: divide-and-conquer, plus iteration

Two-phase estimation Alignment methods Phylogeny methods • Clustal
•  POY •  Probcons (and Probtree) •  Probalign •  MAFFT •  Muscle •  Di-align •  T-Coffee •  Prank (PNAS 2005, Science 2008) •  Opal (ISMB and Bioinf. 2007) •  FSA (PLoS Comp. Bio. 2009) •  Infernal (Bioinf. 2009) •  Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc.

1000-taxon models, ordered by diﬃculty (roughly, rate of evolution), from Liu et al., Science 2009

What we know so far Alignment methods vary dramatically in accuracy, and increasing the rate of evolution increases the error rates. Alignment error impacts tree error Some big datasets are easy to align

SATe “Family” Iterative divide-and-conquer methods
Each iteration uses the current tree with divide-and-conquer, to produce an alignment (running preferred MSA methods on subsets, and aligning alignments together) Each iteration computes an ML tree on the current alignment, under Markov models of evolution that do not consider indels

SATé Algorithm Obtain initial alignment and estimated ML tree Tree
Estimate ML tree on new alignment Use tree to compute new alignment Alignment

SATé-1 iteration (actual decomposition produces 32 subproblems)
Merge subproblems Estimate ML tree on merged alignment Decompose based on input tree Align subproblems ABCD e

1000 taxon models, ordered by difficulty
For moderate-to-difficult datasets, SATe gets better trees and alignments than all other estimated methods. Close to what you might get if you had access to true alignment. Opens up a new realm of possibility: Datasets currently considered “unalignable” can in fact be aligned reasonably well. This opens up the feasibility of accurate estimations of deep evolutionary histories using a wider range of markers. TRANSITION: can we do better? What about smaller simulated datasets? And what about biological datasets? 24 hour SATé-I analysis, using MAFFT to align the subsets (Similar improvements for biological datasets)

SATe Family SATe-I (2009): SATe-II (2012) PASTA (2014)
Up to about 10,000 sequences Good accuracy and reasonable speed “Center-tree” decomposition SATe-II (2012) Up to about 50,000 sequences Improved accuracy and speed Centroid-edge recursive decomposition PASTA (2014) Up to 1,000,000 sequences Combines centroid-edge decomposition with transitivity merge

SATé-I vs. SATé-II SATé-II Faster and more accurate than SATé-I
Longer analyses or use of ML to select tree/alignment pair slightly better results

PASTA: even better than SATé-2
Simulated RNASim datasets from 10K to 200K taxa Limited to 24 hours using 12 CPUs Not all methods could run (missing bars could not finish)

PASTA Running Time and Scalability
One iteration Using 12 cpus 1 node on Lonestar TACC Maximum 24 GB memory Showing wall clock running time ~ 1 hour for 10k taxa ~ 17 hours for 200k taxa

Boosting BAli-Phy Nute and Warnow, RECOMB Comparative Genomics and BMC Genomics 2016 Used BAli-Phy within PASTA and got it to scale to 1000 sequences with high accuracy!

Observations about SATe family
Divide-and-conquer improves alignment estimation, and leads to improved trees Iteration also helps (the first few iterations are the most important) All MSA methods examined improve using these techniques Ultra-large datasets can be analyzed efficiently with high accuracy!

Results for co-estimation methods
Optimizing treelength (POY and BeeTLe) doesn’t produce good alignments, and trees are not as good as those obtained using ML on standard MSA methods. Statistical co-estimation of alignments and trees under models of evolution that include indels can produce highly accurate alignments and trees – but running time is a big issue. SATé and PASTA are iterative techniques for co- estimating alignments and trees, and produce good results… but have no statistical guarantees.

General trends Treelength-based optimization currently not as accurate as some standard techniques (e.g., ML on MAFFT alignments) Many methods give excellent results on small datasets – Probcons, Probalign, Bali-Phy, etc… but most are not in use because of dataset size limitations Large datasets best using PASTA or UPP? (maybe) Co-estimation under statistical models might be the way to go, IF…

Research Projects Design your own MSA method, or just modify an existing one in some simple way (e.g., different guide tree) Test existing MSA methods with respect to different criteria (e.g., extend Prank study to more methods and datasets) Develop different MSA criteria that are more appropriate than TC, SPFN, SPFP Compare different MSA methods on some biological dataset Parallelize some MSA method Consider how to combine MSAs on the same input

Techniques for MSA Tandy Warnow.

Similar presentations

Presentation on theme: "Techniques for MSA Tandy Warnow."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Techniques for MSA Tandy Warnow.

Similar presentations

Presentation on theme: "Techniques for MSA Tandy Warnow."— Presentation transcript:

Similar presentations

About project

Feedback