Techniques for MSA Tandy Warnow.

Slides:

Advertisements

Similar presentations

New methods for simultaneous estimation of trees and alignments Tandy Warnow The University of Texas at Austin.

Advertisements

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.

Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.

CIS786, Lecture 4 Usman Roshan.

MCB 5472 Lecture #6: Sequence alignment March 27, 2014.

Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.

Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.

Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.

Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.

New techniques that “boost” methods for large-scale multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science.

CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.

394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.

Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.

Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.

Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

Family of HMMs Nam Nguyen University of Texas at Austin.

Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.

Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.

Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.

Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.

Multiple Sequence Alignment with PASTA Michael Nute Austin, TX June 17, 2016.

Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.

Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.

394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.

Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.

CS 466 and BIOE 498: Introduction to Bioinformatics

Advances in Ultra-large Phylogeny Estimation

New Approaches for Inferring the Tree of Life

394C, Spring 2012 Jan 23, 2012 Tandy Warnow.

CS 581 / BIOE 540: Algorithmic Computational Genomics

TIPP: Taxon Identification using Phylogeny-Aware Profiles

Multiple Sequence Alignment Methods

Tandy Warnow Department of Computer Sciences

The ideal approach is simultaneous alignment and tree estimation.

Algorithm Design and Phylogenomics

A Hybrid Algorithm for Multiple DNA Sequence Alignment

New methods for simultaneous estimation of trees and alignments

Large-Scale Multiple Sequence Alignment

TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles

CS 581 Algorithmic Computational Genomics

TIPP: Taxon Identification using Phylogeny-Aware Profiles

Tandy Warnow Founder Professor of Engineering

New methods for simultaneous estimation of trees and alignments

Texas, Nebraska, Georgia, Kansas

Benchmarking Statistical Multiple Sequence Alignment

Ultra-Large Phylogeny Estimation Using SATé and DACTAL

Recent Breakthroughs in Mathematical and Computational Phylogenetics

The Most General Markov Substitution Model on an Unrooted Tree

CS 394C: Computational Biology Algorithms

September 1, 2009 Tandy Warnow

Taxonomic identification and phylogenetic profiling

Multiple Sequence Alignment

Algorithms for Inferring the Tree of Life

Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.

Tandy Warnow The University of Texas at Austin

Tandy Warnow The University of Texas at Austin

New methods for simultaneous estimation of trees and alignments

Ultra-large Multiple Sequence Alignment

Advances in Phylogenomic Estimation

Advances in Phylogenomic Estimation

TIPP and SEPP (plus PASTA)

Multiple Sequence Alignment

Presentation transcript:

Techniques for MSA Tandy Warnow

Multiple Sequence Alignment (MSA): another grand challenge1 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

BeeTLe: a heuristic for GTA BeeTLe is designed to improve on POY (leading software) for optimizing GTA Gap penalty function impacts accuracy, with Affine better than Simple Alignments using GTA not very good – there are many better methods, including SATe, SATe-II, Opal, and MAFFT. Maximum likelihood (ML) produces more accurate trees than maximum parsimony (MP). ML trees on good alignments are better than BeeTLe trees. See Liu and Warnow, PLOS One 2012

Today Simulation studies and what they revealed about MSA methods MSA techniques Progressive alignment Divide-and-conquer Iteration New MSA methods with improved accuracy

Impact of guide tree Most MSA methods use progressive alignment Hence, there is a potential for the guide tree to impact the final alignment. Many authors have studied this issue… here’s our take on it (Nelesen et al., PSB 2008)

Estimated tree and alignment Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC–- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC–- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S4 Compare S4 S3 True tree and alignment S2 S3 Estimated tree and alignment

Alignment Error/Accuracy SPFN: percentage of homologies in the true alignment that are not recovered (false negative homologies) SPFP: percentage of homologies in the estimated alignment that are false (false positive homologies) TC: total number of columns correctly recovered SP-score: percentage of homologies in the true alignment that are recovered Pairs score: 1-(avg of SP-FN and SP-FP)

FN FP 50% error rate FN: false negative (missing edge) FP: false positive (incorrect edge) FP 50% error rate

Nelesen et al., PSB 2008 Pacific Symposium on Biocomputing, 2008 MSA methods: ClustalW, Muscle, Probcons, MAFFT, and FTA (Fixed Tree Alignment, using POY on the guidetree) Guide trees: Default for each method Two different UPGMA trees Probtree (ML on Probcons+GT alignment) Examined results on simulated datasets with respect to alignment error and tree error

Figure from Nelesen et al., Pacific Symposium on Biocomputing, 2008

Impact of Guide Tree on Alignment Error

Impact of Guide Tree on Tree Error

Figure from Nelesen et al., Pacific Symposium on Biocomputing, 2008

Observations The choice of guide tree can have a big impact on tree error, but less so on “alignment error” (as measured using sum-of-pairs) Some methods are greatly improved using topologically more accurate guide trees; others are less so. Note the interesting behavior of FTA (POY) when the guide tree is the true tree, compared to ML on the true alignment!

Observations Guide tree choice did not seem to affect alignment SP error Guide tree choice affected tree error – but impact depended on dataset size (25 vs. 100) and MSA method. Probcons very impacted by guide tree (and that may be because its own default guide tree is poorly chosen). FTA very impacted by guide tree. Note that FTA on the true tree is MORE accurate than ML on the true alignment. For analyses of 100-taxon datasets, Probtree is a good guide tree.

The SATe family of methods Liu et al., Science 2009 introduced SATe, a technique to co-estimate alignments and trees on large datasets. Subsequently improved in SATe-II and PASTA Basic techniques: divide-and-conquer, plus iteration

Two-phase estimation Alignment methods Phylogeny methods • Clustal •  POY •  Probcons (and Probtree) •  Probalign •  MAFFT •  Muscle •  Di-align •  T-Coffee •  Prank (PNAS 2005, Science 2008) •  Opal (ISMB and Bioinf. 2007) •  FSA (PLoS Comp. Bio. 2009) •  Infernal (Bioinf. 2009) •  Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc.

1000-taxon models, ordered by diﬃculty (roughly, rate of evolution), from Liu et al., Science 2009

What we know so far Alignment methods vary dramatically in accuracy, and increasing the rate of evolution increases the error rates. Alignment error impacts tree error Some big datasets are easy to align

SATe “Family” Iterative divide-and-conquer methods Each iteration uses the current tree with divide-and-conquer, to produce an alignment (running preferred MSA methods on subsets, and aligning alignments together) Each iteration computes an ML tree on the current alignment, under Markov models of evolution that do not consider indels

SATé Algorithm Obtain initial alignment and estimated ML tree Tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment

SATé-1 iteration (actual decomposition produces 32 subproblems) Merge subproblems Estimate ML tree on merged alignment Decompose based on input tree Align subproblems ABCD e

1000 taxon models, ordered by difficulty For moderate-to-difficult datasets, SATe gets better trees and alignments than all other estimated methods. Close to what you might get if you had access to true alignment. Opens up a new realm of possibility: Datasets currently considered “unalignable” can in fact be aligned reasonably well. This opens up the feasibility of accurate estimations of deep evolutionary histories using a wider range of markers. TRANSITION: can we do better? What about smaller simulated datasets? And what about biological datasets? 24 hour SATé-I analysis, using MAFFT to align the subsets (Similar improvements for biological datasets)

SATe Family SATe-I (2009): SATe-II (2012) PASTA (2014) Up to about 10,000 sequences Good accuracy and reasonable speed “Center-tree” decomposition SATe-II (2012) Up to about 50,000 sequences Improved accuracy and speed Centroid-edge recursive decomposition PASTA (2014) Up to 1,000,000 sequences Combines centroid-edge decomposition with transitivity merge

SATé-I vs. SATé-II SATé-II Faster and more accurate than SATé-I Longer analyses or use of ML to select tree/alignment pair slightly better results

PASTA: even better than SATé-2 Simulated RNASim datasets from 10K to 200K taxa Limited to 24 hours using 12 CPUs Not all methods could run (missing bars could not finish)

PASTA Running Time and Scalability One iteration Using 12 cpus 1 node on Lonestar TACC Maximum 24 GB memory Showing wall clock running time ~ 1 hour for 10k taxa ~ 17 hours for 200k taxa

Boosting BAli-Phy Nute and Warnow, RECOMB Comparative Genomics and BMC Genomics 2016 Used BAli-Phy within PASTA and got it to scale to 1000 sequences with high accuracy!

Observations about SATe family Divide-and-conquer improves alignment estimation, and leads to improved trees Iteration also helps (the first few iterations are the most important) All MSA methods examined improve using these techniques Ultra-large datasets can be analyzed efficiently with high accuracy!

Results for co-estimation methods Optimizing treelength (POY and BeeTLe) doesn’t produce good alignments, and trees are not as good as those obtained using ML on standard MSA methods. Statistical co-estimation of alignments and trees under models of evolution that include indels can produce highly accurate alignments and trees – but running time is a big issue. SATé and PASTA are iterative techniques for co- estimating alignments and trees, and produce good results… but have no statistical guarantees.

General trends Treelength-based optimization currently not as accurate as some standard techniques (e.g., ML on MAFFT alignments) Many methods give excellent results on small datasets – Probcons, Probalign, Bali-Phy, etc… but most are not in use because of dataset size limitations Large datasets best using PASTA or UPP? (maybe) Co-estimation under statistical models might be the way to go, IF…

Research Projects Design your own MSA method, or just modify an existing one in some simple way (e.g., different guide tree) Test existing MSA methods with respect to different criteria (e.g., extend Prank study to more methods and datasets) Develop different MSA criteria that are more appropriate than TC, SPFN, SPFP Compare different MSA methods on some biological dataset Parallelize some MSA method Consider how to combine MSAs on the same input