Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.

Slides:



Advertisements
Similar presentations
New methods for simultaneous estimation of trees and alignments Tandy Warnow The University of Texas at Austin.
Advertisements

Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Molecular Evolution Revised 29/12/06
Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
BNFO 602 Multiple sequence alignment Usman Roshan.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Expected accuracy sequence alignment
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
IPAM Workshop on Multiple Sequence Alignment Tandy Warnow The University of Illinois at Urbana-Champaign.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
New techniques that “boost” methods for large-scale multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
394C, October 2, 2013 Topics: Multiple Sequence Alignment
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
CS 466 and BIOE 498: Introduction to Bioinformatics
Distance-based phylogeny estimation
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
The ideal approach is simultaneous alignment and tree estimation.
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
New methods for simultaneous estimation of trees and alignments
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
CS 581 Algorithmic Computational Genomics
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
Recent Breakthroughs in Mathematical and Computational Phylogenetics
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Presentation transcript:

Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

…ACGGTGCAGTTACCA… MutationDeletion …ACCAGTCACCA…

…ACGGTGCAGTTACCA… …AC----CAGTCACCA… The true multiple alignment –Reflects historical substitution, insertion, and deletion events in the true phylogeny … ACGGTGCAGTTACCA … MutationDeletion … ACCAGTCACCA …

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Multiple Sequence Alignment S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 2: Construct tree S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 S4 S2 S3

Many methods Alignment methods Clustal POY (and POY*) Probcons (and Probtree) MAFFT Prank Muscle Di-align T-Coffee Opal Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc.

…ACGGTGCAGTTACCA… …AC----CAGTCACCA… The true multiple alignment –Reflects historical substitution, insertion, and deletion events in the true phylogeny But how do we try to estimate this? … ACGGTGCAGTTACCA … MutationDeletion … ACCAGTCACCA …

Pairwise alignments and edit transformations Each pairwise alignment implies one or more edit transformations Each edit transformation implies one or more pairwise alignments So calculating the edit distance (and hence minimum cost edit transformation) is the same as calculating the optimal pairwise alignment

Edit distances Substitution costs may depend upon which nucleotides are involved (e.g, transition/transversion differences) Gap costs –Linear (aka “simple”): gapcost(L) = cL –Affine: gapcost(L) = c+c’(L) –Other: gapcost(L) = c+c’log(L)

Computing optimal pairwise alignments The cost of a pairwise alignment (under a simple gap model) is just the sum of the costs of the columns Under affine gap models, it’s a bit more complicated (but not much)

Computing edit distance Given two sequences and the edit distance function F(.,.), how do we compute the edit distance between two sequences? Simple algorithm for standard gap cost functions (e.g., affine) based upon dynamic programming

DP alg for simple gap costs Given two sequences A[1…n] and B[1…m], and an edit distance function F(.,.) with unit substitution costs and gap cost C, Let –A = A 1,A 2,…,A n –B = B 1,B 2,…,B m Let M(i,j)=F(A[1…i],B[1…j]) (i.e., the edit distance between these two prefixes )

Dynamic programming algorithm Let M(i,j)=F(A[1…i],B[1…j]) M(0,0)=0 M(n,m) stores our answer How do we compute M(i,j) from other entries of the matrix?

Calculating M(i,j) Examine final column in some optimal pairwise alignment of A[1…i] to B[1…j] Possibilities: –Nucleotide over nucleotide: previous columns align A[1…i-1] to B[1…j-1]: M(i,j)=M(i-1,j-1)+subcost(A i,B j ) –Indel (-) over nucleotide: previous columns align A[1…i] to B[1…j-1]: M(i,j)=M(i,j-1)+indelcost –Nucleotide over indel: previous columns align A[1…i-1] to B[1…j]: M(i,j)=M(i-1,j)+indelcost

Calculating M(i,j) Examine final column in some optimal pairwise alignment of A[1…i] to B[1…j] Possibilities: –Nucleotide over nucleotide: previous columns align A[1…i-1] to B[1…j-1]: M(i,j)=M(i-1,j-1)+subcost(A i,B j ) –Indel (-) over nucleotide: previous columns align A[1…i] to B[1…j-1]: M(i,j)=M(i,j-1)+indelcost –Nucleotide over indel: previous columns align A[1…i-1] to B[1…j]: M(i,j)=M(i-1,j)+indelcost

Calculating M(i,j) M(i,j)=max{ M(i-1,j-1)+subcost(A i,B j ), M(i,j-1)+indelcost, M(i-1,j)+indelcost}

O(nm) DP algorithm for pairwise alignment using simple gap costs Initialize M(0,j) = M(j,0) = j  indelcost For i=1…n –For j = 1…m M(i,j)=max {M(i-1,j-1)+subcost(A i,B j ), M(i,j-1)+indelcost, M(i-1,j)+indelcost} Return M(n,m) Add arrows for backtracking (to construct an optimal alignment and edit transformation rather than just the cost) Modification for other gap cost functions is straightforward but leads to an increase in running time

Sum-of-pairs optimal multiple alignment Given set S of sequences and edit cost function F(.,.), Find multiple alignment that minimizes the sum of the implied pairwise alignments (Sum-of-Pairs criterion) NP-hard, but can be approximated Is this useful?

Other approaches to MSA Many of the methods used in practice do not try to optimize the sum-of-pairs Instead they use probabilistic models (HMMs) Often they do a progressive alignment on an estimated tree (aligning alignments) Performance of these methods can be assessed using real and simulated data

Many methods Alignment methods Clustal POY (and POY*) Probcons (and Probtree) MAFFT Prank Muscle Di-align T-Coffee Opal Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc.

Simulation study ROSE simulation: –1000, 500, and 100 sequences –Evolution with substitutions and indels –Varied gap lengths, rates of evolution Computed alignments Used RAxML to compute trees Recorded tree error (missing branch rate) Recorded alignment error (SP-FN)

1000 taxon models ranked by difficulty

Problems with the two phase approach Manual alignment can have a high level of subjectivity (and can take a long time). Current alignment methods fail to return reasonable alignments on markers that evolve with high rates of indels and substitutions, especially if these are large datasets. We discard potentially useful markers if they are difficult to align.

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA and S1 S4 S2 S3 Simultaneous estimation of trees and alignments

Simultaneous Estimation Methods Likelihood-based (under model of evolution including insertion/deletion events)‏ –ALIFRITZ, BAli-Phy, BEAST, StatAlign –Computationally intensive –Limited to small datasets (< 30 sequences)

Treelength-based Input: Set S of unaligned sequences over an alphabet ∑, and an edit distance function F(.,.) (must account for gaps and substitutions) Output: Tree T with sequences S at the leaves and other sequences at the internal nodes so as to minimize  e F(s v,s w ), where the sum is taken over all edges e=(s v,s w ) in the tree

Minimizing treelength Given set S of sequences and edit distance function F(.,.), Find tree T with S at the leaves and sequences at the internal nodes so as to minimize the treelength (sum of edit distances) NP-hard but can be approximated NP-hard even if the tree is known!

Minimizing treelength The problem of finding sequences at the internal nodes of a fixed tree was introduced by Sankoff. Several algorithmic results related to this problem, with pretty theory Most popular software is POY, which tries to optimize tree length. The accuracy of any tree or alignment depends upon the edit distance function F(.,.)

More SATé: our method for simultaneous estimation and tree alignment POY and POY*: results of how changing the gap penalty from simple to affine impacts the alignment and tree Impact of guide tree on MSA Statistical co-estimation using models that include indel events (Statalign, Alifritz, BAliPhy) Getting “inside” some of the best MSAs