Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler Presented by Syed Nabeel.

Slides:

Advertisements

Similar presentations

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.

Advertisements

Lecture 8 Alignment of pairs of sequence Local and global alignment

A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.

Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.

BNFO 602 Multiple sequence alignment Usman Roshan.

1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN

Heuristic alignment algorithms and cost matrices

1 Protein Multiple Alignment by Konstantin Davydov.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Expected accuracy sequence alignment

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Sequence Analysis Tools

Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,

Sequence similarity.

Multiple alignment: heuristics

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Alignment III PAM Matrices. 2 PAM250 scoring matrix.

Multiple Sequence Alignments

Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

Chapter 5 Multiple Sequence Alignment.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.

Multiple sequence alignment

Multiple Sequence Alignment

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Introduction to Profile Hidden Markov Models

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

Protein Sequence Alignment and Database Searching.

1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.

Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung,

Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.

Chapter 3 Computational Molecular Biology Michael Smith

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight.

MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Expected accuracy sequence alignment Usman Roshan.

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Sequence Alignment.

Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.

Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Multiple sequence alignment (msa)

Presentation transcript:

Progressive multiple sequence alignments from triplets by Matthias Kruspe and Peter F Stadler Presented by Syed Nabeel

Outline  Background  Motivation  Algorithm  Complexity Analysis  Experiments and Results  Discussions and Future work

Background  Sequence alignment A way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity.  Pairwise sequence alignment Alignment of two sequences to maximize the common elements of the pair (usually a scoring scheme is used) 3

Multiple sequence alignment (MSA)  Scoring Scheme To access the quality of alignment Scores calculated based on substitution matrices e.g. BLOSUM and PAM etc  Multiple sequence alignment (MSA) An extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. NP-hard problem

MSA Example

Heuristic methods for MSA  Progressive method ClustalW, T-Coffee, POA, and etc.  Iterative method Muscle, DIALIGN, and etc.  Probabilistic method Probcons, Hmmt, Muscle, and etc.

Progressive method  Makes explicit use of the evolutionary relatedness of the sequences to build the alignment.  Complete MSA of the given sequences is calculated from pairwise alignments of previous aligned sequences by following the branching order of a pre-computed "guide" tree  Reconstruction usually involves some clustering method such as Neighbor- Joining or UPGMA

Problems with Existing Progressive Methods Not guaranteed to find the optimal alignment  utilize only a small part of the information that is potentially available in the complete data set  the relative placement of adjacent insertions and deletions leads to score-equivalent alignments among which the algorithm chooses one by means of a pragmatic rule (e.g. "Always make insertions before deletions") ‏ There is no mechanism to identify errors that have been made in previous steps and to correct them during later stages

Motivation for aln3nn  Utilizes an exact algorithm to compute alignment of sequence and profile triples  Instead of using a single guide tree, phylogenetic networks as constructed by the Neighbor-Net algorithm are used  It involves aggregation step that constructs pairs from triples to subdivide 3-way alignments into pairs of alignments  It provides a chance for the removal of erroneously inserted gaps at later aggregation steps.

Dynamic Programming Approaches  Needleman-Wunsch algorithm Basic dynamic programming scheme for pairwise sequence comparison Requires quadratic space and time Easily translates to a cubic space and time algorithms for three sequences. Uses trivial gap cost functions.

Linear vs Affine Gap Costs  Linear gap cost Has only one parameter d, which is a cost per unit length of gap d is almost always negative, so the alignment with fewer gaps is favoured over the alignment with more gaps The overall cost for one large gap is the same as for many small gaps  Affine gap cost Higher penalty is assigned for opening a new gap than for extending an existing one This removes the problem in linear gap costs as overall cost for one large gap is smaller than that for many small gaps

Gotoh’s Algorithm  Makes use of affine gap costs  Quadratic CPU and memory requirements for two sequences  Alignment of three sequences with affine gap costs requires O(n 3 ) time and space  Aln3nn is based on Gotoh’s Algorithm with minor modifications

Basic Concepts  Let A, B, and C denote the three sequences.  Ai, Bj, and Ck to refer to the ith, jth, and kth position in A, B, and C  '-' denotes the gap character.  Scores for the alignment of two or three non-gap characters are denoted by S(α, β) and S(α, β, γ)  Gap penalties are determined from gap open (go) and gap extensions (ge) scores.  M(i, j, k) denotes the best score of the alignments of the prefixes Ai, Bj, and Ck if the residues (Ai, Bj, Ck) are aligned  Ixy(i, j, k) the best score given that (Ai, Bj,-) is the last column of the partial alignment  Ix(i, j, k) the best score given that the last column is of the form (Ai, -, -) ‏  Sum-of-pairs model used for substitution scores S(a, b, c) = S(a, b) + S(a, c) + S(b,c).

Recurrences Case 1:(Ai, Bj, Ck) ‏ All three sequences are aligned

Recurrences (contd.)‏ Case 2:(Ai, Bj,-) Gap in the C sequence

Recurrences (contd.)‏ Case 3:(Ai, -,-) ‏ Gap in the B and C sequence

aln3nn Optimization  The above mentioned approach has cubic memory consumption which is acceptable only for small sequence lengths n  Aln3nn Optimization: Divide and Conquer Input sequences that exceed a given threshold length l are subsequently subdivided into smaller sequences until the length criterion is fulfilled Partial sequences are aligned separately and the emerging alignments are concatenated afterward Result is an approximate solution of the global MSA problem The threshold length depends on sequence properties and the available amount of memory and CPU resources

Determining Alignment Order  The order in which sequences and profiles are aligned has an important influence on the performance of progressive alignment algorithms  Pairwise alignments use binary guide trees to determine alignment order It encapsulate an approximation to the phylogenetic relationships of the input sequences The input sequences form the leaves of this tree Each interior node corresponds to an alignment The root of the guide tree represents the desired multiple alignment of all input sequences.

Phylogenetic Networks in aln3nn  Neighbor-Net (Nnet) approach is used to construct a phylogenetic network to calculate the alignment order The input sequences are represented as nodes that are all disconnected in the beginning. In each aggregation step, Nnet selects two nodes using a specific selection criterion In contrast to Neighbor-Joining, the two nodes are not paired immediately Nnet waits until a node has been paired up a second time. Then the corresponding three linked nodes are replaced by two new linked nodes. The distances of the newly introduced nodes to the remaining "actives" node are computed as a linear combination of the distances of the nodes prior to aggregation. The entire procedure is repeated until only three active nodes are left.

Agglomeration and Splitting  Node agglomeration occurs when one of the three involved nodes (B) has two neighbors, while the other two (A and C) have only a single one  The alignment ABC is split such that the sequences contained in B are distributed between two subsets B' and B" so as to maximize the scores of partial alignments AB‘ and B''C

Agglomeration and Splitting (contd.)

Space and Time Complexity  Simple dynamic programming For 3 way alignment it takes O(n 3 ) space and time (n being the length of the sequence) Thus the alignment of all N sequences takes O(Nn 3 ) time  Divide-&-Conquer with the cutoff length l Space Complexity O(n 2 +l 3 ) space is required This is the space needed to store the additional cost matrices plus the space required for aligning the remaining (sub) sequences of length at most l.

Space and Time Complexity (contd.) Time Complexity O(n 2 +nl 2 ) time is required for alignment of one triplet The term n 2 results from the time that is needed to calculate the additional cost matrices plus the time to search for the optimal slicing positions. The term nl 2 comes from the alignment of the triplet itself The total time complexity of the alignment is therefore O(Nn 2 +Nnl 2 )

Running Time Comparisons

Alignments of Structured RNAs  aln3nn software includes the possibility to use RNA secondary structure annotation as additional input with nucleic acid alignments  Matrix of equilibrium base pairing probabilities P ij is computed for each input sequence  For each sequence position probabilities are calculated for following cases pairing possibilities position i is paired with a position j <i a position j > i it remains unpaired

Structural Score Contributions  These probabilities are used as structure annotation.  For a pair of annotated input sequences A and B we define structural score contributions for positions i and j by  The total (mis)match score is the weighted sum of the sequence score and the structure score using the equation  Ψ is the balance term that measures the relative contribution of sequence and structure similarity  For very similar sequences one should use ψ ≈ 1 whereas in case of very dissimilar sequences one should use a score dominated by the structural component.

Experiments and Results

Pairwise versus Three-Way Alignments  Sets of artificial sequences generated using the ROSE package  The quality of aln3nn alignments were compared to standard progressive alignments of three sequences using t_coffee  The same scoring model in aln3nn and t_coffee were used  The analysis indicated that as gaps increased aln3nn produced better scores

Comparisons for 3 and 10 sequences

Protein Alignments  Used three types of substitution matrices: BLOSUM, PAM and GONNET  aln3nn chooses the best suiting matrix of the given type according to sequence identity  The median BAliBASE score is used for each sequence set as a measurement for alignment quality  Although aln3nn does not employ any heuristic rules to alter scoring parameters it compares well with other common alignment programs

Comparison of different alignment programs

RNA Alignments  RNA sequences often evolve much faster than their secondary structure  Alignment quality can be increased dramatically by including structural information  Used six diverse families of RNA data sets from the BRaliBase for comparisons  Structure conservation index (SCI) was used to assess the quality of the calculated alignments SCI is defined as the ratio of consensus folding energy of a set of aligned sequences and average unconstrained folding energies of the individual sequences SCI is close to 0 for structurally divergent sequences and close to 1 for correctly aligned sequences with a common fold

Alignment accuracies on RNA samples

Influence of parameter ψ on SCI  The SCI decreases if structural information is completely ignored (ψ = 1)  On the other hand ignoring the sequence information (ψ = 0) yields even worse results.  The reason is that RNA secondary structure prediction has limited accuracy so that alignments based on predicted structures for individual sequences are based on very noisy data  Also the impact of the ψ parameter varies between different RNA families.

Impact of the balancing parameter on SCI

Gap Removals  In some data sets one fifth of the gaps in the early stages of the progressive alignment are later removed again  Following table shows the frequency f of gaps that are removed at intermediate division steps and that are not re- introduced at later stages

Discussion and Future Work  A direct comparison of aln3nn with progressive alignments of the same three sequences shows that the progressive approach leads to significantly suboptimal scores  Aln3nn incurs additional computational costs compared to pair-wise, guide-tree based, approaches but it achieves competitive alignment accuracies on both protein and nucleic acid data  Performance of t_coffee shows that the shortcoming of initial pairwise alignments cannot be fully overcome later on where as aln3nn overcomes this problem  Future work Modifications in the division step for 3 way alignments Improvements in branch and bound approach

Thanks