UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.10: Common Multiple.

Slides:



Advertisements
Similar presentations
Greedy Algorithms.
Advertisements

Greedy Algorithms Greed is good. (Some of the time)
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Multiple String.
Bioinformatics Algorithms and Data Structures
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures CLUSTAL W Algorithm Lecturer:
Spring 2010CS 2251 Graphs Chapter 10. Spring 2010CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Strings and.
Sequence Alignment III CIS 667 February 10, 2004.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Strings and.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Multiple Sequence Alignment
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Introduction to Profile Hidden Markov Models
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BINF6201/8201 Molecular phylogenetic methods
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Lectures on Greedy Algorithms and Dynamic Programming
Minimal Spanning Tree Problems in What is a minimal spanning tree An MST is a tree (set of edges) that connects all nodes in a graph, using.
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 11.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Greedy Technique.
Multiple sequence alignment (msa)
13 Text Processing Hongfei Yan June 1, 2016.
Multiple Alignment and Phylogenetic Trees
Autumn 2016 Lecture 11 Minimum Spanning Trees (Part II)
Minimum Spanning Tree.
Autumn 2015 Lecture 11 Minimum Spanning Trees (Part II)
Autumn 2015 Lecture 10 Minimum Spanning Trees
Richard Anderson Lecture 10 Minimum Spanning Trees
CSE 373: Data Structures and Algorithms
Winter 2019 Lecture 11 Minimum Spanning Trees (Part II)
Fragment Assembly 7/30/2019.
Autumn 2019 Lecture 11 Minimum Spanning Trees (Part II)
Presentation transcript:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.10: Common Multiple Alignment Methods Lecturer: Dr. Rose Slides by: Dr. Rose April 1, 2003

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Iterative pairwise alignment Naïve method: 1.merge two strings with minimum edit distance 2.Successively merge in the string with the smallest edit distance from any string in the multiple alignment. Observation: Like Prim’s algorithm for minimum spanning tree.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Iterative pairwise alignment Alternative Naïve method: 1.merge two strings with minimum edit distance 2.Successively merge multiple alignments of subsets of strings on the basis of pairwise edit distance. Observation: Like Kruskal’s algorithm for minimum spanning tree.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Iterative pairwise alignment Less Naïve method: UPGMA 1.merge two strings with minimum edit distance 2.Successively merge multiple alignments of subsets of strings on the basis of average edit distance. Observation: Average linkage method is a common clustering linkage method. UPGMA: unweighted pair-group method using arithmetic averages.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Iterative pairwise alignment Problem: aligning protein sequences to reveal conserved  -strands. Q: What is a  -strand? Let’s refresh our memories concerning proteins and their structures. For a good overview visit:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Q: What is a protein? A: One or more polypetide chains. Q: Ughh, what is a polypetide chain? A: A linear polymer of amino acid residues, i.e., a sequence of amino acids. Defn. The primary structure of a protein is the sequence of amino acids comprising it.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods General form of an amino acid (borrowed from Jon Cooper, Birkbeck Crystallography Dept.)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Amino acids are joined together by peptide bonds. (borrowed from Jon Cooper, Birkbeck Crystallography Dept.) Here the sequence of R-groups along the chain is called the primary structure. Secondary structure refers to the local folding of the polypeptide chain.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods The right-handed spiral conformation is known as the 'alpha-helix‘. (borrowed from Jon Cooper, Birkbeck Crystallography Dept.)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods A section of polypeptide with residues in the beta- conformation is refered to as a beta-strand (from J. Cooper, Birkbeck Crystallography Dept.)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods B-strands can form beta-sheets (J. Cooper, Birkbeck Crystallography Dept.)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Here we see beta-strands in an antiparallel beta-sheet. (from J. Cooper, Birkbeck Crystallography Dept.)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods: Iterative alignment Problem: aligning protein sequences to reveal conserved  -strands. Q: First of all, what is meant by conserved  -strands? A: These are  -strands that are preserved through evolutionary changes. We want to find the location of these conserved  - strands.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Scenario: we have one protein sequence with known locations of  -strands. Q: How do you think we are able to know where the locations of the  -strands are? A: They probably were found by x-ray crystallography.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Note: any multiple alignment will entail gaps. Q: How can we use multiple alignment to find conserved  -strands? We need an approach that will: 1.Align the conserved  -strands. 2.not insert gaps into the conserved  -strands. The method should be tuned to favor similarities in secondary structure. ( Recall: secondary structure refers to the local folding of the polypeptide chain.)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Broad outline of method: 1.Greedy algorithm. 2.Variant maximum spanning tree method 3.Add strings to alignment in order of similarity First we need to define the similarity metric. Next we see that it is not simply a multiple alignment consistent with a node labeling of the maximum spanning tree.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Similarity metric: For each pair of strings S i, S j : Compute pairwise similarity score Repeat 100 times: Randomly permute the characters in the two strings Compute pairwise similarity score Compute the mean and standard deviation Define sd(i, j) as the ratio of the similarity score and standard deviation computed from the permuted strings.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Q: Why should sd(i, j) favor secondary structure similarity? Expectations: Common nonrandom structures will raise sd(i, j). These structures will be destroyed in the permuted strings. Certainly, sd(i, j) favors similarity. Not clear that the favored similarity is necessarily secondary structure similarity.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Limited empirical results with test data: If sd(i, j) score > 5 then > 70% agreement with reference alignment (reference alignment from x-ray data) Gusfield states, “So sd(i, j) values can be used to give some confidence that the optimal alignment is biologically informative, even when the alignment is obtained from proteins where the secondary structure is not known.” Anybody skeptical?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Iterative alignment: 1.Select the pair of strings with maximal sd(i, j) score. 2.Optimally align these two strings. 3.Repeat: Compute the profile of the current multiple alignment. Find the largest sd(i, j) score where S i is included in the multiple alignment but S j is not. Merge S j into the multiple alignment by aligning S j with the profile.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Q: Are the sd scores selected in the same order that a Prim’s maximum spanning tree algorithm would select them? A: yes. So this is a maximum spanning tree clustering method. Q: Is the multiple alignment consistent with some node labeling of the maximum spanning tree? A: No. Q: Why not?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Do we remember what it means for an alignment to be consistent with a tree? M is consistent with T if the induced pairwise alignment of S i and S j has optimal weighted edit distance for each pair of strings (S i, S j ) that label adjacent nodes in T. Q: So why is the multiple alignment not consistent with some node labeling of the maximum spanning tree?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment A: Because S j is not aligned to have optimal weighted edit distance to S i. Recall, S j is optimally aligned with the profile at the time it is merged. Q: So how well does this algorithm work? A: contradictory results compared to optimal pairwise alignments: Secondary structure alignment is improved when optimal pairwise alignment gives poor result. Poor result when optimal pairwise alignment does a good job.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Q: Why have we been interested in multiple alignment? A: we have focused on two problems: 1.Characterizing protein families & superfamilies 2.Identifying important conserved features. There is a third reason for investigating multiple alignment: Q: any ideas? A: deducing evolutionary history.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Consider: iterative alignment successively merges distinct subset of strings.  This can be represented by a binary tree T.  Each leaf is a single string representing a taxon  Each internal node v represents: 1.The merge of the strings in the subtree rooted at v 2.the multiple alignment alignment of v’s descendents. Idea: choose merge criteria that reflects evolutionary history  Then T represents a deduced evolutionary tree.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Approach: progressive alignment Key idea: pairs strings with minimum edit distance probably represent recent divergence of taxa.  Merging such pairs should provide the best information.  This alignment should conserve the maximum amount of common structure.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Consequence of optimal structure conservation  gap preservation.  Never remove a gap in subsequent merges. any gap inserted in a pairwise alignment should be preserved in the multiple alignment. Note: if subsequent merges are effected by aligning profiles, then gaps are automatically preserved.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Approach: (for local & global alignments) 1.find a “good” motif Q: What is a good motif? A: One that is wide (long) and high (common to many of the strings) 2.Shift the strings containing the motif to align the occurrence of the motif. 3.Recursively align the substrings on either side of the motif.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Q: what happens when we run out of high & wide motifs? A: Use iterative alignment to finish up. Q: What about the strings that didn’t contain the first “good” motif? A: Align these strings separately, starting with their own “good” motif. Note: these separately aligned strings will have to be merged afterwards.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Q: How can “good” motifs be found? 1.Collect fixed-size substrings as candidates 2.Examine the “goodness” of the substrings using: 1.Hashing techniques or 2.Standard substring comparison or 3.Suffix trees or 4.Sorting methods Q: suggest how sorting could be used.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Next, try to extend the motif on both ends. Recall, the candidate motifs are all fixed-length. Gusfield notes: there are many ways to realize this approach. There does not appear to be a best variant.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Vingron/Argos repeated-motif method: (We will limit our discussion to 3 strings for simplicity.) 1.Create a graph of 3n nodes. 2.Look for a similar l-length (chosen by modeler) substring at nodes (i, j) & (i´, j´), j  j´ 3.If the l-length substrings are sufficiently similar, connect node (i, j) to (i´, j´) 4.Remove any edge in the graph that is not part of a clique of size 3.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Q: What is the significance of a clique of size 3? A: This represents a similar l-length substring that appears in each of the 3 strings.  The clique forms a motif.  we delete edges that are not part of such cliques since we are only interested in finding motifs.  represent the clique at (i,1), (i´,2), and (i´´,3) by (i, i´, i´´)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Clique (i, i´, i´´) is to the left of clique (z, z´, z´´) iff i<z, i´<z´, and i´´<z´´. Define two cliques as non-crossing if one is to the left of the other. 5.Find a set of nice non-crossing cliques. We want “nicely spaced” motifs. Q: What does this mean?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Graphically, we would prefer these “nicely spaced cliques: To these not “nicely spaced” set of cliques.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods We can do this by considering the relative starting position of the motif  compare how the differences i´- i & i´´- i´ match with z´- z & z´´- z´.  Give a high weight to pairs of cliques that are thus “nicely spaced”.  Give a small weight to pairs of “poorly spaced” cliques.