Multiple Sequence Alignment

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
1 The TSP : Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell ( )
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Multiple String.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Multiple Sequence Comparison.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.10: Common Multiple.
4 -1 Chapter 4 The Sequence Alignment Problem The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :
A general approximation technique for constrained forest problems Michael X. Goemans & David P. Williamson Presented by: Yonatan Elhanani & Yuval Cohen.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Multiple Sequence alignment Chitta Baral Arizona State University.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
Multiple Sequence Alignments
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
TECH Computer Science Graph Optimization Problems and Greedy Algorithms Greedy Algorithms  // Make the best choice now! Optimization Problems  Minimizing.
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Algorithms for Network Optimization Problems This handout: Minimum Spanning Tree Problem Approximation Algorithms Traveling Salesman Problem.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Chapter 3 Computational Molecular Biology Michael Smith
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Sequence Alignment.
Step 3: Tools Database Searching
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Multiple Alignment.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Multiple sequence alignment (msa)
Bioinformatics Algorithms and Data Structures
Intro to Alignment Algorithms: Global and Local
Artificial Intelligence
Multiple Sequence Alignment
Computational Genomics Lecture #3a
Presentation transcript:

Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations

Multiple Sequence Alignment Motivation What are we trying to accomplish? Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations

Multiple Sequence Alignment Motivation Representation of protein families Identification and representation of conserved features of DNA/protein sequences that correlate with structure or function Deduction of evolutionary history from DNA/protein sequences Read pages 333-342 A lot of this is done by “heuristic” or “intuition” and is difficult to automate

Biological Motivation Previous “First Fact of Biological Sequence Comparison” In biomolecular sequences (DNA, RNA, amino acid sequences), high sequence similarity usually implies significant functional or structural similarity Second Fact of Biological Sequence Comparison Evolutionarily and functionally related molecular strings can differ significantly throughout the string yet preserve the same 3D structure(s), 2D substructure(s), active sites, or dispersed residues

2 strings versus multiple strings Based on first fact Find unknown biological relationships using string similarity Method: database searching Multiple strings Based loosely on second fact Given known biological relationships (function, structure, etc), identify unknown conserved subpatterns in a set of strings These subpatterns can then be used as a known pattern for other database searches

Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations

Definition A global alignment of a set of k>2 strings {Si} is obtained by inserting spaces (dashes) into each Si so that each string has the same length at the end. Placing each string into columns, one character (or dash) per column. Note ALL positions in both S and T are involved A local alignment of a set of k>2 strings {Si} is obtained by selecting one substring Si’ from each string Si globally aligning those substrings

Example Strings {abca, ababa, accb, cbbc} a b c - a a b a b a

Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Induced pairwise alignments Definition of sum of pair (SP) scoring Justification (or lack thereof) Algorithms Family Representations

Scoring MSAs Key fact: there is no universally accepted score function My impression is that people evaluate MSA’s by feel (they know a good one when they see it) Definitions Given a MSA M, the induced pairwise alignment of Si and Sj is obtained from M by removing all rows except the two rows for Si and Sj. Opposing spaces can be removed if desired.

Definitions Definitions Given a MSA M, the induced pairwise alignment of Si and Sj is obtained from M by removing all rows except the two rows for Si and Sj. Opposing spaces can be removed if desired. The score of an induced pairwise alignment is determined using any chosen scoring scheme for two-string alignment in the standard manner.

Example Example Induced alignment Score a b c - a a b a b a a c c b - c b - b c Induced alignment a b c - a a c c b - Score 0 1 0 1 1 = 3

Sum of Pairs (SP) Definition: The SP score of a MSA M is the sum of the scores of pairwise global alignments induced by M Example a b c - a a b a b a a c c b - c b - b c SP score: 2 + 3 + 4 + 3 + 3 + 4 = 19

Justification Difficult to give a sound biological justification for SP or any other scoring scheme Main reasons for studying it It is easy to work with It has been used by many people in studying MSA It is used in several packages

Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Exact, NP-hard problem Approximation Algorithm (Center Star) Heuristic Methods Family Representations

Formal Problem Input Output Observation k strings {Si} Scoring function Output MSA of {Si} with minimum (maximum) SP score Observation Exact solution is NP-hard Dynamic programming takes O(nk) time, so solving exactly for more than even 6 strings of typical length is often not feasible

Heuristic Speedup View problem as a shortest path problem with O(nk) nodes Given an upper bound on the actual value, we can eliminate exploration of many nodes using branch and bound ideas Key is to send values forward rather than backwards Backwards: All nodes will eventually be evaluated Forwards: Limit to those which can possibly be less than current estimate on optimal

Backwards D(i,j) w r i t e s 1 2 3 4 5 6 7 v n

Forwards D(i,j) w r i t e s 1 2 3 4 5 6 7 v n

Approximation Algorithms Given the hardness of computing the exact solution, how about developing algorithms that compute a solution that is guaranteed to be close to optimal Goal: Find a polynomial-time algorithm A that minimizes supI A(I)/OPT(I) Only computer scientists seem interested in this Biologists seem to do things more heuristically

Alignments consistent with a tree D(Si,Sj) is the optimal weighted edit distance between Si and Sj Definition: Let T be a tree where each node is labeled with a string from {Si}. Then a multiple alignment of {Si} is consistent with T if the induced pairwise alignment of Si and Sj has score D(Si,Sj) for each pair of strings (Si, Sj) that are connected by an edge in T.

Example -AX-Z -A-YZ -AXYZ --XYZ AYXYZ All edge alignment scores are optimal Others are not such as AYXYZ with -AXYZ AXYZ XYZ AYXYZ

Theorem For any {Si} and any tree T whose nodes are labeled with distinct nodes of {Si}, we can efficiently find an MSA M(T) of {Si} that is consistent with T. Proof Incrementally align any two adjacent nodes Two aligned gaps have zero cost Add gaps as necessary to other already aligned sequences

Example Align AXYZ and XYZ Align AYXYZ and -XYZ … AYZ AXZ AXYZ -XYZ A-XYZ or -AXYZ --XYZ --XYZ AYXYX AYXYZ … AXYZ XYZ AYXYZ

Triangle Inequality Assume an alphabet-weighted scoring scheme s(x,y) x and y could be any character (or a space) A scoring scheme satisfies the triangle inequality if for any three characters (including a space) x, y, and z, s(x,z) <= s(x,y) + s(y,z) Note, not all scoring schemes used in biology satisfy this triangle inequality property

Center Star Method For {Si}, define Sc to be the string that minimizes Sall strings D(Sc, Sj) Define the center star to be the star where the center node is labeled with Sc Define Mc to be an MSA of {Si} that is consistent with the center star Define d(Si, Sj) to be the score of the pairwise alignment of Si and Sj induced by Mc. Denote the score of an alignment M as d(M).

Example AYZ AXZ Sall strings D(AXYZ, Sj) = 4 Mc before AYXYZ added Mc after AYXYZ added A-XYZ A-X-Z A--YZ --XYZ AYXYZ AXYZ XYZ AYXYZ

Example continued Mc after AYXYZ added d(AYZ,AYXYZ) = 2 AXZ AYZ Mc after AYXYZ added A-XYZ A-X-Z A--YZ --XYZ AYXYZ d(AYZ,AYXYZ) = 2 d(Mc) = 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + 2 + 2 = 16 AXYZ XYZ AYXYZ

Results Lemma: Assuming triangle inequality, then d(Si, Sj) <= d(Si, Sc) + d(Sc, Sj) = D(Si, Sc) + D(Sc, Sj) Definition: Let M* be the optimal alignment of {Si} and d*(Si, Sj) be the score of the induced pairwise alignment. Theorem: d(Mc) / d(M*) <= 2(k-1)/k < 2

Proof

Weighted SP Each induced pairwise score is multiplied by a weight w(i,j). Optimal weighted SP can be computed in exponential time (in k) using dynamic programming Little is known about approximation of weighted SP Why doesn’t center star give a guaranteed bound here?

Heuristic Techniques In practice, people tend to use more heuristic methods with no proven performance guarantees Basic idea Do some form of iterative or progressive alignment For example, do an alignment based on a minimum spanning tree of some sort Find two closest nodes and join them how should we define closeness? then iteratively add closest non-aligned node to the alignment

Heuristic Techniques In practice, people tend to use more heuristic methods with no proven performance guarantees Basic idea Do some form of iterative or progressive alignment For example, do an alignment based on a minimum spanning tree of some sort Find two closest nodes and join them how should we define closeness? then iteratively add closest non-aligned node to the alignment

One method of defining closeness sd(i,j) scores given a scoring scheme Compute D(Si, Sj) 100 times do “Jumble” Si and Sj and compute D(jum(Si), jum(Sj)) Compute mean and standard deviation of these 100 jumbled comparisons Define sd(i,j) = D(Si, Sj)/standard deviation (no mean?) Intuition Strings Si and Sj contain non-random structures (hopefully secondary structure) in common if sd(i,j) is high

Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations Profiles Regular expressions/motifs

Representation Problem Input family of sequences that typically have a known biological similarity Desired output Representation of this family of sequences that reveals any string/sequence similarities that hopefully are related to their biological similarity

Profiles Strings {abca, ababa, accb, cbbc} Profile 1 2 3 4 5 a b c - a 1 2 3 4 5 a 75 25 50 b 75 75 c 25 25 50 25 - 25 25 25

Log odds ratio p(a) = 6/20 = 30% p(a,1) = 3/4 = 75% Strings {abca, ababa, accb, cbbc} a b c - a a b a b a a c c b - c b - b c Profile 1 2 3 4 5 a 75 25 50 b 75 75 c 25 25 50 25 - 25 25 25 p(a) = 6/20 = 30% p(a,1) = 3/4 = 75% log (p(x,j)/p(x)) is entry Example (without logs) 1 2 3 4 5 a 2.5 0 .83 0 1.7 b 0 2.5 0 2.5 0 c 1 1 2 0 1 - 0 0 1.7 1.7 1.7

Nice feature of profiles Natural extension of alignment and scoring of strings to profiles Aligning a string to a profile We can generalize notions of pairwise string alignment Scoring Compute a weighted sum based on frequency of characters in the column Can generalize to profile to profile alignments Optimal alignment Dynamic programming can solve

Signature representations Signature or motif signature pattern contained as a substring in most members of a family typically represented as a regular expression Such a regular expression might be derived given a multiple sequence alignment