Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.

Slides:



Advertisements
Similar presentations
Lecture 24 MAS 714 Hartmut Klauck
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Atlantis HKOI2005 Final Event (Senior Group). 2 Background 9000 B.C B.C. Atlantis Zeus Other gods Destroy!!
 2004 SDU Lecture11- All-pairs shortest paths. Dynamic programming Comparing to divide-and-conquer 1.Both partition the problem into sub-problems 2.Divide-and-conquer.
Fast FAST By Noga Alon, Daniel Lokshtanov And Saket Saurabh Presentation by Gil Einziger.
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.
Sequencing and Sequence Alignment
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Aligning Alignments Exactly John Kececioglu and Dean Starrett Proceedings of the eighth annual international conference on Computational molecular biology.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Analysis of Algorithms CS 477/677
Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Multiple Sequence alignment Chitta Baral Arizona State University.
Sequence Alignment III CIS 667 February 10, 2004.
BNFO 602 Multiple sequence alignment Usman Roshan.
Chapter 11: Limitations of Algorithmic Power
1 Combinatorial Dominance Analysis Keywords: Combinatorial Optimization (CO) Approximation Algorithms (AA) Approximation Ratio (a.r) Combinatorial Dominance.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
The Hardness of Cache Conscious Data Placement Erez Petrank, Technion Dror Rawitz, Caesarea Rothschild Institute Appeared in 29 th ACM Conference on Principles.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
1 Sequences comparison 1 Issues Similarity gives a measure of how similar the sequences are. Alignment is a way to make clear the correspondence between.
Class 2: Basic Sequence Alignment
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Sequence Alignment.
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
EMIS 8373: Integer Programming NP-Complete Problems updated 21 April 2009.
Chapter 3 Computational Molecular Biology Michael Smith
1 Approximate Algorithms (chap. 35) Motivation: –Many problems are NP-complete, so unlikely find efficient algorithms –Three ways to get around: If input.
1 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Comp. Genomics Recitation 10 Clustering and analysis of microarrays.
Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Chapter 13 Backtracking Introduction The 3-coloring problem
Chapter 11 Introduction to Computational Complexity Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
GA for Sequence Alignment  Pair-wise alignment  Multiple string alignment.
An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Hans Bodlaender, Marek Cygan and Stefan Kratsch
Approximate Algorithms (chap. 35)
Bioinformatics: The pair-wise alignment problem
JinJu Lee & Beatrice Seifert CSE 5311 Fall 2005 Week 10 (Nov 1 & 3)
Sequence Alignment Using Dynamic Programming
Sequence Alignment 11/24/2018.
SMA5422: Special Topics in Biotechnology
Intro to Alignment Algorithms: Global and Local
CSE 589 Applied Algorithms Spring 1999
Multiple Sequence Alignment
Multiple Sequence Alignment (I)
Computational Genomics Lecture #3a
Fragment Assembly 7/30/2019.
Presentation transcript:

Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng

Background Background Definition Definition Hardness Hardness An Exponential time algorithm An Exponential time algorithm

Alignments Given two (DNA or Protein) sequences, an alignment puts them against each other such that the similar parts are aligned as close as possible, for example: A T – C – T C G C T - T G - A T G – A T A T – C – T C G C T - T G - A T G – A T There are four kinds of alignments Match Insertion; Deletion; Mismatch

Scoring Alignments There are four types of aligned columns: –Match – Score  match = 0. –Mismatch – Score  mismatch  0. –Insertion – Score  insertion  0. –Deletion – Score  deletion  0. The score of an alignment is defined to be the sum of the score of the aligned columns. The goal is to minimize the score

Gap-cost We can extend the score  indel by  open and  extension, then for a gap of size x, we have  open +x*  extension instead of x*  indel. AT----CGCTTCAT -TGCAT—AT----- AT----CGCTTCAT -TGCAT—AT-----  open +4*  extension

Multiple Alignments In general we also need compare multiple sequences and find the similarities. Multiple alignment generalizes the alignment idea to handle many sequences. AT-C-TCGAT -TGCAT--AT ATCCA-CGCT AT-C-TCGAT -TGCAT--AT ATCCA-CGCT

Sum-of-Pairs (SP) Score Given a multiple alignment, the sum-of- pairs (SP) score is given by the sum of the induced pairwise alignment scores of each pair in the alignment. AT-C-TCGAT -TGCAT--AT ATCCA-CGCT AT-C-TCGAT -TGCAT--AT ATCCA-CGCT  AT-C-TCGAT -TGCAT--AT AT-C-TCGAT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT AT-C-TCGAT -TGCAT--AT AT-C-TCGAT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT + +

BAD NEWS Multiple alignment is NP-hard One methods is to approximate the optimal value; Progressive alignments A problem arised natually: Aligning Alignments

Aligning Alignments Let S be a collection of strings s 1, s 2, s 3 …s k, over alphabet ; An alignment of S is a matrix A with k rows such that: i) Each entry is either a letter or a space; ii) No column is all space; iii) Reading across row i and remove space, we get string s i ; Like before, we have three types of aligning score: match, mismatch and substitution;

Aligning Alignments Given two alignments A with k sequences of length N, B with l sequences of length M, we want to align the columns of A and B; AT-C-TCGAT -TGCAT--AT ATCCA-CGAT CT-ATTGGAT -TTAT-G--T CTTA-GGGAT

Aligning Alignments In other word, We treat the columns of A and B as single letters, just like aligning two sequences. CT GT -T AT -T GT C-T G-T --T -AT --T -GT

Aligning Alignments The score function is still sum-of-pair, namely We note that the alignment of A i ’ and B j ’ may contain space in both sequences, so we just remove the space here A i ’: a----aa-a B j ’: aaa-a-a-a

Aligning Alignments Without gap cost, aligning alignments is polynomial time solvable. We can apply dynamic programming like we did in aligning sequences; the only difference here is that we align columns.

Aligning Alignments With gap cost, this problem is NP-complete We can use a reduction from MAX-CUT problem MAX-CUT: Given a graph G=(V, E), and a integer c, ask whether there is a partition of V: V= L R and, such that the size of the cut is no less than c; By cut, it means the set of edges which have one end vertex in L and another is in R;

NP-hardness Given an instance of MAX-CUT G=(V,E), V={v 1, v 2, …v n } and E={e 1, e 2, … e m },and a integer c; we construct two multiple alignments A and B over alphabet {0,1}: both A and B has m edge rows and k dummy rows, each edge rows corresponding an edge; A has 2n columns, every two continuous columns correspond a vertex; B has 3n columns, every three continuous columns correspond a vertex;

NP-hardness The dummy rows in A are (0-) n, dummy rows in B are (0--) n ; As to the edge rows in A: suppose the row for e, and e=(v i, v j ), then in columns i and j, there are substring, “-1”, and space elsewhere; As to the edge rows in B: suppose the row for e, and e=(v i, v j ), (i<j), then in columns i, there is a substring “010”, in columns j, there is a substring “-10”

NP-hardness Simply we let score for match is 0, score for mismatch is 1, and gap open cost is 2, gap extension cost is 1 ask whether there is an alignment such that the score is less then d-c; So we have an instance of Aligning Alignments.

HOMEWORK4 Given a set of multiple alignments {A 1, A 2, … A n }, each A i is a multiple alignment with k i sequences, without gap cost, is the problem of multiple alignment on those alignments {A 1, A 2, … A n } hard or easy, use the method in this paper to align multiple alignments, i.e. align columns. If hard, prove it; otherwise, give an efficient algorithm and prove complexity and correctness.

Exact Algorithm The basic idea is still dynamic programming; We have to remember extra information by a set, so-called shape, S : for each row in a multiple alignment, we record the columns of the right-most letters.

Exact Algorithm S(i, j)=

Exact Algorithm C(i,j,t)=min Where g(A[i], B[j], s) means the total number of gaps initiated by appending column A[i] and B[j] onto an alignment that ends in shape s;

Exact Algorithm The optimum value is The problem here is the number of shapes maybe too many, so in the worst case the time and space complexity is

Any Questions? 423B