A Sub-quadratic Sequence Alignment Algorithm. Global alignment Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1)

Slides:



Advertisements
Similar presentations
CSNB143 – Discrete Structure
Advertisements

Longest Common Subsequence
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
By Cruchemor, Landau and Ziv-ukelson. Abstract We present an O(n²/log n) algorithm for computing the optimal global alignment value of two strings,of.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Chapter 3 The Greedy Method 3.
11-1 Elements of Dynamic Programming For dynamic programming to be applicable, an optimization problem must have: 1.Optimal substructure –An optimal solution.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Space Efficient Alignment Algorithms and Affine Gap Penalties
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez June 24, 2005.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Review of Matrix Algebra
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
1 A Linear Space Algorithm for Computing Maximal Common Subsequences Author: D.S. Hirschberg Publisher: Communications of the ACM 1975 Presenter: Han-Chen.
Perfect Phylogeny MLE for Phylogeny Lecture 14
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Class 2: Basic Sequence Alignment
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Table of Contents Matrices - Multiplication Assume that matrix A is of order m  n and matrix B is of order p  q. To determine whether or not A can be.
5  Systems of Linear Equations: ✦ An Introduction ✦ Unique Solutions ✦ Underdetermined and Overdetermined Systems  Matrices  Multiplication of Matrices.
Dynamic Programming – Part 2 Introduction to Algorithms Dynamic Programming – Part 2 CSE 680 Prof. Roger Crawfis.
1.3 Matrices and Matrix Operations.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
SMAWK. REVISE Global alignment (Revise) Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1) +  (S[i], T[j]),
1 Quantum query complexity of some graph problems C. DürrUniv. Paris-Sud M. HeiligmanNational Security Agency P. HøyerUniv. of Calgary M. MhallaInstitut.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
1.3 Matrices and Matrix Operations. Definition A matrix is a rectangular array of numbers. The numbers in the array are called the entries in the matrix.
Chapter 4 – Matrix CSNB 143 Discrete Mathematical Structures.
A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson.
Co. Chapter 3 Determinants Linear Algebra. Ch03_2 Let A be an n  n matrix and c be a nonzero scalar. (a)If then |B| = …….. (b)If then |B| = …..... (c)If.
1 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Problem Statement How do we represent relationship between two related elements ?
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Chapter 7 Dynamic Programming 7.1 Introduction 7.2 The Longest Common Subsequence Problem 7.3 Matrix Chain Multiplication 7.4 The dynamic Programming Paradigm.
1 How to Multiply Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. integers, matrices, and polynomials.
Table of Contents Matrices - Definition and Notation A matrix is a rectangular array of numbers. Consider the following matrix: Matrix B has 3 rows and.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Learning to Align: a Statistical Approach
Matrices Rules & Operations.
Succinct Data Structures
5 Systems of Linear Equations and Matrices
Matrix Multiplication
Sequence Alignment 11/24/2018.
Dynamic Programming General Idea
Intro to Alignment Algorithms: Global and Local
Matrices and Matrix Operations
Trevor Brown DC 2338, Office hour M3-4pm
Dynamic Programming-- Longest Common Subsequence
Dynamic Programming General Idea
Elements of Dynamic Programming
Can dist tables be merged in linear time? an open problem
Linear Algebra Lecture 11.
CSE 5290: Algorithms for Bioinformatics Fall 2009
Perfect Phylogeny Tutorial #10
Presentation transcript:

A Sub-quadratic Sequence Alignment Algorithm

Global alignment Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1) +  (S[i], T[j]), V(i-1,j) +  (S[i], -), V(i,j-1) +  (-, T[j]) }

FOUR RUSSIAN ALGORITHM

UNRESTRICTED SCORING FUNCTION

Main idea: Compress the sequences S = aacgacga T = ctacgaga c t a g g a g c g LZ-78: Divide the sequence into distinct words 1234 aacgacga ctacgaga Trie The number of distinct words:

aacggaca c t 3/43/2 a cgcg 5/45/2 agag a g a gca a gca a ca g a ca Main idea a g c t Trie for T 4 g g a c g Trie for S Compute the alignment score in each block Propagate the scores between the adjacent blocks

Main idea Compress the sequence into words Pre-compute the score for each block Do alignment between blocks Note: – Replace normal characters by words – Operate on blocks

COMPRESS THE SEQUENCE LZ-78

S = aacgacga T = ctacgaga c t a g g a g c g LZ-78: Divide the sequence into distinct words 1234 aacgacga ctacgaga Trie The number of distinct words:

LZ-78 Theorem (Lempel and Ziv): – Constant alphabet sequence S – The maximal number of distinct phrases in S is O(n/log n). Tighter upper bound: O(hn/log n) – h is the entropy factor – a real number, 0 < h  1 – Entropy is small sequence is repetitive

COMPUTE THE ALIGNMENT SCORE IN EACH BLOCK

aacggaca c t 3/43/2 a cgcg 5/45/2 agag a g a gca a gca a ca g a ca Compute the alignment score in each block

Given – Input border: I – Block Compute – Output border: O O g a gca G I

Matrices I[i] : is the input border value DIST[i,j] : weight of the optimal path – From entry i of the input border – To entry j of its output border OUT[i,j] : merges the information from input row I and DIST – OUT[i,j]=I[i] + DIST[i,j] O[j] = max{OUT[i,j] for i=1..n} O g a gca G I

DIST and OUT matrix example O g a gca G I DIST matrixOUT matrix I (input borders) Block – sub-sequences “acg”, “ag” I0I △△ I1I △ I2I I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ - -- - I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O max col

For each block, given two sub-sequence S1, S2 Compute (from scratch) DIST in  (n*m) time Given I and DIST, compute OUT in  (n*m) time Given OUT[i,j], Compute O in  (m*n) time

Revise Compress the sequence Pre-compute DIST[i,j] for each block Compute border values of each blocks Remaining questions – How to compute DIST[i,j] efficiently? – How to compute O[j] from I[i] and DIST[i,j] efficiently? aacggaca c t a 4/4 cgcg 5/45/3 agag a

COMPUTE O[J] EFFICIENTLY

Compute O[j] efficiently For each block of two sub-sequences S1, S2 Given – I[i] – DIST[i,j] Compute – O[j]

DIST and OUT matrix example O g a gca G I DIST matrixOUT matrix I (input borders) Block – sub-sequences “acg”, “ag” I0I △△ I1I △ I2I I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ - -- - I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O max col

Compute O without explicit OUT O g a gca G I DIST matrix I (input borders) Block – sub-sequences “acg”, “ag” I0I △△ I1I △ I2I I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20 I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O SMAWK

Given DIST[i,j], I[i] we can compute O[j] in O(n+m) – Without creating OUT[i,j] How? Why?

Why? Aggarwal, Park and Schmidt observed that DIST and OUT matrices are Monge arrays. Definition: a matrix M[0…m,0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n: 1.Convex condition: M[a,c]  M[b,c]  M[a,d]  M[b,d] for all a<b and c<d. 2.Concave condition: M[a,c]  M[b,c]  M[a,d]  M[b,d] for all a<b and c<d.

How? Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find all row and column maxima of a totally monotone matrix by querying only O(n) elements of the matrix.

Why DIST[i,j] is totally monotone? O g a gca G I The concave condition If b-c is better than a-c, then b-d is better than a-d. a b dc

Other problem Rectangle problem of DIST Set upper right corner of OUT to -  Set lower left corner of OUT to -(n+i-1)*k Preserve the totally monotone property of OUT I0I △△ I1I △ I2I I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20

COMPUTE DIST[I,J] EFFICIENTLY

aacggaca c t 3/43/2 a cgcg 5/45/2 agag a g a gca a gca a ca g a ca Compute DIST[i,j] for block(5/4) a g c t Trie for T 4 g g a c g Trie for S

DIST matrix

Only column m in DIST[i,j] is new DIST block can be updated in O(m+n)

MANTAINING DIRECT ACCESS TO DIST TABLE

aa cga c ga c t a cgcg agag a Triefor T g g a c Triefor S g c t a g

aa cga c ga c t a cgcg agag a Triefor T g g a c Triefor S g c t a g

DIST aa cga c ga c t a cgcg agag a Triefor T g g a c Triefor S g c t a g

Complexity Assume |S| = |T| = n Number of words in S, T = O(hn/log n) Number of blocks in alignment graph O(h 2 n 2 /(log n) 2 ) For each block – Update new DIST block O(t = size of the border) – Create direct access table O(t) Propagating I/O across blocks – SMAWK O(t) Sum of the sizes of all borders is O(hn 2 /log n) Total complexity: O(hn 2 /log n)

Other extensions Trace Reducing the space complexity for discrete scoring Local alignment

References Crochemore, M.; Landau, G. M. & Ziv-Ukelson, M. A sub-quadratic sequence alignment algorithm for unrestricted cost matrices ACM-SIAM, 2002, Some pictures from 葉恆青