A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Algorithms Chapter 15 Dynamic Programming - Rod
Longest Common Subsequence
Chapter 3 Determinants and Eigenvectors 大葉大學 資訊工程系 黃鈴玲 Linear Algebra.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Fast Algorithms For Hierarchical Range Histogram Constructions
Lecture 3: Parallel Algorithm Design
By Cruchemor, Landau and Ziv-ukelson. Abstract We present an O(n²/log n) algorithm for computing the optimal global alignment value of two strings,of.
The Divide-and-Conquer Strategy
Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside.
Mathematics. Matrices and Determinants-1 Session.
Simple Linear Work Suffix Array Construction J. Kärkkäinen, P. Sanders Proc. 30th International Conference on Automata, Languages and Programming 2003.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
11-1 Elements of Dynamic Programming For dynamic programming to be applicable, an optimization problem must have: 1.Optimal substructure –An optimal solution.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Greedy Algorithms. 2 Greedy Methods ( 描述 1) * 解最佳化問題的演算法, 其解題過程可看成是由一 連串的決策步驟所組成, 而每一步驟都有一組選擇 要選定. * 一個 greedy method 在每一決策步驟總是選定那目 前看來最好 的選擇. *Greedy.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Digital Signal Processing with Examples in M ATLAB ® Chap 1 Introduction Ming-Hong Shih, Aug 25, 2003.
Monte Carlo Simulation Part.1 Dept. Phys., Tunghai Univ. Numerical Methods, C. T. Shih.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.
845: Gas Station Numbers ★★★ 題組: Problem Set Archive with Online Judge 題號: 845: Gas Station Numbers. 解題者:張維珊 解題日期: 2006 年 2 月 題意: 將輸入的數字,經過重新排列組合或旋轉數字,得到比原先的數字大,
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
1 523: Minimum Transport Cost ★★★☆☆ 題組: Problem Set Archive with Online Judge 題號: 523: Minimum Transport Cost 解題者:林祺光 解題日期: 2006 年 6 月 12 日 題意:計算兩個城市之間最小的運輸成本,運輸.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Variable-Length Codes: Huffman Codes
CALCULUS – II Matrix Multiplication by Dr. Eman Saad & Dr. Shorouk Ossama.
Matrices and Determinants
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
5  Systems of Linear Equations: ✦ An Introduction ✦ Unique Solutions ✦ Underdetermined and Overdetermined Systems  Matrices  Multiplication of Matrices.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
SMAWK. REVISE Global alignment (Revise) Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1) +  (S[i], T[j]),
Chap. 2 Matrices 2.1 Operations with Matrices
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Graph Theory Chapter 7 Eulerian Graphs 大葉大學 (Da-Yeh Univ.) 資訊工程系 (Dept. CSIE) 黃鈴玲 (Lingling Huang)
1 C ollege A lgebra Systems and Matrices (Chapter5) 1.
Matrices CHAPTER 8.1 ~ 8.8. Ch _2 Contents  8.1 Matrix Algebra 8.1 Matrix Algebra  8.2 Systems of Linear Algebra Equations 8.2 Systems of Linear.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Chapter 3 Determinants Linear Algebra. Ch03_2 3.1 Introduction to Determinants Definition The determinant of a 2  2 matrix A is denoted |A| and is given.
1 Closures of Relations: Transitive Closure and Partitions Sections 8.4 and 8.5.
Information Theory Linear Block Codes Jalal Al Roumy.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
2 2.1 © 2012 Pearson Education, Inc. Matrix Algebra MATRIX OPERATIONS.
The parity bits of linear block codes are linear combination of the message. Therefore, we can represent the encoder by a linear system described by matrices.
CompSci 102 Discrete Math for Computer Science February 7, 2012 Prof. Rodger Slides modified from Rosen.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Greatest Common Divisors & Least Common Multiples  Definition 4 Let a and b be integers, not both zero. The largest integer d such that d|a and d|b is.
A Sub-quadratic Sequence Alignment Algorithm. Global alignment Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1)
SECTION 2 BINARY OPERATIONS Definition: A binary operation  on a set S is a function mapping S X S into S. For each (a, b)  S X S, we will denote the.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
1 Design and Analysis of Algorithms Yoram Moses Lecture 13 June 17, 2010
Greedy Algorithms. p2. Activity-selection problem: Problem : Want to schedule as many compatible activities as possible., n activities. Activity i, start.
CS 285- Discrete Mathematics Lecture 11. Section 3.8 Matrices Introduction Matrix Arithmetic Transposes and Power of Matrices Zero – One Matrices Boolean.
Matrices. Matrix A matrix is an ordered rectangular array of numbers. The entry in the i th row and j th column is denoted by a ij. Ex. 4 Columns 3 Rows.
Matrix Algebra MATRIX OPERATIONS © 2012 Pearson Education, Inc.
Matrix Algebra MATRIX OPERATIONS.
MATHEMATICS Matrix Multiplication
Matrix Algebra MATRIX OPERATIONS © 2012 Pearson Education, Inc.
Sequence Alignment 11/24/2018.
Section 2.4 Matrices.
Elements of Dynamic Programming
Matrix Algebra MATRIX OPERATIONS © 2012 Pearson Education, Inc.
Error Correction Coding
Presentation transcript:

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson

Presentation R 蘇展弘 B 葉恆青 R 呂育恩 R 張文亮 R 游騰楷

Outline Introduction and preliminaries. –LZ-78. –The basic concept. Global alignment. Local alignment. Proof for LZ-76. Proof for SMAWK algorithm.

LZ-78 aacgacga 0 a 1 c 2 g 4 g 3 The number of distinct code word:

Sample of LZ ctacgaga 1234 aacgacga c t a g g a g c g

Basic Concept aacgacga 1c 2t 3a 4cg 5ag a a 5/4ag5 cg4 a3 t2 c1 aacggaca g a gca a 5/45/2ag5 cg4 a3 t2 c1 aacggaca g a ca a 5/45/2ag5 cg4 3/43/2a3 t2 c1 aacggaca a ca g a gca Left prefix g a gca Top prefix g a gca Diagonal prefix Input border: I Output border: O

Basic Concept I/O Propagation Across G I 0 = △△ I 1 = △ I 2 = I 3 =0 △ -2 0 I 4 =0 △△ -20 I 5 =0 △△△ -20 DIST matrix 0-2 △△△ I 5 =0 0-2 △△ I 4 = △ I 3 = I 2 =0 △ -3-2 I 1 =0 △△ -3-20I 0 = DIST matrix I 5 = I 4 = I 3 = I 2 =3 -∞-∞1011I 1 =2 -∞-∞-∞-∞-201I 0 = OUT matrix Directly assign -∞ OUT[i,j]=-(n+i+1) x k, Where k is the maximal absolute value in the penalty matrix I 5 = I 4 = I 3 = I 2 =3 -∞-∞1011I 1 =2 -∞-∞-∞-∞-201I 0 = O5O5 O4O4 O3O3 O2O2 O1O1 OoOo

Basic Concept Monge Property DIST matrix I 0 = △△ I 1 = △ I 2 = I 3 =0 △ -2 0 I 4 =0 △△ -20 I 5 =0 △△△ -20 Aggarwal and Park and Schmidt observed that DIST matrices are Monge arrays. Def : A matrix M[ m x n ] is Monge if either condition 1 or 2 below holds for all a, b=0 … m; c, d=0 … n: 1. convex condition : 2. concave condition :

Basic Concept Tatally Monotone An important property of Monge arrays is that of being totally monotone. Def : A matrix M[ m x n ] is totally monotone if either condition 1 or 2 below holds for all a, b=0 … m; c, d=0 … n: 1. convex condition : 2. concave condition : Both DIST and OUT matrices are totally monotone by the concave condition. Aggarwal et al gave a recursive algorithm, nicknamed SMAWK, which can compute on O(n) time all row and column maxima of a n x n totally monotone matrix, by querying only O(n) elements of the array.

DIST matrix I 0 = ΔΔ I 1 = Δ I 2 = I 3 = 2Δ-2 0 I 4 = 1ΔΔ-20 I 5 = 3ΔΔΔ-20

OUT matrix - -- -

OUT matrix - -- - concave monotonicity: 若左行 的上面比下面小,則右 行的上面也比下面小 No new column maximum : –(n + i + 1) * k - 

01234 aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a The New block

Corresponding matrices

Maintaining Direct Access to DIST Columns 目的 : 跑 SMAWK 時需要用到的 OUT matrix 必須由 DIST 和 inplut 來提供,並在 Constant time 內得到 OUT matrix 的每一格。但是 Space 又不能超過。 作法 : 只存新產生的 column ,並維護一個 data structure 。

DIST(5,4) aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a Data Strucure

DIST(5,4) aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a Construction

Time and Space Complexity 作 new column 作 DIST vector ( 即找出 該 DIST matrix 所有的 column) 用 SMAWK 從這個 DIST( 加上 input) 算出 output maxima 。 O ( t )

Total complexity aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a h n / log ( n ) n O ( h n 2 / log(n) )

Sub-Quadratic Local Alignment Eric, Yu En Lu Information Management Dept. National Taiwan University

Sub-Quadratic Global Alignment Exploits Redundancy among sequences resulted by Lempel-Ziv Compression (self- repeating) to obtain the sub-quadratic part

Sub-Quadratic Local Alignment Requires additional knowledge of where a locally optimal string starts and ends However, this algorithm is performed on a per-block basis, we have to compute additional information specific to a block And then use it as the cue to the final score

Additional Information I S[i] C E[k] F=max {MAX t i=0 {I[i]+E[i]}, C}

Algorithm Body Given: DIST G Encoding –Compute values of E –Compute values of S –Compute values of C Propagation –Compute values of O ’ (modified from the O in global alignment) –Computing F Seek Highest Score –Find the highest score F

Back-Tracking the Exact Path Global Alignment Local Alignment –Given the block with max F value –We seek its path through looking its max{lp, tp, dia} block recursively until the score 0

Time/Space Analysis Encoding –E: max{E[i] lp, E[i] tp, DIST[I, l c ]}  O(t) –S: (all other can be copies, except..) S lr,lc = max{S lr- 1,lc +W, S lr,lc-1 +W, S lr-1,lc-1 +W}  O(t) –C: max{C lp, C tp, S[l c ]}  O(1) Propagation –O[i] = max{O[i], S[i]}  O(t) –F=max {MAX ti=0 {I[i]+E[i]}, C}  O(t) Find F  O(hn 2 /log 2 n) Total Complexity  O(hn 2 /log n)

Further Improvements Efficient alignment storage algorithm –Conditioned in “ discrete weights ” –Gives a minimal encoding to DIST (O(t)  O(1) ) for G –Thus we obtain O(hn 2 /(log n )2) storage complexity in Global-Alignment problem –While time complexity is O(hn 2 /log n)

Now, we are going to have presentations on SMAWK & LZ-76 Thank you!

The Maximum Numbers of Distinct Words Speaker : Emory Chang Date : 2002/1/31 Lempel and Ziv,1976 “On the Complexity of Finite Sequences” Reference :

What is a Distinct Word? EX: (LZ78) A = {0,1},a = |A| = 2 S = ,n = |S| = we have four distinct words, and five steps to generate the sequence.

Notation A : the set of alphabets α : the number of alphabets S : a sequence belong to A n : the length of S C(S) : production complexity of S N : the maximum possible number of distinct words. n

The upper bound Any sequential encoding procedure employs a parsing rule which a long string of data is broken down into words that are individually mapped into distinct words. For every :

Special case(1/2) Let N denote the maximum possible number of distinct words. Clearly C(S) < N+1 (a possible exception of the last one) Consider the special case : The sequence is formed by all distinct words of length of one, two, …, k ex: S = n =

Special case(2/2) The length of symbols at level i The number of nodes at level i

General Case The length of level k+1=k+1 Level k+1

Proof: Since We have Therefore from

SMAWK A Linear Time Algorithm for the Maximum Problem on Wide Totally Monotone Matrices

Definition Let A be an n  m matrix with real entries. –A j denote the jth column of A and A i denote the ith row of A. –A[i 1,…,i k ;j 1,…,j k ] denote the submatrix of A. –Let j(i) be the smallest column index j such that A(i,j) equals to the maximum value in A i.

i=2 j(i)=3

Definition A n  m matrix A is monotone if for 1  i 1  i 2  n, j(i 1 )  j(i 2 ). A is totally monotone if every submatrix of A is monotone.

Another Definition In the previous paper, the definition of totally monotone is: –A matrix M[0…m,0…n] is totally monotone is either condition 1 or 2 below holds for all a,b=0…n; c,d=0…m: –1. Convex condition: M[a,c]  M[b,c]  M[a,d]  M[b,d] for all a<b and c<d –2. Concave condition: M[a,c]  M[b,c]  M[a,d]  M[b,d] for all a<b and c<d We use concave here.

Comparison Now we want to compare these two definitions. The definition in SMAWK’s paper is called D s, The definition in this paper is called D c (we need a transpose to match the row and column of these to definition).

Comparison (cont.) To proof D c  D s. –D c holds on matrix A[0…n,0…m] –Let A’[i…i’,j…j’] be a submatrix of A, i  i 1  i 2  i’, j 1 = j(i 1 ), j 2 = j(i 2 )  j 1  j 2. i1i1 i2i2 j1j1 a,b,c  d  e,f,g  h So j 1  j 2

Comparison (cont.) To proof D s  D c –The matrix satisfies D s but not D c. D c is stronger.

Lemma 1 –We define an entry A[i,j] is dead if j  j(i). –Lemma 1: –Let A be a totally monotone n  m matrix and let 1  j 1  j 2  m. if A(r, j 1 )  A(r,j 2 ) then entries in {A(i, j 2 ):1  i  r} are dead. if A(r, j 1 )  A(r,j 2 ) then entries in {A(i, j 1 ):r  i  n} are dead. j1j1 j2j2 r i i

REDUCE(A) –C=A; k=1 –While C has more than n columns do – case – C(k,k)  C(k,k+1) and k < n : k = k+1 – C(k,k)  C(k,k+1) and k = n : Delete column C k+1 – C(k,k) < C(k,k+1) : Delete column C k ; – if k>1 then k = k-1

REDUCE(A)   <

  

  <

Time Complexity Case 2 + Case 3 = m – n Case 1 at most n + (m – n) –1 Totally 2m – n – 1 O(m)

MAXCOMPUTE(A) – B = REDUCE(A) –If n=1 then output the maximum and return –C=B[2,4,…,2  n/2  ; 1,2…,n] –MAXCOMPUTE(C) –From the known positions of maxima in the even rows of B, find the maxima in its odd rows.

IDEA

Time Complexity T(n,m) = c 1 m + c 2 n + T(n/2, n) = c 1 m + (c 1 +c 2 )n + c 2 n/2 + T(n/4, n/2) T(n,m) = 2 (c 1 +c 2 )n + c 1 m = O(m)