Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside.

Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside

Outline Problem Definition Applications to Molecular Biology Two Existing Algorithms Open Problems

Definition of Problem Given a sequence of real numbers, A =, and a positive integer L ≤ n, the goal is to find a consecutive substring of A of length at least L such that the average of the numbers in the subsequence is maximized.

Applications in Biology Locating GC-rich Regions Post-Processing Sequence Alignments Annotating Multiple Sequence Alignments Computing Ungapped Local Alignments with Length Constraints

Two Existing Algorithms An O(nlogL)-time Algorithm (Yaw Ling Lin, Tao Jiang, Kun-Mao Chao, 2001) A Linear Time Algorithm for Binary Strings (Hsueh-I Lu, 2002)

The O(nlogL)-time Algorithm (Yaw-Ling Lin, Tao Jiang, Kun-Mao Chao,2001) Basic Scheme: Finding good partner of each element, i.e. for element a i, locate a j, such that the segment has maximum average among all substrings starting from a i. Choose the with the maximum average among the n candidates.

Important Concepts Right-Skew Sequence A sequence A = is right-skew if and only if the average of any prefix is always less than or equal to the average of the remaining suffix subsequence.

Important Concept Decreasingly Right-Skew Partition A partition A=A 1 A 2 …A k is decreasingly right- skew if each segment Ai of the partition is right-skew and μ(A i ) > μ(A j ) for any i < j.

Big Picture of Right-Skew Partition A B C D Intuition: 1.If A is chosen, B must also be 2.If C is not chosen, D can not be, either.

Lemma 7(Huang): The Maximum Average Substring Can not be longer than 2L-1 Proof If C is the maximum average substring with length ≥2L, let C= AB, where |A|≥L and |B|≥L, then the average of A or B is no less than that of C. Say μ(A) > μ(B), then μ(A) > μ(AB)

Main Idea of the O(nlogL)Algorithm 1.Compute the decreasingly right-skew partition in O(n) time. 2.Finding the good partner for each index costs O(logL) time.

Compute the decreasingly right- skew partition 1.Lemma 5: Every real sequence A= has a unique decreasingly right-skew partition. 2.Lemma 6: All right-skew pointers for a length n sequence can be computed in O(n) amortized time.

Compute the right-skew pointers 49 30 5 49 5 49 5 49 5

Find good partner in O(logL) Lemma 9(Bitonic): Let P be a real sequence, and A 1 A 2 …A m the decreasingly right-skew partition of a sequence A. Suppose that μ(PA 1 …A k ) = max{μ(PA 1 …A i )|0≤i≤m} Then μ(PA 1 …A i ) > μ(A i+1 ) if and only if i≥k.

What does Lemma 9 tell us? Locating good partner can be done with binary search! To find k so that μ(PA 1 …A k ) = max{μ(PA 1 …A i )|0 ≤ i ≤ m} We guess i and make it closer to k: 1.μ(PA 1 …A i ) >μ(A i+1 ) implies i ≥ k 2.μ(PA 1 …A i ) ≤μ(A i+1 ) implies i < k

Big Picture of Locating Good Partners L 1 L 12 L 12 3

Date Structure for Binary Search  logL  Pointer-Jumping Tables j (k) denotes the right end-point of the kth right- skew segment. p (0) [i] = p[i], where p[i] is right-skew pointer for i, p (k+1) [i] = min{p (k) [p (k) [i]+1], n}. 1  k   logL  The precomputation of the jumping tables takes at most O(nlogL) time.

Totally n phases Each phase costs O(logL) Overall: O(nlogL)-time The Time Complexity

Crying Out for A Linear Time Algorithm!!

A Linear-Time Algorithm for Binary Strings (Hsueh-I Lu, 2002) Build upon the previous algorithm Improvements: - Considering an upper bound on the number of right-skew segments - Working simultaneously on the right-skew partitions of forward and reverse strings - Utilizing Properties of Binary Strings

Basic Scheme Let B =  log 3 n  and b =  (loglogn) 3  1.Choose O(n/ logn) indices i of S such that if g(i)-i  B holds for any of such i, then g(i) can be found in O(logn) time. 2.Choose O(n/ loglogn) indices i of S such that if B  g(i) – i  b holds for any of such i, then g(i) can be found in O(loglogn) time. 3.Find g(i) for all indices i such that g(i) – i  b.

Denotations A right-skew decompostion of any substring S [p, q] is a nonempty set of i indices i 1,i 2,…, i l so that S[i 1,i 2 ], S[i 2, i 3 ],…, S[i l-1, i l ] are decreasingly right- skew partition of S. Let D S (i, j) denote the right-skew decomposition of S[i, j] If P = {p 1, p 2,…, p k, p k+1 }, where p 1 < p 2 <…<p k+1, then

An Intuitive Observation Right-skew pointers cannot cross ABC 1.By definition of right-skew segment: μ(A)  μ(B)  μ(C) Thus μ(A+B)  μ(C). 2. By definition of decreasingly right-skew partition: μ(A+B) > μ(C). Contradiction.

The Big Picture of Right-Skew Decomposition

Lemma 3: from the big picture If j  D S (P), then D S (j, n)  D S (P). Lemma 3 tells us that if j belongs to the right- skew decomposition of some set of indices, then its good partner will also be. Thus, we only need to search for its good partner among a limited number of indices.

Lemma 4: |D S (i, j)| = O((j - i) 2/3 ) (It holds for binary strings only) Define: A right-skew substring determined by D S (i, j) is the undividable right-skew segment. A right-skew substring S[p, q] is long (short) if q - p  l 1/3 (q - p < l 1/3 ) Prove lemma 4 by showing that the number of long and short right-skew substrings for a binary string is O((j - i) 2/3 ).

Phase 1: g(i) - i  B; g R (j) - j  B Define: P short = {p | p mod B  0 and 0  p < n}  {n} We have |D S (P short )| = O (n/logn) In this phase, we take care of index i such that i and g(i), i.e. good partner of i, are both in D S (P short )  D R (P short )

Phase 2: L + b < g(i) – i  L+B L + b < g R (j) – j  L+B Define: P tiny = {p | p mod b  0 and 0  p < n}  {n} We have |D S (P tiny )| = O (n/loglogn) In this phase, we take care of index i such that i and g(i), i.e. good partner of i, are both in D S (P tiny )  D R (P tiny )

Phase 3: g(i)-i  L+b, g R (j)-j  L+b We set up a table M whose (x,y) entry contains the index z, such that: If C is a binary string of L+b bits, x is the number of ‘1’ in the first L bits of C; y is the binary string consisting of the last b bits of C; z is the good partner of index 0 in C. Because b is relatively small, the number of possible value for x and y is linear Looking up the table M, we can cope with the left- over case in O(n)-time.

Open Problems: How to extend the linear time algorithm for binary strings to arbitrary strings.

INTERESTED? Contact: Jie Zheng Department of Computer Science Surge Building # 350 UC, Riverside E-mail: zjie@cs.ucr.eduzjie@cs.ucr.edu

Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside.

Similar presentations

Presentation on theme: "Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside.

Similar presentations

Presentation on theme: "Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside."— Presentation transcript:

Similar presentations

About project

Feedback