Download presentation
Presentation is loading. Please wait.
Published byKelley George Modified over 9 years ago
1
Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside
2
Outline Problem Definition Applications to Molecular Biology Two Existing Algorithms Open Problems
3
Definition of Problem Given a sequence of real numbers, A =, and a positive integer L ≤ n, the goal is to find a consecutive substring of A of length at least L such that the average of the numbers in the subsequence is maximized.
4
Applications in Biology Locating GC-rich Regions Post-Processing Sequence Alignments Annotating Multiple Sequence Alignments Computing Ungapped Local Alignments with Length Constraints
5
Two Existing Algorithms An O(nlogL)-time Algorithm (Yaw Ling Lin, Tao Jiang, Kun-Mao Chao, 2001) A Linear Time Algorithm for Binary Strings (Hsueh-I Lu, 2002)
6
The O(nlogL)-time Algorithm (Yaw-Ling Lin, Tao Jiang, Kun-Mao Chao,2001) Basic Scheme: Finding good partner of each element, i.e. for element a i, locate a j, such that the segment has maximum average among all substrings starting from a i. Choose the with the maximum average among the n candidates.
7
Important Concepts Right-Skew Sequence A sequence A = is right-skew if and only if the average of any prefix is always less than or equal to the average of the remaining suffix subsequence.
8
Important Concept Decreasingly Right-Skew Partition A partition A=A 1 A 2 …A k is decreasingly right- skew if each segment Ai of the partition is right-skew and μ(A i ) > μ(A j ) for any i < j.
9
Big Picture of Right-Skew Partition A B C D Intuition: 1.If A is chosen, B must also be 2.If C is not chosen, D can not be, either.
10
Lemma 7(Huang): The Maximum Average Substring Can not be longer than 2L-1 Proof If C is the maximum average substring with length ≥2L, let C= AB, where |A|≥L and |B|≥L, then the average of A or B is no less than that of C. Say μ(A) > μ(B), then μ(A) > μ(AB)
11
Main Idea of the O(nlogL)Algorithm 1.Compute the decreasingly right-skew partition in O(n) time. 2.Finding the good partner for each index costs O(logL) time.
12
Compute the decreasingly right- skew partition 1.Lemma 5: Every real sequence A= has a unique decreasingly right-skew partition. 2.Lemma 6: All right-skew pointers for a length n sequence can be computed in O(n) amortized time.
13
Compute the right-skew pointers 49 30 5 49 5 49 5 49 5
14
Find good partner in O(logL) Lemma 9(Bitonic): Let P be a real sequence, and A 1 A 2 …A m the decreasingly right-skew partition of a sequence A. Suppose that μ(PA 1 …A k ) = max{μ(PA 1 …A i )|0≤i≤m} Then μ(PA 1 …A i ) > μ(A i+1 ) if and only if i≥k.
15
What does Lemma 9 tell us? Locating good partner can be done with binary search! To find k so that μ(PA 1 …A k ) = max{μ(PA 1 …A i )|0 ≤ i ≤ m} We guess i and make it closer to k: 1.μ(PA 1 …A i ) >μ(A i+1 ) implies i ≥ k 2.μ(PA 1 …A i ) ≤μ(A i+1 ) implies i < k
16
Big Picture of Locating Good Partners L 1 L 12 L 12 3
17
Date Structure for Binary Search logL Pointer-Jumping Tables j (k) denotes the right end-point of the kth right- skew segment. p (0) [i] = p[i], where p[i] is right-skew pointer for i, p (k+1) [i] = min{p (k) [p (k) [i]+1], n}. 1 k logL The precomputation of the jumping tables takes at most O(nlogL) time.
18
Totally n phases Each phase costs O(logL) Overall: O(nlogL)-time The Time Complexity
19
Crying Out for A Linear Time Algorithm!!
20
A Linear-Time Algorithm for Binary Strings (Hsueh-I Lu, 2002) Build upon the previous algorithm Improvements: - Considering an upper bound on the number of right-skew segments - Working simultaneously on the right-skew partitions of forward and reverse strings - Utilizing Properties of Binary Strings
21
Basic Scheme Let B = log 3 n and b = (loglogn) 3 1.Choose O(n/ logn) indices i of S such that if g(i)-i B holds for any of such i, then g(i) can be found in O(logn) time. 2.Choose O(n/ loglogn) indices i of S such that if B g(i) – i b holds for any of such i, then g(i) can be found in O(loglogn) time. 3.Find g(i) for all indices i such that g(i) – i b.
22
Denotations A right-skew decompostion of any substring S [p, q] is a nonempty set of i indices i 1,i 2,…, i l so that S[i 1,i 2 ], S[i 2, i 3 ],…, S[i l-1, i l ] are decreasingly right- skew partition of S. Let D S (i, j) denote the right-skew decomposition of S[i, j] If P = {p 1, p 2,…, p k, p k+1 }, where p 1 < p 2 <…<p k+1, then
23
An Intuitive Observation Right-skew pointers cannot cross ABC 1.By definition of right-skew segment: μ(A) μ(B) μ(C) Thus μ(A+B) μ(C). 2. By definition of decreasingly right-skew partition: μ(A+B) > μ(C). Contradiction.
24
The Big Picture of Right-Skew Decomposition
25
Lemma 3: from the big picture If j D S (P), then D S (j, n) D S (P). Lemma 3 tells us that if j belongs to the right- skew decomposition of some set of indices, then its good partner will also be. Thus, we only need to search for its good partner among a limited number of indices.
26
Lemma 4: |D S (i, j)| = O((j - i) 2/3 ) (It holds for binary strings only) Define: A right-skew substring determined by D S (i, j) is the undividable right-skew segment. A right-skew substring S[p, q] is long (short) if q - p l 1/3 (q - p < l 1/3 ) Prove lemma 4 by showing that the number of long and short right-skew substrings for a binary string is O((j - i) 2/3 ).
27
Phase 1: g(i) - i B; g R (j) - j B Define: P short = {p | p mod B 0 and 0 p < n} {n} We have |D S (P short )| = O (n/logn) In this phase, we take care of index i such that i and g(i), i.e. good partner of i, are both in D S (P short ) D R (P short )
28
Phase 2: L + b < g(i) – i L+B L + b < g R (j) – j L+B Define: P tiny = {p | p mod b 0 and 0 p < n} {n} We have |D S (P tiny )| = O (n/loglogn) In this phase, we take care of index i such that i and g(i), i.e. good partner of i, are both in D S (P tiny ) D R (P tiny )
29
Phase 3: g(i)-i L+b, g R (j)-j L+b We set up a table M whose (x,y) entry contains the index z, such that: If C is a binary string of L+b bits, x is the number of ‘1’ in the first L bits of C; y is the binary string consisting of the last b bits of C; z is the good partner of index 0 in C. Because b is relatively small, the number of possible value for x and y is linear Looking up the table M, we can cope with the left- over case in O(n)-time.
30
Open Problems: How to extend the linear time algorithm for binary strings to arbitrary strings.
31
INTERESTED? Contact: Jie Zheng Department of Computer Science Surge Building # 350 UC, Riverside E-mail: zjie@cs.ucr.eduzjie@cs.ucr.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.