Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside.

Slides:

Advertisements

Similar presentations

WSPD Applications.

Advertisements

Longest Common Subsequence

QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.

Greedy Algorithms Amihood Amir Bar-Ilan University.

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.

Fusion Trees Advanced Data Structures Aris Tentes.

Finding a Length-Constrained Maximum-Density Path in a Tree Rung-Ren Lin, Wen-Hsiung Kuo, and Kun-Mao Chao.

Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.

Computability and Complexity

Lectures on Recursive Algorithms1 COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski.

By Groysman Maxim. Let S be a set of sites in the plane. Each point in the plane is influenced by each point of S. We would like to decompose the plane.

Heaviest Segments in a Number Sequence Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan.

1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.

The number of edge-disjoint transitive triples in a tournament.

Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.

Lower Envelopes (Cont.) Yuval Suede. Reminder Lower Envelope is the graph of the pointwise minimum of the (partially defined) functions. Letbe the maximum.

Automata and Formal Languages Tim Sheard 1 Lecture 8 Pumping Lemma & Distinguishability Jim Hook Tim Sheard Portland State University.

6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.

On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.

Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin Tao Jiang.

CSE115/ENGR160 Discrete Mathematics 03/03/11 Ming-Hsuan Yang UC Merced 1.

Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang.

An efficient algorithm for optimizing whole genome alignment with noise P. Wong, T. Lam, N. Lu, H. Ting, and S. Yiu Department of Computer Science, University.

Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

Hash Tables1 Part E Hash Tables  

Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)

Data Structures – LECTURE 10 Huffman coding

Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang.

Data Structures CS 310. Abstract Data Types (ADTs) An ADT is a formal description of a set of data values and a set of operations that manipulate the.

DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…

Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.

Efficient Partition Trees Jiri Matousek Presented By Benny Schlesinger Omer Tavori 1.

- ABHRA DASGUPTA Solving Adhoc and Math related problems.

Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?

Asymptotics Data Structures & OO Development I 1 Computer Science Dept Va Tech June 2006 ©2006 McQuain & Ribbens Big-O Analysis Order of magnitude analysis.

Finite Groups & Subgroups. Order of a group Definition: The number of elements of a group (finite or infinite) is called its order. Notation: We will.

Mathematical Preliminaries Strings and Languages Preliminaries 1.

Jessie Zhao Course page: 1.

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National.

The Integers. The Division Algorithms A high-school question: Compute 58/17. We can write 58 as 58 = 3 (17) + 7 This forms illustrates the answer: “3.

Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.

Chapter 6: Geometric Analysis: The Gap Property By azam sadeghian 1.

Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.

Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.

Approximation Schemes Open Shop Problem. O||C max and Om||C max {J 1,..., J n } is set of jobs. {M 1,..., M m } is set of machines. J i : {O i1,..., O.

By J. Burns and J. Pachl Based on a presentation by Irina Shapira and Julia Mosin Uniform Self-Stabilization 1 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5.

CSE 5314 On-line Computation Homework 1 Wook Choi Feb/26/2004.

Efficient Algorithms for Some Variants of the Farthest String Problem Chih Huai Cheng, Ching Chiang Huang, Shu Yu Hu, Kun-Mao Chao.

Foundations of Discrete Mathematics Chapter 4 By Dr. Dalia M. Gil, Ph.D.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

Heaviest Segments in a Number Sequence Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan.

CS Lecture 14 Powerful Tools     !. Build your toolbox of abstract structures and concepts. Know the capacities and limits of each tool.

Compression for Fixed-Width Memories Ori Rottenstriech, Amit Berman, Yuval Cassuto and Isaac Keslassy Technion, Israel.

Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 Degenerate String Reconstruction from Cover Arrays (Extended Abstract) 1.

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-1.

Approximation Algorithms based on linear programming.

Sorting by placement and Shift Sergi Elizalde Peter Winkler By 資工四 B 周于荃.

On the R ange M aximum-Sum S egment Q uery Problem Kuan-Yu Chen and Kun-Mao Chao Department of Computer Science and Information Engineering, National Taiwan.

Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.

 2004 SDU Uniquely Decodable Code 1.Related Notions 2.Determining UDC 3.Kraft Inequality.

Fuw-Yi Yang1 Textbook: Introduction to Cryptography 2nd ed. By J.A. Buchmann Chap 1 Integers Department of Computer Science and Information Engineering,

Relations Chapter 9 Copyright © McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill.

Great Theoretical Ideas In Computer Science

Heaviest Segments in a Number Sequence

On the Range Maximum-Sum Segment Query Problem

Efficient Huffman Decoding

Presentation transcript:

Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside

Outline Problem Definition Applications to Molecular Biology Two Existing Algorithms Open Problems

Definition of Problem Given a sequence of real numbers, A =, and a positive integer L ≤ n, the goal is to find a consecutive substring of A of length at least L such that the average of the numbers in the subsequence is maximized.

Applications in Biology Locating GC-rich Regions Post-Processing Sequence Alignments Annotating Multiple Sequence Alignments Computing Ungapped Local Alignments with Length Constraints

Two Existing Algorithms An O(nlogL)-time Algorithm (Yaw Ling Lin, Tao Jiang, Kun-Mao Chao, 2001) A Linear Time Algorithm for Binary Strings (Hsueh-I Lu, 2002)

The O(nlogL)-time Algorithm (Yaw-Ling Lin, Tao Jiang, Kun-Mao Chao,2001) Basic Scheme: Finding good partner of each element, i.e. for element a i, locate a j, such that the segment has maximum average among all substrings starting from a i. Choose the with the maximum average among the n candidates.

Important Concepts Right-Skew Sequence A sequence A = is right-skew if and only if the average of any prefix is always less than or equal to the average of the remaining suffix subsequence.

Important Concept Decreasingly Right-Skew Partition A partition A=A 1 A 2 …A k is decreasingly right- skew if each segment Ai of the partition is right-skew and μ(A i ) > μ(A j ) for any i < j.

Big Picture of Right-Skew Partition A B C D Intuition: 1.If A is chosen, B must also be 2.If C is not chosen, D can not be, either.

Lemma 7(Huang): The Maximum Average Substring Can not be longer than 2L-1 Proof If C is the maximum average substring with length ≥2L, let C= AB, where |A|≥L and |B|≥L, then the average of A or B is no less than that of C. Say μ(A) > μ(B), then μ(A) > μ(AB)

Main Idea of the O(nlogL)Algorithm 1.Compute the decreasingly right-skew partition in O(n) time. 2.Finding the good partner for each index costs O(logL) time.

Compute the decreasingly right- skew partition 1.Lemma 5: Every real sequence A= has a unique decreasingly right-skew partition. 2.Lemma 6: All right-skew pointers for a length n sequence can be computed in O(n) amortized time.

Compute the right-skew pointers

Find good partner in O(logL) Lemma 9(Bitonic): Let P be a real sequence, and A 1 A 2 …A m the decreasingly right-skew partition of a sequence A. Suppose that μ(PA 1 …A k ) = max{μ(PA 1 …A i )|0≤i≤m} Then μ(PA 1 …A i ) > μ(A i+1 ) if and only if i≥k.

What does Lemma 9 tell us? Locating good partner can be done with binary search! To find k so that μ(PA 1 …A k ) = max{μ(PA 1 …A i )|0 ≤ i ≤ m} We guess i and make it closer to k: 1.μ(PA 1 …A i ) >μ(A i+1 ) implies i ≥ k 2.μ(PA 1 …A i ) ≤μ(A i+1 ) implies i < k

Big Picture of Locating Good Partners L 1 L 12 L 12 3

Date Structure for Binary Search  logL  Pointer-Jumping Tables j (k) denotes the right end-point of the kth right- skew segment. p (0) [i] = p[i], where p[i] is right-skew pointer for i, p (k+1) [i] = min{p (k) [p (k) [i]+1], n}. 1  k   logL  The precomputation of the jumping tables takes at most O(nlogL) time.

Totally n phases Each phase costs O(logL) Overall: O(nlogL)-time The Time Complexity

Crying Out for A Linear Time Algorithm!!

A Linear-Time Algorithm for Binary Strings (Hsueh-I Lu, 2002) Build upon the previous algorithm Improvements: - Considering an upper bound on the number of right-skew segments - Working simultaneously on the right-skew partitions of forward and reverse strings - Utilizing Properties of Binary Strings

Basic Scheme Let B =  log 3 n  and b =  (loglogn) 3  1.Choose O(n/ logn) indices i of S such that if g(i)-i  B holds for any of such i, then g(i) can be found in O(logn) time. 2.Choose O(n/ loglogn) indices i of S such that if B  g(i) – i  b holds for any of such i, then g(i) can be found in O(loglogn) time. 3.Find g(i) for all indices i such that g(i) – i  b.

Denotations A right-skew decompostion of any substring S [p, q] is a nonempty set of i indices i 1,i 2,…, i l so that S[i 1,i 2 ], S[i 2, i 3 ],…, S[i l-1, i l ] are decreasingly right- skew partition of S. Let D S (i, j) denote the right-skew decomposition of S[i, j] If P = {p 1, p 2,…, p k, p k+1 }, where p 1 < p 2 <…<p k+1, then

An Intuitive Observation Right-skew pointers cannot cross ABC 1.By definition of right-skew segment: μ(A)  μ(B)  μ(C) Thus μ(A+B)  μ(C). 2. By definition of decreasingly right-skew partition: μ(A+B) > μ(C). Contradiction.

The Big Picture of Right-Skew Decomposition

Lemma 3: from the big picture If j  D S (P), then D S (j, n)  D S (P). Lemma 3 tells us that if j belongs to the right- skew decomposition of some set of indices, then its good partner will also be. Thus, we only need to search for its good partner among a limited number of indices.

Lemma 4: |D S (i, j)| = O((j - i) 2/3 ) (It holds for binary strings only) Define: A right-skew substring determined by D S (i, j) is the undividable right-skew segment. A right-skew substring S[p, q] is long (short) if q - p  l 1/3 (q - p < l 1/3 ) Prove lemma 4 by showing that the number of long and short right-skew substrings for a binary string is O((j - i) 2/3 ).

Phase 1: g(i) - i  B; g R (j) - j  B Define: P short = {p | p mod B  0 and 0  p < n}  {n} We have |D S (P short )| = O (n/logn) In this phase, we take care of index i such that i and g(i), i.e. good partner of i, are both in D S (P short )  D R (P short )

Phase 2: L + b < g(i) – i  L+B L + b < g R (j) – j  L+B Define: P tiny = {p | p mod b  0 and 0  p < n}  {n} We have |D S (P tiny )| = O (n/loglogn) In this phase, we take care of index i such that i and g(i), i.e. good partner of i, are both in D S (P tiny )  D R (P tiny )

Phase 3: g(i)-i  L+b, g R (j)-j  L+b We set up a table M whose (x,y) entry contains the index z, such that: If C is a binary string of L+b bits, x is the number of ‘1’ in the first L bits of C; y is the binary string consisting of the last b bits of C; z is the good partner of index 0 in C. Because b is relatively small, the number of possible value for x and y is linear Looking up the table M, we can cope with the left- over case in O(n)-time.

Open Problems: How to extend the linear time algorithm for binary strings to arbitrary strings.

INTERESTED? Contact: Jie Zheng Department of Computer Science Surge Building # 350 UC, Riverside