Presentation is loading. Please wait.

Presentation is loading. Please wait.

An O(N 2 ) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru.

Similar presentations


Presentation on theme: "An O(N 2 ) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru."— Presentation transcript:

1 An O(N 2 ) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru Miyano IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 4, OCTOBER-DECEMBER 2004 Presented by, Siva ramakrishnan Subramanian Graduate Student, CPSC, TAMU. siv@tamu.edu

2 Motive  Finding patterns conserved across a set of biologically related sequences to extract meaning is a common topic in Bioinformatics.  More than one sequence element can affect the biological characteristics of the sequences.  Past work on finding composite patterns- Structured Motifs, MITRA, Bioprospector…

3 Overview  Given a set of sequences and numeric attribute values for each sequence, the problem is to find the optimal (w.r.t to a scoring function) pair of patterns combined with any Boolean function.  Past work- finds combination of 2 patterns p and q where (p^q) occur in each string  this paper’s formulation allows all possible combinations such as (p^¬q)…conditions like “presence of one element but absence of other” can be specified.  Thus this method can be used to find cooperative as well as competing sequence elements.  O(N 2 ) Algorithm and Implementation based on suffix arrays (this is the homework!!!) are the main contributions of this paper.

4 Preliminaries  Let ∑ be a finite alphabet & ε denote an empty string.  Let Ψ(p,s) be a Boolean matching function true only if p is a substring of s.  Boolean pattern pair: a triplet where p and q are patterns and F is a 2-ary Boolean function.  Matching function value for a pattern pair Ψ(,s) is defined as F(Ψ(p,s),Ψ(q,s)).  All possible F values are defined in the following table.

5 All Candidate Boolean Operations on

6 Preliminaries  A pattern or a Boolean pattern pair ∏ matches a string s if and only if Ψ(∏,s) is true. Pattern ε matches any string.  For a given set of strings S={s 1,..., s m } let M(∏,S) denote the set of indices of strings in S that ∏ matches, that is, M(∏,S)={i| Ψ(∏,s i )=true}, and let its complement be denoted as M’(∏,S)={i|Ψ(∏,s i )= false}.  For each s i €S, we are given an associated numeric attribute value r i. Let R(∏,S)= ∑ i€M(∏,S) r i denote the sum of r i over all s i that ∏ matches. Let M(∏) and R(∏) be a shorthand notation for M(∏,S) and R(∏,S), respectively. Note that |M(ε)|=m & R(ε)=∑ i=1 to m r i.

7 Scoring Function  Objective is to find a pattern that maximizes a suitable scoring function score.  The paper concentrates on scoring functions whose values for a pattern ∏ depend on values cumulated over the strings in S that match ∏.  Scoring function score takes parameters |M(∏)| and R(∏).  Also assumed that the score value computation can be done in constant time if the parameter values are known.  Specific choice for the scoring function highly depends on the particular application.

8 Problem Definition  Given a set S={s 1,..., s m } of strings, where each string s i is assigned a numeric attribute value r i and a scoring function score: RxR=>R, find the Boolean pattern pair ∏€{ | p,q€∑*,F€{F 0,…,F 15 }} that maximizes score(|M(∏)|,R(∏)).

9 Suffix tree & GST  Edges are labeled with substrings of s.  For a node v, l(v) is the string obtained by concatenating edge labels from root to v.  For each leaf node v, l(v) is a distinct suffix of s & for each suffix there exists a leaf v.  Each node has at least 2 children; first character of the labels on the edges to its children are distinct.  GST: Given a set S={s 1,..., s m } GST is a suffix tree for the String s 1 $ 1...s m $ m where each $ i is a distinct character that does not belong to ∑.  All paths are ended at the first appearance of $ i and each leaf is labeled with id i.  O(N) space and time.

10 Suffix tree S= caggaggaccat. The paths of the suffix tree from the root to the leaves (suffixes) are sorted in lexicographic order from left to right, each leaf corresponding to a position in the suffix array. The integer in the suffix array represents the position in the string from which the corresponding suffix starts. A s [i]=j indicates s[j:n] is the i th suffix in the lexicographic ordering The lcp array represents the length of the longest path that consecutive suffixes in the suffix array share.

11 GST (Generalized Suffix Tree) A Generalized Suffix Tree and its corresponding suffix array for the strings {facct, gctt, ctctg}.

12 A Naïve O(N 3 ) Algorithm  Let N= ∑ i=1 to m length(s i )  O(N) candidates for a single pattern patterns of form l(v), where v is a node in the GST over the set S. (Why???)  Hence O(N 2 ) candidate pattern pairs  For a given pair, the values |M(∏)| and R(∏) can be computed in O(N) time by any of the linear time string matching algorithms.  Then scoring function value is calculated in constant time given |M(∏)| and R(∏).  Time=O(N 3 ). Space=O(N) for Suffix tree.

13 O(N 2 ) Algorithm  Two steps  Find |M(l(v))| and R(l(v)) for all nodes v of GST in O(N) time and space  Solve optimal pair of substring patterns problem in O(N 2 ) time and O(N) space for any scoring function score provided that it can be calculated in constant time given its inputs.

14 Algorithm- First step  If R(l(v)) for all v can be found in O(N) time so can be |M(l(v)|. (when r i =1 for all i, R(l(v)=|M(l(V)|)  Let LF(v) be the set of all leaf nodes in the subtree rooted by node v.  Let c i (v) denote the number of leaves in LF(v) that have the label id i.  Let sum of leaf attributes be ∑ LF(v) r i.

15 Algorithm- First step  ∑ LF(v) r i = ∑ i€M(l(v)) (c i (v).r i )  R(l(v)) = ∑ i€M(l(v)) r i = ∑ LF(v) r i - ∑ i€M(l(v)) ((c i (v)-1).r i ) …(1)  Let correction factor be corr(l(v),S)=∑ i€M(l(v)) ((c i (v)-1).r i )  In (1) ∑ LF(v) r i can be calculated for all v using a linear time post-order traversal as ∑ LF(v) r i = ∑ v’ (∑ LF(v’) r i | v’ is a child node of v).

16 Algorithm- First step  How to remove the redundancies (correcting factors) in (1)?  Let I(id i ) be the list of all leaves with the label id i in the order they appear in the post-order traversal of the tree. Constructing the lists I can be done in linear time for all labels id i.  The leaves in LF(v) with the label id i form a continuous interval of length c i (v) in the list I(id i ).  If c i (v) > 0, a length-c i (v) interval in I(id i ) contains (c i (v)-1) adjacent (overlapping) leaf pairs.  If x,y € LF(v), the node lca(x,y) belongs to the subtree rooted by v.  For any s i € S, Ψ(l(v),s i )=true, that is, i€ M(l(v)) if and only if there is a leaf x € LF(v) with the label id i.

17 Algorithm- First step  Initially correction value=0 for all v.  For each adjacent leaf pairs in I(id i ) add r i to the correction value of the node lca(x,y).  For each v, sum of correction values in the nodes of the sub-tree rooted by v is (c i (v)-1).r i.  Repeat this for all lists I(id i )- the preceding total sum becomes ∑ i€M(l(v)) ((c i (v)-1).r i ) = corr(l(v),S)  Perform a linear time bottom-up (post- order) traversal to find R(l(v)).

18 Algorithm- First step V3:r3+r2+r3-r3 =r2+r3=R(l(v3)) V2:R(l(v3))+r2-r2 =r2+r3=R(l(v2)) V1:r1+R(l(v2))+r3-r3 =r1+r2+r3=R(l(v1)) Correction values at v1,v2,v3 set to r3,r2,r3

19 Pseudo code for Step 1

20 O(N 2 ) Algorithm  Two steps  Find |M(l(v))| and R(l(v)) for all nodes v of GST in O(N) time and space  Solve optimal pair of substring patterns problem in O(N 2 ) time and O(N) space for any scoring function score provided that it can be calculated in constant time given its inputs.

21 Algorithm- Second step  O(N) choices for the first patternl(v 1 )  For each l(v 1 ) use a modified version of the previous algorithm for the O(N) choices for the second pattern,l(v2)  given a fixed l(v1), we additionally label each string s i €S and the corresponding leaves in the GST with the Boolean value Ψ(l(v1),s i ) O(N) time.  Cumulate the sums and correction values separately for true and false values of the additional label.

22 Algorithm- Second step  ∑ i€M(l(v2)) (ri | Ψ(l(v1),si)= true) =∑ i€M(l(v2)) (ri | Ψ(l(v1),si)= true, Ψ(l(v2),si)= true) =R( )  ∑ i€M(l(v2)) (ri | Ψ(l(v1),si)= false) =∑ i€M(l(v2)) (ri | Ψ(l(v1),si)= false, Ψ(l(v2),si)= true) =R( )  ∑ i€M’(l(v2)) (ri | Ψ(l(v1),si)= true) =∑ i€M’(l(v2)) (ri | Ψ(l(v1),si)= true, Ψ(l(v2),si)= false) =R( ) =R(l(v1)) - R( )  ∑ i€M’(l(v2)) (ri | Ψ(l(v1),si)= false) =∑ i€M’(l(v2)) (ri | Ψ(l(v1),si)= false, Ψ(l(v2),si)= false) =R( ) =R(ε) – R(l(v1) - R( ) where R(ε) & R(l(v1) can be computed in linear time.

23 Algorithm- Second step  All cumulative values of the form ∑ i (ri | Ψ(l(v1),si)= b1, Ψ(l(v2),si)=b2) where b1,b2€{true,false} can be computed in linear time.  Thus R( ) and hence the score can be computed in linear time for all pairs of the form, given a fixed l(v1).  Thus O(N 2 ) for all pattern pairs.  Since the O(N) calculations for each l(v1) is independent, the same GST can be reused. Hence the space complexity is O(N).

24 Algorithm- Second step

25 The rest of the paper in a nutshell  Extension for k-ary Boolean function.  Implementation using suffix arrays.  Computational experiments and results.  Algorithm Variations Multiple String Attributes, Distance Restrictions.

26 Homework  Explain the implementation of the Optimal Boolean Pattern Pair problem using suffix arrays in your own words. Also explain why is it more efficient than the suffix tree approach. Email: siv@tamu.edu

27 THANK YOU


Download ppt "An O(N 2 ) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru."

Similar presentations


Ads by Google