Download presentation
Presentation is loading. Please wait.
1
An O(N 2 ) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru Miyano IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 4, OCTOBER-DECEMBER 2004 Presented by, Siva ramakrishnan Subramanian Graduate Student, CPSC, TAMU. siv@tamu.edu
2
Motive Finding patterns conserved across a set of biologically related sequences to extract meaning is a common topic in Bioinformatics. More than one sequence element can affect the biological characteristics of the sequences. Past work on finding composite patterns- Structured Motifs, MITRA, Bioprospector…
3
Overview Given a set of sequences and numeric attribute values for each sequence, the problem is to find the optimal (w.r.t to a scoring function) pair of patterns combined with any Boolean function. Past work- finds combination of 2 patterns p and q where (p^q) occur in each string this paper’s formulation allows all possible combinations such as (p^¬q)…conditions like “presence of one element but absence of other” can be specified. Thus this method can be used to find cooperative as well as competing sequence elements. O(N 2 ) Algorithm and Implementation based on suffix arrays (this is the homework!!!) are the main contributions of this paper.
4
Preliminaries Let ∑ be a finite alphabet & ε denote an empty string. Let Ψ(p,s) be a Boolean matching function true only if p is a substring of s. Boolean pattern pair: a triplet where p and q are patterns and F is a 2-ary Boolean function. Matching function value for a pattern pair Ψ(,s) is defined as F(Ψ(p,s),Ψ(q,s)). All possible F values are defined in the following table.
5
All Candidate Boolean Operations on
6
Preliminaries A pattern or a Boolean pattern pair ∏ matches a string s if and only if Ψ(∏,s) is true. Pattern ε matches any string. For a given set of strings S={s 1,..., s m } let M(∏,S) denote the set of indices of strings in S that ∏ matches, that is, M(∏,S)={i| Ψ(∏,s i )=true}, and let its complement be denoted as M’(∏,S)={i|Ψ(∏,s i )= false}. For each s i €S, we are given an associated numeric attribute value r i. Let R(∏,S)= ∑ i€M(∏,S) r i denote the sum of r i over all s i that ∏ matches. Let M(∏) and R(∏) be a shorthand notation for M(∏,S) and R(∏,S), respectively. Note that |M(ε)|=m & R(ε)=∑ i=1 to m r i.
7
Scoring Function Objective is to find a pattern that maximizes a suitable scoring function score. The paper concentrates on scoring functions whose values for a pattern ∏ depend on values cumulated over the strings in S that match ∏. Scoring function score takes parameters |M(∏)| and R(∏). Also assumed that the score value computation can be done in constant time if the parameter values are known. Specific choice for the scoring function highly depends on the particular application.
8
Problem Definition Given a set S={s 1,..., s m } of strings, where each string s i is assigned a numeric attribute value r i and a scoring function score: RxR=>R, find the Boolean pattern pair ∏€{ | p,q€∑*,F€{F 0,…,F 15 }} that maximizes score(|M(∏)|,R(∏)).
9
Suffix tree & GST Edges are labeled with substrings of s. For a node v, l(v) is the string obtained by concatenating edge labels from root to v. For each leaf node v, l(v) is a distinct suffix of s & for each suffix there exists a leaf v. Each node has at least 2 children; first character of the labels on the edges to its children are distinct. GST: Given a set S={s 1,..., s m } GST is a suffix tree for the String s 1 $ 1...s m $ m where each $ i is a distinct character that does not belong to ∑. All paths are ended at the first appearance of $ i and each leaf is labeled with id i. O(N) space and time.
10
Suffix tree S= caggaggaccat. The paths of the suffix tree from the root to the leaves (suffixes) are sorted in lexicographic order from left to right, each leaf corresponding to a position in the suffix array. The integer in the suffix array represents the position in the string from which the corresponding suffix starts. A s [i]=j indicates s[j:n] is the i th suffix in the lexicographic ordering The lcp array represents the length of the longest path that consecutive suffixes in the suffix array share.
11
GST (Generalized Suffix Tree) A Generalized Suffix Tree and its corresponding suffix array for the strings {facct, gctt, ctctg}.
12
A Naïve O(N 3 ) Algorithm Let N= ∑ i=1 to m length(s i ) O(N) candidates for a single pattern patterns of form l(v), where v is a node in the GST over the set S. (Why???) Hence O(N 2 ) candidate pattern pairs For a given pair, the values |M(∏)| and R(∏) can be computed in O(N) time by any of the linear time string matching algorithms. Then scoring function value is calculated in constant time given |M(∏)| and R(∏). Time=O(N 3 ). Space=O(N) for Suffix tree.
13
O(N 2 ) Algorithm Two steps Find |M(l(v))| and R(l(v)) for all nodes v of GST in O(N) time and space Solve optimal pair of substring patterns problem in O(N 2 ) time and O(N) space for any scoring function score provided that it can be calculated in constant time given its inputs.
14
Algorithm- First step If R(l(v)) for all v can be found in O(N) time so can be |M(l(v)|. (when r i =1 for all i, R(l(v)=|M(l(V)|) Let LF(v) be the set of all leaf nodes in the subtree rooted by node v. Let c i (v) denote the number of leaves in LF(v) that have the label id i. Let sum of leaf attributes be ∑ LF(v) r i.
15
Algorithm- First step ∑ LF(v) r i = ∑ i€M(l(v)) (c i (v).r i ) R(l(v)) = ∑ i€M(l(v)) r i = ∑ LF(v) r i - ∑ i€M(l(v)) ((c i (v)-1).r i ) …(1) Let correction factor be corr(l(v),S)=∑ i€M(l(v)) ((c i (v)-1).r i ) In (1) ∑ LF(v) r i can be calculated for all v using a linear time post-order traversal as ∑ LF(v) r i = ∑ v’ (∑ LF(v’) r i | v’ is a child node of v).
16
Algorithm- First step How to remove the redundancies (correcting factors) in (1)? Let I(id i ) be the list of all leaves with the label id i in the order they appear in the post-order traversal of the tree. Constructing the lists I can be done in linear time for all labels id i. The leaves in LF(v) with the label id i form a continuous interval of length c i (v) in the list I(id i ). If c i (v) > 0, a length-c i (v) interval in I(id i ) contains (c i (v)-1) adjacent (overlapping) leaf pairs. If x,y € LF(v), the node lca(x,y) belongs to the subtree rooted by v. For any s i € S, Ψ(l(v),s i )=true, that is, i€ M(l(v)) if and only if there is a leaf x € LF(v) with the label id i.
17
Algorithm- First step Initially correction value=0 for all v. For each adjacent leaf pairs in I(id i ) add r i to the correction value of the node lca(x,y). For each v, sum of correction values in the nodes of the sub-tree rooted by v is (c i (v)-1).r i. Repeat this for all lists I(id i )- the preceding total sum becomes ∑ i€M(l(v)) ((c i (v)-1).r i ) = corr(l(v),S) Perform a linear time bottom-up (post- order) traversal to find R(l(v)).
18
Algorithm- First step V3:r3+r2+r3-r3 =r2+r3=R(l(v3)) V2:R(l(v3))+r2-r2 =r2+r3=R(l(v2)) V1:r1+R(l(v2))+r3-r3 =r1+r2+r3=R(l(v1)) Correction values at v1,v2,v3 set to r3,r2,r3
19
Pseudo code for Step 1
20
O(N 2 ) Algorithm Two steps Find |M(l(v))| and R(l(v)) for all nodes v of GST in O(N) time and space Solve optimal pair of substring patterns problem in O(N 2 ) time and O(N) space for any scoring function score provided that it can be calculated in constant time given its inputs.
21
Algorithm- Second step O(N) choices for the first patternl(v 1 ) For each l(v 1 ) use a modified version of the previous algorithm for the O(N) choices for the second pattern,l(v2) given a fixed l(v1), we additionally label each string s i €S and the corresponding leaves in the GST with the Boolean value Ψ(l(v1),s i ) O(N) time. Cumulate the sums and correction values separately for true and false values of the additional label.
22
Algorithm- Second step ∑ i€M(l(v2)) (ri | Ψ(l(v1),si)= true) =∑ i€M(l(v2)) (ri | Ψ(l(v1),si)= true, Ψ(l(v2),si)= true) =R( ) ∑ i€M(l(v2)) (ri | Ψ(l(v1),si)= false) =∑ i€M(l(v2)) (ri | Ψ(l(v1),si)= false, Ψ(l(v2),si)= true) =R( ) ∑ i€M’(l(v2)) (ri | Ψ(l(v1),si)= true) =∑ i€M’(l(v2)) (ri | Ψ(l(v1),si)= true, Ψ(l(v2),si)= false) =R( ) =R(l(v1)) - R( ) ∑ i€M’(l(v2)) (ri | Ψ(l(v1),si)= false) =∑ i€M’(l(v2)) (ri | Ψ(l(v1),si)= false, Ψ(l(v2),si)= false) =R( ) =R(ε) – R(l(v1) - R( ) where R(ε) & R(l(v1) can be computed in linear time.
23
Algorithm- Second step All cumulative values of the form ∑ i (ri | Ψ(l(v1),si)= b1, Ψ(l(v2),si)=b2) where b1,b2€{true,false} can be computed in linear time. Thus R( ) and hence the score can be computed in linear time for all pairs of the form, given a fixed l(v1). Thus O(N 2 ) for all pattern pairs. Since the O(N) calculations for each l(v1) is independent, the same GST can be reused. Hence the space complexity is O(N).
24
Algorithm- Second step
25
The rest of the paper in a nutshell Extension for k-ary Boolean function. Implementation using suffix arrays. Computational experiments and results. Algorithm Variations Multiple String Attributes, Distance Restrictions.
26
Homework Explain the implementation of the Optimal Boolean Pattern Pair problem using suffix arrays in your own words. Also explain why is it more efficient than the suffix tree approach. Email: siv@tamu.edu
27
THANK YOU
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.