An O(N 2 ) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru.

Slides:

Advertisements

Similar presentations

Chapter 7. Binary Search Trees

Advertisements

Boosting Textual Compression in Optimal Linear Time.

Introduction to Computer Science 2 Lecture 7: Extended binary trees

Longest Common Subsequence

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Fast Algorithms For Hierarchical Range Histogram Constructions

Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.

Applications of well seperated pairs Closest Pair K-Closest Pairs All-Nearest neighbor.

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

Two implementation issues Alphabet size Generalizing to multiple strings.

Constant-Time LCA Retrieval

1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.

Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

15-853Page : Algorithms in the Real World Suffix Trees.

296.3: Algorithms in the Real World

1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)

Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …

1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.

Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.

Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.

Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.

Jan. 2013Dr. Yangjun Chen ACS Outline Signature Files - Signature for attribute values - Signature for records - Searching a signature file Signature.

1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.

BTrees & Bitmap Indexes

1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.

Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.

Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.

Costas Busch - RPI1 Mathematical Preliminaries. Costas Busch - RPI2 Mathematical Preliminaries Sets Functions Relations Graphs Proof Techniques.

Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.

Courtesy Costas Busch - RPI1 Mathematical Preliminaries.

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.

1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.

Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Phylogenetic Tree Construction and Related Problems Bioinformatics.

1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.

Important Problem Types and Fundamental Data Structures

Topic #3: Lexical Analysis

Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices

Mathematical Preliminaries Strings and Languages Preliminaries 1.

“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.

Mathematical Preliminaries. Sets Functions Relations Graphs Proof Techniques.

Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.

Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.

 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.

Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Recursive Data Structures and Grammars  Themes  Recursive Description of Data Structures  Grammars and Parsing  Recursive Definitions of Properties.

Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Foundation of Computing Systems

Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.

Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida.

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)

On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,

5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.

DATA STRUCURES II CSC QUIZ 1. What is Data Structure ? 2. Mention the classifications of data structure giving example of each. 3. Briefly explain.

Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,

15-853:Algorithms in the Real World

String Data Structures and Algorithms: Suffix Trees and Suffix Arrays

Comparative RNA Structural Analysis

String Data Structures and Algorithms

String Data Structures and Algorithms

Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.

Error Correction Coding

Analysis of Algorithms CS 477/677

Presentation transcript:

An O(N 2 ) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru Miyano IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 4, OCTOBER-DECEMBER 2004 Presented by, Siva ramakrishnan Subramanian Graduate Student, CPSC, TAMU.

Motive  Finding patterns conserved across a set of biologically related sequences to extract meaning is a common topic in Bioinformatics.  More than one sequence element can affect the biological characteristics of the sequences.  Past work on finding composite patterns- Structured Motifs, MITRA, Bioprospector…

Overview  Given a set of sequences and numeric attribute values for each sequence, the problem is to find the optimal (w.r.t to a scoring function) pair of patterns combined with any Boolean function.  Past work- finds combination of 2 patterns p and q where (p^q) occur in each string  this paper’s formulation allows all possible combinations such as (p^¬q)…conditions like “presence of one element but absence of other” can be specified.  Thus this method can be used to find cooperative as well as competing sequence elements.  O(N 2 ) Algorithm and Implementation based on suffix arrays (this is the homework!!!) are the main contributions of this paper.

Preliminaries  Let ∑ be a finite alphabet & ε denote an empty string.  Let Ψ(p,s) be a Boolean matching function true only if p is a substring of s.  Boolean pattern pair: a triplet where p and q are patterns and F is a 2-ary Boolean function.  Matching function value for a pattern pair Ψ(,s) is defined as F(Ψ(p,s),Ψ(q,s)).  All possible F values are defined in the following table.

All Candidate Boolean Operations on

Preliminaries  A pattern or a Boolean pattern pair ∏ matches a string s if and only if Ψ(∏,s) is true. Pattern ε matches any string.  For a given set of strings S={s 1,..., s m } let M(∏,S) denote the set of indices of strings in S that ∏ matches, that is, M(∏,S)={i| Ψ(∏,s i )=true}, and let its complement be denoted as M’(∏,S)={i|Ψ(∏,s i )= false}.  For each s i €S, we are given an associated numeric attribute value r i. Let R(∏,S)= ∑ i€M(∏,S) r i denote the sum of r i over all s i that ∏ matches. Let M(∏) and R(∏) be a shorthand notation for M(∏,S) and R(∏,S), respectively. Note that |M(ε)|=m & R(ε)=∑ i=1 to m r i.

Scoring Function  Objective is to find a pattern that maximizes a suitable scoring function score.  The paper concentrates on scoring functions whose values for a pattern ∏ depend on values cumulated over the strings in S that match ∏.  Scoring function score takes parameters |M(∏)| and R(∏).  Also assumed that the score value computation can be done in constant time if the parameter values are known.  Specific choice for the scoring function highly depends on the particular application.

Problem Definition  Given a set S={s 1,..., s m } of strings, where each string s i is assigned a numeric attribute value r i and a scoring function score: RxR=>R, find the Boolean pattern pair ∏€{ | p,q€∑*,F€{F 0,…,F 15 }} that maximizes score(|M(∏)|,R(∏)).

Suffix tree & GST  Edges are labeled with substrings of s.  For a node v, l(v) is the string obtained by concatenating edge labels from root to v.  For each leaf node v, l(v) is a distinct suffix of s & for each suffix there exists a leaf v.  Each node has at least 2 children; first character of the labels on the edges to its children are distinct.  GST: Given a set S={s 1,..., s m } GST is a suffix tree for the String s 1 $ 1...s m $ m where each $ i is a distinct character that does not belong to ∑.  All paths are ended at the first appearance of $ i and each leaf is labeled with id i.  O(N) space and time.

Suffix tree S= caggaggaccat. The paths of the suffix tree from the root to the leaves (suffixes) are sorted in lexicographic order from left to right, each leaf corresponding to a position in the suffix array. The integer in the suffix array represents the position in the string from which the corresponding suffix starts. A s [i]=j indicates s[j:n] is the i th suffix in the lexicographic ordering The lcp array represents the length of the longest path that consecutive suffixes in the suffix array share.

GST (Generalized Suffix Tree) A Generalized Suffix Tree and its corresponding suffix array for the strings {facct, gctt, ctctg}.

A Naïve O(N 3 ) Algorithm  Let N= ∑ i=1 to m length(s i )  O(N) candidates for a single pattern patterns of form l(v), where v is a node in the GST over the set S. (Why???)  Hence O(N 2 ) candidate pattern pairs  For a given pair, the values |M(∏)| and R(∏) can be computed in O(N) time by any of the linear time string matching algorithms.  Then scoring function value is calculated in constant time given |M(∏)| and R(∏).  Time=O(N 3 ). Space=O(N) for Suffix tree.

O(N 2 ) Algorithm  Two steps  Find |M(l(v))| and R(l(v)) for all nodes v of GST in O(N) time and space  Solve optimal pair of substring patterns problem in O(N 2 ) time and O(N) space for any scoring function score provided that it can be calculated in constant time given its inputs.

Algorithm- First step  If R(l(v)) for all v can be found in O(N) time so can be |M(l(v)|. (when r i =1 for all i, R(l(v)=|M(l(V)|)  Let LF(v) be the set of all leaf nodes in the subtree rooted by node v.  Let c i (v) denote the number of leaves in LF(v) that have the label id i.  Let sum of leaf attributes be ∑ LF(v) r i.

Algorithm- First step  ∑ LF(v) r i = ∑ i€M(l(v)) (c i (v).r i )  R(l(v)) = ∑ i€M(l(v)) r i = ∑ LF(v) r i - ∑ i€M(l(v)) ((c i (v)-1).r i ) …(1)  Let correction factor be corr(l(v),S)=∑ i€M(l(v)) ((c i (v)-1).r i )  In (1) ∑ LF(v) r i can be calculated for all v using a linear time post-order traversal as ∑ LF(v) r i = ∑ v’ (∑ LF(v’) r i | v’ is a child node of v).

Algorithm- First step  How to remove the redundancies (correcting factors) in (1)?  Let I(id i ) be the list of all leaves with the label id i in the order they appear in the post-order traversal of the tree. Constructing the lists I can be done in linear time for all labels id i.  The leaves in LF(v) with the label id i form a continuous interval of length c i (v) in the list I(id i ).  If c i (v) > 0, a length-c i (v) interval in I(id i ) contains (c i (v)-1) adjacent (overlapping) leaf pairs.  If x,y € LF(v), the node lca(x,y) belongs to the subtree rooted by v.  For any s i € S, Ψ(l(v),s i )=true, that is, i€ M(l(v)) if and only if there is a leaf x € LF(v) with the label id i.

Algorithm- First step  Initially correction value=0 for all v.  For each adjacent leaf pairs in I(id i ) add r i to the correction value of the node lca(x,y).  For each v, sum of correction values in the nodes of the sub-tree rooted by v is (c i (v)-1).r i.  Repeat this for all lists I(id i )- the preceding total sum becomes ∑ i€M(l(v)) ((c i (v)-1).r i ) = corr(l(v),S)  Perform a linear time bottom-up (post- order) traversal to find R(l(v)).

Algorithm- First step V3:r3+r2+r3-r3 =r2+r3=R(l(v3)) V2:R(l(v3))+r2-r2 =r2+r3=R(l(v2)) V1:r1+R(l(v2))+r3-r3 =r1+r2+r3=R(l(v1)) Correction values at v1,v2,v3 set to r3,r2,r3

Pseudo code for Step 1

O(N 2 ) Algorithm  Two steps  Find |M(l(v))| and R(l(v)) for all nodes v of GST in O(N) time and space  Solve optimal pair of substring patterns problem in O(N 2 ) time and O(N) space for any scoring function score provided that it can be calculated in constant time given its inputs.

Algorithm- Second step  O(N) choices for the first patternl(v 1 )  For each l(v 1 ) use a modified version of the previous algorithm for the O(N) choices for the second pattern,l(v2)  given a fixed l(v1), we additionally label each string s i €S and the corresponding leaves in the GST with the Boolean value Ψ(l(v1),s i ) O(N) time.  Cumulate the sums and correction values separately for true and false values of the additional label.

Algorithm- Second step  ∑ i€M(l(v2)) (ri | Ψ(l(v1),si)= true) =∑ i€M(l(v2)) (ri | Ψ(l(v1),si)= true, Ψ(l(v2),si)= true) =R( )  ∑ i€M(l(v2)) (ri | Ψ(l(v1),si)= false) =∑ i€M(l(v2)) (ri | Ψ(l(v1),si)= false, Ψ(l(v2),si)= true) =R( )  ∑ i€M’(l(v2)) (ri | Ψ(l(v1),si)= true) =∑ i€M’(l(v2)) (ri | Ψ(l(v1),si)= true, Ψ(l(v2),si)= false) =R( ) =R(l(v1)) - R( )  ∑ i€M’(l(v2)) (ri | Ψ(l(v1),si)= false) =∑ i€M’(l(v2)) (ri | Ψ(l(v1),si)= false, Ψ(l(v2),si)= false) =R( ) =R(ε) – R(l(v1) - R( ) where R(ε) & R(l(v1) can be computed in linear time.

Algorithm- Second step  All cumulative values of the form ∑ i (ri | Ψ(l(v1),si)= b1, Ψ(l(v2),si)=b2) where b1,b2€{true,false} can be computed in linear time.  Thus R( ) and hence the score can be computed in linear time for all pairs of the form, given a fixed l(v1).  Thus O(N 2 ) for all pattern pairs.  Since the O(N) calculations for each l(v1) is independent, the same GST can be reused. Hence the space complexity is O(N).

Algorithm- Second step

The rest of the paper in a nutshell  Extension for k-ary Boolean function.  Implementation using suffix arrays.  Computational experiments and results.  Algorithm Variations Multiple String Attributes, Distance Restrictions.

Homework  Explain the implementation of the Optimal Boolean Pattern Pair problem using suffix arrays in your own words. Also explain why is it more efficient than the suffix tree approach.

THANK YOU