Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang.

Slides:

Advertisements

Similar presentations

1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.

Advertisements

College of Information Technology & Design

Lecture 24 MAS 714 Hartmut Klauck

Longest Common Subsequence

Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.

Fast Algorithms For Hierarchical Range Histogram Constructions

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.

Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

15-853Page : Algorithms in the Real World Suffix Trees.

296.3: Algorithms in the Real World

1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)

Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.

Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: Boyer-Moore Algorithm.

Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.

1 String Edit Distance Matching Problem With Moves Graham Cormode S. Muthukrishna November 2001.

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.

Testing Metric Properties Michal Parnas and Dana Ron.

1 Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories.

Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.

Aho-Corasick String Matching An Efficient String Matching.

1 On approximating the number of relevant variables in a function Dana Ron & Gilad Tsur Tel-Aviv University.

Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.

Foundations of Privacy Lecture 11 Lecturer: Moni Naor.

String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)

Faster 2-Dimensional Scaled Matching Amihood Amir and Eran Chencinski.

Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.

Asynchronous Pattern Matching - Address Level Errors Amihood Amir Bar Ilan University 2010.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.

Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

S C A L E D Pattern Matching Amihood Amir Ayelet Butman Bar-Ilan University Moshe Lewenstein and Johns Hopkins University Bar-Ilan University.

1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.

1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.

Great Theoretical Ideas in Computer Science.

1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.

Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,

Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.

Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.

MCS 101: Algorithms Instructor Neelima Gupta

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.

Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp

A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,

CSC 211 Data Structures Lecture 13

Greedy Methods and Backtracking Dr. Marina Gavrilova Computer Science University of Calgary Canada.

Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)

Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

MCS 101: Algorithms Instructor Neelima Gupta

String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.

06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],

Permuted Scaled Matching Ayelet Butman Noa Lewenstein Ian Munro.

The Misra Gries Algorithm. Motivation Espionage The rest we monitor.

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Computer Programming 2 Lab (1) I.Fatimah Alzahrani.

Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.

28 Aug, 2006PSC Song Classification for Dancing Manolis Cristodoukalis, Costas Iliopoulos, M. Sohel Rahman, W.F. Smyth.

Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 Degenerate String Reconstruction from Cover Arrays (Extended Abstract) 1.

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,

Tries 07/28/16 11:04 Text Compression

Modeling with Recurrence Relations

Runtime evaluation of algorithms

Reachability on Suffix Tree Graphs

String Data Structures and Algorithms

Suffix Trees String … any sequence of characters.

Switching Lemmas and Proof Complexity

Presentation transcript:

Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang

Results Weighted Matching Property Matching Pattern Matching General Reduction Property Indexing

Property Matching Def: A property of a string T = t 1, …, t n is a set of intervals {(s 1, f 1 ), (s 2, f 2 ), …, (s t, f t )}, s.t. s i, f i {1, …, n} and s i ≤ f i Property Matching Problem Given a text T with property and a pattern P, Find all locations where P matches T and is fully contained in an interval in.

Property Matching - Example Property Swap Matching Problem AAADBBADBDBA D BD ADB

Property Matching Solving Property Matching Problem Solve regular pattern matching problem Solve regular pattern matching problem Eliminate results not in property interval Eliminate results not in property interval Eliminating results can be done in linear time Eliminating results can be done in linear time If regular problem takes Ω(n) time => If regular problem takes Ω(n) time => Property matching time = regular problem time

Property Indexing Problem Preprocess T s.t. given a P find occurrences of P in T s.t. P is contained in a property interval Preprocess T s.t. given a P find occurrences of P in T s.t. P is contained in a property interval Time: proportional to |P| and tocc Time: proportional to |P| and tocc Our solution: Query time O(|P| log|Σ| + tocc ) Our solution: Query time O(|P| log|Σ| + tocc ) Preprocessing of O(n log|Σ| + n * log log n)

Weighted Sequence Def 1: weighted sequence is sequence of sets of pairs where and is probability of having symbol at location i.

Weighted Sequence Def 2: Given prob ε, P=p 1, …,p m occurs at location i of weighted text T w.p. at least ε if:

Weighted Sequence ADCC

Goal Weighted Matching problems = Pattern Matching problems with weighted text. Weighted Matching problems = Pattern Matching problems with weighted text. Goal: Find general reduction for solving weighted matching problems using regular pattern matching algorithms. Goal: Find general reduction for solving weighted matching problems using regular pattern matching algorithms.

Naive Algorithm Algorithm A 1. Find all possible patterns appearing in weighted text. 2. Concatenate all patterns to create new text. 3. Run regular pattern matching algorithm on new regular text. 4. Check each pattern found for prob. ≥ ε.

Naive Algorithm A A A A A A A A A A A A D D D D B B C C B C B C A A A A A A A A C C C C D D D D B B C C B C B C ABA DBB

Naive Algorithm Clearly this algorithm is inefficient and can be exponential even for |Σ|=2. Clearly this algorithm is inefficient and can be exponential even for |Σ|=2. Notice that there is a lot of waste: Notice that there is a lot of waste: –Many patterns share same substrings. –Given ε, we can ignore patterns w.p. < ε.

Maximal Factor Def 3: Given ε, weighted text T, string X is maximal factor of T at location i if: (a) X appears at location i w.p. ≥ ε (b) if we extend X with 1 character to right or left – the probability drops below ε.

Maximal Factor AC DB

Algorithm B 1. Find all maximal factors in text. 2. Concatenate factors to create new text. 3. Run regular pattern matching algorithm on new regular text. Note: A pattern appearing in new text has prob. of appearance ≥ ε.

Total Length of Maximal Factors What is total length of all maximal factors? Consider the following case: such that (1- δ ) n/3 = ε.  n/3 maximal factors of length 2/3*n  Total length of all maximal factors is Ω(n 2 ).

Classifying Text Locations Given ε, we classify location i of weighted text into 3 categories: Solid positions: one character w.p. exactly 1. Solid positions: one character w.p. exactly 1. Leading positions: at least one character w.p. greater than 1-ε (and less than 1). Leading positions: at least one character w.p. greater than 1-ε (and less than 1). Branching positions: all characters have probability of appearance at most 1-ε. Branching positions: all characters have probability of appearance at most 1-ε.

Classifying Text Locations If ε ≤ 1/2, at most 1 “eligible” character at leading position

LST Transformation Def 4: The Leading to Solid Transformation of weighted text T=t 1, …,t n, LST(T)=t ’ 1, …,t ’ n is: where leading character has prob. of app. ≥ max{1-ε, ε}

LST Transformation

Extended Maximal Factor Def 5: X is an extended maximal factor of T if X is an maximal factor of LST(T).

Lemma 1 Lemma 1: Total length of all extended maximal factors is at most O(n∙(1/ε) 2 log(1/ε)). Corollary: For constant k, total length of all extended maximal factors is linear.

Lemma 1 Why can we assume constant ε? In practice: want patterns that appear with noticeable probabilities e.g. 90%, 50% or 20%. In practice: want patterns that appear with noticeable probabilities e.g. 90%, 50% or 20%. Finding patterns w.p. at least 20% => 1/ε=5. Finding patterns w.p. at least 20% => 1/ε=5. Smaller percentage = smaller ε, rarely in practice. Smaller percentage = smaller ε, rarely in practice.

Proof of Lemma 1 Case 1: ε > 1/2, search patterns w.p. > 50%. Obv: At each location at most 1 char w.p. > 50%.  Total length of all factors is ≤ n. For rest of proof we assume ε ≤ 1/2.

Proof of Lemma 1 Claim 1: A (extended) maximal factor passes by at most O((1/ε)∙log(1/ε)) branching positions. Proof: Denote l b = max. # of branching position passed. In a branching position all characters have prob. of appearance ≤ 1-ε :

Proof of Lemma 1 Claim 2: At most extended maximal factors start at each location. Intuition:

Proof of Lemma 1 Claim 1: A (extended) maximal factor passes by ≤ O((1/ε) log(1/ε)) branching positions. Claim 2: At most extended maximal factors starting at each location. Corollary: each location is in ≤ O((1/ε) 2 log(1/ε)) extended maximal factors.

Proof of Lemma 1 There are l b starting locations, from each location there are ≤ extended maximal factors. Corollary: each location is in ≤ O((1/ε) 2 log(1/ε)) extended maximal factors.

Finding Extended Maximal Factors Algorithm for finding extended maximal factors: 1. Transform T to LST(T) 2. Find all maximal factors in LST(T) by: (a) At each starting location try to extend until the prob. drops below ε. (b) Backtrack to previous branching position and try to extend the factor and so on... Run time: linear in the output length.

Framework for Solving Weighted Matching Problems Solving Weighted Matching Problems: 1. Find all extended maximal factors of T. 2. Concatenate factors (add $ ’ s betw) to get T ’. 3. Compute property by extending probabilities until below ε 4. Run property algorithm on text T ’ with.

Conclusions Our framework yields: Our framework yields: –Solutions to unsolved weighted matching problems (scaled, swaped, param. matching, indexing) –Efficient solutions to others (exact and approx.) For constant ε: For constant ε: –Weighted matching problems can be solved in same running times as regular pattern matching –Weighted ndexing can be solved in same times except for O(n log log(n)) preprocessing