Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang
Results Weighted Matching Property Matching Pattern Matching General Reduction Property Indexing
Property Matching Def: A property of a string T = t 1, …, t n is a set of intervals {(s 1, f 1 ), (s 2, f 2 ), …, (s t, f t )}, s.t. s i, f i {1, …, n} and s i ≤ f i Property Matching Problem Given a text T with property and a pattern P, Find all locations where P matches T and is fully contained in an interval in.
Property Matching - Example Property Swap Matching Problem AAADBBADBDBA D BD ADB
Property Matching Solving Property Matching Problem Solve regular pattern matching problem Solve regular pattern matching problem Eliminate results not in property interval Eliminate results not in property interval Eliminating results can be done in linear time Eliminating results can be done in linear time If regular problem takes Ω(n) time => If regular problem takes Ω(n) time => Property matching time = regular problem time
Property Indexing Problem Preprocess T s.t. given a P find occurrences of P in T s.t. P is contained in a property interval Preprocess T s.t. given a P find occurrences of P in T s.t. P is contained in a property interval Time: proportional to |P| and tocc Time: proportional to |P| and tocc Our solution: Query time O(|P| log|Σ| + tocc ) Our solution: Query time O(|P| log|Σ| + tocc ) Preprocessing of O(n log|Σ| + n * log log n)
Weighted Sequence Def 1: weighted sequence is sequence of sets of pairs where and is probability of having symbol at location i.
Weighted Sequence Def 2: Given prob ε, P=p 1, …,p m occurs at location i of weighted text T w.p. at least ε if:
Weighted Sequence ADCC
Goal Weighted Matching problems = Pattern Matching problems with weighted text. Weighted Matching problems = Pattern Matching problems with weighted text. Goal: Find general reduction for solving weighted matching problems using regular pattern matching algorithms. Goal: Find general reduction for solving weighted matching problems using regular pattern matching algorithms.
Naive Algorithm Algorithm A 1. Find all possible patterns appearing in weighted text. 2. Concatenate all patterns to create new text. 3. Run regular pattern matching algorithm on new regular text. 4. Check each pattern found for prob. ≥ ε.
Naive Algorithm A A A A A A A A A A A A D D D D B B C C B C B C A A A A A A A A C C C C D D D D B B C C B C B C ABA DBB
Naive Algorithm Clearly this algorithm is inefficient and can be exponential even for |Σ|=2. Clearly this algorithm is inefficient and can be exponential even for |Σ|=2. Notice that there is a lot of waste: Notice that there is a lot of waste: –Many patterns share same substrings. –Given ε, we can ignore patterns w.p. < ε.
Maximal Factor Def 3: Given ε, weighted text T, string X is maximal factor of T at location i if: (a) X appears at location i w.p. ≥ ε (b) if we extend X with 1 character to right or left – the probability drops below ε.
Maximal Factor AC DB
Algorithm B 1. Find all maximal factors in text. 2. Concatenate factors to create new text. 3. Run regular pattern matching algorithm on new regular text. Note: A pattern appearing in new text has prob. of appearance ≥ ε.
Total Length of Maximal Factors What is total length of all maximal factors? Consider the following case: such that (1- δ ) n/3 = ε. n/3 maximal factors of length 2/3*n Total length of all maximal factors is Ω(n 2 ).
Classifying Text Locations Given ε, we classify location i of weighted text into 3 categories: Solid positions: one character w.p. exactly 1. Solid positions: one character w.p. exactly 1. Leading positions: at least one character w.p. greater than 1-ε (and less than 1). Leading positions: at least one character w.p. greater than 1-ε (and less than 1). Branching positions: all characters have probability of appearance at most 1-ε. Branching positions: all characters have probability of appearance at most 1-ε.
Classifying Text Locations If ε ≤ 1/2, at most 1 “eligible” character at leading position
LST Transformation Def 4: The Leading to Solid Transformation of weighted text T=t 1, …,t n, LST(T)=t ’ 1, …,t ’ n is: where leading character has prob. of app. ≥ max{1-ε, ε}
LST Transformation
Extended Maximal Factor Def 5: X is an extended maximal factor of T if X is an maximal factor of LST(T).
Lemma 1 Lemma 1: Total length of all extended maximal factors is at most O(n∙(1/ε) 2 log(1/ε)). Corollary: For constant k, total length of all extended maximal factors is linear.
Lemma 1 Why can we assume constant ε? In practice: want patterns that appear with noticeable probabilities e.g. 90%, 50% or 20%. In practice: want patterns that appear with noticeable probabilities e.g. 90%, 50% or 20%. Finding patterns w.p. at least 20% => 1/ε=5. Finding patterns w.p. at least 20% => 1/ε=5. Smaller percentage = smaller ε, rarely in practice. Smaller percentage = smaller ε, rarely in practice.
Proof of Lemma 1 Case 1: ε > 1/2, search patterns w.p. > 50%. Obv: At each location at most 1 char w.p. > 50%. Total length of all factors is ≤ n. For rest of proof we assume ε ≤ 1/2.
Proof of Lemma 1 Claim 1: A (extended) maximal factor passes by at most O((1/ε)∙log(1/ε)) branching positions. Proof: Denote l b = max. # of branching position passed. In a branching position all characters have prob. of appearance ≤ 1-ε :
Proof of Lemma 1 Claim 2: At most extended maximal factors start at each location. Intuition:
Proof of Lemma 1 Claim 1: A (extended) maximal factor passes by ≤ O((1/ε) log(1/ε)) branching positions. Claim 2: At most extended maximal factors starting at each location. Corollary: each location is in ≤ O((1/ε) 2 log(1/ε)) extended maximal factors.
Proof of Lemma 1 There are l b starting locations, from each location there are ≤ extended maximal factors. Corollary: each location is in ≤ O((1/ε) 2 log(1/ε)) extended maximal factors.
Finding Extended Maximal Factors Algorithm for finding extended maximal factors: 1. Transform T to LST(T) 2. Find all maximal factors in LST(T) by: (a) At each starting location try to extend until the prob. drops below ε. (b) Backtrack to previous branching position and try to extend the factor and so on... Run time: linear in the output length.
Framework for Solving Weighted Matching Problems Solving Weighted Matching Problems: 1. Find all extended maximal factors of T. 2. Concatenate factors (add $ ’ s betw) to get T ’. 3. Compute property by extending probabilities until below ε 4. Run property algorithm on text T ’ with.
Conclusions Our framework yields: Our framework yields: –Solutions to unsolved weighted matching problems (scaled, swaped, param. matching, indexing) –Efficient solutions to others (exact and approx.) For constant ε: For constant ε: –Weighted matching problems can be solved in same running times as regular pattern matching –Weighted ndexing can be solved in same times except for O(n log log(n)) preprocessing