Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir
Given: A glass ball. An n storied building. Find: The floor k such that the ball breaks when dropped from it, but does not break if dropped from floor k-1.
STRATEGY 1: Only one ball given to experiment with. O(n) experiments necessary. Sequential search. n n
STRATEGY 2: As many balls as necessary are given to experiment with. O(log n) experiments necessary. Binary search.
STRATEGY 3: Only two balls given to experiment with. O( ) experiments necessary. Bounded divide-and-conquer Experiments 1 st ball Experiments 2 nd ball
Meaning of two Balls “Bounded” Divide-and-Conquer. In reality: Different paradigms. 1. Works on large groups. 2. Works within a group.
In Pattern Matching 1.Works on large groups: Convolutions: O(n log m) using FFT b 0 b 1 b 2
Problem: O(n log m) only in algebraically closed fields, e.g. C. Solution: Reduce problem to (Boolean/integer/real) multiplication. This reduction costs! Example: Hamming distance. Counting mismatches is equivalent to Counting matches A B A B C A B B B A
Example: Count all “hits” of 1 in pattern and 1 in text
For Define: 1 if a=b 0 o/w Example:
For Do: + + Result: The number of times a in pattern matches a in text + the number of times b in pattern matches b in text + the number of times c in pattern matches c in text.
So for alphabet with a symbols (a fixed) the time is: O(n a log m) = O(n log m) Problem: Infinite alphabets.
Without loss of generality: | | = m + 1 Since every element of T not in P is replaced by some symbol x not in P. ABCDEFGH same number of errors as ABXXXXGH ABBBBBGH ABBBB BGH Example:
Divide and Conquer Idea (Wrong) Split to 1 U 2 of size m/2 each. Construct T 1, P 1 and T 2, P 2. Where for S = { T, P } and e = { 1, 2 }: if o/w
The Algorithm 1.Find num1 = number of matches of P 1 in T 1 2.Find num2 = number of matches of P 2 in T 2 3. matches num1 + num2 Time: O(n) every iteration for changing alphabet.
Time: T(m)=2T(m/2) + n. Closed Form: T(m) = 2 i T(m/2 i ) + (2 i i-2 + … ) n = (2 log m + … ) n = O(m 2 n) THIS IS BAD !!!
Needed: Faster way to compute matches of x to itself. Such a method exists if x appears in the pattern a very small number of times. Assume: x appears in pattern c times. For every occurrence of x in text, update just the appropriate counters of the c occurrences of x in the pattern. Text: Pattern: Time: O(nc).
Problem: In general it could be that x occurs in the pattern O(m) times, then total time becomes O(nm). BAD again. Tradeoff: If x appears in the pattern more than c times, count matches by FFT, in time O(n log m), per x. For all x’s that appear in the pattern less than c times, count matches (simultaneously) in time O(nc).
How many elements appear at least c times? For these elements, time: O((m/c) n log m). For all other elements, time: O(nc). The optimal case is when they equal, i.e. Total Time:(A-87,K-87)
In our Tower Metaphor: >c A Separate convolution for each group of floors (repetitions of a number x). >c <c Every element within the group is taken care of individually. However, all groups are “scanned” together.
Weighted Sequences Alignment of “similar” sequences – one of the challenges of string matching. Assume: from a set of sequences over alphabet a set of “probabilities” is constructed as follows: Text: i a 1 a 2 a k Where is the probability that symbol a j occurs in text location i..
This text of probabilities is called a weighted sequence. Our problem: Given: Weighted sequence T, pattern P=s 1,…,s m, and probability . Find: All text locations i such that P occurs there with probability > , i.e.. Example: Pattern ACDB occurs at location 2 of the text with probability
Iliopoulos et. al., in a number of recent papers answer the following questions about weighted sequences: 1.Do exact matching. 2.Construct weighted suffix tree for indexing. Exact Matching: 1.Convert probabilities to logarithms. Now we use sums rather than products. 2. Consider every text row separately. Let T a be the text row of a, for some Then the log probability of the pattern at every location is given by the formula:
Example: P = ABABCAB x T A Gives the sum of the log probabilities of A x T B Gives the sum of the log probabilities of B x T C Gives the sum of the log probabilities of C. Add them all up and get the result. Time: O(n log m).
Weighted Hamming Distance (A, Iliopoulos, Kapah – 06) Compute the smallest number of mismatches for every location. Mismatches are not symmetric. If errors are assumed to be in the text: How many text elements need to be changed (so that they will have probability 1 matching the corresponding pattern symbol) to produce a match at each location? Example: Text,pattern ACDB, =1/3. There exists a match at location 2 with 1 mismatch
If errors assumed to be in the pattern: How many pattern symbols need to be replaced in order to have a match at a given location? Example: For Text,pattern ACDB, and =1/3 no match exists in location 2 even with 4 mismatches since every element already has highest probability. So changing the pattern letter D to A,B, or C will leave the same probability.
We solve both types of mismatch weighted sequences problems, (as well as a few flavors of edit distance). Here we show the simpler of the two mismatch definitions – changes to text. We solve a more general problem: Input: Text where N. Pattern. Natural number e. Find: For every text location i, the smallest number of text locations that, when changed to 0, bring the convolution result to be no greater than e. (We change the negatives to positives and dropped the requirement that the numbers be log probabilities of weighted sequences.)
We use the Tower Metaphor. Assumption: n<2m+1. Observation: For every text location we need to sort all O(m) text elements, and find out what is the precise point where the sum of all elements becomes less than e.
Text elements sorted (biggest at bottom, smallest on top) First find the block where the sum is still. Then Find where that change occurs within the block. Need to known: For each text location, how many text elements from each block.
How many text elements in each block? One convolution per block. Let T j be such that Do convolution: for every block j and save in each text location. Time:
Let T j be For every block j do: We now know for each text location what is the sum of block values less than e and how many such values exist. All we need to do is find exact number within the seam block. For every text location: For every element in seam block, from top to bottom: If element matches 1 in pattern, multiply. Until number exceeds e.
Example: Implementation: Keep index for every element in every block, subtract from it the index of text location i and check if it hits a 1. Total Time for correction:
As always, taking block sizes rather than Will make the Total Time: