Author: Ricardo A. Baeza-Yates, Gaston H. Gonnet Publisher: 1992 Communications of the ACM Presenter: Yuen-Shuo Li Date: 2013/08/14 1
String searching is a very important component of many problems, including text editing, bibliographic retrieval, and symbol manipulation.
T[a] = T[b] = T[c] = T[d] = cbaba T[a] = 11010
50301 State cbbabababcaba… text T[a] = T[b] = T[c] = T[d] = 11111
0301 State cbbabababcaba… text T[a] = T[b] = T[c] = T[d] = 11111
14020 State cbbabababcaba… text T[a] = T[b] = T[c] = T[d] = 11111
14020 State cbbabababcaba… text T[a] = T[b] = T[c] = T[d] = 11111
14020 State cbbabababcaba… text T[a] = T[b] = T[c] = T[d] = 11111
4020 State cbbabababcaba… text T[a] = T[b] = T[c] = T[d] = 11111
50301 State cbbabababcaba… text T[a] = T[b] = T[c] = T[d] = 11111
04121 State cbbabababcaba… text T[a] = T[b] = T[c] = T[d] = 11111
To update the state after reading a new character on the text, we must Shift the vector state b bits to the left to reflect that we have advanced one position in the text. Update the individual states according to the new character.
The number of mismatches
0 or 1 b = 1
Let {a, b, c, d} be the alphabet, and ababc the pattern. T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = 11111
The initial state is State abdabababc text T[a] = T[b] = T[c] = T[d] = The match at the end of the text is indicated by the value 0 in the leftmost bit of the state
m: pattern size w: word size
T[a] = T[b] = T[c] = T[d] = 01101
We allow up to k characters of the pattern to mismatch with the corresponding text. For example, if k = 2, the pattern mismatch: mismatch (match) dispatch (match) respatch (mismatch)
At each step we record the overflow bits in an overflow state, and we reset the overflow bits of all individual states.
We want to search for all occurrences of ababc with at most 2 mismatch. Because the value of b is 3 for 2 mismatches, every position in the state is represented by a number in the range Initial state: Initial overflow: We report a match when the sum of the leftmost digits of the state and the overflow is less than 3
Experimental results for searching 100 times for all possible matches of a pattern in a 50,000 character English text(a legal document)
BMH: Boyer-Moore, as suggested by Horspool
The execution time while search 1,000 words chosen at random from the same English text