Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.

Similar presentations


Presentation on theme: "A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994."— Presentation transcript:

1 A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994

2 Basic Idea Boyer-Moore: Starts by comparing the last character. Uses the “ skip ” idea of Boyer-Moore to multiple patterns. (Bad character shift) Looking text in blocks instead of one by one char. Hash functions and tables are used.

3 The preprocessing stage The minimum length of a pattern, m, and consider only the first m chars of each pattern. ∴ if k patterns, total size. Three tables to build: a SHIFT table, a HASH table, and a PREFIX table.

4 Scanning steps 1. Compute a hash value h based on the current B characters from the text (starting with ). 2. Check the value of : if >0, shift and back to 1. 3. Compute the hash value of the prefix of the text; call it text_prefix. 4. Check for each p, whether. When they are equal, check the actual pattern against the text directly.

5 SHIFT table SHIFT table: Let B be the size of the block, each string of size B in the alphabet is mapped to an index to the SHIFT table by a hash function. 1.X doesn ’ t appear ; (not m); 2.X appears ; ; q is the position that X ends in some pattern. Set to the minimum value.

6 HASH table The same hash function as SHIFT table. Map the last B chars of all patterns. contains a pointer that: Points to a list of pointers of the patterns whose last B characters hash into i. is an index to the PREFIX table.

7 PREFIX table Map the first B ’ chars of all patterns into the PREFIX table. Contains the hash value of each prefix of size B ’. Used to filter patterns whose suffix is the same but whose prefix is different.

8 SHIFT[i] HASH tableSHIFT tablePREFIX table Hash = i Pattern pointer list SHIFT[i+1] Hash = i+1

9 Performance entries in SHIFT table, constructed in time. It takes to compute one hash function, the total amount of work in the cases of non- zero shifts is. Assume B ’ =B, then the amount of work for the case of shift value 0 is also, the expected total amount for this step is also.

10

11 A comparison of different search routines on a 15.8MB text Above figure Pattern sizes : ranging from 5 to 15 with average size slightly above 6. Cannot handle more than few hundreds patterns. Original egrep & fgrep.

12

13 A comparison of running times for different number of patterns Above figure Running time is improved exceeds about 8000. Related to the way greps work rather than to the specific algorithm. Agrep (and every other grep) outputs the lines that match the query. Above 8000, most line are matched, so less work is needed.

14

15 The effect of the minimum pattern length on the running time Above figure The larger m is the more chances of shifting there are, leading to less work. Match the curve of the function is the average shift values Preprocessing is very fast Ex. For 10000 patterns, agrep 0.17 second, GNU-grep 0.9 second

16 Additional applications Find all similar files in a large file system, need a data structure to handle. If the data is fetched from disk anyway, we can : Store the records as we obtained them without sorting them. Or putting one record together with its identifier per line.

17 Additional applications Benefits : No need for any additional space for the data structure. No need for preprocessing or organizing the data structure, e.g. sorting. More flexible search.

18 Additional applications Another applications : match-and – replace. Each pattern is associated with a replacement pattern. Discover and replace in the output by its replacement.

19 Conclusion Aho and Crosick : a linear-time algorithm, optimal in the worst case. Boyer and Moore : regular string- searching algorithm, possible to skip a large portion, leading to faster than linear algorithm, but not suitable in multi-pattern.

20 conclusion Wu and Manber : concentrate on typical searches rather than on worst-case behavior. Crucial to making the algorithm significantly faster than other algorithms in practice.


Download ppt "A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994."

Similar presentations


Ads by Google