On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching for Regular Expressions or Automaton Searching on Tries Ricardo Baeza-Yates, Gaston H. Gonnet
On the Use of Regular Expressions for Searching Text New perspective, particularly relevant to structured text Definition of the search problem –Does a given string of text match a particular pattern (regular expression recognition problem) –Locate the substrings of a text that match a particular pattern (searching problem) –Given a universe U identify all elements of U that contain a substring x matching a particular pattern r (more precise definition)
On the Use of Regular Expressions for Searching Text Given a string x and a regular expression r, locate all substrings of x that match r (continuous stream of text; problem: quadratic in the length of x; overlapping and nesting results) –Restrict the search to linearize the solutions; not simple –Most common restriction is the “leftmost longest match” rule –Problems: what is the next match? Where to start new search from?
On the Use of Regular Expressions for Searching Text This article prosposes alternative linearizing restriction—”Locate the set of shortest nonnested (but possibly overlapping strings that each match the pattern”. Related work” Thomsons’s algorithm, Baeza-Yates
On the Use of Regular Expressions for Searching Text Shortest substring –Definition of the search problem –Comparison between longest and shortest match search: shortest-match reports all occurrences of the members of L that are in G(L) and no others; longest depends on the entire text. A string may be recognized as member of a regular language by a single left to right scan with constant store. Longest does not have such properties.
On the Use of Regular Expressions for Searching Text Explicit containment –A regular expression may be used to define an explicit universe for search. Implement it by running two concurrent copies of the algorithm. Search tool: CGREP was developed on the basis of the theory in this article.
On the Use of Regular Expressions for Searching Text Concluding comments: –Explores the properties of shortest match search rule for regular expressions –The shortest substring rule provides a precise definition of which strings will be selected during a search without any dependence on the contents of the remainder of the text –Only single left to right scan is enough –Storage requirements depend on the properties of the regular expression only –Can define search universes; useful in structured text (no predefined retrieval units)
Fast Text Searching for Regular Expressions or Automaton Searching on Tries Presents algorithms for efficient searching of regular expressions on preprocessed text, using a Patricia tree as a logical model for the index. Run in logarithmic expected time in the size of the text for some restricted regular expressions, and in sublinear expected time for any regular expression.
Fast Text Searching for Regular Expressions or Automaton Searching on Tries Pattern matching – find occurrences of a given pattern in a long string Variations based on preprocessing the text or not and the language used to specify the query In this article the authors consider preprocessed text and a query specified by a regular expression The problem: find if text string t ε Σ* q Σ* (q is the query) and 1) the location of occurrence, 2) the number of occurrences, 3) all locations where the pattern occurs (any combination)
Fast Text Searching for Regular Expressions or Automaton Searching on Tries Main idea: Simulation of the finite automaton of the query over a digital tree (or Patricia tree) of the text. Run the automaton on all paths of the digital tree from the root to the leaves, stopping when possible. Time savings from the fact that each edge of the tree is traversed at most once, and that every edge represents pairs of symbols in many places of the text.
Fast Text Searching for Regular Expressions or Automaton Searching on Tries Static databases Logical index for text Definition of sistrings Construction of text index which is a binary trie consisting of the set of sistrings of the text Use of Patricia tree to reduce the number of internal nodes
Fast Text Searching for Regular Expressions or Automaton Searching on Tries General automaton searching –The authors present an algorithm that can search for artitrary regular expressions in time sublinear in n on the average. They simulate a DFA in a binary trie built from all the sistrings of a text.
Fast Text Searching for Regular Expressions or Automaton Searching on Tries Concluding comments –Using a trie or Patricia tree, we can search for many types of string searching queries in logarithmic average time, independently of the size of the answer –Automaton searching in a trie is sublinear in the size of the text on average for any regular expression –Worst case of automata searching is linear (for unusual pieces of text)