RE-Tree: An Efficient Index Structure for Regular Expressions

RE-Tree: An Efficient Index Structure for Regular Expressions
Chee-Yong Chan, Minos Garofalakis, Rajeev Rastogi Information Sciences Research Center Bell Laboratories, Lucent Technologies

Motivation Regular Expressions (REs) provide a simple yet powerful formalism for pattern/structure specifications. Example applications: XPath pattern language for XML documents Policy language of Border Gateway Protocol (BGP) RE Filtering Problem: Input string s Subset of R that match s RE Filter R, Set of REs

Our Approach: RE-Tree Idea: Partition RE data set using a height-balanced hierarchical index structure to maximize pruning of search space. Challenge: REs generally define infinite sets and there is no well-defined metric for clustering REs.

RE-Tree Overview Dynamic, height-balanced, hierarchical index structure. REs are stored as finite automata (FA) in the leaf nodes. Internal nodes contain directory entries pointing to nodes at next level; each directory entry = (FA, Pointer) } Internal FAs M 1 2 3 4 5 6 7 8 ...... Leaf FAs

RE-Tree: Containment Property
RE-Tree: An Efficient Index Structure for Regular Expressions L(M1) L(M2) L(M3) L(M4) Example: M1 = Bounding FA of { M2, M3, M4 } a (a | b) c* aa ( a | b | c)* c ab (bc | cc)* M 2 3 4 a (a | b) ( a | b | c)* .... 1 N’ N

Bounding Finite Automata
RE-Tree: An Efficient Index Structure for Regular Expressions Bounding Finite Automata Many possible bounding FAs for a given set of FAs. Most precise FA accepts union of L(Mi) for all Mi in the set. Least precise FA accepts Space-Precision tradeoff for bounding FAs: A more precise FA improves search pruning but its size could be large, resulting in lower fan-out of index node. RE-tree controls fan-out by bounding the maximum number of states per internal FA (using an index parameter ). Goal: Optimize search performance by maximizing precision of bounding FAs. *

RE-Trees vs. R-Trees RE-trees are similar in spirit to R-trees.
RE-Tree: An Efficient Index Structure for Regular Expressions RE-Trees vs. R-Trees RE-trees are similar in spirit to R-trees. R-trees RE-trees Multi-dimensional rectangles Regular languages Data Type Minimal bounding rectangles (MBR) Internal node entries Bounding FAs Minimize size of languages accepted by bounding FAs Update operations Minimize volume of MBRs

RE-Tree Algorithms RE-tree construction involves three key operations:
RE-Tree: An Efficient Index Structure for Regular Expressions RE-Tree Algorithms RE-tree construction involves three key operations: Selecting an optimal insertion node Computing an optimal bounding FA Computing an optimal node split

RE-Tree Optimization Problems
RE-Tree: An Efficient Index Structure for Regular Expressions RE-Tree Optimization Problems Let S = {M1, M2, ...., Mn} be set of FAs in a node N. Selecting an optimal insertion node Select the node corrresponding to Mi that maximizes |L(M Mi)|, where M is the FA to be inserted. Computing an optimal bounding FA Compute M, a bounding FA of S (with at most states), that minimizes |L(M)|. Computing an optimal node split Partition S into S1 & S2 such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized.

RE-Tree Optimization Problems
RE-Tree: An Efficient Index Structure for Regular Expressions RE-Tree Optimization Problems Let S = {M1, M2, ...., Mn} be set of FAs in a node N. Selecting an optimal insertion node Select the node corrresponding to Mi that maximizes |L(M Mi)|, where M is the FA to be inserted. Computing an optimal bounding FA Compute M, a bounding FA of S (with at most states), that minimizes |L(M)|. Computing an optimal node split Partition S into S1 & S2 such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized. Possibly Infinite!

Main Challenge Problem: How to measure size of REs?
RE-Tree: An Efficient Index Structure for Regular Expressions Main Challenge Problem: How to measure size of REs? Observe: Infinite REs may not have the same size. Example: (a|b)* is larger than a(a|b)*. Idea: Need a computable measure for size of REs that captures intuition of “larger than’’ relationship. Let L(M,i) = Set of length-i strings in L(M). Intuitively, L(M) is larger than L(M’) iff + s.t. N k > N i = 1 k L(M, i) L(M’, i) >

Max-Count Size Measure
RE-Tree: An Efficient Index Structure for Regular Expressions Max-Count Size Measure Idea: Count size of L(M) up to some maximum length. |L(M)| = |L(M,1)| + |L(M,2)| |L(M,k)| Cons: Sensitive to maximum length parameter value. Example: L(M1) = (b|c)* d (a|b)* d (b|c)* d L(M2) = dd (a|b|c)* d L(M2) is larger than L(M1), but max-count measure is correct iff maximum length parameter value > 15.

MDL-based Size Measure
RE-Tree: An Efficient Index Structure for Regular Expressions MDL-based Size Measure MDL Principle: Provides an information-theoretic definition of an optimal model for a given data set. Observation: L(M1) S, L(M2) S w S Encode(w, M1) Encode(w, M2) < M1 is more precise than M2 MDL-based Measure: L(M2) is larger than L(M1) w S1 Encode(w, M1) / |w| w S2 Encode(w, M2) / |w| <

Definition of Encoding(w,M)
RE-Tree: An Efficient Index Structure for Regular Expressions Definition of Encoding(w,M) How to encode w L(M) using M ? Let p = < s0, s1, ..., sn > be accepting path of w in M. Encode(w, M) = i = 0 n-1 log ( # out-going transitions in si) Example: a,b,c b d M Encode( ddbd, M) = log(1) + log(2) + log(4) = 5

Algorithm to Optimize Bounding FA
RE-Tree: An Efficient Index Structure for Regular Expressions Algorithm to Optimize Bounding FA Compute a bounding FA M for a given set of FAs S s.t. (1) M has at most number of states, and (2) |L(M)| is minimized. Problem is NP-hard. Heuristic: Compute the most precise FA for S & then incrementally relax its precision (by greedily merging pairs of states) until the space constraint is satisfied.

An Example Compute bounding FA for S = { abb* , aa*b } with = 3 a b b
RE-Tree: An Efficient Index Structure for Regular Expressions An Example Compute bounding FA for S = { abb* , aa*b } with = 3 a b b a b a

Other RE-Tree Algorithms
RE-Tree: An Efficient Index Structure for Regular Expressions Other RE-Tree Algorithms Selecting an optimal insertion node Select the node corresponding to Mi that maximizes |L(M Mi)|, where M is the FA to be inserted. Computing an optimal node split Partition S into S1 & S2 (each with at last m FAs) such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized. Problem is NP-hard. Heuristic used is similar to R-tree’s Quadratic Split Algorithm.

Optimizing RE-Tree Operations
RE-Tree: An Efficient Index Structure for Regular Expressions Optimizing RE-Tree Operations RE-tree algorithms involve many FA operations (i.e., union & intersection). Speed up performance using sampling techniques. Example: Selecting optimal insertion node requires computing |L(Mi M)| for each Mi in current node. An unbiased estimate of |L(Mi M, k)| is given by (# strings in S accepted by Mi) |L(M, k)| |S| where S = uniform random sample of L(M, k).

Related Work A lot of work on the traditional RE search problem: how to speed up searching of an RE query. But none on the RE filtering problem. Indexes for filtering XPath expressions: XFilter [VLDB’00], YFilter [ICDE’02], XTrie [ICDE’02], matchMaker [EDBT’02]. Class of REs supported in XPath is more restrictive. Indexes for filtering XPath are all main-memory structures.

Experimental Evaluation
RE-Tree: An Efficient Index Structure for Regular Expressions Experimental Evaluation Algorithms: RE-tree vs Sequential File Approach. Data Set: Generated synthetic RE data sets. Vary RE similarity, , size of data set. Queries: Generated 1000 random query strings from RE data set. System: 700 MHz Intel Pentium III with 512 MB memory running FreeBSD 4.1. NITF: News Industry Text Format

Varying Similarity of REs
RE-Tree: An Efficient Index Structure for Regular Expressions Varying Similarity of REs

Conclusions RE-Tree, a novel index structure for REs.
RE-Tree: An Efficient Index Structure for Regular Expressions Conclusions RE-Tree, a novel index structure for REs. Novel size measures for REs. Update algorithms to optimize bounding FAs. Sampling-based techniques to speed up RE-tree operations.

RE-Tree: An Efficient Index Structure for Regular Expressions

Similar presentations

Presentation on theme: "RE-Tree: An Efficient Index Structure for Regular Expressions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RE-Tree: An Efficient Index Structure for Regular Expressions

Similar presentations

Presentation on theme: "RE-Tree: An Efficient Index Structure for Regular Expressions"— Presentation transcript:

Similar presentations

About project

Feedback