RE-Tree: An Efficient Index Structure for Regular Expressions

Slides:



Advertisements
Similar presentations
The Optimal-Location Query
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Lecture 24 MAS 714 Hartmut Klauck
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Fast Algorithms For Hierarchical Range Histogram Constructions
On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
2-dimensional indexing structure
I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
I/O-Algorithms Lars Arge Aarhus University February 6, 2007.
Chapter 3: Data Storage and Access Methods
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
I/O-Algorithms Lars Arge Aarhus University March 6, 2007.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Regular Languages A language is regular over  if it can be built from ;, {  }, and { a } for every a 2 , using operators union ( [ ), concatenation.
Indexing and Hashing (emphasis on B+ trees) By Huy Nguyen Cs157b TR Lee, Sin-Min.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Minimizing Delay in Shared Pipelines Ori Rottenstreich (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) Yoram Revah, Aviran Kadosh.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
Jeremy Iverson & Zhang Yun 1.  Chapter 6 Key Concepts ◦ Structures and access methods ◦ R-Tree  R*-Tree  Mobile Object Indexing  Questions 2.
Trie Indexes for Efficient XML Query Processing
CPS216: Data-intensive Computing Systems
Tree-Pattern Aggregation for Scalable XML Data Dissemination
CS522 Advanced database Systems
CS 540 Database Management Systems
Multiway Search Trees Data may not fit into main memory
Efficient processing of path query with not-predicates on XML data
Efficient Filtering of XML Documents with XPath Expressions
COSC160: Data Structures Linked Lists
Optimal Configuration of OSPF Aggregates
PARSE TREES.
Evaluation of Relational Operations
I don’t need a title slide for a lecture
Indexing and Hashing Basic Concepts Ordered Indices
Approximate Frequency Counts over Data Streams
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Database Design and Programming
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Indexing 4/11/2019.
Overview of Query Evaluation: JOINS
A Framework for Testing Query Transformation Rules
Efficient Processing of Top-k Spatial Preference Queries
Hierarchical Clustering
Donghui Zhang, Tian Xia Northeastern University
Efficient Aggregation over Objects with Extent
Presentation transcript:

RE-Tree: An Efficient Index Structure for Regular Expressions Chee-Yong Chan, Minos Garofalakis, Rajeev Rastogi Information Sciences Research Center Bell Laboratories, Lucent Technologies

RE-Tree: An Efficient Index Structure for Regular Expressions Motivation Regular Expressions (REs) provide a simple yet powerful formalism for pattern/structure specifications. Example applications: XPath pattern language for XML documents Policy language of Border Gateway Protocol (BGP) RE Filtering Problem: Input string s Subset of R that match s RE Filter R, Set of REs

RE-Tree: An Efficient Index Structure for Regular Expressions Our Approach: RE-Tree Idea: Partition RE data set using a height-balanced hierarchical index structure to maximize pruning of search space. Challenge: REs generally define infinite sets and there is no well-defined metric for clustering REs.

RE-Tree: An Efficient Index Structure for Regular Expressions RE-Tree Overview Dynamic, height-balanced, hierarchical index structure. REs are stored as finite automata (FA) in the leaf nodes. Internal nodes contain directory entries pointing to nodes at next level; each directory entry = (FA, Pointer) } Internal FAs M 1 2 3 4 5 6 7 8 ...... Leaf FAs

RE-Tree: Containment Property RE-Tree: An Efficient Index Structure for Regular Expressions L(M1) L(M2) L(M3) L(M4) Example: M1 = Bounding FA of { M2, M3, M4 } a (a | b) c* aa ( a | b | c)* c ab (bc | cc)* M 2 3 4 a (a | b) ( a | b | c)* .... 1 N’ N

Bounding Finite Automata RE-Tree: An Efficient Index Structure for Regular Expressions Bounding Finite Automata Many possible bounding FAs for a given set of FAs. Most precise FA accepts union of L(Mi) for all Mi in the set. Least precise FA accepts Space-Precision tradeoff for bounding FAs: A more precise FA improves search pruning but its size could be large, resulting in lower fan-out of index node. RE-tree controls fan-out by bounding the maximum number of states per internal FA (using an index parameter ). Goal: Optimize search performance by maximizing precision of bounding FAs. *

RE-Trees vs. R-Trees RE-trees are similar in spirit to R-trees. RE-Tree: An Efficient Index Structure for Regular Expressions RE-Trees vs. R-Trees RE-trees are similar in spirit to R-trees. R-trees RE-trees Multi-dimensional rectangles Regular languages Data Type Minimal bounding rectangles (MBR) Internal node entries Bounding FAs Minimize size of languages accepted by bounding FAs Update operations Minimize volume of MBRs

RE-Tree Algorithms RE-tree construction involves three key operations: RE-Tree: An Efficient Index Structure for Regular Expressions RE-Tree Algorithms RE-tree construction involves three key operations: Selecting an optimal insertion node Computing an optimal bounding FA Computing an optimal node split

RE-Tree Optimization Problems RE-Tree: An Efficient Index Structure for Regular Expressions RE-Tree Optimization Problems Let S = {M1, M2, ...., Mn} be set of FAs in a node N. Selecting an optimal insertion node Select the node corrresponding to Mi that maximizes |L(M Mi)|, where M is the FA to be inserted. Computing an optimal bounding FA Compute M, a bounding FA of S (with at most states), that minimizes |L(M)|. Computing an optimal node split Partition S into S1 & S2 such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized.

RE-Tree Optimization Problems RE-Tree: An Efficient Index Structure for Regular Expressions RE-Tree Optimization Problems Let S = {M1, M2, ...., Mn} be set of FAs in a node N. Selecting an optimal insertion node Select the node corrresponding to Mi that maximizes |L(M Mi)|, where M is the FA to be inserted. Computing an optimal bounding FA Compute M, a bounding FA of S (with at most states), that minimizes |L(M)|. Computing an optimal node split Partition S into S1 & S2 such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized. Possibly Infinite!

Main Challenge Problem: How to measure size of REs? RE-Tree: An Efficient Index Structure for Regular Expressions Main Challenge Problem: How to measure size of REs? Observe: Infinite REs may not have the same size. Example: (a|b)* is larger than a(a|b)*. Idea: Need a computable measure for size of REs that captures intuition of “larger than’’ relationship. Let L(M,i) = Set of length-i strings in L(M). Intuitively, L(M) is larger than L(M’) iff + s.t. N k > N i = 1 k L(M, i) L(M’, i) >

Max-Count Size Measure RE-Tree: An Efficient Index Structure for Regular Expressions Max-Count Size Measure Idea: Count size of L(M) up to some maximum length. |L(M)| = |L(M,1)| + |L(M,2)| + .....+ |L(M,k)| Cons: Sensitive to maximum length parameter value. Example: L(M1) = (b|c)* d (a|b)* d (b|c)* d L(M2) = dd (a|b|c)* d L(M2) is larger than L(M1), but max-count measure is correct iff maximum length parameter value > 15.

MDL-based Size Measure RE-Tree: An Efficient Index Structure for Regular Expressions MDL-based Size Measure MDL Principle: Provides an information-theoretic definition of an optimal model for a given data set. Observation: L(M1) S, L(M2) S w S Encode(w, M1) Encode(w, M2) < M1 is more precise than M2 MDL-based Measure: L(M2) is larger than L(M1) w S1 Encode(w, M1) / |w| w S2 Encode(w, M2) / |w| <

Definition of Encoding(w,M) RE-Tree: An Efficient Index Structure for Regular Expressions Definition of Encoding(w,M) How to encode w L(M) using M ? Let p = < s0, s1, ..., sn > be accepting path of w in M. Encode(w, M) = i = 0 n-1 log ( # out-going transitions in si) Example: a,b,c b d M Encode( ddbd, M) = log(1) + log(2) + log(4) = 5

Algorithm to Optimize Bounding FA RE-Tree: An Efficient Index Structure for Regular Expressions Algorithm to Optimize Bounding FA Compute a bounding FA M for a given set of FAs S s.t. (1) M has at most number of states, and (2) |L(M)| is minimized. Problem is NP-hard. Heuristic: Compute the most precise FA for S & then incrementally relax its precision (by greedily merging pairs of states) until the space constraint is satisfied.

An Example Compute bounding FA for S = { abb* , aa*b } with = 3 a b b RE-Tree: An Efficient Index Structure for Regular Expressions An Example Compute bounding FA for S = { abb* , aa*b } with = 3 a b b a b a

Other RE-Tree Algorithms RE-Tree: An Efficient Index Structure for Regular Expressions Other RE-Tree Algorithms Selecting an optimal insertion node Select the node corresponding to Mi that maximizes |L(M Mi)|, where M is the FA to be inserted. Computing an optimal node split Partition S into S1 & S2 (each with at last m FAs) such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized. Problem is NP-hard. Heuristic used is similar to R-tree’s Quadratic Split Algorithm.

Optimizing RE-Tree Operations RE-Tree: An Efficient Index Structure for Regular Expressions Optimizing RE-Tree Operations RE-tree algorithms involve many FA operations (i.e., union & intersection). Speed up performance using sampling techniques. Example: Selecting optimal insertion node requires computing |L(Mi M)| for each Mi in current node. An unbiased estimate of |L(Mi M, k)| is given by (# strings in S accepted by Mi) |L(M, k)| |S| where S = uniform random sample of L(M, k).

RE-Tree: An Efficient Index Structure for Regular Expressions Related Work A lot of work on the traditional RE search problem: how to speed up searching of an RE query. But none on the RE filtering problem. Indexes for filtering XPath expressions: XFilter [VLDB’00], YFilter [ICDE’02], XTrie [ICDE’02], matchMaker [EDBT’02]. Class of REs supported in XPath is more restrictive. Indexes for filtering XPath are all main-memory structures.

Experimental Evaluation RE-Tree: An Efficient Index Structure for Regular Expressions Experimental Evaluation Algorithms: RE-tree vs Sequential File Approach. Data Set: Generated synthetic RE data sets. Vary RE similarity, , size of data set. Queries: Generated 1000 random query strings from RE data set. System: 700 MHz Intel Pentium III with 512 MB memory running FreeBSD 4.1. NITF: News Industry Text Format

Varying Similarity of REs RE-Tree: An Efficient Index Structure for Regular Expressions Varying Similarity of REs

Varying Similarity of REs RE-Tree: An Efficient Index Structure for Regular Expressions Varying Similarity of REs

Conclusions RE-Tree, a novel index structure for REs. RE-Tree: An Efficient Index Structure for Regular Expressions Conclusions RE-Tree, a novel index structure for REs. Novel size measures for REs. Update algorithms to optimize bounding FAs. Sampling-based techniques to speed up RE-tree operations.