BLAS : An Efficient XPATH Processing System Presented by: Moran Birenbaum Published by: Yi Chen Susan B. Davidson Yifeng Zheng.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo.
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
Constant-Time LCA Retrieval
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
BLAS: An Efficient XPath Processing System Chen Y., Davidson S., Zheng Y. Νίκος Λούτας.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Implementation of Graph Decomposition and Recursive Closures Graph Decomposition and Recursive Closures was published in 2003 by Professor Chen. The project.
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
Xyleme A Dynamic Warehouse for XML Data of the Web.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Lec 15 April 9 Topics: l binary Trees l expression trees Binary Search Trees (Chapter 5 of text)
Storing and Querying Ordered XML Using Relational Database System Swapna Dhayagude.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Recursive Graph Deduction and Reachability Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Inbal Yahav A Framework for Using Materialized XPath Views in XML Query Processing VLDB ‘04 DB Seminar, Spring 2005 By: Andrey Balmin Fatma Ozcan Kevin.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
1/25 Pointer Logic Changki PSWLAB Pointer Logic Daniel Kroening and Ofer Strichman Decision Procedure.
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
1 Efficient packet classification using TCAMs Authors: Derek Pao, Yiu Keung Li and Peng Zhou Publisher: Computer Networks 2006 Present: Chen-Yu Lin Date:
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Querying Structured Text in an XML Database By Xuemei Luo.
Processing of structured documents Spring 2003, Part 7 Helena Ahonen-Myka.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
Advanced Databases: Lecture 6 Query Optimization (I) 1 Introduction to query processing + Implementing Relational Algebra Advanced Databases By Dr. Akhtar.
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
CS 415 – A.I. Slide Set 5. Chapter 3 Structures and Strategies for State Space Search – Predicate Calculus: provides a means of describing objects and.
Database Systems Part VII: XML Querying Software School of Hunan University
BLAS: An Efficient XPath Processing System Zhimin Song Advanced Database System Professor: Dr. Mengchi Liu.
Data Structures and Algorithm Analysis Trees Lecturer: Jing Liu Homepage:
Data Structures TREES.
Union-find Algorithm Presented by Michael Cassarino.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
CE 221 Data Structures and Algorithms Chapter 4: Trees (Binary) Text: Read Weiss, §4.1 – 4.2 1Izmir University of Economics.
Trees : Part 1 Section 4.1 (1) Theory and Terminology (2) Preorder, Postorder and Levelorder Traversals.
Session 1 Module 1: Introduction to Data Integrity
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
XPath --XML Path Language Motivation of XPath Data Model and Data Types Node Types Location Steps Functions XPath 2.0 Additional Functionality and its.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Author: Akiyoshi Matonoy, Toshiyuki Amagasay, Masatoshi Yoshikawaz, Shunsuke Uemuray.
1 Trees : Part 1 Reading: Section 4.1 Theory and Terminology Preorder, Postorder and Levelorder Traversals.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Querying and Transforming XML Data
Database Management System
Methodology – Physical Database Design for Relational Databases
Physical Database Design for Relational Databases Step 3 – Step 8
Binary Trees, Binary Search Trees
OrientX: an Integrated, Schema-Based Native XML Database System
Indexing and Hashing Basic Concepts Ordered Indices
XML indexing – A(k) indices
Presentation transcript:

BLAS : An Efficient XPATH Processing System Presented by: Moran Birenbaum Published by: Yi Chen Susan B. Davidson Yifeng Zheng

Introduction XML is rapidly emerging as de facto standard for exchanging data on the web. Due to its complex, tree like structure, languages for querying XML are based on path navigation and typically include the ability to traverse from a given node to a child node (the “child axis”) or from a given node to a descendant node(the “descendant axis”).

BLAS System BLAS is a Bi-Labeling based system, for efficiently processing complex XPath queries over XML data. BLAS uses P-labeling to process queries over consecutive child axes and D-labeling to process queries involving descendant axis traversal. The XML data is stored in labeled form, and indexed to optimize descendant axis traversals. Complex XPath queries are then translated to SQL expressions. A query engine is used to process the SQL queries.

BLAS System - Continued The primary bottleneck for evaluating complex queries efficiently is the number of joins and disk accesses. Therefore the BLAS system attempts to reduce the number of joins and disk accesses required for complex XPath queries, as well as optimize the join operations. BLAS is composed of three parts : –Index Generator – Stores the labeling and data. – Query Translator – Generates SQL queries from the XPath queries. –Query Engine – Evaluates the SQL queries.

Main XML Example cytochrome c [validated] cytochrome c … Evans, M.J. … 2001 The human somatic cytochrome c gene … …

Main XPath Query Example Q = /proteinDatabase/proteinEntry[protein//superfamily cytochrome c”]/reference/refinfo[//author = Evans, M.J.” and year = “2001”]/title

Definitions Queries without branches can be represented as paths, and are called path queries. The evaluation of a path expression P returns the set of nodes in an XML tree T which are reachable by P starting from the root of T. This set of XML nodes is denoted as [[P]]. A path expression P is contained in a path expression Q, denoted P C Q, if and only if for any XML tree T, [[P]] C [[Q]]. Path expression P and Q are non-overlapping, denoted P ∩ Q = Ø, if and only if for any XML tree T, [[P]] ∩ [[Q]] = Ø.

Definitions - Continued A suffix path expression is a path expression P which optionally begins with a descendant axis step (//), followed by zero or more child axis steps(/). Example: //proteinDataBase/name A simple suffix path expression contains only child axis steps. Example: /proteinDataBase/proteinEntry/protein/name A source path of a node n in an XML tree T, denoted as SP(n), is the unique simple path P from the root to the node. Evaluating a suffix path query Q entails finding all the nodes n such that SP(n) C Q.

The Labeling Scheme BLAS uses a bi-labeling scheme which transforms XML data into relations, and XPath queries into SQL which can be efficiently evaluated over the transformed relations. The labeling scheme consists of two labels, one for speeding up descendant axis steps (D-label), and the other for speeding up consecutive child axis steps (P-Label). Then a generic B+ tree index on the labels is built.

D-labeling Definition: A D-label of an XML node is a triplet:, such that for any two nodes n and m, n ≠ m: –validation: n.d1 < n.d2. –Descendant: m is a descendant of n if n.d1 < m.d1 and n.d2 > m.d2. –Child: m is a child of n if and only if m is a descendant of n and n.d3 +1 = m.d3. n and m have no ancestor - descendant relationship if and only if n.d2 m.d2. * In this way, the ancestor – descendant relationship between any two nodes can be determined solely by checking their D-labels.

D-labeling continued An XML node m is a descendant of another node n if and only if m is nested within n in the XML document. To distinguish a child from descendant, d3 is set to be the level of n in the XML tree. The level of n is defined as the length of the path from the root to n. Example: In the main example the first node tagged classification begins at position 7 and ends at position 11, its level is 4. * Every start tag, end tag and text are treated as a separate unit. So, a D-label will be represented as.

D-labeling continued How it’s done: To use this labeling scheme for processing descendant axis queries such As //t 1 //t 2, we first retrieve all the nodes reachable by t1 and by t 2, resulting in two lists l 1 and l 2. We then test for the ancestor-descendant relationship between nodes in list l 1 and those in list l 2. Interpreting l 1 and l 2 as relations, this test is a join with the “descendant” property as the join predicate, therefore it’s called D-join.

D-labeling Example Query: //ProteinDatabase//refinfo and let pDB and refinfo be relations which store nodes tagged by proteinDatabase and refinfo, respectively. The D-join could be expressed in SQL as follows: select pDB.start pDB.end, refinfo.start, refinfo.end from pDB, refinfo Where pDB.start refinfo.end

P-labeling The P-labeling scheme is used to efficiently process consecutive child axis steps (a suffix path query). Intuition: Each XML node n is annotated with a label according to its source path SP(n), and a suffix path query Q is also annotated with a label, such that the containment relationship between SP(n) and Q can be determined by examining their labels. Hence suffix path queries can be evaluated efficiently.

P-labeling Continued Definition 1: A P-label for a suffix path P is an interval I p =, such that for any two suffix path queries P, Q: Validation: P.p1 < P.p2 Containment: P C Q if and only if interval Ip is contained in I Q, i.e. Q.p1 P.p2. Nonintersection: P ∩ Q = Ø if and only if Ip and I Q do not overlap, i.e. P.p1 > Q.p2 or P.p2 < Q.p1. Therefore, for any two suffix path queries P, Q, either P and Q have a containment relationship or they are non-overlapping.

P-labeling Continued Since the evaluation of a suffix path query Q entails finding all XML nodes n such that SP(n) C Q, the evaluation can be implemented as finding all n such that the P-label of SP(n) is contained in the P-label of Q. It is equivalent to finding all nodes n such that Q.p1 < SP(n).p1 < Q.p2. Definition 2: For an XML node n, such that SP(n) =, the P-label for this XML node, denoted as n.plabel, is the integer p1.

P-labeling Continued [[Q]] = {n| Q.p1 < n.plabel < Q.p2} If Q is a simple path then: [[Q]] = {n| Q.p1 = n.plabel}

P-labeling Construction Suppose that there are n distinct tags (t 1,t 2, …,t n ) We assign / a ratio r 0, and each tag ti a ratio r i, such that ∑r i =1. Let r i = 1/(n+1) for all i. Define the domain of the numbers in a P-label to be integers in [0, m-1]. The m is chosen such that m > (n+1)^h, where h is the length of the longest path in an XML tree.

P-labeling Construction - Continued Path // is assigned an interval (P-label) of. Partition the interval in tag order proportional to t i ’s ratio r i for each path //t i and /’s ratio r 0. Assuming that the order of tags is t 1, t 2,…, t n, this means that we allocate an interval to / and to each t i, such that (p i+ 1 − p i )/m = ri and p 1 /m = r 0. For the interval of a path //t i, we further partition it into subintervals by tags in order according to their ratios. Each path //t j /t i (or /t i ) is now assigned a subinterval, and the proportion of the length of interval of //t j /t i (or /t i ) over the length of interval of //t i is the ratio r j (or r 0 ). Continue to partition over each subinterval as needed.

P-labeling Example The partition procedure for m = 10,001 and tags t1, t2, t3,…t9. The P-label assigned to /t1/t2 is.

P-Labeling Example2 We want to construct P-labels for the sample XML data in the main example. For simplicity, assume m = 10^12 and that there are 99 tags. Each tag is assigned a ratio Suppose the order is /, ProteinDataBase, proteinEntry, protein, name… We want to assign a P-label for suffix path P = /ProteinDataBase/proteinEntry/protein/name We begin by assigning P-label to suffix path //name. Then we extract a subinterval from it and get the P-label for //protein/name.

P-Labeling Example2 Continued P-labels for some suffix path expressions: Example: Suppose we wish to evaluate query //protein/name. we first compute its P-label which is. Then we find all the nodes n such that 4.03*10^10 < n.plabel < 4.04*10^10-1 The suffix path query can be evaluated by the following SQL select * from nodes where nodes.plabel > 4.03*10^10 and nodes.plabel < 4.04*10^10-1

The BLAS Index Generator The BLAS index generator builds P-labels and D-labels for each element node, and stores text values. A tuple is generated for every node n, where plabel is the P-label of n, start and end are the start and end tag positions of n in the XML document, level is the level of n, and data is used to store the value of n if there is any, otherwise data is set to null. The tuples are clustered by {plabel, start}, and B+ tree indexes are built on start, plabel and data to facilitate searches.

The BLAS Query Translator The BLAS query translator translates an input XPath query into standard SQL. It is composed of three modules: –Query Decomposition – Generates a tree representation of the input XPath query, splits the query into a set of suffix path queries and records the ancestor-descendant relationship between the results of these suffix path queries. –SQL Generation – Computes the query’s P-labeling and generates a corresponding subquery in SQL. –SQL Composition – Combines the subqueries into a single SQL query based on D-labeling and the ancestor- descendant relationship between the suffix path queries results.

Query Translator - Continued There are three query translation algorithms that translate a complex XPath query into an efficient SQL query:  Split – which is used only for purposes of exposition.  Unfold – which is used when scheme information is present.  Push-up – which is used when scheme information is absent.

Split Algorithm This algorithm is the simplest query translator. The algorithm splits the query tree into one or more parts, where each part is a suffix path query. Split consists of two steps: –Descendant axis elimination - The basic operation of the algorithm is to take a query as input, do a depth-first traversal and split any descendant-axis of form p//q, into p and //q. –Branch elimination – The basic operation of the algorithm is to take query as input, do a depth-first traversal and split any branch axis of form p[q 1, q 2,…q l ]/r into p, //q 1, //q 2,…//q l, //r.

The Descendant Axis Elimination Algorithm Algorithm 3 D-elimination(query tree Q) 1: List intermediate-result 2: Depth-first search(Q) { 3: if current node reached by a // edge then 4: Q’ = the subtree rooted at the current // edge 5: Cut Q’ from Q; 6: intermediate-result.add(D-elimination(Q’)) 7: end if 8: } 9: result = answer(Q) 10: for all r in intermediate-result do 11: result = D-join(result,r) 12: end for 13: return result

Branch Elimination Algorithm Algorithm 4 B-elimination(query tree Q) 1: List intermediate-result 2: Depth-first search(Q) { 3: if current node has more than one child then 4: for all child of Q: Q’ do 5: cut Q’ from Q 6: Q’ = //Q’ 7: intermediate-result.add(B-elimination(Q’)) 8: end for 9: end if 10: } 11: result = answer(Q) 12: for all r in intermediate-result do 13: result = D-join(result,r) 14: end for 15: return result

Split Algorithm - Example For example we translate the main query example.

Split Algorithm – Example Continued Since Q1 contains branching point, the branch elimination procedure must be invoked to further decompose it.

Split Algorithm – Example Continued

After that, we can see that each resulting subquery is a suffix path query, which can be evaluated directly using P- labeling. Example: Evaluation of Q4 results in a list of nodes pEntry that are reachable by path /ProteinDataBase/proteinEntry, and the evaluation of Q7 results in a list of nodes refinfo that are reachable by path //reference/refinfo. pEntry and refinfo are D- joined as follow select pEntry.start, pEntry.end, refinfo.start, refinfo.end from pEntry, refinfo where pEntry.start refinfo.start and pEntry.level = refinfo.level - 2

Push-up Algorithm The branch elimination in the Split algorithm eliminates a branch of the form p[q 1, q 2,…..q l ]/r into p, //q 1, //q 2 /…q l,//r. and ignores the fact that the root of qi and r is a child of the leaf of p. Therefore, rather than evaluate //q i and//r, we should evaluate p/q i and p/r. Since p/qi and p/r are more specific than //q i and //r. This evaluation reduces the number of disk accesses and the size of the intermediate results.

Push-up Algorithm - Example

Unfold Algorithm A further optimization of the descendant-axis elimination is possible when scheme information is available. For non-recursive schemas, path expressions with wildcards can be evaluated over the schema graph and wildcards can be substituted with actual tags. Thus a query of form p//q can be enumerated by all possibilities p/r 1 /q, p/r 2 /q, …p/r l /q and the result of the query is the union of the results of p/r 1 /q, p/r 2 /q, …,p/r l /q. For a recursive scheme, given statistics about the depth of the XML tree, queries can be unfolded to this depth and the occurrences of the // can be eliminated.

Unfold Algorithm - Continued One advantage with Unfold is that we replace D-joins with a process that first performs selections on P-labels and then unions the results.This is very efficient because selections using an index are cheap, and the union is very simple since there are no duplicates. Another advantage is that the subqueries are all simple path queries, which can be implemented as a select operation with equality predicates instead of range predicates. The number of disk accesses is also reduced.

Query Translator -Final Example We take query Q and first apply push-up branch elimination and get the following set of subqueries: Q 4, Q’ 5, Q’ 7, Q’ 8, Q’ 9 from the previous example and the following subqueries: Q ’’ 2 = /ProteinDataBase/ProteinEntry/protein//superfamily=“cytochrome c” Q ’’ 3 = /ProteinDataBase/ProteinEntry/reference/refinfo//author=“Evans, M.J.”. After applying the Unfold descendant-axis elimination to subqueries Q’’ 2 and Q’’ 3 we get: Q ’’’ 2 = /ProteinDataBase/ProteinEntry/protein/classification/ superfamily=“cytochrome c”. Q ’’’ 3 = /ProteinDataBase/ProteinEntry/reference/refinfo/ authors/author = “Evans, M.J.”.

The Architecture of BLAS System

Efficiency Of The Algorithm * The algorithm is more efficient than an approach which uses only D-labeling because: 1. The number of joins is reduced – With D-labeling, a query which contains l tags requires (l - 1) D-joins. However, if b is the number of outgoing edges, which are not annotated with //, of a branching point and d is the number of descendant axis steps, then the number of D-joins is the Split and Push-up is bounded by (b + d), which is always less than (l – 1). In the presence of scheme information, the Unfold algorithm can be applied and reduce the number of D-joins to b. 2. The number of disk accesses are reduced. Since T-labels are clustered by {plabel, start}, the number of disk accesses of T-labeling is less than that of D-labeling.

Summary The BLAS system is used for processing complex XPath queries over XML data. The system is based on a labeling scheme which combines P-labeling (for processing suffix path queries efficiently) with D-labeling (for processing queries involving the descendant axis). The query translator algorithms decompose an XPath query into a set of suffix path sub-queries. P-labels of these sub- queries are then calculated, and the sub-queries are translated into an SQL expressions. The final SQL query plan is obtained by taking their D-joins. Experimental results demonstrate that the BLAS system has a substantial performance improvement compared to traditional XPath processing using D-labeling.

References 1. “BLAS: An Efficient XPATH Processing System” – Yi Chen, Susan B. Davidson, Yifeng Zheng.