BLAS: An Efficient XPath Processing System Chen Y., Davidson S., Zheng Y. Νίκος Λούτας.

Slides:



Advertisements
Similar presentations
Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
Advertisements

XML: Extensible Markup Language
Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo.
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
Fast Algorithms For Hierarchical Range Histogram Constructions
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
An Algorithm for Streaming XPath Processing with Forward and Backward Axes Charles Barton, Philippe Charles, Deepak Goyal, Mukund Raghavchari IBM T. J.
Implementation of Graph Decomposition and Recursive Closures Graph Decomposition and Recursive Closures was published in 2003 by Professor Chen. The project.
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
2005rel-xml-ii1 The SilkRoute system  The system goals  Scenario, examples  View Forests  View forest and query composition  View forest efficient.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Storing and Querying Ordered XML Using Relational Database System Swapna Dhayagude.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
Recursive Graph Deduction and Reachability Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Natix Done by Asmaa Hassanain CSC 5370 Dr. Hachim Haddoutti 12/8/2003.
BLAS : An Efficient XPATH Processing System Presented by: Moran Birenbaum Published by: Yi Chen Susan B. Davidson Yifeng Zheng.
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
1 Efficient packet classification using TCAMs Authors: Derek Pao, Yiu Keung Li and Peng Zhou Publisher: Computer Networks 2006 Present: Chen-Yu Lin Date:
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
Lecture 7 of Advanced Databases XML Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Querying Structured Text in an XML Database By Xuemei Luo.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
Crimson: A Data Management System to Support Evaluating Phylogenetic Tree Reconstruction Algorithms Yifeng Zheng, Stephen Fisher, Shirley cohen, Sheng.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Database Systems Part VII: XML Querying Software School of Hunan University
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
BLAS: An Efficient XPath Processing System Zhimin Song Advanced Database System Professor: Dr. Mengchi Liu.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Fast and practical indexing and querying of very large graphs Silke Triβl, Ulf Leser Humboldt-Universitat zu Berlin Presenter: Liwen Sun (Stephen) SIGMOD’07.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
XPath --XML Path Language Motivation of XPath Data Model and Data Types Node Types Location Steps Functions XPath 2.0 Additional Functionality and its.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
Author: Akiyoshi Matonoy, Toshiyuki Amagasay, Masatoshi Yoshikawaz, Shunsuke Uemuray.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,
Compressing XML Documents with Finite State Automata
Querying and Transforming XML Data
Database Management System
Relational Algebra Chapter 4, Part A
OrientX: an Integrated, Schema-Based Native XML Database System
(b) Tree representation
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Early Profile Pruning on XML-aware Publish-Subscribe Systems
XML indexing – A(k) indices
CPSC-608 Database Systems
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

BLAS: An Efficient XPath Processing System Chen Y., Davidson S., Zheng Y. Νίκος Λούτας

Outline Problem being addressed in the paper Related work BLAS Experimental Results Evaluation

Problem Number of disk accesses and joins is the primary bottleneck for evaluating complex queries efficiently!

Can we improve XPath processing which uses relational technology? D-labeling Processes descendant axis traversal using a single join rather than a transitive closure of joins. Observation: D-labeling processes / and // in the same way using joins. XPRESS – queriable compressed XML files Reverse arithmetic encoding A label path as a distinct interval in[0.0, 1.0) Handling of path expressions : containment relationships Motivation

Goals Process / (simple path expressions) more efficiently Reduce the number of disk accesses and joins Optimize the join operations

Outline Problem being addressed in the paper Related work BLAS Experimental Results Evaluation

Related work XML storage and query processing Store XML data naively as a file The whole file needs to be traversed whenever a query is processed  not efficient for large XML data sets Store XML using a commercial RDBMS Indexing, query processing capabilities

Related work (cont’d) XML storage and query processing An XML document as a graph  generate a tuple for every edge Simple, general and automatic generation of XML query – SQL mapping An XML query may involve many self-joins Self-joins can be eliminated by inlining the distinct child information into the parent tuple  complex XML query – SQL mapping Problem: In all above approaches, we typically need to rely on auxiliary code in a general-purpose programming language together with SQL to express an XML query

Related work (cont’d) Indexing Structural indexes create a structural summary which is extracted from the XML document as a directed graph  queries evaluated by pruning the search space Path / tree queries Indexing for branching path queries  restrict the class of queries indexed to achieve performance benefits Materialized views

Related work (cont’d) Labeling D-labeling Build minimum label size D-labels Build a B + tree over D-labels to support tree queries Effective for translating XQuery to SQL XPRESS  an XML data compression technique which uses reverse arithmetic encoding to encode label paths as a distinct interval within [0.0,1). Furthermore, it supports query evaluation over the compressed document using the containment relationship among the intervals.

Outline Problem being addressed in the paper Related work BLAS Experimental Results Evaluation

Bi-LAbeling based System (BLAS) Based on D-labeling and P-labeling Process XPath queries which can be represented as trees Index generator  stores D-labeling, P-labeling, data values of an XML document Query engine  RDBMS or twig join

BLAS (cont’d) Query translator Decomposes an XPath query into a set of suffix path queries encodes each suffix path query using P-labeling generates a corresponding SQL query for each suffix path query composes the SQL subqueries into a complete SQL query plan using D-labeling

Architecture of BLAS Query Engine Query decomposition Subquery Generator (based on P-labeling) XPath Query Suffix Path Query … Subquery composition (based on D-labeling) Query Translator Ancestor-descendant relationship between the results of the suffix path queries Query XML P-labelings D-labelings Data values SAX Parser Events P-labeling generator D-labeling generator … Storage Data loader Query result

BLAS: D-labeling A D-label of an XML node is a triplet, such that for any two nodes n and m, n ≠ m: n.d1 ≤ n.d2 (validation) m is a descendant of n, if and only if n.d1 m.d2 (descendant) m is a child of n, if and only if m is a descendant of n and n.d3 + 1 = m.d3 (child) n and m have no ancestor-descendant relationship, if and only if n.d2 m.d2 (nonoverlap)

BLAS: D-labeling (cont’d) Where for a node n: d1  the position of the start tag of n in the XML document d2  the position of the end tag of n in the XML document d3  level of n in the XML trees

BLAS: D-labeling (cont’d) Descendant axis query //t1//t2 Retrieve all the nodes reachable by t1 and t2  two lists, l1 and l2 Test for ancestor-descendant relationships between nodes in l1 and in l2 (D-join) //proteinDatabase//refinfo, pDB and refinfo  relations which store node tagged by proteinDatabase and refinfo Select pDB.start, pDB.end, refinfo.start, refinfo.end From pDB, refinfo Where pDB.start refinfo.end

D-labeling scheme The labeling (start, end, level) can be used to detect ancestor- descendant relationships between nodes in a tree. books book titlesection title section titlefigure description “The lord of the rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall” (1, 20000, 1) (6, 1200, 2) (10,80,3) (81, 250,3)... (100, 200,4)

BLAS: P-labeling Efficiently process consecutive child axis steps (suffix path query) A P-label for a suffix path P is an interval I P =, such that for any two suffix path expressions P, Q: P.p1 ≤ P.p2 (Validation ) P  Q if and only if interval I P is contained in I Q, i.e. Q.p1 ≤ P.p1 and Q.p2 ≤ P.p2 (Containment) P  Q = , if and only if I P and I Q do not overlap, i.e. P.p1 > Q.p2 or P.p2 < Q.p1 (Nonintersection)

BLAS: P-labeling (cont’d) For an XML node n, such that SP(n) =, the P-label for this XML node, denoted as n.plabel, is the integer p 1 Find all nodes n such that Q.p1 ≤ SP(n).p1 ≤ Q.p2 and evaluate suffix path query Q by obtaining the set of XML nodes whose P-labels are contained in the P-label of Q [[Q]] = {n | Q.p1 ≤ n.plabel ≤ Q.p2 }

BLAS: Intuition for P-labels Assign each node a number, and each suffix path an interval such that: For any two suffix paths Q1 and Q2, Q 1 contained in Q 2 iff Q 1 ’ s interval is contained in Q 2 ’ s A node is contained in the suffix path iff its number is contained in the path interval. Replaces a sequence of joins by a selection.

BLAS: P-labeling Construction For paths For XML Trees Assign / ratio r 0 and each tag ratio r i = 1 / (n+1) Define domain [0,m-1], m  (n + 1) h Construct P-labels for suffix paths Assign // an interval of Partition the interval I tag order proportional to ti’s r i allocate to suffix paths starting with /, and to suffix paths starting with //ti Partition over each subinterval of path //ti by tags according to their ratios.

BLAS: Constructing P-label for paths *10 4 3*10 4 /book //books/book //book/book 2.1* * * * *10 4 /books/book 2.11* / //books //book *10 4 3*10 4 //title 4* //section 5*10 4

BLAS: P-labeling Construction (cont’d) m = and 99 tags Each tag is assigned a r = 0.01 construct a P-label for suffix path P= /ProteinDatabase/ProteinEntry/protein/name

Sample XML Protein Repository

BLAS: Constructing P-label for XML nodes (cont’d) books book titlesection title section titlefigure description “The lord of the rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”... P-label of an XML node: m, where the P-label for the path from root is [m,n] Evaluating a suffix path query Q  finding all nodes whose P-label is contained in the P-label of Q E.g. /books/book/section: [42100, 42110]

BLAS: Query Language XPath queries containing /, //, *, and predicates (branches)  tree queries The evaluation of a path expression P returns the set of nodes [[P]] in an XML tree T which are reachable by P starting from the root of T A source path SP(n) of a node n in an XML tree T, is the unique simple path P from the root to itself. A path expression P is contained in a path expression Q, P  Q, if and only if for any XML tree T [[P]]  [[Q]] Path expressions P and Q are non-overlapping,P  Q = , if and only if for any XML tree T, [[P]]  [[Q]] = 

BLAS: Query Translator Split Steps: Descendent axis elimination Branch elimination Dfs traversal p//q  p and //q D-elimination – D-join

BLAS: Query Translator: (I) Decomposition section book title figure Q: //book[//title]/section/figure

BLAS: Query Translator: (I) Decomposition (cont ’ d) section book figure Q: //book[//title]/section/figure book title

BLAS: Query Translator: (I) Decomposition (cont ’ d) book Q: //book[//title]/section/figure title section figure

BLAS: Query Translator: (I) Decomposition (cont ’ d) Q: //book[//title]/section/figure book title section figure book

BLAS: Query Translator: (II) Selection on P-labels Q: //book[//title]/section/figure book title section figure book

BLAS: Query Translator: (III) Join on D-labels Q: //book[//title]/section/figure book title section figure book

BLAS: Query Translator - Push-up Used when schema information is absent Descendent axis elimination Push-up branch elimination P[q1…qn]/r  p, p/q1, …, p/qn, p/r

BLAS: Query Translator - Unfold Used when schema information is present Both non-recursive and recursive schemas replace D-joins with a process that first performs selections on P-labels and then unions the results  very efficient selections using an index are cheap the union is very simple since there are no duplicates subqueries are all simple path queries, which can be implemented as a select operation with equality predicates reduce the number of disk accesses

BLAS: Query Translator – Unfold (cont’d)

BLAS: Comparison with D-labeling BLAS: Fewer joins, fewer disk accesses book title section figure book title section figure book BLASD-labeling

Outline Problem being addressed in the paper Related work BLAS Experimental Results Evaluation

Data sets Query sets Suffix path queries Path queries XPath queries Benchmark queries Query Engine: TwigStack Join Experiment Setup

Query Execution Time Query Name: A:Auction P: Protein S: Shakespeare 1: suffix path query 2: path query 3: XPath query

Number of data elements visited Query Name: A:Auction P: Protein S: Shakespeare 1: suffix path query 2: path query 3: XPath query

Benchmark Query Execution Time

Scalability BLAS

Outline Problem being addressed in the paper Related work BLAS Experimental Results Evaluation

Contributions P-labeling scheme is proposed to evaluate suffix path queries efficiently. BLAS combines P-labeling and D-labeling to evaluate XPath queries. BLAS is more efficient than state-of-the-art work because the queries translated from XPath queries require: fewer disk accesses fewer joins Experiments show the effectiveness of BLAS

Evaluation Successful effort Trade off between additional cost and execution time BLAS vs RDBMS ?