BLAS: An Efficient XPath Processing System Chen Y., Davidson S., Zheng Y. Νίκος Λούτας
Outline Problem being addressed in the paper Related work BLAS Experimental Results Evaluation
Problem Number of disk accesses and joins is the primary bottleneck for evaluating complex queries efficiently!
Can we improve XPath processing which uses relational technology? D-labeling Processes descendant axis traversal using a single join rather than a transitive closure of joins. Observation: D-labeling processes / and // in the same way using joins. XPRESS – queriable compressed XML files Reverse arithmetic encoding A label path as a distinct interval in[0.0, 1.0) Handling of path expressions : containment relationships Motivation
Goals Process / (simple path expressions) more efficiently Reduce the number of disk accesses and joins Optimize the join operations
Outline Problem being addressed in the paper Related work BLAS Experimental Results Evaluation
Related work XML storage and query processing Store XML data naively as a file The whole file needs to be traversed whenever a query is processed not efficient for large XML data sets Store XML using a commercial RDBMS Indexing, query processing capabilities
Related work (cont’d) XML storage and query processing An XML document as a graph generate a tuple for every edge Simple, general and automatic generation of XML query – SQL mapping An XML query may involve many self-joins Self-joins can be eliminated by inlining the distinct child information into the parent tuple complex XML query – SQL mapping Problem: In all above approaches, we typically need to rely on auxiliary code in a general-purpose programming language together with SQL to express an XML query
Related work (cont’d) Indexing Structural indexes create a structural summary which is extracted from the XML document as a directed graph queries evaluated by pruning the search space Path / tree queries Indexing for branching path queries restrict the class of queries indexed to achieve performance benefits Materialized views
Related work (cont’d) Labeling D-labeling Build minimum label size D-labels Build a B + tree over D-labels to support tree queries Effective for translating XQuery to SQL XPRESS an XML data compression technique which uses reverse arithmetic encoding to encode label paths as a distinct interval within [0.0,1). Furthermore, it supports query evaluation over the compressed document using the containment relationship among the intervals.
Outline Problem being addressed in the paper Related work BLAS Experimental Results Evaluation
Bi-LAbeling based System (BLAS) Based on D-labeling and P-labeling Process XPath queries which can be represented as trees Index generator stores D-labeling, P-labeling, data values of an XML document Query engine RDBMS or twig join
BLAS (cont’d) Query translator Decomposes an XPath query into a set of suffix path queries encodes each suffix path query using P-labeling generates a corresponding SQL query for each suffix path query composes the SQL subqueries into a complete SQL query plan using D-labeling
Architecture of BLAS Query Engine Query decomposition Subquery Generator (based on P-labeling) XPath Query Suffix Path Query … Subquery composition (based on D-labeling) Query Translator Ancestor-descendant relationship between the results of the suffix path queries Query XML P-labelings D-labelings Data values SAX Parser Events P-labeling generator D-labeling generator … Storage Data loader Query result
BLAS: D-labeling A D-label of an XML node is a triplet, such that for any two nodes n and m, n ≠ m: n.d1 ≤ n.d2 (validation) m is a descendant of n, if and only if n.d1 m.d2 (descendant) m is a child of n, if and only if m is a descendant of n and n.d3 + 1 = m.d3 (child) n and m have no ancestor-descendant relationship, if and only if n.d2 m.d2 (nonoverlap)
BLAS: D-labeling (cont’d) Where for a node n: d1 the position of the start tag of n in the XML document d2 the position of the end tag of n in the XML document d3 level of n in the XML trees
BLAS: D-labeling (cont’d) Descendant axis query //t1//t2 Retrieve all the nodes reachable by t1 and t2 two lists, l1 and l2 Test for ancestor-descendant relationships between nodes in l1 and in l2 (D-join) //proteinDatabase//refinfo, pDB and refinfo relations which store node tagged by proteinDatabase and refinfo Select pDB.start, pDB.end, refinfo.start, refinfo.end From pDB, refinfo Where pDB.start refinfo.end
D-labeling scheme The labeling (start, end, level) can be used to detect ancestor- descendant relationships between nodes in a tree. books book titlesection title section titlefigure description “The lord of the rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall” (1, 20000, 1) (6, 1200, 2) (10,80,3) (81, 250,3)... (100, 200,4)
BLAS: P-labeling Efficiently process consecutive child axis steps (suffix path query) A P-label for a suffix path P is an interval I P =, such that for any two suffix path expressions P, Q: P.p1 ≤ P.p2 (Validation ) P Q if and only if interval I P is contained in I Q, i.e. Q.p1 ≤ P.p1 and Q.p2 ≤ P.p2 (Containment) P Q = , if and only if I P and I Q do not overlap, i.e. P.p1 > Q.p2 or P.p2 < Q.p1 (Nonintersection)
BLAS: P-labeling (cont’d) For an XML node n, such that SP(n) =, the P-label for this XML node, denoted as n.plabel, is the integer p 1 Find all nodes n such that Q.p1 ≤ SP(n).p1 ≤ Q.p2 and evaluate suffix path query Q by obtaining the set of XML nodes whose P-labels are contained in the P-label of Q [[Q]] = {n | Q.p1 ≤ n.plabel ≤ Q.p2 }
BLAS: Intuition for P-labels Assign each node a number, and each suffix path an interval such that: For any two suffix paths Q1 and Q2, Q 1 contained in Q 2 iff Q 1 ’ s interval is contained in Q 2 ’ s A node is contained in the suffix path iff its number is contained in the path interval. Replaces a sequence of joins by a selection.
BLAS: P-labeling Construction For paths For XML Trees Assign / ratio r 0 and each tag ratio r i = 1 / (n+1) Define domain [0,m-1], m (n + 1) h Construct P-labels for suffix paths Assign // an interval of Partition the interval I tag order proportional to ti’s r i allocate to suffix paths starting with /, and to suffix paths starting with //ti Partition over each subinterval of path //ti by tags according to their ratios.
BLAS: Constructing P-label for paths *10 4 3*10 4 /book //books/book //book/book 2.1* * * * *10 4 /books/book 2.11* / //books //book *10 4 3*10 4 //title 4* //section 5*10 4
BLAS: P-labeling Construction (cont’d) m = and 99 tags Each tag is assigned a r = 0.01 construct a P-label for suffix path P= /ProteinDatabase/ProteinEntry/protein/name
Sample XML Protein Repository
BLAS: Constructing P-label for XML nodes (cont’d) books book titlesection title section titlefigure description “The lord of the rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”... P-label of an XML node: m, where the P-label for the path from root is [m,n] Evaluating a suffix path query Q finding all nodes whose P-label is contained in the P-label of Q E.g. /books/book/section: [42100, 42110]
BLAS: Query Language XPath queries containing /, //, *, and predicates (branches) tree queries The evaluation of a path expression P returns the set of nodes [[P]] in an XML tree T which are reachable by P starting from the root of T A source path SP(n) of a node n in an XML tree T, is the unique simple path P from the root to itself. A path expression P is contained in a path expression Q, P Q, if and only if for any XML tree T [[P]] [[Q]] Path expressions P and Q are non-overlapping,P Q = , if and only if for any XML tree T, [[P]] [[Q]] =
BLAS: Query Translator Split Steps: Descendent axis elimination Branch elimination Dfs traversal p//q p and //q D-elimination – D-join
BLAS: Query Translator: (I) Decomposition section book title figure Q: //book[//title]/section/figure
BLAS: Query Translator: (I) Decomposition (cont ’ d) section book figure Q: //book[//title]/section/figure book title
BLAS: Query Translator: (I) Decomposition (cont ’ d) book Q: //book[//title]/section/figure title section figure
BLAS: Query Translator: (I) Decomposition (cont ’ d) Q: //book[//title]/section/figure book title section figure book
BLAS: Query Translator: (II) Selection on P-labels Q: //book[//title]/section/figure book title section figure book
BLAS: Query Translator: (III) Join on D-labels Q: //book[//title]/section/figure book title section figure book
BLAS: Query Translator - Push-up Used when schema information is absent Descendent axis elimination Push-up branch elimination P[q1…qn]/r p, p/q1, …, p/qn, p/r
BLAS: Query Translator - Unfold Used when schema information is present Both non-recursive and recursive schemas replace D-joins with a process that first performs selections on P-labels and then unions the results very efficient selections using an index are cheap the union is very simple since there are no duplicates subqueries are all simple path queries, which can be implemented as a select operation with equality predicates reduce the number of disk accesses
BLAS: Query Translator – Unfold (cont’d)
BLAS: Comparison with D-labeling BLAS: Fewer joins, fewer disk accesses book title section figure book title section figure book BLASD-labeling
Outline Problem being addressed in the paper Related work BLAS Experimental Results Evaluation
Data sets Query sets Suffix path queries Path queries XPath queries Benchmark queries Query Engine: TwigStack Join Experiment Setup
Query Execution Time Query Name: A:Auction P: Protein S: Shakespeare 1: suffix path query 2: path query 3: XPath query
Number of data elements visited Query Name: A:Auction P: Protein S: Shakespeare 1: suffix path query 2: path query 3: XPath query
Benchmark Query Execution Time
Scalability BLAS
Outline Problem being addressed in the paper Related work BLAS Experimental Results Evaluation
Contributions P-labeling scheme is proposed to evaluate suffix path queries efficiently. BLAS combines P-labeling and D-labeling to evaluate XPath queries. BLAS is more efficient than state-of-the-art work because the queries translated from XPath queries require: fewer disk accesses fewer joins Experiments show the effectiveness of BLAS
Evaluation Successful effort Trade off between additional cost and execution time BLAS vs RDBMS ?