1 XPath Queries on Streaming Data Feng Peng and Sudarshan S. Chawathe İsmail GÜNEŞ Ayşe GENÇ 12.11.2003.

Slides:



Advertisements
Similar presentations
Automata Theory Part 1: Introduction & NFA November 2002.
Advertisements

Advanced XSLT. Branching in XSLT XSLT is functional programming –The program evaluates a function –The function transforms one structure into another.
Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Macro Processor.
The Big Picture Chapter 3. We want to examine a given computational problem and see how difficult it is. Then we need to compare problems Problems appear.
CS 267: Automated Verification Lecture 10: Nested Depth First Search, Counter- Example Generation Revisited, Bit-State Hashing, On-The-Fly Model Checking.
An Algorithm for Streaming XPath Processing with Forward and Backward Axes Charles Barton, Philippe Charles, Deepak Goyal, Mukund Raghavchari IBM T. J.
1 Conditional XPath, the first order complete XPath dialect Maarten Marx Presented by: Einav Bar-Ner.
Introduction to Computability Theory
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
Querying Streaming XML Data. Layout of the presentation  Introduction  Common Problems faced  Solution proposed  Basic Building blocks of the solution.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Prof. Fateman CS 164 Lecture 91 Bottom-Up Parsing Lecture 9.
CS5371 Theory of Computation Lecture 8: Automata Theory VI (PDA, PDA = CFG)
Table-driven parsing Parsing performed by a finite state machine. Parsing algorithm is language-independent. FSM driven by table (s) generated automatically.
Lesson 6. Refinement of the Operator Model This page describes formally how we refine Figure 2.5 into a more detailed model so that we can connect it.
Binary Trees Chapter 6.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
WORKING WITH XSLT AND XPATH
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Searching: Binary Trees and Hash Tables CHAPTER 12 6/4/15 Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education,
COP3530 Data Structures600 Stack Stack is one the most useful ADTs. Like list, it is a collection of data items. Supports “LIFO” (Last In First Out) discipline.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
XSLT Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
CSC 3210 Computer Organization and Programming Chapter 1 THE COMPUTER D.M. Rasanjalee Himali.
Chapter 6 Binary Trees. 6.1 Trees, Binary Trees, and Binary Search Trees Linked lists usually are more flexible than arrays, but it is difficult to use.
Lecture 05: Theory of Automata:08 Kleene’s Theorem and NFA.
Data Structures Week 8 Further Data Structures The story so far  Saw some fundamental operations as well as advanced operations on arrays, stacks, and.
CS 415 – A.I. Slide Set 5. Chapter 3 Structures and Strategies for State Space Search – Predicate Calculus: provides a means of describing objects and.
Database Systems Part VII: XML Querying Software School of Hunan University
1 More About Turing Machines “Programming Tricks” Restrictions Extensions Closure Properties.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.
XML Access Control Koukis Dimitris Padeleris Pashalis.
Streaming XPath Engine Oleg Slezberg Amruta Joshi.
Automata & Formal Languages, Feodor F. Dragan, Kent State University 1 CHAPTER 3 The Church-Turing Thesis Contents Turing Machines definitions, examples,
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Grouping Robin Burke ECT 360. Outline Extra credit Numbering, revisited Grouping: Sibling difference method Uniquifying in XPath Grouping: Muenchian method.
Grouping Robin Burke ECT 360. Outline Grouping: Sibling difference method Uniquifying in XPath Grouping: Muenchian method Generated ids Keys Moded Templates.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Copyright © Curt Hill Other Trees Applications of the Tree Structure.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
1 Section 13.1 Turing Machines A Turing machine (TM) is a simple computer that has an infinite amount of storage in the form of cells on an infinite tape.
1 Turing Machines and Equivalent Models Section 13.1 Turing Machines.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Bottom Up Parsing CS 671 January 31, CS 671 – Spring Where Are We? Finished Top-Down Parsing Starting Bottom-Up Parsing Lexical Analysis.
XML Schema – XSLT Week 8 Web site:
4 - Conditional Control Structures CHAPTER 4. Introduction A Program is usually not limited to a linear sequence of instructions. In real life, a programme.
DATA STRUCURES II CSC QUIZ 1. What is Data Structure ? 2. Mention the classifications of data structure giving example of each. 3. Briefly explain.
COMPILER CONSTRUCTION
Domain Name System: DNS To identify an entity, TCP/IP protocols use the IP address, which uniquely identifies the Connection of a host to the Internet.
CHP - 9 File Structures.
Finite State Machines Dr K R Bond 2009
Querying and Transforming XML Data
Data Structure Interview Question and Answers
Table-driven parsing Parsing performed by a finite state machine.
DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++
Efficient Filtering of XML Documents with XPath Expressions
Net 323 D: Networks Protocols
Lesson Objectives Aims
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Presentation transcript:

1 XPath Queries on Streaming Data Feng Peng and Sudarshan S. Chawathe İsmail GÜNEŞ Ayşe GENÇ

2 Design and Implementation of the XSQ System for Querying Streaming XML Data Using XPath 1.0. XSQ; Supports multiple predicates, closures and aggregation. Supports multiple predicates, closures and aggregation. Not only provides high throughput but is also memory efficient. Not only provides high throughput but is also memory efficient. Buffers only data that must be buffered by any streaming XML processor. Buffers only data that must be buffered by any streaming XML processor.

3 XML is becoming the de facto standart XML is becoming the de facto standart Streaming XML : Data occurs naturally in streaming form and data that is best accessed in streaming form. Streaming XML : Data occurs naturally in streaming form and data that is best accessed in streaming form. Problem is, evaluating XPath queries over streaming XML. Problem is, evaluating XPath queries over streaming XML. XPath; XPath; Well accepted language for addressing XML. Well accepted language for addressing XML. Often used in a host langugage but also serves a stand-alone query language for XML. Often used in a host langugage but also serves a stand-alone query language for XML.

4 An XPath Query consists; An XPath Query consists; Location Path Location Path Output Expression Output Expression EX: //book[year>2000]/name/text( ) EX: //book[year>2000]/name/text( ) Location Path : Location Path : Location Step Location Step Specify the path from the document root to a desired element. Specify the path from the document root to a desired element. Axis, node test and optional predicate. Axis, node test and optional predicate. //book[year>2000]/name. //book[year>2000]/name. Output Expression. Output Expression. Appears in result. Appears in result. text( ). text( ).

5 Contributions of Method First method that handles closures, aggregations, and multiple predicates. First method that handles closures, aggregations, and multiple predicates. Easy to understand, implement, and expand to more complex queries. Easy to understand, implement, and expand to more complex queries. Illustrates the costs and benefits of different XPath features and implementation trade-offs. Illustrates the costs and benefits of different XPath features and implementation trade-offs.

6 Example 1: Query for the XML data; /pub[year=2002]/book[price<11]/author. Problems: Problems: 1. Buffering the potential result items. 2. Items in the buffer have to be marked seperately. 3. Have to encode the logic of the predicates in the automation.

7 Example 2: More complex example, using a query with closures, and data with recursive structure; //pub[year=2002]//book[author]//name. Difficulties : Difficulties : Elements in an XML stream may come in an order that does not match the order of the corresponding predicates in the query. Elements in an XML stream may come in an order that does not match the order of the corresponding predicates in the query. Recursive structure in the data. Recursive structure in the data. Multiple matches for an item may evaluate predicates to true.(Avoid duplicates) Multiple matches for an item may evaluate predicates to true.(Avoid duplicates) Closure axis and multiple predicates make more difficult to keep track information needed for proper buffer management. Closure axis and multiple predicates make more difficult to keep track information needed for proper buffer management.

8 Data Model for XML Streams Parsers based on SAX API process XML document and generates a sequence of SAX events. Parsers based on SAX API process XML document and generates a sequence of SAX events. For each opening(and closing) tag of an element, the SAX parser generates a begin(or end) event and text event for the text content. For each opening(and closing) tag of an element, the SAX parser generates a begin(or end) event and text event for the text content. The begin event of an element comes with a list of pairs with the attribute name as the key. The begin event of an element comes with a list of pairs with the attribute name as the key. XML stream is modeled as a sequence of SAX events(a sequence{e1,e2,..ei..}where ei belongs B,T,E. ) XML stream is modeled as a sequence of SAX events(a sequence{e1,e2,..ei..}where ei belongs B,T,E. ) B={(a,attrs,d)}.(a,attrs,d) is the begin event of an element with the tag “a” that is at depth d in the XML data and attrs is a list of the attribute name-value pairs. B={(a,attrs,d)}.(a,attrs,d) is the begin event of an element with the tag “a” that is at depth d in the XML data and attrs is a list of the attribute name-value pairs. E={(/a,d)}.(/a,d) is the end event of an element with tag “a” at depth d. E={(/a,d)}.(/a,d) is the end event of an element with tag “a” at depth d. T={(a,text(),d)}.(a,text(),d) is the text event in the element with tag “a” at depth d. The content of the text event can be retrieved using text() T={(a,text(),d)}.(a,text(),d) is the text event in the element with tag “a” at depth d. The content of the text event can be retrieved using text()

9 XPath XSQ implements all of XPath except reverse axes and position functions. XSQ implements all of XPath except reverse axes and position functions. XPath query is in the form of N1N2..Nn/O which consist of location path and output expression. XPath query is in the form of N1N2..Nn/O which consist of location path and output expression. An element matches the location path if the path matches the labels in the location path and satisfies all predicates. An element matches the location path if the path matches the labels in the location path and satisfies all predicates. For each matching element, the result of applying the output function to the element is added to the query result. For each matching element, the result of applying the output function to the element is added to the query result.

10 BASIC PUSHDOWN TRANSDUCER A pushdown transducer(PDT) is a pushdown automaton(PDA) with actions defined along with the transition arcs on the automaton. A pushdown transducer(PDT) is a pushdown automaton(PDA) with actions defined along with the transition arcs on the automaton. It has a finite set of states which includes a start state and a set of final states, a set of input symbols, and a set of stack symbols. It has a finite set of states which includes a start state and a set of final states, a set of input symbols, and a set of stack symbols. At each step, it fetches an input symbol from the input sequence. Based on the input symbol and the symbols in the stack, it changes the current state and operates the stack according to the transition function. Transition function also generates output. At each step, it fetches an input symbol from the input sequence. Based on the input symbol and the symbols in the stack, it changes the current state and operates the stack according to the transition function. Transition function also generates output. Traditional PDT’s don’t have an extra buffer and the operations for the buffer. However, evaluating XPath queries over XML streams requires buffering potential results. Traditional PDT’s don’t have an extra buffer and the operations for the buffer. However, evaluating XPath queries over XML streams requires buffering potential results.

11 Simple PDA for XML Streams PDA that accepts XML streams that have certain string. PDA that accepts XML streams that have certain string. For each begin event, puts the tag element into stack and for end event pop from the stack. For each begin event, puts the tag element into stack and for end event pop from the stack. It is not simple to extend simple PDA to PDT that answers XPath queries. It is not simple to extend simple PDA to PDT that answers XPath queries. PDA has no memory for the previously processed data. However, we need the results for all the predicates. A direct solution is mark every item with a flag that indicates which predicates are satisfied and which are not yet. PDA has no memory for the previously processed data. However, we need the results for all the predicates. A direct solution is mark every item with a flag that indicates which predicates are satisfied and which are not yet. Every time a predicate is evaluated, whole buffer should be checked if some items are affected by its result. Every time a predicate is evaluated, whole buffer should be checked if some items are affected by its result. System becomes complex and low-performance. System becomes complex and low-performance.

12 Building the BPDT Example 3: In a PDT for this query we need to implement at least 3 tasks for this location step. Example 3: In a PDT for this query we need to implement at least 3 tasks for this location step. /pub[year>2000]/book[author]/name/text( ) 1. If the book element does have an author subelement, we need to remember the fact for the future use. 2. If the book element does not have an author subelement, we need to make sure that if the name of the current book element has been in the buffer, it is deleted from the buffer. 3. If the book element does have an author subelement, we need to make sure that if the name of the current book has been in the buffer, it is sent to output if all thepredicates have evaluated to true. If some of the predicates have not been evaluated, we should hold the content in the buffer and handle it later.

13 XPath Queries Categorization 1. Test whether the current element has a specified attribute, or whether the attribute satisfies some condition.(e.g., 2. Test whether the current element contains some text, or whether the text value satisfies some condition(e.g., /year[text( )=2000]) 3. Test whether the current element has a specified type of child(e.g., /book[author]) 4. Test whether the current element’s specified child contains an attribute or whether the value of the attribute satisfies some condition(e.g., 5. Test whether the specified child of the current element has a value that satisfies some condition.(e.g.,/book[year<2000])

14 Based on previous categorization, a template is designed for each category. Based on previous categorization, a template is designed for each category. In each template, there is a START state, a TRUE state that indicates the predicate in this location step has evaluated to true, and an NA state that indicates the predicate has not yet been evaluated. In each template, there is a START state, a TRUE state that indicates the predicate in this location step has evaluated to true, and an NA state that indicates the predicate has not yet been evaluated. The PDT generated from a location step using the template is called a basic pushdown automaton(BPDT). The PDT generated from a location step using the template is called a basic pushdown automaton(BPDT). The BPDT has 2 important features: The BPDT has 2 important features: It is easy to show the current state. It is easy to show the current state. The logic(behaviour) of the predicate is encoded in the BPDT. The logic(behaviour) of the predicate is encoded in the BPDT.

15 Buffer Operations in BPDT In contrast to simple PDA, each BPDT has a buffer of its own that is organized as a queue. The operations on the buffer are: In contrast to simple PDA, each BPDT has a buffer of its own that is organized as a queue. The operations on the buffer are: Q.enqueue(v): add v to the end of the queue. Q.enqueue(v): add v to the end of the queue. Q.clear( ): remove all the items in the queue. Q.clear( ): remove all the items in the queue. Q.flush( ): send all items in the queue to the output in FIFO order. Q.flush( ): send all items in the queue to the output in FIFO order. Q.upload( ): move all items in the queue to the end of the queue of the BPDT that is the parent of this BPDT. Q.upload( ): move all items in the queue to the end of the queue of the BPDT that is the parent of this BPDT.

16 State Transitions in BPDT When the closures and multiple predicates are needed, BPDT is non-deterministic; When the closures and multiple predicates are needed, BPDT is non-deterministic; BPDT first matches e(SAX event) with the labels on all the transition arcs. If it does not find a match, it ignores e. If it finds a matched arc, it first checks the predicate f. If f is not null, the BPDT evaluates the f using e. If f evaluates to false, it does nothing. Otherwise, it replaces s(current state) with a new state s2 determined by the transition arc. BPDT first matches e(SAX event) with the labels on all the transition arcs. If it does not find a match, it ignores e. If it finds a matched arc, it first checks the predicate f. If f is not null, the BPDT evaluates the f using e. If f evaluates to false, it does nothing. Otherwise, it replaces s(current state) with a new state s2 determined by the transition arc.

17 State Transitions in BPDT(Cont’d) If closures are not needed, the BPDT is deterministic. If closures are not needed, the BPDT is deterministic. It always has a single current state. Moreover, there is at most one transition arc that matches the current event. Thus, after it finds one match it can terminate the searching process and process next incoming event. It always has a single current state. Moreover, there is at most one transition arc that matches the current event. Thus, after it finds one match it can terminate the searching process and process next incoming event.

18 HIERARCHICAL PDT The BPDT’s are combined into one hierarchical pushdown transducer, in the form of a binary tree to process XPath queries. The key idea is to use the position of the BPDT in the HPDT to encode the results of all predicates. The BPDT can determine whether a predicate has been evaluated or not by its own position, which is fixed and easy to get in a binary tree.

19 Building HPDT from XPath Queries position of each BPDT is defined by a unique ID (l, k), where l >= 0 is the depth for the BPDT and k >=0 is its sequence number within the layer. ID’s are generation procedure: generate a root BPDT with an ID (0,0) go through all the BPDTs bpdt(i-1, k) For each existing bpdt(i-1, k), if it has an NA state, we generate a bpdt(i,2k) as its right child, which use the NA state of bpdt(i-1, k) as its START state. If bpdt(i-1, k) does not have an NA state, we set bpdt(i,2k) to NULL. Similarly, we generate a bpdt(i,2k+1) as the left child of of bpdt(i-1, k), which uses the TRUE state of bpdt(i-1, k) as its START state.

20 Building HPDT from XPath Queries (Cont’d) After connecting BPDT’s, the buffer operation bpdt(l,k) can be determined as follows: There is the fact that if k = (k0k2...kn)2, when the HPDT reaches a state (not including the START state) in this BPDT, the ith predicate has evaluated to true if and only if ki = 1. Therefore, the buffer operations of this BPDT can be determined given the results of the predicates. Thus, in every bpdt(i,2i-1) i=1,...,n, the BPDT sends the content in the buffer to the output if the predicate in itself evaluates to true.

21 Building HPDT from XPath Queries (Cont’d) Add the output functions to the lowest layer BPDTs. In bpdt(n,2n -1), the value is sent to the output directly. In all the other BPDTs in layer n, the output will be sent to the buffer. If the output expression O is specified, the corresponding attribute or function is added to the transitions in the lowest layer BPDTs. Otherwise, a catchall transition is added to the lowest layer BPDTs.

22 IMPLEMENTATION AND EXPERIMENTS XSQ system is implemented in Java using Sun Java SDK version 1.4. The XML parser used is Xerces 1.0 for Java. Two versions of the XSQ system is implemented: XSQ-NC supports multiple predicates and aggregations, but not closures XSQF supports multiple predicates, aggregations, and closures.

23 Experimental Setup The experiments are conducted on a Pentium III 900MHZ machine with 1 GB memory running the Redhat 7.2 distribution ofGNU/Linux (kernel ). The maximum amount of memory the Java Virtual Machine could use was set to 512 MB.

24 Experimental Setup (Cont’d) We compare the XSQ system with the systems in this table which process XPath queries or XPath- like queries.

25 Experimental Setup (Cont’d) However, the goal is not simply to compare their performance. Through our study of these XPath processors, it is wanted to get more insights of the cost to support certain XPath features such as closures and to predict which system will perform better in what kind of environment.

26 Experimental Setup (Cont’d) In experiments, the above systems are used to evaluate queries over datasets –in table -that differ in size and characteristics, including real and synthetic datasets.

27 Throughput Throughput is an important metric for streaming systems since the data size varies and could be unbounded. The throughput of a SAX parser, which parses the XML data but does nothing else, gives an upper bound of the throughput for any XML query system. The throughput of a SAX parser, which parses the XML data but does nothing else, gives an upper bound of the throughput for any XML query system.

28 Throughput Throughput (Cont’d)

29 Throughput Throughput (Cont’d)

30 Throughput Throughput (Cont’d)

31 Memory Usage Memory usage is critical for the scalability of the streaming system. Non-streaming systems need memory linear in the size of the input since they need to load the whole dataset into memory. In contrast, streaming systems need to store only a small fraction of the stream.

32 Memory Usage (Cont’d)

33 Memory Usage (Cont’d)

34 CONCLUSION A distinguishing feature of XSQ is that it buffers only data that must be buffered by any streaming XPath query processor. Further, XSQ has a clean design based on a hierarchical network of pushdown transducers augmented with buffers.

35 CONCLUSION (Cont’d) The XSQ system is fully implemented, and supports features such as multiple predicates, closures, and aggregation. It is also presented an empirical study of XSQ and related systems in order to explore the costs and benefits of XPath features and implementation choices.