XML Stream Processing Yanlei Diao University of Massachusetts Amherst.

Slides:



Advertisements
Similar presentations
Lecture 24 MAS 714 Hartmut Klauck
Advertisements

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
DFA Minimization Jeremy Mange CS 6800 Summer 2009.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Boosting XML filtering through a scalable FPGA-based architecture A. Mitra, M. Vieira, P. Bakalov, V. Tsotras, W. Najjar.
Fine Grained Access Control in XML DataBase Systems Naveen Yajamanam April 27,2006.
Compiler Construction
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
Engine Issues for Data Stream Processing Mike Franklin UC Berkeley 1 st Duodecennial SWiM Meeting January 9, 2003.
Containment and Equivalence for an XPath Fragment By Gerom e Mikla Dan Suciu Presented By Roy Ionas.
Aho-Corasick String Matching An Efficient String Matching.
1 Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Department of Computer Science and Information Engineering National.
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 7 Mälardalen University 2010.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
1 Distributed Monitoring of Peer-to-Peer Systems By Serge Abiteboul, Bogdan Marinoiu Docflow meeting, Bordeaux.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
CMPS 3223 Theory of Computation Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided.
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
CS-5800 Theory of Computation II PROJECT PRESENTATION By Quincy Campbell & Sandeep Ravikanti.
Schema-Based Query Optimization for XQuery over XML Streams Hong Su Elke A. Rundensteiner Murali Mani Worcester Polytechnic Institute, Massachusetts, USA.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Cayuga: A General Purpose Event Monitoring System Mirek Riedewald Joint work with Alan Demers, Johannes Gehrke, Biswanath Panda, Varun Sharma (IIT Delhi),
Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December.
Lexical Analysis Constructing a Scanner from Regular Expressions.
Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Authors: Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, Randy H.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
XML and Database.
Streaming XPath Engine Oleg Slezberg Amruta Joshi.
Author : Randy Smith & Cristian Estan & Somesh Jha Publisher : IEEE Symposium on Security & privacy,2008 Presenter : Wen-Tse Liang Date : 2010/10/27.
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Publisher : ANCS’ 06 Author : Fang Yu, Zhifeng Chen, Yanlei Diao, T.V.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 XPath Queries on Streaming Data Feng Peng and Sudarshan S. Chawathe İsmail GÜNEŞ Ayşe GENÇ
June 13, 2016 Prof. Abdelaziz Khamis 1 Chapter 2 Scanning – Part 2.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms.
Design of a Notification Engine for Grid Monitoring Events and Prototype Implementation Natascia De Bortoli INFNGRID Technical Board Bologna Feb.
1 Chapter 2 Finite Automata (part a) Hokkaido, Japan.
Department of Software & Media Technology
Mapping Data to Queries Martin Hentschel Systems Group, ETH Zurich.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
6. Pushdown Automata CIS Automata and Formal Languages – Pei Wang.
Lexical analysis Finite Automata
High-Performance XML Filtering with YFilter
Efficient Filtering of XML Documents with XPath Expressions
13 Text Processing Hongfei Yan June 1, 2016.
Chapter 2 FINITE AUTOMATA.
(b) Tree representation
Query Processing for High-Volume XML Message Brokering
Towards an Internet-Scale XML Dissemination Service
4b Lexical analysis Finite Automata
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Advance Database Systems
4b Lexical analysis Finite Automata
Chapter 1 Regular Language
Lecture 5 Scanning.
Presentation transcript:

XML Stream Processing Yanlei Diao University of Massachusetts Amherst

3/11/2016 Yanlei Diao, University of Massachusetts Amherst XML Data Streams  XML is the “wire format” for data exchanged online. Purchase orders News feeds Stock tickers Transactions of online auctions Network monitoring data

3/11/2016 Yanlei Diao, University of Massachusetts Amherst XML Stream Processing  XML data continuously arrives from external sources, queries are evaluated every time a new data item is received.  A key feature of stream processing is the ability to process as it arrives. Natural fit for XML message brokering and web services where messages need to be filtered and transformed on-the-fly.  XML stream processing allows query execution to start before a message is completely received. Short delay in producing results. No need to buffer the entire message for query processing. Both are crucial when input messages are large, e.g. the equivalent of a database’s worth of data.

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Event-Based Parsing  XML stream processing is often performed on the granularity of an XML parsing event produced by an event-based API. Pub-Sub XML Processing Scalability < Start Document < Start Element: report < Start Element: section < Start Element: title Characters:Pub/Sub > End Element: title < Start Element: section < Start Element: figure < Start Element: title Characters: XML Processing > End Element: title > End Element: figure > End Element: section … > End Element: section > End Element: report > End Document

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Matching a Single Path Expression XFilter [ Altinel&Franklin’00 ], YFilter [ Diao et al.’02 ], Tukwila [ Ives et al.’02 ]  Simple paths: ( (“/” | “//”) (ElementName | “*”) )+ A simple path can be transformed to a regular expression Let  = set of element names: −‘/’ is translated to the ‘·’ concatenation operator −“//” is translated to  * −‘*’ is translated to  −“/a//b” can be translated to “a  * b”  A finite automaton (FA) for each path: mapping location steps to machine states.

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Query Compilation Location steps Automaton fragments /a //a /* a * a  * Map location steps to automaton fragments Concatenate automaton fragments for location steps in a query a * b  a * b  Query “/a//b” Is this Automaton deterministic or non-deterministic? For multi-path processing; o.w. not needed

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Event-Driven Query Execution  Query execution retrieves all matches of a path expression.  Event-driven execution of FA: Parsing events (esp. start of elements) drive the FA execution. Elements that trigger transitions to the accepting states are returned. figure  section * < Start Document < Start Element: report < Start Element: section < Start Element: title Characters:Pub/Sub > End Element: title < Start Element: section < Start Element: figure < Start Element: title Characters: XML Processing > End Element: title > End Element: figure > End Element: section … > End Element: section > End Element: report > End Document  Run FA over XML with nested structure, retrieve all results Approach 1: shred XML into a set of linear paths Approach 2: augment FA execution with backtracking

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Multi-Query Processing  Problem: evaluate a set Q = Q1, …, Qn of path queries against each incoming XML document.  Brute force: iterate the query set, one query at a time  Indexing of queries: inverse problem of traditional query processing Traditional DB: Data is stored persistently; queries executed against the data. Indexes of data enable selective searches of the data. XML stream processing: Queries are persistently stored; documents or their parsing events drive the matching of queries. Indexes of queries enable selective matching of documents to queries.  Sharing of processing: commonalities exist among queries. Shared processing avoids redundant work.

3/11/2016 Yanlei Diao, University of Massachusetts Amherst YFilter [Diao et al.’03] builds a combined FA for all paths.  Complete prefix sharing among paths.  Moore machine with output: accepting states → partition of query ids.  Nondeterministic Finite Automaton (NFA)-based implementation: a small machine size, flexible, easy to maintain, etc. Constructing the Combined FSM a {Q1} b Q1=/a/b Q2=/a/c Q3=/a/b/c Q4=/a//b/c Q5=/a/*/b Q6=/a//c Q7=/a/*/*/c Q8=/a/b/c a {Q2} c c {Q3}  {Q4} c b * * c {Q5} c {Q6} * c {Q7} {Q3, Q8}

3/11/2016 Yanlei Diao, University of Massachusetts Amherst YFilter uses a stack mechanism to handle XML.  Backtracking in the NFA.  No repeated work for the same element. An XML fragment … Execution Algorithm read initial 1 Runtime Stack NFA match Q1 9 read match Q3 Q Q5Q6Q4 c c b {Q1} {Q3, Q8} {Q2} {Q4} {Q6} {Q5} {Q7} a * c c * c c *  b

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Implementation Choices  Non-Deterministic Automata (NFA) versus Deterministic Automata (DFA) NFA: small machine, large numbers of transitions per element DFA: potentially large machine, one transition per element  Worse-case comparison for regular expressions: Single regular expression of length n m regular expressions compiled together Processing complexity Machine size Processing complexity Machine size NFAO(n2)O(n2)O(n)O(n)O(n2m)O(n2m)O(nm) DFAO(1)O(1) O(n)O(n) O(1)O(1) O(  nm ) YFilter specific O(3n) O(3nm) Possible in practice?  Restricted path expressions:

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Eager DFA  Green et al. studied the size of DFA for the restricted set of path expressions [Green et al.’03]  Eager DFA: Query compile time, translates from NFA to DFA Creates a state for every possible situation that runtime may require  Single path: Linear for “//e1/e2/e3/e4/e5” Exponential for “//e1/*/*/*/e5”  Multiple paths: Exponential for “//e1//b”, “//e2//b”, …, “//e5//b”

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Example of an Eager DFA The DFA size is O(2 w+1 ) where w is the number of ‘*’. //a/*/*/c/d “a a b b c d” “a b a b c d” Need to remember all the A’s in three consecutive characters, as different combinations of As may yield different results.

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Lazy DFA  Lazy DFA is constructed at run time, on demand Initially, it has a single state Whenever it attempts to make a transition into a missing state, it computes it and updates the transition  Hope: only a small set of the DFA states is needed.  Exploits DTD to derive upper bounds…

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Lazy DFA (contd.)  A DTD graph is simple, if the only loops are self-loops.  Theorem: the size of a lazy DFA is exponential only in the maximal number of simple cycles a path can intersect. Not exponential in # of paths! DTD graph  More complex recursive DTDs: e.g. “table” contains “list” and “list” contains “table” Even a lazy DFA grows large…

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Predicates in Path Expressions  Predicates can address attributes, text data, or positions of elements. Value of an attribute in an element, e.g., = “easy”]. Text data of an element, e.g., //section/title[text()=“XPath”]. Position of an element, e.g., //section/figure[text()=“XPath”][1].

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Predicate Evaluation  Extend the NFA including additional states representing successful evaluation of predicates and transitions to them  Potential problems A potentially huge increase in the machine size Destroy sharing of path expressions  Recent work [Gupta & Suciu 2003] : Possible to build an efficient pushdown automaton using lazy construction if No mixed content, such as “ 1 2 ”. Can afford to periodically rebuild the automaton from scratch. Can afford to train the automaton in each construction.

3/11/2016 Yanlei Diao, University of Massachusetts Amherst YFilter: Mapping XML to Relational P1: //section//figure P2: //section/section/figure path-tuple streams Shared Path Matching Engine  section * {P2} section figure {P1}  figure * 3 title section 4 figure 5 title 6 2 section 7 figure title 8 1 report  XQuery is much more complex than regular expressions.  Leverage efficient relational processing  XQuery stream processing = relational operations on pathtuple streams! parsing events

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Example Query Plan for FWR F: //section W1://section/title W2://section/figure/title R1://section//section//title R2://section//figure Q1: for $s in where $s/title = “Pub/Sub” and $s/figure/title = “XML processing” return { $s//section//title } { $s//figure }   Π DupElim OuterJoin-Select Shared Path Matching Engine   FW1W2R1R2 An external (post-processing) plan for each query:  Selection: evaluates value-based predicates.  Projection: projects onto specific fields and removes duplicates.  Semijoin: handles correlations between for and where paths, finds query matches.  Outerjoin-Select: handles correlations between for and return paths, generates query results. Push all paths into the path engine. Π DupElim

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Buffering in XML Stream Processing  Buffering in XQuery stream processing Whether an element belongs to the result set is uncertain as it depends on predicates that have not been evaluated  Goal: avoid materialization and minimize buffering  Issues to address: What queries require buffering? What elements to buffer? When and how to prune dynamic data structures? −Buffered elements −State in any dynamic data structures

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Buffering in XML Stream Processing //section[title=“XML”]  Requires buffering? Yes  What element to buffer? section  When to prune? until the end of section is encountered, or when no more title can occur in this section. (Use DTD!) “XML processing” 3 title section 4 figure 5 title 6 2 section 7 figure title 8 1

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Buffering in XML Stream Processing //section[figure=“XML”]/title  Requires buffering? Yes  What element to buffer? title  When to prune? until a figure matches the predicate, or when no more figure can occur in this section (use DTD!), or the end of section is encountered. “XML processing” 3 title section 4 figure 5 title 6 2 section 7 figure title 8 1

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Buffering in XML Stream Processing  Requires buffering? Yes  What element to buffer? figure figure5?, figure7?  When to prune? until the predicate is true, or when no more title can occur in this section (use DTD!), or the end of section is encountered. “XML processing” 3 title section 4 figure 5 title 6 2 section 7 figure title 8 1 //section[contains(.//title, “XML”)]//figure

3/11/2016 Yanlei Diao, University of Massachusetts Amherst

3/11/2016 Yanlei Diao, University of Massachusetts Amherst XQuery Usage Scenarios  XML message brokers Simple path expressions, for authentication, authorization, routing, etc. Single input message, relatively small Transient and streaming data (no indexes)  XML transformation language in Web Services Large and complex queries, mostly for transformation Input message + external data sources, small/medium sized data sets (xK -> xM) Transient and streaming data (no indexes)  Semantic data verification Mostly messages Potentially complex (but small) queries Streaming and multi-query optimization required

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Example 2 of an Eager DFA //a/b//c/d //e/f//g/h If there are l patterns with “//” per pattern, we need O(2^l) states to record the matching of the power set of the prefixes. The DFA size is O((x+1) l ), where x is the number of “//” per pattern, and l is the number of patterns.

3/11/2016 Yanlei Diao, University of Massachusetts Amherst Full XQuery Stream Processor  Stream processing for the entire XQuery language [Daniela et al.’04] General algebra for XQuery Pull-based token-at-a-time execution model Lazy evaluation (like other functional languages) Some preliminary work on sharing [Diao et al.’04]  Buffering is a big concern [Barton et al.’03, Peng&Chawathe’03, Koch et al.’04]  Sharing is crucial for performance and scalability for large numbers of queries