Query Processing for High-Volume XML Message Brokering

Slides:

Advertisements

Similar presentations

XML Data Management 8. XQuery Werner Nutt. Requirements for an XML Query Language David Maier, W3C XML Query Requirements: Closedness: output must be.

Advertisements

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,

Boosting XML filtering through a scalable FPGA-based architecture A. Mitra, M. Vieira, P. Bakalov, V. Tsotras, W. Najjar.

Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)

Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.

©NEC Laboratories America 1 Hui Zhang Samrat Ganguly Sudeept Bhatnagar Rauf Izmailov NEC Labs America Abhishek Sharma University of Southern California.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

1 A Unified Model for XQuery Evaluation over XML Data Streams Jinhui Jian Hong Su Elke A. Rundensteiner Worcester Polytechnic Institute ER 2003.

Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

Access Path Selection in a Relational Database Management System Selinger et al.

Module 7 Reading SQL Server® 2008 R2 Execution Plans.

Pattern tree algebras: sets or sequences? Stelios Paparizos, H. V. Jagadish University of Michigan Ann Arbor, MI USA.

Ashwani Roy Understanding Graphical Execution Plans Level 200.

Querying Structured Text in an XML Database By Xuemei Luo.

Optimization in XSLT and XQuery Michael Kay. 2 Challenges XSLT/XQuery are high-level declarative languages: performance depends on good optimization Performance.

Database Systems Part VII: XML Querying Software School of Hunan University

Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

VLDB2005 CMS-ToPSS: Efficient Dissemination of RSS Documents Milenko Petrovic Haifeng Liu Hans-Arno Jacobsen University of Toronto.

Streaming XPath Engine Oleg Slezberg Amruta Joshi.

Information-Centric Networks10b-1 Week 10 / Paper 2 Hermes: a distributed event-based middleware architecture –P.R. Pietzuch, J.M. Bacon –ICDCS 2002 Workshops.

Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems.

Information-Centric Networks Section # 10.2: Publish/Subscribe Instructor: George Xylomenos Department: Informatics.

YFILTER (Filtering and Transformation for High- Volume XML Message Brokering) MS. 3 최주리

An Effective SPARQL Support over Relational Database Jing Lu, Feng Cao, Li Ma, Yong Yu, Yue Pan SWDB-ODBIS 2007 SNU IDB Lab. Hyewon Lim July 30 th, 2009.

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

REED ： Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.

XML Stream Processing Yanlei Diao University of Massachusetts Amherst.

Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.

Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.

Design of a Notification Engine for Grid Monitoring Events and Prototype Implementation Natascia De Bortoli INFNGRID Technical Board Bologna Feb.

Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.

Efficient Evaluation of XQuery over Streaming Data

Unit 4 Representing Web Data: XML

CC La Web de Datos Primavera 2017 Lecture 7: SPARQL [i]

02 | Advanced SELECT Statements

Chapter 25: Advanced Data Types and New Applications

High-Performance XML Filtering with YFilter

Prepared by : Ankit Patel (226)

Efficient Filtering of XML Documents with XPath Expressions

RE-Tree: An Efficient Index Structure for Regular Expressions

CHAPTER 3 Architectures for Distributed Systems

CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

The Design of an Acquisitional Query Processor For Sensor Networks

Overview of Query Optimization

Evaluation of Relational Operations: Other Operations

(b) Tree representation

Lecture 12 Lecture 12: Indexing.

CC La Web de Datos Primavera 2016 Lecture 7: SPARQL (1.0)

The Relational Algebra and Relational Calculus

Akshay Tomar Prateek Singh Lohchubh

Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.

Towards an Internet-Scale XML Dissemination Service

Alin Deutsch, University of Pennsylvania Mary Mernandez, AT&T Labs

Early Profile Pruning on XML-aware Publish-Subscribe Systems

Advance Database Systems

CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.

Evaluation of Relational Operations: Other Techniques

Diving into Query Execution Plans

Query Optimization.

A Semantic Peer-to-Peer Overlay for Web Services Discovery

Eddies for Continuous Queries

Adaptive Query Processing (Background)

CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Evaluation of Relational Operations: Other Techniques

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

Query Processing for High-Volume XML Message Brokering Yanlei Diao Michael Franklin University of California, Berkeley 11/24/2018

XML Message Brokers Data exchange in XML: Web services, data and application integration, information dissemination. XML message brokers: central exchange points for messages sent between applications/users. Main functions: For a large set of queries, Filtering: matches messages to predicates representing interest specifications. Transformation: restructures matched messages according to recipient-specific requirements. Routing: delivers the customized data to the recipients. Yanlei Diao 11/24/2018

Personalized Content Delivery Data Source XML streams user queries query results Message Broker User subscriptions: Specification of user interests, written in an XML query language. XML streams: Continuously arriving XML data items. The message broker matches data items to queries, transforms them, and routes the results. Yanlei Diao 11/24/2018

XML Filtering and YFilter XML filtering systems: XFilter, YFilter, XMLTK, XTrie, Index-Filter, MatchMaker YFilter: high-performance shared path matching engine Q1=/a/b Q2=/a/c Q3=/a/b/c Q4=/a//b/c Q5=/a/*/c Q6=/a//c Q7=/a/*/*/c Q8=/a/b/c A large set of linear path expressions. A single Non-Deterministic Finite Automaton, sharing all the common prefixes. {Q6} {Q7} {Q4} {Q1} {Q5} {Q2} a c  b * {Q3, Q8} Path sharing is the key to efficiency and scalability, orders of magnitude performance improvement! Diao et al. Path sharing and predicate evaluation for high-performance XML filtering. TODS, Dec. 2003 (to appear). Yanlei Diao 11/24/2018

Efficient Transformation Goal: customized result generation for tens of thousands of queries! Leverage prior work on shared path matching (i.e.,YFilter) How, and to what extent can a shared path matching engine be exploited? Build customization functionality on top of it What post-processing of path matching output is needed? How can this be done most efficiently? Yanlei Diao 11/24/2018

Message Broker Architecture Yanlei Diao 11/24/2018

Query Specification A query is a FLWR expression enclosed by a constant tag. <sections> { for $s in $doc//section where $s/title = “XML” and $s/figure/title = “XML processing” return <section> { $s//section//title } { $s//figure } </section> } </sections> binding path predicate paths return paths Yanlei Diao 11/24/2018

PathTuple Streams A PathTuple stream for each matched path expression: <section> <section> <figure> … </figure> </section> <figure> … </figure> </section> Node labeled tree 2 1 3 4 section figure //section//figure /section/section/figure parsing events 2 3 1 4 2 3 1 YFilter parsing events A PathTuple stream for each matched path expression: PathTuple: A unique path match, one field per location step. Ordering: PathTuples in a stream are always output in increasing order of node ids in the last field. Path oriented shredding: query processing = operations on tuple streams. Yanlei Diao 11/24/2018

Output of Query Processor GroupSequence-ListSequence format for all the nodes selected from the input message. <sections> { for $s in $doc//section where $s/title = “XML” and $s/figure/title = “XML processing” return <section> { $s//section//title } { $s//figure } </section> } </sections> section_2: {title_6, title_8}, {figure_5, figure_9} section_11: {title_14}, {figure_15, figure_16} Yanlei Diao 11/24/2018

Basic Approaches Three query processing approaches exploiting shared path matching. Post-process path tuple streams to generate results. Plans consist of relation-style/tree-search based operators. Differ in the extent they push work down to the path engine. Tension between shared path matching and result customization! PathTuples in a stream are returned in a single, fixed order for all queries containing the path. They can be used differently in post-processing of the queries. Yanlei Diao 11/24/2018

Alternative 1: PathSharing-F 2 4 7 11 13 … //section for $s in $doc//section where $s/title = “XML” and $s/figure/title = “XML processing” return <section> { $s//section//title } { $s//figure } </section> //section Insert part of the binding path from the for clauses into the path engine. An external plan for each query: Selection: value-based comparisons in the binding path (//section[@id <= 2]). DupElim: when same node is bound multiple times in the stream. Where-Filter: tests predicate paths in the where clause (tree-search routine). Return-Select: applies the return clause (tree-search routine). Yanlei Diao 11/24/2018

Duplicate Elimination Return-Select Where-Filter 1 section 2 3 figure id = 1 id = 2 <figures> { for $f in $doc//section[@id<=2]//figure where … return … } </figures> DupElim 1 3 id <=2 1 3 2 Duplicates for the binding path: PathTuples containing the same node id in the last field. Cause redundant work in later operators and a duplicate result. YFilter //section//figure 1 3 2 DupElim ensures that the same node is emitted only once. Yanlei Diao 11/24/2018

Alternative 2: PathSharing-FW for $s in $doc//section where $s/title = “XML” and $s/figure/title = “XML processing” return <section> { $s//section//title } { $s//figure } </section> //section //section/title //section/figure/title In addition: push predicate paths from the where clause into the path engine. //section/figure/title 2 7 11 … //section 5 9 16 4 6 10 17 > Semijoins: find query matches after paths in the for and the where clause are matched. 2 4 7 11 13 … //section //section/title 3 8 12 > order-preserving hash vs. merge based: hash based joins are more expensive Yanlei Diao 11/24/2018

Alternative 3: PathSharing-FWR Also push return paths from the return clause into the path engine. OuterJoin-Select: generate results. create a group for each binding path tuple in the leftmost input. left outer join the binding path tuple with a return stream to create a list. order preserving hash vs merge based Duplicates for a return path: Defined on the join field and the last field of the return path stream. Need DupElim on return paths before outer joins. Yanlei Diao 11/24/2018

Optimizations Observation: More path sharing  more sophisticated processing plans. Tension between shared path streams and result customization. Different notions of duplicates for binding/return paths. Different stream orders for the inputs of join operators. Optimizations based on query / DTD inspection: Removing unnecessary DupElim operators; Turning hash-based operators to merge/scan-based ones. Yanlei Diao 11/24/2018

Performance Comparison Three alternatives w./w.o. optimizations, non-recursive data Bib DTD, number of distinct queries = 5000, number of predicate paths = 1, number of return paths = 2, “//” probability = 0.2 Multi-Query Processing Time (MQPT) = wall clock time of processing a message – message parsing time (msec) Yanlei Diao 11/24/2018

Other Results Three alternatives w./w.o. optimizations, recursive data Vary number of predicate paths Vary number of return paths Vary “//” probability Summary of the results: PathSharing-FWR when combined with optimizations based on queries and DTD usually provides the best performance. It performs rather poorly without optimizations. Effectiveness of optimizations: Query inspection improves the performance of all alternatives; Addition of DTD-based optimizations improves them further. Recursive data challenges the effectiveness of optimizations. Yanlei Diao 11/24/2018

Shared Post-processing So far, a separate post-processing plan per query. The best performing approach (PathSharing-FWR) only uses relational style operators. Sharing techniques similar to shared Continuous Query processing, but highly tailored for XML message brokering. Query rewriting Shared group by for outer joins Selection pullup over semijoins (NiagaraCQ) Shared selection (TriggerMan, NiagaraCQ, TelegraphCQ) Shared post-processing can provide great improvement in scalability! Yanlei Diao 11/24/2018

Conclusions Result customization for a large set of queries: Sharing is key to high-performance. Can exploit existing path sharing technology, but need to resolve the inherent tension between path sharing and result customization. Results show that aggressive path sharing performs best when using optimizations. Relational style operators in post-processing enable use of techniques from the literature (multi-query optimization, CQ processing). Yanlei Diao 11/24/2018

Future work Extending the range of shared post-processing. Additional features in result customization OrderBy, aggregation, nested FLWR expressions, etc. Customization solutions based on shared tree pattern matching. Third component of the XML message broker content-based routing in an overlay network deployment. Yanlei Diao 11/24/2018