Towards an Internet-Scale XML Dissemination Service

Slides:



Advertisements
Similar presentations
17 th International World Wide Web Conference 2008 Beijing, China XML Data Dissemination using Automata on top of Structured Overlay Networks Iris Miliaraki.
Advertisements

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Evaluation of a Scalable P2P Lookup Protocol for Internet Applications
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Yongluan Zhou University of Southern Denmark Ali Salehi, Karl Aberer EPFL, Switzerland - Manisha Mishra.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
Traffic Engineering With Traditional IP Routing Protocols
Network Monitoring for Internet Traffic Engineering Jennifer Rexford AT&T Labs – Research Florham Park, NJ 07932
Overview of Search Engines
The RSS Editor Programme: RSS_broker A.Annunziato, C. Best JRC Ispra
Aalborg University – Department of Production XML Extensible Markup Language Kaj A. Jørgensen Aalborg University, Department of Production XML – Extensible.
Developing Health Geographic Information Systems (HGIS) for Khorasan Province in Iran (Technical Report) S.H. Sanaei-Nejad, (MSc, PhD) Ferdowsi University.
Achieving fast (approximate) event matching in large-scale content- based publish/subscribe networks Yaxiong Zhao and Jie Wu The speaker will be graduating.
Word Wide Cache Distributed Caching for the Distributed Enterprise.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
1 Distributed Monitoring of Peer-to-Peer Systems By Serge Abiteboul, Bogdan Marinoiu Docflow meeting, Bordeaux.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
1 Emergency Alerts as RSS Feeds with Interdomain Authorization Filippo Gioachin 1, Ravinder Shankesi 1, Michael J. May 1,2, Carl A. Gunter 1, Wook Shin.
SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.
Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.
ICDCS Beijing China Routing of XML and XPath Queries in Data Dissemination Networks Guoli Li, Shuang Hou Hans-Arno Jacobsen Middleware Systems Research.
VLDB2005 CMS-ToPSS: Efficient Dissemination of RSS Documents Milenko Petrovic Haifeng Liu Hans-Arno Jacobsen University of Toronto.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
INRIA - Progress report DBGlobe meeting - Athens November 29 th, 2002.
High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.
Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
An overlay for latency gradated multicasting Anwitaman Datta SCE, NTU Singapore Ion Stoica, Mike Franklin EECS, UC Berkeley
REED : Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.
XML Stream Processing Yanlei Diao University of Massachusetts Amherst.
Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.
Design of a Notification Engine for Grid Monitoring Events and Prototype Implementation Natascia De Bortoli INFNGRID Technical Board Bologna Feb.
Information Retrieval in Practice
Algorithms and Problem Solving
Efficient Evaluation of XQuery over Streaming Data
Introduction to Load Balancing:
Ad-hoc Networks.
LOCO Extract – Transform - Load
Towards GLUE Schema 2.0 Sergio Andreozzi INFN-CNAF Bologna, Italy
Open Source distributed document DB for an enterprise
A Study of Group-Tree Matching in Large Scale Group Communications
Software Design and Architecture
High-Performance XML Filtering with YFilter
Efficient Filtering of XML Documents with XPath Expressions
RE-Tree: An Efficient Index Structure for Regular Expressions
CHAPTER 3 Architectures for Distributed Systems
Probabilistic Data Management
Plethora: Infrastructure and System Design
Associative Query Answering via Query Feature Similarity
SONATA: Query-Driven Network Telemetry
Model-Driven Analysis Frameworks for Embedded Systems
CS 31006: Computer Networks – The Routers
Query Processing for High-Volume XML Message Brokering
Prof. Leonardo Mostarda University of Camerino
SAT-Based Area Recovery in Technology Mapping
Ch 4. The Evolution of Analytic Scalability
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Composite Subscriptions in Content-based Pub/Sub Systems
A Component-based Architecture for Mobile Information Access
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Database management concepts
Foundations for Highly-Available Content-based Publish/Subscribe Overlays Young Yoon, Vinod Muthusamy and Hans-Arno Jacobsen.
Algorithms and Problem Solving
Market-based Dynamic Task Allocation in Mobile Surveillance Systems
Query Optimization.
A Semantic Peer-to-Peer Overlay for Web Services Discovery
OU BATTLECARD: Oracle Utilities Learning Subscription
Presentation transcript:

Towards an Internet-Scale XML Dissemination Service Yanlei Diao Shariq Rizvi Michael J. Franklin EECS, U.C. Berkeley 12/9/2018

XML dissemination services Outline XML dissemination services System model Core techniques Status and conclusions Yanlei Diao 12/9/2018

Applications of XML Dissemination News feeds via RSS (Really Simple Syndication) My Yahoo!: updated headlines from BBC, CNet, NPR. Mobile services Mobile operators: connect content providers with millions of clients running a multitude of operating systems. Stock tickers QuoteMedia: fast access to real-time and historical stock data. Online auctions: freebidingtools.com: create your own feed for your favorite eBay search. Network monitoring: Ganglia: a distributed monitoring system for clusters and grids. Yanlei Diao 12/9/2018

YFilter: An XML Dissemination Service XML streams user queries query results Data Source YFilter User queries: Specification of data interests, written in an XML query language. Data sources: Continuously publish XML data items. The service: Delivers to each user the XML data items that match her data interests; the delivered results are presented in a customized format. Yanlei Diao 12/9/2018

ONYX: Large-Scale XML Dissemination Operator Network using YFilter for XML Dissemination Data Source Broker YFilter An overlay network of information brokers running YFilter. U1 U2 U3 U4 U5 Underlying infrastructures: A dedicated network Peer-to-peer Collaboration among administrative domains Yanlei Diao 12/9/2018

Design Space: Expressiveness Expressiveness: data model + query language a service supports Subject-based Messages: a subject label Queries: a specific label or a wildcard Predicate-based Messages: attribute-value pairs Queries: a set of predicates XML filtering Messages: XML Queries: subset of XPath 1.0 XML filtering and transformation Queries: subset of XQuery Yanlei Diao 12/9/2018

Design Space: Why Distributed Processing? Privacy Regulations: e.g., CA Senate Bill No. 1386. Policies: e.g., customers’ data stay behind the firewall. Locality of data interests Disseminate regional data directly to local subscribers. Scalability Data volume: number of messages per second up to thousands, message size from 1 KB to 20 KB. Query population: up to millions. Frequency of query updates: from a daily basis to every few minutes. Result Volume: can amplify the input data volume by a large factor. Yanlei Diao 12/9/2018

Related Systems Yanlei Diao 12/9/2018

Content of the Paper Content-driven routing Need to handle both structural and value-based constraints. Leverage YFilter: NFA-based operator networks, distributed construction. Filtering power of routing (i.e., fraction of messages filtered) Filtering power can be inherently limited. Use query partitioning (if possible) to improve it. Distributed transformation Currently either at the publishers’ side or at the edge brokers. Perform cascading message transformation during routing. Efficient XML transmission Verbosity of XML, and XML parsing at each routing step. Investigate different XML formats for XML transmission. Detailed architectural design Other optimization techniques… Yanlei Diao 12/9/2018

Outline System model Core techniques Status and conclusions XML dissemination services System model Core techniques Status and conclusions Yanlei Diao 12/9/2018

Operations on Data/Query flows Data Source Broker 1 Broker 2 Broker 4 Broker 3 Broker 5 Broker 6 data flow Broker 2: /nitf/head/pubdata[@edition.area=“NY”] Broker 4: /nitf/head/pubdata[@edition.area=“SF”] Broker 5: /nitf//toject.subject[@toject.subject.type=“Stock”] or /nitf//toject.subject[@toject.subject.matter=“fishing”] Broker 6: /nitf//toject.subject[@toject.subject.type =“Sports”] … [a transformation query*] Q1: /nitf[head/pubdata[@edition.area=“SF”]] [.//toject.subject[@toject.subject.type=“Stock”]] Q2: /nitf[head/pubdata[@edition.area=”SF”]] [.//toject.subject[@toject.subject.matter=“fishing”]] Q3: /nitf[head/pubdata[@edition.area=“SF”]] [.//series[@series.name=“Tide Forecasts”]] Q5: … Q4: … query flow Yanlei Diao 12/9/2018

System Tasks on Data/Query Planes Processing planes: query plane and data plane Planes System Tasks Query Plane Data Plane Content-driven routing Incremental transformation Final query processing Build routing tables Lookup in routing tables Build transformation plans Execute transformation Build query plans Execute query plans Yanlei Diao 12/9/2018

Outline Core techniques XML dissemination services System model Status and conclusions Next, let me talk about some core techniques Yanlei Diao 12/9/2018

Routing Table Design A routing table: mapping from output links to routing queries. a routing query: the data interests of queries down from an output link. data interest of a query: XPath expressions, for and where clauses of FLWOR expressions. Routing table design a canonical form of routing queries; a representation of routing tables; and an algorithm constructing them from a distributed query population. Two (conflicting) goals High filtering power of routing Fraction of messages filtered in routing. High routing efficiency Number of messages routed per second. Yanlei Diao 12/9/2018

YFilter Basics An XML filtering and transformation engine that processes multiple queries in a shared fashion. A Non-Deterministic Finite Automaton (NFA)-based operator network. Benefits for routing: Fast structure matching. A small maintenance cost for query updates. Extensibility for supporting new operators. Q1: /nitf [head/pubdata[@edition.area=“SF”]] [.//tobject.subject[@tobject.subject.type=“Stock”]] pubdata toject. subject nitf  head * 1 2 3 5 4 6  Q1 Q2: /nitf [head/pubdata[@edition.area=“SF”]] [.//tobject.subject[@tobject.subject.matter=“fishing”]]  Q2 Y. Diao and M.J. Franklin. Query Processing for High-Volume XML Message Brokering. VLDB 2003. Y. Diao, et al. Path Sharing and Predicate Evaluation for High-Performance XML Filtering. TODS, Dec. 2003.  YFilter v1.0 release: Coming later this month! Yanlei Diao 12/9/2018

Our Solution Routing queries are a disjunction of path expressions Each XPath expression (equivalent of the for and where clauses of FLOWR expressions) is a routing query. Multiple routing queries can be connected by or. Routing table representation Merge routing queries into a single combined operator network. Construction algorithm Map(): a user query  a routing query in the canonical form. Collect(): routing queries sent from child brokers  a routing table. Aggregate(): all the routing queries (at a node) a new routing query. Yanlei Diao 12/9/2018

(mapping from links to routing queries) An Example Scenario A new routing query Aggregate( ) (4d) Broker 4 from Broker 5 from Broker 6 Routing Table (mapping from links to routing queries) Collect( ) (4c) Broker 5 Q1: /nitf[head/pubdata/@edition.area=“SF”] [.//toject.subject/@toject.subject.type=“Stock” ] Q2: /nitf[head/pubdata/@edition.area=“SF”] [.//toject.subject/@toject.subject.matter=“fishing”] A new routing query Aggregate( ) (5b) Broker 6 Q3: /nitf[head/pubdata/@edition.area=“SF”] [.//series/series.name=“Tide Forecasts” ] A new routing query Aggregate( ) (6b) Routing queries Map( ) (5a) Routing queries Map( ) (6a) Yanlei Diao 12/9/2018

Example (continued)  Broker5 Broker6 Broker6 Broker5 Q1 Q2 Q3 pubdata toject. subject nitf  head * 1 2 3 5 4 6 Broker5 (6b)  pubdata nitf head  * 1 2 3 4 5 series 7 Broker6 Collect( ) Map( ) & Aggregate( ) Broker6 (4c’)  pubdata toject. subject nitf  head * 1 2 3 5 4 6 series 7 Broker5 Map( ) & Aggregate( ) (5a)  Q1 Q2 pubdata toject. subject nitf  head * 1 2 3 5 4 6 (6a)  Q3 pubdata nitf head  * 1 2 3 4 5 series 7 Yanlei Diao 12/9/2018

Sharing and Short-cut Evaluation A problem with sharing Separate routing query representations: short-cut evaluation. Combined one: sharing may sacrifice the short-cut evaluation strategy. (4c) Broker 5: Broker 6: (5b) (6b) (5b)  pubdata toject. subject nitf  head * 1 2 3 5 4 6 Broker5 (6b) series 7 Broker6 Solution: dynamic pruning of the operator network at runtime Each operator/NFA state has a static set of broker ids that it can reach. System keeps a dynamic set of broker ids that have been reached. YFilter execution is extended to prune the operator network using these sets. (4c’) pubdata toject. subject nitf  head * 1 2 3 5 4 6 series 7  Broker5 Broker6 Yanlei Diao 12/9/2018

Other Routing Considerations: Content Generalization Large routing tables can be a problem. Introduce content generation as an additional step in Collect( ) or Aggregate( ). Generalization methods. Trade off filtering power for routing (space) efficiency. Filtering Power of Routing Fraction of messages filtered by routing. Selectivity of the union of the user queries at the node. Loss in precision in the routing queries representing this node. If inherently low, partition the query population to improve it. An Exclusiveness Pattern: e.g., “/a/b[@id=?]” Identify a set of such patterns, and partition queries using them. Yanlei Diao 12/9/2018

Status and Conclusions Queries bring intelligence to the network routing fabric. We present a detailed architectural design of ONYX. We address fundamental issues. YFilter’s NFA-based operator networks are good for routing! Locality of data interests is key to filtering power! Status: YFilter release, XML transmission, other implementation underway. This is an area full of opportunities for optimization. Improving routing efficiency. Improving filtering power of routing. Incremental message transformation. Sharing among different processing tasks. Schema-based optimization… Yanlei Diao 12/9/2018

Questions ONYX Yanlei Diao 12/9/2018