Towards an Internet-Scale XML Dissemination Service Yanlei Diao Shariq Rizvi Michael J. Franklin EECS, U.C. Berkeley 12/9/2018
XML dissemination services Outline XML dissemination services System model Core techniques Status and conclusions Yanlei Diao 12/9/2018
Applications of XML Dissemination News feeds via RSS (Really Simple Syndication) My Yahoo!: updated headlines from BBC, CNet, NPR. Mobile services Mobile operators: connect content providers with millions of clients running a multitude of operating systems. Stock tickers QuoteMedia: fast access to real-time and historical stock data. Online auctions: freebidingtools.com: create your own feed for your favorite eBay search. Network monitoring: Ganglia: a distributed monitoring system for clusters and grids. Yanlei Diao 12/9/2018
YFilter: An XML Dissemination Service XML streams user queries query results Data Source YFilter User queries: Specification of data interests, written in an XML query language. Data sources: Continuously publish XML data items. The service: Delivers to each user the XML data items that match her data interests; the delivered results are presented in a customized format. Yanlei Diao 12/9/2018
ONYX: Large-Scale XML Dissemination Operator Network using YFilter for XML Dissemination Data Source Broker YFilter An overlay network of information brokers running YFilter. U1 U2 U3 U4 U5 Underlying infrastructures: A dedicated network Peer-to-peer Collaboration among administrative domains Yanlei Diao 12/9/2018
Design Space: Expressiveness Expressiveness: data model + query language a service supports Subject-based Messages: a subject label Queries: a specific label or a wildcard Predicate-based Messages: attribute-value pairs Queries: a set of predicates XML filtering Messages: XML Queries: subset of XPath 1.0 XML filtering and transformation Queries: subset of XQuery Yanlei Diao 12/9/2018
Design Space: Why Distributed Processing? Privacy Regulations: e.g., CA Senate Bill No. 1386. Policies: e.g., customers’ data stay behind the firewall. Locality of data interests Disseminate regional data directly to local subscribers. Scalability Data volume: number of messages per second up to thousands, message size from 1 KB to 20 KB. Query population: up to millions. Frequency of query updates: from a daily basis to every few minutes. Result Volume: can amplify the input data volume by a large factor. Yanlei Diao 12/9/2018
Related Systems Yanlei Diao 12/9/2018
Content of the Paper Content-driven routing Need to handle both structural and value-based constraints. Leverage YFilter: NFA-based operator networks, distributed construction. Filtering power of routing (i.e., fraction of messages filtered) Filtering power can be inherently limited. Use query partitioning (if possible) to improve it. Distributed transformation Currently either at the publishers’ side or at the edge brokers. Perform cascading message transformation during routing. Efficient XML transmission Verbosity of XML, and XML parsing at each routing step. Investigate different XML formats for XML transmission. Detailed architectural design Other optimization techniques… Yanlei Diao 12/9/2018
Outline System model Core techniques Status and conclusions XML dissemination services System model Core techniques Status and conclusions Yanlei Diao 12/9/2018
Operations on Data/Query flows Data Source Broker 1 Broker 2 Broker 4 Broker 3 Broker 5 Broker 6 data flow Broker 2: /nitf/head/pubdata[@edition.area=“NY”] Broker 4: /nitf/head/pubdata[@edition.area=“SF”] Broker 5: /nitf//toject.subject[@toject.subject.type=“Stock”] or /nitf//toject.subject[@toject.subject.matter=“fishing”] Broker 6: /nitf//toject.subject[@toject.subject.type =“Sports”] … [a transformation query*] Q1: /nitf[head/pubdata[@edition.area=“SF”]] [.//toject.subject[@toject.subject.type=“Stock”]] Q2: /nitf[head/pubdata[@edition.area=”SF”]] [.//toject.subject[@toject.subject.matter=“fishing”]] Q3: /nitf[head/pubdata[@edition.area=“SF”]] [.//series[@series.name=“Tide Forecasts”]] Q5: … Q4: … query flow Yanlei Diao 12/9/2018
System Tasks on Data/Query Planes Processing planes: query plane and data plane Planes System Tasks Query Plane Data Plane Content-driven routing Incremental transformation Final query processing Build routing tables Lookup in routing tables Build transformation plans Execute transformation Build query plans Execute query plans Yanlei Diao 12/9/2018
Outline Core techniques XML dissemination services System model Status and conclusions Next, let me talk about some core techniques Yanlei Diao 12/9/2018
Routing Table Design A routing table: mapping from output links to routing queries. a routing query: the data interests of queries down from an output link. data interest of a query: XPath expressions, for and where clauses of FLWOR expressions. Routing table design a canonical form of routing queries; a representation of routing tables; and an algorithm constructing them from a distributed query population. Two (conflicting) goals High filtering power of routing Fraction of messages filtered in routing. High routing efficiency Number of messages routed per second. Yanlei Diao 12/9/2018
YFilter Basics An XML filtering and transformation engine that processes multiple queries in a shared fashion. A Non-Deterministic Finite Automaton (NFA)-based operator network. Benefits for routing: Fast structure matching. A small maintenance cost for query updates. Extensibility for supporting new operators. Q1: /nitf [head/pubdata[@edition.area=“SF”]] [.//tobject.subject[@tobject.subject.type=“Stock”]] pubdata toject. subject nitf head * 1 2 3 5 4 6 Q1 Q2: /nitf [head/pubdata[@edition.area=“SF”]] [.//tobject.subject[@tobject.subject.matter=“fishing”]] Q2 Y. Diao and M.J. Franklin. Query Processing for High-Volume XML Message Brokering. VLDB 2003. Y. Diao, et al. Path Sharing and Predicate Evaluation for High-Performance XML Filtering. TODS, Dec. 2003. YFilter v1.0 release: Coming later this month! Yanlei Diao 12/9/2018
Our Solution Routing queries are a disjunction of path expressions Each XPath expression (equivalent of the for and where clauses of FLOWR expressions) is a routing query. Multiple routing queries can be connected by or. Routing table representation Merge routing queries into a single combined operator network. Construction algorithm Map(): a user query a routing query in the canonical form. Collect(): routing queries sent from child brokers a routing table. Aggregate(): all the routing queries (at a node) a new routing query. Yanlei Diao 12/9/2018
(mapping from links to routing queries) An Example Scenario A new routing query Aggregate( ) (4d) Broker 4 from Broker 5 from Broker 6 Routing Table (mapping from links to routing queries) Collect( ) (4c) Broker 5 Q1: /nitf[head/pubdata/@edition.area=“SF”] [.//toject.subject/@toject.subject.type=“Stock” ] Q2: /nitf[head/pubdata/@edition.area=“SF”] [.//toject.subject/@toject.subject.matter=“fishing”] A new routing query Aggregate( ) (5b) Broker 6 Q3: /nitf[head/pubdata/@edition.area=“SF”] [.//series/series.name=“Tide Forecasts” ] A new routing query Aggregate( ) (6b) Routing queries Map( ) (5a) Routing queries Map( ) (6a) Yanlei Diao 12/9/2018
Example (continued) Broker5 Broker6 Broker6 Broker5 Q1 Q2 Q3 pubdata toject. subject nitf head * 1 2 3 5 4 6 Broker5 (6b) pubdata nitf head * 1 2 3 4 5 series 7 Broker6 Collect( ) Map( ) & Aggregate( ) Broker6 (4c’) pubdata toject. subject nitf head * 1 2 3 5 4 6 series 7 Broker5 Map( ) & Aggregate( ) (5a) Q1 Q2 pubdata toject. subject nitf head * 1 2 3 5 4 6 (6a) Q3 pubdata nitf head * 1 2 3 4 5 series 7 Yanlei Diao 12/9/2018
Sharing and Short-cut Evaluation A problem with sharing Separate routing query representations: short-cut evaluation. Combined one: sharing may sacrifice the short-cut evaluation strategy. (4c) Broker 5: Broker 6: (5b) (6b) (5b) pubdata toject. subject nitf head * 1 2 3 5 4 6 Broker5 (6b) series 7 Broker6 Solution: dynamic pruning of the operator network at runtime Each operator/NFA state has a static set of broker ids that it can reach. System keeps a dynamic set of broker ids that have been reached. YFilter execution is extended to prune the operator network using these sets. (4c’) pubdata toject. subject nitf head * 1 2 3 5 4 6 series 7 Broker5 Broker6 Yanlei Diao 12/9/2018
Other Routing Considerations: Content Generalization Large routing tables can be a problem. Introduce content generation as an additional step in Collect( ) or Aggregate( ). Generalization methods. Trade off filtering power for routing (space) efficiency. Filtering Power of Routing Fraction of messages filtered by routing. Selectivity of the union of the user queries at the node. Loss in precision in the routing queries representing this node. If inherently low, partition the query population to improve it. An Exclusiveness Pattern: e.g., “/a/b[@id=?]” Identify a set of such patterns, and partition queries using them. Yanlei Diao 12/9/2018
Status and Conclusions Queries bring intelligence to the network routing fabric. We present a detailed architectural design of ONYX. We address fundamental issues. YFilter’s NFA-based operator networks are good for routing! Locality of data interests is key to filtering power! Status: YFilter release, XML transmission, other implementation underway. This is an area full of opportunities for optimization. Improving routing efficiency. Improving filtering power of routing. Incremental message transformation. Sharing among different processing tasks. Schema-based optimization… Yanlei Diao 12/9/2018
Questions ONYX Yanlei Diao 12/9/2018