Tree-Pattern Aggregation for Scalable XML Data Dissemination

Slides:

Advertisements

Similar presentations

Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.

Advertisements

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

Frequent Closed Pattern Search By Row and Feature Enumeration

Fast Firewall Implementation for Software and Hardware-based Routers Lili Qiu, Microsoft Research George Varghese, UCSD Subhash Suri, UCSB 9 th International.

Fast Algorithms For Hierarchical Range Histogram Constructions

TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.

1 Conditional XPath, the first order complete XPath dialect Maarten Marx Presented by: Einav Bar-Ner.

Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.

Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.

Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.

©NEC Laboratories America 1 Hui Zhang Samrat Ganguly Sudeept Bhatnagar Rauf Izmailov NEC Labs America Abhishek Sharma University of Southern California.

DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.

Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Ecole Polytechnique Fédérale de Lausanne, Switzerland Efficient processing of XPath queries with structured overlay networks Gleb Skobeltsyn, Manfred Hauswirth,

Tomo-gravity Yin ZhangMatthew Roughan Nick DuffieldAlbert Greenberg “A Northern NJ Research Lab” ACM.

Achieving fast (approximate) event matching in large-scale content- based publish/subscribe networks Yaxiong Zhao and Jie Wu The speaker will be graduating.

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,

Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.

Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Querying Structured Text in an XML Database By Xuemei Luo.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

Efficient Processing of Top-k Spatial Preference Queries

Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

XML Access Control Koukis Dimitris Padeleris Pashalis.

ICDCS Beijing China Routing of XML and XPath Queries in Data Dissemination Networks Guoli Li, Shuang Hou Hans-Arno Jacobsen Middleware Systems Research.

August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.

Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree

XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

A novel, low-latency algorithm for multiple group-by query optimization Duy-Hung Phan Pietro Michiardi ICDE16.

1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.

Trie Indexes for Efficient XML Query Processing

By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01

Multicast Outline Multicast Introduction and Motivation DVRMP.

Efficient processing of path query with not-predicates on XML data

Data Streaming in Computer Networking

A Study of Group-Tree Matching in Large Scale Group Communications

Chapter 25: Advanced Data Types and New Applications

A paper on Join Synopses for Approximate Query Answering

Efficient Filtering of XML Documents with XPath Expressions

RE-Tree: An Efficient Index Structure for Regular Expressions

Structure and Value Synopses for XML Data Graphs

Optimal Configuration of OSPF Aggregates

Analysis and design of algorithm

Query-Friendly Compression of Graph Streams

SCOPE: Scalable Consistency in Structured P2P Systems

Spatio-temporal Pattern Queries

Spatial Online Sampling and Aggregation

Random Sampling over Joins Revisited

Indexing and Hashing Basic Concepts Ordered Indices

Towards an Internet-Scale XML Dissemination Service

Early Profile Pruning on XML-aware Publish-Subscribe Systems

A Small and Fast IP Forwarding Table Using Hashing

A Framework for Testing Query Transformation Rules

Tree-Pattern Similarity Estimation for Scalable Content-based Routing

Efficient Processing of Top-k Spatial Preference Queries

Lu Tang , Qun Huang, Patrick P. C. Lee

Relax and Adapt: Computing Top-k Matches to XPath Queries

Efficient Aggregation over Objects with Extent

Presentation transcript:

Tree-Pattern Aggregation for Scalable XML Data Dissemination Minos Garofalakis [ Joint work with Chee-Yong Chan, Wenfei Fan, Pascal Felber, Rajeev Rastogi ] Information Sciences Research Center Bell Labs, Lucent Technologies http://www.bell-labs.com/user/{cychan, wenfei, minos, rastogi} http://www.eurecom.fr/~felber/

Outline Introduction & Motivation Problem Fomulation Content-based XML data dissemination Problem Fomulation Tree-pattern model Pattern aggregation problem Our Solution: Basic Algorithmic Tools Tree-pattern containment and minimization algorithms Least-Upper-Bound (LUB) computation Our Solution: Selectivity-based Tree-Pattern Aggregation Statistical synopsis and algorithms for estimating aggregate “quality” The overall tree-pattern aggregation algorithm Experimental Study Results with real-life DTDs Conclusions

Content-based XML Data Dissemination XML: Dominant standard for data exchange on the Internet (B2B/B2C) Key Problem: Content-based filtering and routing of XML documents Effective XML data delivery based on document contents and user subscriptions (Publish/Subscribe model) User subscriptions indicate patterns of XML content that interest users (e.g., in Xpath) Content-based XML routers Quickly match incoming XML documents against standing subscriptions Route documents to interested data consumers User Subscriptions Work on effective indexing structures for fast subscription matching XFilter/YFilter [VLDB’00,ICDE’02], XTrie [ICDE’02]

XML Data Dissemination in the Wide Area Large, complex network of data producers and data consumers To effectively route XML traffic, routers in the core/backbone of the distribution network need to be aware of all user subscriptions Potentially huge volume of subscriptions! Filtering speed at the core will suffer! Need a technique that can effectively aggregate user subscriptions to a smaller set of aggregated content specifications Networking analog: Heavy aggregation of IP addresses in the routing tables of routers on the Internet backbone Serious scalability concerns for Pub/Sub Systems

Wide-Area XML Data Dissemination (cont.) However, subscription aggregation also implies a “precision loss” False positives matching the aggregated content specifications without matching the original subscriptions Implies that users may receive content that they are not interested in Our goal: Aggregate user subscriptions to a small collection while minimizing the “precision loss” due to aggregation Several novel challenges for XML/XPath-based Publish/Subscribe Aggregating hierarchically-structured subscriptions with possible wildcards Quantifying “precision loss” due to aggregation in the context of streaming, hierarchical XML documents Effectively aggregating large subscription collections

User-Subscription Model: Tree Patterns Tree patterns: Unordered, node-labeled trees specifying content & structure conditions on XML documents Wildcards: “*” = any tag , “//” = any subpath (descendant operator) Significant fragment of XPath (used earlier in XML/LDAP applications) A tree pattern basically specifies an existential condition for each one of its paths with conjunctions at each branching node Special root node “/.” allows for conjunctive conditions at the root level. For example: a b c g f d Example Document Trees /. * // a b c Root node with tag “a” s.t. (1) on some document path “a” has a “b” grandchild AND (2) on some document path “a” has a “c” descendant /. // a

Tree Patterns: Basic Definitions Tree pattern p contains tree pattern q ( ) iff every document T that satisfies q also satisfies p p “generalizes” q Extends naturally to sets of tree patterns S, S’ iff for each there exists s.t. Size of a tree pattern p (|p|) = number of tree nodes in p /. * a b // /. a b /. a b c /. * // a b c

Problem Statement Given a set of tree patterns S and a space bound k, compute a new set S’ of aggregate patterns such that: (i.e., S’ “generalizes” S) (i.e., S’ is concise) S’ is as precise as possible (i.e., any other set of patterns satisfying (1) and (2) is at least as general as S’) Minimize extra coverage (false positives) for the aggregated set S’ Basic algorithmic tools Containment, Minimization, Least-Upper-Bound (LUB) computation May be of independent interest (e.g., XML query optimization)

Basic Algorithms: Pattern Containment and Minimization Basic Question: “Given tree patterns p and q, does p contain q?” Propose an algorithm based on Dynamic Programming Basic DP recurrence -- p(v) , q(w) = sub-patterns rooted at nodes v, w of patterns p, q respectively CONTAINS[ p(v), q(w) ] = [ tag(v) >= tag(w) ] AND If tag(v) = “//” then CONTAINS[ p(v), q(w) ] = CONTAINS[ p(v), q(w) ] OR tag(v) is at least as general; e.g., // >= * >= a ( CONTAINS[ p(v’), q(w’) ] ) v’ = child(v) w’ = child(w) ( CONTAINS[ p(v’), q(w) ] ) OR v’ = child(v) /* “//” maps to empty path */ ( CONTAINS[ p(v), q(w’) ] ) w’ = child(w) /* “//” maps to path >= 2 */

Basic Algorithms: Pattern Containment and Minimization (cont.) Theorem: Our CONTAINS[p, q] algorithm determines whether in O(|p|*|q|) time Tree -Pattern Minimization: we are interested in patterns with minimal no. of nodes -- want to eliminate “redundant” sub-trees Algorithm MINIZE[p]: Minimize pattern p by recursive, top-down applications of the CONTAINS[] algorithm Theorem: Our MINIMIZE[p] algorithm minimizes the tree pattern p in O(|p|^2) time /. Contains the left-child sub-pattern => can be eliminated without changing pattern semantics ! a // b c a b c

Basic Algorithms: Least-Upper-Bound (LUB) Computation Given tree patterns p and q (in general, a set of patterns), we want to find the most precise/specific tree pattern containing both p and q Least-Upper-Bound of p, q -- LUB(p,q) = tightest generalization of p, q Shown that LUB(p,q) exists and is unique (up to pattern equivalence) Straightforward generalization to any set of input tree patterns Proposed an algorithm for LUB computation Makes use of our pattern containment and minimization algorithms Similar, dynamic-programming flavor as our CONTAINS[] procedure, but somewhat more complicated Need to keep track of several possible container sub-patterns Details of LUB algorithm in the paper ...

Outline Introduction & Motivation Problem Fomulation Content-based XML data dissemination Problem Fomulation Tree-pattern model Pattern aggregation problem Our Solution: Basic Algorithmic Tools Tree-pattern containment and minimization algorithms Least-Upper-Bound (LUB) computation Our Solution: Selectivity-based Tree-Pattern Aggregation Statistical synopsis and algorithms for estimating aggregate “quality” The overall tree-pattern aggregation algorithm Experimental Study Results with real-life DTDs Conclusions

Quantifying Precision Loss: Pattern Selectivities Consider aggregated pattern p that generalizes a set of patterns S (i.e., for each ) Want to quantify the “loss in precision” when using p instead of S Selectivity(p) = fraction of incoming documents matching p Selectivity(S) = fraction of documents matching any Clearly, Selectivity(p) >= Selectivity(S) Difference = fraction of “false positives” induced by the aggregate p Loss of precision due to aggregation = Selectivity(p) - Selectivity(S) Idea: Use document distribution statistics to estimate selectivities and quantify precision loss during tree-pattern aggregation Cannot afford to keep the entire document distribution! Use coarse statistics (“Document Tree” Synopsis) computed on-the-fly over the streaming XML documents

The Document-Tree Synopsis Compute summary of path-distribution characteristics as documents are streaming by Document-Tree Synopsis = label paths with frequency counts (indicating no. of documents containing that path) Construction Identify distinct document paths Install all Skeleton-Tree paths in the Document-Tree synopsis Trace each path from the root of the synopsis, increasing the frequency counts and adding new nodes where necessary a d x b c a d x b c Contains all distinct label paths in the document Coalesce same-tag siblings XML Document Skeleton Tree

Example Document-Tree Synopsis XML Documents: a d x c b a d x b c d x a b c Synopsis: a d x c b /. 3 2 1 a x * b /. 3 1.5 2.3 Merge low-frequency nodes for further compression

Estimating Pattern Selectivities Problem is different from traditional XML selectivity estimation Want selectivity at the level of documents rather than XML elements For patterns that are simple label paths (no branching or wildcards), get the selectivity directly from the synopsis For branching label paths: assume independence at branch points Selectivity = (individual branch selectivities) Selectivity(set of patterns S) = Selectivity( q) Summing all q selectivities can overestimate (overlap!) We define: Selectivity(S) = max { Selectivity(q) } ( like “fuzzy-OR”) Same idea for handling wildcards Max. over all possible wildcard instantiations a d x c b 3 2 1 a x d a x b d Selectivity = (2/3)*(2/3) = 4/9 Selectivity = 2/3

Selectivity Estimation Algorithm Estimate selectivity of pattern p over document-tree synopsis T Apply our estimation model in a Dynamic-Programming recurrence p(v) = sub-pattern rooted at node v of p; t = node of T If tag(v) = “//” then Estimate tree-pattern selectivity in O(|p|*|T|) time SEL[ p(v), T ] = max { SEL[ p(v’), t’ ] } v’ = child(v) t’ = child(t) SEL[ p(v), T ] = max { SEL[ p(v), t ] , SEL[ p(v’), t ] , v’ = child(v) /* “//” maps to empty path */ max { SEL[ p(v), t’ ] } } t’ = child(t) /* “//” maps to path >= 2 */

Selectivity-based Pattern Aggregation Algorithm AGGREGATE( S , k ) // S = set of tree patterns; k = space bound Initialize S’ = S while ( ) do C = candidate aggregate patterns generated using LUB computations & node pruning on patterns in S’ Select pattern x in C such that BENEFIT(x) is maximized S’ = S’ + { x } - { p in S’ that are contained in x } BENEFIT(x) based on marginal gain : maximize the gain in space per unit of “precision loss” ( let c(x) = { p in S’ that are contained in x } ) BENEFIT(x) = ( |p| - |x| ) / ( Selectivity(x) - Selectivity(c(x)) ) c(x)

Experimental Study Our selectivity-based aggregation algorithm (AGGR) against a “naive” generalization algorithm based on node pruning (PRUNE) PRUNE: delete “prunable” nodes with highest frequencies from patterns Key metrics Selectivity loss (due to aggregation) = (#False matches) / (#Documents not matching any of the original patterns) Filtering Speed XML documents and tree patterns generated using IBM’s XML generator tool with the XHTML and NITF DTDs Used Zipfian parameters to inject skew into document and/or pattern tags 1,000 documents used to “learn” the document-tree synopsis, another 1,000 to measure algorithm performance 10,000 tree patterns, max. height = 10, Prob[branch] = prob[wildcard] = .1 (>= 100,000 tree nodes)

Skewed Data

Skewed Patterns

Skewed Patterns & Skewed Data

Filtering Speed (XTrie Index)

Conclusions Introduced Tree-Pattern Aggregation problem Crucial for building scalable XML-based Pub/Sub systems Novel, selectivity-based pattern-aggregation algorithm LUB computations & coarse document statistics to compute “precise” aggregates Selection of aggregates based on marginal gains Basic algorithmic tools may be of independent interest E.g., XML query optimization Experimental validation with real-life DTDs Future Build more accurate document statistics on the fly? Increasing the expressiveness of subscription model (e.g., value predicates)

Thank you!