Tree-Pattern Similarity Estimation for Scalable Content-based Routing

Slides:

Advertisements

Similar presentations

Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi.

Advertisements

XML: Extensible Markup Language

Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

XML and Enterprise Computing. What is XML? Stands for “Extensible Markup Language” –similar to SGML and HTML –document “tags” are used to define content.

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.

Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.

DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.

Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

XML Primer. 2 History: SGML vs. HTML vs. XML SGML (1960) XML(1996) HTML(1990) XHTML(2000)

P2P Course, Structured systems 1 Introduction (26/10/05)

Ecole Polytechnique Fédérale de Lausanne, Switzerland Efficient processing of XPath queries with structured overlay networks Gleb Skobeltsyn, Manfred Hauswirth,

CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.

1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.

Event-Condition-Action Rule Languages over Semistructured Data George Papamarkos.

The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Electronic Commerce COMP3210 Session 4: Designing, Building and Evaluating e-Commerce Initiatives – Part II Dr. Paul Walcott Department of Computer Science,

Querying Structured Text in an XML Database By Xuemei Luo.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

Web Technologies COMP6115 Session 4: Adding a Database to a Web Site Dr. Paul Walcott Department of Computer Science, Mathematics and Physics University.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

MIDDLEWARE SYSTEMS RESEARCH GROUP Adaptive Content-based Routing In General Overlay Topologies Guoli Li, Vinod Muthusamy Hans-Arno Jacobsen Middleware.

CS 157B: Database Management Systems II February 11 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.

ICDCS Beijing China Routing of XML and XPath Queries in Data Dissemination Networks Guoli Li, Shuang Hou Hans-Arno Jacobsen Middleware Systems Research.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.

XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.

More SQL: Complex Queries, Triggers, Views, and Schema Modification

Unit 4 Representing Web Data: XML

By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01

Tree-Pattern Aggregation for Scalable XML Data Dissemination

Updating SF-Tree Speaker: Ho Wai Shing.

Indexing & querying text

Distributed database approach,

A paper on Join Synopses for Approximate Query Answering

DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++

The Variable-Increment Counting Bloom Filter

Database System Implementation CSE 507

Distance Computation “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presentation by Julie Letchner.

Efficient Filtering of XML Documents with XPath Expressions

RE-Tree: An Efficient Index Structure for Regular Expressions

Structure and Value Synopses for XML Data Graphs

Probabilistic Data Management

Query-Friendly Compression of Graph Streams

Chapter 7 Representing Web Data: XML

Random Sampling over Joins Revisited

Indexing and Hashing Basic Concepts Ordered Indices

Towards an Internet-Scale XML Dissemination Service

Structure and Content Scoring for XML

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Range-Efficient Computation of F0 over Massive Data Streams

Database Design and Programming

2018, Spring Pusan National University Ki-Joune Li

Structure and Content Scoring for XML

A Semantic Peer-to-Peer Overlay for Web Services Discovery

Lu Tang , Qun Huang, Patrick P. C. Lee

Presentation transcript:

Tree-Pattern Similarity Estimation for Scalable Content-based Routing R. Chand University of Geneva Raphael.Chand@cui.unige.ch http://cui.unige.ch/ Based on joint work with: P. Felber (University of Neuchatel) M. Garofalakis (Yahoo Research)

Content-based publish/subscribe Consumers register subscriptions Producers publish events (messages) Messages are routed to interested consumers Interested  message matches subscription Matching based on the content of messages Stock Quotes Symbol = LU and Price ≥ 10 Weather Forecast City = Nice Volume > 100,000 Internet Symbol: LU Price: $10 Volume: 101,000 City: Nice Weather: Sunny Temp: 24ºC 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Broker-based approach Fixed infrastructure of reliable brokers (Subset of) subscriptions stored at brokers in routing tables Typically takes advantage of “containment” relationship Filtering engine matches message against subscriptions to determine next hop(s) Cons: dedicated infrastructure, large routing tables, complex filtering algorithms 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

P2P approach Gather consumers in semantic communities according to interests (subscriptions) Disseminate messages in community & stop when reaching boundaries Pros: broker-less, space- efficient, low filtering cost Challenge: identify subscription proximity “are two distinct subscriptions likely to match the same set of documents?” 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Problem statement Given compute the similarity between p and q S: valid tree patterns (subscriptions) D: valid documents p, q  S compute the similarity between p and q p ~ q ~: S2  [0,1] (probability that p matches the same subset of D as q) Algorithms use H  D: historical data about document stream k: space bound 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Basic approach Summarize the document stream Synopsis maintained incrementally Accurate yet compact (compression, pruning) Evaluate selectivity of tree pattern using synopsis Recursive algorithm matches TP against synopsis Estimate similarity using various metrics Similarity computed from selectivity 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

1. Document-tree synopsis Maintain a concise, accurate synopsis HS Built on-line as documents stream by Captures the path distribution of documents in H Captures cross-pattern correlations in the stream p, q match the same documents (not just the same number) Allows us to estimate the fraction of documents matching different patterns 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

1. Document-tree synopsis Document-tree synopsis: tree with paths labeled with matching sets (documents containing path) Summary of path-distribution characteristics of documents Adding a document to the synopsis: Trace each path from the root of the synopsis, updating the matching sets and adding new nodes where necessary 1 3 Synopsis: x x x a b c d {1,2,3} {2,3} {1,3} {1} {2} /. 2 a a b x a a b b c c d a b c b d a c d a a d XML documents: c d 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

1. Matching set compression Problem: cannot maintain full matching set With N documents: O(N) Approach 1: only maintain document count Independence assumption unrealistic (no cross-pattern correlation) P(S1) = 2/3 * 1/3 = 2/9 vs. 0 P(S2) = 2/3 * 2/3 = 4/9 vs. 2/3 /. x #3 // // a b #3 #3 S1 S2 b a b b c d a d #2 #2 #2 #1 #3 a d d a c d #1 #2 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

1. Matching set compression Approach 2: use fixed-size sample sets Keep uniform sample of s documents [Vitter’s reservoir-sampling scheme] P(kth document in synopsis) = min(1,s/k) Once replaced, document ID deleted from whole tree Sampling rate decided uniformly over all nodes  Inefficient utilization of the space budget  Poor estimates 1 1 1 1 4/5 2/3 4/7 1/2 4/9 Document stream 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 11 12 Sample set 1 5 2 7 3 4 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

1. Matching set compression Approach 3: use per-node hash samples Gibbons’ distinct sampling: hash function maps document IDs on logarithmic range of levels Pr[h(x)≥l] = 1/2l Hash samples start at level l=0, keep d  h(d)≥l Once sample is full, increment level and “sub-sample” Fine sampling granularity, keep low frequency paths  Much better estimates  Good utilization of the space budget 1 2 3 Hash sample 1 2 3 4 l=0 4 5 6 7 8 l=0 l=1 l=2 3 … 5 2 8 4 l=1 1/2 3/4 7/8 1 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

1. Matching set compression Approach 3: (cont’d) Computing union/intersection: sub-sample lower level to higher prior to union/intersection, then possibly once more Estimate cardinality of sample with n elements: n2l Only need to store document ID in hash samples at final nodes of incoming paths Matching set of parent can be reconstructed by recursively unioning those of descendants  Reduced memory requirements /. x { } 1 x x a b { } { } a a b a b b c d a d {1,3} {2,3} { } {1} {1,2,3} b c c d b c d c d {2} {2,3} 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

1. Synopsis pruning Synopsis may grow very large (due to path diversity)  Prune nodes with little influence on selectivity estimation 1. Merge same-label leaf nodes with high similarity 2. Fold leaf nodes in parent with high similarity 3. Delete low-cardinality nodes Similar: S(t)  S(t’) / S(t)  S(t’)  1 1 3 2 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

2. Evaluate selectivity Recursive algorithm matches TP against synopsis u: node of tree pattern p v: node of synopsis ≥: at least as general (// ≥ * ≥ a)  No matching path Found matching path for rootu in synopsis  Return matching set Leaf of synopsis, not leaf of TP  No matching path Look for any path in synopsis for each branch of TP For Counters max * Maps // to empty path… …or to path of length ≥2 Selectivity is # of matching documents / # total documents 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

3. Estimate similarity Metrics to estimate similarity using selectivity Conditional probability of p given q (if p and q match the same set of documents as q alone, then p ~ q) Symmetrical conditional probability Ratio of joint to union probability (also symmetric) P(pq) computed by merging root nodes of p and q 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Evaluation: setup NITF and xCBL DTDs D: 10,000 documents with approx. 100 tag pairs, 10 levels Sp: 1,000 “positive” TPs (some match in D) * (10%), // (10%), branches (10%), ≤10 levels, Zipf skew (1) Sn: 1,000 “negative” TPs (no match in D) Synopses with 3 variants for matching sets Different space budgets (sizes of matching sets, compression degrees for pruning) Compare result of proximity metrics with exact value computed from sets of matching documents 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Evaluation: error metrics Let P(p): exact selectivity of p P’(p): our estimate of the selectivity of p Mi(p,q): exact proximity of p and q using metric Mi M’i(p,q): our estimate of the proximity of p and q using Mi Positive error: Negative error: Metrics error: 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Positive error vs. MS size Hashes outperforms other approaches in terms of accuracy Less than 5% with 1,000 entries 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Negative error vs. MS size Hashes also outperforms other approaches (no error with xCBL for Hashes & Sets) 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Positive error vs. synopsis size For a given space budget, Hashes is the most accurate (after some threshold) Hashes becomes more accurate than Counters 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Error of proximity metrics Hashes produces the best estimates 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Error vs. compression ratio Error remains small even for relatively high compression degrees Less than 15% error with 1:5 compression 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Conclusion Goal: semantic communities for publish/subscribe Problem: estimate similarity of (seemingly unrelated) tree patterns Similarity metric is very accurate and consistent Other usages (e.g., approximate XML queries) 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Thank You! 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Context: XML (messages) Extensible Markup Language: universal interchange (meta-)language, standard, semi-structured Type/structure (tags, defined by DTD or schema) vs. content (values, data associated with tags) Well-formed: syntactically correct Valid: matches DTD or schema XML documents : single-rooted trees <quotes> <stock> <name>Lucent Tech.</name> <symbol>LU</symbol> <price>10</price> </stock> <name>Cisco Systems, Inc.</name> <symbol>CSCO</symbol> <price>17</price> </quotes> Start/End Tags (properly nested) quotes stock stock Character Data name symbol price name symbol price Lucent LU 10 Cisco CSCO 17 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Context: XPath (subscriptions) Simple language: navigate/select parts of XML tree XPath Expression: sequence of node tests, child (/), descendant (//), wildcard (*), qualifiers ([...]) Constraints on structure and content of messages Using qualifiers, define tree pattern: specifies existential condition for paths with conjunctions at branching nodes XPath fragment, binary output: selection ⇒ match quotes // * // quotes stock price stock stock stock stock symbol symbol symbol price name symbol price name symbol price /quotes/stock/symbol =LU =LU >15 Lucent LU 10 Cisco CSCO 17 //price /*/stock[symbol=“LU”] //stock[price>15][symbol=“LU”] 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Document-tree synopsis 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

Tree patterns and XML tree 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

1. Document-tree synopsis Document-tree synopsis: tree with paths labeled with matching sets (IDs of documents containing that path) Summary of path-distribution characteristics of documents Adding a document to the synopsis: I) Identify distinct document paths  skeleton-tree x Coalesce same-tag siblings x a a b a b b c c d b c d XML document Skeleton-tree 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand

1. Document-tree synopsis II) Install all skeleton-tree paths in synopsis Trace each path from the root of the synopsis, updating the matching sets and adding new nodes where necessary 1 3 Synopsis: x x x a b c d {1,2,3} {2,3} {1,3} {1} {2} /. 2 a a b x a a b b c c d a b c b d a c d a a d XML documents: c d 5/10/2019 Tree-Pattern Similarity Estimation for Scalable Content-based Routing — R. Chand