Correlating XML Data Streams Using Tree-Edit Distance Embeddings Minos Garofalakis & Amit Kumar Internet Management Research Department Bell Labs, Lucent.

Slides:

Advertisements

Similar presentations

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

Advertisements

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Fast Algorithms For Hierarchical Range Histogram Constructions

Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,

TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.

Estimating Join-Distinct Aggregates over Update Streams Minos Garofalakis Bell Labs, Lucent Technologies (Joint work with Sumit Ganguly, Amit Kumar, Rajeev.

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.

Data Stream Mining and Querying

Data Stream Processing (Part I) Alon,, Matias, Szegedy. “The space complexity of approximating the frequency moments”, ACM STOC’1996. Alon, Gibbons, Matias,

Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.

Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

Sketching Massive Distributed Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

1 Administrivia  List of potential projects will be out by the end of the week  If you have specific project ideas, catch me during office hours (right.

Data Stream Processing (Part III) Gibbons. “Distinct sampling for highly accurate answers to distinct values queries and event reports”, VLDB’2001. Ganguly,

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Processing Continuous Network-Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies.

1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

Sketching Techniques for Massive Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies.

Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.

The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Analyzing Massive Data Streams: Past, Present, and Future Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies.

1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.

1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.

Sketch-Based Multi-Query Processing over Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Spatial Data Management

New Characterizations in Turnstile Streams with Applications

A paper on Join Synopses for Approximate Query Answering

RE-Tree: An Efficient Index Structure for Regular Expressions

Probabilistic Data Management

Lattice Histograms: A Resilient Synopsis Structure

Lecture 7: Dynamic sampling Dimension Reduction

Y. Kotidis, S. Muthukrishnan,

Near-Optimal (Euclidean) Metric Compression

Lecture 2- Query Processing (continued)

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Range-Efficient Computation of F0 over Massive Data Streams

Lecture 6: Counting triangles Dynamic graphs & sampling

Important Problem Types and Fundamental Data Structures

Data Pre-processing Lecture Notes for Chapter 2

Presentation transcript:

Correlating XML Data Streams Using Tree-Edit Distance Embeddings Minos Garofalakis & Amit Kumar Internet Management Research Department Bell Labs, Lucent Technologies

2 Talk Outline  Introduction & Motivation –Relational & XML stream-query processing –The problem  Brief description of our idea & contributions  Embedding algorithm for XML trees  Applications to XML stream queries –Sketching a massive, streaming XML data tree –Similarity joins over streaming XML data sources  Conclusions

3 Query Processing over Data Streams  Requirements for stream query processing –Single Pass: Each record is examined at most once, in fixed (arrival) order –Small Space: Log or poly-log in data stream size –Real-time: Per-record processing time must be low  Example Application: Network Management  A data stream is a massive, continuous sequence of data records Stream Query Processing Engine (Approximate) Answer Data Streams Query Q Network Operations Center (NOC) IP Network Measurements/ Alarms R1 R2 SELECT COUNT(*) FROM R1, R2 WHERE R1.A = R2.B Data-Stream Join

4 Processing Relational Data Streams  Lots of interesting work and results based on the idea of pseudo- random sketches –Essentially, randomized linear projections of the data distribution Stream R(A,B) Atomic Sketch  are random variates from appropriate distribution  Trivial to maintain over stream: Add whenever tuple (i, j) is seen  Only O(logN) space (instead of )  Surprisingly powerful idea: Use a small number of sketches to –Estimate stream norms, build approximate histograms/wavelet coefficients,... –Estimate join aggregates AGGR(R S) (using sketches of R and S) N N

5 Processing XML Data Streams  XML: Much richer, (semi)structured data model –Ordered, node-labeled data trees  Bulk of work on XML streaming: Content-based filtering of XML documents (publish/subscribe systems) –Quickly match incoming documents against standing XPath subscriptions XPath Subscriptions (X/Yfilter, Xtrie, etc.)  Essentially, simple selection queries over a stream of XML records!  No work on more complex XML stream queries –For example, queries trying to correlate different XML data streams

6 Processing XML Data Streams (cont.)  Example XML stream correlation query: Similarity-Join  Correlation metric: Tree-edit distance ed(T1,T2) –Node relabels, inserts, deletes - also, allow for subtree moves Different data representation for same information (DTDs, optional elements) Web Source 1 Web Source 2 T1 T2 Degree of content similarity between streaming XML sources publication book authors author paper title author delete publication book author paper title author |SimJoin(S1, S2)| = |{(T1,T2) S1xS2: dist(T1,T2) }|

7 How About Relational Techniques/Sketches?  Randomized linear projections (a.k.a. sketches) are useful for points over a numeric vector space –Not structured objects over a complex metric space (tree-edit distance)  Our idea: Build a low-distortion embedding of streaming XML and the tree-edit distance metric in a multi-d normed vector space  Given such an embedding, sketching techniques now become relevant in the streaming XML context! –E.g., use sketches to produce synopses of the data distribution in the image vector space ed(T1,T2) ed(T1,T3) ed(T2,T3) map V() Distances are (approximately) preserved in the image vector space!! V(T2) V(T1) V(T3)

8 Our Approach & Contributions  Construct low-distortion embedding for tree-edit distance over streaming XML documents -- Requirements: –Small space/time –Oblivious: Can compute V(T) independent of other trees in the stream(s) –Bourgain’s Lemma is inapplicable!  First algorithm for low-distortion, oblivious embedding of the tree-edit distance metric in small space/time – Fully deterministic, embed into L1 vector space – Bound of on distance distortion for trees with n nodes

9 Our Approach & Contributions (cont.)  Applications in XML stream query processing –Combine our embedding with existing pseudo-random sketching techniques –Build a small-space sketch synopsis for a massive, streaming XML data tree Concise surrogate for tree-edit distance computations –Approximating tree-edit distance similarity joins over XML streams in small space/time  First algorithmic results on correlating XML data in the streaming model  Other important algorithmic applications for our embedding result –Approximate tree-edit distance in (near-linear) time

10 Our Embedding Algorithm  Key Idea: Given an XML tree T, build a hierarchical parsing structure over T by intelligently grouping nodes and contracting edges in T –At parsing level i: T(i) is generated by grouping nodes of T(i-1) ( T(0) = T ) –Each node in the parsing structure ( T(i), for all i = 0, 1,... ) corresponds to a connected subtree of T –Vector image V(T) is basically the characteristic vector of the resulting multiset of subtrees (in the entire parsing structure)  Our parsing guarantees –O(log|T|) parsing levels (constant-fraction reduction at each level) –V(T) is very sparse: Only O(|T|) non-zero components in V(T) Even though dimensionality = ( = label alphabet) Allows for effective sketching –V(T) is constructed in time V(T)[x] = no. of times subtree x appears in the parsing structure for T

11  Node grouping at a given parsing level T(i): Create groups of 2 or 3 nodes of T(i) and merge them into a single node of T(i+1) –1. Group maximal sequence of contiguous leaf children of a node –2. Group maximal sequence of contiguous nodes in a chain –3. Fold leftmost lone leaf child into parent Our Embedding Algorithm (cont.)  Grouping for Cases 1,2: Deterministic coin-tossing process of Cormode and Muthukrishnan [SODA’02] –Key property: Insertion/deletion in a sequence of length k only affects the grouping of nodes in a radius of from the point of change

12 Our Embedding Algorithm (cont.)  Example hierarchical tree parsing  O(log|T|) levels in the parsing, build V(T) in time T(0) = TT(1) T(2) T(3) x = “empty” label V(T)[x] += 1 y = V(T)[y] += 1

13 Main Embedding Result  Theorem: Our embedding algorithm builds a vector V(T) with O(|T|) non- zero components in time ; further, given trees T, S with n = max{|T|, |S|}, we have:  Upper-bound proof highlights –Key Idea: Bound the size of “influence region” (i.e., set of affected node groups) for a tree-edit operation on T (=T(0)) at each level of parsing We show that this set is of size at level i –Then, it is simple to show that any tree-edit operation can change by at most L1 norm of subvector at level i changes by at most O(|influence region|)

14 Main Embedding Result (cont.)  Lower-bound proof highlights –Constructive: “Budget” of at most tree-edit operations is sufficient to convert the parsing structure for S into that for T Proceed bottom up, level-by-level At bottom level (T(0)), use budget to insert/delete appropriate labeled nodes At higher levels, use subtree moves to appropriately arrange nodes  See paper for full details...

15 Sketching a Massive, Streaming XML Tree  Input: Massive XML data tree T (n = |T| >> available memory), seen in preorder  Output: Small space surrogate (vector) for high-probability, approximate tree-edit distance computations (to within our distortion bounds)  Theorem: Can build a -size sketch vector of V(T) for approximate tree-edit distance computations in space and time per element –d = depth of T, = probabilistic confidence in ed() approximation –XML trees are typically “bushy” (d<<n or d = O(polylog(n)))

16 Sketching a Massive, Streaming XML Tree (cont.)  Key Ideas –Incrementally parse T to produce V(T) as elements stream in –Just need to retain the influence region nodes for each parsing level and for each node in the current root-to-leaf path –While updating V(T), also produce an L1 sketch of the V(T) vector using the techniques of Indyk [FOCS’00] Influence Regions T=T(0) T(1) T(2)... T(O(logn))

17 Approximate Similarity Joins over XML Streams  Input: Long streams S1, S2 of N (short) XML documents ( b nodes)  Output: Estimate for |SimJoin(S1, S2)|  Theorem: Can build an atomic sketch-based estimate for |SimJoin(S1, S2)| where distances are approximated to within in space and time per document – = probabilistic confidence in distance estimates S1: S2: |SimJoin(S1, S2)| = |{(T1,T2) S1xS2: ed(T1,T2) }|

18 Approximate Similarity Joins over XML Streams (cont.)  Key Ideas –Our embedding of streaming document trees, plus two distinct levels of sketching One to reduce L1 dimensionality, one to capture the data distribution (for joining) Finally, similarity join in lower-dimensional L1 space Some technical issues: high-probability L1 dimensionality reduction is not possible, sketching for L1 similarity joins Details in the paper... V(T) TreeEmbed dimensions L1 sketch (dim. reduction) Join Sketch (distribution) X(S)

19 Conclusions  Considered the problem of processing complex correlation queries based on tree-edit distance over XML data streams  New approach, based on a novel algorithm for low-distortion vector- space embedding of streaming XML and the tree-edit distance metric –First oblivious, small space/time algorithm with guaranteed logarithmic worst-case bound on distance distortion  Combined our embedding with sketching techniques to give first algorithmic results on correlating XML data in the streaming model –Sketching an XML database, approximate similarity joins  Other algorithmic applications of our results –Tool for very fast (near-linear time) approximate tree-edit distance computations  Future: Other metric-space embeddings in the streaming model??

20 Thank you!

21 Stream Data Synopses  Conventional data summaries fall short –Quantiles and 1-d histograms [MRL98,99], [GK01], [GKMS02] Cannot capture attribute correlations Little support for approximation guarantees –Samples (e.g., using Reservoir Sampling) Perform poorly for joins [AGMS99] Cannot handle deletion of records –Multi-d histograms/wavelets Construction requires multiple passes over the data  Different approach: Randomized sketch synopses [AMS96] –Only logarithmic space –Probabilistic guarantees on the quality of the approximate answer –Supports insertion as well as deletion of records

22 Randomized Sketch Synopses for Streams  Goal:  Goal: Build small-space summary for distribution vector f(i) (i=1,..., N) seen as a stream of i-values  Basic Construct:  Basic Construct: Randomized Linear Projection of f() = inner/dot product of f-vector –Simple to compute over the stream: Add whenever the i-th value is seen –Generate ‘s in small (logN) space using pseudo-random generators –Tunable probabilistic guarantees on approximation error Data stream: 3, 1, 2, 4, 2, 3, 5,... f(1) f(2) f(3) f(4) f(5) where = vector of random values from an appropriate distribution  Used for low-distortion vector-space embeddings [JL84]

23 More work on Sketches...  Low-distortion vector-space embeddings (JL Lemma) [Ind01] and applications –E.g., approximate nearest neighbors [IM98]  Wavelet and histogram extraction over data streams [GGI02, GIM02, GKMS01, TGIK02]  Discovering patterns and periodicities in time-series databases [IKM00, CIK02]  Quantile estimation over streams [GKMS02]  Distinct value estimation over streams [CDI02]  Maintaining top-k item frequencies over a stream [CCF02]  Stream norm computation [FKS99, Ind00]  Data cleaning [DJM02]