Approximate XML Query Answers Neoklis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas) Represented by: Gal.

Slides:



Advertisements
Similar presentations
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Advertisements

Clustering Categorical Data The Case of Quran Verses
gSpan: Graph-based substructure pattern mining
Divide and Conquer. Subject Series-Parallel Digraphs Planarity testing.
Fast Algorithms For Hierarchical Range Histogram Constructions
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
AVL Trees1 Part-F2 AVL Trees v z. AVL Trees2 AVL Tree Definition (§ 9.2) AVL trees are balanced. An AVL Tree is a binary search tree such that.
Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.
GOLOMB RULERS AND GRACEFUL GRAPHS
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Implementation of Graph Decomposition and Recursive Closures Graph Decomposition and Recursive Closures was published in 2003 by Professor Chen. The project.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Aki Hecht Seminar in Databases (236826) January 2009
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Recursive Graph Deduction and Reachability Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Data Structures Using C++ 2E Chapter 11 Binary Trees and B-Trees.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
KNURE, Software department, Ph , N.V. Bilous Faculty of computer sciences Software department, KNURE The trees.
Binary Trees Chapter 6.
Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Graph Algorithms Using Depth First Search Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms.
Randomized Algorithms - Treaps
UNC Chapel Hill M. C. Lin Point Location Reading: Chapter 6 of the Textbook Driving Applications –Knowing Where You Are in GIS Related Applications –Triangulation.
Algorithms for Enumerating All Spanning Trees of Undirected and Weighted Graphs Presented by R 李孟哲 R 陳翰霖 R 張仕明 Sanjiv Kapoor and.
MST Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
Introduction Of Tree. Introduction A tree is a non-linear data structure in which items are arranged in sequence. It is used to represent hierarchical.
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
1 Trees A tree is a data structure used to represent different kinds of data and help solve a number of algorithmic problems Game trees (i.e., chess ),
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
Database Systems Part VII: XML Querying Software School of Hunan University
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
CE 221 Data Structures and Algorithms Chapter 4: Trees (Binary) Text: Read Weiss, §4.1 – 4.2 1Izmir University of Economics.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
M180: Data Structures & Algorithms in Java Trees & Binary Trees Arab Open University 1.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
LIMITATIONS OF ALGORITHM POWER
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Data Structure and Algorithms
1 Trees : Part 1 Reading: Section 4.1 Theory and Terminology Preorder, Postorder and Levelorder Traversals.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
DATA STRUCURES II CSC QUIZ 1. What is Data Structure ? 2. Mention the classifications of data structure giving example of each. 3. Briefly explain.
Gspan: Graph-based Substructure Pattern Mining
Structure and Value Synopses for XML Data Graphs
Probabilistic Data Management
Graph Algorithms Using Depth First Search
Clustering.
Clustering.
Presentation transcript:

Approximate XML Query Answers Neoklis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas) Represented by: Gal Zach

Motivation XML: de-facto standard for data exchange over the Internet. Conflict between “on-line” and query execution cost Increased query response times Users might wait for un-interesting results Processing the query over a concise synopsis of the XML data. The approximate result should be: Computed fast Similar in its value content to the true result Similar in its hierarchical structure to the true result

Outline Motivation Background: Synopsis model TreeSketch Synopses Summarization model  Structural clustering of elements Efficient processing and construction Element Simulation Distance Experimental Results

Twig Query - Example for q1 in //a[//b] for q2 in q1//p return q1//n, for q3 in q2//k return q3 q0 q1 q2 q3 q4 //a[//b] //p //k //n a p k n a p k n d Twig queryQuery treeNesting tree b The is for the paths that are specified in the return clause.

Synopsis Model Let G =(V G,E G ) a direct node-labeled graph. A graph synopsis S(G)= (V S,E S ) is a direct node-labeled graph where: 1. Each Node v  V S corresponds to a subset of element (or attribute) node in V G, termed the extent of v – extent(v), that have the same label. 2. An edge (u,v)  E G is represented in E S as an edge between nodes whose extent contains the two endpoints u and v. Each synopsis node u store a tag tag(u) for the common tag of its element and a count field |u| for the size of its extent.

Synopsis Model Synopsis node  Set of elements of the same tag Synopsis edge  Document edge(s) r a1a2a3 r a(3) r a r a1a2a3

XML Data Graph P0 PB3 A1 A2 N4N8 V8V4E14 P6 F13 B5 F10 P7B9 T12 V10V11V12V13V14 T11 Synopsis Model - Example P(1) A(2) PB(1) N(2) P(2) T(2) B(2) F(2) E(1) Count(A) = | Extent(A) | = | {A1,A2} | =2 Synopsis graph

Example for Twig-XSketch r a1 b c b c b c b c a2 1 4 Document T1Twig-XSketch B/F=Backward forward Document T2 B/F R(1) A(2) B(4) C(10) B/F r a1 b c b c b c b c a Note: The numbers on the edges represent how many edges are of this kind.

Count-Stability and the TreeSketch Synopsis

Definitions Let R  V x V denote an equivalence relation over the nodes of T(V,E), and let (u,v) denote a pair of equivalence classes (i.e. element node partition) induced by R. The pair (u,v) is K-stable (K≥0) iff each element e  u has exactly k child elements in v. The relation R and the graph synopsis S R (T) resulting from the corresponding element partition are said to be count stable iff for every possible pair of element partitions (u,v) there exists some k≥0 such that (u,v) is k-stable.

Examples Tree T 1 r a1 b1 b3b2 a2 r a b S R (T 1 ) The pair (r, a) is 2-stable. The pair (a, b) is not k-stable for any k≥0. Tree T 2 r a b S R (T 2 ) The pair (r, a) is 2-stable. The pair (a, b) is 3-stable. S R (T 2 ) is count-stable. r b3 b1 a1 b2 b6 b4 a2 b5

Lemma Given a data tree T(V,E) there exists a unique minimal (in terms of the number of equivalence classes) count-stable equivalence relation R  V x V. Furthermore, there exists a function Expand from stable relations to XML trees, such that Expand(R) is isomorphic to the original document tree T.

Example S R (T 1 )S R (T 2 ) r a b c b r a b c a b r a1 b c b c b c b c a2 1 4 r a1 b c b c b c b c a

TreeSketch Synopsis TreeSketch synopsis TS for an XML data tree T is a graph-synopsis for T where: 1. Each node u in TS stores an element count count(u) = |extent(u)|. 2. Each edge (u,v) in TS stores an (average) child count count(u,v) equal to the average number of children in extent(v) for each element in extent(u).

TreeSketch Synopsis The interpretation of the stored average is simple: All elements in the extent of u have count(u,v) child elements in the extent of v.

TreeSketches and Clustering

Let u be a synopsis node with outgoing edges u v 1,…,u v n. The set of outgoing edges defines a n-dimnetional space where an element e  u is mapped to point (c 1 (e),…,c n (e)) if it has c i (e) children to node vi, 1≤ i ≤n. The recorded average edge counts essentially map all points in this space to point (count(u,v 1 ),…,count(u,v n )), which actually represents the centroid of the cluster.

TreeSketches and Clustering - Example r a1 cb cb a r(1) c(10)b(6) a(2) 3 5 a1 (1,2) a2 (5,8) a (3,5) Synopsis tree Original tree

TreeSketches and Clustering We can characterize the quality of a TreeSketch synopsis by using a metric that quantifies the quality of the induced clustering. The metric used in the article is the squared error of the clustering which essentially measures the euclidean distance between points and their corresponding centroid. The squared error of a single cluster u is defined as sq(u) = Σ e  u Σ 1≤ i ≤n (c i (e)-count(u i,v i ))² Sq(TS) for a synopsis TS is simply the sum of squared errors for all the induced clusters.

TreeSketches and Clustering Note that the squared error for a count-stable synopsis is zero since all edge-count centroids are exact, i.e., the child count for any element in a given synopsis node extent are identical. Tight clusters  Accurate synopsis The perfect synopsis corresponds to a perfect clustering

Building the Count-Stable Summery

B UILD S TABLE Algorithm Input: XML document T. Output: Count-Stable synopsis S to T. Begin 1. H=Ф; S=Ф 2. foreach e  T in post-order do 3. C={(ui,ci):ui is a node in S and c i=|children(e)∩extent(ui)|>0} 4. if (H[lable(e),C]=Ф) than 5. Add node u to S with label(u)=label(e) 6. H[lable(e),C]=u 7. for (ui,ci)  C do add edge u ui to S 8. endif 9. u=H[lable(e),C]; extent(u)=extent(u)U{e} 10. endfor end =>The algorithm time: O(|T|)

Example r a1 b2b1 b3 a2 b’ a’a’’ r’ T S H (b, Ф) = b’ (a, {(b’,2)} ) = a’ (a, {(b’,1)} ) = a’’ (r, {(a’,1),(a’’,1)} ) = r’ C= Ф C= {(b’,2)}C= ФC= {(b’,1)}C= {(a’,1),(a’’,1)} b1 b2 a1 b3 a2 r

Space Budget Limitations Given an XML tree T, build a TreeSketch of size B Difficult clustering problem Space dimensionality depends on the clustering itself Construction based on bottom-up clustering Compress perfect synopsis by merging clusters Best merge determined by marginal gains Perfect Space Budget …

TSB UILD Algorithm Maintain a pool of candidate operations for merging 2 nodes of TS in size U h (given as input to the algorithm). m(TS) denotes the resulting synopsis after applying merge m on TS. m.err d = sq(m(TS)) - sq(TS) is the increase in squared error from TS to m(TS). m.size d = size(TS) - size(m(TS)) is the decrease in synopsis size. The operations pool is organized in min-heap according to the marginal-gain ratio m.err d / m.size d.

TSB UILD: Main Steps Input: XML Tree T. Space budget S. Upper/Lower bounds for heap size (U h, L h ). Output: TreeSketch synopsis TS of T of size ≤ S. Main Steps: TS = BuildStable(T); Creates the pool of candidates merge operations on size U h. Applying each merge operation on it’s turn on TS. After each merge, recompute all necessary parameters of TS. If TS drops bellow size S, the algorithm stops. If the pool size drops below the bound L h, replenish it.

TSB UILD

C REATE P OOL Algorithm Generate all possible pair-wise merges and keep the top U h O(N²) merge operations. Key observation: Two elements have similar structure, if their children have similar structure. Children clusters should be merged first. Bottom-up merging, based on depth Depth: distance from the leaves of the tree. Build a pool of candidate merges by increasing depth. Replenish the pool when it falls below a given threshold.

C REATE P OOL

Approximate Query Processing

E VAL Q UERY: Main Steps Input: TreeSketch TS of document T. Twig Query Q. Output: TreeSketch T Q that approximates the nesting tree N T (Q). Main Steps: Go in pre-order traversal on Q. After q j was added, go to it’s son q i. Add the node q i  Q if it doesn’t exists yet, and calculate the paths number from q j to it, acording to TS. Connect q i to q j (the parent node) by adding an edge.

E VAL Q UERY Algorithm

E VAL E MBED

Example //f q0q0 q1q1 q2q2 q3q3 q4q4 q5q5 //a d[/g]//f b|e c r A EB F C D G2G2 G1G r Q (q 0) A Q ( q 1) E Q ( q 2) B Q ( q 2) F Q (q 4) C Q (q 5) F Q ( q 3) QueryT REE S KETCH Result T REE S KETCH TS Q

Example Cont. Let us consider the processing of node q 1 (on the query), and more specifically the computation of bindings from q 1 to q 3. Starting from node A, which appears in the bindings of q 1, we can identify exactly one simple embedding of path(q 1,q 3 )=d[/g]//f, namely e=A/D/F. The bindings of q 3, therefore, will be the descendants of A along the given embedding. The number of descendants for each element in A: n t = count(A,D)·count(D,F)=2·0.5=1. s = ·0.7=0.88. => The number descendants along d[/g]//f for each binding q 1 is 1·0.88 = 0.88.

Error of Approximation - Abstract The error of approximation is quantified by the distance between the 2 XML trees. The distance represents how much 2 trees are similar, by the aspects of structure and meaning. ESD - Element Simulation Distance - is a metric described on the article which quantifies the above-mentioned distance.

Experimental Study Data Sets: IMDB - real-life data set from the Internet Movie Data Base. XMark - synthetic data set that models transactions on an online-action. SwissProt - real-life data set with annotations on proteins. Workload: 1000 random twig queries. Evaluation metrics: Average ESD for approximate answers

Data Sets Characteristics Data setsElementsFile size (MB)Stable Synopsis Size (KB) IMDB102, XMARK103, SProt182,

Approximate Answers IMDB (~102K Elements) Avg. Result Size: 3,477 tuples

Approximate Answers Avg. ESD Synopsis size (KB) XMark (~103K Elements) Avg. Result Size: 2,436 tuples. TreeSkethces TwigXSkethces

Approximate Answers SwissPort (~182K Elements) Avg. Result Size: 104,592 tuples Synopsis size (KB) Avg. ESD. TreeSkethces TwigXSkethces

Construction times Construction times (minutes) for T REE S KETCH es and twig- XS KETCH es. IMDBXMarkSwiss-Port T REE S KETCHes Twig-XS KETCHes

Error of Approximation Let N TS (Q) be the approximate nesting tree that is computed over a concise synopsis TS, and let N T (Q) be the true nesting tree of the query Q. The error of approximation is quantified by the distance between the 2 XML trees, denoted as dist A (N TS (Q), N T (Q)). We will use the tree-edit distance metric, which measure only the syntactic differences.

Tree-edit distance metric The tree-edit distance dist E (T 1,T 2 ) between 2 XML trees measures the minimum cost sequence of edit operations the transform T 1 to T 2. Operations on tree nodes (basic): Adding Deleting Relabeling

Tree-edit distance metric - Example r aa SdSd ScSc ScSc SdSd r aa SdSd ScSc ScSc SdSd r aa SdSd ScSc ScSc SdSd Query answer TApproximation T 1 Approximation T 2 dist E (T,T 1 ) = 3·|S c |+3·|S c | = 3·|Sc|+3·|Sd| = distE(T,T2)

Element Simulation Distance New distance metric for XML trees. Considers both the overall path structure and the distribution of document edges. Defined recursively. Uses existing distance metric such as MAC (match and compare) and EMD (earth mover’s distance). Note: these metrics are not described on the article.

Element Simulation Distance MAC: A numerical measure to quantify the quality of an approximate answer to a set- valued query. EMD: Measures a distance between 2 distributions, which reflects the minimal amount of work that must be performed to transform one distribution into the other by moving “distribution mass” around.

Element Simulation Distance Let u  T 1 v  T 2 be elements of the compared trees where label(u)=label(v). Let U t, V t denote the children sets of u, v respectively, that have tag t. ESD(u’,v’) denotes the distance between any 2 elements u’  U t, v’  V t. The distance dist ς (U t, V t ) between U t, V t is defined by using an existing value set distance metric, like MAC or EMD. ESD(u,v) = Σdist ς (U t, V t )

Element Simulation Distance Assume without loss of generality that V t =Ø. For each element e  U t, we insert a unique element e v in V t with distance ESD(e,e v )=|e|, where |e| is the sub-trees size of e, and ESD(e’,e v )=∞, for all e’  U t, e’≠e. ESD Between two Trees : ESD(T 1,T 2 ) = ESD(root(T 1 ), root(T 2 )).

ESD - Example a Let u,v be the left a elements of T and T 1 respectively. Element u,v have children of tags c and d and thus ESD(u,v)= dist ς (U c, V c )+ dist ς (U d, V d ). ESD(c i,c j ), c i  U c, c j  V c are equal to 0, since the elements have identical sub-trees. Notice that the 2 value sets contain equal values but at different multiplicities. Using the MAC metric: dist ς (U c, V c )=8 => ESD(u,v)=8+0=8. r aa SdSd ScSc ScSc SdSd r aa SdSd ScSc ScSc SdSd

ESD – Example Cont. Let v’ be the left element a of T ESD(u,v’)=6. r aa SdSd ScSc ScSc SdSd r aa SdSd ScSc ScSc SdSd

Questions

Thank You!