Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Indexing DNA Sequences Using q-Grams
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
CSE544 Database Statistics Tuesday, February 15 th, 2011 Dan Suciu , Winter
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Fast Algorithms For Hierarchical Range Histogram Constructions
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Bloom Based Filters for Hierarchical Data Georgia Koloniari and Evaggelia Pitoura University of Ioannina, Greece.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Depth Estimation for Ranking Query Optimization Karl Schnaitter, UC Santa Cruz Joshua Spiegel, BEA Systems, Inc. Neoklis Polyzotis, UC Santa Cruz.
Approximate XML Query Answers Neoklis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas) Represented by: Gal.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science.
XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
Clustering XML Documents for Query Performance Enhancement Wang Lian.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
GSLPI: a Cost-based Query Progress Indicator
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
A paper on Join Synopses for Approximate Query Answering
RE-Tree: An Efficient Index Structure for Regular Expressions
Structure and Value Synopses for XML Data Graphs
Probabilistic Data Management
ICICLES: Self-tuning Samples for Approximate Query Answering
Efficient Subgraph Similarity All-Matching
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Motivation XML: de-facto standard for data exchange Development of the “XML Warehouse” Conflict between “on-line” and query execution cost  Increased query response times  Users might wait for un-interesting results XML Data Warehouse XML R Q

Approximate Query Answers Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result Use approximate result as timely feedback  User can assess the “value” of the query Goal: reduce number of evaluated queries XML Data Warehouse Synopsis XML R R’ Q

Contributions TreeSketch Synopses  Structural summaries for XML data  Approximate answers for complex twig queries  Summarization model  Structural clustering of elements  Efficient processing and construction Element Simulation Distance  Novel distance metric for XML data  Captures “approximate” similarity between two XML trees Experimental Results  Accurate approximate answers for low space budgets  Low-error selectivity estimates  Efficient construction algorithm

Outline Preliminaries TreeSketches  Synopsis model  Computing approximate answers  Summary construction Element Simulation Distance Experimental Study Conclusions

Data and Query Model XML Document q0q0 q1q1 q2q2 q3q3 //section.//equation./figure Twig Query s2s2 e 11 e 13 f5f5 f7f7 r Nesting Tree p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r e 10 f5f5 s2s2 r e8e8 f5f5 s2s2 r f4f4 s2s2 r e8e8 f4f4 s2s2 r q3q3 q2q2 q1q1 q0q0 Binding Tuples

Problem Definition Process twig query over a synopsis Compute approximation of nesting tree q0q0 q1q1 q2q2 q3q3 //section.//equation./figure s2s2 e 11 e 13 f5f5 f7f7 r s ee f r Approximate Nesting Tree True Nesting Tree XML Data Synopsis

Graph Synopsis XML DocumentGraph Synopsis Synopsis node  Set of elements of the same tag Synopsis edge  Document edge(s) P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r

XML DocumentTreeSketch TreeSketch Synopsis Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u  P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r 2 #F#F #F#F

XML DocumentTreeSketch TreeSketch Synopsis Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u P(1) S(2) C(4) F(4) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r #F#F

TreeSketches and Clustering TreeSketch  Clustering based on structure  All elements in a node are mapped to a “centroid”  Tight clusters  Accurate synopsis  The perfect synopsis corresponds to a perfect clustering Synopsis quality quantified by clustering error  Options: Manhattan Distance, Squared Error, …  Quality can be measured independent of a workload  Key for effective construction

Computing Approximate Answers TreeSketch q0q0 q1q1 q2q2 q3q3 //section.//equation.//caption QueryApproximate Nesting Tree R E 1 1+1=2 C S 2 Compute TreeSketch of approximate answer Accuracy depends on quality of clustering P(1) S(2) F(2) C(4) F(2) E(2) R(1)

TreeSketch Construction Given an XML tree T, build a TreeSketch of size B Difficult clustering problem  Space dimensionality depends on the clustering itself Construction based on bottom-up clustering  Compress perfect synopsis by merging clusters  Best merge determined by marginal gains Perfect Space Budget …

Depth-Guided Merging Key observation: Two elements have similar structure, if their children have similar structure  Children clusters should be merged first Bottom-up merging, based on depth  Depth: distance from the leaves of the tree  Build a pool of candidate merges by increasing depth  Replenish the pool when it falls below a given threshold Improved construction time - good performance

Outline Preliminaries TreeSketches  Synopsis model  Computing approximate answers  Summary construction Element Simulation Distance Experimental Study Conclusions

Error of Approximation Error  Distance between R’ and R Popular metric: Tree-edit distance  Min-cost sequence of operations that transform R’ to R  Measures syntactic differences between R and R’ Not intuitive for approximate answers! T1T1 T r s e s f 14 ef 41 r s e s f 44 ef 11 r s e s f 26 ef 62 T2T2 Different counts Similar Trait Same counts Opposite Trait

Element Simulation Distance Capture approximate similarity between R and R’ u simulates v: u and v have identical structure ESD(u,v): “degree” of simulation between u,v  How well the structure of u matches the structure of v Modeled as the distance between multi-sets Efficient computation using perfect summaries T1T1 T r s e s f 14 ef 41 r s e s f 44 ef 11 r s e s f 26 ef 62 T2T2

Outline Preliminaries TreeSketches  Synopsis model  Computing approximate answers  Summary construction Element Simulation Distance Experimental Study Conclusions

Experimental Methodology Data Sets: XMark, DBLP, IMDB, SwissProt Workload: 1000 random twig queries Evaluation metrics:  Average ESD for approximate answers  Mean absolute relative error for selectivity estimation

Approximate Answers IMDB (~102K Elements) Avg. Result Size: 3,477 tuples

Selectivity Estimation - SwissProt SwissProt (~182K Elements) Avg. Result Size: 104,592 tuples

Selectivity Estimation Data Set #Elements (x 10 3 ) # Tuples (x 10 3 ) DBLP1,50078 IMDB23613 S-Prot XMark2, Data Set Construction Time (min) DBLP11 IMDB2.5 S-Prot38 XMark240

Conclusions Approximate query answering for XML databases TreeSketch Synopses  Structural summaries for tree-structured XML  Approximate answers for twig-queries  Model: Graph Synopsis + Edge-counts  Efficient processing and construction Element Simulation Distance  Capture approximate similarity b/w XML trees Experimental Results  High accuracy for low space budgets  Efficient construction

Questions?

XML Document p1p1 s2s2 f7f7 c 14 s3s3 f9f9 c 17 f5f5 e 11 c 12 e 13 f9f9 c 17 r P(1) S(2) F(2) C(4) F(2) E(2) R TreeSketch TreeSketch Model (2/2) Average number of children Edge count #E#E #C#C 1 1 

XML XML Document p1p1 s2s2 f7f7 c 14 s3s3 f9f9 c 17 f5f5 e 11 c 12 e 13 p: paper s: section c: caption t: title f: figure e: equation f9f9 c 17 r