Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences.

Slides:

Advertisements

Similar presentations

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.

Advertisements

Recap: Mining association rules from large datasets

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.

Informed Search Methods How can we improve searching strategy by using intelligence? Map example: Heuristic: Expand those nodes closest in “as the crow.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Frequent Closed Pattern Search By Row and Feature Enumeration

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Fast Algorithms For Hierarchical Range Histogram Constructions

Lecture 10 Query Optimization II Automatic Database Design.

Simplifying CFGs There are several ways in which context-free grammars can be simplified. One natural way is to eliminate useless symbols those that cannot.

Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,

Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.

Artificial Intelligence Lecture

In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,

Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA

An Algebraic Approach to Practical and Scalable Overlay Network Monitoring Yan Chen, David Bindel, Hanhee Song, Randy H. Katz Presented by Mahesh Balakrishnan.

Data Mining Association Analysis: Basic Concepts and Algorithms

Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.

Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.

ECE669 L10: Graph Applications March 2, 2004 ECE 669 Parallel Computer Architecture Lecture 10 Graph Applications.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.

XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey.

Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze.

IEEE Globecom 2010 Tan Le Yong Liu Department of Electrical and Computer Engineering Polytechnic Institute of NYU Opportunistic Overlay Multicast in Wireless.

Materialized View Selection for XQuery Workloads Asterios Katsifodimos 1, Ioana Manolescu 1 & Vasilis Vassalos 2 1 Inria Saclay & Université Paris-Sud,

Lecture 9 Query Optimization.

ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Graph Summaries for Subgraph Frequency Estimation 1 Angela Maduko, 2 Kemafor Anyanwu, 3 Amit Sheth, 4 Paul Schliekelman 1 LSDIS Lab, University of Georgia.

QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing.

Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree

XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS

Energy Efficient Data Management for Wireless Sensor Networks with Data Sink Failure Hyunyoung Lee, Kyoungsook Lee, Lan Lin and Andreas Klappenecker †

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,

Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.

1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.

Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.

Dirk Stroobandt Ghent University Electronics and Information Systems Department Multi-terminal Nets do Change Conventional Wire Length Distribution Models.

Partial Query-Evaluation in Internet Query Engines Jayavel Shanmugasundaram Kristin Tufte David DeWitt David Maier Jeffrey Naughton University of Wisconsin.

AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.

By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01

RE-Tree: An Efficient Index Structure for Regular Expressions

Structure and Value Synopses for XML Data Graphs

Probabilistic Data Management

Evaluation of Relational Operations: Other Operations

SigMatch Fast and Scalable Multi-Pattern Matching

Searching for Solutions

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Structure and Content Scoring for XML

Evaluation of Relational Operations: Other Techniques

Wei Wang University of New South Wales, Australia

Presentation transcript:

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences Department University of Wisconsin - Madison

Motivation XML enables Internet scale applications that query data from many sources Niagara, Xyleme, … Queries over XML data use path expressions Optimizing these queries requires estimating the selectivity of the path expressions Focus of this talk: Building statistics for XML data and using them for estimating the selectivity of simple path expressions

What is XML? Pygmalion Bernard Shaw David Copperfield Charles Dickens

Querying XML FOR $n_auth IN document("*")//novel/author $p_auth IN document("*")//play/author WHERE $n_auth/text() = $p_auth/text() RETURN $n_auth Optimizing this query requires estimating the selectivity of the path expressions This requires information about the structure of the XML data

Goal of this Work Build database statistics that capture the structure of XML data Ensure that the statistics fit in a small amount of memory For efficient query optimization Important for Internet scale applications Use the statistics to estimate the selectivity of simple XML path expressions //t1/t2/…/tn

Outline of Presentation Introduction Path Trees Markov Tables Performance Evaluation Conclusions

Path Trees A 1 C 1 B 2 D 1 D 1 E 3

Summarizing Path Trees Path trees contain all the information needed for selectivity estimation Problem: May not fit in available memory Small available memory Internet scale Remove low frequency nodes Removed nodes replaced with *-nodes Tag name: * meaning "any tag" Frequency: Average frequency of replaced nodes Sibling-*, Level-*, Global-*, No-*

Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2

Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2

Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2

Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2

Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 * f=6 n=2 *-nodes represent deleted sibling nodes Memory saved by coalescing nodes

Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 * f=6 n=2

Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 * f=6 n=2

Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 * f=6 n=2

Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 K 11 * f=6 n=2 * f=12 n=2

Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 * K 11 * f=6 n=2 f=12 n=2

Sibling-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 * K 11 * f=6 n=2 f=12 n=2

Sibling-* Summarization A 1 C 9 B 13 F 15 K 12 * K 11 * f=6 n=2 f=12 n=2 * f=16 n=2

Sibling-* Summarization A 1 C 9 B 13 * F 15 * K* f=6 n=2 f=12 n=2 f=16 n=2 f=23 n=2

Sibling-* Summarization A 1 C 9 B 13 * F 15 * K* f=23 n=2 68 3

Original Path Tree A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2

Sibling-* Summarization A 1 C 9 B 13 * F 15 * K* f=23 n= Try to retain as much information as possible about the deleted nodes

Level-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2

Level-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2

Level-* Summarization A 1 C 9 B 13 G 10 F 15 K 12 K 11 * 6 * 3 Less information about deleted nodes than sibling-* Deletes fewer nodes than sibling-*

Global-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2

Global-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2

Global-* Summarization C 9 B 13 G 10 F 15 H 6 K 12 D 7 K 11 * 3

No-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2

No-* Summarization A 1 C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 J 4 I 2

No-* Summarization C 9 B 13 G 10 F 15 H 6 K 12 E 5 D 7 K 11 Memory savings similar to global-* Conservative assumption about deleted nodes

Outline Introduction Path Trees Markov Tables Performance Evaluation Conclusions

Markov Tables A table of all distinct paths of length up to m and their frequencies For paths of length greater than m, combine paths from the Markov table Example: Uses "short memory" or "Markov" property f(B/C/D) f(B/C) f(A/B/C/D) = f(A/B/C)

Markov Tables PathFreqPathFreq A1AC6 B11AD4 C15BC9 D19BD7 AB11CD8 A 1 D 4 C 6 B 11 D 7 C 9 D 8

Summarizing Markov Tables Exact selectivities for paths of length up to m Approximate selectivities for paths longer than m Problem: May not fit in available memory Remove low frequency paths Discard removed paths of length > 2 Replace removed paths of length 1 or 2 with *-paths Suffix-*, Global-*, No-*

Suffix-* Summarization PathFreqPathFreq A1AC6 B11AD4 C15BC9 D19BD7 AB11CD8

Suffix-* Summarization PathFreqPathFreq A1AC6 B11AD4 C15BC9 D19BD7 AB11CD8 *0**0

Suffix-* Summarization PathFreqPathFreq A1AC6 B11AD4 C15BC9 D19BD7 AB11CD8 *0**0

Suffix-* Summarization PathFreqPathFreq AC6 B11AD4 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0

Suffix-* Summarization PathFreqPathFreq AC6 B11AD4 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0 S D = { } Set of deleted paths of length 2

Suffix-* Summarization PathFreqPathFreq AC6 B11 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0 S D = { (AD,4) }

Suffix-* Summarization PathFreqPathFreq AC6 B11 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0 S D = { (AD,4) }

Suffix-* Summarization PathFreqPathFreq AC6 B11 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0 S D = { (AD,4) }

Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0 S D = { }

Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19BD7 AB11CD8 *f=1,n=1**0 S D = { }

Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19 AB11CD8 *f=1,n=1**0 S D = { (BD,7) }

Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19 AB11CD8 *f=1,n=1**0 S D = { (BD,7) }

Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19 AB11 *f=1,n=1**0 S D = { (BD,7), (CD,8) }

Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19 AB11 *f=1,n=1**0 S D = { (BD,7), (CD,8) }

Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15BC9 D19 AB11 *f=1,n=1**0 S D = { (BD,7), (CD,8) }

Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15B*f=16,n=2 D19 AB11 *f=1,n=1**0 S D = { (CD,8) }

Suffix-* Summarization PathFreqPathFreq A*f=10,n=2 B11 C15B*f=16,n=2 D19 AB11 *f=1,n=1**0 S D = { (CD,8) }

Suffix-* Summarization PathFreqPathFreq B11 C15B*f=16,n=2 D19 AB11 *f=1,n=1**f=10,n=2 S D = { (CD,8) }

Suffix-* Summarization PathFreqPathFreq B11 C15B*8 D19 AB11 *1**6 S D = { }

Global-*, No-* Summarization Global-* Two *-paths, * and ** Deletes fewer paths than suffix-* to summarize the Markov table No-* No *-paths Conservatively assumes that paths not in the Markov table do not exist in the data

Outline Introduction Path Trees Markov Tables Performance Evaluation Conclusions

Data Sets for Experiments Synthetic data set 100,000 XML elements Path tree: 3197 nodes, 6 levels, 38 KB Element frequencies: Zipfian (z=1) DBLP data set 1,399,765 XML elements Path tree: 5883 nodes, 6 levels, 69 KB

Query Workloads 1,000 paths of length between 1 and 4 Random paths All query paths exist in the data Random tags Most query paths of length 2 or more do not exist in the data Available memory between 5 and 50 KB

Best Summarization Methods Path trees Query paths in data: Global-* Query paths not in data: No-* Markov tables m = 2 is best Query paths in data: Suffix-* Query paths not in data: No-*

Path Trees vs. Markov Tables When to use path trees and when to use Markov tables? Also compared against Pruned Suffix Trees (PSTs) [Chen et al, ICDE 2001] Can handle branching path expressions Can handle conditions on element values

Synthetic Data – Random Paths

Synthetic Data – Random Tags

DBLP Data – Random Paths

DBLP Data – Random Tags

When are Markov Tables Better? DBLP Repeated sub-structures effectively captured by Markov tables … … …

Conclusions Novel statistics for estimating the selectivity of XML path expressions Scale to "all the XML data on the Internet" More accurate than best previously known alternative Repeated sub-structures: Markov tables No repeated sub-structures: Path trees Query paths exist in the data: Global-*, Suffix-* Query paths do not exist in the data: No-* To appear in VLDB 2001