XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey.

Slides:



Advertisements
Similar presentations
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Advertisements

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Fast Algorithms For Hierarchical Range Histogram Constructions
DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science.
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Slide 1 Web-Base Management Systems Aaron Brown and David Oppenheimer CS294-7 February 11, 1999.
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
CS Lecture 9 Storeing and Querying Large Web Graphs.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Ecole Polytechnique Fédérale de Lausanne, Switzerland Efficient processing of XPath queries with structured overlay networks Gleb Skobeltsyn, Manfred Hauswirth,
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
Albert Gatt Corpora and Statistical Methods Lecture 9.
Before I stated the database I had to save it into My Documents> ICT> You can do it> D201EPORTFOLIO> Evidence For the field group food item, I set the.
Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Algebra 1A Vocabulary 1-2 Part 2
Materialized View Selection for XQuery Workloads Asterios Katsifodimos 1, Ioana Manolescu 1 & Vasilis Vassalos 2 1 Inria Saclay & Université Paris-Sud,
1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
Christopher Moh 2005 Competition Programming Analyzing and Solving problems.
The Forest and the Trees Julia Stoyanovich Candidacy Exam in Database Systems Fall 2005.
Histograms for Selectivity Estimation
3.6 Solving Absolute Value Equations and Inequalities
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
Johannes Kepler University Linz Department of Business Informatics Data & Knowledge Engineering Altenberger Str. 69, 4040 Linz Austria/Europe
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Dense-Region Based Compact Data Cube
By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01
HUFFMAN CODES.
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Reducing Number of Candidates
Updating SF-Tree Speaker: Ho Wai Shing.
Indexing & querying text
A paper on Join Synopses for Approximate Query Answering
Structure and Value Synopses for XML Data Graphs
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
ICICLES: Self-tuning Samples for Approximate Query Answering
Toshiyuki Shimizu (Kyoto University)
Structure and Content Scoring for XML
Structure and Content Scoring for XML
Minwise Hashing and Efficient Search
Wei Wang University of New South Wales, Australia
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Trigonometric Equations
CSE 326: Data Structures Lecture #14
Presentation transcript:

XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, Ronald Parr Speaker: Ho Wai Shing

Contents Introduction: the problems in XML path selectivity estimation XPathLearner: the properties and the details Experiment Results Conclusions Future Work

Introduction XML is becoming the standard of data exchange We need to query the structure and text data of XML documents Selectivity is essential in optimizing evaluation plans

Introduction Example:

Introduction Example: FOR $b IN document("*")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998" RETURN $b/title The path expressions: //book/publisher = "Morgan Kaufmann" //book/year = "1998" //book/title

Introduction We need a structure to store some statistics of the data Then calculate the estimated selectivity from these statistics Problem: estimate the selectivity of (simple, single-value, multi-value) path expressions with limited space

Related Work Path Trees Markov Tables k-RO (in Lore)

Path Trees Aggregate siblings with the same tag tag names only (no data values) e.g.,

Markov Table selectivity of short paths up to length k is stored selectivity of longer paths are estimated using a Markov model e.g.,

k-RO used in Lore systems very similar to Markov table data values are also objects stored as a graph

Twigs can answer "twig" queries a structural query with a small branch based on suffix tree (for simple paths) + signatures in each node (for estimating branching)

Problems Faced Offline need to scan the whole repository beforehand to gather statistics unfeasible if the data is remote and is extremely large Can solve SPEs only or it's too large Ignore data values

Problems Faced Not Adaptive to query workload much space wasted in infrequently asked paths No Quick Update needs periodic rescan of repository

Objective XPathLearner: uses Markov based approach, uses an online algorithm, is adaptive to workload, can answer simple paths, single-value paths (//A/B='3') and multi-value paths (//A='2'/B='3'). considers data values, can be easily updated

XPathLearner

Architecture

A More Detailed Example

What to Store? Markov table (1 st order in the discussion)

What to Store? may be large if there are many data values solution: only "tag-tag", "tag", and top- k value entries are stored exactly, other entries are stored within buckets default is 1

What is Actually Stored? Compressed 1st order Markov table (or, Markov histogram) assumption: v1-v4 starts with 'a', v5-v8 starts with 'b', k = 1

Use this formula  : selectivity t 1, t 2,..., t n : tags t 1 t 2...t n : path with these tags N: total number of data items How to Retrieve Selectivity?

Use this formula (it's what we calculate)  : selectivity t 1, t 2,..., t n : tags t 1 t 2...t n : path with these tags f(p): frequency of the path p How to Retrieve Selectivity?

Use this formula (if it's multi-valued)  : selectivity t 1, t 2,..., t n : tags t 1 t 2...t n : path with these tags f(t,v): frequency of the value v in tag t How to Retrieve Selectivity?

Retrieval Example for path //B/C/D, estimated selectivity = for path //B/C/D=v3, estimated selectivity = =

How to Update? get the query feedback, e.g., (BCD, 5) update the histogram entries that contained in the query so that the future estimation could be more accurate e.g., update B, C, D, BC, BD so that the estimation is nearer to 5 than before. two update approaches: the Heavy-tail Rule, the Delta Rule

Heavy Tail Rule put more correction towards the end (tail) of the path equation: f k () refers to the frequency before update f k+1 () refers to the frequency after update suggestion: w i = 2 i

Heavy Tail Rule updating those one-'tag' entries safeguards the terms that were set by exact query feedback

Heavy Tail Rule A reminder to what is stored

Heavy Tail Rule Example: query feedback = (ACD, 6) by the table, estimation = f(AC) / f(C) x f(CD) = 3 / 7 x 6  3

Heavy Tail Rule updates: new estimation = 4 / 8 x 8 = 4

Delta Rule first proposed by Rumelhart et al. basic idea: where

Experiments

Data Set: DBLP (other experiments are done but not included in the paper) Metric: average absolute error, average relative error

Experiments

Conclusions XPathLearner is a new method for estimating the selectivity of path expressions It is online, based on query feedback and doesn't need database scan use Markov histograms to store statistics

Future Work change from fixed length Markov table to variable length Markov table choose the paths to be stored more carefully or wisely apply the update method to other areas, e.g., graph based structures, to answer branching queries, etc

References [1]Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, Ronald Parr, XPathLearner: An On-Line Self-Tuning Markov Histogram for XML Path Selectivity Estimation, VLDB'02