Structure and Content Scoring for XML

Structure and Content Scoring for XML
Amélie Marian (Rutgers University) Joint work with: Sihem Amer-Yahia (AT&T Research Labs) Nick Koudas (University of Toronto) Divesh Srivastava (AT&T Research Labs) David Toman (University of Waterloo)

Motivations: XML Data Heterogeneity
book book Data book title (Great Expectations) edition (paperback) info author (Dickens) info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) Heterogeneous XML Data about books Query: book[./info[./title=“Great Expectations” and ./author=“Dickens”] and ./edition=“paperback”] book title (Great Expectations) edition (paperback) info author (Dickens) Query root node: Distinguished node 1/2/2019 Amélie Marian - Rutgers University

Amélie Marian - Rutgers University
XML Query Relaxation Query [Amer-Yahia et al. EDBT’02] book title (Great Expectations) edition (paperback) info author (Dickens) Tree pattern relaxations: Leaf node deletion Edge generalization Subtree promotion book book book title (Great Expectations) edition (paperback) info author (Dickens) Data edition? info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) 1/2/2019 Amélie Marian - Rutgers University

Motivations Top-k query processing suitable for relaxed XML queries over heterogeneous collections Return k XML nodes that are closest to query structure Opportunity for more efficient query processing Need scoring mechanism to identify best k answers 1/2/2019 Amélie Marian - Rutgers University

Contributions Scoring mechanism for XML queries Data structures for top-k query processing Experimental evaluation 1/2/2019 Amélie Marian - Rutgers University

Scoring Functions Critical for Top-k Query Processing
Top-k answer quality depends on scoring function Efficient top-k query processing requires scoring function: Monotonic Fast to compute Little attention given to scoring functions for structured and semi-structured data Extensively studied over text data (e.g., tf.idf) Proposed scoring function inspired by tf.idf for XML data 1/2/2019 Amélie Marian - Rutgers University

Adaptation of tf.idf to XML Queries
Document Collection (Information Retrieval) XML Document Document XML Node (result is a subtree rooted at a distinguished node, i.e., a node with a given label and structural properties) Keyword(s) Query Pattern idf (inverse document frequency) is a function of the fraction of documents that contain the keyword(s) idf is a function of the fraction of distinguished nodes that match the query pattern tf (term frequency) is a function of the number of occurrences of the keyword in the document tf is a function of the number of ways the query pattern matches the distinguished node 1/2/2019 Amélie Marian - Rutgers University

Scoring Function for XML Approximate Matches
book title (Great Expectations) edition (paperback) info book title (Great Expectations) edition (paperback) info author (Dickens) book book Required properties: Exact matches should be scored higher than relaxed matches (idf) Distinguished nodes with several matches should be ranked higher than those with fewer matches (tf) How to combine tf and idf? tf.idf, as used by IR, violates above properties Ranking based on idf, then breaking ties using tf satisfies the properties info info edition (paperback) edition (paperback) author (Dickens) title (Great Expectations) (a) (b) score(a) <= score(b) score(a) >= score(b) 1/2/2019 Amélie Marian - Rutgers University

A Family of Scoring Methods for XML Path Queries
book title (Great Expectations) edition (paperback) info author (Dickens) Query Twig predicate High quality Expensive computation Path predicates Binary predicates Low quality Fast computation book title (Great Expectations) edition (paperback) info author (Dickens) book title (Great Expectations) edition (paperback) author (Dickens) info + book title (Great Expectations) edition (paperback) author (Dickens) info + 1/2/2019 Amélie Marian - Rutgers University

Matrix Representation of Twigs
Twigs (queries and tuples) can be represented by matrices that capture all relationships in the query: a b c d e Partial Tuple: Query: a1 (not joined with e yet) (no matches for e) (e1 matches) b1 d1 c1 e1 a b c d e = / // X a b c d e = // X / ? // X X / = X X X X X Matrix subsumption used to compare tuple and queries 1/2/2019 Amélie Marian - Rutgers University

Representing Relaxed Query Patterns: DAG Structure
b c Each child is more relaxed (has more matches) than its parent idf of a child is no higher than the idf of its parent idf scores are accessible in constant time for any match (complete or partial) using hash function a b c a b c a b c a b c a b c a b a a Exhaustive algorithm to build the DAG c b a 1/2/2019 Amélie Marian - Rutgers University

Information stored in the DAG
b c idf score information: idf=(1+|a|)/(1+|ap|), where |ap| is the number of a nodes that satisfy the query predicate For query processing: Best possible score from here Best possible score after each remaining join operations Number of matches (useful for tf) 1.228 a b c a b c 1.2 1.195 a b c a b c 1.167 1.195 a b c a 1.167 1.156 b a a 1.049 1.156 c b a 1 1/2/2019 Amélie Marian - Rutgers University

Query Processing using the DAG
Benefits: Score computation done in a preprocessing phase (using exact or approximate information) Score access during query processing done in constant time Additional information needed for query processing precomputed and accessed in constant time (e.g., score upper bound) tf estimated at runtime based on available information 1/2/2019 Amélie Marian - Rutgers University

Quality/Space/Time tradeoff
Binary Predicates Smaller DAG (O(4q)) Faster pre-processing (and processing) Lower Quality (fewer possible scores) Path Predicates and Twig DAG is O(4q^2/2)) in space (still reasonable in practice) More pre-processing Higher Quality (more differences between scores) 1/2/2019 Amélie Marian - Rutgers University

Experimental Setup Data: Synthetic heterogeneous document collections generated with Toxgene Real dataset: Wall Street Journal Treebank corpora Pregenerated queries exhibiting different sizes, query structures and predicates Measures: DAG size DAG preprocessing time Query processing time Precision (percentage of top-k answers that are actual top-k answers, as given by Twig) 1/2/2019 Amélie Marian - Rutgers University

XML Scoring Precision 1/2/2019 Amélie Marian - Rutgers University

XML Scoring Preprocessing Time
1/2/2019 Amélie Marian - Rutgers University

XML Scoring Real data 1/2/2019 Amélie Marian - Rutgers University

Conclusions Scoring method for XML queries Inspired from tf.idf Accounts for structure and content Accounts for structural relaxations Efficient data structures to compute and access scores during top-k query processing DAG Matrix representation of queries and tuples Evaluation of the scoring methods tradeoffs Answer quality vs. preprocessing time 1/2/2019 Amélie Marian - Rutgers University

Related Work IR Scoring Content only XML Scoring Content with structure XIRQL [XML&IR’00], JuruXML [SIGIR’03], IR-CADG [WebDB’04] None of these techniques account for structural relaxations (with the exception of our previous work [ICDE’05]) XML Structural Relaxation FleXPath [SIGMOD’04], Kanza and Sagiv [PODS’01], Schlieder [EDBT’02], Delobel and Rousset [FMII’01] 1/2/2019 Amélie Marian - Rutgers University

Future Work Streaming scenarios Incremental updates on DAG Approximate scoring Integration with approximate text scoring Extend proposed XML scoring function to handle text content approximation (e.g., misspellings) Unify structure and content score Quality evaluation (INEX) 1/2/2019 Amélie Marian - Rutgers University

Structure and Content Scoring for XML

Similar presentations

Presentation on theme: "Structure and Content Scoring for XML"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Structure and Content Scoring for XML

Similar presentations

Presentation on theme: "Structure and Content Scoring for XML"— Presentation transcript:

Similar presentations

About project

Feedback