Structure and Content Scoring for XML

Slides:

Advertisements

Similar presentations

Ting Chen, Jiaheng Lu, Tok Wang Ling

Advertisements

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.

Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology.

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,

Chapter 5: Introduction to Information Retrieval

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,

Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Jiang Chen Columbia University Ke Yi HKUST. Motivation  Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data.

DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.

Fast Algorithms For Hierarchical Range Histogram Constructions

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.

Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,

Ch 4: Information Retrieval and Text Mining

CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.

1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,

Fuzzy Multi-Dimensional Search in the Wayfinder File System Christopher Peery, Wei Wang, Amélie Marian, Thu D. Nguyen Computer Science Department, Rutgers.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,

Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University.

XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Chapter 5: Information Retrieval and Web Search

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,

2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.

Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.

1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

Querying Structured Text in an XML Database By Xuemei Luo.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Chapter 6: Information Retrieval and Web Search

1 The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search Sihem Amer-Yahia AT&T Labs Research - USA Database Department.

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

A Study of Balanced Search Trees: Brainstorming a New Balanced Search Tree Anthony Kim, 2005 Computer Systems Research.

2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.

Adaptive Processing of Top-k Queries in XML Amelie Marian, Sihem Amer-Yahia Nick Koudas, Divesh Srivastava Proceedings of the 21st International Conference.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree

Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.

1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.

1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.

Efficient processing of path query with not-predicates on XML data

RE-Tree: An Efficient Index Structure for Regular Expressions

Probabilistic Data Management

Spatio-temporal Pattern Queries

(b) Tree representation

Text Joins in an RDBMS for Web Data Integration

Structure and Content Scoring for XML

Introduction to XML IR — Scoring and Ranking XML Group.

Early Profile Pruning on XML-aware Publish-Subscribe Systems

Relax and Adapt: Computing Top-k Matches to XPath Queries

Efficient Aggregation over Objects with Extent

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

Structure and Content Scoring for XML Amélie Marian (Columbia University) Joint work with: Sihem Amer-Yahia (AT&T Research Labs) Nick Koudas (University of Toronto) Divesh Srivastava (AT&T Research Labs) David Toman (University of Waterloo)

Motivations: XML Data Heterogeneity book book Data book title (Great Expectations) edition (paperback) info author (Dickens) info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) Heterogeneous XML Data about books Query: book[./info[./title=“Great Expectations” and ./author=“Dickens”] and ./edition=“paperback”] book title (Great Expectations) edition (paperback) info author (Dickens) Query root node: Distinguished node 4/9/2019 Amélie Marian - Columbia University

Amélie Marian - Columbia University XML Query Relaxation Query [Amer-Yahia et al. EDBT’02] book title (Great Expectations) edition (paperback) info author (Dickens) Tree pattern relaxations: Leaf node deletion Edge generalization Subtree promotion book book book title (Great Expectations) edition (paperback) info author (Dickens) Data edition? info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) 4/9/2019 Amélie Marian - Columbia University

Amélie Marian - Columbia University Motivations Top-k query processing suitable for relaxed XML queries over heterogeneous collections Return k XML nodes that are closest to query structure Opportunity for more efficient query processing Need scoring mechanism to identify best k answers 4/9/2019 Amélie Marian - Columbia University

Amélie Marian - Columbia University Contributions Scoring mechanism for XML queries Data structures for top-k query processing Experimental evaluation 4/9/2019 Amélie Marian - Columbia University

Scoring Functions Critical for Top-k Query Processing Top-k answer quality depends on scoring function Efficient top-k query processing requires scoring function: Monotonic Fast to compute Little attention given to scoring functions for structured and semi-structured data Extensively studied over text data (e.g., tf.idf) Proposed scoring function inspired by tf.idf for XML data 4/9/2019 Amélie Marian - Columbia University

Adaptation of tf.idf to XML Queries Document Collection (Information Retrieval) XML Document Document XML Node (result is a subtree rooted at a distinguished node, i.e., a node with a given label and structural properties) Keyword(s) Query Pattern idf (inverse document frequency) is a function of the fraction of documents that contain the keyword(s) idf is a function of the fraction of distinguished nodes that match the query pattern tf (term frequency) is a function of the number of occurrences of the keyword in the document tf is a function of the number of ways the query pattern matches the distinguished node 4/9/2019 Amélie Marian - Columbia University

Scoring Function for XML Approximate Matches book title (Great Expectations) edition (paperback) info book title (Great Expectations) edition (paperback) info author (Dickens) book book Required properties: Exact matches should be scored higher than relaxed matches (idf) Distinguished nodes with several matches should be ranked higher than those with fewer matches (tf) How to combine tf and idf? tf.idf, as used by IR, violates above properties Ranking based on idf, then breaking ties using tf satisfies the properties info info edition (paperback) edition (paperback) author (Dickens) title (Great Expectations) (a) (b) score(a) <= score(b) score(a) >= score(b) 4/9/2019 Amélie Marian - Columbia University

A Family of Scoring Methods for XML Path Queries book title (Great Expectations) edition (paperback) info author (Dickens) Query Twig predicate High quality Expensive computation Path predicates Binary predicates Low quality Fast computation book title (Great Expectations) edition (paperback) info author (Dickens) book title (Great Expectations) edition (paperback) author (Dickens) info + book title (Great Expectations) edition (paperback) author (Dickens) info + 4/9/2019 Amélie Marian - Columbia University

Amélie Marian - Columbia University Contributions Scoring mechanism for XML queries Data structures for top-k query processing Experimental evaluation 4/9/2019 Amélie Marian - Columbia University

Matrix Representation of Twigs Twigs (queries and tuples) can be represented by matrices that capture all relationships in the query: a b c d e Partial Tuple: Query: a1 (not joined with e yet) (no matches for e) (e1 matches) b1 d1 c1 e1 a b c d e = / // X a b c d e = // X / ? // X X / = X X X X X Matrix subsumption used to compare tuple and queries 4/9/2019 Amélie Marian - Columbia University

Representing Relaxed Query Patterns: DAG Structure b c Each child is more relaxed (has more matches) than its parent idf of a child is no higher than the idf of its parent idf scores are accessible in constant time for any match (complete or partial) using hash function a b c a b c a b c a b c a b c a b a a Exhaustive algorithm to build the DAG c b a 4/9/2019 Amélie Marian - Columbia University

Information stored in the DAG b c idf score information: idf=(1+|a|)/(1+|ap|), where |ap| is the number of a nodes that satisfy the query predicate For query processing: Best possible score from here Best possible score after each remaining join operations Number of matches (useful for tf) 1.228 a b c a b c 1.2 1.195 a b c a b c 1.167 1.195 a b c a 1.167 1.156 b a a 1.049 1.156 c b a 1 4/9/2019 Amélie Marian - Columbia University

Query Processing using the DAG Benefits: Score computation done in a preprocessing phase (using exact or approximate information) Score access during query processing done in constant time Additional information needed for query processing precomputed and accessed in constant time (e.g., score upper bound) tf estimated at runtime based on available information 4/9/2019 Amélie Marian - Columbia University

Quality/Space/Time tradeoff Binary Predicates Smaller DAG (O(4q)) Faster pre-processing (and processing) Lower Quality (fewer possible scores) Path Predicates and Twig DAG is O(4q^2/2)) in space (still reasonable in practice) More pre-processing Higher Quality (more differences between scores) 4/9/2019 Amélie Marian - Columbia University

Amélie Marian - Columbia University Contributions Scoring mechanism for XML queries Data structures for top-k query processing Experimental evaluation 4/9/2019 Amélie Marian - Columbia University

Amélie Marian - Columbia University Experimental Setup Data: Synthetic heterogeneous document collections generated with Toxgene Real dataset: Wall Street Journal Treebank corpora Pregenerated queries exhibiting different sizes, query structures and predicates Measures: DAG size DAG preprocessing time Query processing time Precision (percentage of top-k answers that are actual top-k answers, as given by Twig) 4/9/2019 Amélie Marian - Columbia University

Amélie Marian - Columbia University XML Scoring Precision 4/9/2019 Amélie Marian - Columbia University

XML Scoring Preprocessing Time 4/9/2019 Amélie Marian - Columbia University

Amélie Marian - Columbia University XML Scoring Real data 4/9/2019 Amélie Marian - Columbia University

Amélie Marian - Columbia University Conclusions Scoring method for XML queries Inspired from tf.idf Accounts for structure and content Accounts for structural relaxations Efficient data structures to compute and access scores during top-k query processing DAG Matrix representation of queries and tuples Evaluation of the scoring methods tradeoffs Answer quality vs. preprocessing time 4/9/2019 Amélie Marian - Columbia University

Amélie Marian - Columbia University Related Work IR Scoring Content only XML Scoring Content with structure XIRQL [XML&IR’00], JuruXML [SIGIR’03], IR-CADG [WebDB’04] None of these techniques account for structural relaxations (with the exception of our previous work [ICDE’05]) XML Structural Relaxation FleXPath [SIGMOD’04], Kanza and Sagiv [PODS’01], Schlieder [EDBT’02], Delobel and Rousset [FMII’01] 4/9/2019 Amélie Marian - Columbia University

Amélie Marian - Columbia University Future Work Streaming scenarios Incremental updates on DAG Approximate scoring Integration with approximate text scoring Extend proposed XML scoring function to handle text content approximation (e.g., misspellings) Unify structure and content score Quality evaluation (INEX) 4/9/2019 Amélie Marian - Columbia University