MCN: A New Semantics Towards Effective XML Keyword Search

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
XML Data Management 8. XQuery Werner Nutt. Requirements for an XML Query Language David Maier, W3C XML Query Requirements: Closedness: output must be.
XML: Extensible Markup Language
XML DOCUMENTS AND DATABASES
Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Reasoning and Identifying Relevant Matches for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Xyleme A Dynamic Warehouse for XML Data of the Web.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University.
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
Identifying Meaningful Return Information for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
4/20/2017.
Querying Tree-Structured Data Using Dimension Graphs Dimitri Theodoratos (New Jersey Institute of Technology, USA) Theodore Dalamagas (National Techn.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 Maintaining Semantics in the Design of Valid and Reversible SemiStructured Views Yabing Chen, Tok Wang Ling, Mong Li Lee Department of Computer Science.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
Database Systems Part VII: XML Querying Software School of Hunan University
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
XML Access Control Koukis Dimitris Padeleris Pashalis.
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
XML Labling and Query Optimization Sigmod
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.
Querying and Transforming XML Data
Efficient processing of path query with not-predicates on XML data
Database Management System
DATA MODELS.
RE-Tree: An Efficient Index Structure for Regular Expressions
Computing Full Disjunctions
Chapter 12: Query Processing
TT-Join: Efficient Set Containment Join
OrientX: an Integrated, Schema-Based Native XML Database System
(b) Tree representation
On Efficient Graph Substructure Selection
Toshiyuki Shimizu (Kyoto University)
Semi-Structured data (XML Data MODEL)
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Bidirectional Query Planning Algorithm
A Framework for Testing Query Transformation Rules
Information Retrieval and Web Design
Introduction to XML IR XML Group.
Presentation transcript:

MCN: A New Semantics Towards Effective XML Keyword Search Junfeng Zhou, Xiaofeng Meng Renmin University of China Zhifeng Bao, Tok Wang Ling National University of Singapore

Outline Introduction Preliminaries XML Keyword Search Semantics Query Processing Experiment Conclusions 2019/2/18

Outline Introduction Preliminaries XML Keyword Search Semantics Query Processing Experiment Conclusions 2019/2/18

Introduction XML data is pervasive nowadays Effective search mechanism is indispensable Structured query languages, e.g. XQuery or XPath, may be too complex for naïve users XML keyword search is EFFECTIVE when The query language is complex E.g. XQuery or XPath The underlying schema is complex or unavailable E.g. XMark (327 schema elements) [1] Querying heterogeneous XML documents [1] C. Yu etc. Querying complex structured databases. VLDB2007. 2019/2/18

Introduction Observations for XML keyword search on data-centric XML document Each data element has a category, e.g. entity or attribute of ER-model Users query intensions are based on the relationships of entity nodes A data fragment is meaningful means the relationships between entity nodes can be interpreted by the information of the data fragment itself 2019/2/18

Introduction Existing XML Keyword Search Semantics Lowest Common Ancestor (LCA) Smallest Lowest Common Ancestor (SLCA) Connected Network (CN) Existing semantics consider only structural information when determining whether a data fragment is a matched one XML Document 2019/2/18

Introduction Problems of Existing Semantics Return meaningless results, R1,R2,R4 Lose meaningful results, R1’,R4’ The relationship of the two person nodes cannot be interpreted by R4 itself R4’ means that Mike bought the item sold by John The relationship of the two person nodes cannot be interpreted by R2 itself R1’ means that both Mike and John provided photos about the same item The relationship of the two photo nodes cannot be interpreted by R1 itself XML Document 2019/2/18

Introduction Contributions An XML Keyword Search Semantics MCN (Meaningful Connected Network) R3 means that Mike and John are watching the same auction R4’ means that Mike bought the item sold by John R1’ means that both Mike and John provided photos about the same item XML Document 2019/2/18

Introduction Query Processing of directed graph based methods [1,5-8] Finding all Connected Networks from XML Graph is NP-Complete [9] The first group [7,8] finds only a subset of all results The second group [1,5,6] uses a two-step strategy (1) Identify the set of structured Query Patterns (2) Evaluate Query Patterns to get the matching results [1] Cong, Y., etc. Querying Complex Structured Databases. VLDB2007 [5] Vagelis, H., etc. Keyword Proximity Search on XML Graphs. ICDE2003 [6] Sara, C., etc. Interconnection semantics for keyword search in XML. CIKM2005 [7] Konstantin, G., etc. Keyword proximity search in complex data graphs. SIGMOD2008 [8] Hao, H., etc. BLINKS: ranked keyword searches on graphs. SIGMOD2007 [9] Reich, G.,etc. Beyond Steiner’s problem: a VLSI oriented generalization. WG Workship (1990) 2019/2/18

Keywords used to compute query patterns are schema elements Introduction Q1 = {person: Mike, auction: } Q2 = {person: Mike, person: John} Problems of the two-step methods [1,5,6] For the first step [1,6] find Query Patterns of schema elements from the schema graph, where text values are attached to different schema elements Keywords used to compute query patterns are schema elements Cannot process queries involving text values attached to the two schema elements of same name Schema graph 2019/2/18 [1] Cong, Y., etc. Querying Complex Structured Databases. VLDB2007 [5] Vagelis, H., etc. Keyword Proximity Search on XML Graphs. ICDE2003 [6] Sara, C., etc. Interconnection semantics for keyword search in XML. CIKM2005

Introduction Problems of the two-step methods [1,5,6] For the first step [1,6] find Query Patterns of schema elements from the schema graph, where text values are attached to different schema elements [5] needs to scan data elements to produce Query Patterns For the second step All methods suffer from costly structural join operations1 to process all Query Patterns 1 structural join operation denotes the join operation determining ancestor-descendant or parent-child relationship 2019/2/18 [1] Cong, Y., etc. Querying Complex Structured Databases. VLDB2007 [5] Vagelis, H., etc. Keyword Proximity Search on XML Graphs. ICDE2003 [6] Sara, C., etc. Interconnection semantics for keyword search in XML. CIKM2005

Introduction Contributions An XML Keyword Search Semantics MCN (Meaningful Connected Network) An efficient query processing algorithm Uses the two-step strategy For the first step Avoids scanning real data For the second step Avoids costly structural join operations 2019/2/18

Outline Introduction Preliminaries XML Keyword Search Semantics Query Processing Experiment Conclusions 2019/2/18

Preliminaries Schema Directed Graph Always available, otherwise we can infer it by [1,2] Containment Edge Reference Edge 1:1 relationship 1:n relationship Schema Graph [1] Geert, J.B., etc.: XML Schema Definitions from XML Data. VLDB2007 [2] Geert, J.B., etc.: Inference of Concise DTDs from XML Data. VLDB2006 2019/2/18

Preliminaries Node Categories Entity, Attribute, Connection Node How to specify the category of each node? Heuristics [1,2] + Manual adjustment Used to organize data Connection Node Entity Node Corresponding to a “*” Same to the concept of ER-model Attribute Node Corresponding to a “leaf” node Schema Graph [1] C. Yu etc. Querying complex structured databases. VLDB2007 [2] Ziyang, L etc. Identifying meaningful return information for XML keyword search. SIGMOD2007 2019/2/18

Outline Introduction Preliminaries XML Keyword Search Semantics Query Processing Experiment Conclusions 2019/2/18

XML Keyword Search Semantics Observation of a “meaningful” data fragment The directions of the edges on a path from one element to another may conflicts with each other Traversing the XML graph to compute all meaningful data fragments is infeasible because of its large size An alternative way is finding meaningful relationship from schema graph A possible relationship of mixed direction getting from the given schema graph may be meaningless We need Firstly identify all possible relationships of schema elements Then identify those meaningless ones and just keep the meaningful ones Schema graph 2019/2/18 Meaningless, because R’ doesn’t have database instances in practice

XML Keyword Search Semantics For the mixed direction problem Define “Walk” to return all connected relationships A walk consider a schema graph as an undirected graph, starts at a schema node, after a series of nodes and edges, ends at another schema node For the meaningfulness problem Define “Meaningful Entity Walk (MEW)” to filter out useless relationships A MEW is a walk that denotes a relationship between two entity nodes that can be interpreted by the walk itself Meaningful, means that keywords may be about the same person Meaningless, because W2 doesn’t have database instances in practice √ √ √ 2019/2/18 Meaningless, because W5 cannot tell what is the relationship between item and person Meaningful, means that the photo and video contain information of the same item Meaningful, means that a person is watching an auction

XML Keyword Search Semantics For keyword search query Define “Meaningful Connected Network” Contains each keyword at least once There exists at least one Meaningful Entity Walk between each pair of entity nodes Meaningless, the relationship of the two person nodes in R2 and R4 cannot be interpreted by R2 and R4 themselves Meaningless, the relationship of the two photo nodes cannot be interpreted by R1 Meaningful, the two persons provided photos about the same item Meaningful, one person bought the stuff sold by another person Meaningful, the two persons are watching the same auction √ 2019/2/18 Data fragments containing Mike and John

XML Keyword Search Semantics Since the joining sequence of two data elements is data bound1, usually, users are required to specify the maximum size of the returned results. Keyword Search Problem For a given keyword query Q, find all matched MCNs from the given XML document D, where each MCN contains at most C entity instances. R3.size = 12 R1’.size = 7 R4’.size = 11 If MAXsize = 10, R3 and R4’ will be discarded But they all contain 3 entity nodes In our method, C is the number of maximum entity nodes, so we can avoid returning data fragments conveying very weak semantics with overwhelming entity nodes C = 3 by default 2019/2/18 1Data bound means the size of a result may be as large as the number of nodes in an XML document

Outline Introduction Preliminaries XML Keyword Search Semantics Query Processing Experiment Conclusions 2019/2/18

Query Processing Schema Graph Entity Graph Entity Path Partial path An entity path is a meaningful entity walk of the schema graph that contains just two entity nodes Partial Path a partial path is a path of the schema graph that starts from and entity node and ends at an attribute node Entity Graph: a schema graph that just keeps all entity nodes and their connection relationships, i.e., entity path Partial path Entity Graph Schema Graph 2019/2/18

Query Processing Identify Query Patterns Entity Graph Query Patterns e3: entity path Partial path Only shows the entity path Identify Query Patterns Query Patterns Entity Graph Do not need to scan real data Theorem: For a given keyword query Q, our method produces all query patterns satisfying that each one has at most C entity nodes Observation: A Query Pattern consists of a set of Entity Paths and a set of Partial Paths 1 selfE(k) returns a set of entity nodes that have entity instances containing k as their attribute or attribute values 2019/2/18

Query Processing Process all Query Patterns Entity Graph XML document Observation: A Query Pattern consists of a set of Entity Paths and a set of Partial Paths Process all Query Patterns Entity Graph XML document Entity Path Index: for each entity path e, store the set of path instances of e Partial Path Index: for each keyword k, it records the set of partial paths and their database instances that contain k as their text value 2019/2/18

Query Processing Process all Query Patterns Q = {Mike, John} Entity path Partial path Process all Query Patterns A keyword query Q corresponds to a set of Query Patterns A Query Pattern consists of a set of Entity Paths and a set of Partial Paths The result set of each entity path and each partial path can be got by probing Entity Path Index and Partial Path Index Theorem: Let Q be a given keyword query. Using EPI and PPI, the structural join operations1 can be avoided from the evaluation of Q 2019/2/18 1 structural join operation denotes the join operation determining ancestor-descendant or parent-child relationship

Query Processing Process all Query Patterns Identifying redundant QPs “~” means containment relationship Process all Query Patterns Identifying redundant QPs Correct Answer Q = {Mike, John} schema doc Query Patterns Results 2019/2/18 both provider nodes denote the same data elements, and their text value are same to each other, which contradicts with the given xml document

First identify which selection will produce empty result set Query Processing Process all Query Patterns Q = {Mike, John} First identify which selection will produce empty result set Entity Path Index Partial Path Index 2019/2/18

Outline Introduction Preliminaries XML Keyword Search Semantics Query Processing Experiment Conclusions 2019/2/18

Experiment Experiment Setup Implemented SLCA, XSEarch, IM (our method) algorithms using Microsoft Visual C++ 6.0 Query engines used in our experiment X-Hive1 and MonetDB2 2019/2/18 1 http://www.x-hive.com 2 http://monetdb.cwi.nl/projects/monetdb/XQuery/index.html

PPI+EPI+Assistant index Experiment Datasets, Indices Queries 40 keyword queries 4 groups with 2,3,4,5 keywords respectively PPI+EPI+Assistant index 2019/2/18

Experiment Evaluation Metrics Precision & Recall Running time Users submit a keyword query We write XQuery expressions corresponding to their keyword query by asking their query intension, the result set is R, which is got by running MonetDB query engine Process the given keyword query using different algorithms, the result of a special algorithm is RQ Running time 2019/2/18

Experiment Experimental Results 2019/2/18

Outline Introduction Preliminaries XML Keyword Search Semantics Query Processing Experiment Conclusions 2019/2/18

Conclusions We proposed a new semantics, i.e. MCN, based on relationships of entity nodes to capture meaningful information considering IDREF We proposed an entity graph based method, which can produce all query patterns while avoid scanning real data We proposed two efficient indices, based on which our method can avoid structural join operations by equivalently transforming structural join operations into value join operations We conducted experiments to verify the effectiveness and efficiency of our method On going work A good ranking mechanism considering node categories 2019/2/18

Thank you! 2019/2/18