Presentation is loading. Please wait.

Presentation is loading. Please wait.

MCN: A New Semantics Towards Effective XML Keyword Search

Similar presentations


Presentation on theme: "MCN: A New Semantics Towards Effective XML Keyword Search"— Presentation transcript:

1 MCN: A New Semantics Towards Effective XML Keyword Search
Junfeng Zhou, Xiaofeng Meng Renmin University of China Zhifeng Bao, Tok Wang Ling National University of Singapore

2 Outline Introduction Preliminaries XML Keyword Search Semantics
Query Processing Experiment Conclusions 2019/2/18

3 Outline Introduction Preliminaries XML Keyword Search Semantics
Query Processing Experiment Conclusions 2019/2/18

4 Introduction XML data is pervasive nowadays
Effective search mechanism is indispensable Structured query languages, e.g. XQuery or XPath, may be too complex for naïve users XML keyword search is EFFECTIVE when The query language is complex E.g. XQuery or XPath The underlying schema is complex or unavailable E.g. XMark (327 schema elements) [1] Querying heterogeneous XML documents [1] C. Yu etc. Querying complex structured databases. VLDB2007. 2019/2/18

5 Introduction Observations for XML keyword search on data-centric XML document Each data element has a category, e.g. entity or attribute of ER-model Users query intensions are based on the relationships of entity nodes A data fragment is meaningful means the relationships between entity nodes can be interpreted by the information of the data fragment itself 2019/2/18

6 Introduction Existing XML Keyword Search Semantics
Lowest Common Ancestor (LCA) Smallest Lowest Common Ancestor (SLCA) Connected Network (CN) Existing semantics consider only structural information when determining whether a data fragment is a matched one XML Document 2019/2/18

7 Introduction Problems of Existing Semantics
Return meaningless results, R1,R2,R4 Lose meaningful results, R1’,R4’ The relationship of the two person nodes cannot be interpreted by R4 itself R4’ means that Mike bought the item sold by John The relationship of the two person nodes cannot be interpreted by R2 itself R1’ means that both Mike and John provided photos about the same item The relationship of the two photo nodes cannot be interpreted by R1 itself XML Document 2019/2/18

8 Introduction Contributions An XML Keyword Search Semantics
MCN (Meaningful Connected Network) R3 means that Mike and John are watching the same auction R4’ means that Mike bought the item sold by John R1’ means that both Mike and John provided photos about the same item XML Document 2019/2/18

9 Introduction Query Processing of directed graph based methods [1,5-8]
Finding all Connected Networks from XML Graph is NP-Complete [9] The first group [7,8] finds only a subset of all results The second group [1,5,6] uses a two-step strategy (1) Identify the set of structured Query Patterns (2) Evaluate Query Patterns to get the matching results [1] Cong, Y., etc. Querying Complex Structured Databases. VLDB2007 [5] Vagelis, H., etc. Keyword Proximity Search on XML Graphs. ICDE2003 [6] Sara, C., etc. Interconnection semantics for keyword search in XML. CIKM2005 [7] Konstantin, G., etc. Keyword proximity search in complex data graphs. SIGMOD2008 [8] Hao, H., etc. BLINKS: ranked keyword searches on graphs. SIGMOD2007 [9] Reich, G.,etc. Beyond Steiner’s problem: a VLSI oriented generalization. WG Workship (1990) 2019/2/18

10 Keywords used to compute query patterns are schema elements
Introduction Q1 = {person: Mike, auction: } Q2 = {person: Mike, person: John} Problems of the two-step methods [1,5,6] For the first step [1,6] find Query Patterns of schema elements from the schema graph, where text values are attached to different schema elements Keywords used to compute query patterns are schema elements Cannot process queries involving text values attached to the two schema elements of same name Schema graph 2019/2/18 [1] Cong, Y., etc. Querying Complex Structured Databases. VLDB2007 [5] Vagelis, H., etc. Keyword Proximity Search on XML Graphs. ICDE2003 [6] Sara, C., etc. Interconnection semantics for keyword search in XML. CIKM2005

11 Introduction Problems of the two-step methods [1,5,6]
For the first step [1,6] find Query Patterns of schema elements from the schema graph, where text values are attached to different schema elements [5] needs to scan data elements to produce Query Patterns For the second step All methods suffer from costly structural join operations1 to process all Query Patterns 1 structural join operation denotes the join operation determining ancestor-descendant or parent-child relationship 2019/2/18 [1] Cong, Y., etc. Querying Complex Structured Databases. VLDB2007 [5] Vagelis, H., etc. Keyword Proximity Search on XML Graphs. ICDE2003 [6] Sara, C., etc. Interconnection semantics for keyword search in XML. CIKM2005

12 Introduction Contributions An XML Keyword Search Semantics
MCN (Meaningful Connected Network) An efficient query processing algorithm Uses the two-step strategy For the first step Avoids scanning real data For the second step Avoids costly structural join operations 2019/2/18

13 Outline Introduction Preliminaries XML Keyword Search Semantics
Query Processing Experiment Conclusions 2019/2/18

14 Preliminaries Schema Directed Graph
Always available, otherwise we can infer it by [1,2] Containment Edge Reference Edge 1:1 relationship 1:n relationship Schema Graph [1] Geert, J.B., etc.: XML Schema Definitions from XML Data. VLDB2007 [2] Geert, J.B., etc.: Inference of Concise DTDs from XML Data. VLDB2006 2019/2/18

15 Preliminaries Node Categories Entity, Attribute, Connection Node
How to specify the category of each node? Heuristics [1,2] + Manual adjustment Used to organize data Connection Node Entity Node Corresponding to a “*” Same to the concept of ER-model Attribute Node Corresponding to a “leaf” node Schema Graph [1] C. Yu etc. Querying complex structured databases. VLDB2007 [2] Ziyang, L etc. Identifying meaningful return information for XML keyword search. SIGMOD2007 2019/2/18

16 Outline Introduction Preliminaries XML Keyword Search Semantics
Query Processing Experiment Conclusions 2019/2/18

17 XML Keyword Search Semantics
Observation of a “meaningful” data fragment The directions of the edges on a path from one element to another may conflicts with each other Traversing the XML graph to compute all meaningful data fragments is infeasible because of its large size An alternative way is finding meaningful relationship from schema graph A possible relationship of mixed direction getting from the given schema graph may be meaningless We need Firstly identify all possible relationships of schema elements Then identify those meaningless ones and just keep the meaningful ones Schema graph 2019/2/18 Meaningless, because R’ doesn’t have database instances in practice

18 XML Keyword Search Semantics
For the mixed direction problem Define “Walk” to return all connected relationships A walk consider a schema graph as an undirected graph, starts at a schema node, after a series of nodes and edges, ends at another schema node For the meaningfulness problem Define “Meaningful Entity Walk (MEW)” to filter out useless relationships A MEW is a walk that denotes a relationship between two entity nodes that can be interpreted by the walk itself Meaningful, means that keywords may be about the same person Meaningless, because W2 doesn’t have database instances in practice 2019/2/18 Meaningless, because W5 cannot tell what is the relationship between item and person Meaningful, means that the photo and video contain information of the same item Meaningful, means that a person is watching an auction

19 XML Keyword Search Semantics
For keyword search query Define “Meaningful Connected Network” Contains each keyword at least once There exists at least one Meaningful Entity Walk between each pair of entity nodes Meaningless, the relationship of the two person nodes in R2 and R4 cannot be interpreted by R2 and R4 themselves Meaningless, the relationship of the two photo nodes cannot be interpreted by R1 Meaningful, the two persons provided photos about the same item Meaningful, one person bought the stuff sold by another person Meaningful, the two persons are watching the same auction 2019/2/18 Data fragments containing Mike and John

20 XML Keyword Search Semantics
Since the joining sequence of two data elements is data bound1, usually, users are required to specify the maximum size of the returned results. Keyword Search Problem For a given keyword query Q, find all matched MCNs from the given XML document D, where each MCN contains at most C entity instances. R3.size = 12 R1’.size = 7 R4’.size = 11 If MAXsize = 10, R3 and R4’ will be discarded But they all contain 3 entity nodes In our method, C is the number of maximum entity nodes, so we can avoid returning data fragments conveying very weak semantics with overwhelming entity nodes C = 3 by default 2019/2/18 1Data bound means the size of a result may be as large as the number of nodes in an XML document

21 Outline Introduction Preliminaries XML Keyword Search Semantics
Query Processing Experiment Conclusions 2019/2/18

22 Query Processing Schema Graph Entity Graph Entity Path Partial path
An entity path is a meaningful entity walk of the schema graph that contains just two entity nodes Partial Path a partial path is a path of the schema graph that starts from and entity node and ends at an attribute node Entity Graph: a schema graph that just keeps all entity nodes and their connection relationships, i.e., entity path Partial path Entity Graph Schema Graph 2019/2/18

23 Query Processing Identify Query Patterns Entity Graph Query Patterns
e3: entity path Partial path Only shows the entity path Identify Query Patterns Query Patterns Entity Graph Do not need to scan real data Theorem: For a given keyword query Q, our method produces all query patterns satisfying that each one has at most C entity nodes Observation: A Query Pattern consists of a set of Entity Paths and a set of Partial Paths 1 selfE(k) returns a set of entity nodes that have entity instances containing k as their attribute or attribute values 2019/2/18

24 Query Processing Process all Query Patterns Entity Graph XML document
Observation: A Query Pattern consists of a set of Entity Paths and a set of Partial Paths Process all Query Patterns Entity Graph XML document Entity Path Index: for each entity path e, store the set of path instances of e Partial Path Index: for each keyword k, it records the set of partial paths and their database instances that contain k as their text value 2019/2/18

25 Query Processing Process all Query Patterns
Q = {Mike, John} Entity path Partial path Process all Query Patterns A keyword query Q corresponds to a set of Query Patterns A Query Pattern consists of a set of Entity Paths and a set of Partial Paths The result set of each entity path and each partial path can be got by probing Entity Path Index and Partial Path Index Theorem: Let Q be a given keyword query. Using EPI and PPI, the structural join operations1 can be avoided from the evaluation of Q 2019/2/18 1 structural join operation denotes the join operation determining ancestor-descendant or parent-child relationship

26 Query Processing Process all Query Patterns Identifying redundant QPs
“~” means containment relationship Process all Query Patterns Identifying redundant QPs Correct Answer Q = {Mike, John} schema doc Query Patterns Results 2019/2/18 both provider nodes denote the same data elements, and their text value are same to each other, which contradicts with the given xml document

27 First identify which selection will produce empty result set
Query Processing Process all Query Patterns Q = {Mike, John} First identify which selection will produce empty result set Entity Path Index Partial Path Index 2019/2/18

28 Outline Introduction Preliminaries XML Keyword Search Semantics
Query Processing Experiment Conclusions 2019/2/18

29 Experiment Experiment Setup
Implemented SLCA, XSEarch, IM (our method) algorithms using Microsoft Visual C++ 6.0 Query engines used in our experiment X-Hive1 and MonetDB2 2019/2/18 1 2

30 PPI+EPI+Assistant index
Experiment Datasets, Indices Queries 40 keyword queries 4 groups with 2,3,4,5 keywords respectively PPI+EPI+Assistant index 2019/2/18

31 Experiment Evaluation Metrics Precision & Recall Running time
Users submit a keyword query We write XQuery expressions corresponding to their keyword query by asking their query intension, the result set is R, which is got by running MonetDB query engine Process the given keyword query using different algorithms, the result of a special algorithm is RQ Running time 2019/2/18

32 Experiment Experimental Results 2019/2/18

33 Outline Introduction Preliminaries XML Keyword Search Semantics
Query Processing Experiment Conclusions 2019/2/18

34 Conclusions We proposed a new semantics, i.e. MCN, based on relationships of entity nodes to capture meaningful information considering IDREF We proposed an entity graph based method, which can produce all query patterns while avoid scanning real data We proposed two efficient indices, based on which our method can avoid structural join operations by equivalently transforming structural join operations into value join operations We conducted experiments to verify the effectiveness and efficiency of our method On going work A good ranking mechanism considering node categories 2019/2/18

35 Thank you! 2019/2/18


Download ppt "MCN: A New Semantics Towards Effective XML Keyword Search"

Similar presentations


Ads by Google