Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cooperative XML (CoXML) Query Answering. 2 Motivation X ML has become the standard format for information representation and data exchange An explosive.

Similar presentations


Presentation on theme: "Cooperative XML (CoXML) Query Answering. 2 Motivation X ML has become the standard format for information representation and data exchange An explosive."— Presentation transcript:

1 Cooperative XML (CoXML) Query Answering

2 2 Motivation X ML has become the standard format for information representation and data exchange An explosive increase in the amount of XML data available on the web, e.g., Bills at the Library of Congress IEEE Computer Society’s publication SwissProt – protein sequence databases XMark – online auction data …. Effective XML search methods are needed!

3 3 Challenges XML schema is usually very complex E.g., the schema for the IEEE Computer Society publication dataset contains about 170 distinct tags and more than 1000 distinct paths It is often unrealistic for users to fully understand a schema before asking queries Exact query answering is inadequate and approximate query answering is more appropriate!

4 4 Approach: CoXML Query Approximate Answers Cooperative XML Query Answering XML Database Engine XML Documents Derive approximate answers by relaxing query conditions, i.e., query relaxation

5 5 Roadmap Introduction Background CoXML Related Work Conclusion

6 6 XML Data Model XML data is often modeled as an ordered labeled tree Tree nodes: elements Tree edges: element-nesting relationships 1 article title 2 7 body Search engine spam detection section 8..a spam detection technique by content analysis… author 3 name 4 title 5 XYZIEEE Fellow year 6 2003 Content Element

7 7 XML Query Model XML queries are often modeled as trees Structure conditions: a set of query nodes connected by Parent-to-child (‘/’): directly connected Ancestor-to-descendant (‘// ’): connected (either directly or indirectly) Content conditions: Either value predicates or keyword constraints on query nodes Example article titlesection search engine spam detection year 2003

8 8 XML Query Answer An answer for a query is a set of nodes in a data tree that satisfies both structure and content conditions Example 1 article title 2 7 body Search engine spam detection section 8..a spam detection technique by content analysis… author 3 name 4 title 5 XYZIEEE Fellow year 6 2003 Data Tree article titlesection search engine spam detection year 2003 Query Tree

9 9 XML Query Relaxation Types Value relaxation: enlarging a value condition’s search scope Node relabel: changing the label a node to a similar or a more general label by domain knowledge article titleyear search engine 2003 section spam detection article titleyear search engine 2000- 2005 section spam detection article titleyear search engine 2003 section spam detection document titleyear search engine 2003 section spam detection [1] Tree Pattern Relaxation (S. Amer-Yahia, et al., 2000)

10 10 XML Query Relaxation Types Edge generalization: relaxing a ‘/’ edge to a ‘//’ edge Node deletion: dropping a node from a query tree article titleyear search engine 2003 section spam detection article titleyear search engine 2003 section spam detection article titleyear search engine 2003 section spam detection article yearsearch engine 2003 section spam detection

11 11 XML Relaxation Properties Definition Relaxation operation: an application of a relaxation type to a specific query node or edge Lemma Given a query tree with n applicable relaxation operations, there are potentially up to 2 n relaxed trees Possible combinations:

12 12 Roadmap Introduction Background CoXML Related Work Conclusion

13 13 Challenges Query relaxation is often user-specific Different users may have different approximate matching specifications for a given query tree How to provide user-specific approximate query answering? A query with n relaxation operations has potentially up to 2 n relaxed queries How to systematically relax a query? Query relaxation generates a set of approximate answers How to effectively rank the returned approximate answers?

14 14 CoXML System Overview Relaxation Engine Ranking Module Relaxation Index Builder RLXQuery ranked results XML Documents CoXML XML Database Engine XTAH results query exact answers relaxed query query similarity metrics relaxation language relaxation indexes

15 15 Roadmap Introduction Background CoXML Relaxation Language Relaxation Indexes Ranking Evaluation Testbed Related Work Conclusion

16 16 Relaxation Language Motivation Enabling users to specify approximate conditions in queries and to control the approximate matching process RLXQuery - relaxation-enabled XQuery Extends the standard XML query language (XQuery) with relaxation constructs & controls, such as ~ : approximate conditions ! : non-relaxable conditions REJECT : unacceptable relaxations AT-LEAST : minimum # of answers to be returned RELAX-ORDER : relaxation orders among multiple conditions USE: allowable relaxation types

17 17 RLXQuery Example FOR $a in doc (“bib.xml”)//article WHERE $a/year = ~2003 V-COND-LABEL t 1 and ~($a[about(./!title, “search engine”)]/body/section)[about(., “spam detection”)] S-COND-LABEL t 2 RETURN $a RELAX-ORDER (t 1, t 2 ) USE (edge generalization, node deletion) AT-LEAST 20 article title year search engine 2003 body section spam detection ! t2 t1

18 18 Roadmap Introduction Background CoXML Relaxation Language Relaxation Indexes Ranking Evaluation Testbed Related Work Conclusion

19 19 Relaxation Index Naïve approach Generate all possible relaxed queries & iteratively select the best relaxed query to derive approximate answers Exhaustive, but not scalable Observation Many queries share the same (or similar) tree structures Our approach: relaxation index Consider the structure of a query tree T as a template Build indexes on the relaxed trees of T Use the index to guide the relaxations of any query with the same (or similar) tree structure as that of T

20 20 Relaxation Index - XTAH XTAH A hierarchical multi-level labeled cluster of relaxed trees Building an XTAH Given a query structure template T, generate all possible relaxed trees Each relaxed trees uses an unique set of relaxation operations Cluster relaxed trees into groups based on relaxation operations and distances similar to “suffix-tree” clustering

21 21 XTAH Example article title body section $1 $2$3 $4 Template structure T {gen(e $1,$2 )}…{gen(e $3, $4 )}{del($2)} … node_relabeledge_generalizationnode_deletion relax {gen(e $3, $4 ), gen(e $1,$3 )}... article body section T6T6 {gen(e $1, $2 ), gen(e $3, $4 )} … {del($2), del($3)} … … …… … article titlebody section T2T2 T4T4 article titlebody section article titlebody section T3T3 article titlebody section T1T1 article section T7T7 A sample XTAH for the template structure T gen(e $u, $v ) – relaxing the edge between $u and $v del($u) – deleting the node $u

22 22 XTAH Properties Each group consists of a set of relaxed trees obtained by using similar relaxation operations Efficient location of relaxed trees based on relaxation operations The higher level a group, the less relaxed the trees in the group Relaxing queries at different granularities by traversing up and down the XTAH

23 23 XTAH-Guided Query Relaxation Problem Given a query with relaxation specifications (constructs and controls), how to search an XTAH for relaxed queries that satisfy the specification? Approach First, prune XTAH groups containing trees that use unacceptable relaxations as specified in the query This step can be efficiently achieved by utilizing internal node labels Then, iteratively search the XTAH for the best relaxed query

24 24 Query Relaxation Process Example node_relabel... node_deletion relax …{gen(e $1,$2 )}…{gen(e $3, $4 )} … edge_generalization {gen(e $3, $4 ), gen(e $1,$3 )} {gen(e $1, $2 ), gen(e $3, $4 )} … … … article titlebody section T2T2 T4T4 article titlebody section article titlebody section T3T3 article titlebody section T1T1 {del($2)} article body section T6T6 {del($2), del($3)} … … article section T7T7 article title body section $1 $2$3 $4 The template structure, T A sample XTAH for the template structure T article title year search engine 2003 body section spam detection ! t2 t1 Relaxation Control USE (edge generalization, node deletion) AT-LEAST 20 Sample RLXQuery

25 25 XTAH-Guided Query Relaxation Problem Given a query and an XTAH, how to efficiently locate the best relaxation candidate at the leaf level? Approach: M-tree Assign representatives to internal groups Representatives summarize distance properties of the trees within groups Use representatives to guide the search path to the best relaxation candidate R0R0 R1R1 R2R2 R3R3 R5R5 R8R8 R 11 relaxed tree j [2] M-tree: An efficient access method for similarity search in metric space (P. Ciaccia et. al., VLDB 97)

26 26 Roadmap Introduction Background CoXML Relaxation Language Relaxation Indexes Ranking Evaluation Testbed Related Work Conclusion

27 27 Ranking Ranking criteria Based on both content and structure similarities between a query and an answer, i.e., a set of data nodes Approach Content similarity – extended vector space model Structure similarity – tree editing distance with a model for assigning operation cost Overall relevancy – a ranking model combing both content and structure similarities

28 28 Content Similarity Term Frequency Inverse Document Frequency Weighted Term FrequencyInverse Element Frequency Vector Space Model Extended Vector Space Model XML content ranking Traditional IR ranking content similarity between a query and an answer (i.e., a set of data nodes) content similarity between a query and a document

29 29 Weighted Term Frequency Terms under different paths of a node weight differently Example The weighted term frequency for a term t in a node v is: p i : a path under the node v to a term t; m: # of different paths under the node v that contain the term t section spam detection 8 paragraph …an approach to detect spam by … 12 reference Spam detection taxonomy section 5 Spam Detection By Content Analysis 6 title QueryData

30 30 Inverse Element Frequency The more number of XML elements containing a term, the less disambiguating power the term has E.g., the term “spam” is less disambiguating than the term “detection” The inverse element frequency for a query term t is $u: a query node whose content condition contains the term t N 1 : # of data nodes that match the structure condition related to $u N 2 : # of data nodes that match the structure condition related to $u and contain t

31 31 Extended Vector Space Model The content similarity between an answer A and a query Q is n: # of nodes in Q {$u 1, …, $u n }: the set of query nodes in Q {v 1, …, v n }: the set of data nodes in A, where v i matches $u i (1 ≤ i ≤ n) |$u i.cont|: the number of terms in the content conditions on the node $u i t ij : a term in the content condition on the query $u i

32 32 Structure Distance Function Both XML data and queries are modeled as trees Similarities between trees are often computed by editing distances, i.e., the cost of the cheapest sequence of editing operations that transform one tree into the other tree The structure distance between an answer A and a query Q can be measured as the total cost of relaxation operations used to derive A {r 1, …, r k }: the set of relaxation operations used to derive A cost(r i ): the cost for r i (0 ≤ cost( r i ) ≤ 1 )

33 33 Relaxation Operation Cost Naïve approach Assign uniform cost to all relaxation operations Simple but ineffective Our approach Assign an operation cost based on the similarity between the two nodes being approximated by the operation The closer the two nodes, the less the operation costs r i : a relaxation operation $u, $v: the two nodes that are being approximated by r i

34 34 Nodes Approximated By Relaxation Operations Relaxation Operation Nodes being approximated by the operation: ($u, $v) Example Node relabel(a node with the old label, a node with the new label) (article, document) Node deletion(a child node, the parent node)(section, body) Edge generalization (a child node, a descendant node)(article/title, article//title) article titlebody section Query tree document titlebody section Node Relabel article titlebody Node deletion article titlebody section Edge generalization T1T1 T2T2 T3T3 T4T4

35 35 overall relevancy content similarity structure distance

36 36 Overall Relevancy Function The overall relevancy of an answer A to a query Q, sim(A, Q), is a function of cont_sim(A, Q) and struct_dist(A, Q) Properties sim (A, Q) = cont_sim (A, Q) if struct_dist(A, Q) = 0 sim (A, Q)  as cont_sim( A, Q )  sim (A, Q)  as struct_dist( A, Q )  Implementation  is a small constant between 0 and 1

37 37 Roadmap Introduction Background CoXML Relaxation Indexes Relaxation Language Ranking Evaluation Testbed Related Work Conclusion

38 38 Evaluation Studies INEX (Initiative for the evaluation of XML) Similar to TREC for text retrieval Document collections Scientific articles from IEEE Computer Society 1995 – 2002 About 500MByte Each article consists of 1500 XML nodes on average Queries Strict content and structure (SCAS) Vague content and structure (VCAS) Golden standard Relevance assessment provided by INEX

39 39 Evaluation of Content Similarity Datasets: INEX 03 test collection Query sets: 30 SCAS queries Comparisons: 38 submissions in INEX 03 Recall Precision 0.510 0.2 0.4 0.6 0.8 1 Avg. Precision 0.3309

40 40 Evaluation of the Cost Model Dataset: INEX 05 test collection Query set: 22 simple VCAS queries Evaluation metric: normalized extended cumulative gain (nxCG) the official evaluation metric used in INEX 05 Given a number i (i  1), nxCG@i, similar to precision@i, measures the relative gain users accumulated up to the rank i E.g., nxCG@10, nxCG@25, nxCG@50, … Cost Models: UCost: uniform cost for each relaxation operation (Baseline) SCost: our proposed cost model

41 41 Retrieval performance improvements with semantic cost model  Cost Model 0.10.30.50.70.9 Uniform0.25840.26160.28280.28940.2916 Semantic0.3319 (+28.44%) 0.3190 (+21.94%) 0.3196 (+13.04%) 0.3068 (+6%) 0.2957 (+4.08%) Assigning relaxation operation with different cost based on the similarities of the nodes being operated improves retrieval performance! nxCG@25 and nxCG@50 yield similar results Query set: all content-and-structure queries in INEX 05 nxCG@10 ( , cost model)

42 42 Evaluation of the Cost Model Result  Cost Model 0.10.30.50.70.9 UCost0.25840.26160.28280.28940.2916 SCost0.3319 (+28.44%) 0.3190 (+21.94%) 0.3196 (+13.04%) 0.3068 (+6%) 0.2957 (+4.08%) Each cell: nxCG@10 for a given pair ( , cost model) (% of improvement over the baseline) Utilizing node similarities to distinguish costs of different operations improves retrieval performance! Similar results are observed using nxCG@25 and nxCG@50

43 43 Expressiveness of the Relaxation Language INEX 05 Topic 267 Expressing Topic 267 using RLXQuery //article//fm//atl[about(., "digital libraries")] Articles containing "digital libraries" in their title. I'm interested in articles discussing Digital Libraries as their main subject. Therefore I require that the title of any relevant article mentions "digital library" explicitly. Documents that mention digital libraries only under the bibliography are not relevant, as well as documents that do not have the phrase "digital library" in their title. FOR $a in doc(“inex.xml”)//article LET $b = $a//fm//!atl REJECT(fm, bb) WHERE $b[about(., “digital libraries”)] RETURN $b

44 44 Expressing Topic 267 with RLXQuery Results FOR $a in doc(“inex.xml”)//article LET $b = $a//fm//!atl REJECT(fm, bb) WHERE $b[about(., “digital libraries”)] RETURN $b Evaluation Metric Method nxCG@10nxCG@25 No relaxation control0.10130.2365 With relaxation control1.00.8986 Effectiveness of the Relaxation Control Relaxation control enables the system to provide answers with greater relevancy! Perfect accuracy

45 45 Evaluation of the Ranking Function Dataset: INEX 05 test collection Query set: 4 official VCAS queries with available relevance assessments Comparison: top-1 submission in INEX 05 Results Metric Topic nxCG@10nxCG@25 Top-1CoXMLTop-1CoXML 2560.42930.42480.47330.5555 2640.00.00690.00.0033 2750.77150.6380.5890.5922 2840.00.12590.00.1233 Average0.3002 (+0.4%)0.29890.26560.3186 (+20%) The systematic relaxation approach enables our system to derive more approximate answers! Our ranking function, based on both content and structure relevancy, outperforms other ranking functions using content similarities only!

46 46 Roadmap Introduction Background CoXML Relaxation Indexes – XTAH Relaxation Language – RLXQuery Ranking Evaluation Testbed Related Work Conclusion

47 47 CoXML Testbed Team Members: Prof. Chu, S. Liu, T. Lee, E. Sung, C. Cardenas, A. Putnam, J. Chen, R. Shahinian RLXQuery Preprocessor RLXQuery Parser Relaxation Manager Database Manager Ranking Module Relaxation Index Builder XTAH XML Database Engine XML Document s Relaxation Controller RLXQuery Approximate Answers

48 48 Relaxation Examples using the Testbed

49 49 Relaxation Examples using the Testbed

50 50 Roadmap Introduction Background CoXML Related Work Conclusion

51 51 Related Work: Query Relaxation Relaxation based on schema conversions ([LC01, LMC01], [LMC03]) No structure relaxation Native XML relaxation Propose structure relaxation types [e.g., KS01, ACS02] We use the relaxation types introduced in [ACS02] Investigate efficient algorithms for deriving top-K answers based on relaxation types supported [e.g, Sch02, ACS02, ALP04, AKM05] No relaxation control

52 52 Related Work: XML Ranking Content ranking Most extend ranking models for text retrieval to the XML scenario, e.g., HyRex, XXL, JuruXML, XSearch We utilize structure to distinguish terms of different weights occurring in different parts of a document Structure ranking Based on tree editing distance algorithms w/o considering operation cost [NJ02] Based on the occurrence frequency of the query trees, paths, or predicates in data [MAK05, AKM05] Our structure ranking is similar to editing distance, but we consider operation cost

53 53 Conclusion Cooperative XML (CoXML) query answering RLXQuery enables users to effectively express approximate query conditions and to control the approximate matching process XTAH provides systematic query relaxation guidance Both content and structure similarity metrics for evaluating the relevancy of approximate answers Evaluation studies with the INEX test collections demonstrate the effectiveness of our methodology


Download ppt "Cooperative XML (CoXML) Query Answering. 2 Motivation X ML has become the standard format for information representation and data exchange An explosive."

Similar presentations


Ads by Google