Reasoning and Identifying Relevant Matches for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University
VLDB 2008, Auckland, New Zealand Motivation Identifying relevant matches is a critical step of processing XML search. Query: “Gasol, position” relevant matches irrelevant matches
VLDB 2008, Auckland, New Zealand How to Evaluate Various Strategies? Existing approaches for identifying relevant matches: XKSearch (SLCA) [Xu and Papakonstantinou 2005] XRank [Guo et al. 2003] XSEarch [Cohen et al. 2003] Star-semantics All-semantics Schema-free XQuery (MLCA) [Li et al. 2004] CVLCA [Li et al. 2007]
VLDB 2008, Auckland, New Zealand How to Evaluate Various Strategies? The traditional approach Obtain ground truth of query results by user studies on a large number of documents and queries. Measure the precision and recall of a strategy wrt ground truth Costly An axiomatic approach Formalize broad intuitions as a collection of simple axioms and evaluate strategies based on the axioms. It has been successful in many areas, e.g. mathematical economics, clustering, location theory, collaborative filtering, etc Cost-effective Problem: Is it possible to evaluate and reason about XML keyword search strategies in a formal axiomatic framework?
VLDB 2008, Auckland, New Zealand Roadmap Motivation and Problem Definition Challenges and Contributions Four properties that an XML search engine should satisfy Query Monotonicity/Consistency Data Monotonicity/Consistency MaxMatch: the first system that satisfies all four properties Experimental Evaluation Conclusions
VLDB 2008, Auckland, New Zealand Challenge It is easy for an individual to assess the relevance of matches But it is extremely difficult to formalize the relevance assessment, independently of any query, data, algorithm, and user Query: “Gasol, position” relevant matches irrelevant matches
VLDB 2008, Auckland, New Zealand Example: Similar Queries Interestingly, we discovered that some abnormal behaviors can be clearly observed when examining results of two similar queries or one query on two similar documents produced by the same search engine. Q1: “Gasol, position” Q2: “Grizzlies, Gasol, position” These two “position” nodes should still be irrelevant.
VLDB 2008, Auckland, New Zealand Example: Similar Data Q: “Grizzlies, Gasol, Brown, position” position forward An empty result after data insertion is abnormal. How to capture the logical connection between query results?
VLDB 2008, Auckland, New Zealand Contributions of This Work The first work that formally reasoned about keyword search in an axiomatic framework We identified four desirable properties that an XML search engine should satisfy. Data/Query Monotonicity capture the desirable changes to the number of query results Data/Query Consistency capture the desirable changes to the content of a query result We reasoned about existing XML keyword search strategies. We proposed MaxMatch - the only XML keyword search strategy that possess all properties. Experiments verified our intuition and demonstrated the effectiveness and efficiency of MaxMatch.
VLDB 2008, Auckland, New Zealand Roadmap Motivation and Problem Definition Challenges and Contributions Four properties that an XML search engine should satisfy Query Monotonicity/Consistency Data Monotonicity/Consistency MaxMatch: the first system that satisfies all four properties Experimental Evaluation Conclusions
VLDB 2008, Auckland, New Zealand Properties wrt Similar Queries Query Monotonicity When we add a keyword to the query, the query becomes more restrictive, therefore the number of query results should not increase. Query Consistency When we add a new keyword to the query, each delta subtree that newly becomes (part of) a query result should contain the new keyword.
VLDB 2008, Auckland, New Zealand Example: Query Monotonicity/Consistency Q1: “forward, name”Q2: “forward, USA, name” New Keyword Monotonicity: the number of query results reduces from 2 to 1. Consistency: in each result, the delta sub-tree (if exists) contains “USA”.
VLDB 2008, Auckland, New Zealand Example Revisited: Violation of Query Consistency Q1: “Gasol, position” An XML keyword search engine that considers these nodes as relevant for the new query violates query consistency. Q2: “Grizzlies, Gasol, position”
VLDB 2008, Auckland, New Zealand Properties wrt Similar Data Data Monotonicity When we add a node to the data, the data content becomes richer, and the number of query results should not decrease. Data Consistency After we add a node to the data, each delta subtree that becomes (part of) a query result should contain the newly inserted node.
VLDB 2008, Auckland, New Zealand Example: Data Monotonicity/Consistency Q: “forward, name” position forward New Match Monotonicity: the number of query results increases from 1 to 2. Consistency: in each result, the delta sub-tree (if exists) contains the new data node.
VLDB 2008, Auckland, New Zealand Example Revisited: Violation of Data Monotonicity Q: “Grizzlies, Gasol, Brown, position” position forward An XML keyword search engine that outputs an empty result on the updated data violates data monotonicity.
VLDB 2008, Auckland, New Zealand The Proposed Axiomatic Framework Four desirable properties Query Monotonicity Query Consistency Data Monotonicity Data Consistency These properties are: Non-trivial No prior XML keyword system satisfies all of them. Non-redundant An algorithm may violate any one of them while satisfying others. Satisfiable We propose a novel technique – MaxMatch - that satisfies all four properties.
VLDB 2008, Auckland, New Zealand Roadmap Motivation and Problem Definition Challenges and Contributions Four properties that an XML search engine should satisfy Query Monotonicity/Consistency Data Monotonicity/Consistency MaxMatch: the first system that satisfies all four properties Experimental Evaluation Conclusions
VLDB 2008, Auckland, New Zealand MaxMatch MaxMatch’s name comes from “Maximal Match” MaxMatch preserves each subtree whose set of descendant keyword matches is “Maximal” among its siblings. Intuitively, the subtrees that are removed are strictly less relevant to the query since fewer keywords are contained.
VLDB 2008, Auckland, New Zealand MaxMatch Q: Grizzlies, Gasol, Brown, position Not as informative as its siblings: discarded MaxMatch satisfies all four properties. Proof details and algorithms can be found in the paper.
VLDB 2008, Auckland, New Zealand Roadmap Motivation and Problem Definition Challenges and Contributions Four properties that an XML search engine should satisfy Query Monotonicity/Consistency Data Monotonicity/Consistency MaxMatch: the first system that satisfies all four properties Experimental Evaluation Conclusions
VLDB 2008, Auckland, New Zealand Search Quality Data set: Baseball, Mondial Query set: 36 queries in total Ground truth: obtained by user study. User perception of search results on query pairs and document pairs confirms our intuition of the proposed properties F-measure of MaxMatch vs. Existing Approaches
VLDB 2008, Auckland, New Zealand Processing Time Mondial Data (515KB) Baseball Data (1014KB)
VLDB 2008, Auckland, New Zealand Conclusions This is the first work on reasoning about and evaluating XML keyword search strategies using a formal axiomatic framework. Four intuitive and elegant properties are proposed: query monotonicity/consistency, data monotonicity/consistency. We designed and developed MaxMatch - the only XML keyword search strategy that satisfies all properties. Experiments verified the intuition of the properties and the effectiveness and efficiency of MaxMatch. MaxMatch is incorporated as part of XSeek [Liu & Chen Sigmod 07]
Thank You! Questions? Welcome to try MaxMatch at: xseek.asu.edu