Identifying Meaningful Return Information for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University
SIGMOD 2007 Searching XML Data XQueryfor $x in doc(“DB.xml”)//player $y in $x/namewhere $y = “Mutombo” return $x/position Find the position of the player with name “Mutombo” Keyword SearchMutombo, position team foundedstadium players player namepositionnationality Congo centerMutombo division 1967Toyota southwest name Rockets league team … … team … … player … Center player namepositionnationality U.S guardWells founded 1967 name Rockets
SIGMOD 2007 How to identify meaningful return information? Inferring return clauses in XQuery Limited research has been done Users or system administrators specify [Hristidis et al 03, Li et al 04] Whole document [Carmel et al 02] Subtree Return [Cohen et al 03, Guo et al 03, Xu et al 05] Path Return variants [Hristidis et al 06] Challenges in XML Keyword Search How to select relevant keyword matches and connect them? Inferring for clauses (with variable bindings) and where clauses in XQuery Have been much studied XRank [Guo et al 03] XSEarch [Cohen et al 03] Meaningful LCA [Li et al 04] Smallest LCA[Xu et al 05] XSeek XSeek: automatically and intelligently identifies return information
SIGMOD 2007 Selecting and Connecting Keyword Matches Identify relevant matches using variants of LCA concepts [Cohen et al 03, Li et al 04, Xu et al 05] Q1: Mutombo, position team foundedstadium players player namepositionnationality Congo centerMutombo division 1967Toyota southwest name Rockets league team … … team … … player … Center player namepositionnationality U.S guardWells founded 1967 name Rockets
SIGMOD 2007 Selecting and Connecting Keyword Matches Q1: Mutombo, position team foundedstadium players player namepositionnationality Congo centerMutombo division 1967Toyota southwest name Rockets league team … … team … … player … Center player namepositionnationality U.S guardWells founded 1967 name Rockets Given relevant matches, what should be returned?
SIGMOD 2007 Example I: Subtree Return Q1: Mutombo, position team foundedstadium players player namepositionnationality Congo centerMutombo division 1967Toyota southwest name Rockets league team … … team … … player … Center player namepositionnationality U.S guardWells founded 1967 name Rockets Q2: Mutombo, center
SIGMOD 2007 Example I: Path Return Q1: Mutombo, position team foundedstadium players player namepositionnationality Congo centerMutombo division 1967Toyota southwest name Rockets league team … … team … … player … Center player namepositionnationality U.S guardWells founded 1967 name Rockets Q2: Mutombo, center
SIGMOD 2007 Example I: XSeek Q1: Mutombo, position team foundedstadium players player namepositionnationality Congo centerMutombo division 1967Toyota southwest name Rockets league team … … team … … player … Center player namepositionnationality U.S guardWells founded 1967 name Rockets Q2: Mutombo, center
SIGMOD 2007 Example II: Subtree Return, Path Return Q3: Rockets team foundedstadium players player namepositionnationality Congo centerMutombo division 1967Toyota southwest name Rockets league team … … team … … player … Center player namepositionnationality U.S guardWells founded 1967 name Rockets
SIGMOD 2007 Example II: XSeek Q3: Rockets team foundedstadium players player namepositionnationality Congo centerMutombo division 1967Toyota southwest name Rockets league team … … team … … player … Center player namepositionnationality U.S guardWells founded 1967 name Rockets
SIGMOD 2007 Contributions XSeek: automatically infers meaningful return information for XML keyword Search No elicitation from users or system administrators is required No schema information is required Inferring search semantics Analyzing XML data structure Analyzing keyword match pattern Determining search results based on node types and match types Efficient implementation of the search semantics Experimental verification on effectiveness and efficiency
SIGMOD 2007 Roadmap Motivation Inferring search semantics Analyzing keyword match patterns Analyzing XML data structure Identifying search results XSeek architecture Experiments Conclusions
SIGMOD 2007 Analyzing Keyword Match Patterns Identifying search predicates and return nodes in keywords Examples of keyword searches Q1: Mutombo, position Q2: Mutombo, center Q3: Rockets Examples of structured queries SQL: select position from Player where name = “Mutombo” XQuery: for $x in doc(“DB.xml”)//player where $x/name = “Mutombo” return $x/position Return Nodes Search Predicates Return Nodes Search Predicates
SIGMOD 2007 Analyzing XML Data Structure Three types of data nodes Entity nodes Attribute nodes Connection nodes Related work on identifying node types [Xu et al 06] team foundedstadium players player namepositionnationality Congo centerMutombo division 1967Toyota southwest name Rockets league team … … team … … player … Center player namepositionnationality U.S guardWells founded 1967 name Rockets
SIGMOD 2007 Identifying Search Results Search results consist of Matches to search predicates This allows users to verify the relevance of search results Matches to return nodes This is what the user is searching for Matches are output according to node types Attribute node: display name, value Entity node: display name, attributes, optionally entity and connection descendants Connection node: display name, optionally entity and connection descendants Nodes that connect these matches
SIGMOD 2007 A Search Result Example Q1: Mutombo, position team foundedstadium players player namepositionnationality Congo centerMutombo division 1967Toyota southwest name Rockets league team … … team … … player … Center player namepositionnationality U.S guardWells founded 1967 name Rockets
SIGMOD 2007 What if Return Nodes Are Absent? Explicit return nodes: nodes that are explicitly identified in input keywords Inferring implicit return nodes if no explicit return nodes in input keywords Users may be interested in general information of entities that are relevant to the search Master entity: the lowest ancestor-or-self entity of the LCA node, or the XML tree root Relevant entity: the entities on a path from a master entity to a relevant keyword match, inclusively
SIGMOD 2007 Search with Implicit Return Nodes (I) team foundedstadium players player namepositionnationality Congo centerMutombo division 1967Toyota southwest name Rockets league team … … team … … player … Center player namepositionnationality U.S guardWells founded 1967 name Rockets Q2: Mutombo, center
SIGMOD 2007 Search with Implicit Return Nodes (II) Q3: Rockets team foundedstadium players player namepositionnationality Congo centerMutombo division 1967Toyota southwest name Rockets league team … … team … … player … Center player namepositionnationality U.S guardWells founded 1967 name Rockets
SIGMOD 2007 Roadmap Motivation Inferring search semantics Analyzing keyword match patterns Analyzing XML data structure Identifying search results XSeek architecture Experiments Conclusions
SIGMOD 2007 Data Analyzer Architecture of XSeek Index Builder Keyword Matcher Match Grouper Keyword Analyzer Return Node Recognizer Result Generator Indexes Search Result XML Keywords Entities Attributes Connection nodes Search predicates Return nodes Explicit return nodes Implicit return nodes
SIGMOD 2007 Experimental Setup Compare the performance of XSeek Subtree Return Path Return Measurements Search quality Speed Scalability Data sets: Mondial, WSU, XMark benchmark Query sets: eight queries for each data set
SIGMOD 2007 Search Quality: Precision Precision: measures the soundness of search results XSeek in general has a precision as good as Path Return open auction, person257 seller, person179, buyer, price, date
SIGMOD 2007 Recall: measures the completeness of search results XSeek in general has a recall as good as Subtree Return Search Quality: Recall
SIGMOD 2007 F-Measure is a weighted harmonic mean of precision and recall XSeek has the best F-Measure Search Quality: F-Measure
SIGMOD 2007 Speed: Benchmark Data seller, person179, buyer, price, date person257, person133
SIGMOD 2007 Conclusions The first work that automatically infers meaningful return information for XML keyword search No elicitation from users or system administrators, no schema information is required Analyzing keyword match patterns Search predicates Return nodes Analyzing XML node types Entities Attributes Connection nodes Identifying two types of return information Explicit return nodes Implicit return nodes Outputting an XML node based on its match type and node type Experiments verify XSeek’s effectiveness and efficiency
Thank You! Questions? Welcome to visit XSeek demo in VLDB 07