Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University
SIGMOD Snippets in Text Search Snippets are widely used in text search engine to help users to quickly identify relevant query results.
SIGMOD Fragment of an XML Search Result Find the apparel retailers in Texas Keyword Search Texas, apparel, retailer store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… There can be many large search results. Good snippets can help users to quickly and easily judge the relevance.
SIGMOD A Sample Snippet From the snippet, we know The corresponding query result contains matches to all keywords The retailer is “Brook Brothers” This retailer has many stores in Houston. The clothes featured by this retailer. It helps us to differentiate this query result from other apparel retailers (e.g. Carter’s) store statecity merchandises clothes fitting men TexasHouston retailer clothes situation casual category outwear nameproduct Brook Brothers apparel How to generate good snippets for XML search? No existing work on XML snippet generation yet.
SIGMOD Challenges and Our Contributions What are desirable properties of a good snippet? Identified three properties: self-contained, distinguishable, representative What information in the query result is significant in order to achieve the properties? Designed an algorithm to generate a ranked list of significant information - IList How to generate a snippet to maximally cover the significant information within a size bound? Proved the NP-hardness of this problem. Designed an efficient and effective algorithm for snippet generation eXtract : The first system on snippet generation for XML search
SIGMOD Roadmap Identifying desirable properties of a good snippet Self-contained Distinguishable Representative Constructing an information list – IList IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions
SIGMOD Self-contained Snippet Snippets should be self-contained in order to be understandable. Text search: snippets usually preserve self- contained semantic units: phrases / sentences surrounding keyword matches. XML search: semantic units should be preserved. Challenge: What is a semantic unit?
SIGMOD Query Result Fragment (revisited) Adding keywords and their corresponding entity names to IList. IList: Texas, apparel, retailer, store Data contain Entities Attributes A self-contained snippet should contain names of the entities whose attributes are in snippets store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria ……
SIGMOD Distinguishable Snippet Snippets should be distinguishable, so that users can easily differentiate query results Text search: the title of the document is included. XML search: the “key” of the result should be included. Challenge: What is the key of an XML search result?
SIGMOD Query Result Fragment Adding the key of the query result to IList. IList: Texas, apparel, retailer, store, Brook Brothers We can mine keys of entities return entity support entity We identify two types of entities in a query result. Return entities Support entities store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… Inferring return entities: An entity whose name or attribute name match keywords; otherwise the highest entity Key of a query result Keys of return entities
SIGMOD Representative Snippet Snippets should provide summaries of the query results, so that users can quickly grasp the essence of the results Text search: active research area; sometimes the first and/or last sentence of a paragraph is used as a summary. XML search: include “dominant features” of query results Challenges: What are features? What are dominant features?
SIGMOD Features of Query Result We define a feature as (entity, attribute, value). store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… Feature type Houston:6 Austin: 1 Other values (3): 3 Men: 600 Women: 360 Children: 40 Casual: 700 Formal: 300 Outwear: 220 Suit: 120 Skirt: 80 Sweaters: 70 Other values (7): 510 city: fitting: situation: category: entity: attribute: value: # of occurrences store: clothes: Some feature statistics
SIGMOD Houston:6 Austin: 1 Other values (3): 3 Men: 600 Women: 360 Children: 40 Casual: 700 Formal: 300 Outwear: 220 Suit: 120 Skirt: 80 Sweaters: 70 Other values (7): 510 city: fitting: situation: category: entity: attribute: value: # of occurrences store: clothes: Dominant Features of Query Result A feature that occurs often is likely to be dominant. store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… But this is not always reliable. Dominance score the # of occurrence of a feature / the avg. # of occurrences of features of the same type Dominant features Features with dominance score ≥ 1
SIGMOD Representative Snippet Adding dominant features to IList in the order of dominance scores store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual Houston:6 Austin: 1 Other values (3): 3 Men: 600 Women: 360 Children: 40 Casual: 700 Formal: 300 Outwear: 220 Suit: 120 Skirt: 80 Sweaters: 70 Other values (7): 510 city: fitting: situation: category: entity: attribute: value: # of occurrences store: clothes:
SIGMOD Roadmap Identifying desirable properties of a good snippet Self-contained Distinguishable Representative Constructing an information list – IList IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions
SIGMOD Roadmap Identifying desirable properties of a good snippet Self-contained Related entity names Distinguishable Key of query result (return entities) Representative Dominant features Constructing an information list – IList IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions IList
SIGMOD Roadmap Identifying desirable properties of a good snippet Self-contained Related entity names Distinguishable Key of query result (return entities) Representative Dominant features Constructing an information list – IList IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions IList
SIGMOD Instance Selection Problem IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual Input: query result R, IList, a snippet size bound B Output: snippet S Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B?
SIGMOD Instance Selection Problem store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… Input: query result R, IList, a snippet size bound B Output: snippet S Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B? Good Bad IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
SIGMOD Instance Selection Problem Challenges: The cost of covering an IList item is dynamic The number of IList items that can be covered is unknown till the very end. The Instance Selection Problem is NP hard. We designed an efficient and effective greedy algorithm to tackle this problem
SIGMOD Instance Selection Algorithm store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual weight: ½ ¼ 1/8 1/16 1/32 1/64 Path based instance selection Coverage: the entities on the path and their attributes Benefit: the total weight of IList items covered Cost: the path length
SIGMOD Instance Selection Algorithm store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… 1.For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it 2.Update benefits and costs of other paths 3.Go to step 1 till the size bound is reached or the whole IList is covered IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
SIGMOD Instance Selection Algorithm store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… 1.For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it 2.Update benefits and costs of other paths 3.Go to step 1 till the size bound is reached or the whole IList is covered IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
SIGMOD Final Snippet store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
SIGMOD Roadmap Identifying desirable properties of a good snippet Self-contained Distinguishable Representative Constructing an information list – IList IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions
SIGMOD Experimental Setup Comparing the performance of Greedy Algorithm for Instance Selection -- eXtract Optimal (but exponential) Algorithm for Instance Selection Google Desktop Measurements Search quality Speed Scalability Data sets: Films, Retailer Query sets: Eight queries for each data set
SIGMOD Ten users were asked to score the snippets generated by the three approaches on the same query results The two approaches designed specifically for XML snippet generation have much higher scores than Google Desktop Greedy algorithm (eXtract) has close scores to the Optimal algorithm Search Quality: User Study
SIGMOD Search Quality: Precision & Recall Through another user study, the ground truth of snippets are obtained. The snippets generated by the Greedy algorithm in eXtract have close precision and recall as the Optimal algorithm Precision Recall
SIGMOD Speed Film Data Set Retailer Data Set The performance of the Greedy algorithm is much better than that of the Optimal algorithm
SIGMOD Scalability Scalability on Snippet Size (number of edges) The scalability of the Greedy algorithm is much better than that of the Optimal algorithm Scalability on Query Result Size (KB)
SIGMOD Conclusions The first work that generates result snippets for keyword search on XML data Identified the desirable properties for snippets Self-contained Distinguishable Representative Designed an algorithm to generate IList as a ranked list of significant items to be included in snippets Proved that the instance selection problem is NP-hard Designed an efficient algorithm to cover IList in building a snippet within a size bound Experiments verified the effectiveness and efficiency
SIGMOD Thank You! Questions? Welcome to visit eXtract demo in VLDB
SIGMOD Architecture of eXtract Index Builder XML Index Return Entity Identifier Query & Result Dominant Feature Identifier IList, Query Result Instance Selector Result Snippet Data Analyzer Query Result Key Identifier
SIGMOD 2008 Snippets Comparison store statecity merchandises clothes fitting men TexasHouston retailer clothes situation casual category outwear nameproduct Brook Brothers apparel eXtract Google Desktop