Download presentation
Presentation is loading. Please wait.
1
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University
2
SIGMOD 2008 2 Snippets in Text Search Snippets are widely used in text search engine to help users to quickly identify relevant query results.
3
SIGMOD 2008 3 Fragment of an XML Search Result Find the apparel retailers in Texas Keyword Search Texas, apparel, retailer store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… There can be many large search results. Good snippets can help users to quickly and easily judge the relevance.
4
SIGMOD 2008 4 A Sample Snippet From the snippet, we know The corresponding query result contains matches to all keywords The retailer is “Brook Brothers” This retailer has many stores in Houston. The clothes featured by this retailer. It helps us to differentiate this query result from other apparel retailers (e.g. Carter’s) store statecity merchandises clothes fitting men TexasHouston retailer clothes situation casual category outwear nameproduct Brook Brothers apparel How to generate good snippets for XML search? No existing work on XML snippet generation yet.
5
SIGMOD 2008 5 Challenges and Our Contributions What are desirable properties of a good snippet? Identified three properties: self-contained, distinguishable, representative What information in the query result is significant in order to achieve the properties? Designed an algorithm to generate a ranked list of significant information - IList How to generate a snippet to maximally cover the significant information within a size bound? Proved the NP-hardness of this problem. Designed an efficient and effective algorithm for snippet generation eXtract : The first system on snippet generation for XML search
6
SIGMOD 2008 6 Roadmap Identifying desirable properties of a good snippet Self-contained Distinguishable Representative Constructing an information list – IList IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions
7
SIGMOD 2008 7 Self-contained Snippet Snippets should be self-contained in order to be understandable. Text search: snippets usually preserve self- contained semantic units: phrases / sentences surrounding keyword matches. XML search: semantic units should be preserved. Challenge: What is a semantic unit?
8
SIGMOD 2008 8 Query Result Fragment (revisited) Adding keywords and their corresponding entity names to IList. IList: Texas, apparel, retailer, store Data contain Entities Attributes A self-contained snippet should contain names of the entities whose attributes are in snippets store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria ……
9
SIGMOD 2008 9 Distinguishable Snippet Snippets should be distinguishable, so that users can easily differentiate query results Text search: the title of the document is included. XML search: the “key” of the result should be included. Challenge: What is the key of an XML search result?
10
SIGMOD 2008 10 Query Result Fragment Adding the key of the query result to IList. IList: Texas, apparel, retailer, store, Brook Brothers We can mine keys of entities return entity support entity We identify two types of entities in a query result. Return entities Support entities store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… Inferring return entities: An entity whose name or attribute name match keywords; otherwise the highest entity Key of a query result Keys of return entities
11
SIGMOD 2008 11 Representative Snippet Snippets should provide summaries of the query results, so that users can quickly grasp the essence of the results Text search: active research area; sometimes the first and/or last sentence of a paragraph is used as a summary. XML search: include “dominant features” of query results Challenges: What are features? What are dominant features?
12
SIGMOD 2008 12 Features of Query Result We define a feature as (entity, attribute, value). store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… Feature type Houston:6 Austin: 1 Other values (3): 3 Men: 600 Women: 360 Children: 40 Casual: 700 Formal: 300 Outwear: 220 Suit: 120 Skirt: 80 Sweaters: 70 Other values (7): 510 city: fitting: situation: category: entity: attribute: value: # of occurrences store: clothes: Some feature statistics
13
SIGMOD 2008 13 Houston:6 Austin: 1 Other values (3): 3 Men: 600 Women: 360 Children: 40 Casual: 700 Formal: 300 Outwear: 220 Suit: 120 Skirt: 80 Sweaters: 70 Other values (7): 510 city: fitting: situation: category: entity: attribute: value: # of occurrences store: clothes: Dominant Features of Query Result A feature that occurs often is likely to be dominant. store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… But this is not always reliable. Dominance score the # of occurrence of a feature / the avg. # of occurrences of features of the same type Dominant features Features with dominance score ≥ 1
14
SIGMOD 2008 14 Representative Snippet Adding dominant features to IList in the order of dominance scores store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual Houston:6 Austin: 1 Other values (3): 3 Men: 600 Women: 360 Children: 40 Casual: 700 Formal: 300 Outwear: 220 Suit: 120 Skirt: 80 Sweaters: 70 Other values (7): 510 city: fitting: situation: category: entity: attribute: value: # of occurrences store: clothes:
15
SIGMOD 2008 15 Roadmap Identifying desirable properties of a good snippet Self-contained Distinguishable Representative Constructing an information list – IList IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions
16
SIGMOD 2008 16 Roadmap Identifying desirable properties of a good snippet Self-contained Related entity names Distinguishable Key of query result (return entities) Representative Dominant features Constructing an information list – IList IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions IList
17
SIGMOD 2008 17 Roadmap Identifying desirable properties of a good snippet Self-contained Related entity names Distinguishable Key of query result (return entities) Representative Dominant features Constructing an information list – IList IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions IList
18
SIGMOD 2008 18 Instance Selection Problem IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual Input: query result R, IList, a snippet size bound B Output: snippet S Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B?
19
SIGMOD 2008 19 Instance Selection Problem store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… Input: query result R, IList, a snippet size bound B Output: snippet S Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B? Good Bad IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
20
SIGMOD 2008 20 Instance Selection Problem Challenges: The cost of covering an IList item is dynamic The number of IList items that can be covered is unknown till the very end. The Instance Selection Problem is NP hard. We designed an efficient and effective greedy algorithm to tackle this problem
21
SIGMOD 2008 21 Instance Selection Algorithm store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual weight: 1 1 1 ½ ¼ 1/8 1/16 1/32 1/64 Path based instance selection Coverage: the entities on the path and their attributes Benefit: the total weight of IList items covered Cost: the path length
22
SIGMOD 2008 22 Instance Selection Algorithm store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… 1.For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it 2.Update benefits and costs of other paths 3.Go to step 1 till the size bound is reached or the whole IList is covered IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
23
SIGMOD 2008 23 Instance Selection Algorithm store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… 1.For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it 2.Update benefits and costs of other paths 3.Go to step 1 till the size bound is reached or the whole IList is covered IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
24
SIGMOD 2008 24 Final Snippet store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
25
SIGMOD 2008 25 Roadmap Identifying desirable properties of a good snippet Self-contained Distinguishable Representative Constructing an information list – IList IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions
26
SIGMOD 2008 26 Experimental Setup Comparing the performance of Greedy Algorithm for Instance Selection -- eXtract Optimal (but exponential) Algorithm for Instance Selection Google Desktop Measurements Search quality Speed Scalability Data sets: Films, Retailer Query sets: Eight queries for each data set
27
SIGMOD 2008 27 Ten users were asked to score the snippets generated by the three approaches on the same query results The two approaches designed specifically for XML snippet generation have much higher scores than Google Desktop Greedy algorithm (eXtract) has close scores to the Optimal algorithm Search Quality: User Study
28
SIGMOD 2008 28 Search Quality: Precision & Recall Through another user study, the ground truth of snippets are obtained. The snippets generated by the Greedy algorithm in eXtract have close precision and recall as the Optimal algorithm Precision Recall
29
SIGMOD 2008 29 Speed Film Data Set Retailer Data Set The performance of the Greedy algorithm is much better than that of the Optimal algorithm
30
SIGMOD 2008 30 Scalability Scalability on Snippet Size (number of edges) The scalability of the Greedy algorithm is much better than that of the Optimal algorithm Scalability on Query Result Size (KB)
31
SIGMOD 2008 31 Conclusions The first work that generates result snippets for keyword search on XML data Identified the desirable properties for snippets Self-contained Distinguishable Representative Designed an algorithm to generate IList as a ranked list of significant items to be included in snippets Proved that the instance selection problem is NP-hard Designed an efficient algorithm to cover IList in building a snippet within a size bound Experiments verified the effectiveness and efficiency
32
SIGMOD 2008 32 Thank You! Questions? Welcome to visit eXtract demo in VLDB 2008 http://eXtract.asu.edu/
33
SIGMOD 2008 33 Architecture of eXtract Index Builder XML Index Return Entity Identifier Query & Result Dominant Feature Identifier IList, Query Result Instance Selector Result Snippet Data Analyzer Query Result Key Identifier
34
SIGMOD 2008 Snippets Comparison store statecity merchandises clothes fitting men TexasHouston retailer clothes situation casual category outwear nameproduct Brook Brothers apparel eXtract Google Desktop
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.