Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.

Slides:



Advertisements
Similar presentations
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Advertisements

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.
ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Reasoning and Identifying Relevant Matches for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University.
Information Retrieval in Practice
Search Engines and Information Retrieval
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
INFO 624 Week 3 Retrieval System Evaluation
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Academic Advisor: Prof. Ronen Brafman Team Members: Ran Isenberg Mirit Markovich Noa Aharon Alon Furman.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Identifying Meaningful Return Information for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University.
Overview of Search Engines
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Search Engines and Information Retrieval Chapter 1.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Querying Structured Text in an XML Database By Xuemei Luo.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.
Honors Track: Competitive Programming & Problem Solving Optimization Problems Kevin Verbeek.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Exploiting Relevance Feedback in Knowledge Graph Search
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Information Retrieval in Practice
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Information Retrieval (in Practice)
Multimedia Information Retrieval
Toshiyuki Shimizu (Kyoto University)
MCN: A New Semantics Towards Effective XML Keyword Search
Combining Keyword and Semantic Search for Best Effort Information Retrieval  Andrew Zitzelberger 1.
CS246: Information Retrieval
Introduction Dataset search
Presentation transcript:

Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University

SIGMOD Snippets in Text Search Snippets are widely used in text search engine to help users to quickly identify relevant query results.

SIGMOD Fragment of an XML Search Result Find the apparel retailers in Texas Keyword Search Texas, apparel, retailer store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… There can be many large search results. Good snippets can help users to quickly and easily judge the relevance.

SIGMOD A Sample Snippet From the snippet, we know  The corresponding query result contains matches to all keywords  The retailer is “Brook Brothers”  This retailer has many stores in Houston.  The clothes featured by this retailer. It helps us to differentiate this query result from other apparel retailers (e.g. Carter’s) store statecity merchandises clothes fitting men TexasHouston retailer clothes situation casual category outwear nameproduct Brook Brothers apparel How to generate good snippets for XML search? No existing work on XML snippet generation yet.

SIGMOD Challenges and Our Contributions What are desirable properties of a good snippet? Identified three properties: self-contained, distinguishable, representative What information in the query result is significant in order to achieve the properties? Designed an algorithm to generate a ranked list of significant information - IList How to generate a snippet to maximally cover the significant information within a size bound? Proved the NP-hardness of this problem. Designed an efficient and effective algorithm for snippet generation eXtract : The first system on snippet generation for XML search

SIGMOD Roadmap Identifying desirable properties of a good snippet  Self-contained  Distinguishable  Representative Constructing an information list – IList  IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions

SIGMOD Self-contained Snippet Snippets should be self-contained in order to be understandable. Text search: snippets usually preserve self- contained semantic units: phrases / sentences surrounding keyword matches. XML search: semantic units should be preserved. Challenge: What is a semantic unit?

SIGMOD Query Result Fragment (revisited) Adding keywords and their corresponding entity names to IList. IList: Texas, apparel, retailer, store Data contain  Entities  Attributes A self-contained snippet should contain names of the entities whose attributes are in snippets store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria ……

SIGMOD Distinguishable Snippet Snippets should be distinguishable, so that users can easily differentiate query results Text search: the title of the document is included. XML search: the “key” of the result should be included. Challenge: What is the key of an XML search result?

SIGMOD Query Result Fragment Adding the key of the query result to IList. IList: Texas, apparel, retailer, store, Brook Brothers We can mine keys of entities return entity support entity We identify two types of entities in a query result.  Return entities  Support entities store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… Inferring return entities: An entity whose name or attribute name match keywords; otherwise the highest entity Key of a query result  Keys of return entities

SIGMOD Representative Snippet Snippets should provide summaries of the query results, so that users can quickly grasp the essence of the results Text search: active research area; sometimes the first and/or last sentence of a paragraph is used as a summary. XML search: include “dominant features” of query results Challenges: What are features? What are dominant features?

SIGMOD Features of Query Result We define a feature as (entity, attribute, value). store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… Feature type Houston:6 Austin: 1 Other values (3): 3 Men: 600 Women: 360 Children: 40 Casual: 700 Formal: 300 Outwear: 220 Suit: 120 Skirt: 80 Sweaters: 70 Other values (7): 510 city: fitting: situation: category: entity: attribute: value: # of occurrences store: clothes: Some feature statistics

SIGMOD Houston:6 Austin: 1 Other values (3): 3 Men: 600 Women: 360 Children: 40 Casual: 700 Formal: 300 Outwear: 220 Suit: 120 Skirt: 80 Sweaters: 70 Other values (7): 510 city: fitting: situation: category: entity: attribute: value: # of occurrences store: clothes: Dominant Features of Query Result A feature that occurs often is likely to be dominant. store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… But this is not always reliable. Dominance score  the # of occurrence of a feature / the avg. # of occurrences of features of the same type Dominant features  Features with dominance score ≥ 1

SIGMOD Representative Snippet Adding dominant features to IList in the order of dominance scores store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston retailer category suit 1 clothes 2 clothes 3 fitting men 2 situation formal 2 fitting women 3 nameproduct Brook Brothers apparel situation casual 1 name Galleria …… IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual Houston:6 Austin: 1 Other values (3): 3 Men: 600 Women: 360 Children: 40 Casual: 700 Formal: 300 Outwear: 220 Suit: 120 Skirt: 80 Sweaters: 70 Other values (7): 510 city: fitting: situation: category: entity: attribute: value: # of occurrences store: clothes:

SIGMOD Roadmap Identifying desirable properties of a good snippet  Self-contained  Distinguishable  Representative Constructing an information list – IList  IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions

SIGMOD Roadmap Identifying desirable properties of a good snippet  Self-contained  Related entity names  Distinguishable  Key of query result (return entities)  Representative  Dominant features Constructing an information list – IList  IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions IList

SIGMOD Roadmap Identifying desirable properties of a good snippet  Self-contained  Related entity names  Distinguishable  Key of query result (return entities)  Representative  Dominant features Constructing an information list – IList  IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions IList

SIGMOD Instance Selection Problem IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual Input: query result R, IList, a snippet size bound B Output: snippet S Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B?

SIGMOD Instance Selection Problem store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… Input: query result R, IList, a snippet size bound B Output: snippet S Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B? Good Bad IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

SIGMOD Instance Selection Problem Challenges:  The cost of covering an IList item is dynamic  The number of IList items that can be covered is unknown till the very end. The Instance Selection Problem is NP hard. We designed an efficient and effective greedy algorithm to tackle this problem

SIGMOD Instance Selection Algorithm store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual weight: ½ ¼ 1/8 1/16 1/32 1/64 Path based instance selection  Coverage: the entities on the path and their attributes  Benefit: the total weight of IList items covered  Cost: the path length

SIGMOD Instance Selection Algorithm store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… 1.For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it 2.Update benefits and costs of other paths 3.Go to step 1 till the size bound is reached or the whole IList is covered IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

SIGMOD Instance Selection Algorithm store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… 1.For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it 2.Update benefits and costs of other paths 3.Go to step 1 till the size bound is reached or the whole IList is covered IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

SIGMOD Final Snippet store 1 statecity merchandises 1 clothes 1 fitting men 1 Texas 1 Houston store 2 statecity Texas 2 Austin merchandises 2 retailer category suit 1 clothes 2 clothes 3 clothes 4 clothes 5 fitting men 2 situation formal 2 situationfitting women 3 casual 3 category outwear 3 situationfitting men 4 category sweater 4 categoryfitting women 5 outwear 5 nameproduct Brook Brothers apparel casual 4 situation casual 1 name Galleria name West Village …… IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

SIGMOD Roadmap Identifying desirable properties of a good snippet  Self-contained  Distinguishable  Representative Constructing an information list – IList  IList is a ranked list of significant information in the query result in order to achieve the properties. Building snippets based on IList within a snippet size bound Experimental evaluation Conclusions

SIGMOD Experimental Setup Comparing the performance of  Greedy Algorithm for Instance Selection -- eXtract  Optimal (but exponential) Algorithm for Instance Selection  Google Desktop Measurements  Search quality  Speed  Scalability Data sets: Films, Retailer Query sets: Eight queries for each data set

SIGMOD Ten users were asked to score the snippets generated by the three approaches on the same query results The two approaches designed specifically for XML snippet generation have much higher scores than Google Desktop Greedy algorithm (eXtract) has close scores to the Optimal algorithm Search Quality: User Study

SIGMOD Search Quality: Precision & Recall Through another user study, the ground truth of snippets are obtained. The snippets generated by the Greedy algorithm in eXtract have close precision and recall as the Optimal algorithm Precision Recall

SIGMOD Speed Film Data Set Retailer Data Set The performance of the Greedy algorithm is much better than that of the Optimal algorithm

SIGMOD Scalability Scalability on Snippet Size (number of edges) The scalability of the Greedy algorithm is much better than that of the Optimal algorithm Scalability on Query Result Size (KB)

SIGMOD Conclusions The first work that generates result snippets for keyword search on XML data Identified the desirable properties for snippets  Self-contained  Distinguishable  Representative Designed an algorithm to generate IList as a ranked list of significant items to be included in snippets Proved that the instance selection problem is NP-hard Designed an efficient algorithm to cover IList in building a snippet within a size bound Experiments verified the effectiveness and efficiency

SIGMOD Thank You! Questions? Welcome to visit eXtract demo in VLDB

SIGMOD Architecture of eXtract Index Builder XML Index Return Entity Identifier Query & Result Dominant Feature Identifier IList, Query Result Instance Selector Result Snippet Data Analyzer Query Result Key Identifier

SIGMOD 2008 Snippets Comparison store statecity merchandises clothes fitting men TexasHouston retailer clothes situation casual category outwear nameproduct Brook Brothers apparel eXtract Google Desktop