A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Evaluating Search Engine
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
1 Web Query Classification Query Classification Task: map queries to concepts Application: Paid advertisement 问题:百度 /Google 怎么赚钱?
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Topics and Transitions: Investigation of User Search Behavior Xuehua Shen, Susan Dumais, Eric Horvitz.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
A Language Independent Method for Question Classification COLING 2004.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
Queensland University of Technology
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Information Retrieval
Applying Key Phrase Extraction to aid Invalidity Search
Panagiotis G. Ipeirotis Luis Gravano
Presentation transcript:

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies Research Center IIIT-Hyderabad ICON 2010

Outline Query categorization Related work Importance of ranking Challenges Design goals Our approach Results Conclusion

Query categorization (QC) Automatic categorization (classification) of user queries into one or more of pre-defined categories Note that categories are pre-defined and may vary across different applications However, for a particular application categories remain the same over a reasonable amount of time

Contributions Solving query categorization as a purely information retrieval problem Emphasis on importance of ranking of categories for QC systems Our system being simple and unsupervised in nature can establish a new baseline

Related Work Text categorization techniques (Shen et al., 2005, 2006): – Solve QC as a text categorization problem – But queries are not as rich as text documents in terms of context – Text classifiers are trained with a static vocabulary, which may not account for the dynamic nature of the Web.

…Related Work Graph based models (Diemert and Vandelle, 2009). – Constructing concept graphs built from search query logs – Once the concept graph is constructed, a query is categorized by traversing through the graph. – Not all search engines have the luxury of large search query logs.

Research Questions Can we solve QC by considering it purely as an IR problem? Can we combine the existing relatively standard IR techniques to solve QC? Can already categorized corpus be used for conducting query categorization? Can we establish a new baseline for QC systems?

Importance of ranking Consider category listings of two hypothetical systems for the query “Ipod” It is obvious from this example that ranking plays an important role for QC systems RankSystem (I) Category listing 1Entertainment/CelebritiesEntertainment/Music 2Computers/Hardware 3Computers/Software 4Info/References & Libraries 5Entertainment/MusicEntertainment/Celebrities

Challenges Category representation: – Categories need to be defined (covering most of the Web) – Each category needs to be represented by a set of documents that best describe that category. Category representation is needed in order to solve QC purely as an IR problem.

…Challenges Query expansion/enrichment: Usually queries are very short. Average query length in KDD Cup 2005 was 3.12 words. 22.5% of the queries were of length 3 words. 78.7% of the queries had at most 4 words.

Category Representation Categories of Open Directory Project (ODP) for QC Web documents that are classified under a category represent that category. Approximately 2.4 million English documents (of ODP) to represent categories These documents are classified into approximately 380K categories. Here the assumption is that these categories cover the entire Web. This corpus of ODP documents is used to perform QC.

Design Goals Our design goals: – Simple – Unsupervised framework – Implementable on Web scale – To solve QC as a “search” problem since “search” is a task a Web search can afford for free.

Our Approach Query expansion module ODP documents retrieval Query categorization Taxonomy mapping (ODP to target space) (Optional) Expanded Query ODP documents ODP Categories Target Categories

Query Expansion Pseudo relevance feedback query expansion Submit query to a Web search engine Collect stemmed terms (Q’) from title and snippets for top N search results Stop word removal Weight on document frequency (DF) measure Query Web Search Engine Web documents Stemmed terms

…Query Expansion Common concepts for a query usually occur in most of the top web documents obtained for a query This information is best captured by DF These common concepts represent the query “Serena Williams” Web Search Engine ……………..…Tennis..sports….. ………WTA ……………..…Tennis..sports….. ………WTA ……… Tennis ………………… Wimbledon. ……… Tennis ………………… Wimbledon. ……tennis… ……………… …..sports…....…..WTA.. ……tennis… ……………… …..sports…....…..WTA.. Tennis Sports WTA Wimbl edon Tennis Sports WTA Wimbl edon

Central Idea The ODP documents that match the query- related concepts are good enough to carry out QC In essence, topically similar documents This fact is leveraged in our unsupervised approach to QC

Query Categorization Search the expanded query on the ODP Web document corpus ODP documents retrieved for the query belong to at least one ODP category; resulting in query categorization An optional taxonomy mapping in case target categories are different from that of ODP

Taxonomy Mapping for KDD Cup dataset We map ODP categories to KDD cup categories to evaluate on KDD Dataset Note that computation of these mappings is one time and offline

…Taxonomy Mapping Target category t Stem words in t Search over the ODP category descriptions Search Category Descriptions C -- retrieved ODP categories Mappings: t to every category in C Reverse: Every ODP category in C can be mapped to t Mappings Store these mappings for target category t. Repeat for other target categories

…Taxonomy Mapping Search the target categories in the category ODP descriptions For a target category t, let the set of retrieved ODP categories be C Map every category in C to target category t. Repeat this for other target categories, and obtain mappings

…Taxonomy Mapping Let C (Q) be the set of ODP categories returned for a query Q The categories in target space to which most of the categories of C (Q) are getting mapped to will be ranked higher Top K categories in target space are returned as top K target categories for the query

Dataset KDD Cup 2005 dataset (Lie et al., 2005) A set of unlabeled 800K queries sampled MSN search query logs 67 predefined categories A set of 800 queries (sampled from the 800K queries) was labeled Three labelers independently labeled this set Each query was tagged with at most 5 categories This dataset serves as the standard dataset for QC evaluation

Evaluation Metrics Precision, Recall and F1 are defined, respectively, as follows:

Results State of the art (Shen et al., 2005) Best today (Shen et al., 2006b) NA KBS (Diemert and Vandelle 2009) NA Our System (+4.2%) *High precision reported by KBS System is due to binary categorization

On Results Though F1 reported for our system is marginally lower, we believe our system should be viewed from a different perspective Solve QC purely as an information retrieval problem Combined relatively standard techniques to solve QC making it – simple, and – implementable on a very large scale

….On Results Our system is unsupervised in nature Our system does not make use of resources like search query logs Thus, we believe the results reported complement our design goals to a reasonable extent

Conclusion A simple, unsupervised yet effective approach to query categorization Leverages already categorized corpus (ODP) to perform QC Advantages – Simple approach – Unsupervised – Existing IR techniques can be used – Avoids Multiclass classification