Collection Fusion in Carrot2

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Chapter 5: Introduction to Information Retrieval
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Modern Information Retrieval
A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.
Parallel and Distributed IR
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Querying Structured Text in an XML Database By Xuemei Luo.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Search in Peer-to-Peer File-Sharing Systems: Like Metasearch Engines, But Not Really Wai Gen Yee, Dongmei Jia, Linh Thai Nguyen {yee, jiadong,
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
1 Computing Relevance, Similarity: The Vector Space Model.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Chapter 23: Probabilistic Language Models April 13, 2004.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Sept 20-21, 2001R. Scott Cost - CADIP, UMBC1 CARROT II Collaborative Agent-based Routing and Retrieval of Text, Version 2 CADIP Fall Research Symposium.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Information Retrieval in Practice
Murat Açar - Zeynep Çipiloğlu Yıldız
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Query Caching in Agent-based Distributed Information Retrieval
Results Fusion in Heterogeneous Information Sources
Relevance and Reinforcement in Interactive Browsing
Retrieval Utilities Relevance feedback Clustering
INF 141: Information Retrieval
Retrieval Performance Evaluation - Measures
Information Retrieval and Web Design
Ranking using Multiple Document Types in Desktop Search
CS 430: Information Discovery
Presentation transcript:

Collection Fusion in Carrot2 Mithun Sheshagiri

Acknowledgements Prof. Scott Cost Srikanth Kallurkar Hemali Majithia

Overview Collection Fusion Problem in IR Possible solutions Equal Distribution Assumption Comparable similarities Modeling Relevant Document Distribution Query Clustering Carrot2 System Query Routing in Carrot2

Overview Collection Fusion in Carrot2 Future Work Conclusions References

The Collection Fusion Problem Centralized Indexing and Retrieval. Distributed IR Systems The Collection Fusion Problem Determining the number of documents that need to be retrieved from each sub-collection Interleaving the documents returned by each sub-collection

Possible Solutions Equal Distribution Assumption Assumes that relevant documents are distributed equally across all sub-collections Comparable similarities Documents in the final result are listed as though the similarities are normalized across sub-collections. Similarity values are dependant on sub-collections A rare but not so relevant document can have higher ranking

Possible Solutions Equal Distribution Assumption Assumes that relevant documents are distributed equally across all sub-collections Comparable similarities Documents in the final result are listed as though the similarities are normalized across sub-collections. Similarity values are dependant on sub-collections A rare but not so relevant document can have higher ranking

Possible Solutions Modeling Relevant Document Distribution The document distribution model is built using training queries. The document distribution for a query q is obtained by averaging the number of relevant documents retrieved by the k nearest queries. This is done for all sub-collections. These document distributions along with the total number of documents to be retrieved is passed to a maximization procedure.

Possible Solutions Modeling Relevant Document Distribution This maximization procedure calculates a cut-off value for each sub-collection.

Possible Solutions Query Clustering Query clusters are formed by grouping training queries which return some identical documents. A weight is assigned to each cluster. Weight is computed based on the number of relevant documents returned by the queries belonging to the cluster. The centroid of the query cluster is calculated by averaging the query vectors belonging to that query cluster.

Possible Solutions Query Clustering The cluster whose centroid is most similar to the user query is selected and its weight is returned. The set of weights returned by all the sub-collections are used to apportion the retrieved set. wi (N) wi wi: Weight returned by the cluster N : Number of documents in the final result

Carrot2 System Carrot2 is a agent based distributed IR system. Uses Jackal Communication Infrastructure KQML is used by agents for communication Agents interface with IR engine through a wrapper Wrapper provides functionality to index documents as well as metadata

Carrot2 System Metadata is a reduced representation of the sub-collection. (8-10)% Metadata is a vector consisting of N-grams (terms) and the number of documents that contain it. On start-up an agent is allotted a sub-collection. Every agent has an associated metadata object. An agent also has access to a metadata pool.

Query Routing in Carrot2 Query is submitted to a Query Manager. Query manager picks an agent from a list of agents returned by the Collection Manager. Every agent queries its metadata pool and makes a decision. Query its local collection. Forward the query. Combination of both.

Query Routing in Carrot2 The process ends when There are no more agents that have not already received the query. The number of times the query has been forwarded has reached a threshold value.

Collection fusion in Carrot2 An approach similar to query clustering. Query cluster Metadata object Representations of sub-collections Both have a weight/similarity which is an indication of the relevance of the documents in the sub-collection to the given query. The similarity values of the metadata objects can be used to apportion the total number of documents that need to be returned.

Collection Fusion in Carrot2 Requirement for implementation Access to the metadata object of all participating sub-collections (C2 agents). Using the metadata pool of one agent when the metadata objects are distributed in broadcast mode. (Flooding strategy) A new agent which accesses the metadata objects of all participating agents.

Collection Fusion in Carrot2 Similarity value is appended to the result returned by each agent. The interleaving can be done by rolling a C-faced die which is biased by the number of documents that are still to be picked from the original result set.

Future Work The suitability of the proposed technique to the C2 system should be experimentally verified. This technique makes use of existing entities and information, implementation can be done with minimal changes to the existing architecture.

Conclusion Combination of query clustering like approach along with probabilistic interleaving is a good candidate for collection fusion in C2 Decentralized nature Use of existing entities Easy to implement Less prone to scalability issues.

References Ellen M. Voorhees, Narendra Gupta, and Ben Johnson­Laird. Learning collection fusion strategies. James P Callan, Zhihong Lu and Bruce Croft Searching Distributed Collections With Inference Networks. E. M. Voorhees, N. K. Gupta, and B. Johnson­Laird. The collection fusion problem.