Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Chapter 5: Introduction to Information Retrieval
Improved TF-IDF Ranker
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Cue Validity Variance (CVV) Database Selection Algorithm Enhancement Travis Emmitt 9 August 1999.
Presenters: Başak Çakar Şadiye Kaptanoğlu.  Typical output of an IR system – static predefined summary ◦ Title ◦ First few sentences  Not a clear view.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
1 Statistical correlation analysis in image retrieval Reporter : Erica Li 2004/9/30.
Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Data Fusion Eyüp Serdar AYAZ İlker Nadi BOZKURT Hayrettin GÜRKÖK.
Compare&Contrast: Using the Web to Discover Comparable Cases for News Stories Presenter: Aravind Krishna Kalavagattu.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Scalable Text Mining with Sparse Generative Models
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Overview of Search Engines
Clustering Unsupervised learning Generating “classes”
1 Automatic Indexing The vector model Methods for calculating term weights in the vector model : –Simple term weights –Inverse document frequency –Signal.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Server Ranking for Distributed Text Retrieval Systems on the Internet (Yuwono and Lee) presented by Travis Emmitt.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Web- and Multimedia-based Information Systems Lecture 2.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Performance Measurement. 2 Testing Environment.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
QUERY-PERFORMANCE PREDICTION: SETTING THE EXPECTATIONS STRAIGHT Date : 2014/08/18 Author : Fiana Raiber, Oren Kurland Source : SIGIR’14 Advisor : Jia-ling.
Large-Scale Content-Based Audio Retrieval from Text Queries
Collection Fusion in Carrot2
Information Retrieval in Practice
CS 430: Information Discovery
Color Image Retrieval based on Primitives of Color Moments
Retrieval Utilities Relevance feedback Clustering
INF 141: Information Retrieval
Ranking using Multiple Document Types in Desktop Search
Presentation transcript:

Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further Experiments with Database Merging E. Vorhees Brian Shaw CS 5604

Issue: Merging for Effective Results multiple brokers (take search queries), multiple collection servers broker must select appropriate collection servers and merge results

Server Ranking: overview… Problem: “cost” (including user’s time) of broadcasting to all servers and processing power Solution: broker ranks collection servers (“goodness score”); broadcasts query to at most σ (sigma) collection servers (preset number or scoring threshold); merges results 1- Server Ranking for Distributed Text Retrieval on the Internet

Server Ranking: Server Selection Relies solely on Document Frequency data (DF); all collection servers must report changes to broker Cue Validity Variance (CVV) goodness score is based on estimate that term j distinguishes one collection server from another; not an indication of quantity or quality of relevance 1- Server Ranking for Distributed Text Retrieval on the Internet

Server Ranking: Merging Assumption 1: the best document in collection i is equally relevant to the best document in collection k A collection server containing a few but highly relevant documents will contribute to the final list. Assumption 2: the distance between two consecutive document ranks is inversely proportional to the goodness score Relative goodness scores are roughly proportional to the number of documents contributed to the final list. Final ranking is a combination of goodness score and local rankings. 1- Server Ranking for Distributed Text Retrieval on the Internet

Experiments: (overview)… Problem: broker has no access to meta-data from isolated collection servers Solution: choose collection server(s) based on results from previous training queries 2- Further Experiments with Database Merging

Experiments: Server Selection, two approaches Query Clustering (QC): cluster training queries (based on # of same documents retrieved) and calculate cluster “centroid vector”; compare query vector to centroid vector and assign weight to collection Modeling Relevant Document Distributions (MRDD): find M most similar training queries and assign weights to collections based on the training run’s relevant document distribution 2- Further Experiments with Database Merging

Experiments: Merging N documents retrieved from each server as determined by weights Final ranking is a random process: roll a C- faced die that is biased by the number of documents still to be picked from each of the C collections 2- Further Experiments with Database Merging

Comparison 1-Server Ranking2-Experiments Broker’s Knowledge Shared Document Frequency Data Training Query Results Collection Server Selection CVV Goodness Scoring Comparison to Training Queries MergingGoodness Score & Local Rank Random

Conclusions The server ranking method proposed by Yuwono and Lee is an effective way to minimize operating costs (such as time) in an environment where brokers and collection servers can share document frequency data. The “isolated merging strategies” proposed by Vorhees is an effective way to choose a collection server where no meta-information is shared between the broker and collection server.