Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Advertisements

Chapter 5: Introduction to Information Retrieval
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Search Engines and Information Retrieval
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Hinrich Schütze and Christina Lioma
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Distributed Information Retrieval Jamie Callan Carnegie Mellon University
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Chapter 5: Information Retrieval and Web Search
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Search Engines and Information Retrieval Chapter 1.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
«Full-text federated search of text-based digital libraries in peer-to-peer networks» Information Retrieval 2006, Springer Jie Liu, Jamie Callan Language.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Search Result Diversification in Resource Selection for Federated Search Date : 2014/06/17 Author : Dzung Hong, Luo Si Source : SIGIR’13 Advisor: Jia-ling.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Collaborative Filtering With Decoupled Models for Preferences and Ratings Rong Jin 1, Luo Si 1, ChengXiang Zhai 2 and Jamie Callan 1 Language Technology.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Information Retrieval in Practice
An Empirical Study of Learning to Rank for Entity Search
Relevance Feedback Hongning Wang
John Lafferty, Chengxiang Zhai School of Computer Science
Panagiotis G. Ipeirotis Luis Gravano
Search Engine Architecture
Retrieval Evaluation - Reference Collections
Learning to Rank with Ties
WSExpress: A QoS-Aware Search Engine for Web Services
Presentation transcript:

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University Advisor Jamie Callan (Carnegie Mellon University)

2 © Luo Si July,2004 Outline Outline:  Introduction: Introduction to federated search  Research Problems: the state-of-the-art and our contribution  Demo: Demo of a prototype system for real world application!

3 © Luo Si July,2004 Outline Outline:  Introduction: Introduction to federated search  Research Problems: the state-of-the-art and our contribution  Demo: Demo of a prototype system for real world application!

4 © Luo Si July,2004 Introduction Visible Web vs. Hidden Web Visible Web: Information can be copied (crawled) and accessed by conventional search engines like Google or AltaVista - No arbitrary crawl of the data (e.g., ACM library) - Updated too frequently to be crawled (e.g., buy.com) Can NOT Index (promptly) Hidden Web: Information hidden from conventional engines. - Larger than Visible Web (2-50 times) Valuable Searched by Federated Search - Created by professionals - Web: Uncooperative information sources Federated Search is a feature used to beat Google by search engines like

5 © Luo Si July,2004 Introduction Components of Federated Search System... (1)Resource Representation.. Engine 1Engine 2Engine 3Engine 4Engine N ( 2) Resource Selection …… (3) Results Merging

6 © Luo Si July,2004 Introduction Modeling Federated Search Application in real world - But, not enough relevance judgments, not enough control… Require Thorough Simulation TREC Testbeds with about 100 information sources - Normal or moderately skewed size testbeds: Trec123 or Trec4_Kmeans - Skewed: Representative (large source with the same relevant doc density), Relevant (large source with higher relevant doc density), Nonrelevant (large source with lower relevant doc density) Multiple type of search engines to reflect uncooperative environment Modeling Federated Search in Research Environments

7 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and our contribution  Demo - Resource Representation - Resource Selection - Results Merging - A Unified Framework

8 © Luo Si July,2004 Research Problems (Resource Representation) Previous Research on Resource Representation Resource descriptions of words and the occurrences - Query-Based Sampling (Callan, 1999): send query and get sampled doc Information source size estimation - Capture-Recapture Model (Liu and Yu, 1999): But require large number of interactions with information sources Centralized sample database: Collect docs from Query-Based Sampling (QBS) - For query-expansion (Ogilvie & Callan, 2001), not very successful - Successful utilization for other problems, throughout our new research

9 © Luo Si July,2004 Research Problems (Resource Representation) Estimate df of a term in sampled docs, Get total df from the source by resample query, Scale the number of sampled docs to estimate source size Sample-Resample Model (Si and Callan, 2003) New Information Source Size Estimation Algorithm Absolute error ratio Estimated Size Actual Size Trec123Trec123-10Col Cap-Recapture Sample-Resample Experiments Measure:

10 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and our contribution  Demo - Resource Representation - Resource Selection - Results Merging - A Unified Framework

11 © Luo Si July,2004 Research Problems (Resource Selection) Previous Research on Resource Selection Goal of Resource Selection of Information Source Recommendation High-Recall: Select the (few) information sources that have the most relevant documents “Big document” resource selection approach: Treat information sources as big documents, rank them by similarity of user query - Examples: CVV, CORI and KL-divergence They lose doc boundaries and do not optimize the goal of High-Recall Estimate the percentage of relevant docs among sources and rank sources New RElevant Doc Distribution Estimation (ReDDE) resource selection “Relevant Document Distribution Estimation Method for Resource Selection” (Luo Si & Jamie Callan, SIGIR ’03)

12 © Luo Si July,2004 Research Problems (Resource Selection) Relevant Doc Distribution Estimation (ReDDE) Algorithm Estimated Source Size Number of Sampled Docs “Everything at the top is (equally) relevant” Source Scale Factor Rank on Centralized Complete DB Simple Rank on Centralized Complete DB with ranking on Centralized Complete DB Number of Relevant Docs

13 © Luo Si July,2004 Research Problems (Resource Selection) Experiments Evaluated Ranking Desired Ranking Measure:

14 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and our contribution  Future Research - Resource Representation - Resource Selection - Results Merging - A Unified Framework

15 © Luo Si July,2004 Research Problems (Results Merging) Goal of Results Merging Make different result lists comparable and merge them into a single list Difficulties: Information sources may use different retrieval algorithms Information sources have different corpus statistics Previous Research on Results Merging Some methods download all docs and calculate comparable scores large communication and computation costs Some methods use heuristic combination: CORI method Semi-Supervised Learning (SSL) Merging (Si & Callan, 2002, 2003) Basic idea is to approximate centralized doc score by linear regression Estimate linear models from overlap documents in both centralized sampled DB and individual ranked lists

16 © Luo Si July,2004 Research Problems (Results Merging) In resource representation: Build representations by QBS, collapse sampled docs into centralized sample DB In resource selection: Rank sources, calculate centralized scores for docs in centralized sample DB In results merging: Find overlap docs, build linear models, estimate centralized scores for all docs SSL Results Merging (cont) Engine 2 …….. …… Engine 1Engine N Resource Representation Centralized Sample DB Resource Selection. Overlap Docs... Final Results CSDB Ranking

17 © Luo Si July,2004 Research Problems (Results Merging) 10 Sources Selected Experiments Trec123Trec4-kmeans “Using Sampled Data and Regression to Merger Search Engine Results ” (Luo Si & Jamie Callan, SIGIR ’02) “A Semi-Supervised Learning Method to Merge Search Engine Results ” (Luo Si & Jamie Callan, TOIS ’03)

18 © Luo Si July,2004 Outline Outline:  Introduction  Research Problems: the state-of-the-art and preliminary research  Demo - Resource Representation - Resource Selection - Results Merging - A Unified Framework

19 © Luo Si July,2004 Research Problems (Unified Utility Framework) Goal of the Unified Utility Maximization Framework Integrate and adjust individual components of federated search to get global desired results for different applications High-Recall vs. High-Precision Simply combine individual effective components together High-Recall: Select sources that contain as many relevant docs as possible for information source recommendation High-Precision: Select sources that return many relevant docs at top part of final ranked list for federated document retrieval They are correlated but NOT identical, previous research does NOT distinguish them

20 © Luo Si July,2004 Research Problems (Unified Utility Framework) Formalize federated search as mathematic optimization problem with respect to different goals of different applications Unified Utility Maximization Framework (UUM) Example: for document retrieval with High-Precision Goal: Number of rel docs in top part of rank list Number of sources to select Retrieve fixed number of docs

21 © Luo Si July,2004 Research Problems (Unified Utility Framework) Resource selection for federated document retrieval Unified Utility Maximization Framework (UUM) Solution: No simple solution, by dynamic programming A variant to select variable number of docs from selected sources Total number of documents to select Retrieve variable number of docs “Unified Utility Maximization Framework for Resource Selection ” (Luo Si & Jamie Callan, CIKM ’04)

22 © Luo Si July,2004 Research Problems (Unified Utility Framework) ExperimentsResource selection for federated document retrieval Trec123Representative 3 Sources Selected 10 Sources Selected SSL Merge

23 © Luo Si July,2004 Outline  Demo FedStats Project: Cooperative work with Jamie Callan, Thi Nhu Truong and Lawrence Yau

24 © Luo Si July,2004 Outline  Demo Results Merging Experiments of FedStats for CORI and SSL

25 © Luo Si July,2004 Future Research (Conclude) Conclude Federated search has been hot research in last decade - Most of previous research is tied with “Big document” Approach - More theoretically solid foundation - More empirically effective - Better model real world applications The new research advances the state-of-the-art Bridge from Cool Research to Practical Tool