Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Natural Language Processing WEB SEARCH ENGINES August, 2002.

Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)

Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

Information Retrieval in Practice

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.

Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:

Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )

Information Retrieval

Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.

Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.

University of Kansas Data Discovery on the Information Highway Susan Gauch University of Kansas.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Proposal for Term Project J. H. Wang Mar. 2, 2015.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.

GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.

Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.

Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,

Information Retrieval in Practice

Metasearch Thanks to Eric Glover NEC Research Institute.

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Information Retrieval

Data Mining Chapter 6 Search Engines

Chapter 5: Information Retrieval and Web Search

Retrieval Performance Evaluation - Measures

Information Retrieval and Web Design

Presentation transcript:

Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University

Introduction So far: General-purpose search engines (index and search the whole web) In addition: Special-purpose search engines exist (index and search documents from a particular domain) Problem : - Both only cover only a fraction of the web - Users sometimes have to address different search engines Possible solution: Metasearch engines

Motivation Why should we build metasearch engines? - To increase the coverage of the web - To solve the scalability of searching the web - To facilitate the invocation of multiple search engines - To improve the retrieval effectiveness

A typical metasearch engine session 1. User sends query to metasearch engine 2. Metasearch engine processes query and sends it to component search engines 3. Component search engines return results (based on local similarity) to metasearch engine 4. Metasearch engine merges and ranks returned results (based on global similarities) and presents them to the user

A typical architecture USER INTERFACE DATABASE SELECTOR DOCUMENT SELECTOR RESULT MERGER QUERY DISPATCHER SEARCH ENGINE... Database selection problem Document selection problem Result merging problem

Challenges for metasearch engines Why is it hard to create metasearch engines? Because of the heterogeneity among the involved autonomous component search engines: - Various indexing methods - Different document term weighting schemes - Different query term weighting schemes - Varying similarity functions - Different document databases - Unequal document versions - Different result representation

A typical architecture USER INTERFACE DATABASE SELECTOR DOCUMENT SELECTOR RESULT MERGER QUERY DISPATCHER SEARCH ENGINE... Database selection problem Document selection problem Result merging problem

The database selection problem Selection is important if the number of component search engines is large Goal : Identifying as many potentially useful databases as possible while minimizing wrongly identifying useless ones Decision is usually based on a representative of the database Most important questions: - What are good representatives? - How do we get them?

Approaches for database selection Three types of techniques: - Rough representative approaches e.g. several key words or paragraphs - Statistical representative approaches statistical information (e.g. document frequency of a term) - Learning-based approaches learned from past experience (training queries or real (previous) user queries)

Approaches for database selection Three types of techniques: - Rough representative approaches e.g. several key words or paragraphs - Statistical representative approaches statistical information (e.g. document frequency of a term) - Learning-based approaches learned from past experience (training queries or real (previous) user queries)

Rough representative approaches Typically several keywords and paragraphs Only give a rather general idea about the database Often manually generated Automatic approaches exist (e.g. taking the text from the interface page or anchor text from pages pointing to the search engine) Alternatively: Involve the user, e.g. by explicit search engine selection or need to specify a subject area Advantages: Easy and little storage space requirements, works well for highly specialized component search engines with diversified topics Disadvantage: Does not work well for databases with documents of diverse interests

Statistical representative approaches Examples: - D-WISE approach - CORI Net approach - gGIOSS approach - Estimating the number of potentially useful documents - Estimating the similarity of the most similar documents

The D-WISE approach The representative of a component search engine contains - the document frequency of each term in the component search engine (n values d ij ) - the number of documents in the database (1 value n i ) Ranking the component search engines based on a given query q: 1. Calculate the cue validity C ij of each search term t j

The D-WISE approach (cont.) 2. Calculate the variance CVV j of the CV ij 's of each query term t j for all component databases (ACV j = average of all CV ij 's for all comp. databases) 3. Compute the ranking score r i for each component database i with respect to q

The D-WISE approach (cont.) 4. Select component databases with the highest ranking score (Intuitive interpretation of the ranking score: Indicates where useful query terms are concentrated) Advantages of this approach: - easy scalable - easy to compute Disadvantages : - ranking scores are relative scores (no absolute quality measures) - does not distinguish between multiple term appearances within single documents

The D-WISE approach (cont.) 4. Select component databases with the highest ranking score (Intuitive interpretation of the ranking score: Indicates where useful query terms are concentrated) Advantages of this approach: - easy scalable - easy to compute Disadvantages : - ranking scores are relative scores (no absolute quality measures) - does not distinguish between multiple term appearances within single documents

Learning-based approaches Predict usefulness of a database for new queries based on the experiences with the database from past ones How to obtain these retrieval experiences? - Usage of training queries (static learning) - Usage of real user queries and accumulate retrieval knowledge gradually and update continuously (dynamic learning) - Use a combined learning approach, i.e. start with training queries and update continuously based on real user queries Examples: - The MRDD approach - The SavvySearch approach - The ProFusion approach

The MRDD approach Static learning approach, i.e. a set of training queries is given and all relevant documents for each training query have to be identified (manually) Stores a vector reflecting the distribution of the relevant documents: with r i = number of top-ranked documents that must be retrieved to obtain i relevant documents Example : Retrieval result: (d1,..., d100) Relevant documents: d1, d4, d10, d17, d30 Distribution vector: =

The MRDD approach (cont.) Application : Input: query q from the user 1. Compare q to the training queries and select the k most similar ones (e.g. k=8) 2. Calculate the average distribution vector over these k queries for each database 3. Select the databases D i 's which maximize precision Example : D1: D2: D3:

A typical architecture USER INTERFACE DATABASE SELECTOR DOCUMENT SELECTOR RESULT MERGER QUERY DISPATCHER SEARCH ENGINE... Database selection problem Document selection problem Result merging problem

The document selection problem The naive approach (i.e. return all results) won't work because it delivers too many documents Goal : Retrieve as many potentially useful documents from a component database as possible while minimizing the retrieval of useless documents Different categories of approaches: - User determination - Weighted allocation - Learning-based approaches - Guaranteed retrieval

User determination Let the user select a number for each component database (or use a default number, if none is given by the user) Advantage: - Works well with small numbers of component databases and if the user is reasonably familiar with them Disadvantage: - Does not work for larger sets of component databases

Weighted allocation Use the rank (or ranking score) calculated by the database selection algorithm to specify the number of documents that should be selected from the respective database Example : If r i is the ranking score of database D i (i = 1,..., N) and m documents should be returned to the user, then select documents from database D i

Learning-based approaches Idea : Learn how many documents to retrieve for a given query based on past retrieval experiences for similar queries Example : The MRDD approach (see before) is a learning-based approach that already includes document selection!

Guaranteed retrieval Idea : Take global similarities into account (not only local ones, e.g. component database dependent document frequencies, etc.) Example approaches : - Query modification - Computing the tightest local threshold

A typical architecture USER INTERFACE DATABASE SELECTOR DOCUMENT SELECTOR RESULT MERGER QUERY DISPATCHER SEARCH ENGINE... Database selection problem Document selection problem Result merging problem

Result merging approaches Goal : Combine all returned results (sorted by local similarity) into one single result (sorted by global similarity) Why is this a hard problem? - Local similarities are not provided by all search engines - The heterogeneity of the component search engines make them hard (if not impossible) to compare - Local similarities might differ significantly from global ones - Should documents returned from multiple search engines considered differently?

Result merging approaches (cont.) Two types of approaches exist: Local similarity adjustment, e.g. - Adjust local similarities using additional information (e.g. quality of the component database) - Convert local document ranks to similarity Global similarity estimation Attempts to estimate the true global similarities

Local similarity adjustment Distinguish three cases : 1. The selected databases (or returned results) are pairwise disjoint (or nearly disjoint) 2. The selected databases (or returned results) overlap but are not identical 3. The selected databases are identical The latter case is known as data fusion and normally not considered for metasearch engines In the following: Assume case 1 (disjoint results)

Local similarity adjustment 1st: assume local similarities exist 1. Normalize the returned similarities 2. Use database scores to adjust the local similarities (i.e. give higher preference to documents from highly rated databases), e.g. If s is the ranking score of database D and s is the average of all these scores, then is the weight for database D (N = no. of databases) Define the adjusted similarity as w*x (x = original, returned local similarity) 3. Sort results by adjusted similarity

Local similarity adjustment 2nd: assume local similarities do NOT exist Possible approaches: - Use the local document rank information directly to perform the merge, e.g. 1. Arrange databases based on their scores 2. Select documents with the round-robin method (Alternatively: randomized version where the documents are selected based on the probability of the database being relevant) - Convert the local document ranks to similarities

Local similarity adjustment Distinguish three cases : 1. The selected databases (or returned results) are pairwise disjoint (or nearly disjoint) 2. The selected databases (or returned results) overlap but are not identical How to deal with results returned by multiple search engines? Calculate adjusted similarities as before and use (e.g.) methods from data fusion to combine them. However, this might not work well because of the different coverage of the component databases (-> active research field).

Global similarity estimation Approaches to estimate global similarities: 1. Document fetching: Idea: Download returned documents to get (e.g.) term frequencies, estimate global document frequency Disadvantages: expensive (but remedies exist, e.g. download docs in parallel, keep downloading and analyzing docs while initial results are presented, only download the beginning portion) Advantages: - Identifies obsolete URLs - Ranking based on current (up-to-date) content - Better result representation

Global similarity estimation Approaches to estimate global similarities (cont.): 2. Use of discovered knowledge: Basic idea: Try to figure out the specific document indexing and similarity computation methods used by the different component search engines Use this information to - better compare local similarities - better adjust local similarities to make them comparable with each other - better derive global from local similarities

Global similarity estimation 2. Use of discovered knowledge: Examples Assume all component search engines use same indexing and local similarity estimation methods that do not include collection-dependent statistics -> local similarities are comparable and can be used directly Assume the only difference is the usage of (different) lists of stopwords -> modify the query to make results comparable Assume idf information is also used -> either adjust local similarities or compute global similarities directly (cont. on next slide)

Global similarity estimation 2. Use of discovered knowledge: Examples Assume idf information is also used -> either adjust local similarities or compute global similarities directly Case 1: Query q contains just one single term t Similarity in the component database: Global similarity: Multiply local sim. with Case 2: Query q contains multiple terms: see lit.

Other challenges for metasearch engines 1. Integrate local systems employing different indexing techniques 2. Integrate local systems supporting different types of queries (e.g. Boolean vs. vector space queries) 3. Discover knowledge about component search engines 4. Develop more effective result merging methods 5. Study the appropriate cooperation between a metasearch engine and the local systems 6. Incorporate new indexing and weighting techniques to build better metasearch engines

Other challenges for metasearch engines 7. Improve the effectiveness of metasearch 8. Decide where to place the software components of a metasearch engine 9. Create a standard testbed to evaluate the proposed techniques for database selection, document selection, and result merging 10. Extend metasearch techniques to different types of data sources

References [1] MENG, YU, LIU: BUILDING EFFICIENT AND EFFECTIVE METASEARCH ENGINES. ACM COMPUTING SURVEYS, VOL. 34, NO. 1, MARCH 2002.

INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORM. NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

General web search engine architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. FIG. 1 IN ARASU ET AL. "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, , PAGE 4)

Next week 1. Final grading of the programming exercises 2. Exercises with the PageRank simulation tool (participation mandatory, but no grading) Exam dates August 28th and 29th OR September 18th and 19th OR by personal arrangement