Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
Modern Information Retrieval Chapter 1: Introduction
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Modern Information Retrieval Chapter 1: Introduction
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Hinrich Schütze and Christina Lioma
CS/Info 430: Information Retrieval
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Information Retrieval
Web Search - Summer Term 2006 II. Information Retrieval (Models, Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Search Engines and Information Retrieval Chapter 1.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Vector Space Models.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 5 Ranking.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)
Text Based Information Retrieval
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
אחזור מידע, מנועי חיפוש וספריות
Basic Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University

Organizational Remarks Exercises: Please, register to the exercises by sending me an till Friday, May 5th, with - Your name, - Matrikelnummer, - Studiengang, - Plans for exam This is just to organize the exercises but has no effect if you decide to drop this course later.

INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface RESULTS DOCS.

INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface QUERY RESULTS DOCS. RESULT REPRESENTATION INDEXING SEARCH

INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORM. NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

Query Languages: Boolean Search So far : a) Single terms (unrelated / bag of words) b) Boolean conjunctions (AND, OR, NOT) Boolean search : Main search model before the Web came along (Note: Mainly professional users). Advantages of Boolean queries : Precise (mathematical model), Offers great control and transparency, Good for domains with ranking by other means than relevance, i.e. chronological

Boolean Search (Cont.) Disadvantages of Boolean queries : Sometimes hard to specify, even for experts Binary decision (relevant or not) Bag-of-Words, no position Example: Query: New York City Doc. 1: This is a nice city. Doc. 2: This city has a new library. Query: New AND York AND City Doc. 1: New York has a new library. Doc. 2: The city of York has a new library.

Further Query Types Phrases, e.g. New York City Proximity, e.g. University NEAR Freiburg (finds University of Freiburg and Albert-Ludwigs University Freiburg) Structural queries, e.g. AUTHOR = Ottmann AND TEXT CONTAINS binary search tree Natural language vs. keywords Pattern matching, e.g. wildcards: index* (finds index, indexing, indexes, indexer, …) Spelling corrections and some more (often application dependent)

Phrases Often used (esp. for web search): Quotas e.g. “New York City” Advantage: Easy and seem to work well (about 10% of web queries are such phrases according to Manning et al. [2]) How do we support this? We need word positions. We need all original words (e.g. no stop word removal in University of Freiburg). We need an efficient way to do this.

Approaches to Support Phrases Biword indexes: Idea: Store pairs of consecutive words (in addition to single terms), e.g. New York City is represented by the terms New, York, City, New York, York City Might cause problems for phrases with more than 2 words, but often works quite well Positional indexes: Idea: Store position of each word in the postings list

Positional Indexes – Example … … 18453CITY 9421YORK 23535NEW … … … :4[3,12,46,78] 25:3[43,120,221] 32:6[12,20,57,200,322,481] … NEW23535…,25:6[41,87,136,…], … YORK9421…,25:2[42,137], …

Positional Indexes Also works for queries such as University [word]1 Freiburg University NEAR Freiburg Problem: Size Need to store additional info (positions) on an already large index (stop words!) Approx. size: 2-4 times the original index, 1/2 size of uncompressed documents [2] In practice: Combinations exist, e.g. index w. names as phrases, useful biwords, and store position

Pattern Matching – Wildcards Example : fußball* is mapped to fußballer, fußballspiel, fußballweltmeister, … Trailing wildcard queries, e.g. fußball* Can easily be found if dictionary is stored as a B-tree Leading wildcard queries, e.g. *meister Can easily be found if dictionary is stored as a reverse B-tree (i.e. terms stored backwards)

Wildcards (Cont.) General wildcards, e.g. f*ball (matches e.g. to fußball, federball, …) Idea: Move the * at the end Permuterm index : For each word (e.g. fußball) add end symbol (e.g. fußball$) and create permutations (e.g. fußball$, ußball$f, ßball$fu, ball$fuß, …, l$fußbal, $fußball) Permuterm index : dictionary = all permuterms, postings = dictionary terms containing this rotation Query : Permute * to the end (e.g. ball$f*) and get postings from permuterm index (e.g. ball$fuß, ball$feder, …)

Structural Queries In practice: Often semi-structured documents Structural queries : Use available structure to better specify the information need, e.g. AUTHOR = Ottmann AND TEXT CONTAINS search tree Requires to store structure information, e.g. in a parametric index encoded in the dictionary: or in the postings: OTTMANN.TITLE OTTMANN.BODY OTTMANN.AUTHOR91719 … … … 48 OTTMANN8.BODY9.AUTHOR, 9.BODY12.TITLE …

Summary: Further Query Types Phrases, e.g. New York City Proximity, e.g. University NEAR Freiburg (finds University of Freiburg and Albert-Ludwigs University Freiburg) Structural queries, e.g. AUTHOR = Ottmann AND TEXT CONTAINS binary search tree Natural language vs. keywords Pattern matching, e.g. wildcards: index* (finds index, indexing, indexes, indexer, …) Spelling corrections and some more (often application dependent)

INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORM. NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

Ranking – Motivation So far: Mapping of processed words from the query to processed words from the documents Set of (hopefully) relevant documents Similar to Boolean search, either explicitly specified by the user (q1 AND q2) or implicitly done by the system, e.g. by returning docs with all query terms (AND) by returning docs with any query term (OR) Intuitively: A doc. containing more different query terms than another one seems more relevant.

Estimating Relevance Question: How can we estimate relevance based on a given query and a document collection? Different terms might have a different influence on relevance e.g. stop words are less relevant than names Documents containing more (different) query terms might be more relevant e.g. New York (state and city) vs. New York City Documents containing an important term more often might be more relevant e.g. one query term: doc. 1 contains query term 200 times, doc. 2 contains it just 5 times

VOCABULARY : FACTORS, INFORMATION, HELP, HUMAN, OPERATION, RETRIEVAL, SYSTEMS (VECTOR = ( )) QUERY = {HUMAN FACTORS IN INFORMATION RETRIEVAL SYSTEMS} VECTOR REPRESENTATION = ( ) DOCUMENT 1 : {HUMAN, FACTORS, INFORMATION, RETRIEVAL} VECTOR REPRESENTATION = ( ) DOCUMENT 2 : {HUMAN, FACTORS, HELP, SYSTEMS} VECTOR REPRESENTATION = ( ) DOCUMENT 3 : {FACTORS, OPERATION, SYSTEMS} VECTOR REPRESENTATION = ( ) EXAMPLE FOR TERM WEIGH- TING SOURCE: FRAKES ET AL. [3], PAGE 365 SIMPLE MATCH QUERY ( ) DOC 1 ( ) ( ) = 4 QUERY ( ) DOC 2 ( ) ( ) = 3 QUERY ( ) DOC 3 ( ) ( ) = 2 WEIGHTED MATCH QUERY ( ) DOC 1 ( ) ( ) = 13 QUERY ( ) DOC 2 ( ) ( ) = 8 QUERY ( ) DOC 3 ( ) ( ) = 3

Term Frequency (TF) In practice: Various experiments have confirmed that Term Frequency (TF) is a significant measure for relevance But: It depends on the document’s length Therefore: Normalization #T = FREQUENCY OF TERM T IN DOC. D DL = DOCUMENT LENGTH = NO. TERMS IN D TERMS (SORTED BY # OF APPEARANCES) # APPEARANCES

Inverse Document Frequency (IDF) Observation: Relevance of a term also depends on its frequency in the whole collection. Example: Query = Amazon Rain Forrest NEWSPAPER ARCHIVEAMAZON.COM PRESS RELEASES Inverse Document Frequency (IDF):

The TF*IDF Measure TF (T, D) = # appearances in one document Estimation for how good a term represents the content of 1 document (intra document frequency) IDF (T) = Inv. of # appearances in the collection Estimation for how good a term separates different documents (inv. of inter document frequency) Combined measure / weight: TF*IDF (T, D) = TF (T, D) * IDF (T) (#T, DL, N as defined before)

TF*IDF Weighting – Comments Note: Different definitions / versions exist Based on the application and data other weights might be used, e.g. Structure information (e.g. term in title, abstract, …) Popularity (e.g. Titanic in a movie data base) Relative position between terms (e.g. Amazon Rain Forrest vs. Amazon Press Releases) Date (e.g. news archive: newer = more relevant) Layout (e.g. bold faced font) etc. However, TF*IDF often has a high impact

2 of the Most Imp. Weighting Fcts. SOURCE: AMIT SINGHAL MODERN INFORM. RETRIEVAL: A BRIEF OVERVIEW, IEEE BULLETIN, 2001 Okapi weighting based document score: Pivoted normalization weighting based doc. score: with tf = the term‘s frequency in the document qtf = the term‘s frequency in the query N = the total number of documents in the collection df = the number of documents that contain the term dl = the document length (in bytes) avdl = the average document length

INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORM. NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

Evaluation of IR Systems Standard approaches for algorithm and computer system evaluation Speed / processing time Storage requirements Correctness of used algorithms and their implementation But most importantly Performance, effectiveness Another important issue: Usability, users’ perception Questions: What is a good / better search engine? How to measure search engine quality? How to perform evaluations? Etc.

What does Performance/Effectiveness of IR Systems mean? Typical questions: How good is the quality of a system? Which system should I buy? Which one is better? How can I measure the quality of a system? What does quality mean for me? Etc. Their answer depends on users, application, … Very different views and perceptions User vs. search engine provider, developer vs. manager, seller vs. buyer, … And remember: Queries can be ambiguous, unspecific, etc. Hence, in practice, use restrictions and idealization, e.g. only binary decisions

Precision & Recall PRECISION = # FOUND & RELEVANT # FOUND RECALL = # FOUND & RELEVANT # RELEVANT RESULT: DOCUMENTS: AC D B F H G E J I 1. DOC. B 2. DOC. E 3. DOC. F 4. DOC. G 5. DOC. D 6. DOC. H Restrictions: 0/1 Relevance, Set instead of order/ranking But: We can use this for eval. of ranking, too (via top N docs.)

Calculating Precision & Recall Precision : Can be calculated directly from the result Recall : Requires relevance ratings for whole (!) data collection In practice: Approaches to estimate recall 1.) Use a representative sample instead of whole data collection 2.) Document-source method 3.) Expanding queries 4.) Compare result with external sources 5.) Pooling method

Precision & Recall – Special cases Special treatment is necessary, if no doc. is found or no relevant docs. exist (division by zero) NO REL. DOC. EXISTS : A = C = 0 1st CASE: B = 0 2nd CASE: B > 0 EMPTY RESULT SET: A = B = 0 1st CASE: C = 0 2nd CASE: C > 0 A B C D D B D D C D

Precision & Recall Graphs Comparing 2 systems: Prec 1 = 0.6, Rec 1 = 0.3 Prec 2 = 0.4, Rec 2 = 0.6 Which one is better? Prec.-Recall-Graph: PRECISION RECALL

References & Recommended Reading [1] R. BAEZA-YATES, B. RIBEIRO-NETO: MODERN INFORMATIN RETRIEVAL, ADDISON WESLEY, 1999 CHAPTER 4 (QUERY LANGUAGES) [2] C. MANNING, P. RAGHAVAN, H. SCHÜTZ: INTRODUCTION TO INFORMATION RETRIEVAL (TO APPEAR 2007) CHAPTER 1.4, 2.2.2, 4.1, 6.1 (QUERY LANG.) CHAPTER 6.2 (RANKING / RELEVANCE) DRAFT AVAILABLE ONLINE AT ~schuetze/information-retrieval-book.html [3] WILLIAM B. FRAKES, RICARDO BAEZA-YATES (EDS.): INFORMATION RETRIEVAL – DATA STRUCTURES AND ALGORITHMS, P T R PRENTICE HALL, 1992 CHAPTER 14: RANKING ALGORITHMS [4] G. SALTON: A BLUEPRINT FOR AUTOMATIC INDEXING, ACM SIGIR FORUM, VOL. 16, ISSUE 2, FALL 1981 (TERMPROCESSING, RANKING / RELEVANCE) (REFERENCES FOR EVALUATION: NEXT TIME)

Schedule Introduction IR-Basics (Lectures) Overview, terms and definitions Index (inverted files) Term processing Query processing Ranking (TF*IDF, …) Evaluation IR-Models (Boolean, vector, probab.) IR-Basics (Exercises) Web Search (Lectures and exercises)