Web Search - Summer Term 2006 II. Information Retrieval (Models, Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Modern Information Retrieval Chapter 1: Introduction
Probabilistic Information Retrieval Chris Manning, Pandu Nayak and
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Information Retrieval Models: Probabilistic Models
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Modern Information Retrieval Chapter 1: Introduction
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
IR Models: Review Vector Model and Probabilistic.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Chapter 6: Information Retrieval and Web Search
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Information Retrieval
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Hsin-Hsi Chen5-1 Chapter 5 Query Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)
Plan for Today’s Lecture(s)
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS 430: Information Discovery
Presentation transcript:

Web Search - Summer Term 2006 II. Information Retrieval (Models, Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University

Classic Retrieval Models 1. Boolean Model (set theoretic) 2. Vector Model (algebraic) 3. Probabilistic Models (probabilistic)

Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the information need represented by q, i.e. P(R|q,d) Compared to previous models: Boolean and Vector Models: Ranking based on relevance value which is inter- preted as a similarity measure between q and d Probabilistic Models: Ranking based on estimated likelihood of d being relevant for query q

Probabilistic Modeling Given: Documents d j = (t 1, t 2,..., t n ), queries q i (n = no of docs in collection) We assume similar dependence between d and q as before, i.e. relevance depends on term distribution (Note: Slightly different notation here than before!) Estimating P(R|d,q) directly often impossible in practice. Instead: Use Bayes Theorem, i.e. or

Probab. Modeling as Decision Strategy Decision about which docs should be returned based on a threshold calculated with a cost function C j Example: C j (R, dec)RetrievedNot retrieved Relevant Doc.0 1 Non-Rel. Doc. 2 0 Decision based on risk function that minimizes costs:

Probability Estimation Different approaches to estimate P(d|R) exist: Binary Independence Retrieval Model (BIR) Binary Independence Retrieval Model (BII) Darmstadt Indexing Approach (DIA) Generally we assume stochastic independence between the terms of one document, i.e.

Binary Independence Retr. Model (BIR) Learning : Estimation of probability distribution based on - a query q k - a set of documents d j - respective relevance judgments Application : Generalization to different documents from the collection (but restricted to same query and terms from training) DOCS TERMS QUERIES LEARNINGAPPLICATION BIR

Binary Indep. Indexing Model (BII) Learning : Estimation of probability distribution based on - a document d j - a set of queries q k - respective relevance judgments Application : Generalization to different queries (but restricted to same doc. and terms from training) DOCS TERMS QUERIES APPLICA- TION BII LEARNING

Learning : Estimation of probability distribution based on - a set of queries q k - an abstract description of a set of documents d j - respective relevance judgments Application : Generalization to different queries and documents Darmstadt Indexing Approach (DIA) DOCS TERMS QUERIES APPLICA- TION DIA LEARNING

DIA - Description Step Basic idea: Instead of term-document pairs, consider relevance descriptions x(t i, d m ) These contain the values of certain attributes of term t i, document d m and their relation to each other Examples: - Dictionary information about t i (e.g. IDF) - Parameters describing d m (e.g. length or no. of unique terms) - Information about the appearance of t i in d m (e.g. in title, abstract), its frequency, the distance between two query terms, etc. REFERENCE: FUHR, BUCKLEY [4]

DIA - Decision Step Estimation of probability P(R | x(t i, d m )) P(R | x(t i, d m )) is the probability of a document d m being relevant to an arbitrary query given that a term common to both document and query has a relevance description x(t i, d m ). Advantages: - Abstraction from specific term-doc pairs and thus generalization to random docs and queries - Enables individual, application-specific relevance descriptions

DIA - (Very) Simple Example RELEVANCE DESCRIPTION: x(t i, d m ) = (x 1, x 2 ) with QUERYDOC.REL.TERMx q1q1 d1d1 REL.t1t2t3t1t2t3 (1,1) (0,1) (1,2) q1q1 d2d2 NOT REL. t1t3t4t1t3t4 (0,2) (1,1) (0,1) q2q2 d1d1 REL.t2t5t6t7t2t5t6t7 (0,2) (1,1) (1,2) q2q2 d3d3 NOT REL. t5t7t5t7 (0,1) xExEx 1/4 (0,2)2/3 (1,1)2/3 (1,2)1 TRAINING SET: q 1, q 2, d 1, d 2, d 3 EVENT SPACE: 1, if t i  title of d m 0, otherwise 1, if t i  d m once 2, if t i  d m at least twice x 1 = x 2 =

DIA - Indexing Function Because of relevance descriptions: Generalization to random docs and queries Another advantage: Instead of probabilities, we can also use a general indexing function e(x(t i, d m )) Note: We have a typical pattern recognition problem here, i.e. - Given: Set of features / parameters and different classes (here: rel. and not rel.) - Goal: Classification based on these features Approaches such as Neural Networks, SVMs, etc. can be used.

Models for IR - Taxonomy Classic models: Boolean model ( based on set theory ) Vector space model ( based on algebra ) Probabilistic models ( based on probability theory ) Fuzzy set model Extended Boolean model Generalized vector model Latent semantic indexing Neural networks Inference networks Belief network SOURCE: R. BAEZA-YATES [1], PAGE Further models: Structured Models Models for Browsing Filtering

References & Recommended Reading [1] R. BAEZA-YATES, B. RIBEIRO-NETO: MODERN IR, ADDISON WESLEY, 1999 CHAPTER (IR MODELS), CH. 5 (RELEVANCE FEEDBACK) [2] N. FUHR: SKRIPTUM ZUR VORLESUNG INFORMATION RETRIEVAL, AVAILABLE ONLINE AT THE COURSE HOME PAGE OR DIRECTLY AT CHAPTER , 5.5, 6 (IR MODELS) [3] F. CRESTANI, M. LALMAS, C.J. RIJSBERGEN, I. CAMPBELL: IS THIS DOCUMENT RELEVANT?... PROBABLIY: A SURVEY OF PROBABILISTIC MODELS IN INFORMATION RETRIEVAL, ACM COMPUTING SURVEYS, VOL. 30, NO. 4, DEC CHAPTER (PROBABILISTIC MODELS) [4] N. FUHR, C. BUCKLEY: A PROBABILISTIC LEARNING APPROACH FOR DOCUMENT INDEXING, ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 9, NO. 3, JULY 1991 CHAPTER 2 AND 4 (PROBABILISTIC MODELS)

Web Search - Summer Term 2006 II. Information Retrieval (Basics: Relevance Feedback) (c) Wolfgang Hürst, Albert-Ludwigs-University

Relevance Feedback Motivation : Formulating a good query is often difficult Idea : Improve search result by indicating the relevance of the initially returned docs Possible usage : - Get better search results - Re-train the current IR model Different approaches based on - User feedback - Local information in the initial result set - Global information in the whole doc. coll.

Relev. Feedb. based on User Input Procedure : - User enters initial query - System returns result to user based on this query - User marks relevant documents - System selects important terms from marked docs. - System returns new result based on these terms Two approaches : - Query Expansion - Term Re-weighting Advantages : - Breaks down search task in smaller steps - Relevance judgments easier to make than (re-)formulation of a query - Controlled process to emphasize relevant terms and de-emphasize irrelevant ones

Query Expansion & Term Re-Weighting for the Vector Model Vector Space Model : Representation of documents and queries as weighted vectors of terms Assumption : - Large overlap of term sets from relevant documents - Small overlap of term sets from irrelevant docs. Basic idea : Re-formulate the query in order to get the query vector closer to the documents marked as relevant

Optimum Query Vector = Set of returned docs marked as rel. = Set of returned docs marked as irrel. = Set of all relevant docs in the doc. coll. = No. of docs in the respective doc. sets = Constant factors (for fine-tuning) Best query vector to distinguish relevant from non-relevant docs:

Query Expansion & Term Re-Weighting Based on the relevance feedback from the user we incrementally change the initial query vector q to create a better query vector q m Goal: Approximation of the optimum query vector q opt Other approaches exist, e.g. Ide_Regular, Ide_Dec_Hi Standard_Rochio approach :

Relev. Feedb. without User Input Different approaches based on - User feedback - Local information in the initial result set - Global information in the whole doc. coll. Basic idea of relevance feedback: Clustering, i.e. the docs marked as relevant contain additional terms which describe a larger cluster of relevant docs So far: Get user feedback to create this term set Now: Approaches to get these term sets automatically Two approaches: - Local strategies (based on returned result set) - Global strategies (based on whole doc. Collection)

Query Exp. Through Local Clustering Motivation : Given a query q, there exists a local relationship between relevant documents Basic idea : Expand query q with additional terms based on a clustering of the documents from the initial result set Different approaches exist: - Association Clusters : Assume a correlation between terms co-occurring in different docs - Metric Clusters : Assume a correlation between terms close to each other (in a document) - Scalar Clusters : Assume a correlation between terms with a similar neighborhood

Metric Clusters Note: In the following we consider word stems s instead of terms (analogous to the literature; works similar w. terms) = Distance between two terms t i and t j in document d (in no. of terms) = root of term t = set of all words with root s Define a local stem-stem correlation matrix s with elements s u,v based on the correlation c u,v or normalized:

Query Exp. With Metric Clusters Clusters based on Metric-Correlation-Matrices : Generated by returning the n terms (roots) s v with the highest entries s u,v values given a term s u Use these clusters for query expansion Comments: - Clusters do not necessarily contain synonyms - Non-normalized clusters often contain high frequency terms - Normalized clusters often group terms that appear less often - Therefore: Combined approaches exist (i.e. using normalized and non-normalized clusters)

Overview of Approaches Based on user feedback Based on local information in the initial result set - Local clustering - Local context analysis (combine local and global info) Based on global information in the whole document collection, examples: - Query expansion using a similarity thesaurus - Query expansion using a statistical thesaurus

References (Books) R. BAEZA-YATES, B. RIBEIRO-NETO: MODERN INFORMATION RETRIEVAL, ADDISON WESLEY, 1999 WILLIAM B. FRAKES, RICARDO BAEZA-YATES (EDS.): INFORMATION RETRIEVAL – DATA STRUCTURES AND ALGORITHMS, P T R PRENTICE HALL, 1992 C. J. VAN RIJSBERGEN: INFORMATION RETRIEVAL, 1979, C. MANNING, P. RAGHAVAN, H. SCHÜTZ: INTRODUCTION TO INFORMATION RETRIEVAL (TO APPEAR 2007) retrieval-book.html I. WITTEN, A. MOFFAT, T. BELL: MANAGING GIGABYTES, MORGAN KAUFMANN PUBLISHING, 1999 N. FUHR: SKRIPTUM ZUR VORLESUNG INFORMATION RETRIEVAL, SS 2006 AND MANY MORE!

References (Articles) G. SALTON: A BLUEPRINT FOR AUTOMATIC INDEXING, ACM SIGIR FORUM, VOL. 16, ISSUE 2, FALL 1981 F. CRESTANI, M. LALMAS, C.J. RIJSBERGEN, I. CAMPBELL: IS THIS DOCUMENT RELEVANT?... PROBABLIY: A SURVEY OF PROBABILISTIC MODELS IN INFORMATION RETRIEVAL, ACM COMPUTING SURVEYS, VOL. 30, NO. 4, DEC N. FUHR, C. BUCKLEY: A PROBABILISTIC LEARNING APPROACH FOR DOCUMENT INDEXING, ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 9, NO. 3, JULY 1991 Further Sources IR-RELATED CONFERENCES : ACM SIGIR International Conference on Information Retrieval ACM / IEEE Joint Conference on Digital Libraries (JCDL) ACM Conference on Information Knowledge and Management (CIKM) Text REtrieval Conference (TREC),

INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORM. NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

Schedule Introduction IR-Basics (Lectures) Overview, terms and definitions Index (inverted files) Term processing Query processing Ranking (TF*IDF, …) Evaluation IR-Models (Boolean, vector, probabilistic) IR-Basics (Exercises) Web Search (Lectures and exercises)

Organizational Remarks Exercises: Please, register for the exercises by sending me an containing - Your name, - Matrikelnummer, - Studiengang (BA, MSc, Diploma,...) - Plans for exam (yes, no, undecided) This is just to organize the exercises, i.e. there are no consequences if you decide to drop this course. Registrations should be done before the exercises start. Later registration might be possible under certain circumstances (contact me).