Web Search - Summer Term 2006 II. Information Retrieval (Models, Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University
Classic Retrieval Models 1. Boolean Model (set theoretic) 2. Vector Model (algebraic) 3. Probabilistic Models (probabilistic)
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the information need represented by q, i.e. P(R|q,d) Compared to previous models: Boolean and Vector Models: Ranking based on relevance value which is inter- preted as a similarity measure between q and d Probabilistic Models: Ranking based on estimated likelihood of d being relevant for query q
Probabilistic Modeling Given: Documents d j = (t 1, t 2,..., t n ), queries q i (n = no of docs in collection) We assume similar dependence between d and q as before, i.e. relevance depends on term distribution (Note: Slightly different notation here than before!) Estimating P(R|d,q) directly often impossible in practice. Instead: Use Bayes Theorem, i.e. or
Probab. Modeling as Decision Strategy Decision about which docs should be returned based on a threshold calculated with a cost function C j Example: C j (R, dec)RetrievedNot retrieved Relevant Doc.0 1 Non-Rel. Doc. 2 0 Decision based on risk function that minimizes costs:
Probability Estimation Different approaches to estimate P(d|R) exist: Binary Independence Retrieval Model (BIR) Binary Independence Retrieval Model (BII) Darmstadt Indexing Approach (DIA) Generally we assume stochastic independence between the terms of one document, i.e.
Binary Independence Retr. Model (BIR) Learning : Estimation of probability distribution based on - a query q k - a set of documents d j - respective relevance judgments Application : Generalization to different documents from the collection (but restricted to same query and terms from training) DOCS TERMS QUERIES LEARNINGAPPLICATION BIR
Binary Indep. Indexing Model (BII) Learning : Estimation of probability distribution based on - a document d j - a set of queries q k - respective relevance judgments Application : Generalization to different queries (but restricted to same doc. and terms from training) DOCS TERMS QUERIES APPLICA- TION BII LEARNING
Learning : Estimation of probability distribution based on - a set of queries q k - an abstract description of a set of documents d j - respective relevance judgments Application : Generalization to different queries and documents Darmstadt Indexing Approach (DIA) DOCS TERMS QUERIES APPLICA- TION DIA LEARNING
DIA - Description Step Basic idea: Instead of term-document pairs, consider relevance descriptions x(t i, d m ) These contain the values of certain attributes of term t i, document d m and their relation to each other Examples: - Dictionary information about t i (e.g. IDF) - Parameters describing d m (e.g. length or no. of unique terms) - Information about the appearance of t i in d m (e.g. in title, abstract), its frequency, the distance between two query terms, etc. REFERENCE: FUHR, BUCKLEY [4]
DIA - Decision Step Estimation of probability P(R | x(t i, d m )) P(R | x(t i, d m )) is the probability of a document d m being relevant to an arbitrary query given that a term common to both document and query has a relevance description x(t i, d m ). Advantages: - Abstraction from specific term-doc pairs and thus generalization to random docs and queries - Enables individual, application-specific relevance descriptions
DIA - (Very) Simple Example RELEVANCE DESCRIPTION: x(t i, d m ) = (x 1, x 2 ) with QUERYDOC.REL.TERMx q1q1 d1d1 REL.t1t2t3t1t2t3 (1,1) (0,1) (1,2) q1q1 d2d2 NOT REL. t1t3t4t1t3t4 (0,2) (1,1) (0,1) q2q2 d1d1 REL.t2t5t6t7t2t5t6t7 (0,2) (1,1) (1,2) q2q2 d3d3 NOT REL. t5t7t5t7 (0,1) xExEx 1/4 (0,2)2/3 (1,1)2/3 (1,2)1 TRAINING SET: q 1, q 2, d 1, d 2, d 3 EVENT SPACE: 1, if t i title of d m 0, otherwise 1, if t i d m once 2, if t i d m at least twice x 1 = x 2 =
DIA - Indexing Function Because of relevance descriptions: Generalization to random docs and queries Another advantage: Instead of probabilities, we can also use a general indexing function e(x(t i, d m )) Note: We have a typical pattern recognition problem here, i.e. - Given: Set of features / parameters and different classes (here: rel. and not rel.) - Goal: Classification based on these features Approaches such as Neural Networks, SVMs, etc. can be used.
Models for IR - Taxonomy Classic models: Boolean model ( based on set theory ) Vector space model ( based on algebra ) Probabilistic models ( based on probability theory ) Fuzzy set model Extended Boolean model Generalized vector model Latent semantic indexing Neural networks Inference networks Belief network SOURCE: R. BAEZA-YATES [1], PAGE Further models: Structured Models Models for Browsing Filtering
References & Recommended Reading [1] R. BAEZA-YATES, B. RIBEIRO-NETO: MODERN IR, ADDISON WESLEY, 1999 CHAPTER (IR MODELS), CH. 5 (RELEVANCE FEEDBACK) [2] N. FUHR: SKRIPTUM ZUR VORLESUNG INFORMATION RETRIEVAL, AVAILABLE ONLINE AT THE COURSE HOME PAGE OR DIRECTLY AT CHAPTER , 5.5, 6 (IR MODELS) [3] F. CRESTANI, M. LALMAS, C.J. RIJSBERGEN, I. CAMPBELL: IS THIS DOCUMENT RELEVANT?... PROBABLIY: A SURVEY OF PROBABILISTIC MODELS IN INFORMATION RETRIEVAL, ACM COMPUTING SURVEYS, VOL. 30, NO. 4, DEC CHAPTER (PROBABILISTIC MODELS) [4] N. FUHR, C. BUCKLEY: A PROBABILISTIC LEARNING APPROACH FOR DOCUMENT INDEXING, ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 9, NO. 3, JULY 1991 CHAPTER 2 AND 4 (PROBABILISTIC MODELS)
Web Search - Summer Term 2006 II. Information Retrieval (Basics: Relevance Feedback) (c) Wolfgang Hürst, Albert-Ludwigs-University
Relevance Feedback Motivation : Formulating a good query is often difficult Idea : Improve search result by indicating the relevance of the initially returned docs Possible usage : - Get better search results - Re-train the current IR model Different approaches based on - User feedback - Local information in the initial result set - Global information in the whole doc. coll.
Relev. Feedb. based on User Input Procedure : - User enters initial query - System returns result to user based on this query - User marks relevant documents - System selects important terms from marked docs. - System returns new result based on these terms Two approaches : - Query Expansion - Term Re-weighting Advantages : - Breaks down search task in smaller steps - Relevance judgments easier to make than (re-)formulation of a query - Controlled process to emphasize relevant terms and de-emphasize irrelevant ones
Query Expansion & Term Re-Weighting for the Vector Model Vector Space Model : Representation of documents and queries as weighted vectors of terms Assumption : - Large overlap of term sets from relevant documents - Small overlap of term sets from irrelevant docs. Basic idea : Re-formulate the query in order to get the query vector closer to the documents marked as relevant
Optimum Query Vector = Set of returned docs marked as rel. = Set of returned docs marked as irrel. = Set of all relevant docs in the doc. coll. = No. of docs in the respective doc. sets = Constant factors (for fine-tuning) Best query vector to distinguish relevant from non-relevant docs:
Query Expansion & Term Re-Weighting Based on the relevance feedback from the user we incrementally change the initial query vector q to create a better query vector q m Goal: Approximation of the optimum query vector q opt Other approaches exist, e.g. Ide_Regular, Ide_Dec_Hi Standard_Rochio approach :
Relev. Feedb. without User Input Different approaches based on - User feedback - Local information in the initial result set - Global information in the whole doc. coll. Basic idea of relevance feedback: Clustering, i.e. the docs marked as relevant contain additional terms which describe a larger cluster of relevant docs So far: Get user feedback to create this term set Now: Approaches to get these term sets automatically Two approaches: - Local strategies (based on returned result set) - Global strategies (based on whole doc. Collection)
Query Exp. Through Local Clustering Motivation : Given a query q, there exists a local relationship between relevant documents Basic idea : Expand query q with additional terms based on a clustering of the documents from the initial result set Different approaches exist: - Association Clusters : Assume a correlation between terms co-occurring in different docs - Metric Clusters : Assume a correlation between terms close to each other (in a document) - Scalar Clusters : Assume a correlation between terms with a similar neighborhood
Metric Clusters Note: In the following we consider word stems s instead of terms (analogous to the literature; works similar w. terms) = Distance between two terms t i and t j in document d (in no. of terms) = root of term t = set of all words with root s Define a local stem-stem correlation matrix s with elements s u,v based on the correlation c u,v or normalized:
Query Exp. With Metric Clusters Clusters based on Metric-Correlation-Matrices : Generated by returning the n terms (roots) s v with the highest entries s u,v values given a term s u Use these clusters for query expansion Comments: - Clusters do not necessarily contain synonyms - Non-normalized clusters often contain high frequency terms - Normalized clusters often group terms that appear less often - Therefore: Combined approaches exist (i.e. using normalized and non-normalized clusters)
Overview of Approaches Based on user feedback Based on local information in the initial result set - Local clustering - Local context analysis (combine local and global info) Based on global information in the whole document collection, examples: - Query expansion using a similarity thesaurus - Query expansion using a statistical thesaurus
References (Books) R. BAEZA-YATES, B. RIBEIRO-NETO: MODERN INFORMATION RETRIEVAL, ADDISON WESLEY, 1999 WILLIAM B. FRAKES, RICARDO BAEZA-YATES (EDS.): INFORMATION RETRIEVAL – DATA STRUCTURES AND ALGORITHMS, P T R PRENTICE HALL, 1992 C. J. VAN RIJSBERGEN: INFORMATION RETRIEVAL, 1979, C. MANNING, P. RAGHAVAN, H. SCHÜTZ: INTRODUCTION TO INFORMATION RETRIEVAL (TO APPEAR 2007) retrieval-book.html I. WITTEN, A. MOFFAT, T. BELL: MANAGING GIGABYTES, MORGAN KAUFMANN PUBLISHING, 1999 N. FUHR: SKRIPTUM ZUR VORLESUNG INFORMATION RETRIEVAL, SS 2006 AND MANY MORE!
References (Articles) G. SALTON: A BLUEPRINT FOR AUTOMATIC INDEXING, ACM SIGIR FORUM, VOL. 16, ISSUE 2, FALL 1981 F. CRESTANI, M. LALMAS, C.J. RIJSBERGEN, I. CAMPBELL: IS THIS DOCUMENT RELEVANT?... PROBABLIY: A SURVEY OF PROBABILISTIC MODELS IN INFORMATION RETRIEVAL, ACM COMPUTING SURVEYS, VOL. 30, NO. 4, DEC N. FUHR, C. BUCKLEY: A PROBABILISTIC LEARNING APPROACH FOR DOCUMENT INDEXING, ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 9, NO. 3, JULY 1991 Further Sources IR-RELATED CONFERENCES : ACM SIGIR International Conference on Information Retrieval ACM / IEEE Joint Conference on Digital Libraries (JCDL) ACM Conference on Information Knowledge and Management (CIKM) Text REtrieval Conference (TREC),
INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORM. NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION
Schedule Introduction IR-Basics (Lectures) Overview, terms and definitions Index (inverted files) Term processing Query processing Ranking (TF*IDF, …) Evaluation IR-Models (Boolean, vector, probabilistic) IR-Basics (Exercises) Web Search (Lectures and exercises)
Organizational Remarks Exercises: Please, register for the exercises by sending me an containing - Your name, - Matrikelnummer, - Studiengang (BA, MSc, Diploma,...) - Plans for exam (yes, no, undecided) This is just to organize the exercises, i.e. there are no consequences if you decide to drop this course. Registrations should be done before the exercises start. Later registration might be possible under certain circumstances (contact me).