2011.02.16 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
CpSc 881: Information Retrieval
Probabilistic Ranking Principle
Information Retrieval Models: Probabilistic Models
IR Models: Overview, Boolean, and Vector
Chapter 7 Retrieval Models.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Lecture 10: Inference Nets and Language Models Prof. Ray Larson University of California, Berkeley School of Information.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
Modeling Modern Information Retrieval
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
SLIDE 1IS 240 – Fall 2005 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Introduction to Probabilistic.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
SLIDE 1IS 240 – Spring 2007 Lecture 10: Inference and Belief Networks Prof. Ray Larson University of California, Berkeley School of Information.
IR Models: Review Vector Model and Probabilistic.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Probabilistic Ranking Principle Hongning Wang
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Vector Space Models.
Ray R. Larson : University of California, Berkeley Clustering and Classification Workshop 1998 Cheshire II and Automatic Categorization Ray R. Larson Associate.
C.Watterscsci64031 Probabilistic Retrieval Model.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at
Lecture 10: Inference Nets and Language Models
Information Retrieval Models: Probabilistic Models
Representation of documents and queries
CS 430: Information Discovery
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 9: Probabilistic Retrieval

SLIDE 2IS 240 – Spring 2011 Mini-TREC Need to make groups –Today – Give me a note with group members (names and login names) Systems –SMART (not recommended…) ftp://ftp.cs.cornell.edu/pub/smart –MG (We have a special version if interested) –Cheshire II & 3 II = ftp://cheshire.berkeley.edu/pub/cheshire & 3 = –Zprise (Older search system from NIST) –IRF (new Java-based IR framework from NIST) –Lemur –Lucene (Java-based Text search engine) –Galago (Also Java-based) –Others?? (See )

SLIDE 3IS 240 – Spring 2011 Mini-TREC Proposed Schedule –February 9 – Database and previous Queries –March 2 – report on system acquisition and setup –March 9, New Queries for testing… –April 18, Results due –April 20, Results and system rankings –April 27 Group reports and discussion

SLIDE 4IS 240 – Spring 2011 Today Review –Clustering and Automatic Classification Probabilistic Models –Probabilistic Indexing (Model 1) –Probabilistic Retrieval (Model 2) –Unified Model (Model 3) –Model 0 and real-world IR –Regression Models –The “Okapi Weighting Formula”

SLIDE 5IS 240 – Spring 2011 Today Review –Clustering and Automatic Classification Probabilistic Models –Probabilistic Indexing (Model 1) –Probabilistic Retrieval (Model 2) –Unified Model (Model 3) –Model 0 and real-world IR –Regression Models –The “Okapi Weighting Formula”

SLIDE 6IS 240 – Spring 2011 Review: IR Models Set Theoretic Models –Boolean –Fuzzy –Extended Boolean Vector Models (Algebraic) Probabilistic Models (probabilistic)

SLIDE 7IS 240 – Spring 2011 Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

SLIDE 8IS 240 – Spring 2011 Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

SLIDE 9IS 240 – Spring 2011 Vector Space Visualization

SLIDE 10IS 240 – Spring 2011 Vector Space with Term Weights and Cosine Matching D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)

SLIDE 11IS 240 – Spring 2011 Document/Document Matrix

SLIDE 12IS 240 – Spring 2011 Hierarchical Methods Single Link Dissimilarity Matrix Hierarchical methods: Polythetic, Usually Exclusive, Ordered Clusters are order-independent

SLIDE 13IS 240 – Spring 2011 Threshold =.1 Single Link Dissimilarity Matrix

SLIDE 14IS 240 – Spring 2011 Threshold =

SLIDE 15IS 240 – Spring 2011 Threshold =

SLIDE 16IS 240 – Spring 2011 K-means & Rocchio Clustering Agglomerative methods: Polythetic, Exclusive or Overlapping, Unordered clusters are order-dependent. Doc 1. Select initial centers (I.e. seed the space) 2. Assign docs to highest matching centers and compute centroids 3. Reassign all documents to centroid(s) Rocchio’s method

SLIDE 17IS 240 – Spring 2011 Clustering Advantages: –See some main themes Disadvantage: –Many ways documents could group together are hidden Thinking point: what is the relationship to classification systems and facets?

SLIDE 18IS 240 – Spring 2011 Automatic Class Assignment Doc Search Engine 1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme

SLIDE 19IS 240 – Spring 2011 Automatic Categorization in Cheshire II Cheshire supports a method we call “classification clustering” that relies on having a set of records that have been indexed using some controlled vocabulary. Classification clustering has the following steps…

SLIDE 20IS 240 – Spring 2011 Start with a collection of documents.

SLIDE 21IS 240 – Spring 2011 Classify and index with controlled vocabulary. Index Ideally, use a database already indexed

SLIDE 22IS 240 – Spring 2011 Problem: Controlled Vocabularies can be difficult for people to use. “pass mtr veh spark ign eng” Index

SLIDE 23IS 240 – Spring 2011 Solution: Entry Level Vocabulary Indexes. Index EVI pass mtr veh spark ign eng” = “Automobile”

SLIDE 24IS 240 – Spring 2011 EVI example EVI 1 Index term: “pass mtr veh spark ign eng” User Query “Automobile” EVI 2 Index term: “automobiles” OR “internal combustible engines”

SLIDE 25IS 240 – Spring 2011 But why stop there? Index EVI

SLIDE 26IS 240 – Spring 2011 “Which EVI do I use?” Index EVI Index EVI Index EVI

SLIDE 27IS 240 – Spring 2011 EVI to EVIs Index EVI Index EVI Index EVI EVI 2

SLIDE 28IS 240 – Spring 2011 Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Why not treat language the same way?

SLIDE 29IS 240 – Spring 2011 Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Statistical association Digital library resources

SLIDE 30IS 240 – Spring 2011 Cheshire II - Two-Stage Retrieval Using the LC Classification System –Pseudo-Document created for each LC class containing terms derived from “content-rich” portions of documents in that class (e.g., subject headings, titles, etc.) –Permits searching by any term in the class –Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first. –User selects classes to feed back for the “second stage” search of documents. Can be used with any classified/Indexed collection.

SLIDE 31IS 240 – Spring 2011 Cheshire EVI Demo

SLIDE 32IS 240 – Spring 2011 Problems with Vector Space There is no real theoretical basis for the assumption of a term space –it is more for visualization than having any real basis –most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions –Terms are not independent of all other terms

SLIDE 33IS 240 – Spring 2011 Today Review –Clustering and Automatic Classification Probabilistic Models –Probabilistic Indexing (Model 1) –Probabilistic Retrieval (Model 2) –Unified Model (Model 3) –Model 0 and real-world IR –Regression Models –The “Okapi Weighting Formula”

SLIDE 34IS 240 – Spring 2011 Probabilistic Models Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) Relies on accurate estimates of probabilities

SLIDE 35IS 240 – Spring 2011 Probability Ranking Principle If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data. Stephen E. Robertson, J. Documentation 1977

SLIDE 36IS 240 – Spring 2011 Model 1 – Maron and Kuhns Concerned with estimating probabilities of relevance at the point of indexing: –If a patron came with a request using term t i, what is the probability that she/he would be satisfied with document D j ?

SLIDE 37IS 240 – Spring 2011 Probability theory (detour) To get to the Bayesian statistical inference used in both model 1 and 2…

SLIDE 38IS 240 – Spring 2011 Probability Theory The “Bayes’ Rule” (AKA: Bayesian Inference) says

SLIDE 39IS 240 – Spring 2011 Bayes’ theorem For example: A: disease B: symptom

SLIDE 40IS 240 – Spring 2011 Bayes’ Theorem: Application Box1 Box2 p(box1) =.5 P(red ball | box1) =.4 P(blue ball | box1) =.6 p(box2) =.5 P(red ball | box2) =.5 P(blue ball | box2) =.5 Toss a fair coin. If it lands head up, draw a ball from box 1; otherwise, draw a ball from box 2. If the ball is blue, what is the probability that it is drawn from box 2?

SLIDE 41IS 240 – Spring 2011 Bayes’ Theorem: Application in IR Goal: want to estimate the probability that a document D is relevant to a given query. It is often useful to estimate log odds of probability of relevance

SLIDE 42IS 240 – Spring 2011 Bayes’ Theorem: Application in IR If documents are represented by binary vectors, then Steven & Sparck Jones term weighting

SLIDE 43IS 240 – Spring 2011 Bayes Theorem: Application in IR

SLIDE 44IS 240 – Spring 2011 Model 1 A patron submits a query (call it Q) consisting of some specification of her/his information need. Different patrons submitting the same stated query may differ as to whether or not they judge a specific document to be relevant. The function of the retrieval system is to compute for each individual document the probability that it will be judged relevant by a patron who has submitted query Q. Robertson, Maron & Cooper, 1982

SLIDE 45IS 240 – Spring 2011 Model 1 Bayes A is the class of events of using the system D i is the class of events of Document i being judged relevant I j is the class of queries consisting of the single term I j P(D i |A,I j ) = probability that if a query is submitted to the system then a relevant document is retrieved

SLIDE 46IS 240 – Spring 2011 Model 2 Documents have many different properties; some documents have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties. Robertson, Maron & Cooper, 1982

SLIDE 47IS 240 – Spring 2011 Model 2 – Robertson & Sparck Jones Document Relevance Document indexing Given a term t and a query q r n-r n - R-r N-n-R+r N-n R N-R N

SLIDE 48IS 240 – Spring 2011 Robertson-Spark Jones Weights Retrospective formulation --

SLIDE 49IS 240 – Spring 2011 Robertson-Sparck Jones Weights Predictive formulation

SLIDE 50IS 240 – Spring 2011 Probabilistic Models: Some Unifying Notation D = All present and future documents Q = All present and future queries (D i,Q j ) = A document query pair x = class of similar documents, y = class of similar queries, Relevance is a relation:

SLIDE 51IS 240 – Spring 2011 Probabilistic Models Model 1 -- Probabilistic Indexing, P(R|y,D i ) Model 2 -- Probabilistic Querying, P(R|Q j,x) Model 3 -- Merged Model, P(R| Q j, D i ) Model 0 -- P(R|y,x) Probabilities are estimated based on prior usage or relevance estimation

SLIDE 52IS 240 – Spring 2011 Probabilistic Models Q D x y DiDi QjQj

SLIDE 53IS 240 – Spring 2011 Logistic Regression Based on work by William Cooper, Fred Gey and Daniel Dabney Builds a regression model for relevance prediction based on a set of training data Uses less restrictive independence assumptions than Model 2 –Linked Dependence

SLIDE 54IS 240 – Spring 2011 Dependence assumptions In Model 2 term independence was assumed –P(R|A,B) = P(R|A)P(R|B) –This is not very realistic as we have discussed before Cooper, Gey, and Dabney proposed linked dependence: –If two or more retrieval clues are statistically dependent in the set of all relevance-related query- document pairs then they are statistically dependent to a corresponding degree in the set of all nonrelevance-related pairs. –Thus dependency in the relevant and nonrelevant documents is linked

SLIDE 55IS 240 – Spring 2011 Linked Dependence Linked Dependence Assumption: there exists a positive real number K such that the following two conditions hold: –P(A,B|R) = K P(A|R) P(B|R) –When K=1 this is the same as binary independence

SLIDE 56IS 240 – Spring 2011 Linked Dependence The Odds of an event E : (See paper for details) Multiplying by O(R) and taking logs we get:

SLIDE 57IS 240 – Spring 2011 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any value from negative infinity to positive infinity, whereas the output is confined to values between 0 and 1. The variable z represents the exposure to some set of independent variables, while ƒ(z) represents the probability of a particular outcome, given that set of explanatory variables. The variable z is a measure of the total contribution of all the independent variables used in the model and is known as the logit.

SLIDE 58IS 240 – Spring 2011 Probabilistic Models: Logistic Regression Estimates for relevance based on log- linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:

SLIDE 59IS 240 – Spring 2011 Logistic Regression Term Frequency in Document Relevance

SLIDE 60IS 240 – Spring 2011 Probabilistic Models: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown on the next slide

SLIDE 61IS 240 – Spring 2011 Probabilistic Models: Logistic Regression attributes (“TREC3”) Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

SLIDE 62IS 240 – Spring 2011 Current use of Probabilistic Models Most of the major systems in TREC now use the “Okapi BM-25 formula” (or Language Models -- more on those later) which incorporates the Robertson-Sparck Jones weights…

SLIDE 63IS 240 – Spring 2011 Okapi BM-25 Where: Q is a query containing terms T K is k 1 ((1-b) + b.dl/avdl) k 1, b and k 3 are parameters, usually 1.2, 0.75 and tf is the frequency of the term in a specific document qtf is the frequency of the term in a topic from which Q was derived dl and avdl are the document length and the average document length measured in some convenient unit (e.g. bytes) w (1) is the Robertson-Sparck Jones weight.

SLIDE 64IS 240 – Spring 2011 Probabilistic Models Strong theoretical basis In principle should supply the best predictions of relevance given available information Can be implemented similarly to Vector Relevance information is required -- or is “guestimated” Important indicators of relevance may not be term -- though terms only are usually used Optimally requires on- going collection of relevance information AdvantagesDisadvantages

SLIDE 65IS 240 – Spring 2011 Vector and Probabilistic Models Support “natural language” queries Treat documents and queries the same Support relevance feedback searching Support ranked retrieval Differ primarily in theoretical basis and in how the ranking is calculated –Vector assumes relevance –Probabilistic relies on relevance judgments or estimates