Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.

Slides:

Advertisements

Similar presentations

Traditional IR models Jian-Yun Nie.

Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Chapter 5: Introduction to Information Retrieval

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.

Text Similarity David Kauchak CS457 Fall 2011.

Distance and Similarity Measures

Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

IR Models: Overview, Boolean, and Vector

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

ISP 433/533 Week 2 IR Models.

Boolean, Vector Space, Probabilistic

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.

Modeling Modern Information Retrieval

Hinrich Schütze and Christina Lioma

DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.

8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.

Boolean, Vector Space, Probabilistic

Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.

1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.

Chapter 5: Information Retrieval and Web Search

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Chapter 6: Information Retrieval and Web Search

Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)

1 Computing Relevance, Similarity: The Vector Space Model.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.

Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Chapter 23: Probabilistic Language Models April 13, 2004.

Vector Space Models.

C.Watterscsci64031 Probabilistic Retrieval Model.

Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.

Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.

VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.

1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at

Plan for Today’s Lecture(s)

Query Models Use Types What do search engines do.

Text Based Information Retrieval

Why the interest in Queries?

Multimedia Information Retrieval

Representation of documents and queries

CS 430: Information Discovery

5. Vector Space and Probabilistic Retrieval Models

Information Retrieval and Web Design

CS 430: Information Discovery

Presentation transcript:

Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval

2 Retrieval Models  Model is an idealization or abstraction of an actual process  in this case, process is matching of documents with queries, i.e., retrieval  Mathematical models are used to study the properties of the process, draw conclusions, make predictions  Conclusions derived from a model depend on whether the model is a good approximation to the actual situation  Retrieval models can describe the computational process  e.g. how documents are ranked  note that inverted file is an implementation not a model  Retrieval variables: queries, documents, terms, relevance judgements, users, information needs  Retrieval models have an explicit or implicit definition of relevance

Information need Index Pre-process Parse Collections Rank Query text input Lexical analysis and stop words Result Sets How is the index constructed? How is the matching and scoring done?

Intelligent Information Retrieval 4 Retrieval Models  Customary to distinguish between exact-match and best-match retrieval  Exact-match  query specifies precise retrieval criteria every document either matches or fails to match query  result is a set of documents  Best-match  query describes good or “best” matching document  result is ranked list of documents  result may include estimate of quality  Best-match models: better retrieval effectiveness  good documents appear at top of ranking  but efficiency is better in exact match (e.g., Boolean)

Intelligent Information Retrieval 5 Ranking Algorithms  Assign weights to the terms in the query  Assign weights to the terms in the documents  Compare the weighted query terms to the weighted document terms  Boolean matching (exact match)  simple (coordinate level) matching  cosine similarity  other similarity measures (Dice, Jaccard, overlap, etc.)  extended Boolean models  probabilistic models  Rank order the results  pure Boolean has no ordering

Intelligent Information Retrieval 6 Boolean Retrieval  Boolean retrieval most common exact-match model  queries are logic expressions with document features as operands  retrieved documents are generally not ranked  query formulation difficult for novice users  “Pure” Boolean operators: AND, OR, NOT  Most systems have proximity operators  Most systems support simple regular expressions as search terms to match spelling variants

Intelligent Information Retrieval 7 A B Boolean Logic  AND and OR in a Boolean query represent intersection and union of the corresponding documents sets, respectively  NOT represents the complement of the corresponding set

Intelligent Information Retrieval 8 Boolean Queries  Boolean queries are Boolean combination of terms  Cat  Cat OR Dog  Cat AND Dog  (Cat AND Dog) OR Collar  (Cat AND Dog) OR (Collar AND Leash)  (Cat OR Dog) AND (Collar OR Leash)  (Cat OR Dog) AND (Collar OR Leash)  Each of the following combinations works:

Intelligent Information Retrieval 9 Boolean Matching 3t33t3 1t11t1 2t22t2 1D11D1 2D22D2 3D33D3 4D44D4 5D55D5 6D66D6 8D88D8 7D77D7 9D99D9 10 D D 11 m1m1 m2m2 m3m3 m5m5 m4m4 m7m7 m8m8 m6m6 m 2 = t 1 t 2 t 3 m 1 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3 Hit list for the query t1 AND t2  {D1, D3, D5, D9, D10, D11} ∩ {D1, D2, D4, D5, D6} = {D1, D5}

Intelligent Information Retrieval 10 Psuedo-Boolean Queries  A new notation, from web search  +cat dog +collar leash  Does not mean the same thing!  Need a way to group combinations  Phrases:  “stray cat” AND “frayed collar”  +“stray cat” + “frayed collar”

Intelligent Information Retrieval 11 Faceted Boolean Query  Strategy: break query into facets  conjunction of disjunctions (conjunctive normal form) a1 OR a2 OR a3 b1 OR b2 c1 OR c2 OR c3 OR c4  each facet expresses a topic or concept “rain forest” OR jungle OR amazon medicine OR remedy OR cure research OR development AND

Intelligent Information Retrieval 12 Faceted Boolean Query  Query still fails if one facet missing  Alternative: a form of Coordination level ranking  Order results in terms of how many facets (disjuncts) are satisfied  Also called Quorum ranking  Problem: Facets still undifferentiated  Alternative: assign weights to facets

Intelligent Information Retrieval 13 Boolean Model  Advantages  simple queries are easy to understand  relatively easy to implement  structured queries  queries can be automatically translated into CNF or DNF  Disadvantages  difficult to specify what is wanted  too much returned, or too little (acceptable precision generally means unacceptable recall)  ordering not well determined  query formulation difficult for novice users  Dominant language in commercial systems until the WWW

Intelligent Information Retrieval 14 Vector Space Model (revisited)  Documents are represented as “bags of words”  Represented as vectors when used computationally  A vector is an array of floating point (or binary in case of bit maps)  Has direction and magnitude  Each vector has a place for every term in collection (most are sparse) nova galaxy heat actor film role A B C D E F 0.7 G H I Document Ids a document vector

Intelligent Information Retrieval 15 Documents & Query in n-dimensional Space  Documents are represented as vectors in term space  Terms are usually stems  Documents represented by binary vectors of terms  Queries represented the same as documents  Query and Document weights are based on length and direction of their vector  A vector distance measure between the query and documents is used to rank retrieved documents

Intelligent Information Retrieval 16 The Notion of “Similarity” in IR  The notion of similarity is central to many aspects of information retrieval and filtering:  measuring similarity of the query to documents is the primary factor in determining what is returned (and how they are ranked)  similarity measures can also be used in clustering documents (I.e., grouping together documents with similar content)  the same similarity measures can also be used to group together related terms (based on their occurrence patterns across documents in the collection)

Intelligent Information Retrieval 17 Vector-Based Similarity Measures  Simple Matching and Cosine Similarity  Simple matching = dot product of two vectors  Cosine Similarity = normalized dot product  the norm of a vector X is:  the cosine similarity of vectors X and Y is: In other words, divide the dot product by the norms of the two vectors

Intelligent Information Retrieval 18 Vector-Based Similarity Measures  Why divide by the norm?  Example:  X =  ||X|| = SQRT( ) = 5.83  X* = X / ||X|| =  Now, note that ||X*|| = 1  So, dividing a vector by its norm, turns it into a unit-length vector  Cosine similarity measures the angle between two unit length vectors (i.e., the magnitude of the vectors are ignored).

Intelligent Information Retrieval 19 Computing a similarity score 2D Example

Intelligent Information Retrieval 20 Computing Similarity Scores

Intelligent Information Retrieval 21 Other Vector Space Similarity Measures Simple Matching: Cosine Coefficient: Dice’s Coefficient: Jaccard’s Coefficient:

Intelligent Information Retrieval 22 Vector Space Similarity Measures  Again consider the following two document and the query vectors: D 1 = (0.8, 0.3) D 2 = (0.2, 0.7) Q = (0.4, 0.8)  Computing similarity using Jaccard’s Coefficient:  Computing similarity using Dice’s Coefficient: sim(Q, D 1 ) = 0.73 sim(Q, D 2 ) = 0.96

Intelligent Information Retrieval 23 Vector Space Similarity Measures Example

Intelligent Information Retrieval 24 Vector Space Similarity Measures Example

Intelligent Information Retrieval 25 Probabilistic Models  Attempts to be more theoretically sound than the vector space model  try to predict the probability of a document’s being relevant, given the query  there are many variations  usually more complicated to compute than v.s.  usually many approximations are required  Relevance information is required from a random sample of documents and queries (training examples)  Works about the same (sometimes better) than vector space approaches

Intelligent Information Retrieval 26 Basic Probabilistic Retrieval  Retrieval is modeled as a classification process  Two classes for each query: the relevant and non-relevant documents (with respect to a given query)  could easily be extended to three classes (i.e. add a don’t care)  Given a particular document D, calculate the probability of belonging to the relevant class  retrieve if greater than probability of belonging to non-relevant class  i.e. retrieve if P(R|D) > P(NR|D)  Equivalently, rank by a discriminant value (also called likelihood ratio) P(R|D) / P(NR|D)  Different ways of estimating these probabilities lead to different models

Intelligent Information Retrieval 27 Basic Probabilistic Retrieval  A given query divides the document collection into two sets: relevant and non-relevant Relevant Documents Non-Relevant Documents Document P(R|D) P(NR|D)  If a document set D has been selected in response to a query, retrieve the document if dis(D) > 1 where dis(D) = P(R|D) / P(NR|D)  is the discriminant of D  This criteria can be modified by weighting the two probabilities

Intelligent Information Retrieval 28 Estimating Probabilities  Bayes’ Rule can be used to “invert” conditional probabilities:  Applying that to discriminant function:  Note that P(R) is the probability that a random document is relevant to the query, and P(NR) = 1 - P(R) P(R) = n / N and P(NR) = 1 - P(R) = (N - n) / N where n = number of relevant documents, and N = total number of documents in the collection

Intelligent Information Retrieval 29 Estimating Probabilities  Now we need to estimate P(D|R) and P(D|NR)  If we assume that a document is represented by terms t 1,..., t n, and that these terms are statistically independent, then  and similarly we can compute P(D|NR)  Note that P(t i |R) is the probability that a term t i occurs in a relevant document, and it can be estimated based on previously available sample (e.g., through relevance feedback)  So, based on the probability of the distribution of terms in relevant and non-relevant documents we can estimate whether the document should be retrieved (i.e, if dis(D) > 1)  Note that documents that are retrieved can be ranked based on the value of the discriminant

Intelligent Information Retrieval 30 Probabilistic Retrieval - Example Since the discriminant is less than one, document D should not be retrieved

Intelligent Information Retrieval 31 Probabilistic Retrieval (cont.)  In practice, can’t build a model for each query  Instead a general model is built based on query-document pairs in the historical (training) data  Then for a given query Q, the discriminant is computed only based on the conditional probabilities of the query terms  If query term t occurs in D, take P(t|R) and P(t|NR)  If query term t does not appear in D, take 1-P(t|R) and 1- P(t|NR) Q = t1, t3, t4 D = t1, t4, t5

Intelligent Information Retrieval 32 Probabilistic Models  Strong theoretical basis  In principle should supply the best predictions of relevance given available information  Can be implemented similarly to Vector  Relevance information is required -- or is “guestimated”  Important indicators of relevance may not be term -- though terms only are usually used  Optimally requires on-going collection of relevance information AdvantagesDisadvantages

Intelligent Information Retrieval 33 Vector and Probabilistic Models  Support “natural language” queries  Treat documents and queries the same  Support relevance feedback searching  Support ranked retrieval  Differ primarily in theoretical basis and in how the ranking is calculated  Vector assumes relevance  Probabilistic relies on relevance judgments or estimates

Intelligent Information Retrieval 34 Extended Boolean Models  Weighted Boolean Queries  Weights are assigned to the operands in Boolean query A 0.6 AND B 0.75 A 1.0 OR B 0.3  The weighting operation depends on the distance between document sets for A and B  a weight of 1.0 says that all of the corresponding document set is considered in the operation  a weight of 0 < w < 1 says that only a portion of the document set is considered  the documents added or deleted are those that are “closest” to the current set of documents

Intelligent Information Retrieval 35 Weighted Boolean Queries A 1.0 AND B 1.0 = A  A 1.0 OR B 1.0 = A  A 1.0 AND B 0.0 = A  A 1.0 OR B 0.0 = A A 1.0 OR B.75 = A  75% of  B  A 1.0 AND B.75 = (A  )  25% of  A B A B

Intelligent Information Retrieval 36 Weighted Boolean Queries  Matching Algorithm 1. Find initial matching set (non-weighted Boolean document set) 2. Find the invariant document set (set of documents that are present both when operand weight is 1.0 and when the weight is 0.0); the optional set is the remaining items 3. Compute the centroid of the invariant set 4. Find the number of documents, say k, from the optional set that will potentially be added to the invariant set (determined by the weight of the query term) 5. Compute similarity between documents in the optional set and the centroid (of the invariant set) 6. Items to be added or deleted are the top k documents in the optional set with the highest similarity scores

Intelligent Information Retrieval 37 Demo of Extended Boolean Query * * Thanks to Michael Bombyk for discovering this applet!

Intelligent Information Retrieval 38 Weighted Boolean Queries - Example Q1 (initial) = (D1, D2, D3, D4, D5, D6, D8) Q1 (invariant) = (D3, D6, D8) Q1 (optional) = (D1, D2, D4, D5) => 4 items No. selected docs. = Centroid(Q1) = (1/3) = (4.7, 0.7, 2.0, 2.0) Computing Similarity (using simple matching): SIM(Centroid,D1) = (4.7,0.7,2.0,2.0). (0,4,0,8) = 18.8 SIM(Centroid,D2) = (4.7,0.7,2.0,2.0). (0,2,0,0) = 1.4 SIM(Centroid,D4) = (4.7,0.7,2.0,2.0). (0,6,4,6) = 24.2 SIM(Centroid,D5) = (4.7,0.7,2.0,2.0). (0,4,6,4) = 22.8 So the final Hit list is : (D3, D6, D8)  (D4, D5) Query Q1 = A 1.0 OR B.333

Intelligent Information Retrieval 39 Weighted Boolean Queries - Example Query Q2 = C.75 AND D 1.0 Q2 (initial) = (D3, D4, D5) Q2 (invariant) = (D3, D4, D5) Q2 (optional) = (D1, D8) => 2 items No. selected docs. = Centroid(Q2) = (1/3) = Computing Similarity (using simple matching): SIM(Centroid,D1) = SIM(Centroid,D8) = Final Hit list is: (D3, D4, D5)  (D1)