PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.

Slides:



Advertisements
Similar presentations
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Advertisements

Basic IR: Modeling Basic IR Task: Slightly more complex:
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Motivation and Outline
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
1/22 --Are you getting s sent to class mailing list? (if not, send me ) Agenda: --Intro wrapup --Start on text retrieval Link du jour: vivisimo.com.
Modeling Modern Information Retrieval
Hinrich Schütze and Christina Lioma
Vector Space Model CS 652 Information Extraction and Integration.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Information Retrieval: Foundation to Web Search Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 13, 2015 Some.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Introduction to Digital Libraries Searching
Basic ranking Models Boolean and Vector Space Models.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
Information Retrieval CSE 8337 Spring 2005 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
1 Boolean Model. 2 A document is represented as a set of keywords. Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
The Vector Space Models (VSM)
Plan for Today’s Lecture(s)
Information Retrieval and Web Search
Representation of documents and queries
Adapted from Lectures by
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Boolean and Vector Space Retrieval Models
Recuperação de Informação B
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Information Retrieval and Web Design
Advanced information retrieval
Term Frequency–Inverse Document Frequency
Presentation transcript:

PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

PrasadL2IRModels2 Introduction Docs DB Information Need Index Terms Doc Query Ranked List of Docs match abstract

PrasadL2IRModels3 Introduction Premise: Semantics of documents and user information need, expressible naturally through sets of index terms  Unfortunately, in general, matching at index term level is quite imprecise Critical Issue: Ranking - ordering of documents retrieved that (hopefully) reflects their relevance to the query

PrasadL2IRModels4 Fundamental premisses regarding relevance determines an IR Model  common sets of index terms  sharing of weighted terms  likelihood of relevance IR Model (boolean, vector, probabilistic, etc), logical view of the documents (full text, index terms, etc) and the user task (retrieval, browsing, etc) are all orthogonal aspects of an IR system.

PrasadL2IRModels5 IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext

PrasadL2IRModels6 IR Models The IR model, the logical view of the docs, and the retrieval task are distinct aspects of the system

PrasadL2IRModels7 Retrieval: Ad Hoc vs Filtering Ad hoc retrieval: Collection “Fixed Size” Q2 Q3 Q1 Q4 Q5

PrasadL2IRModels8 Retrieval: Ad Hoc vs Filtering Filtering: Documents Stream User 1 Profile User 2 Profile Docs Filtered for User 2 Docs for User 1

PrasadL2IRModels9 Retrieval : Ad hoc vs Filtering Docs collection relatively static while queries vary Ranking for determining relevance to user information need  Cf. String matching problem where the text is given and the pattern to be searched varies. E.g., use indexing techniques, suffix trees, etc. Queries relatively static while new docs are added to the collection Construction of user profile to reflect user preferences  Cf. String matching problem where pattern is given and the text varies. E.g., use automata-based techniques

PrasadL2IRModels10 Specifying an IR Model Structure Quadruple [D, Q, F, R(q i, d j )]  D = Representation of documents  Q = Representation of queries  F = Framework for modeling representations and their relationships Standard language/algebra/impl. type for translation to provide semantics Evaluation w.r.t. “direct” semantics through benchmarks  R = Ranking function that associates a real number with a query-doc pair

PrasadL2IRModels11 Classic IR Models - Basic Concepts Each document represented by a set of representative keywords or index terms  Index terms meant to capture document’s main themes or semantics.  Usually, index terms are nouns because nouns have meaning by themselves.  However, search engines assume that all words are index terms (full text representation)

PrasadL2IRModels12 Classic IR Models - Basic Concepts Not all terms are equally useful for representing the document’s content Let  ki be an index term  dj be a document  wij be the weight associated with (ki,dj) The weight wij quantifies the importance of the index term for describing the document content

PrasadL2IRModels13 Notations/Conventions  Ki is an index term  dj is a document  t is the total number of docs  K = (k1, k2, …, kt) is the set of all index terms  wij >= 0 is the weight associated with (ki,dj) wij = 0 if the term is not in the doc  vec(dj) = (w1j, w2j, …, wtj) is the weight vector associated with the document dj  gi(vec(dj)) = wij is the function which returns the weight associated with the pair (ki,dj)

PrasadL2IRModels14 Boolean Model

PrasadL2IRModels15 The Boolean Model Simple model based on set theory Queries and documents specified as boolean expressions  precise semantics  E.g., q = ka  (kb   kc) Terms are either present or absent. Thus, wij  {0,1}

PrasadL2IRModels16 Example  q = ka  (kb   kc)  vec(qdnf) = (1,1,1)  (1,1,0)  (1,0,0) »Disjunctive Normal Form  vec(qcc) = (1,1,0) »Conjunctive component Similar/Matching documents md1 = [ka ka d e] => (1,0,0) md2 = [ka kb kc] => (1,1,1) Unmatched documents ud1 = [ka kc] => (1,0,1) ud2 = [d] => (0,0,0)

PrasadL2IRModels17 Similarity/Matching function sim(q,dj) = 1 if vec(dj)  vec(qdnf)) 0 otherwise »Requires coercion for accuracy

PrasadL2IRModels18 Venn Diagram q = ka  (kb   kc) (1,1,1) (1,0,0) (1,1,0) KaKb Kc

PrasadL2IRModels19 Drawbacks of the Boolean Model  Expressive power of boolean expressions to capture information need and document semantics inadequate  Retrieval based on binary decision criteria (with no partial match) does not reflect our intuitions behind relevance adequately As a result  Answer set contains either too few or too many documents in response to a user query  No ranking of documents

PrasadL2IRModels20 Vector Model

PrasadL2IRModels21 Documents as vectors Not all index terms are equally useful in representing document content Each doc j can be viewed as a vector of non-boolean weights, one component for each term  terms are axes of vector space  docs are points in this vector space even with stemming, the vector space may have 20,000+ dimensions

PrasadL2IRModels22 Intuition Postulate: Documents that are “close together” in the vector space talk about the same things. t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ

PrasadL2IRModels23 Desiderata for proximity If d 1 is near d 2, then d 2 is near d 1. If d 1 near d 2, and d 2 near d 3, then d 1 is not far from d 3. No doc is closer to d than d itself.

PrasadL2IRModels24 First cut Idea: Distance between d 1 and d 2 is the length of the vector |d 1 – d 2 |.  Euclidean distance Why is this not a great idea? We still haven’t dealt with the issue of length normalization  Short documents would be more similar to each other by virtue of length, not topic However, we can implicitly normalize by looking at angles instead  “Proportional content”

PrasadL2IRModels25 Cosine similarity Distance between vectors d 1 and d 2 captured by the cosine of the angle x between them. t 1 d 2 d 1 t 3 t 2 θ

PrasadL2IRModels26 Cosine similarity A vector can be normalized (given a length of 1) by dividing each of its components by its length – here we use the L 2 norm This maps vectors onto the unit sphere: Then, Longer documents don’t get more weight

PrasadL2IRModels27 Cosine similarity Cosine of angle between two vectors The denominator involves the lengths of the vectors. Normalization

PrasadL2IRModels28 Example Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights. tf weights

PrasadL2IRModels29 Normalized weights cos(SAS, PAP) =.996 x x x 0.0 = cos(SAS, WH) =.996 x x x.254 = 0.889

PrasadL2IRModels30 Queries in the vector space model Central idea: the query as a vector: We regard the query as short document  Note that d q is very sparse! We return the documents ranked by the closeness of their vectors to the query, also represented as a vector.

PrasadL2IRModels31 The Vector Model: Example I d1 d2 d3 d4d5 d6 d7 k1 k2 k3

PrasadL2IRModels32 The Vector Model: Example II d1 d2 d3 d4d5 d6 d7 k1 k2 k3

PrasadL2IRModels33 The Vector Model: Example III d1 d2 d3 d4d5 d6 d7 k1 k2 k3

PrasadL2IRModels34 Summary: What’s the point of using vector spaces? A well-formed algebraic space for retrieval Query becomes a vector in the same space as the docs. Can measure each doc’s proximity to it. Natural measure of scores/ranking – no longer Boolean.  Documents and queries are expressed as bags of words

PrasadL2IRModels35 The Vector Model Non-binary (numeric) term weights used to compute degree of similarity between a query and each of the documents. Enables  partial matches to deal with incompleteness  answer set ranking to deal with information overload

PrasadL2IRModels36 Define:  wij > 0 whenever ki  dj  wiq >= 0 associated with the pair (ki,q)  vec(dj) = (w1j, w2j,..., wtj) vec(q) = (w1q, w2q,..., wtq)  To each term ki, associate a unit vector vec(i)  The t unit vectors, vec(1),..., vec(t) form an orthonormal basis (embodying independence assumption) for the t-dimensional space for representing queries and documents

PrasadL2IRModels37 The Vector Model How to compute the weights wij and wiq ?  quantification of intra-document content (similarity/semantic emphasis) tf factor, the term frequency within a document  quantification of inter-document separation (dis- similarity/significant discriminant) idf factor, the inverse document frequency  wij = tf(i,j) * idf(i)

PrasadL2IRModels38 Let,  N be the total number of docs in the collection  ni be the number of docs which contain ki  freq(i,j) raw frequency of ki within dj A normalized tf factor is given by  f(i,j) = freq(i,j) / max(freq(l,j)) where the maximum is computed over all terms which occur within the document dj The idf factor is computed as  idf(i) = log (N/ni) the log makes the values of tf and idf comparable.

PrasadL2IRModels39 Digression: terminology WARNING: In a lot of IR literature, “frequency” is used to mean “count”  Thus term frequency in IR literature is used to mean number of occurrences in a doc  Not divided by document length (which would actually make it a frequency)

PrasadL2IRModels40 The best term-weighting schemes use weights which are given by  wij = f(i,j) * log(N/ni)  the strategy is called a tf-idf weighting scheme For the query term weights, use  wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni) The vector model with tf-idf weights is a good ranking strategy for general collections.  It is also simple and fast to compute.

PrasadL2IRModels41 The Vector Model Advantages:  term-weighting improves answer set quality  partial matching allows retrieval of docs that approximate the query conditions  cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages:  assumes independence of index terms; not clear that this is bad though