IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.

IR Models J. H. Wang Mar. 11, 2008

The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module 4, 10 6, 7 58 2 8 Text Database Text

Introduction Traditional information retrieval systems usually adopt index terms to index and retrieve documents –An index term is a keyword (or group of related words) which has some meaning of its own (usually a noun) Advantages –Simple –The semantic of the documents and of the user information need can be naturally expressed through sets of index terms

Docs Information Need Index Terms doc query Ranking match

IR Models  Ranking algorithms are at the core of information retrieval systems (predicting which documents are relevant and which are not).

A Taxonomy of Information Retrieval Models Retrieval: Ad hoc Filtering Classic Models Browsing USERTASKUSERTASK Boolean Vector Probabilistic Structured Models Non-overlapping lists Proximal Nodes Flat Structure Guided Hypertext Browsing Fuzzy Extended Boolean Set Theoretic Algebraic Generalized Vector Lat. Semantic Index Neural Networks Inference Network Belief Network Probabilistic

Structure Guided Hypertext Flat Hypertext FlatBrowsing StructuredClassic Set Theoretic Algebraic Probabilistic Classic Set Theoretic Algebraic Probabilistic Retrieval Full Text+ Structure Full TextIndex Terms Figure 2.2 Retrieval models most frequently associated with distinct combinations of a document logical view and a user task.

Retrieval : Ad hoc and Filtering Ad hoc (Search) : The documents in the collection remain relatively static while new queries are submitted to the system Routing (Filtering) : The queries remain relatively static while new documents come into the system

Retrieval: Ad Hoc x Filtering Ad hoc retrieval: Collection “Fixed Size” Q2 Q3 Q1 Q4 Q5

Retrieval: Ad Hoc x Filtering Filtering: Documents Stream User 1 Profile User 2 Profile Docs Filtered for User 2 Docs for User 1

A Formal Characterization of IR Models D : A set composed of logical views (or representation) for the documents in the collection Q : A set composed of logical views (or representation) for the user information needs (queries) F : A framework for modeling document representations, queries, and their relationships R(q i, d j ) : A ranking function which defines an ordering among the documents with regard to the query

Definition k i : A generic index term K : The set of all index terms {k 1,…,k t } w i,j : A weight associated with index term k i of a document d j g i : A function returns the weight associated with k i in any t -dimensional vector ( g i (d j )=w i,j )

Classic IR Model Basic concepts: Each document is described by a set of representative keywords called index terms Assign a numerical weights to distinct relevance between index terms Three classic models: Boolean, vector, probabilistic

Boolean Model Binary decision criterion –Either relevant or nonrelevant (no partial match) Data retrieval model Advantage –Clean formalism, simplicity Disadvantage –It is not simple to translate an information need into a Boolean expression –Exact matching may lead to retrieval of too few or too many documents

Example Can be represented as a disjunction of conjunctive vectors (in DNF) –Q= q a  (q b  q c )=(1,1,1)  (1,1,0)  (1,0,0) Formal definition –For the Boolean model, the index term weight are all binary, i.e. w ij  {0,1} –A query is a conventional Boolean expression, which can be transformed to a disjunctive normal form (q cc : conjunctive component) if (  q cc  )  (  k i, w i,j =g i (q cc )) (1,1,1) (1,0,0) (1,1,0) KaKb Kc

Vector Model [Salton, 1968] Assign non-binary weights to index terms in queries and in documents => TFxIDF Compute the similarity between documents and query => Sim(D j, Q) More precise than Boolean model

The IR Problem  A Clustering Problem We think of the documents as a collection C of objects and think of the user query as a specification of a set A of objects Intra-cluster similarity –What are the features which better describe the objects in the set A? Inter-cluster similarity –What are the features which better distinguish the objects in the set A?

TF: intra-clustering similarity is quantified by measuring the raw frequency of a term k i inside a document d j –term frequency (the tf factor) provides one measure of how well that term describes the document contents IDF: inter-clustering similarity is quantified by measuring the inverse of the frequency of a term k i among the documents in the collection –inverse document frequency (the idf factor) Idea for TFxIDF

Vector Model (1/4) Index terms are assigned positive and non- binary weights The index terms in the query are also weighted Term weights are used to compute the degree of similarity between documents and the user query Then, retrieved documents are sorted in decreasing order

Vector Model (2/4) Degree of similarity

Vector Model (3/4) Definition –normalized frequency –inverse document frequency –term-weighting schemes –query-term weights

Vector Model (4/4) Advantages –Its term-weighting scheme improves retrieval performance –Its partial matching strategy allows retrieval of documents that approximate the query conditions –Its cosine ranking formula sorts the documents according to their degree of similarity to the query Disadvantage –The assumption of mutual independence between index terms

The Vector Model: Example I d1 d2 d3 d4d5 d6 d7 k1 k2 k3

The Vector Model: Example II d1 d2 d3 d4d5 d6 d7 k1 k2 k3

The Vector Model: Example III d1 d2 d3 d4d5 d6 d7 k1 k2 k3

Probabilistic Model (1/6) Introduced by Roberston and Sparck Jones, 1976 –Binary independence retrieval (BIR) model Idea: Given a user query q, and the ideal answer set R of the relevant documents, the problem is to specify the properties for this set – Assumption (probabilistic principle): the probability of relevance depends on the query and document representations only; ideal answer set R should maximize the overall probability of relevance –The probabilistic model tries to estimate the probability that the user will find the document d j relevant with ratio P(d j relevant to q)/P(d j nonrelevant to q)

Probabilistic Model (2/6) Definition –All index term weights are all binary i.e., w i,j  {0,1} –Let R be the set of documents known to be relevant to query q –Let be the complement of R –Let be the probability that the document d j is relevant to the query q –Let be the probability that the document d j is nonelevant to query q

Probabilistic Model (3/6) The similarity sim(d j,q) of the document d j to the query q is defined as the ratio Using Bayes’ rule, – P(R) stands for the probability that a document randomly selected from the entire collection is relevant – stands for the probability of randomly selecting the document d j from the set R of relevant documents

Probabilistic Model (4/6) Assuming independence of index terms and given q =( d 1, d 2, …, d t ),

Probabilistic Model (5/6) –Pr( k i | R ) stands for the probability that the index term k i is present in a document randomly selected from the set R – stands for the probability that the index term k i is not present in a document randomly selected from the set R

Probabilistic Model (6/6)

Estimation of Term Relevance In the very beginning: Next, the ranking can be improved as follows: For small values of V and V i Let V be a subset of the documents initially retrieved N V df i

Advantage –Documents are ranked in decreasing order of their probability of being relevant Disadvantage –The need to guess the initial relevant and nonrelevant sets –Term frequency is not considered –Independence assumption for index terms

Brief Comparison of Classic Models Boolean model is the weakest –Not able to recognize partial matches Controversy between probabilistic and vector models –The vector model is expected to outperform the probabilistic model with general collections

Alternative Set Theoretic Models Fuzzy Set Model Extended Boolean Model

Fuzzy Theory A fuzzy subset A of a universe U is characterized by a membership function u A : U  {0,1} which associates with each element u  U a number u A Let A and B be two fuzzy subsets of U,

Fuzzy Information Retrieval Using a term-term correlation matrix Define a fuzzy set associated to each index term k i –If a term k l is strongly related to k i, that is c i,l ~1, then u i ( d j )~1 –If a term k l is loosely related to k i, that is c i,l ~0, then u i ( d j )~0

Example Disjunctive Normal Form cc1 cc3 cc2 KaKb Kc

Algebraic Sum and Product The degree of membership in a disjunctive fuzzy set is computed using an algebraic sum, instead of max function The degree of membership in a conjunctive fuzzy set is computed using an algebraic product, instead of min function More smooth than max and min functions

Alternative Algebraic Models Generalized Vector Space Model Latent Semantic Model Neural Network Model

Sparse Matrix Problem Considering a term-doc matrix of dimensions 1M*1M –Most of the entries will be 0  sparse matrix –A waste of storage and computation –How to reduce the dimensions?

Latent Semantic Indexing (1/5) Let M =( M ij ) be a term-document association matrix with t rows and N columns Latent semantic indexing decomposes M using Singular Value Decompositions – K is the matrix of eigenvectors derived from the term-to-term correlation matrix (MM t ) – D t is the matrix of eigenvectors derived from the transpose of the document-to-document matrix (M t M) – S is an r  r diagonal matrix of singular values, where r = min ( t, N ) is the rank of M

Latent Semantic Indexing (2/5) Consider now only the s largest singular values of S, and their corresponding columns in K and D t – (The remaining singular values of S are deleted) The resultant matrix M s (rank s ) is closest to the original matrix M in the least square sense s<r is the dimensionality of a reduced concept space

Latent Semantic Indexing (3/5) The selection of s attempts to balance two opposing effects –s should be large enough to allow fitting all the structure in the real data –s should be small enough to allow filtering out all the non-relevant representational details

Latent Semantic Indexing (4/5) Consider the relationship between any two documents

Latent Semantic Indexing (5/5) To rank documents with regard to a given user query, we model the query as a pseudo-document in the original matrix M –Assume the query is modeled as the document with number k –Then the k th row in the matrix provides the ranks of all documents with respect to this query

Computing an Example Let (Mij) be given by the matrix –Compute the matrices (K), (S), and (D) t

Latent Semantic Indexing transforms the occurrence matrix into a relation between the terms and concepts, and a relation between the concepts and the documents –Indirect relation between terms and documents through some hidden (or latent) concepts Taipei Taiwan … doc ?

Taipei Taiwan … doc (Latent) Concepts

Alternative Probabilistic Model Bayesian Networks Inference Network Model Belief Network Model

Bayesian Network Let x i be a node in a Bayesian network G and  x i be the set of parent nodes of x i The influence of  x i on x i can be specified by any set of functions that satisfy: P( x 1, x 2, x 3, x 4, x 5 )=P( x 1 )P( x 2 | x 1 )P( x 3 | x 1 )P( x 4 | x 2, x 3 )P( x 5 | x 3 )

Belief Network Model (1/6) The probability space The set K ={ k 1, k 2, …, k t } of all index terms is the universe. To each subset u is associated a vector such that g i ( )=1  k i  u Random variables –To each index term k i is associated a binary random variable

Belief Network Model (2/6) Concept space –A document d j is represented as a concept composed of the terms used to index d j –A user query q is also represented as a concept composed of the terms used to index q –Both user query and document are modeled as subsets of index terms Probability distribution P over K Degree of coverage of K by c

Belief Network Model (3/6) A query q is modeled as a network node –This random variable is set to 1 whenever q completely covers the concept space K – P(q) computes the degree of coverage of the space K by q A document d j is modeled as a network node –This random variable is 1 to indicate that d j completely covers the concept space K – P ( d j ) computes the degree of coverage of the space K by d j

Belief Network Model (4/6)

Belief Network Model (5/6) Assumption –P( d j | q ) is adopted as the rank of the document d j with respect to the query q

Belief Network Model (6/6) Specify the conditional probabilities as follows Thus, the belief network model can be tuned to subsume the vector model

Comparison Belief network model –Is based on set-theoretic view –It provides a separation between the document and the query –It is able to reproduce any ranking strategy generated by the inference network model Inference network model –Takes a purely epistemological view which is more difficult to grasp

IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.

Similar presentations

Presentation on theme: "IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.

Similar presentations

Presentation on theme: "IR Models J. H. Wang Mar. 11, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y."— Presentation transcript:

Similar presentations

About project

Feedback