Download presentation
Presentation is loading. Please wait.
Published byPosy Patrick Modified over 9 years ago
1
Modeling (Chap. 2) Modern Information Retrieval Spring 2000
2
Introduction Traditional IR systems adopt index terms to index, retrieve documents An index term is simply any word that appears in text of documents Retrieval based on index terms is simple u premise is that semantics of documents and user information can be expressed through set of index terms
3
n Key Question u semantics in document (user request) lost when text replaced with set of words u matching between documents and user request done in very imprecise space of index terms (low quality retrieval) u problem worsened for users with no training in properly forming queries (cause of frequent dissatisfaction of Web users with answers obtained)
4
Taxonomy of IR Models Three classic models Boolean documents and queries represented as sets of index terms Vector documents and queries represented as vectors in t-dimensional space Probabilistic document and query representations based on probability theory
5
Basic Concepts Classic models consider that each document is described by index terms Index term is a (document) word that helps in remembering document ’ s main themes index terms used to index and summarize document content in general, index terms are nouns (because meaning by themselves) index terms may consider all distinct words in a document collection
6
Distinct index terms have varying relevance when describing document contents Thus numerical weights assigned to each index term of a document Let k i be index term, d j document, and w i, j 0 be weight for pair (k i, d j ) Weight quantifies importance of index term for describing document semantic contents
7
Definition (pp. 25) n Let t be no. of index terms in system and k i be generic index term. n K = {k 1, …, k t } is set of all index terms. n A weight w i, j > 0 associated with each index term k i of document d j. n For index term that does not appear in document text, w i, j = 0. n Document d j associated with index term vector j represented by j = (w 1, j, w 2, j, …w t, j )
8
Boolean Model Simple retrieval model based on set theory and Boolean algebra framework is easy to grasp by users (concept of set is intuitive) Queries specified as Boolean expressions which have precise semantics
9
Drawbacks Retrieval strategy is binary decision (document is relevant/non-relevant) prevents good retrieval performance not simple to translate information need into Boolean expression (difficult and awkward to express) dominant model with commercial DB systems
10
Boolean Model (Cont.) Considers that index terms are present or absent in document index term weights are binary, I.e. w i, j {0,1} query q composed of index terms linked by not, and, or query is Boolean expression which can be represented as DNF
11
Boolean Model (Cont.) n Query [q=k a (k b k c )] can be written in DNF as [ dnf = (1,1,1) (1,1,0) (1,0,0)] u each component is binary weighted vector associated with tuple (k a, k b, k c ) u binary weighted vectors are called conjunctive components of dnf
12
Boolean Model (cont.) n Index term weight variables are all binary, I.e. w i,j {0,1} n query q is a Boolean expression n Let dnf be DNF for query q n Let cc be any conjunctive components of dnf n Similarity of document d j to query q is u sim(d j,q) = 1 if cc | ( cc dnf ) ( ki,g i ( j ) = g i ( cc )) where g i ( j ) = w i, j u sim(d j,q) = 0 otherwise
13
Boolean Model (Cont.) n If sim(d j,q) = 1 then Boolean model predict that document d j is relevant to query q (it might not be) n Otherwise, prediction is that document is not relevant n Boolean model predicts that each document is either relevant or non- relevant n no notion of partial match
14
n Main advantages u clean formalism u simplicity n Main disadvantages u exact matching lead to retrieval of too few or too many documents n index term weighting can lead to improvement in retrieval performance
15
Vector Model n Assign non-binary weights to index terms in queries and documents n term weights used to compute degree of similarity between document and user query n by sorting retrieved documents in decreasing order (of degree of similarity), vector model considers partially matched documents u ranked document answer set a lot more precise (than answer set by Boolean model)
16
Vector Model (Cont.) n Weight w i, j for pair (k i, d j ) is positive and non-binary n index terms in query are also weighted n Let w i, q be weight associated with pair [k i, q ], where w i, q 0 n query vector defined as = (w 1, q, w 2, q, …, w t, q ) where t is total no. of index terms in system n vector for document d j is represented by j = (w 1, j, w 2, j, …, w t, j )
17
Vector Model (Cont.) n Document d j and user query q represented as t-dimensional vectors. n evaluate degree of similarity of d j with regard to q as correlation between vectors j and. n Correlation can be quantified by cosine of angle between these two vectors u sim(dj,q) =
18
Vector Model (Cont.) n Sim(q,d j ) varies from 0 to +1. n Ranks documents according to degree of similarity to query n document may be retrieved even if it partially matches query u establish a threshold on sim(d j,q) and retrieve documents with degree of similarity above threshold
19
Index term weights n Documents are collection C of objects n User query is set A of objects n IR problem is to determine which documents are in set A and which are not (I.e. clustering problem) n In clustering problem u intra-cluster similarity (features which better describe objects in set A) u inter-cluster similarity (features which better distinguish objects in set A from remaining objects in collection C
20
n In vector model, intra-cluster similarity quantified by measuring raw frequency of term k i inside document d j ( tf factor ) u how well term describes document contents n inter-cluster dissimilarity quantified by measuring inverse of frequency of term k i among documents in collection ( idf factor) u terms which appear in many documents are not very useful for distinguishing relevant document from non-relevant one
21
Definition (pp.29) n Let N be total no. of documents in system n let ni be number of documents in which index term k i appears n let freq i, j be raw frequency of term k i in document d j u no. of times term k i mentioned in text of document d j n Normalized frequency f i, j of term k i in d j n f i, j =
22
n Maximum computed over all terms mentioned in text of document d j n if term k i does not appear in document d j then f i, j = 0 n let idf i, inverse document frequency for k i be u idf i = log n best known term weighting scheme u w i, j = f i, j log
23
n Advantages of vector model u term weighting scheme improves retrieval performance u retrieve documents that approximate query conditions u sorts documents according to degree of similarity to query n Disadvantage u index terms are mutually independent
24
Probabilistic Model n Given user query, there is set of documents containing exactly relevant documents. u Ideal answer set n given description of ideal answer set, no problem in retrieving its documents n querying process is process of specifying properties of ideal answer set u the properties are not exactly known u there are index terms whose semantics are used to characterize these properties
25
Probabilistic Model (Cont.) n These properties not known at query time n effort has to be made to initially guess what they (I.e. properties) are n initial guess generate preliminary probabilistic description of ideal answer set to retrieve first set of documents n user interaction initiated to improve probabilistic description of ideal answer set
26
n User examine retrieved documents and decide which ones are relevant n this information used to refine description of ideal answer set n by repeating this process, such description will evolve and be closer to ideal answer set
27
Fundamental Assumption n Given user query q and document d j in collection, probabilistic model estimate probability that user will find document d j relevant u assumes that probability of relevance depends on query and document representations only u assumes that there is subset of all documents which user prefers as answer set for query q u such ideal answer set is labeled R u documents in set R are predicted to be relevant to query
28
n Given query q, probabilistic model assigns to each document d j the ratio P(d j relevant-to q)/P(d j non-relevant-to q) u measure of similarity to query u odds of document d j being relevant to query q
29
n Index term weight variables are all binary I.e. w i, j {0,1}, w i, q {0,1} n query q is subset of index terms n let R be set of documents known (initially guessed) to be relevant n let be complement of R n let P(R| j ) be probability that document d j is relevant to query q n let P( | j ) be probability that document d j not relevant to query q.
30
n Similarity sim(d j,q) of document d j to query q is ratio n sim(d j,q) = n sim(d j,q) ~ n sim(d j,q) ~ w i, q w i, j
31
n How to compute P(k i |R) and P(k i | ) initially ? u assume P(k i |R) is constant for all index terms k i (typically 0.5) u P(k i |R) = 0.5 u assume distribution of index terms among non-relevant documents approximated by distribution of index terms among all documents in collection u P(k i | ) = n i /N where n i is no. of documents containing index term k i ; N is total no. of doc.
32
n Let V be subset of documents initially retrieved and ranked by model n let V i be subset of V composed of documents in V with index term k i n P(k i |R) approximated by distribution of index term k i among doc. retrieved u P(k i |R) = V i / V n P(k i | ) approximated by considering all non-retrieved doc. are not relevant u P(k i | ) =
33
n Advantages u documents ranked in decreasing order of their probability of being relevant n Disadvantages u need to guess initial separation of relevant and non-relevant sets u all index term weights are binary u index terms are mutually independent
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.