Download presentation
Presentation is loading. Please wait.
Published byIra Pierce Modified over 9 years ago
1
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: http://www.cs.unt.edu/~rada/CSCE5300
2
Slide 1 Today’s topics Boolean retrieval Improvements / Variations of the boolean model –Extended boolean model –Fuzzy information retrieval
3
Slide 2 IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext
4
Slide 3 The Boolean Model Simple model based on set theory Queries specified as boolean expressions –precise semantics –neat formalism –q = ka (kb kc) Terms are either present or absent. Thus, wij {0,1} Consider –q = ka (kb kc) –vec(qdnf) = (1,1,1) (1,1,0) (1,0,0) –vec(qcc) = (1,1,0) is a conjunctive component Each query can be transformed in DNF form
5
Slide 4 The Boolean Model q = ka (kb kc) sim(q,dj) = 1, if document satisfies the boolean query 0 otherwise - no in-between, only 0 or 1 (1,1,1) (1,0,0) (1,1,0) KaKb Kc
6
Slide 5 Exercise D 1 = “computer information retrieval” D 2 = “computer retrieval” D 3 = “information” D 4 = “computer information” Q 1 = “information retrieval” Q 2 = “information ¬computer”
7
Slide 6 Exercise 0 1Swift 2Shakespeare 3 Swift 4Milton 5 Swift 6MiltonShakespeare 7MiltonShakespeareSwift 8Chaucer 9 Swift 10ChaucerShakespeare 11ChaucerShakespeareSwift 12ChaucerMilton 13ChaucerMiltonSwift 14ChaucerMiltonShakespeare 15ChaucerMiltonShakespeareSwift ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
8
Slide 7 Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query
9
Slide 8 The Boolean model imposes a binary criterion for deciding relevance The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past Two extensions of boolean model: –Fuzzy Set Model –Extended Boolean Model
10
Slide 9 Fuzzy Set Model Queries and docs represented by sets of index terms: matching is approximate from the start This vagueness can be modeled using a fuzzy framework, as follows: –with each term is associated a fuzzy set –each doc has a degree of membership in this fuzzy set This interpretation provides the foundation for many models for IR based on fuzzy theory In here, the model proposed by Ogawa, Morita, and Kobayashi (1991)
11
Slide 10 Fuzzy Set Theory Framework for representing classes whose boundaries are not well defined Key idea is to introduce the notion of a degree of membership associated with the elements of a set This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership Thus, membership is now a gradual notion, contrary to the notion enforced by classic Boolean logic
12
Slide 11 Fuzzy Set Theory Definition –A fuzzy subset A of U is characterized by a membership function (A,u) : U [0,1] which associates with each element u of U a number (u) in the interval [0,1] Definition –Let A and B be two fuzzy subsets of U. Also, let ¬A be the complement of A. Then, (¬A,u) = 1 - (A,u) (A B,u) = max( (A,u), (B,u)) (A B,u) = min( (A,u), (B,u))
13
Slide 12 Fuzzy Information Retrieval Fuzzy sets are modeled based on a thesaurus This thesaurus is built as follows: –Let vec(c) be a term-term correlation matrix –Let c(i,l) be a normalized correlation factor for (ki,kl): c(i,l) = n(i,l) ni + nl - n(i,l) -ni: number of docs which contain ki -nl: number of docs which contain kl -n(i,l): number of docs which contain both ki and kl We now have the notion of proximity among index terms.
14
Slide 13 Fuzzy Information Retrieval The correlation factor c(i,l) can be used to define fuzzy set membership for a document dj as follows: (i,j) = 1 - (1 - c(i,l)) kl dj - (i,j) : membership of doc dj in fuzzy subset associated with ki The above expression computes an algebraic sum over all terms in the doc dj A doc dj belongs to the fuzzy set for ki, if its own terms are associated with ki
15
Slide 14 Fuzzy Information Retrieval (i,j) = 1 - (1 - c(i,l)) kl dj - (i,j) : membership of doc dj in fuzzy subset associated with ki If doc dj contains a term kl which is closely related to ki, we have –c(i,l) ~ 1 – (i,j) ~ 1 –index ki is a good fuzzy index for doc
16
Slide 15 Fuzzy IR: An Example q = ka (kb kc) vec(qdnf) = (1,1,1) + (1,1,0) + (1,0,0) = vec(cc1) + vec(cc2) + vec(cc3) (q,dj) = (cc1+cc2+cc3,j) = 1 - (1 - (a,j) (b,j) (c,j)) * (1 - (a,j) (b,j) (1- (c,j))) * (1 - (a,j) (1- (b,j)) (1- (c,j))) cc1 cc3 cc2 KaKb Kc
17
Slide 16 Fuzzy Information Retrieval Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory Experiments with standard test collections are not available Difficult to compare at this time
18
Slide 17 Extended Boolean Model Boolean model is simple and elegant. But, no provision for a ranking As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership Extend the Boolean model with the notions of partial matching and term weighting Combine characteristics of the Vector model with properties of Boolean algebra
19
Slide 18 The Idea The extended Boolean model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra Let, –q = kx ky –Use weights associated with kx and ky –In boolean model: wx = wy = 1; all other documents are irrelevant
20
Slide 19 The Idea: qand = kx ky; wxj = x and wyj = y dj dj+1 y = wyj x = wxj(0,0) (1,1) kx ky sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) ) 2 22 AND We want a document to be as close as possible to (1,1)
21
Slide 20 The Idea: qor = kx ky; wxj = x and wyj = y dj dj+1 y = wyj x = wxj(0,0) (1,1) kx ky sim(qor,dj) = sqrt( x + y ) 2 22 OR We want a document to be as far as possible from (0,0)
22
Slide 21 Generalizing the Idea We can extend the previous model to consider Euclidean distances in a t-dimensional space This can be done using p-norms which extend the notion of distance to include p-distances, where 1 p is a new parameter A generalized conjunctive query is given by – qor = k1 k2... kt A generalized disjunctive query is given by – qand = k1 k2... kt p p p p p p
23
Slide 22 Generalizing the Idea –sim(qand,dj) = 1 - ((1-x1) + (1-x2) +... + (1-xm) ) m ppp p 1 –sim(qor,dj) = (x1 + x2 +... + xm ) m ppp p 1 –If p = 1 then (Vector like) sim(qor,dj) = sim(qand,dj) = x1 +... + xm m ppp p 1
24
Slide 23 Conclusions Model is quite powerful Properties are interesting and might be useful Computation is somewhat complex However, distributivity operation does not hold for ranking computation: –q1 = (k1 k2) k3 –q2 = (k1 k3) (k2 k3) – sim(q1,dj) sim(q2,dj)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.