Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1.

Similar presentations


Presentation on theme: "Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1."— Presentation transcript:

1 Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

2  Objective: to capture the IR problem using a probabilistic framework  Given a user query, there is an ideal answer set  Querying as specification of the properties of this ideal answer set (clustering)  But, what are these properties?  Guess at the beginning what they could be (i.e., guess initial description of ideal answer set)  Improve by iteration 19/10/2015Dr. Almetwally Mostafa 2

3 19/10/2015Dr. Almetwally Mostafa 3 Probabilistic Model n An initial set of documents is retrieved somehow n User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) n IR system uses this information to refine description of ideal answer set n By repeting this process, it is expected that the description of the ideal answer set will improve n Have always in mind the need to guess at the very beginning the description of the ideal answer set n Description of ideal answer set is modeled in probabilistic terms

4 19/10/2015Dr. Almetwally Mostafa 4 Probabilistic Ranking Principle n Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant. n But, u how to compute probabilities? u what is the sample space?

5 19/10/2015Dr. Almetwally Mostafa 5 The Ranking n Probabilistic ranking computed as: u sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) u This is the odds (likelihood) of the document dj being relevant u Taking the odds minimize the probability of an erroneous judgement n Definition: u wij  {0,1} u P(R | vec(dj)) : probability that given doc is relevant u P(  R | vec(dj)) : probability doc is not relevant

6 19/10/2015Dr. Almetwally Mostafa 6 The Ranking n sim(dj,q) = P(R | vec(dj)) / P(  R | vec(dj)) = [P(vec(dj) | R) * P(R)] [P(vec(dj) |  R) * P(  R)] ~ P(vec(dj) | R) P(vec(dj) |  R) n P(vec(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents

7 19/10/2015Dr. Almetwally Mostafa 7 The Ranking n sim(dj,q)~ P(vec(dj) | R) P(vec(dj) |  R) ~ [  P(ki | R)] * [  P(  ki | R)] [  P(ki |  R)] * [  P(  ki |  R)] n P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents

8 19/10/2015Dr. Almetwally Mostafa 8 The Ranking n sim(dj,q)~ log [  P(ki | R)] * [  P(  kj | R)] [  P(ki |  R)] * [  P(  kj |  R)] ~ K * [ log  P(ki | R) + P(  ki | R) log  P(ki |  R) ] P(  ki |  R) ~  wiq * wij * (log P(ki | R) + log P(ki |  R) ) P(  ki | R) P(  ki |  R) where P(  ki | R) = 1 - P(ki | R) P(  ki |  R) = 1 - P(ki |  R)

9 19/10/2015Dr. Almetwally Mostafa 9 The Initial Ranking n sim(dj,q)~ ~  wiq * wij * (log P(ki | R) + log P(ki |  R) ) P(  ki | R) P(  ki |  R) n Probabilities P(ki | R) and P(ki |  R) ? n Estimates based on assumptions: u P(ki | R) = 0.5 u P(ki |  R) = ni N where ni is the number of docs that contain ki u Use this initial guess to retrieve an initial ranking u Improve upon this initial ranking

10 19/10/2015Dr. Almetwally Mostafa 10 Improving the Initial Ranking n sim(dj,q)~ ~  wiq * wij * (log P(ki | R) + log P(ki |  R) ) P(  ki | R) P(  ki |  R) n Let u V : set of docs initially retrieved u Vi : subset of docs retrieved that contain ki n Reevaluate estimates: u P(ki | R) = Vi V u P(ki |  R) = ni - Vi N - V n Repeat recursively

11 19/10/2015Dr. Almetwally Mostafa 11 Improving the Initial Ranking n sim(dj,q)~ ~  wiq * wij * (log P(ki | R) + log P(ki |  R) ) P(  ki | R) P(  ki |  R) n To avoid problems with V=1 and Vi=0: u P(ki | R) = Vi + 0.5 V + 1 u P(ki |  R) = ni - Vi + 0.5 N - V + 1 n Also, u P(ki | R) = Vi + ni/N V + 1 u P(ki |  R) = ni - Vi + ni/N N - V + 1

12 19/10/2015Dr. Almetwally Mostafa 12 Pluses and Minuses n Advantages: u Docs ranked in decreasing order of probability of relevance n Disadvantages: u need to guess initial estimates for P(ki | R) u method does not take into account tf and idf factors

13 19/10/2015Dr. Almetwally Mostafa 13 Brief Comparison of Classic Models n Boolean model does not provide for partial matches and is considered to be the weakest classic model n Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections n This seems also to be the view of the research community

14  The Boolean model imposes a binary criterion for deciding relevance  The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past  We discuss now two set theoretic models for this:  Fuzzy Set Model  Extended Boolean Model 19/10/2015Dr. Almetwally Mostafa 14

15 19/10/2015Dr. Almetwally Mostafa 15 Fuzzy Set Model n Queries and docs represented by sets of index terms: matching is approximate from the start n This vagueness can be modeled using a fuzzy framework, as follows: u with each term is associated a fuzzy set u each doc has a degree of membership in this fuzzy set n This interpretation provides the foundation for many models for IR based on fuzzy theory n In here, we discuss the model proposed by Ogawa, Morita, and Kobayashi (1991)

16 19/10/2015Dr. Almetwally Mostafa 16 Fuzzy Set Theory n Framework for representing classes whose boundaries are not well defined n Key idea is to introduce the notion of a degree of membership associated with the elements of a set n This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership n Thus, membership is now a gradual notion, contrary to the crispy notion enforced by classic Boolean logic

17 19/10/2015Dr. Almetwally Mostafa 17 Fuzzy Set Theory n Definition u A fuzzy subset A of U is characterized by a membership function  (A,u) : U  [0,1] which associates with each element u of U a number  (u) in the interval [0,1] n Definition u Let A and B be two fuzzy subsets of U. Also, let ¬A be the complement of A. Then, F  (¬A,u) = 1 -  (A,u) F  (A  B,u) = max(  (A,u),  (B,u)) F  (A  B,u) = min(  (A,u),  (B,u))

18 19/10/2015Dr. Almetwally Mostafa 18 Fuzzy Information Retrieval n Fuzzy sets are modeled based on a thesaurus n This thesaurus is built as follows: u Let vec(c) be a term-term correlation matrix u Let c(i,l) be a normalized correlation factor for (ki,kl): c(i,l) = n(i,l) ni + nl - n(i,l) ni: number of docs which contain ki nl: number of docs which contain kl n(i,l): number of docs which contain both ki and kl n We now have the notion of proximity among index terms.

19 19/10/2015Dr. Almetwally Mostafa 19 Fuzzy Information Retrieval n The correlation factor c(i,l) can be used to define fuzzy set membership for a document dj as follows:  (i,j) = 1 -  (1 - c(i,l)) ki  dj  (i,j) : membership of doc dj in fuzzy subset associated with ki n The above expression computes an algebraic sum over all terms in the doc dj n A doc dj belongs to the fuzzy set for ki, if its own terms are associated with ki

20 19/10/2015Dr. Almetwally Mostafa 20 Fuzzy Information Retrieval n  (i,j) = 1 -  (1 - c(i,l)) ki  dj  (i,j) : membership of doc dj in fuzzy subset associated with ki n If doc dj contains a term kl which is closely related to ki, we have u c(i,l) ~ 1 u  (i,j) ~ 1 u index ki is a good fuzzy index for doc

21 19/10/2015Dr. Almetwally Mostafa 21 Fuzzy IR: An Example n q = ka  (kb   kc) n vec(qdnf) = (1,1,1) + (1,1,0) + (1,0,0) = vec(cc1) + vec(cc2) + vec(cc3) n  (q,dj) =  (cc1+cc2+cc3,j) = 1 - (1 -  (a,j)  (b,j)  (c,j)) * (1 -  (a,j)  (b,j) (1-  (c,j))) * (1 -  (a,j) (1-  (b,j)) (1-  (c,j))) cc1 cc3 cc2 KaKb Kc

22 19/10/2015Dr. Almetwally Mostafa 22 Fuzzy Information Retrieval n Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory n Experiments with standard test collections are not available n Difficult to compare at this time

23 19/10/2015Dr. Almetwally Mostafa 23 Extended Boolean Model n Booelan retrieval is simple and elegant n But, no ranking is provided n How to extend the model? u interpret conjunctions and disjunctions in terms of Euclidean distances

24  Boolean model is simple and elegant.  But, no provision for a ranking  As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership  Extend the Boolean model with the notions of partial matching and term weighting  Combine characteristics of the Vector model with properties of Boolean algebra 19/10/2015Dr. Almetwally Mostafa 24

25 19/10/2015Dr. Almetwally Mostafa 25 The Idea n The extended Boolean model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra n Let, u q = kx  ky u wxj = fxj * idf(x) associated with [kx,dj] max(idf(i)) u Further, wxj = x and wyj = y

26 19/10/2015Dr. Almetwally Mostafa 26 The Idea: n qand = kx  ky; wxj = x and wyj = y dj dj+1 y = wyj x = wxj(0,0) (1,1) kx ky sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) ) 2 22 AND

27 19/10/2015Dr. Almetwally Mostafa 27 The Idea: n qor = kx  ky; wxj = x and wyj = y dj dj+1 y = wyj x = wxj(0,0) (1,1) kx ky sim(qor,dj) = sqrt( x + y ) 2 22 OR

28 19/10/2015Dr. Almetwally Mostafa 28 Generalizing the Idea n We can extend the previous model to consider Euclidean distances in a t-dimensional space n This can be done using p-norms which extend the notion of distance to include p-distances, where 1  p   is a new parameter n A generalized conjunctive query is given by u qor = k1 k2... kt n A generalized disjunctive query is given by u qand = k1 k2... kt p  p  p   p  p  p

29 19/10/2015Dr. Almetwally Mostafa 29 Generalizing the Idea u sim(qand,dj) = 1 - ((1-x1) + (1-x2) +... + (1-xm) ) m ppp p 1 u sim(qor,dj) = (x1 + x2 +... + xm ) m ppp p 1

30 19/10/2015Dr. Almetwally Mostafa 30 Properties u sim(qor,dj) = (x1 + x2 +... + xm ) m u If p = 1 then (Vector like) F sim(qor,dj) = sim(qand,dj) = x1 +... + xm m u If p =  then (Fuzzy like) F sim(qor,dj) = max (wxj) F sim(qand,dj) = min (wxj) ppp p 1

31 19/10/2015Dr. Almetwally Mostafa 31 Properties u By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model u This is quite powerful and is a good argument in favor of the extended Boolean model u (k1 k2) k3 F k1 and k2 are to be used as in a vectorial retrieval while the presence of k3 is required.    2

32 19/10/2015Dr. Almetwally Mostafa 32 Properties u q = (k1 k2) k3 u sim(q,dj) = ( (1 - ( (1-x1) + (1-x2) ) ) + x3 ) 2 2    2 p p p 1 p 1 p

33  Model is quite powerful  Properties are interesting and might be useful  Computation is somewhat complex  However, distributivity operation does not hold for ranking computation:  q1 = (k1  k2)  k3  q2 = (k1  k3)  (k2  k3)  sim(q1,dj)  sim(q2,dj) 19/10/2015Dr. Almetwally Mostafa 33

34  Classic models enforce independence of index terms.  For the Vector model:  Set of term vectors {k1, k2,..., kt} are linearly independent and form a basis for the subspace of interest.  Frequently, this is interpreted as:   i,j  ki  kj = 0  In 1985, Wong, Ziarko, and Wong proposed an interpretation in which the set of terms is linearly independent, but not pairwise orthogonal. 19/10/2015Dr. Almetwally Mostafa 34

35 19/10/2015Dr. Almetwally Mostafa 35 Key Idea : u In the generalized vector model, two index terms might be non-orthogonal and are represented in terms of smaller components (minterms). u As before let, F wij be the weight associated with [ki,dj] F {k1, k2,..., kt} be the set of all terms u If these weights are all binary, all patterns of occurrence of terms within documents can be represented by the minterms: F m1 = (0,0,..., 0) m5 = (0,0,1,..., 0) m2 = (1,0,..., 0) m3 = (0,1,..., 0) m4 = (1,1,..., 0) m2 = (1,1,1,..., 1) F In here, m2 indicates documents in which solely the term k1 occurs. t

36 19/10/2015Dr. Almetwally Mostafa 36 Key Idea: u The basis for the generalized vector model is formed by a set of 2 vectors defined over the set of minterms, as follows: 0 1 2... 2 F m1 = (1, 0, 0,..., 0, 0) F m2 = (0, 1, 0,..., 0, 0) F m3 = (0, 0, 1,..., 0, 0) F m2 = (0, 0, 0,..., 0, 1) u Notice that, F  i,j  mi  mj = 0 i.e., pairwise orthogonal t t

37 19/10/2015Dr. Almetwally Mostafa 37 Key Idea: u Minterm vectors are pairwise orthogonal. But, this does not mean that the index terms are independent: F The minterm m4 is given by: m4 = (1, 1, 0,..., 0, 0) F This minterm indicates the occurrence of the terms k1 and k2 within a same document. If such document exists in a collection, we say that the minterm m4 is active and that a dependency between these two terms is induced. F The generalized vector model adopts as a basic foundation the notion that cooccurence of terms within documents induces dependencies among them.

38 19/10/2015Dr. Almetwally Mostafa 38 u The vector associated with the term ki is computed as:  c mr F ki = sqrt(  c ) F c =  wij F The weight c associated with the pair [ki,mr] sums up the weights of the term ki in all the documents which have a term occurrence pattern given by mr. F Notice that for a collection of size N, only N minterms affect the ranking (and not 2 ). Forming the Term Vectors  r, g (m )=1 ir i,r  r, g (m )=1 ir i,r 2 dj |  l, gl(dj)=gl(mr) i,r t

39 19/10/2015Dr. Almetwally Mostafa 39 u A degree of correlation between the terms ki and kj can now be computed as: ki kj =  c * c u This degree of correlation sums up (in a weighted form) the dependencies between ki and kj induced by the documents in the collection (represented by the mr minterms). Dependency between Index Terms j,ri,r  r, g (m )=1  g (m )=1 ir j r

40 19/10/2015Dr. Almetwally Mostafa 40 The Generalized Vector Model: An Example d1 d2 d3 d4d5 d6 d7 k1 k2 k3

41 19/10/2015Dr. Almetwally Mostafa 41 Computation of C i,r c =  wij i,r dj |  l, gl(dj)=gl(mr) wij

42 19/10/2015Dr. Almetwally Mostafa 42 Computation of Index Term Vectors u k1 = 1 (3 m2 + 2 m6 + m8 ) sqrt(3 + 2 + 1 ) u k2 = 1 (5 m3 + 3 m7 + 2 m8 ) sqrt(5 + 3 + 2 ) u k3 = 1 (1 m6 + 5 m7 + 4 m8 ) sqrt(1 + 5 + 4 ) 222 222 222

43 19/10/2015Dr. Almetwally Mostafa 43 Computation of Document Vectors F d1 = 2 k1 + k3 F d2 = k1 F d3 = k2 + 3 k3 F d4 = 2 k1 F d5 = k1 + 2 k2 + 4 k3 F d6 = 2 k2 + 2 k3 F d7 = 5 k2 F q = k1 + 2 k2 + 3 k3

44 19/10/2015Dr. Almetwally Mostafa 44 F k1 = 1 (3 m2 + 2 m6 + m8 ) sqrt(3 + 2 + 1 ) F k2 = 1 (5 m3 + 3 m7 + 2 m8 ) sqrt(5 + 3 + 2 ) F k3 = 1 (1 m6 + 5 m7 + 4 m8 ) sqrt(1 + 5 + 4 ) 222 222 222 F d1 = 2 k1 + k3 F d2 = k1 F d3 = k2 + 3 k3 F d4 = 2 k1 F d5 = k1 + 2 k2 + 4 k3 F d6 = 2 k2 + 2 k3 F d7 = 5 k2 F q = k1 + 2 k2 + 3 k3 Ranking Computation

45  Model considers correlations among index terms  Not clear in which situations it is superior to the standard Vector model  Computation costs are higher  Model does introduce interesting new ideas 19/10/2015Dr. Almetwally Mostafa 45

46 19/10/2015Dr. Almetwally Mostafa 46 Latent Semantic Indexing n Classic IR might lead to poor retrieval due to: u unrelated documents might be included in the answer set u relevant documents that do not contain at least one index term are not retrieved u Reasoning: retrieval based on index terms is vague and noisy n The user information need is more related to concepts and ideas than to index terms n A document that shares concepts with another document known to be relevant might be of interest

47 19/10/2015Dr. Almetwally Mostafa 47 Latent Semantic Indexing n The key idea is to map documents and queries into a lower dimensional space (i.e., composed of higher level concepts which are in fewer number than the index terms) n Retrieval in this reduced concept space might be superior to retrieval in the space of index terms

48 19/10/2015Dr. Almetwally Mostafa 48 Latent Semantic Indexing n Definitions u Let t be the total number of index terms u Let N be the number of documents u Let (Mij) be a term-document matrix with t rows and N columns u To each element of this matrix is assigned a weight wij associated with the pair [ki,dj] u The weight wij can be based on a tf-idf weighting scheme

49 19/10/2015Dr. Almetwally Mostafa 49 Latent Semantic Indexing n The matrix (Mij) can be decomposed into 3 matrices (singular value decomposition) as follows: u (Mij) = (K) (S) (D) t u (K) is the matrix of eigenvectors derived from (M)(M) t u (D) t is the matrix of eigenvectors derived from (M) t (M) u (S) is an r x r diagonal matrix of singular values where F r = min(t,N) that is, the rank of (Mij)

50 19/10/2015Dr. Almetwally Mostafa 50 Computing an Example n Let (Mij) be given by the matrix u Compute the matrices (K), (S), and (D) t

51 19/10/2015Dr. Almetwally Mostafa 51 Latent Semantic Indexing n In the matrix (S), select only the s largest singular values n Keep the corresponding columns in (K) and (D) t n The resultant matrix is called (M) s and is given by u (M)s = (K) s (S) s (D) t u where s, s < r, is the dimensionality of the concept space n The parameter s should be u large enough to allow fitting the characteristics of the data u small enough to filter out the non-relevant representational details s

52 19/10/2015Dr. Almetwally Mostafa 52 Latent Ranking n The user query can be modelled as a pseudo- document in the original (M) matrix n Assume the query is modelled as the document numbered 0 in the (M) matrix n The matrix (M) t (M) s quantifies the relantionship between any two documents in the reduced concept space n The first row of this matrix provides the rank of all the documents with regard to the user query (represented as the document numbered 0) s

53  Latent semantic indexing provides an interesting conceptualization of the IR problem  It allows reducing the complexity of the underline representational framework which might be explored, for instance, with the purpose of interfacing with the user 19/10/2015Dr. Almetwally Mostafa 53

54  Classic IR:  Terms are used to index documents and queries  Retrieval is based on index term matching  Motivation:  Neural networks are known to be good pattern matchers 19/10/2015Dr. Almetwally Mostafa 54

55  Neural Networks:  The human brain is composed of billions of neurons  Each neuron can be viewed as a small processing unit  A neuron is stimulated by input signals and emits output signals in reaction  A chain reaction of propagating signals is called a spread activation process  As a result of spread activation, the brain might command the body to take physical reactions 19/10/2015Dr. Almetwally Mostafa 55

56 19/10/2015Dr. Almetwally Mostafa 56 Neural Network Model n A neural network is an oversimplified representation of the neuron interconnections in the human brain: u nodes are processing units u edges are synaptic connections u the strength of a propagating signal is modelled by a weight assigned to each edge u the state of a node is defined by its activation level u depending on its activation level, a node might issue an output signal

57 19/10/2015Dr. Almetwally Mostafa 57 Neural Network for IR: n From the work by Wilkinson & Hingston, SIGIR’91 Document Terms Query Terms Documents kaka kbkb kckc kaka kbkb kckc k1k1 ktkt d1d1 djdj d j+1 dNdN

58 19/10/2015Dr. Almetwally Mostafa 58 Neural Network for IR n Three layers network n Signals propagate across the network n First level of propagation: u Query terms issue the first signals u These signals propagate accross the network to reach the document nodes n Second level of propagation: u Document nodes might themselves generate new signals which affect the document term nodes u Document term nodes might respond with new signals of their own

59 19/10/2015Dr. Almetwally Mostafa 59 Quantifying Signal Propagation n Normalize signal strength (MAX = 1) n Query terms emit initial signal equal to 1 n Weight associated with an edge from a query term node ki to a document term node ki: u Wiq u Wiq = wiq sqrt (  i wiq ) n Weight associated with an edge from a document term node ki to a document node dj: u Wij u Wij = wij sqrt (  i wij ) 2 2

60 19/10/2015Dr. Almetwally Mostafa 60 Quantifying Signal Propagation WiqWij n After the first level of signal propagation, the activation level of a document node dj is given by:  i Wiq Wij =  i wiq wij sqrt (  i wiq ) * sqrt (  i wij ) F which is exactly the ranking of the Vector model n New signals might be exchanged among document term nodes and document nodes in a process analogous to a feedback cycle n A minimum threshold should be enforced to avoid spurious signal generation 22

61 19/10/2015Dr. Almetwally Mostafa 61 Conclusions n Model provides an interesting formulation of the IR problem n Model has not been tested extensively n It is not clear the improvements that the model might provide


Download ppt "Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1."

Similar presentations


Ads by Google