Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.

Similar presentations


Presentation on theme: "Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates."— Presentation transcript:

1 Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ http://www.sims.berkeley.edu/~hearst/irbook/

2 CSE 8337 Spring 2005 2 Generalized Vector Model Classic models enforce independence of index terms. For the Vector model: Set of term vectors {k1, k2,..., kt} are linearly independent and form a basis for the subspace of interest. Frequently, this is interpreted as:  i,j  ki  kj = 0 In 1985, Wong, Ziarko, and Wong proposed an interpretation in which the set of terms is linearly independent, but not pairwise orthogonal.

3 CSE 8337 Spring 2005 3 Key Idea : In the generalized vector model, two index terms might be non-orthogonal and are represented in terms of smaller components (minterms). w ij is the weight associated with [k i,d j ] {k 1, k 2,..., k t } is the set of all terms If these weights are all binary, all patterns of occurrence of terms within documents can be represented by the minterms: m 1 = (0,0,..., 0) m 2 = (1,0,..., 0) m 3 = (0,1,..., 0) m 4 = (1,1,..., 0) t

4 CSE 8337 Spring 2005 4 Key Idea: The basis for the generalized vector model is formed by a set of 2 vectors defined over the set of minterms, as follows: 0 1 2... 2 t m 1 = (1, 0, 0,..., 0, 0) m 2 = (0, 1, 0,..., 0, 0) m 3 = (0, 0, 1,..., 0, 0) m 2 t = (0, 0, 0,..., 0, 1)  i,j  m i  m j = 0 i.e., pairwise orthogonal t

5 CSE 8337 Spring 2005 5 Key Idea: Minterm vectors are pairwise orthogonal and are used as the basis for the generalized becor space model. The minterm m 4 indicates the occurrence of the terms k 1 and k 2 within a same document. If such document exists in a collection, we say that the minterm m 4 is active and that a dependency between these two terms is induced. The generalized vector model adopts as a basic foundation the notion that cooccurence of terms within documents induces dependencies among them.

6 CSE 8337 Spring 2005 6 C i,r - The weight associated with the pair [k i,m r ] sums up the weights of the term k i in all the documents which have a term occurrence pattern given by m r. Index terms are correlated with the minterm vectors. Notice that for a collection of size N, only N minterms affect the ranking (and not 2 t ). A degree of correlation sums up (in a weighted form) the dependencies between k i and k j induced by the documents in the collection (represented by the m r minterms). Forming the Term Vectors

7 CSE 8337 Spring 2005 7 Computation of C i,r c =  w ij i,r d j |  l, g l (d j )=g l (m r ) w ij

8 CSE 8337 Spring 2005 8 Ranking with Generalized Model C i,r gives the correlation between an index term and a minterm A generalized index term vector k i can then be calculated based on m r Ranking is then calculated using a standard cosine similarity

9 CSE 8337 Spring 2005 9 Computation of Document Vectors d1 = 2 k1 + k3 d2 = k1 d3 = k2 + 3 k3 d4 = 2 k1 d5 = k1 + 2 k2 + 4 k3 d6 = 2 k2 + 2 k3 d7 = 5 k2 q = k1 + 2 k2 + 3 k3

10 CSE 8337 Spring 2005 10 Conclusions Model considers correlations among index terms Not clear in which situations it is superior to the standard Vector model Computation costs are higher Model does introduce interesting new ideas

11 CSE 8337 Spring 2005 11 Latent Semantic Indexing Classic IR might lead to poor retrieval due to: unrelated documents might be included in the answer set relevant documents that do not contain at least one index term are not retrieved Reasoning: retrieval based on index terms is vague and noisy The user information need is more related to concepts and ideas than to index terms A document that shares concepts with another document known to be relevant might be of interest

12 CSE 8337 Spring 2005 12 Latent Semantic Indexing The key idea is to map documents and queries into a lower dimensional space (i.e., composed of higher level concepts which are in fewer number than the index terms) Retrieval in this reduced concept space might be superior to retrieval in the space of index terms

13 CSE 8337 Spring 2005 13 Neural Network Model Classic IR: Terms are used to index documents and queries Retrieval is based on index term matching Motivation: Neural networks are known to be good pattern matchers

14 CSE 8337 Spring 2005 14 Neural Network Model Neural Networks: The human brain is composed of billions of neurons Each neuron can be viewed as a small processing unit A neuron is stimulated by input signals and emits output signals in reaction A chain reaction of propagating signals is called a spread activation process As a result of spread activation, the brain might command the body to take physical reactions

15 CSE 8337 Spring 2005 15 Neural Network Model A neural network is an oversimplified representation of the neuron interconnections in the human brain: nodes are processing units edges are synaptic connections the strength of a propagating signal is modelled by a weight assigned to each edge the state of a node is defined by its activation level depending on its activation level, a node might issue an output signal

16 CSE 8337 Spring 2005 16 Neural Network for IR: From the work by Wilkinson & Hingston, SIGIR’91 Document Terms Query Terms Documents kaka kbkb kckc kaka kbkb kckc k1k1 ktkt d1d1 djdj d j+1 dNdN

17 CSE 8337 Spring 2005 17 Neural Network for IR Three layers network Signals propagate across the network First level of propagation: Query terms issue the first signals These signals propagate accross the network to reach the document nodes Second level of propagation: Document nodes might themselves generate new signals which affect the document term nodes Document term nodes might respond with new signals of their own

18 CSE 8337 Spring 2005 18 Conclusions Model provides an interesting formulation of the IR problem Model has not been tested extensively It is not clear the improvements that the model might provide


Download ppt "Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates."

Similar presentations


Ads by Google