Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vector Models for IR Gerald Salton, Cornell SMART System

Similar presentations


Presentation on theme: "Vector Models for IR Gerald Salton, Cornell SMART System"— Presentation transcript:

1 Vector Models for IR Gerald Salton, Cornell SMART System
(Salton + Lesk, 68) (Salton, 71) (Salton + McGill, 83) SMART System Chris Buckely, Cornell / SAPIR systems g Current keeper of the flame Salton’s Magical Automatic Retrieval Tool(?) CS466-8

2 Vector Models for IR Boolean Model SMART Vector Model Doc V1 Doc V2
Doc V2 Word Stem Special compounds SMART Vector Model Termi Doc V1 Doc V2 SMART vectors are composed of real valued Term weights NOT simply Boolean Term Present or NOT CS466-8

3 Example Doc V1 Doc V2 Doc V3 Issues How are weights determined?
DNA Comput* C++ Sparc genome bilog* protein Compiler Doc V1 Doc V2 Doc V3 Issues How are weights determined? (simple option : jraw freq. kweighted by region, titles, keywords) Which terms to include? Stoplists Stem or not? CS466-8

4 Queries and Documents share same vector representation
Given Query DQ g map to vector VQ and find document Di : sim (Vi ,VQ) is greatest CS466-8

5 Similarity Functions Many other options available(Dice, Jaccard)
Cosine similarity is self normalizing V1 D2 V2 Q D3 V3 Can use arbitrary integer values (don’t need to be probabilities) CS466-8

6 Projection of Vectors into 2-D Plane
CS466-8

7 C1 C2 Basically, the average of the vectors in the centroid set
Centroid computation : D = documents in centroid set Total docs in centroid set CS466-8

8 Hierarchical Search with Document Centroids
V1 V3 V4 V2 V5 V6 V7 V9 V8 V10 CS466-8

9 Hierarchical Query Matching
VQ = Query Vector Ci = Root Centroid For all children of Ci {Cj } find Cj : sim (VQ , Cj) is maximum if Cj is a leaf(document vector), return Cj else Ci = Cj and iterate log ( | D | ) vector comparisons (height of tree) CS466-8

10 Ideal Clustering Behavior
CS466-8

11 Sample Clustered Document Collection
 document vector centroid vector CS466-8

12 Ideal Document Space relevant document with respect to a queryvector
nonrelevant document with respect to a query CS466-8

13 Introduction of Superclusters
 document vector centroid vector  supercentroid vector CS466-8


Download ppt "Vector Models for IR Gerald Salton, Cornell SMART System"

Similar presentations


Ads by Google