Vector Models for IR Gerald Salton, Cornell SMART System

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Multimedia Database Systems
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
K nearest neighbor and Rocchio algorithm
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Modeling Modern Information Retrieval
Modern Information Retrieval Chapter 5 Query Operations.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
IR Models: Review Vector Model and Probabilistic.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Clustering C.Watters CS6403.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
C.Watterscsci64031 Probabilistic Retrieval Model.
A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 430: Information Discovery Lecture 5 Ranking.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Issues/Parameters in Vector Model
Indexing & querying text
15-826: Multimedia Databases and Data Mining
Information Retrieval and Web Search
Representation of documents and queries
Visualizing Document Collections
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
Chapter 5: Information Retrieval and Web Search
Recuperação de Informação B
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Today’s Topics Boolean IR Signature files Inverted files PAT trees
Recuperação de Informação B
CS 430: Information Discovery
Presentation transcript:

Vector Models for IR Gerald Salton, Cornell SMART System (Salton + Lesk, 68) (Salton, 71) (Salton + McGill, 83) SMART System Chris Buckely, Cornell / SAPIR systems g Current keeper of the flame Salton’s Magical Automatic Retrieval Tool(?) CS466-8

Vector Models for IR Boolean Model SMART Vector Model Doc V1 Doc V2 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 Doc V2 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 Word Stem Special compounds SMART Vector Model Termi Doc V1 1.0 3.5 4.6 0.1 0.0 0.0 Doc V2 0.0 0.0 0.0 0.1 4.0 0.0 SMART vectors are composed of real valued Term weights NOT simply Boolean Term Present or NOT CS466-8

Example Doc V1 Doc V2 Doc V3 Issues How are weights determined? DNA Comput* C++ Sparc genome bilog* protein Compiler Doc V1 3 5 4 1 0 1 0 0 Doc V2 1 0 0 0 5 3 1 4 Doc V3 2 8 0 1 0 1 0 0 Issues How are weights determined? (simple option : jraw freq. kweighted by region, titles, keywords) Which terms to include? Stoplists Stem or not? CS466-8

Queries and Documents share same vector representation Given Query DQ g map to vector VQ and find document Di : sim (Vi ,VQ) is greatest CS466-8

Similarity Functions Many other options available(Dice, Jaccard) Cosine similarity is self normalizing V1 100 200 300 50 D2 V2 1 2 3 0.5 Q D3 V3 10 20 30 5 Can use arbitrary integer values (don’t need to be probabilities) CS466-8

Projection of Vectors into 2-D Plane CS466-8

C1 C2 Basically, the average of the vectors in the centroid set Centroid computation : D = documents in centroid set Total docs in centroid set CS466-8

Hierarchical Search with Document Centroids V1 V3 V4 V2 V5 V6 V7 V9 V8 V10 CS466-8

Hierarchical Query Matching VQ = Query Vector Ci = Root Centroid For all children of Ci {Cj } find Cj : sim (VQ , Cj) is maximum if Cj is a leaf(document vector), return Cj else Ci = Cj and iterate log ( | D | ) vector comparisons (height of tree) CS466-8

Ideal Clustering Behavior CS466-8

Sample Clustered Document Collection  document vector centroid vector CS466-8

Ideal Document Space relevant document with respect to a queryvector nonrelevant document with respect to a query CS466-8

Introduction of Superclusters  document vector centroid vector  supercentroid vector CS466-8