CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Text Categorization.
TF/IDF Ranking. Vector space model Documents are also treated as a “bag” of words or terms. –Each document is represented as a vector. Term Frequency.
Chapter 5: Introduction to Information Retrieval
TEXT SIMILARITY David Kauchak CS159 Spring Quiz #2  Out of 30 points  High:  Ave: 23  Will drop lowest quiz  I do not grade based on.
Text Similarity David Kauchak CS457 Fall 2011.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Learning for Text Categorization
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Vector Space Model : TF - IDF
CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6)
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
TEXT SIMILARITY David Kauchak CS159 Fall Admin Assignment 4a  Solutions posted  If you’re still unsure about questions 3 and 4, come talk to me.
TF-IDF David Kauchak cs458 Fall 2012 adapted from:
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Basic ranking Models Boolean and Vector Space Models.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Vector Space Models.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER VI SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL 1.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Web Information Retrieval
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
The Vector Space Models (VSM)
Plan for Today’s Lecture(s)
Ch 6 Term Weighting and Vector Space Model
Information Retrieval and Web Search
Representation of documents and queries
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
Boolean and Vector Space Retrieval Models
Information Retrieval and Web Design
CS276: Information Retrieval and Web Search
Term Frequency–Inverse Document Frequency
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics

Measuring similarity between objects Many tasks can be expressed as a question of similarity Information retrieval: the search engine should return documents that are similar to the user query Document classification Clustering of objects: find groups of similar items Document segmentation Essay grading SAT/GRE language questions

Lecture goal for today Introduce a simple yet powerful way of representing text, words etc Introduce several similarity metrics that work well with such representations

Spatial analogy Think about points (locations) in this room Choose a coordinate system Each point is represented by a vector (d1,d2,d3) We can compute distance between points (i.e. Euclidean distance) Points that are near each other are similar

Think about representing document Choose a list of words w1, w2, w3, …., wn These could be for example all the words that appear in a collection of documents Each of these words corresponds to one axis of a coordinate system The corresponding vector representation entry can be for example the number of times a word occurs

Representing three documents S1: The key is in the backpack. S2: The key is in the front pocket of the backpack. S3: The bear den is in the far end of the forest. thekeyisinbackpackfrontpocketofbearde n farendforest S S S

Completing the vector representation There are a number of reasonable choices about what the weight of each word should be Binary: 1 if present 0 if not Number of occurrences Tf.idf Mutual information

Tf.idf Tf is term frequency Number of occurrences of the word in the document Idf is inverse document frequency Consider a collection of N documents df for a particular word is the number of documents (among these N) that contain the word

idf weight df t is the document frequency of t: the number of documents that contain t df t is an inverse measure of the informativeness of t df t  N We define the idf (inverse document frequency) of t by We use log (N/df t ) instead of N/df t to “dampen” the effect of idf. Turns out the base of the log is immaterial. Sec

idf example, suppose N = 1 million termdf t idf t calpurnia1 animal100 sunday1,000 fly10,000 under100,000 the1,000,000 There is one idf value for each term t in a collection. Sec

Mutual information Consider for example that we have several topics (sports, politics, entertainment, academic) We want to find words that are descriptive of a given topic Use only these in our representations Else small irrelevant differences can accumulate to mask real differences

The probabilities can be easily computed from counts in a sample collection How? Pointwise mutual information will have high value when there is strong relationship between the topic and the word

Measures of distance and similarity Euclidean distance is actually never used in language related applications This distance is large for vectors of different length

cosine(query,document) Dot product Unit vectors q i is the tf-idf weight of term i in the query d i is the tf-idf weight of term i in the document cosine similarity of q and d or, equivalently, the cosine of the angle between q and d. Sec. 6.3

Distance metrics

Similarity metrics

How would to choose weighting and similarity measure There is no right answer But looking at the definitions of the metrics you could notice some possible problems and peculiarities Try out and see what works best for a given application Get some sample results you would like to obtain Try out different possibilities See which one is closest in predicting the results you would like

Departures from the special analogy Our “dimensions” are not independent of each other Most likely more than necessary In new examples we are likely to see new dimensions Words that we did not see when we decided on vocabulary Highly dimensional space There is a huge number of words in a language So the weighting schemes are trying to deal with such issues