Term weighting and vector representation of text Lecture 3.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Ch 4: Information Retrieval and Text Mining
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
The Vector Space Model …and applications in Information Retrieval.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Vector Space Model : TF - IDF
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 5 9/6/2011.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Vector Space Models.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Catching up. Vigenère Cypher Keyword:CAT Plain text: PROJECT Code: P  row C, column P = R R  row A, column R = R O  row T, column O = H J  row C,
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Term weighting and Vector space retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
Plan for Today’s Lecture(s)
Basic Information Retrieval
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
Principles of Data Mining Published by Springer-Verlag. 2007
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
6. Implementation of Vector-Space Retrieval
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
Boolean and Vector Space Retrieval Models
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Term weighting and vector representation of text Lecture 3

Last lecture Tokenization – Token, type, term distinctions Case normalization Stemming and lemmatization Stopwords – 30 most common words account for 30% of the tokens in written text

Last lecture: empirical laws Heap’s law: estimating the size of the vocabulary – Linear in the number of tokens in log-log space Zipf’s law: frequency distribution of terms

This class Term weighting – Tf.idf Vector representations of text Computing similarity – Relevance – Redundancy identification

Term frequency and weighting A word that appears often in a document is probably very descriptive of what the document is about Assign to each term in a document a weight for that term, that depends on the number of occurrences of the that term in the document Term frequency (tf) – Assign the weight to be equal to the number of occurrences of term t in document d

Bag of words model A document can now be viewed as the collection of terms in it and their associated weight – Mary is smarter than John – John is smarter than Mary Equivalent in the bag of words model

Problems with term frequency Stop words – Semantically vacuous Auto industry – “Auto” or “car” would not be at all indicative about what a document/sentence is about Need a mechanism for attenuating the effect of terms that occur too often in the collection to be meaningful for relevance/meaning determination

Scale down the term weight of terms with high collection frequency – Reduce the tf weight of a term by a factor that grows with the collection frequency More common for this purpose is document frequency (how many documents in the collection contain the term)

Inverse document frequency N number of documents in the collection N = 1000; df[the] = 1000; idf[the] = 0 N = 1000; df[some] = 100; idf[some] = 2.3 N = 1000; df[car] = 10; idf[car] = 4.6 N = 1000; df[merger] = 1; idf[merger] = 6.9

it.idf weighting Highest when t occurs many times within a small number of documents – Thus lending high discriminating power to those documents Lower when the term occurs fewer times in a document, or occurs in many documents – Thus offering a less pronounced relevance signal Lowest when the term occurs in virtually all documents

Document vector space representation Each document is viewed as a vector with one component corresponding to each term in the dictionary The value of each component is the tf-idf score for that word For dictionary terms that do not occur in the document, the weights are 0

How do we quantify the similarity between documents in vector space? Magnitude of the vector difference – Problematic when the length of the documents is different

Cosine similarity

The effect of the denominator is thus to length-normalize the initial document representation vectors to unit vectors So cosine similarity is the dot product of the normalized versions of the two documents

Tf modifications It is unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence

Maximum tf normalization

Problems with the normalization A change in the stop word list can dramatically alter term weightings A document may contain an outlier term