Term weighting and vector representation of text Lecture 3.

Slides:

Advertisements

Similar presentations

Text Categorization.

Advertisements

Chapter 5: Introduction to Information Retrieval

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.

Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

TF-IDF David Kauchak cs160 Fall 2009 adapted from:

Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.

Learning for Text Categorization

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.

Ch 4: Information Retrieval and Text Mining

CS276 Information Retrieval and Web Mining

Hinrich Schütze and Christina Lioma

Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.

The Vector Space Model …and applications in Information Retrieval.

Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

Vector Space Model : TF - IDF

CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.

Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.

5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.

Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.

1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China

Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =

1 Computing Relevance, Similarity: The Vector Space Model.

CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 5 9/6/2011.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Vector Space Models.

1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.

1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.

CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.

Lecture 6: Scoring, Term Weighting and the Vector Space Model

Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.

Catching up. Vigenère Cypher Keyword:CAT Plain text: PROJECT Code: P  row C, column P = R R  row A, column R = R O  row T, column O = H J  row C,

CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.

Term Weighting approaches in automatic text retrieval. Presented by Ehsan.

Natural Language Processing Topics in Information Retrieval August, 2002.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Term weighting and Vector space retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.

VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.

IR 6 Scoring, term weighting and the vector space model.

Automated Information Retrieval

Plan for Today’s Lecture(s)

Basic Information Retrieval

Representation of documents and queries

Text Categorization Assigning documents to a fixed set of categories

Principles of Data Mining Published by Springer-Verlag. 2007

From frequency to meaning: vector space models of semantics

CS 430: Information Discovery

6. Implementation of Vector-Space Retrieval

Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo

Boolean and Vector Space Retrieval Models

VECTOR SPACE MODEL Its Applications and implementations

Presentation transcript:

Term weighting and vector representation of text Lecture 3

Last lecture Tokenization – Token, type, term distinctions Case normalization Stemming and lemmatization Stopwords – 30 most common words account for 30% of the tokens in written text

Last lecture: empirical laws Heap’s law: estimating the size of the vocabulary – Linear in the number of tokens in log-log space Zipf’s law: frequency distribution of terms

This class Term weighting – Tf.idf Vector representations of text Computing similarity – Relevance – Redundancy identification

Term frequency and weighting A word that appears often in a document is probably very descriptive of what the document is about Assign to each term in a document a weight for that term, that depends on the number of occurrences of the that term in the document Term frequency (tf) – Assign the weight to be equal to the number of occurrences of term t in document d

Bag of words model A document can now be viewed as the collection of terms in it and their associated weight – Mary is smarter than John – John is smarter than Mary Equivalent in the bag of words model

Problems with term frequency Stop words – Semantically vacuous Auto industry – “Auto” or “car” would not be at all indicative about what a document/sentence is about Need a mechanism for attenuating the effect of terms that occur too often in the collection to be meaningful for relevance/meaning determination

Scale down the term weight of terms with high collection frequency – Reduce the tf weight of a term by a factor that grows with the collection frequency More common for this purpose is document frequency (how many documents in the collection contain the term)

Inverse document frequency N number of documents in the collection N = 1000; df[the] = 1000; idf[the] = 0 N = 1000; df[some] = 100; idf[some] = 2.3 N = 1000; df[car] = 10; idf[car] = 4.6 N = 1000; df[merger] = 1; idf[merger] = 6.9

it.idf weighting Highest when t occurs many times within a small number of documents – Thus lending high discriminating power to those documents Lower when the term occurs fewer times in a document, or occurs in many documents – Thus offering a less pronounced relevance signal Lowest when the term occurs in virtually all documents

Document vector space representation Each document is viewed as a vector with one component corresponding to each term in the dictionary The value of each component is the tf-idf score for that word For dictionary terms that do not occur in the document, the weights are 0

How do we quantify the similarity between documents in vector space? Magnitude of the vector difference – Problematic when the length of the documents is different

Cosine similarity

The effect of the denominator is thus to length-normalize the initial document representation vectors to unit vectors So cosine similarity is the dot product of the normalized versions of the two documents

Tf modifications It is unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence

Maximum tf normalization

Problems with the normalization A change in the stop word list can dramatically alter term weightings A document may contain an outlier term