Vector Space Model Hongning Wang Recap: what is text mining 6501: Text Mining2 “Text mining, also referred to as text data mining, roughly.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Text Similarity David Kauchak CS457 Fall 2011.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Text Categorization Hongning Wang Today’s lecture Bayes decision theory Supervised text categorization – General steps for text categorization.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Ch 4: Information Retrieval and Text Mining
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
The Vector Space Model …and applications in Information Retrieval.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Vector Space Model : TF - IDF
Term weighting and vector representation of text Lecture 3.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Advanced Multimedia Text Classification Tamara Berg.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
IR Systems and Web Search By Sri Harsha Tumuluri (UNI: st2653)
Web Crawling and Basic Text Analysis Hongning Wang CS4501: Information Retrieval1.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Inverted Index Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
The Vector Space Model LBSC 708A/CMSC 838L Session 3, September 18, 2001 Douglas W. Oard.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Chapter 6: Information Retrieval and Web Search
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Vector Space Models.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Text Clustering Hongning Wang
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Information Retrieval Models: Vector Space Models
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Statistical Language Models Hongning Wang CS 6501: Text Mining1.
Introduction to Text Mining Thanks for Hongning slides on Text Ming Courses, Slides are slightly modified by Lei Chen.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
KNN & Naïve Bayes Hongning Wang
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
Inverted Index Hongning Wang
Vector Space Model Hongning Wang
Representation of documents and queries
From frequency to meaning: vector space models of semantics
Presentation transcript:

Vector Space Model Hongning Wang

Recap: what is text mining 6501: Text Mining2 “Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text.” - wikipedia “Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” - Hearst, 1999

Recap: text mining in general Text Mining3 AccessMining Organization Filter information Discover knowledge Add Structure/Annotations Serve for IR applications Based on NLP/ML techniques Sub-area of DM research

Today’s lecture 1.How to represent a document? – Make it computable 2.How to infer the relationship among documents or identify the structure within a document? – Knowledge discovery 6501: Text Mining4

How to represent a document Represent by a string? – No semantic meaning Represent by a list of sentences? – Sentence is just like a short document (recursive definition) 6501: Text Mining5

Vector space model Represent documents by concept vectors – Each concept defines one dimension – k concepts define a high-dimensional space – Element of vector corresponds to concept weight E.g., d=(x 1,…,x k ), x i is “importance” of concept i in d Distance between the vectors in this concept space – Relationship among documents 6501: Text Mining6

An illustration of VS model All documents are projected into this concept space Sports Education Finance D4D4 D2D2 D1D1 D5D5 D3D3 6501: Text Mining7 |D 2 -D 4 |

What the VS model doesn’t say How to define/select the “basic concept” – Concepts are assumed to be orthogonal How to assign weights – Weights indicate how well the concept characterizes the document How to define the distance metric 6501: Text Mining8

What is a good “Basic Concept”? Orthogonal – Linearly independent basis vectors “Non-overlapping” in meaning – No ambiguity Weights can be assigned automatically and accurately Existing solutions – Terms or N-grams, a.k.a., Bag-of-Words – Topics We will come back to this later 6501: Text Mining9

Bag-of-Words representation Term as the basis for vector space – Doc1: Text mining is to identify useful information. – Doc2: Useful information is mined from text. – Doc3: Apple is delicious. textinformationidentifyminingminedisusefultofromappledelicious Doc Doc Doc : Text Mining10

Tokenization Break a stream of text into meaningful units – Tokens: words, phrases, symbols – Definition depends on language, corpus, or even context Input: It’s not straight-forward to perform so-called “tokenization.” Output(1): 'It’s', 'not', 'straight-forward', 'to', 'perform', 'so-called', '“tokenization.”' Output(2): 'It', '’', 's', 'not', 'straight', '-', 'forward, 'to', 'perform', 'so', '-', 'called', ‘“', 'tokenization', '.', '”‘ Information Retrieval11

Tokenization Solutions – Regular expressions [\w]+: so-called -> ‘so’, ‘called’ [\S]+: It’s -> ‘It’s’ instead of ‘It’, ‘’s’ – Statistical methods Explore rich features to decide where the boundary of a word is – Apache OpenNLP ( – Stanford NLP Parser ( parser.shtml) parser.shtml Online Demo – Stanford ( – UIUC ( Information Retrieval12 We will come back to this later

Bag-of-Words representation Assumption – Words are independent from each other Pros – Simple Cons – Basis vectors are clearly not linearly independent! – Grammar and order are missing The most frequently used document representation – Image, speech, gene sequence 6501: Text Mining13 textinformationidentifyminingminedisusefultofromappledelicious Doc Doc Doc

Bag-of-Words with N-grams 6501: Text Mining14

Automatic document representation 6501: Text Mining15

Recap: document representation 6501: Text Mining16 1.How to represent a document? – Make it computable 2.How to infer the relationship among documents or identify the structure within a document? – Knowledge discovery

Recap: vector space model 6501: Text Mining17 Represent documents by concept vectors – Each concept defines one dimension – k concepts define a high-dimensional space – Element of vector corresponds to concept weight E.g., d=(x 1,…,x k ), x i is “importance” of concept i in d Distance between the vectors in this concept space – Relationship among documents

Recap: bag-of-word representation 6501: Text Mining18 textinformationidentifyminingminedisusefultofromappledelicious Doc Doc Doc /1 might not be the best choice

A statistical property of language A plot of word frequency in Wikipedia (Nov 27, 2006) Word frequency Word rank by frequency 6501: Text Mining19 In the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences; the second-place word "of" accounts for slightly over 3.5% of words. Discrete version of power law

Zipf’s law tells us Head words take large portion of occurrences, but they are semantically meaningless – E.g., the, a, an, we, do, to Tail words take major portion of vocabulary, but they rarely occur in documents – E.g., dextrosinistral The rest is most representative – To be included in the controlled vocabulary 6501: Text Mining20

Automatic document representation Remove non-informative words Remove rare words 6501: Text Mining21

Normalization Convert different forms of a word to a normalized form in the vocabulary – U.S.A. -> USA, St. Louis -> Saint Louis Solution – Rule-based Delete periods and hyphens All in lower cases – Dictionary-based Construct equivalent class – Car -> “automobile, vehicle” – Mobile phone -> “cellphone” 6501: Text Mining22 We will come back to this later

Stemming Reduce inflected or derived words to their root form – Plurals, adverbs, inflected word forms E.g., ladies -> lady, referring -> refer, forgotten -> forget – Bridge the vocabulary gap – Solutions (for English) Porter stemmer: patterns of vowel-consonant sequence Krovetz stemmer: morphological rules – Risk: lose precise meaning of the word E.g., lay -> lie (a false statement? or be in a horizontal position?) 6501: Text Mining23

Stopwords Useless words for document analysis – Not all words are informative – Remove such words to reduce vocabulary size – No universal definition – Risk: break the original meaning and structure of text E.g., this is not a good option -> option to be or not to be -> null The OEC: Facts about the language 6501: Text Mining24

Constructing a VSM representation 6501: Text Mining25 4.Stopword/controlled vocabulary filtering: Documents in a vector space! D1: ‘Text mining is to identify useful information.’ D1: ‘Text’, ‘mining’, ‘is’, ‘to’, ‘identify’, ‘useful’, ‘information’, ‘.’ D1: ‘text’, ‘mine’, ‘is’, ‘to’, ‘identify’, ‘use’, ‘inform’, ‘.’ D1: ‘text-mine’, ‘mine-is’, ‘is-to’, ‘to-identify’, ‘identify-use’, ‘use-inform’, ‘inform-.’ D1: ‘text-mine’, ‘to-identify’, ‘identify-use’, ‘use-inform’ 1.Tokenization: 2.Stemming/normalization: 3.N-gram construction: Naturally fit into MapReduce paradigm! Mapper Reducer

How to assign weights? Important! Why? – Corpus-wise: some terms carry more information about the document content – Document-wise: not all terms are equally important How? – Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) 6501: Text Mining26

Term frequency 6501: Text Mining27

TF normalization Two views of document length – A doc is long because it is verbose – A doc is long because it has more content Raw TF is inaccurate – Document length variation – “Repeated occurrences” are less informative than the “first occurrence” – Information about semantic does not increase proportionally with number of term occurrence Generally penalize long document, but avoid over-penalizing – Pivoted length normalization 6501: Text Mining28

TF normalization Norm. TF Raw TF 6501: Text Mining29

TF Normalization Norm. TF Raw TF : Text Mining30

Document frequency Idea: a term is more discriminative if it occurs only in fewer documents 6501: Text Mining31

Inverse document frequency Total number of docs in collection Non-linear scaling 6501: Text Mining32

Why document frequency Wordttfdf try insurance Table 1. Example total term frequency v.s. document frequency in Reuters-RCV1 collection. 6501: Text Mining33

TF-IDF weighting “Salton was perhaps the leading computer scientist working in the field of information retrieval during his time.” - wikipedia Gerard Salton Award – highest achievement award in IR 6501: Text Mining34

How to define a good similarity metric? Euclidean distance? Sports Education Finance D4D4 D2D2 D1D1 D5D5 D3D3 D6D6 TF-IDF space 6501: Text Mining35

Recap: constructing a VSM representation 6501: Text Mining36 4.Stopword/controlled vocabulary filtering: Documents in a vector space! D1: ‘Text mining is to identify useful information.’ D1: ‘Text’, ‘mining’, ‘is’, ‘to’, ‘identify’, ‘useful’, ‘information’, ‘.’ D1: ‘text’, ‘mine’, ‘is’, ‘to’, ‘identify’, ‘use’, ‘inform’, ‘.’ D1: ‘text-mine’, ‘mine-is’, ‘is-to’, ‘to-identify’, ‘identify-use’, ‘use-inform’, ‘inform-.’ D1: ‘text-mine’, ‘to-identify’, ‘identify-use’, ‘use-inform’ 1.Tokenization: 2.Stemming/normalization: 3.N-gram construction: Naturally fit into MapReduce paradigm! Mapper Reducer

Recap: TF-IDF weighting 6501: Text Mining37

How to define a good similarity metric? 6501: Text Mining38

From distance to angle Angle: how vectors are overlapped – Cosine similarity – projection of one vector onto another Sports Finance D1D1 D2D2 D6D6 TF-IDF space The choice of angle The choice of Euclidean distance 6501: Text Mining39

Cosine similarity TF-IDF vector Unit vector D1D1 D2D2 D6D6 TF-IDF space Sports Finance 6501: Text Mining40

Advantages of VS model Empirically effective! Intuitive Easy to implement Well-studied/mostly evaluated The Smart system – Developed at Cornell: – Still widely used Warning: many variants of TF-IDF! 6501: Text Mining41

Disadvantages of VS model Assume term independence Lack of “predictive adequacy” – Arbitrary term weighting – Arbitrary similarity measure Lots of parameter tuning! 6501: Text Mining42

What you should know Basic ideas of vector space model Procedures of constructing VS representation for a document Two important heuristics in bag-of-words representation – TF – IDF Similarity metric for VS model 6501: Text Mining43

Today’s reading Introduction to information retrieval – Chapter 2.2: Determining the vocabulary of terms – Chapter 6.2: Term frequency and weighting – Chapter 6.3: The vector space model for scoring – Chapter 6.4: Variant tf-idf functions 6501: Text Mining44