1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Boolean and Vector Space Retrieval Models
Chapter 5: Introduction to Information Retrieval
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Text Categorization CSC 575 Intelligent Information Retrieval.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
K nearest neighbor and Rocchio algorithm
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Ch 4: Information Retrieval and Text Mining
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
The Vector Space Model …and applications in Information Retrieval.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
1 Text Categorization. 2 Categorization Given: –A description of an instance, x  X, where X is the instance language or instance space. –A fixed set.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Text Categorization. 2 Categorization Given: –A description of an instance, x  X, where X is the instance language or instance space. –A fixed set.
PrasadL11Classify1 Vector Space Classification Adapted from Lectures by Raymond Mooney and Barbara Rosario.
The identification of interesting web sites Presented by Xiaoshu Cai.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Universit at Dortmund, LS VIII
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
1 Computing Relevance, Similarity: The Vector Space Model.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
1 Text Categorization. 2 Categorization Given: –A description of an instance, x  X, where X is the instance language or instance space. –A fixed set.
Vector Space Models.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
1 Text Categorization Slides based on R. Mooney (UT Austin)
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Vector Space Classification 1.Vector space text classification 2.Rochhio Text Classification.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
IR 6 Scoring, term weighting and the vector space model.
Instance Based Learning
Intelligent Information Retrieval
Instance Based Learning (Adapted from various sources)
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
Chapter 5: Information Retrieval and Web Search
4. Boolean and Vector Space Retrieval Models
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
Boolean and Vector Space Retrieval Models
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies  Categorizing bookmarks  Newsgroup Messages /News Feeds / Micro-blog Posts  Recommending messages, posts, tweets, etc.  Message filtering  News articles  Personalized news  messages  Routing  Folderizing  Spam filtering

2 Learning for Text Categorization  Text Categorization is an application of classification  Typical Learning Algorithms:  Bayesian (naïve)  Neural network  Relevance Feedback (Rocchio)  Nearest Neighbor  Support Vector Machines (SVM)

3 Similarity Metrics  Nearest neighbor method depends on a similarity (or distance) metric  Simplest for continuous m-dimensional instance space is Euclidean distance  Simplest for m-dimensional binary instance space is Hamming distance (number of feature values that differ)  For text, cosine similarity of TF-IDF weighted vectors is typically most effective

Basic Automatic Text Processing  Parse documents to recognize structure and meta-data  e.g. title, date, other fields, html tags, etc.  Scan for word tokens  lexical analysis to recognize keywords, numbers, special characters, etc.  Stopword removal  common words such as “the”, “and”, “or” which are not semantically meaningful in a document  Stem words  morphological processing to group word variants (e.g., “compute”, “computer”, “computing”, “computes”, … can be represented by a single stem “comput” in the index)  Assign weight to words  using frequency in documents and across documents  Store Index  Stored in a Term-Document Matrix (“inverted index”) which stores each document as a vector of keyword weights 4

5 tf x idf Weights  tf x idf measure:  term frequency (tf)  inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution  Recall the Zipf distribution  Want to weight terms highly if they are  frequent in relevant documents … BUT  infrequent in the collection as a whole  Goal: assign a tf x idf weight to each term in each document

6 tf x idf

7 Inverse Document Frequency  IDF provides high values for rare words and low values for common words

tf x idf Example 8 Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6 dfidf = log2(N/df) T T T T T T T T Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6 T T T T T T T T The initial Term x Doc matrix (Inverted Index) tf x idf Term x Doc matrix Documents represented as vectors of words

9 Nearest-Neighbor Learning Algorithm  Learning is just storing the representations of the training examples in data set D  Testing instance x:  Compute similarity between x and all examples in D  Assign x the category of the most similar examples in D  Does not explicitly compute a generalization or category prototypes (i.e., no “modeling”)  Also called:  Case-based  Memory-based  Lazy learning

10 K Nearest Neighbor for Text Training: For each each training example  D Compute the corresponding TF-IDF vector, d x, for document x Test instance y: Compute TF-IDF vector d for document y For each  D Let s x = cosSim(d, d x ) Sort examples, x, in D by decreasing value of s x Let N be the first k examples in D. (get most similar neighbors) Return the majority class of examples in N

11 Using Rocchio Method  Rocchio method is typically used for relevance feedback in information retrieval  It can be adapted for text categorization.  Use standard TF/IDF weighted vectors to represent text documents  For each category, compute a prototype vector by summing the vectors of the training documents in the category.  Assign test documents to the category with the closest prototype vector based on cosine similarity.

12 Rocchio Text Categorization Algorithm (Training) Assume the set of categories is {c 1, c 2,…c n } For i from 1 to n let p i = (init. prototype vectors) For each training example  D Let d be the TF/IDF term vector for doc x Let i = j where c j = c(x) (sum all the document vectors in c i to get p i ) Let p i = p i + d

13 Rocchio Text Categorization Algorithm (Test) Given test document x Let d be the TF/IDF term vector for x Let m = –2 (init. maximum cosSim) For i from 1 to n: (compute similarity to prototype vector) Let s = cosSim(d, p i ) if s > m let m = s let r = c i (update most similar class prototype) Return class r

14 Rocchio Properties  Does not guarantee a consistent hypothesis.  Forms a simple generalization of the examples in each class (a prototype).  Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.  Classification is based on similarity to class prototypes.