1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Chapter 5: Introduction to Information Retrieval
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Text Categorization CSC 575 Intelligent Information Retrieval.
LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014.
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
K nearest neighbor and Rocchio algorithm
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Ch 4: Information Retrieval and Text Mining
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
The Vector Space Model …and applications in Information Retrieval.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Recommender systems Ram Akella November 26 th 2008.
1 Text Categorization. 2 Categorization Given: –A description of an instance, x  X, where X is the instance language or instance space. –A fixed set.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Advanced Multimedia Text Classification Tamara Berg.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Text Categorization. 2 Categorization Given: –A description of an instance, x  X, where X is the instance language or instance space. –A fixed set.
Text mining.
PrasadL11Classify1 Vector Space Classification Adapted from Lectures by Raymond Mooney and Barbara Rosario.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
The identification of interesting web sites Presented by Xiaoshu Cai.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Text Classification Chapter 2 of “Learning to Classify Text Using Support Vector Machines” by Thorsten Joachims, Kluwer, 2002.
Universit at Dortmund, LS VIII
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
Vector Space Classification (modified from Stanford CS276 slides on Lecture 11: Text Classification; Vector space classification [Borrows slides from Ray.
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
1 Text Categorization. 2 Categorization Given: –A description of an instance, x  X, where X is the instance language or instance space. –A fixed set.
Vector Space Models.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Vector Space Classification 1.Vector space text classification 2.Rochhio Text Classification.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 12 10/4/2011.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Instance Based Learning
Information Retrieval
Intelligent Information Retrieval
Instance Based Learning (Adapted from various sources)
Representation of documents and queries
Text Categorization Rong Jin.
Text Categorization Assigning documents to a fixed set of categories
Instance Based Learning
Chapter 5: Information Retrieval and Web Search
4. Boolean and Vector Space Retrieval Models
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
Boolean and Vector Space Retrieval Models
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies  Categorizing bookmarks  Newsgroup Messages /News Feeds / Micro-blog Posts  Recommending messages, posts, tweets, etc.  Message filtering  News articles  Personalized news  messages  Routing  Folderizing  Spam filtering

2 Learning for Text Categorization  Text Categorization is an application of classification  Typical Learning Algorithms:  Bayesian (naïve)  Neural network  Relevance Feedback (Rocchio)  Nearest Neighbor  Support Vector Machines (SVM)

3 Nearest-Neighbor Learning Algorithm  Learning is just storing the representations of the training examples in data set D  Testing instance x:  Compute similarity between x and all examples in D  Assign x the category of the most similar examples in D  Does not explicitly compute a generalization or category prototypes (i.e., no “modeling”)  Also called:  Case-based  Memory-based  Lazy learning

4 K Nearest-Neighbor  Using only the closest example to determine categorization is subject to errors due to  A single atypical example.  Noise (i.e. error) in the category label of a single training example.  More robust alternative is to find the k most-similar examples and return the majority category of these k examples.  Value of k is typically odd to avoid ties, 3 and 5 are most common.

5 Similarity Metrics  Nearest neighbor method depends on a similarity (or distance) metric  Simplest for continuous m-dimensional instance space is Euclidian distance  Simplest for m-dimensional binary instance space is Hamming distance (number of feature values that differ)  For text, cosine similarity of TF-IDF weighted vectors is typically most effective

Basic Automatic Text Processing  Parse documents to recognize structure and meta-data  e.g. title, date, other fields, html tags, etc.  Scan for word tokens  lexical analysis to recognize keywords, numbers, special characters, etc.  Stopword removal  common words such as “the”, “and”, “or” which are not semantically meaningful in a document  Stem words  morphological processing to group word variants (e.g., “compute”, “computer”, “computing”, “computes”, … can be represented by a single stem “comput” in the index)  Assign weight to words  using frequency in documents and across documents  Store Index  Stored in a Term-Document Matrix (“inverted index”) which stores each document as a vector of keyword weights 6

7 tf x idf Weighs  tf x idf measure:  term frequency (tf)  inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution  Recall the Zipf distribution  Want to weight terms highly if they are  frequent in relevant documents … BUT  infrequent in the collection as a whole  Goal: assign a tf x idf weight to each term in each document

8 tf x idf

9 Inverse Document Frequency  IDF provides high values for rare words and low values for common words

tf x idf Example 10 Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6 dfidf = log2(N/df) T T T T T T T T Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6 T T T T T T T T The initial Term x Doc matrix (Inverted Index) tf x idf Term x Doc matrix Documents represented as vectors of words

11 K Nearest Neighbor for Text Training: For each each training example  D Compute the corresponding TF-IDF vector, d x, for document x Test instance y: Compute TF-IDF vector d for document y For each  D Let s x = cosSim(d, d x ) Sort examples, x, in D by decreasing value of s x Let N be the first k examples in D. (get most similar neighbors) Return the majority class of examples in N