Text Categorization Karl Rees Ling 580 April 2, 2001.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Evaluation of Decision Forests on Text Categorization
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Linear Classifiers (perceptrons)
Decision Tree.
Data Mining Classification: Alternative Techniques
Support Vector Machines and Margins
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Tree Rong Jin. Determine Milage Per Gallon.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Ensemble Learning: An Introduction
Three kinds of learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Advanced Multimedia Text Classification Tamara Berg.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Data mining and machine learning A brief introduction.
Bayesian Networks. Male brain wiring Female brain wiring.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Chapter 9 – Classification and Regression Trees
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Data Mining and Decision Support
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Automatic Categorization of Patent Applications Presentation to the 3rd IPC Workshop, WIPO, Feb , The need for automatic categorization of.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
KNN & Naïve Bayes Hongning Wang
10. Decision Trees and Markov Chains for Gene Finding.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Classification with Perceptrons Reading:
K Nearest Neighbor Classification
Introduction to Data Mining, 2nd Edition by
Information Retrieval
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Text Categorization Karl Rees Ling 580 April 2, 2001

What is Text Categorization? Classify the topic or theme of a document by categories/classes based on its content. Humans can do this intuitively, but how do you teach a machine to classify a text?

Useful Applications Filter a stream of news for a particular interest group. Spam vs. Interesting mail. Determining Authorship. Poetry vs. Fiction vs. Essay, etc…

So, For Example: Ch. 16 of Foundations of Statistical Language Processing. We want to create an agent that can give us the probability of this document belonging to certain categories: P(Poetry) =.015 P(Mathematics) =.198 P(Artificial Intelligence) =.732 P(Utterly Confusing) =.989

Training Sets Corpus of documents for which we already know the category. Essentially, we use this to teach a computer. Like showing a child a few pictures of a dog and a few pictures of a cat and then pointing to your neighbors pet and asking her/him what kind of animal it is.

Data Representation Model Represent each object (document) in the training set in the form (x, c), where x is a vector of measurements and c is the class label. In other words, each document is represented as a vector of potentially weighted word counts.

Training Procedure A procedure/function that chooses a document’s category from a family of classifiers (model class). Typically, the model class consists of two classifiers: c1 and c2, where c2 is NOT(c1). For example: g(x) = w * x + w0 x is the vector of word counts, w * x is the dot product of x and a vector of weights (because we may attach more importance to certain words) and w0 is some threshold. Choose c1 for g(x) > 0, otherwise, c2.

Test Set After training the classifier, we want to test its accuracy on a test set. Accuracy = Number of Objects Correctly Classified / Number of Objects Examined. Precision = Number of Objects Correctly Assigned to a Specific Category / Number of Objects Assigned to a Category Fallout = Number of Objects Incorrectly Assigned to a Category / Number of Objects NOT Belonging to that Category

Modeling How should texts be represented? Using all words leads to sparse statistics. Some words are indicative of a label. One approach: For each label, collect all words in texts with that label. Apply a mean square error test to determine whether a word occurs by chance in the texts. Sort all words by the mean square error test and take the top n (say 20). Idea is to select words that are correlated with a label. Examples: for label earnings, words such as “profit.”

Reuters Collection A common dataset in text classification is the Reuters collection: Articles categorized into about 100 topics training examples, 3299 test examples. Short texts, annotated with SGML. Available: 78.html 78.html

26-FEB :18:59.34 earn COBANCO INC <CBCO> YEAR NET SANTA CRUZ, Calif., Feb 26 - Shr 34 cts vs 1.19 dlrs Net 807,000 vs 2,858,000 Assets mln vs mln Deposits mln vs mln Loans mln vs mln Note: 4th qtr not available. Year includes 1985 extraordinary gain from tax carry forward of 132,000 dlrs, or five cts per shr. Reuter

For Reuters, Label = Earnings: Format of document vector file Each entry consists of 25 lines: 1.the document id 2.is the document in the training set (T) or in the evaluation set (E)? 3.is the document in the core training set (C) or in the validation set (V)? (X where this doesn't apply.) 4.is the document in the earnings category (Y) or not (N)? 5.feature weight for "vs" 6.feature weight for "mln" 7.feature weight for "cts" 8.feature weight for ";" 9.feature weight for "&" 10.feature weight for "000"

For Reuters, Label = Earnings 11.feature weight for "loss" 12.feature weight for "'" 13.feature weight for " 14.feature weight for "3" 15.feature weight for "profit" 16.feature weight for "dlrs" 17.feature weight for "1" 18.feature weight for "pct" 19.feature weight for "is" 20.feature weight for "s" 21.feature weight for "that" 22.feature weight for "net" 23.feature weight for "lt" 24.feature weight for "at" 25.semicolon (separator between entries)

Vector For Example Document: { docid 11 T C Y ; }

Classification Techniques Decision Trees Maximum Entropy Modeling Perceptrons (Neural Networks) K-Nearest Neighbor Classification (kNN) Naïve Bayes Support Vector Machines

Decision Trees

Information Measure of how much we “know” about an object, document, decision, etc… At each successive node we have more information about the object’s classification.

Information p = Number of objects in a set that belong to a certain category. n = Number of objects in a set that don’t belong to that category. I = Measure of the amount of information that we have about an object that is not in the set given. I( p/(p+n), n/(p+n) ) =

Information Gain The amount of Information we gain from making a decision. Each decision we make will give us two new sets, each with its own distinct Information value. There should be more Information in these sets than in the previous set, thus we build our tree based on Information Gain. Those decisions with the highest gain come first.

Information Gain Gain(A) = I( p/(p+n), n/(p+n) ) – Remainder(A) A is the resulting state. Remainder(A) is the average of the information in the resulting sets I = 1…v:

Decision Trees At the bottom of our trees, we have leaf nodes. At each of these nodes, we compute the percentage of objects belonging to the node that fit into the category we are looking at. If it is greater than a certain percentage (say 50%), we say that all documents that fit into this node are in this category. Hopefully, though, the tree will give us more confidence than 50%.

Pruning After growing a tree, we want to prune it down to a smaller size. We may want to get rid of nodes/decisions that don’t contribute any significant information (possibly node 6 and 7 in our example). We also want to get rid of decisions that are based on possibly insignificant details. These “overfit” the training set. For example, if there is only one document in the set that has both dlrs and pct, and this is in the earnings category, it would probably be a mistake to assume that all such documents are in earnings.

Decision Trees

Bagging / Boosting Obviously, there are many different ways to prune. Also, there are many other algorithms besides Information Gain for growing a decision tree. Bagging or Boosting means generating many decision trees and averaging the results of running an object through each of these trees.

Decision Trees

Maximum Entropy Modeling Consult Slides at cs46520/ppframe.htm for more information about Maximum Entropy. cs46520/ppframe.htm

Perceptrons

kNN Basic idea: Keep training set in memory. Define a similarity metric. At classification time, match unseen example against all examples in memory. Select k­best matches. Predict unseen example label as majority label of k retrieved example. Example similarity metric:

kNN Many variants on kNN. Underlying idea is that abstraction (rules, parameters etc) is likely to loose information. No abstraction, case­based reasoning. Training fast, testing can be slow. Potentially large memory requirements.

Naïve Bayes Assumption that are features are independent of each other: Here A is a document consisting of features A1 … An l is the document label. Fast training, fast evaluation. Good when features are independent.

Support Vector Machines SVMs are an interesting new classifier: Similar to kNN. Similar (ish) to maxent. Idea: Transform examples into a new space where they can be linearly separated. Group examples into regions that all share the same label. Base grouping in terms of training items (support vectors) that lie on the boundary. Best grouping found automatically.