Categorization/Classification

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Lazy vs. Eager Learning Lazy vs. eager learning
Ch5 Stochastic Methods Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2011.
What is Statistical Modeling
Learning for Text Categorization
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 13: Text Classification & Naive.
Lecture 13-1: Text Classification & Naive Bayes
TEXT CLASSIFICATION CC437 (Includes some original material by Chris Manning)
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Classification.
CS276A Text Retrieval and Mining Lecture 11. Recap of the last lecture Probabilistic models in Information Retrieval Probability Ranking Principle Binary.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Advanced Multimedia Text Classification Tamara Berg.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Information Retrieval and Web Search Introduction to Text Classification (Note: slides in this set have been adapted from the course taught by Chris Manning.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
ITCS 6265 Information Retrieval and Web Mining Lecture 12: Text Classification; The Naïve Bayes algorithm.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 11 9/29/2011.
Text Classification, Active/Interactive learning.
How to classify reading passages into predefined categories ASH.
Naive Bayes Classifier
Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.
Basic Data Mining Technique
Feature Selection: Why?
Statistical NLP Winter 2009 Lecture 4: Text categorization through Naïve Bayes Roger Levy ありがとう to Chris Manning for slides.
Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Classification Techniques: Bayesian Classification
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Bayesian Classification
Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.
Classification And Bayesian Learning
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Statistical NLP Winter 2008 Lecture 4: Text classification through Naïve Bayes Roger Levy ありがとう to Chris Manning for slides.
Text Classification and Naïve Bayes (Modified from Stanford CS276 slides on Lecture 10: Text Classification; The Naïve Bayes algorithm)
Data Mining and Decision Support
Classification Today: Basic Problem Decision Trees.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
CS276 Information Retrieval and Web Search Lecture 12: Naïve BayesText Classification.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Naive Bayes Classifier
Chapter 6 Classification and Prediction
Lecture 15: Text Classification & Naive Bayes
Classification and Prediction
Classification Techniques: Bayesian Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Classification and Prediction
Information Retrieval
INF 141: Information Retrieval
Naïve Bayes Text Classification
Presentation transcript:

Categorization/Classification Given: A description of an instance, x  X, where X is the instance language or instance space. A fixed set of classes: C = {c1, c2,…, cJ} Determine: The category of x: c(x)C, where c(x) is a classification function whose domain is X and whose range is C. We want to know how to build classification functions (“classifiers”).

More Text Classification Examples: Many search engine functionalities use classification Assign labels to each document or web-page: Labels are most often topics such as Yahoo-categories e.g., "finance," "sports," "news>world>asia>business" Labels may be opinion on a person/product e.g., “like”, “hate”, “neutral” Labels may be domain-specific e.g., "interesting-to-me" : "not-interesting-to-me” e.g., “contains adult language” : “doesn’t”

Classification Methods Manual classification Used by Yahoo! (originally) Very accurate when job is done by experts Consistent when the problem size and team is small Difficult and expensive to scale Means we need automatic classification methods for big problems 3

Classification Methods Supervised learning of a document-label assignment function Many systems partly rely on machine learning (MSN, Verity, Yahoo!, …) k-Nearest Neighbors (simple, powerful) Naive Bayes (simple, common method) … plus many other methods No free lunch: requires hand-classified training data Note that many commercial systems use a mixture of methods 4

Classification—A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae

Classification—A Two-Step Process Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

Process (1): Model Construction Classification Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Process (2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?

Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

The goal of the course Study supervised learning specially for text and hypertext documents Text Has a very large number of potential features, of which many are irrelevant. If vector space model is used, each term is a potential feature. The number of distinct class labels is much larger than structured leaning scenarios.

Topics including in the course Evaluating text classifiers Classifiers NN learners Bayesian learners Hypertext classification Feature selection methods

Evaluating text classifiers Accuracy The ability to predict the correct class labels This is based on comparing the classifier-assigned labels with human-assigned labels Speed time to construct the model (training time) time to use the model (classification/prediction time) Simplicity, speed, and scalability for document insertion, deletion and modification Scalability: efficiency in disk-resident databases Interpretability understanding and insight provided by the model

Benchmarks Reuters 20NG WebKB Labeled documents : 10700 Number of terms : 30000 Number of categories : 135 20NG Labeled documents : 18800 Number of terms : 94000 Number of categories : 20 WebKB Labeled documents : 8300 Number of categories : 7

Measures of accuracy Each document is associated with a subset of classes To avoid searching over the power set of class labels, many systems create a two-class problem for every class Two-way ensemble or one-vs.-rest technique Ensemble classifiers are evaluated on the basis of recall and precision

Classifier Accuracy Measures (guess)~C1 C1 (true) ~C1 True negative False positive False negative True positive Classifier Accuracy Measures

A combined measure: F Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean): People usually use balanced F1 measure i.e., with  = 1 or  = ½ ( b2 = (1-a)/a )

F: Example precision? recall? F1?

F: Why harmonic mean? The simple (arithmetic) mean is 50% for “return-everything” search engine, which is too high. Desideratum: Punish really bad performance on either precision or recall. Taking the minimum achieves this. But minimum is not smooth and hard to weight. F (harmonic mean) is a kind of smooth minimum.

Nearest Neighbor Learner Basic idea Similar documents are expected to be assigned the same class label. Vector space model and cosine measure for similarity let us formalize the idea.

The k-Nearest Neighbor Algorithm All instances correspond to points in the n-D space The nearest neighbor are defined in terms of cosine similarity k-NN returns the most common value among the k training examples nearest to xq Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples . _ _ . _ _ + . . + . _ xq + . _ +

Discussion on the k-NN Algorithm Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to their distance to the query xq Give greater weight to closer neighbors Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes To overcome it, elimination of the least relevant attributes

Bayesian Methods Learning and classification methods based on probability theory. Bayes theorem plays a critical role in probabilistic learning and classification. Build a generative model that approximates how data is produced Uses prior probability of each category given no information about an item. Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Bayes’ Rule

Naive Bayes Classifiers Task: Classify a new instance D based on a tuple of attribute values into one of the classes cj  C

Naïve Bayes Assumption P(cj) Can be estimated from the frequency of classes in the training examples. P(x1,x2,…,xn|cj) O(|X|n•|C|) parameters Could only be estimated if a very, very large number of training examples was available. Naïve Bayes Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).

The Naïve Bayes Classifier Flu X1 X2 X5 X3 X4 fever sinus cough runnynose muscle-ache Conditional Independence Assumption: features detect term presence and are independent of each other given the class: This model is appropriate for binary variables Multivariate Bernoulli model

Learning the Model First attempt: maximum likelihood estimates C X1 X2 X5 X3 X4 X6 First attempt: maximum likelihood estimates simply use the frequencies in the data

Naïve Bayesian Classifier: Training Dataset C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data sample X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)

An Example P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357 Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”)

Problem with Max Likelihood Flu X1 X2 X5 X3 X4 fever sinus cough runnynose muscle-ache What if we have seen no training cases where patient had no flu and muscle aches? Zero probabilities cannot be conditioned away, no matter the other evidence!

Smoothing to Avoid Overfitting The estimate is 0 because of sparseness The training data are never large enough to represent the frequency of rare events adequately To eliminate zeros, we use add-one or Laplace smoothing # of terms in the vocabulary

Underflow Prevention: log space Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. Class with highest final un-normalized log probability score is still the most probable.

Two Models Model 1: Multivariate Bernoulli One feature Xw for each word in dictionary Xw = true in document d if w appears in d Naive Bayes assumption: Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

Example

Text classification example(multivariate Bernoulli model)

Two Models Model 2: Multinomial = Class conditional unigram One feature Xi for each word pos in document feature’s values are all words in dictionary Value of Xi is the word in position i Naïve Bayes assumption: Given the document’s topic, word in one position in the document tells us nothing about words in other positions Second assumption: Word appearance does not depend on position for all positions i,j, word w, and class c

Parameter estimation Multivariate Bernoulli model: Multinomial model: Can create a mega-document for topic j by concatenating all documents in this topic Use frequency of w in mega-document fraction of documents of topic cj in which word w appears fraction of times in which word w appears across all documents of topic cj

Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate required P(cj) and P(xk | cj) terms For each cj in C do docsj  subset of documents for which the target class is cj Textj  single document containing all docsj for each word xk in Vocabulary nk  number of occurrences of xk in Textj

Naïve Bayes: Classifying positions  all word positions in current document which contain tokens found in Vocabulary Return cNB, where

Text classification example(multinomial model) P(c)=3/4, P(~c)=1/4 P(chinese|c)=(5+1)/(8+6)=3/7 P(Tokyo|c)=P(Japan|c)=(0+1)/(8+6)=1/14 P(chinese|~c)=(1+1)/(3+6)=2/9 P(tokyo|~c)=p(Japan|~c)=(1+1)/(3+6)=2/9 d5 c : (3/4)*(3/7)3*(1/14)*(1/14)=0.0003 ~c:(1/4)*(2/9)3*(2/9)*(2/9)=0.0001

Stochastic Language Models Models probability of generating strings (each word in turn) in the language (commonly all strings over ∑). E.g., unigram model Model M 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes … the man likes the woman 0.2 0.01 0.02 0.2 0.01 multiply P(s | M) = 0.00000008

Stochastic Language Models Model probability of generating any string Model M1 Model M2 0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maiden 0.0001 woman 0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman maiden class pleaseth yon the 0.0005 0.01 0.0001 0.2 0.02 0.1 P(s|M2) > P(s|M1)

Unigram and higher-order models Unigram Language Models Bigram (generally, n-gram) Language Models P ( ) = P ( ) P ( | ) P ( | ) P ( | ) Easy. Effective! P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | )

WebKB Experiment (1998) Classify webpages from CS departments into: student, faculty, course,project Train on ~5,000 hand-labeled web pages Cornell, Washington, U.Texas, Wisconsin Crawl and classify a new site (CMU) Results: 44

NB Model Comparison: WebKB

Classification Multinomial vs Multivariate Bernoulli Multinomial model is almost always more effective in text applications!

Naive Bayes is Not So Naive Naïve Bayes: First and Second place in KDD-CUP 97 competition, among 16 (then) state of the art algorithms Goal: Financial services industry direct mail response prediction model: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records. Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Instead Decision Trees can heavily suffer from this. Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data A good dependable baseline for text classification (but not the best)! Optimal if the Independence Assumptions hold: If assumed independence is correct, then it is the Bayes Optimal Classifier for problem Very Fast: Learning with one pass of counting over the data; testing linear in the number of attributes, and document collection size Low Storage requirements

Hypertext classification Search engines assign heuristic weights to terms that occur in specific HTML tags Paying special attention to tags can help with supervised learning as well

Hypertext classification It is important to distinguish between the two occurrences of the word “surfing” resume.publication.title.surfing resume.hobbies.item.surfing Relations provide a uniform way to codify hypertextual features. Ex: contains-text(resume.hobbies.item, wind-surfing) Ex: links-to(source, destination)

Rule Induction

Rule Induction The outer loop learns new rules one at a time, removing positive examples covered by any rule generated thus far. When a new empty rule is initialized, its free variables can be bound in all possible ways The inner loop adds conjunctive literals to the new rule until no negative example is covered by the new rule. A heuristic is to pick a literal that rapidly increases the ratio of surviving positive to negative bindings.

Feature Selection: Why? Text collections have a large number of features 10,000 – 1,000,000 unique words … and more May make using a particular classifier feasible Some classifiers can’t deal with 100,000 of features Reduces training time Training time for some methods is quadratic or worse in the number of features Can improve generalization (performance) Eliminates noise features Avoids overfitting 52

Feature selection: how? An easy one Ignoring terms that are “too frequent” or “too rare” according to empirically chosen threshold. General idea: Hypothesis testing statistics: Are we confident that the value of one categorical variable is associated with the value of another Chi-square test

2 statistic (CHI) 2 is interested in (fo – fe)2/fe summed over all table entries: is the observed number what you’d expect given the marginals? The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence). 9500 500 (4.75) (0.25) (9498) 3 Class  auto (502) 2 Class = auto Term  jaguar Term = jaguar expected: fe observed: fo 54

2 statistic (CHI) There is a simpler formula for 2x2 2: D = #(¬t, ¬c) B = #(t,¬c) C = #(¬t,c) A = #(t,c) N = A + B + C + D 55

Feature Selection Chi-square Statistical foundation May select very slightly informative frequent terms that are not very useful for classification