Partially Supervised Classification of Text Documents

Partially Supervised Classification of Text Documents
Authors: Bing Liu Wee Sun Lee Philip S. Yu Xiaoli Li Presented by: Swetha Nandyala Hi Everyone, I am Swetha and Today I would be talking about Partially Supervised Classification of Text Documents. The authors of this paper are Bing Liu, Wee Sun Lee, Philips S. Yu, Xiaoli Li. CIS 525: Neural Computation

CIS 525: Neural Computation
Overview Introduction Theoretical Foundation Background Methodology NB-C EM-Algorithm Proposed Strategy Evaluation Measures & Experiments Conclusion In this talk I will present an Introduction, certain theoretical Foundation Background Methodologyon NB-C and EM-Algorithm proposed strategy, Evaluation measures & Experiments and Conclusion. CIS 525: Neural Computation

Text Categorization … the activity of labeling natural language texts with thematic categories from a pre-defined set [Sebastiani, 2002] Text Categorization is a task of automatically assigning to a text document d from a given domain D, a category label c selected among a predefined set of category labels C. D Categorization System … c1 c2 ….. cj ck ……... Let look at Text Categorization. TC is the activity of labeling natural language texts with thematic categories from a pre-defined set i.e.. Say we have document domain D and a set of category labels C TC assigns a category label to each of the document in D. Then learning algorithms are applied on this new set of labeled documents to produce classifiers. This is nothing but standard supervised learning problem C CIS 525: Neural Computation

Text Categorization(Contd.)
Standard Supervised Learning Problem Bottleneck: Need for very large number of labeled training documents to build accurate classifier Goal: To identify a particular class of documents from a set of mixed unlabeled documents Standard classifications inapplicable Partially Supervised Classification is used But as we saw one bottle-neck of Supervised Learning is the need for training documents to be labeled (sometimes manually)…. So the main goal of this paper was to propose a strategy that could identify a particular class of documents from a set of unlabeled documents given a small set of positive documents. Since we have to classify unlabeled text documents the traditional classification problems are inapplicable and so partially supervised classification is used. CIS 525: Neural Computation

Theoretical foundations
AIM: To show PSC is a constrained optimization problem fixed distribution D over X x Y, where Y = {0,1} X, Y: sets of possible documents, classes Two sets of documents labeled as positive P of size n1 drawn from X for DX|Y=1 Unlabeled U of size n2 drawn from X for DX independently GOAL: Find the positive documents in U If we have a fixed distribution D over X x Y where X is set of possible documents and Y is classes here{0,1} as we are concerned of two classes positive and negative. Now that we know what partially supervised classification is let us see CIS 525: Neural Computation

Theoretical foundations
learning algorithm: selects a function f  F: X  {0, 1}(a class of functions) to classify unlabeled documents probability of error: Pr[f(X)  Y] is sum of “false positive” and “false negative” cases rewritten as Pr[f(X)  Y] = Pr[f(X) = 1  Y=0]+Pr[f(X) = 0  Y=1] After transforming Pr[f(X)  Y] = Pr[f(X) = 1] - Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] why learning is possible in the partially supervised case The paper theoretically shows that PSC is a constrained optimization problem PrD[A]: A  X x Y chosen randomly according to D T: a finite sample being a subset of our dataset PrT[A]: A  T  X x Y chosen randomly CIS 525: Neural Computation

Theoretical foundations (contd..)
Pr[f(X)  Y] = Pr[f(X) = 1] - Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] Note that Pr[Y = 1] is constant approximation: keeping Pr[f(X) = 0|Y = 1] small error  Pr[f(X) = 1] - Pr[Y = 1]  Pr[f(X) = 1] – const i.e. minimizing Pr[f(X) = 1]  minimizing error  minimizing PrU[f(X) = 1]) & keeping PrP[f(X) = 1])  r NOTHING BUT CONSTRAINT OPTIMIZATION PROBLEM  Learning Possible Note that Pr[Y = 1] is constant no change in criteria Recall= (relevant retrieved) / (all relevant) for large enough sets P(ositive) and U(nlabeled) CIS 525: Neural Computation

Naïve Bayesian Text Classification
D be set of training documents C = {c1, c2, ...,c|C|}: predif. classes, here: c1, c2 For diD, Pr[cj|di]: posterior probs are calculated in NB model: class with the highest Pr[cj|di] is assigned to the document Say D is a set of training documents, each document considered an ordered list of words wt V = <w1, w2, ..., w|V|>: vocabulary used wd i,k  V is a word in position k of doc. Di C = {c1, c2, ...,c|C|}: predif. Classes Posterior probabilities are calculated and the class with highest Pr[cj|di] is assigned to the document CIS 525: Neural Computation

The EM-Algorithm Iterative algorithm for maximum likelihood estimation in problems with incomplete data Two step method Expectation Step Fills in missing data Maximization Step Estimate parameters after the missing data is filled As we have learnt earlier EM is an Iterative algorithm for maximum likelihood estimation in problems with incomplete data. It has 2 steps the Expectation step and Maximization step. In Expectation step it fills missing data and in maximization step estimates parameters CIS 525: Neural Computation

Proposed Strategy Step 1: Re-initialization Iterative-EM: by applying EM-algorithm over P and U Identifying a set of reliable negative documents from the unlabeled set, by introducing spies Step 2: Building and selecting a classifier Spy-EM: building a set of classifiers iteratively selecting a good classifier from the set of classifiers constructed above After building an initial classifier using naïve bayes and EM, the documents that are most likely to be negative in the mixed set are identified using some spies The EM Algorithm generates sequence of solutions that increase the likelihood function, but the classification error need not be necessarily improving CIS 525: Neural Computation

Iterative EM with NB-C Assign each document in P(ositive) to class label c1 and in U(nlabeled) to class c2 Pr[c1/ di] = 1 & Pr[c2/ di] = 0 for each di in P Pr[c2/ dj] = 1 & Pr[c1/ dj] = 0 for each dj in U After initial labeling, a NB-C is built and used to classify documents in U revise posterior probabilities for documents in U After revising, a NB-C with new posterior probs. is built Iterative process goes on until EM converges Setback: strongly biased towards positive documents In IEM, each document in P is assigned to class c1 and each document in U to c2 i.e. Pr[c1/ di] = 1 & Pr[c2/ di] = 0 for each di in P Pr[c2/ dj] = 1 & Pr[c1/ dj] = 0 for each dj in U. Using this initial labeling Naïve bayes classifier is built to classify documents in U. In particular NB-c is used to compute the posterior probs of documents in U which are now assigned as new probabilistic class labels. After all the probs are revised a new classifier NB-C is built on new posterior probs of Unlabeled and positive documents. This iterative process goes until the EM-converges. The final probabilistic class labels can be used in classification. One set back is that it is strongly biased towards positive documents. So a technique call spy technique is proposed to deal this problem and balance both the positive and negative documents. CIS 525: Neural Computation

Step1: Re-Initialization
Sample a certain % of positive examples say “S” and put them into unlabeled set to act as “spies” I-EM algorithm is utilized but the U(nlabeled) set now has some spy documents After EM completes, the probabilistic labels of spies are used to decide which documents are most likely negative(LN) threshold t used for decision making: if Pr[c1|dj] < t: denoted as L(ikely)N(egative) if for dj  S Pr[c1|dj] > t: denoted as U(nlabeled) Here we sample a certain % of positive examples and put them into unlabeled set to act as “spies". The Unlabeled now has spy documents. The IEM is used. After the EM complete, the probabilistic labels of spies are used in deciding which documents belong to the likely negative set. To make this decision a threshold t is considered. Documents with if Pr[c1|dj] < t belong to LN and documents in U and not in P with if Pr[c1|dj] > t belong to U CIS 525: Neural Computation

positives Step-1 effect negatives BEFORE AFTER LN (likely negative) U(un-labeled) U un-labeled spies some spies P(positive) P(positive) Lets now see the results of step1. Before applying spy technique we have a set of positives (labeled) and an Unlabeled set which has positives negatives initially and we add some spies from P to U. After applying the spy technique, the spies help in putting most of the negatives in Likely negative set and most of the positives in Unlabeled set. initial situation: U = P  N no clue which are P and N spies from P added to U help of spies: most positives in U get into unlabeled set, while most negatives get into LN; purity of LN higher than that of U CIS 525: Neural Computation

Step-2: S-EM Apply EM over P, LN and U algorithm proceeds as follows: put all spies S back to P (where they were before) diP:  c1 (i.e. Pr[c1|di] = 1); (fixed thru iterations) djLN:  c2 (i.e. Pr[c2|dj] = 1); (changing thru EM) dkU: initially assigned no label (will be after EM(1)) run EM using P, LN and U until it converges final classifier is produced when EM stops Now that we have balanced the positive and negative documents and EM is employed over P,LN and U. The steps ate we put all the spies where they were initially, for all the documents in P we assign label c1 and LN we assign label c2. Initially unlabeled ones are assigned no labels. At the end of first iteration it will be assigned a probabilistic label and in subsequent iterations revised. The EM is iterated till convergence. This technique leaves us with a set of classifiers, producing a final classifier at the stop. This whole technique is called spy-technique. CIS 525: Neural Computation

Selecting Classifier Pr[f(X)  Y] = Pr[f(X) = 1] - Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] S-EM generates set of classifiers but classification is not necessarily improving remedy: stop iterating of EM at some point estimating the change of the probability error between iterations i and i+1 i = Pr[fi+1(X)  Y] - Pr[fi(X)  Y] if i > 0 for the first time, then ith classifier produced is the final classifier As we know EM is prone to local maxima trap i.e. if local maxima separates the 2 classes well EM works well. Otherwise (i.e. positives and negatives consist of many clusters each) the data is not separable. In such a case it may be better to stop at NB-C instead of iterating EM. To be more specific if we repeat the S-EM n times we produce n-classifiers from which we chose one. This selection is done after estimating the change of the probability error between iterations i and i+1. i.e. if i > 0 for the first time, then ith classifier produced is the final classifier CIS 525: Neural Computation

Evaluation measures Accuracy (of a classifier) A = m/(m+i) , where m, i are numbers of correct and incorrect decisions, respectively F-Score: F = 2pr / (p+r) is a classification performance measure Where recall r = a/(a+c) precision p = a/(a+b) The F-value reflects the average effect of both precision and recall Accuracy of a classifier is defined as ratio of number of correct decisions made to total decisions. Recall is the ratio of number of true positives and total number of positives in test data while precision is number of true positives and all examples classified as positive by system The F-value reflects the average effect of both precision and recall. CIS 525: Neural Computation

Experiments 30 datasets created from 2 large document corpora objective: recovering positive documents placed into mixed sets for each experiment: dividing full positive set into two subsets: P and R P: positive set used in the algorithm with a% of the full positive set R: set of remaining documents with b% have been put into U (not all in R put to U) For experimentation 30 datasets were created from two large document corpora for e.g. 20 newsgroups subdivided into 4 groups e.g. WebKB (CS depts.) subdivided into 7 categories. The objective was to recover positive documents placed in mixed sets For each experiment, the positive set is divided into two subsets P and R, P has a% of the full positive set and used in algorithm and R has b% of full positive set and part of R is put n mixed set. parameters a and b have been varied to cover different scenarios belief: in reality M is large and has a small proportion of positive documents CIS 525: Neural Computation

Experiments (contd…) techniques used NB-C: applied directly to P (c1) and U(c2) to built a classifier to classify data in set U I-EM: applies EM to P and U as long as converges (no spy yet); final classifier to be applied to U to identify its positives S-EM: spies used to re-initialize; I-EM to build the final classifier; threshold t used 3 techniques were applied. NB-C is applies directly to P and M to build a classifier to classify data I-EM applies EM to P and M till convergence and final classifier is applied to M to identify its positives S-EM reinitializes I-EM using spies to build the final classifier; threshold t used CIS 525: Neural Computation

Experiments (contd…) S-EM outperforms NB and I-EM in F dramatically S-EM outperforms NB and I-EM in A as well comment: datasets skewed, so A is not a reliable measure of classifier’s performance Here a and b are %s drawn form positive data to put in P and R, NB is naïve bayes, IEM8 is IEM after 8 iterations. I-EM did not seem to improve after that. And S-EM is spy-EM If we look into the values clearly S-EM outperforms NB and I-EM in F and A CIS 525: Neural Computation

Experiments (contd…) results show great effect of re-initialization with spies: S-EM outperforms I-EMbest re-initialization is not, however, the only factor of improvement: S-EM outperforms S-EM4 conclusions: both Step-1 (reinitializing) and Step-2 (selecting the best model) are needed! If we look into this table with various types of algorithms IE-best is best results obtained from IM from all 8 iterations, I-EM8 is IEM after 8 iterations, S-EM4 is S-EM after 4th iteration and S-EM after selecting good model CIS 525: Neural Computation

Conclusion Gives an overview of the theory on learning with positive and unlabeled examples Describes a two-step strategy for learning which produces extremely accurate classifiers Partially supervised classification is most helpful when initial model is insufficiently trained The paper Gives an overview of the theory on learning with positive and unlabeled examples. Describes a two-step strategy for learning which produces extremely accurate classifiers and Partially supervised classification is most helpful when initial model is insufficiently trained CIS 525: Neural Computation

Questions? CIS 525: Neural Computation

Partially Supervised Classification of Text Documents

Similar presentations

Presentation on theme: "Partially Supervised Classification of Text Documents"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Partially Supervised Classification of Text Documents

Similar presentations

Presentation on theme: "Partially Supervised Classification of Text Documents"— Presentation transcript:

Similar presentations

About project

Feedback