Partially Supervised Classification of Text Documents

Slides:

Advertisements

Similar presentations

Overview Full Bayesian Learning MAP learning

Advertisements

Chapter 5: Partially-Supervised Learning

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Visual Recognition Tutorial

Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.

Bing LiuCS Department, UIC1 Chapter 8: Semi-Supervised Learning Also called “partially supervised learning”

Experimental Evaluation

Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.

Introduction to Machine Learning Approach Lecture 5.

Semi-Supervised Learning

Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen

Active Learning for Class Imbalance Problem

Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Bayesian Networks. Male brain wiring Female brain wiring.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Text Clustering.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS authors: B. Liu, W.S. Lee, P.S. Yu, X. Li presented by Rafal Ladysz.

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Data Science Credibility: Evaluating What’s Been Learned

Machine Learning: Ensemble Methods

Queensland University of Technology

Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (

Semi-Supervised Clustering

Evaluating Classifiers

Constrained Clustering -Semi Supervised Clustering-

Chapter 8: Semi-Supervised Learning

Machine Learning Lecture 9: Clustering

Adversarial Learning for Neural Dialogue Generation

Classification of unlabeled data:

Statistical Models for Automatic Speech Recognition

CSC 594 Topics in AI – Natural Language Processing

KDD 2004: Adversarial Classification

Dipartimento di Ingegneria «Enzo Ferrari»,

Data Mining Lecture 11.

Hidden Markov Models Part 2: Algorithms

Bayesian Models in Machine Learning

Statistical Models for Automatic Speech Recognition

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

iSRD Spam Review Detection with Imbalanced Data Distributions

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Ensemble learning.

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Text Categorization Berlin Chen 2003 Reference:

Learning From Observed Data

Machine Learning: UNIT-3 CHAPTER-1

Machine Learning: Lecture 5

Presentation transcript:

Partially Supervised Classification of Text Documents Authors: Bing Liu Wee Sun Lee Philip S. Yu Xiaoli Li Presented by: Swetha Nandyala Hi Everyone, I am Swetha and Today I would be talking about Partially Supervised Classification of Text Documents. The authors of this paper are Bing Liu, Wee Sun Lee, Philips S. Yu, Xiaoli Li. CIS 525: Neural Computation

CIS 525: Neural Computation Overview Introduction Theoretical Foundation Background Methodology NB-C EM-Algorithm Proposed Strategy Evaluation Measures & Experiments Conclusion In this talk I will present an Introduction, certain theoretical Foundation Background Methodologyon NB-C and EM-Algorithm proposed strategy, Evaluation measures & Experiments and Conclusion. CIS 525: Neural Computation

CIS 525: Neural Computation Text Categorization … the activity of labeling natural language texts with thematic categories from a pre-defined set [Sebastiani, 2002] Text Categorization is a task of automatically assigning to a text document d from a given domain D, a category label c selected among a predefined set of category labels C. D Categorization System … c1 c2 ….. cj ck ……... Let look at Text Categorization. TC is the activity of labeling natural language texts with thematic categories from a pre-defined set i.e.. Say we have document domain D and a set of category labels C TC assigns a category label to each of the document in D. Then learning algorithms are applied on this new set of labeled documents to produce classifiers. This is nothing but standard supervised learning problem C CIS 525: Neural Computation

Text Categorization(Contd.) Standard Supervised Learning Problem Bottleneck: Need for very large number of labeled training documents to build accurate classifier Goal: To identify a particular class of documents from a set of mixed unlabeled documents Standard classifications inapplicable Partially Supervised Classification is used But as we saw one bottle-neck of Supervised Learning is the need for training documents to be labeled (sometimes manually)…. So the main goal of this paper was to propose a strategy that could identify a particular class of documents from a set of unlabeled documents given a small set of positive documents. Since we have to classify unlabeled text documents the traditional classification problems are inapplicable and so partially supervised classification is used. CIS 525: Neural Computation

Theoretical foundations AIM: To show PSC is a constrained optimization problem fixed distribution D over X x Y, where Y = {0,1} X, Y: sets of possible documents, classes Two sets of documents labeled as positive P of size n1 drawn from X for DX|Y=1 Unlabeled U of size n2 drawn from X for DX independently GOAL: Find the positive documents in U If we have a fixed distribution D over X x Y where X is set of possible documents and Y is classes here{0,1} as we are concerned of two classes positive and negative. Now that we know what partially supervised classification is let us see CIS 525: Neural Computation

Theoretical foundations learning algorithm: selects a function f  F: X  {0, 1}(a class of functions) to classify unlabeled documents probability of error: Pr[f(X)  Y] is sum of “false positive” and “false negative” cases rewritten as Pr[f(X)  Y] = Pr[f(X) = 1  Y=0]+Pr[f(X) = 0  Y=1] After transforming Pr[f(X)  Y] = Pr[f(X) = 1] - Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] why learning is possible in the partially supervised case The paper theoretically shows that PSC is a constrained optimization problem PrD[A]: A  X x Y chosen randomly according to D T: a finite sample being a subset of our dataset PrT[A]: A  T  X x Y chosen randomly CIS 525: Neural Computation

Theoretical foundations (contd..) Pr[f(X)  Y] = Pr[f(X) = 1] - Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] Note that Pr[Y = 1] is constant approximation: keeping Pr[f(X) = 0|Y = 1] small error  Pr[f(X) = 1] - Pr[Y = 1]  Pr[f(X) = 1] – const i.e. minimizing Pr[f(X) = 1]  minimizing error  minimizing PrU[f(X) = 1]) & keeping PrP[f(X) = 1])  r NOTHING BUT CONSTRAINT OPTIMIZATION PROBLEM  Learning Possible Note that Pr[Y = 1] is constant no change in criteria Recall= (relevant retrieved) / (all relevant) for large enough sets P(ositive) and U(nlabeled) CIS 525: Neural Computation

Naïve Bayesian Text Classification D be set of training documents C = {c1, c2, ...,c|C|}: predif. classes, here: c1, c2 For diD, Pr[cj|di]: posterior probs are calculated in NB model: class with the highest Pr[cj|di] is assigned to the document Say D is a set of training documents, each document considered an ordered list of words wt V = <w1, w2, ..., w|V|>: vocabulary used wd i,k  V is a word in position k of doc. Di C = {c1, c2, ...,c|C|}: predif. Classes Posterior probabilities are calculated and the class with highest Pr[cj|di] is assigned to the document CIS 525: Neural Computation

CIS 525: Neural Computation The EM-Algorithm Iterative algorithm for maximum likelihood estimation in problems with incomplete data Two step method Expectation Step Fills in missing data Maximization Step Estimate parameters after the missing data is filled As we have learnt earlier EM is an Iterative algorithm for maximum likelihood estimation in problems with incomplete data. It has 2 steps the Expectation step and Maximization step. In Expectation step it fills missing data and in maximization step estimates parameters CIS 525: Neural Computation

CIS 525: Neural Computation Proposed Strategy Step 1: Re-initialization Iterative-EM: by applying EM-algorithm over P and U Identifying a set of reliable negative documents from the unlabeled set, by introducing spies Step 2: Building and selecting a classifier Spy-EM: building a set of classifiers iteratively selecting a good classifier from the set of classifiers constructed above After building an initial classifier using naïve bayes and EM, the documents that are most likely to be negative in the mixed set are identified using some spies The EM Algorithm generates sequence of solutions that increase the likelihood function, but the classification error need not be necessarily improving CIS 525: Neural Computation

CIS 525: Neural Computation Iterative EM with NB-C Assign each document in P(ositive) to class label c1 and in U(nlabeled) to class c2 Pr[c1/ di] = 1 & Pr[c2/ di] = 0 for each di in P Pr[c2/ dj] = 1 & Pr[c1/ dj] = 0 for each dj in U After initial labeling, a NB-C is built and used to classify documents in U revise posterior probabilities for documents in U After revising, a NB-C with new posterior probs. is built Iterative process goes on until EM converges Setback: strongly biased towards positive documents In IEM, each document in P is assigned to class c1 and each document in U to c2 i.e. Pr[c1/ di] = 1 & Pr[c2/ di] = 0 for each di in P Pr[c2/ dj] = 1 & Pr[c1/ dj] = 0 for each dj in U. Using this initial labeling Naïve bayes classifier is built to classify documents in U. In particular NB-c is used to compute the posterior probs of documents in U which are now assigned as new probabilistic class labels. After all the probs are revised a new classifier NB-C is built on new posterior probs of Unlabeled and positive documents. This iterative process goes until the EM-converges. The final probabilistic class labels can be used in classification. One set back is that it is strongly biased towards positive documents. So a technique call spy technique is proposed to deal this problem and balance both the positive and negative documents. CIS 525: Neural Computation

Step1: Re-Initialization Sample a certain % of positive examples say “S” and put them into unlabeled set to act as “spies” I-EM algorithm is utilized but the U(nlabeled) set now has some spy documents After EM completes, the probabilistic labels of spies are used to decide which documents are most likely negative(LN) threshold t used for decision making: if Pr[c1|dj] < t: denoted as L(ikely)N(egative) if for dj  S Pr[c1|dj] > t: denoted as U(nlabeled) Here we sample a certain % of positive examples and put them into unlabeled set to act as “spies". The Unlabeled now has spy documents. The IEM is used. After the EM complete, the probabilistic labels of spies are used in deciding which documents belong to the likely negative set. To make this decision a threshold t is considered. Documents with if Pr[c1|dj] < t belong to LN and documents in U and not in P with if Pr[c1|dj] > t belong to U CIS 525: Neural Computation

CIS 525: Neural Computation positives Step-1 effect negatives BEFORE AFTER LN (likely negative) U(un-labeled) U un-labeled spies some spies P(positive) P(positive) Lets now see the results of step1. Before applying spy technique we have a set of positives (labeled) and an Unlabeled set which has positives negatives initially and we add some spies from P to U. After applying the spy technique, the spies help in putting most of the negatives in Likely negative set and most of the positives in Unlabeled set. initial situation: U = P  N no clue which are P and N spies from P added to U help of spies: most positives in U get into unlabeled set, while most negatives get into LN; purity of LN higher than that of U CIS 525: Neural Computation

CIS 525: Neural Computation Step-2: S-EM Apply EM over P, LN and U algorithm proceeds as follows: put all spies S back to P (where they were before) diP:  c1 (i.e. Pr[c1|di] = 1); (fixed thru iterations) djLN:  c2 (i.e. Pr[c2|dj] = 1); (changing thru EM) dkU: initially assigned no label (will be after EM(1)) run EM using P, LN and U until it converges final classifier is produced when EM stops Now that we have balanced the positive and negative documents and EM is employed over P,LN and U. The steps ate we put all the spies where they were initially, for all the documents in P we assign label c1 and LN we assign label c2. Initially unlabeled ones are assigned no labels. At the end of first iteration it will be assigned a probabilistic label and in subsequent iterations revised. The EM is iterated till convergence. This technique leaves us with a set of classifiers, producing a final classifier at the stop. This whole technique is called spy-technique. CIS 525: Neural Computation

CIS 525: Neural Computation Selecting Classifier Pr[f(X)  Y] = Pr[f(X) = 1] - Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] S-EM generates set of classifiers but classification is not necessarily improving remedy: stop iterating of EM at some point estimating the change of the probability error between iterations i and i+1 i = Pr[fi+1(X)  Y] - Pr[fi(X)  Y] if i > 0 for the first time, then ith classifier produced is the final classifier As we know EM is prone to local maxima trap i.e. if local maxima separates the 2 classes well EM works well. Otherwise (i.e. positives and negatives consist of many clusters each) the data is not separable. In such a case it may be better to stop at NB-C instead of iterating EM. To be more specific if we repeat the S-EM n times we produce n-classifiers from which we chose one. This selection is done after estimating the change of the probability error between iterations i and i+1. i.e. if i > 0 for the first time, then ith classifier produced is the final classifier CIS 525: Neural Computation

CIS 525: Neural Computation Evaluation measures Accuracy (of a classifier) A = m/(m+i) , where m, i are numbers of correct and incorrect decisions, respectively F-Score: F = 2pr / (p+r) is a classification performance measure Where recall r = a/(a+c) precision p = a/(a+b) The F-value reflects the average effect of both precision and recall Accuracy of a classifier is defined as ratio of number of correct decisions made to total decisions. Recall is the ratio of number of true positives and total number of positives in test data while precision is number of true positives and all examples classified as positive by system The F-value reflects the average effect of both precision and recall. CIS 525: Neural Computation

CIS 525: Neural Computation Experiments 30 datasets created from 2 large document corpora objective: recovering positive documents placed into mixed sets for each experiment: dividing full positive set into two subsets: P and R P: positive set used in the algorithm with a% of the full positive set R: set of remaining documents with b% have been put into U (not all in R put to U) For experimentation 30 datasets were created from two large document corpora for e.g. 20 newsgroups subdivided into 4 groups e.g. WebKB (CS depts.) subdivided into 7 categories. The objective was to recover positive documents placed in mixed sets For each experiment, the positive set is divided into two subsets P and R, P has a% of the full positive set and used in algorithm and R has b% of full positive set and part of R is put n mixed set. parameters a and b have been varied to cover different scenarios belief: in reality M is large and has a small proportion of positive documents CIS 525: Neural Computation

CIS 525: Neural Computation Experiments (contd…) techniques used NB-C: applied directly to P (c1) and U(c2) to built a classifier to classify data in set U I-EM: applies EM to P and U as long as converges (no spy yet); final classifier to be applied to U to identify its positives S-EM: spies used to re-initialize; I-EM to build the final classifier; threshold t used 3 techniques were applied. NB-C is applies directly to P and M to build a classifier to classify data I-EM applies EM to P and M till convergence and final classifier is applied to M to identify its positives S-EM reinitializes I-EM using spies to build the final classifier; threshold t used CIS 525: Neural Computation

CIS 525: Neural Computation Experiments (contd…) S-EM outperforms NB and I-EM in F dramatically S-EM outperforms NB and I-EM in A as well comment: datasets skewed, so A is not a reliable measure of classifier’s performance Here a and b are %s drawn form positive data to put in P and R, NB is naïve bayes, IEM8 is IEM after 8 iterations. I-EM did not seem to improve after that. And S-EM is spy-EM If we look into the values clearly S-EM outperforms NB and I-EM in F and A CIS 525: Neural Computation

CIS 525: Neural Computation Experiments (contd…) results show great effect of re-initialization with spies: S-EM outperforms I-EMbest re-initialization is not, however, the only factor of improvement: S-EM outperforms S-EM4 conclusions: both Step-1 (reinitializing) and Step-2 (selecting the best model) are needed! If we look into this table with various types of algorithms IE-best is best results obtained from IM from all 8 iterations, I-EM8 is IEM after 8 iterations, S-EM4 is S-EM after 4th iteration and S-EM after selecting good model CIS 525: Neural Computation

CIS 525: Neural Computation Conclusion Gives an overview of the theory on learning with positive and unlabeled examples Describes a two-step strategy for learning which produces extremely accurate classifiers Partially supervised classification is most helpful when initial model is insufficiently trained The paper Gives an overview of the theory on learning with positive and unlabeled examples. Describes a two-step strategy for learning which produces extremely accurate classifiers and Partially supervised classification is most helpful when initial model is insufficiently trained CIS 525: Neural Computation

CIS 525: Neural Computation Questions? CIS 525: Neural Computation