Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Slides:

Advertisements

Similar presentations

PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen- Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.

Supervised Learning Recap

A. Darwiche Learning in Bayesian Networks. A. Darwiche Known Structure Complete Data Known Structure Incomplete Data Unknown Structure Complete Data Unknown.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Chapter 5: Partially-Supervised Learning

Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.

Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Presented by Zeehasham Rasheed

Visual Recognition Tutorial

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.

Bing LiuCS Department, UIC1 Chapter 8: Semi-Supervised Learning Also called “partially supervised learning”

Large-Scale Text Categorization By Batch Mode Active Learning Steven C.H. Hoi †, Rong Jin ‡, Michael R. Lyu † † CSE Department, Chinese University of Hong.

Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin The Chinese.

Semi-Supervised Learning

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Active Learning for Class Imbalance Problem

Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Data mining and machine learning A brief introduction.

Text Classification, Active/Interactive learning.

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.

Text Clustering.

Classification Techniques: Bayesian Classification

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Chapter 23: Probabilistic Language Models April 13, 2004.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Learning from Positive and Unlabeled Examples Investigator: Bing Liu, Computer Science Prime Grant Support: National Science Foundation Problem Statement.

Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004 Mihai Surdeanu.

Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS authors: B. Liu, W.S. Lee, P.S. Yu, X. Li presented by Rafal Ladysz.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

Learning in Bayesian Networks. Known Structure Complete Data Known Structure Incomplete Data Unknown Structure Complete Data Unknown Structure Incomplete.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (

Partially Supervised Classification of Text Documents

Chapter 8: Semi-Supervised Learning

Machine Learning Lecture 9: Clustering

Classification of unlabeled data:

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Information Retrieval

EM Algorithm and its Applications

Presentation transcript:

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005

Agenda Problem Statement Related Work Theoretical Foundations Proposed Technique Evaluation Conclusions

Problem Statement: Common Approach Text categorization: automated assigning of text documents to pre-defined classes Common Approach: Supervised Learning Manually label a set of documents to pre-defined classes Use a learning algorithm to build a classifier _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Problem Statement: Common Approach (cont.) Problem: bottleneck associated with large number of labeled training documents to build the classifier Nigram, et al, have shown that using a large dose of unlabeled data can help _ _ _ _ _. _ _ _ _. _......

A different approach: Partially supervised classification Two class problem: positive and unlabeled Key feature is that there is no labeled negative document Can be posed as a constrained optimization problem Develop a function that correctly classifies all positive docs and minimizes the number of mixed docs classified as positive will have an expected error rate of no more than  Examplar: Finding matching (i.e., positive documents) from a large collection such as the Web. Matching documents are positive All others are negative

Related Work Text Classification techniques Naïve Bayesian K-nearest neighbor Support vector machines Each requires labeled data for all classes Problem similar to traditional information retrieval Rank orders documents according to their similarities to the query document Does not perform document classification

Theoretical Foundations Some discussion regarding the theoretical foundations. Focused primarily on Minimization of the probability of error Expected recall and precision of functions Pr[f(X)=Y] = Pr[f(X)=1] - Pr[Y=1] + 2Pr Pr[f(X)=0 | Y=1]Pr[Y=1] Painful, painful… but it did show you can build accurate classifiers with high probability when sufficient documents in P (the positive document set) and M (the unlabeled set) are available. / (1)

Theoretical Foundations (cont.) Two serious practical drawbacks to the theoretical method Constrained optimization problem may not be easy to solve for the function class in which we are interested Not easy to choose a desired recall level that will give a good classifier using the function class we are using

Proposed Technique Theory be darned! Paper introduces a practical technique based on the naïve Bayes classifier and the Expectation-Maximization (EM) algorithm After introducing a general technique, the authors offer an enhancement using spies

Proposed Technique: Terms D is the set of training documents V = is the set of all words considered for classification w di,k is the word in position k in document d i N(w t, d i ) is the number of times w t occurs in d i C = {c 1, c 2 } is the set of predefined classes P is the set of positive documents M is the set of unlabeled set of documents S is the set of spy documents Posterior probability Pr[c j | d i ]  {0,1} depends on the class label of the document

Proposed Technique: naïve Bayesian classifer (NB-C) Pr[c j ] =  i Pr[c j |d i ] / |D| Pr[w t |c j ] = 1 +  i=1 P[c j |d i ] N(w t, d i ) |V| +  s=1  i=1 P[c j |d i ] N(w s, d i ) and assuming the words are independent given the class Pr[c j |d i ] = Pr[c j ]  k=1 Pr[w di,k |c j ]  r=1 Pr[c r ]  k=1 Pr[w di,k |c r ] The class with the highest Pr[c j |d i ] is assigned as the class of the doc |D| |V| |C| |d i | (2) (3) (4)

Proposed Technique: EM algorithm Popular class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. Two steps Expectation: fills in the missing data Maximization: parameters are estimated Rinse and repeat Using a NB-C, (2) and (3) equate to the E step, and (4) is the M step Probability of a class now takes the value in [0,1] instead of {0,1}

Proposed Technique: EM algorithm (cont.) All positive documents have the class value c 1 Need to determine class value of each doc in mixed set. EM can help assign a probabilistic class label to each document d j in the mixed set Pr[c 1 |d j ] and Pr[c 2 |d j ] After a number of iterations, all the probabilities will converge

Proposed Technique: Step 1 - Reinitialization (I-EM) Reinitialization Build an initial NB-C using the documents sets M and P For class P, Pr[c 1 |d j ] = 1 and Pr[c 2 |d j ] = 0 For class M, Pr[c 1 |d j ] = 0 and Pr[c 2 |d j ] = 1 Loop while classifier parameters change For each document d j  M Compute Pr[c 1 |d j ] using the current NB-C Pr[c 2 |d j ] = 1 - Pr[c 1 |d j ] Update Pr[w t |c 1 ] and Pr[c 1 ] given the probabilistically assigned class for d j (Pr[c 1 |d j ]) and P (a new NB-C is being built in the process Works well on easy datasets Problem is that our initialization is strongly biased towards positive documents

Proposed Technique: Step 1 - Spies Problem is that our initialization is strongly biased towards positive documents Need to identify some very likely negative documents from the mixed set We do this by sending “spy” documents from the positive set P and put in the mixed set M (10% was used) A threshold t is set and those documents with a probabilistic label less than t are identified as negative 15% was the threshold used c2c2 c1c1 positive mix spies c1c1 positive spies c2c2 likely negative unlabeled

Proposed Technique: Step 1 - Spies (cont) N (most likely negative docs) = U (unlabeled docs) =  S (spies) = sample(P,s%) MS = M U S P = P - S Assign every document d i in P the class c 1 Assign every document d j in MS the class c 2 Run I-EM(MS,P) Classify each document d j in MS Determine the probability threshold t using S For each document d j in M If its probability Pr[c 1 |d j ] < t N = N U {d j } Else U = U U {d j }

Proposed Technique: Step 2 - Building the final classifier Using P, N and U as developed in the previous step Put all the spy documents S back in P Assign Pr[c 1 | d i ] =1 for all documents in P Assign Pr[c 2 | d i ] =1 for all documents in N. This will change with each iteration of EM Each doc d k in U is not assigned a label initially. At the end of the first iteration, it will have a probabilistic label Pr[c 1 | d k ] Run EM using the document sets P, N and U until it converges When EM stops, the final classifier has been produced. This two step technique is called S-EM (Spy EM)

Proposed Technique Selecting a classifier The local maximum that the final classifier may not cleanly separate the positive and negative documents Likely if there are many local clusters If so, from the set of classifiers developed over each iteration, select the one with the least probability of error Refer to (1) Pr[f(X)=Y] = Pr[f(X)=1] - Pr[Y=1] + 2Pr Pr[f(X)=0 | Y=1]Pr[Y=1] /

Evaluation Measurements Breakeven Point 0 = p - r, where p is precision and r is recall Only evaluates sorting order of class probabilities of documents Not appropriate F score F = 2pr / (p+r) Measures performance on a particular class Reflects average effect of both precision and recall Only when both p and r are large will F be large Accuracy

Evaluation Results 2 large document corpora 20NG Removed UseNet headers and subject lines WebKB HTML tags removed 8 iterations Pos SizeM sizePos in MNB(F) Average NB(A)1-EM8(F)1-EM8(A)S-EM(F)S-EM(A)

Evaluation Results (cont) Also varied the % of positive documents both in P (%a) and in M (%b) Pos SizeM sizePos in MNB(F) a=20% b=20% a=50% b=20% a=50% b=50% NB(A)1-EM8(F)1-EM8(A)S-EM(F)S-EM(A)

Conclusions This paper studied the problem of classification with only partial information: one class and a set of mixed documents Technique Naïve Bayes classifier Expectation Maximization algorithm Reinitialized using the positive documents and the most likely negative documents to compensate bias Use estimate of classification error to select a good classifier Extremely accurate results