PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, and Kevin Chen-Chuan IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No. 1, 2004 Presented by Chirayu Wongchokprasitti
Introduction Web page classification is one of the main techniques for Web mining Constructing a classifier requires positive and negative training examples Cautious to avoid bias and laborious to collect negative training examples
Typical Learning Framework
Positive Example Base Learning (PEBL) Framework Learn from positive data and unlabeled data Unlabeled data indicates random samples of the universal set Apply the Mapping-Convergence (M-C) Algorithm
Mapping-Convergence (M-C) Algorithm Divide into 2 stages Mapping stage Use any classifier that does not generate false negatives They chose 1-DNF ( monotone Disjunctive Normal Form) Convergence stage For maximizing margin They chose SVM (Support Vector Machine)
Mapping Stage Use a weak classifier to draw an initial approximation of “strong” negative data. First, Identify strong positive features from positive and unlabeled data by checking the frequency of those features. If feature frequency in positive data is larger than one in the universal data, it is a strong positive Filter out any possible positive, leaving only strong negatives.
Convergence Stage Use SVM to scope down the class boundary Iterate SVM for certain times to extract negative data from unlabeled data The boundary will converge into the true boundary.
Support Vector Machines Visualization of a Support Vector Machine
Convergence of SVM
Data Flow Diagram
Experimental Results Report the result with precision-recall breakeven point (P-R) Experiment 1: the Internet Use DMOZ as the universal set Experiment 2: University CS department Use WebKB data set Mixture Models
Experiment 1
Experiment 2
Mixture Models
Summary and Conclusions PEBL framework eliminates the need for manually collecting negative training examples The Mapping-Convergence (M-C) algorithm achieves classification accuracy as high as that of traditional SVM PEBL needs faster training time