Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht
Overview Text Classification – What and Why? Text Clustering Support Vector Machines (SVMs) with Cluster Features Course of Project Results Conclusion Future Work
Text Classification – What and Why? Text Classification – assigning documents to predefined classes (categories) Labeling manually is time-consuming and sometimes impossible – the process needs to be automated To minimize labeling, automated text classifiers need to be able to utilize unlabeled data
How Does It Work? Text documents are represented using feature vectors Documents (both labeled and unlabeled) are clustered into similar groups - features representing relationship to created clusters are added to the feature vectors The augmented feature vectors are then classified by a Support Vector Machine (SVM) (Raskutti et al, 2002) This novel approach was the basis of this project
Representing Text … With paperless offices becoming more common, companies start using document databases with classification schemes… 1 Companies 3 Document 0 Distance... 1 Offices 0Unix 0 Match Feature Vector
Clustering Labeled feature vectors Unlabeled feature vectors 12… 0…4 10… Feature Vectors
SVMs With Cluster Features Labeled feature vectors (Class 1) Labeled feature vectors (Not Class 1) Unlabeled feature vectors Support Vectors Separating Hyperplane Added Features
Augmented Feature Vector Added FeaturesOriginal Word Frequencies Examples of added features: - binary “closest cluster” indicator - similarity to cluster centroids etc.
Areas of Investigation Investigated the following questions: The value of added cluster features Performance of different clustering algorithms: - Single-Pass (Raskutti et al., 2002) - Snob (Wallace and Boulton, 1968) Transductive Support Vector Machines (TSVMs) with clustering Different factors influencing performance - type of features, number of clusters etc.
Course of Project Implemented Single-Pass clustering algorithm and tested with SVMs - using variations on number and type of features added Combined Snob with SVMs - using different attribute types Tested the approaches on two different data sets - with random splits containing 1%, 5% and 10% labeled data out of the whole training set
Initial Results … Initial results showed that adding cluster features actually degrades SVM performance. Various slightly modified versions of the Single-Pass clustering algorithm as well as Snob were tested, all giving negative results when combined with SVMs. However, one approach showed an improvement...
Partitioning the Data Training set is divided into k partitions with each partition being clustered separately - features added to documents relative to k sets of clusters - k partitions means k x number of cluster features - used k = 5 in experiments
Results Labeled Data Average number of bits for test set of size 600
Results (cont.) Labeled Data Average number of bits for test set of size 3299
Conclusion Results suggest that performance of SVMs depends on: - number of features - type of features - clustering method Partitioning the data: - improves the quality of the features - improves overall performance Issues with use of Snob and clustering in general in text classification
Future Work Extending the SVM+Cluster approach to multi-labeled classification Investigating new sets of cluster features Determining: - optimal number of clusters used for adding cluster features - optimal number of partitions Investigating better methods of using Snob in text classification
References Raskutti, B., Ferra, H. and Kowalczyk, A. (2002). “Using Unlabeled Data for Text Classification through Addition of Cluster Parameters”, In International Conference on Machine Learning (Accepted) Wallace C.S. and Boulton, D.M., “An Information Measure for Classification”, Computer Journal, Vol.11, No.2, 1968, pp