Text Classification With Support Vector Machines Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht
Overview Text Classification – What and Why? Text Clustering Support Vector Machines Current Techniques Project Aim and Plan
Text Classification – What and Why? Text Classification – assigning documents to predefined classes (categories). Example: Web pages can be assigned to “politics”, “sport”, “business”, “entertainment” etc. There are thousands of categories associated with web pages. Labeling manually is time-consuming and sometimes impossible – the process needs to be automated!
Text Classification – What and Why? Automated text classifiers need to be able to learn from: Small set of labeled documents Large set of unlabeled documents Otherwise a lot of labeling would have to be done by humans So how is it done?
Representing Text Companies Document Distance Offices Match 1 Companies 3 Document Distance . . . Offices Unix Match …With paperless offices becoming more common, companies start using document databases with classification schemes… Feature Vector
Clustering Feature Vectors 1 2 … … 4 1 … Labeled documents … 4 1 … Labeled documents Unlabeled documents
Support Vector Machines (SVM) Binary Classifiers Maximizes distance between two classes (finds Optimal Separating Hyperplane – OSH) Support Vectors are closest to OSH OSH Class1 Not Class 1 Support Vectors
Current Techniques Clustering Methods Classification Methods Rasmussen’s Single Pass Algorithm (as described by Raskutti et al. (2002)) Reallocation Method Hierarchical Methods Classification Methods Support Vector Machines Co-Training Algorithm (Blum and Mitchell, 1998) Raskutti et al. (2002) describe an interesting approach – combining SVM’s with Rasmussen’s clustering algorithm
Combining SVM With Clustering Added Features Labeled documents (Class 1) Labeled documents (Not Class 1) Unlabeled documents Support Vectors Separating Hyperplane
Project Aim Resolve following issues: Can combining SVM’s with other techniques improve performance? Documents have thousands of features: Can different feature representation (selection) techniques improve performance without affecting accuracy? Documents can belong to multiple classes but SVM’s are binary classifiers!
Project Plan Currently implementing clustering technique described in Raskutti et al. (2002) Plan to implement other clustering techniques Investigate different feature representation (selection) techniques For example, different weights for words in different positions in document Investigate multi-class problem
References Blum, A. and T. Mitchell (1998). Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers Raskutti, B., H. Ferra, and A. Kowalczyk (2002). Using unlabeled data for text classification through addition of cluster parameters. In International Conference on Machine Learning (Accepted)