Download presentation
Presentation is loading. Please wait.
1
Text Classification With Support Vector Machines
Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht
2
Overview Text Classification – What and Why? Text Clustering
Support Vector Machines Current Techniques Project Aim and Plan
3
Text Classification – What and Why?
Text Classification – assigning documents to predefined classes (categories). Example: Web pages can be assigned to “politics”, “sport”, “business”, “entertainment” etc. There are thousands of categories associated with web pages. Labeling manually is time-consuming and sometimes impossible – the process needs to be automated!
4
Text Classification – What and Why?
Automated text classifiers need to be able to learn from: Small set of labeled documents Large set of unlabeled documents Otherwise a lot of labeling would have to be done by humans So how is it done?
5
Representing Text Companies Document Distance Offices Match
1 Companies 3 Document Distance . . . Offices Unix Match …With paperless offices becoming more common, companies start using document databases with classification schemes… Feature Vector
6
Clustering Feature Vectors 1 2 … … 4 1 … Labeled documents
… 4 1 … Labeled documents Unlabeled documents
7
Support Vector Machines (SVM)
Binary Classifiers Maximizes distance between two classes (finds Optimal Separating Hyperplane – OSH) Support Vectors are closest to OSH OSH Class1 Not Class 1 Support Vectors
8
Current Techniques Clustering Methods Classification Methods
Rasmussen’s Single Pass Algorithm (as described by Raskutti et al. (2002)) Reallocation Method Hierarchical Methods Classification Methods Support Vector Machines Co-Training Algorithm (Blum and Mitchell, 1998) Raskutti et al. (2002) describe an interesting approach – combining SVM’s with Rasmussen’s clustering algorithm
9
Combining SVM With Clustering
Added Features Labeled documents (Class 1) Labeled documents (Not Class 1) Unlabeled documents Support Vectors Separating Hyperplane
10
Project Aim Resolve following issues:
Can combining SVM’s with other techniques improve performance? Documents have thousands of features: Can different feature representation (selection) techniques improve performance without affecting accuracy? Documents can belong to multiple classes but SVM’s are binary classifiers!
11
Project Plan Currently implementing clustering technique described in Raskutti et al. (2002) Plan to implement other clustering techniques Investigate different feature representation (selection) techniques For example, different weights for words in different positions in document Investigate multi-class problem
12
References Blum, A. and T. Mitchell (1998). Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers Raskutti, B., H. Ferra, and A. Kowalczyk (2002). Using unlabeled data for text classification through addition of cluster parameters. In International Conference on Machine Learning (Accepted)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.