Download presentation
Presentation is loading. Please wait.
Published byHannah Douglas Modified over 9 years ago
1
Active Learning An example From Xu et al., “Training SpamAssassin with Active Semi- Supervised Learning”
2
Semi-Supervised and Active Learning Semi-Supervised learning: Using a combination of labeled and unlabeled examples, or using partially labeled examples Active learning: Having the learning system decide which examples to ask an oracle to label
3
Spamassassin Spamassassin: – Asks users to label e-mail, but they don’t often do it. – Also, they may not label the “most informative” examples. Spamassassin “self-training”: – Train classifier on small number of labeled examples. – Run these on unlabeled examples. Add the ones classified with high confidence to the original training set. (Problem – the ones classified with high confidence are not necessarily the most informative ones. – Retrain the classifier with the new, larger training set.
4
Xu et al. paper: Method Supervised learning: Train Naive Bayes classifier on small subset of (labeled) e-mails. Semi-supervised learning: Then run Spamassassin’s self-learning method, selecting a large number of new examples to add to training set. Retrain the classifier. Active learning: Cluster remaining unlabeled e-mails using k-means (on term-frequency feature vectors) with Euclidean distance. Select q representative unlabeled e-mails, first from “pure” clusters, then from “impure clusters”, making sure that many clusters are sampled from. The e-mails selected from each cluster are the ones closest to the cluster centroids. Ask the user to label these q examples. For each of these q examples, if the corresponding cluster is “pure”, propagate this label to a fraction p of the that cluster. Add the newly labeled examples to the training set, and retrain the classifier.
5
Ran on a large corpus (75K) of e-mails. Xu et al. paper: Results
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.