Active Learning An example From Xu et al., “Training SpamAssassin with Active Semi- Supervised Learning”
Semi-Supervised and Active Learning Semi-Supervised learning: Using a combination of labeled and unlabeled examples, or using partially labeled examples Active learning: Having the learning system decide which examples to ask an oracle to label
Spamassassin Spamassassin: – Asks users to label , but they don’t often do it. – Also, they may not label the “most informative” examples. Spamassassin “self-training”: – Train classifier on small number of labeled examples. – Run these on unlabeled examples. Add the ones classified with high confidence to the original training set. (Problem – the ones classified with high confidence are not necessarily the most informative ones. – Retrain the classifier with the new, larger training set.
Xu et al. paper: Method Supervised learning: Train Naive Bayes classifier on small subset of (labeled) s. Semi-supervised learning: Then run Spamassassin’s self-learning method, selecting a large number of new examples to add to training set. Retrain the classifier. Active learning: Cluster remaining unlabeled s using k-means (on term-frequency feature vectors) with Euclidean distance. Select q representative unlabeled s, first from “pure” clusters, then from “impure clusters”, making sure that many clusters are sampled from. The s selected from each cluster are the ones closest to the cluster centroids. Ask the user to label these q examples. For each of these q examples, if the corresponding cluster is “pure”, propagate this label to a fraction p of the that cluster. Add the newly labeled examples to the training set, and retrain the classifier.
Ran on a large corpus (75K) of s. Xu et al. paper: Results