05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.

2 Quick History Email-related research suggested by Dr. Chuck Tappert’s work with MS student, Ian Stuart Decided to approach IBM Research’s SpamGuru anti-spam group for joint research Started P/T onsite at IBM in 11/05 Dr. Richard Segal of IBM Research generously agreed to act as adjunct advisor

3 Research Motivation Assumption: Ongoing training and testing of anti-spam tools require large, fresh databases– corpora–of labeled (spam vs. good) messages Problem: How do we accurately label large numbers of examples―potentially millions― without manually examining every one?

4 Building Email Corpora Accurate training & testing of anti-spam tools require: –truly random, i.e., unbiased, samples –sufficient # of examples to measure low (< 0.1%) error rates –reasonable distributions of spam vs. good mail –examples which represent the target operating environment However, most existing email testing corpora are: –Rather small (just a few thousand messages) –Very narrowly focused in type and content –Aging rapidly and growing more and more stale over time

5 Building Email Corpora (cont.) Email and spam are constantly evolving Building large, current and diverse bodies of examples is time-consuming and expensive Result: Just a few–relatively small and aging– email corpora are used over and over again

6 One Potential Approach Machine Learning (ML) methods can help to build corpus labelers which learn how to label Research in semi-supervised learning (SSL) has shown it’s possible to accurately learn by bootstrapping, i.e., using relatively few labeled examples and lots of unlabeled examples

7 Active Learning Active Learning (AL) is one form of SSL While some ML is passive (e.g., learner is only given labeled examples), AL is proactive Active Learner component directs attention to particular areas it wants information about from a teacher who knows all the labels

Active Learning & Email Corpora Select M “best” messages to label Ask human to label selected messages Update model based on returned labels Label messages using Spam Classifier Model Unlabeled Messages Done? No Yes Spam Classifier Model

9 Active Learning (cont.) Basic Challenge: Minimize the total cost of teacher queries required to achieve a target error rate,  often simply the fewest queries Research Question: How does one selectively choose an optimal set of queries for the teacher during each update cycle?

10 Selective Sampling Uncertainty Sampling † (US) is one selective sampling technique for choosing the most informative examples US is based on the premise that the learner learns fastest by asking first about those examples it, itself, is most uncertain about † “A Sequential Algorithm for Training Text Classifiers”, D. D. Lewis & W. A. Gale, ACM SIGIR ‘94

11 Uncertainty Sampling (cont.) Minimizing total uncertainty over all examples is computationally expensive: O(n) Can you reduce the # of questions asked in each cycle and still learn accurately? Is picking just the most uncertain examples always the best learning strategy? Can other knowledge be brought to bear in selecting the best questions?

12 Research Hypothesis Hypothesis: It should be possible to achieve close to full US accuracy while asking fewer, better questions Focused on development of Approximate Uncertainty Sampling (AUS) labelers –Compromise between speed of learning, # of questions asked & computational resources –Computational complexity is O(m log(n)) vs. O(n) for original Uncertainty Sampling algorithm

13 Research Approach 1.Construct competing AL/US-based labelers 2.Compare them by… –Accuracy (% correct, FP’s & FN’s) –# of teacher queries required to hit error rates –Relative sample sizes –Overall performance & resource usage 3.Select best labelers and refine them

14 Research Infrastructure Built a Java labeler testbench for comparing labeler variations on IBM SpamGuru codebase Developed and tested several Uncertainty Sampling-based labelers Used gold-standard, labeled 92K msg TREC 2005 Enron mail corpus to simulate the teacher Built a GUI front-end (CSI) to support human teacher interaction with labelers

16 Benefits of AUS Nearly as effective as vanilla US, but with lower computational complexity: O(m log(n)) Reduced computational cost allows AUS to be applied to labeling larger datasets AUS makes it possible to update the learned model more frequently AUS is applicable to any AL/US-based solution, not just email corpus labeling

17 Ongoing Work Determine why selective sampling of queries using simple unsupervised clustering (AUS3 & AUS4) didn’t produce better results Develop enhanced clustering versions to attempt to improve AUS performance

05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.

Similar presentations

Presentation on theme: "05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.

Similar presentations

Presentation on theme: "05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr."— Presentation transcript:

Similar presentations

About project

Feedback