05/04/07 Using Active Learning to Label Large Email Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.

Slides:



Advertisements
Similar presentations
Active Learning with Feedback on Both Features and Instances H. Raghavan, O. Madani and R. Jones Journal of Machine Learning Research 7 (2006) Presented.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.
Boosting Approach to ML
UNIVERSITY OF JYVÄSKYLÄ Building NeuroSearch – Intelligent Evolutionary Search Algorithm For Peer-to-Peer Environment Master’s Thesis by Joni Töyrylä
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
Evaluation.
Sparse vs. Ensemble Approaches to Supervised Learning
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
Spatial Semi- supervised Image Classification Stuart Ness G07 - Csci 8701 Final Project 1.
Evaluation.
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
Ensemble Learning: An Introduction
Active Learning with Support Vector Machines
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.
Corpora and Language Teaching
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Part I: Classification and Bayesian Learning
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
Incorporating Unlabeled Data in the Learning Process
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Active Learning for Class Imbalance Problem
Search Engines and Information Retrieval Chapter 1.
IB Business and Management
1 Chapter 11 Implementation. 2 System implementation issues Acquisition techniques Site implementation tools Content management and updating System changeover.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.
Experimental Evaluation of Learning Algorithms Part 1.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Machine Learning.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Classification using Co-Training
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Hierarchical Sampling for Active Learning Sanjoy Dasgupta and Daniel Hsu University of California, San Diego Session : Active Learning and Experimental.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning: Ensemble Methods
Human Computer Interaction Lecture 21,22 User Support
Sofus A. Macskassy Fetch Technologies
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Effects of Targeted Troubleshooting Activities on
Bayesian Averaging of Classifiers and the Overfitting Problem
Presentation transcript:

05/04/07 Using Active Learning to Label Large Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.

2 Quick History -related research suggested by Dr. Chuck Tappert’s work with MS student, Ian Stuart Decided to approach IBM Research’s SpamGuru anti-spam group for joint research Started P/T onsite at IBM in 11/05 Dr. Richard Segal of IBM Research generously agreed to act as adjunct advisor

3 Research Motivation Assumption: Ongoing training and testing of anti-spam tools require large, fresh databases– corpora–of labeled (spam vs. good) messages Problem: How do we accurately label large numbers of examples―potentially millions― without manually examining every one?

4 Building Corpora Accurate training & testing of anti-spam tools require: –truly random, i.e., unbiased, samples –sufficient # of examples to measure low (< 0.1%) error rates –reasonable distributions of spam vs. good mail –examples which represent the target operating environment However, most existing testing corpora are: –Rather small (just a few thousand messages) –Very narrowly focused in type and content –Aging rapidly and growing more and more stale over time

5 Building Corpora (cont.) and spam are constantly evolving Building large, current and diverse bodies of examples is time-consuming and expensive Result: Just a few–relatively small and aging– corpora are used over and over again

6 One Potential Approach Machine Learning (ML) methods can help to build corpus labelers which learn how to label Research in semi-supervised learning (SSL) has shown it’s possible to accurately learn by bootstrapping, i.e., using relatively few labeled examples and lots of unlabeled examples

7 Active Learning Active Learning (AL) is one form of SSL While some ML is passive (e.g., learner is only given labeled examples), AL is proactive Active Learner component directs attention to particular areas it wants information about from a teacher who knows all the labels

Active Learning & Corpora Select M “best” messages to label Ask human to label selected messages Update model based on returned labels Label messages using Spam Classifier Model Unlabeled Messages Done? No Yes Spam Classifier Model

9 Active Learning (cont.) Basic Challenge: Minimize the total cost of teacher queries required to achieve a target error rate,  often simply the fewest queries Research Question: How does one selectively choose an optimal set of queries for the teacher during each update cycle?

10 Selective Sampling Uncertainty Sampling † (US) is one selective sampling technique for choosing the most informative examples US is based on the premise that the learner learns fastest by asking first about those examples it, itself, is most uncertain about † “A Sequential Algorithm for Training Text Classifiers”, D. D. Lewis & W. A. Gale, ACM SIGIR ‘94

11 Uncertainty Sampling (cont.) Minimizing total uncertainty over all examples is computationally expensive: O(n) Can you reduce the # of questions asked in each cycle and still learn accurately? Is picking just the most uncertain examples always the best learning strategy? Can other knowledge be brought to bear in selecting the best questions?

12 Research Hypothesis Hypothesis: It should be possible to achieve close to full US accuracy while asking fewer, better questions Focused on development of Approximate Uncertainty Sampling (AUS) labelers –Compromise between speed of learning, # of questions asked & computational resources –Computational complexity is O(m log(n)) vs. O(n) for original Uncertainty Sampling algorithm

13 Research Approach 1.Construct competing AL/US-based labelers 2.Compare them by… –Accuracy (% correct, FP’s & FN’s) –# of teacher queries required to hit error rates –Relative sample sizes –Overall performance & resource usage 3.Select best labelers and refine them

14 Research Infrastructure Built a Java labeler testbench for comparing labeler variations on IBM SpamGuru codebase Developed and tested several Uncertainty Sampling-based labelers Used gold-standard, labeled 92K msg TREC 2005 Enron mail corpus to simulate the teacher Built a GUI front-end (CSI) to support human teacher interaction with labelers

16 Benefits of AUS Nearly as effective as vanilla US, but with lower computational complexity: O(m log(n)) Reduced computational cost allows AUS to be applied to labeling larger datasets AUS makes it possible to update the learned model more frequently AUS is applicable to any AL/US-based solution, not just corpus labeling

17 Ongoing Work Determine why selective sampling of queries using simple unsupervised clustering (AUS3 & AUS4) didn’t produce better results Develop enhanced clustering versions to attempt to improve AUS performance