Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language Technologies Institute, Language Technologies Institute, School of Computer Science, School of Computer Science, Carnegie Mellon University Carnegie Mellon University CIKM ’08, Napa Valley, October 2008 CIKM ’08, Napa Valley, October 2008
Active learning Assumptions and Real World ► unique oracle ► perfect oracle always right never tired ► works for free or charges uniformly ► multiple sources of information ► imperfect oracles unreliable reluctant ► expensive or charges non-uniformly Active LearningReal World
Solution: Proactive Learning ► Proactive learning is a generalization of active learning to relax these assumptions ► decision-theoretic framework to jointly optimize instance-oracle pair ► utility optimization problem under a fixed budget constraint
Outline ► Methodology 3 Scenarios ► Reluctance ► Fallibility ► Variable and Fixed Cost ► Evaluation Problem Setup Datasets Results ► Conclusion
Scenario 1: Reluctance ► 2 oracles: reliable oracle: expensive but always answers with a correct label reluctant oracle: cheap but may not respond to some queries ► Define a utility score as expected value of information at unit cost
How to simulate oracle unreliability? ► depend on factors such as query difficulty (hard to classify), complexity of the data (requires long and time-consuming analysis), etc. In this work, we model it based on query difficulty ► Assumptions Perfect oracle ~ classifier having zero training error on the entire data Imperfect oracle ~ weak classifier trained on a subset of the entire data ► Train a logistic regression classifier on the subset to obtain ► Identify instances with ► These are the unreliable instances ► Challenge: tradeoff between the information value of an instance and the reliability of the oracle
How to estimate ? ► Cluster unlabeled data using k-means ► Ask the label of each cluster centroid to the reluctant oracle. If label received: increase of nearby points no label: decrease of nearby points equals 1 when label received, -1 otherwise ► # clusters depend on the clustering budget and oracle fee
► Algorithm works in rounds till no budget ► At each round, sampling continues until a label is obtained ► Be careful: You may spend the entire budget on a single attempt ► If no label, decrease the utility of remaining instances: ► This is adaptive Penalization of the Reluctant Oracle
Algorithm for Scenario 1
Scenario 2: Fallibility ► 2 oracles: One perfect but expensive oracle One fallible but cheap oracle, always answers ► Alg. Similar to Scenario 1 with slight modifications ► During exploration: Fallible oracle provides the label with its confidence Confidence = of fallible oracle If then we don’t use the label but we still update but we still update
Outline of Scenario 2
Scenario 3: Non-uniform Cost ► Uniform cost: Fraud detection, face recognition, etc. ► Non-uniform cost: text categorization, medical diagnosis, protein structure prediction, etc. ► 2 oracles: Fixed-cost Oracle Variable-cost Oracle
Outline of Scenario 3
Evaluation ► Datasets: Face detection, UCI Letter (V-vs-Y), Spambase, and UCI Adult
Oracle Properties and Costs ► The cost is inversely proportional to reliability ► Higher costs for the fallible oracle since a noisy label should be penalized more than no label at all ► Cost ratio creates an incentive to choose between oracles
Underlying Sampling Strategy ► Conditional entropy based sampling, weighted by a density measure ► Captures the information content of a close neighborhood close neighbors of x
Results: Overall and Reluctance on Spambase Data
Results: Reluctance
Cost varies non-uniformly statistically significant results (p<0.01)
More light on the clustering step ► Run each baseline without the clustering step ► Entire budget is spent in rounds for data elicitation ► No separate clustering budget ► Results on Spambase under Scenario 1, cost 1:3
Conclusion ► Address issues with the assumptions of active learning ► Introduction to a Proactive Learning framework ► Analysis of imperfect oracles with differing properties and costs ► Expected utility maximization across oracle-instance pairs ► Effective against exploitation of a single oracle