1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

“Students” t-test.
Dialogue Policy Optimisation
Yasuhiro Fujiwara (NTT Cyber Space Labs)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Yue Han and Lei Yu Binghamton University.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.
Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning Presented by Pinar Donmez joint work with Jaime G. Carbonell Language Technologies.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
TRADING OFF PREDICTION ACCURACY AND POWER CONSUMPTION FOR CONTEXT- AWARE WEARABLE COMPUTING Presented By: Jeff Khoshgozaran.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Ensemble Learning: An Introduction
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.
Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.
Classification.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Slide 1 Tutorial: Optimal Learning in the Laboratory Sciences Working with nonlinear belief models December 10, 2014 Warren B. Powell Kris Reyes Si Chen.
Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin.
3 ème Journée Doctorale G&E, Bordeaux, Mars 2015 Wei FENG Geo-Resources and Environment Lab, Bordeaux INP (Bordeaux Institute of Technology), France Supervisor:
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
1 1 Slide © 2006 Thomson/South-Western Chapter 8 Interval Estimation Population Mean:  Known Population Mean:  Known Population Mean:  Unknown Population.
Maximizing long-term ROI for Active Learning Systems
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.
Thesis Proposal PrActive Learning: Practical Active Learning, Generalizing Active Learning for Real-World Deployments.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Joint work with Foster Provost & Panos Ipeirotis New York University.
ACTIVE LEARNING USING CONFORMAL PREDICTORS: APPLICATION TO IMAGE CLASSIFICATION HypHyp Introduction HypHyp Conceptual overview HypHyp Experiments and results.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
1 KDD-09, Paris France Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution Jack Chongjie Xue † Gary M.
Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Active Frame Selection for Label Propagation in Videos Sudheendra Vijayanarasimhan and Kristen Grauman Department of Computer Science, University of Texas.
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Classification Ensemble Methods 1
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Coached Active Learning for Interactive Video Search Xiao-Yong Wei, Zhen-Qun Yang Machine Intelligence Laboratory College of Computer Science Sichuan University,
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
Data Driven Resource Allocation for Distributed Learning
Kyriaki Dimitriadou, Brandeis University
Modeling Annotator Accuracies for Supervised Learning
Sofus A. Macskassy Fetch Technologies
Target for Today Know what can go wrong with a survey and simulation
Discussion 2 1/13/2014.
Designing Private Forums
Bayesian Averaging of Classifiers and the Overfitting Problem
A New Boosting Algorithm Using Input-Dependent Regularizer
FCTA 2016 Porto, Portugal, 9-11 November 2016 Classification confusion within NEFCLASS caused by feature value skewness in multi-dimensional datasets Jamileh.
Computational Learning Theory
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Computational Learning Theory
Learning Probabilistic Graphical Models Overview Learning Problems.
Mingzhen Mo and Irwin King
CS639: Data Management for Data Science
Presentation transcript:

1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science, Carnegie Mellon University KDD ’09 June 30 th 2009 Paris, France

2 Problem Illustration instances oracles

3 Interval Estimate Threshold (IEThresh)  Goal: find the labeler(s) with the highest expected accuracy  Our work builds upon Interval Estimation [L. P. Kaelbling] 1. Estimate the reward of each labeler (more on next slide) 2. Compute upper confidence interval for the labelers 3. Select labelers with upper interval higher than a threshold 4. Observe the output of the chosen oracles to estimate their reward 5. Repeat to step 1 filter out unreliable labelers reduce labeling cost

4 Reward of the labelers  The reward of each labeler is unknown => need to be estimated  reward of a labeler  eliciting true label  true label is also unknown => estimated by the majority vote  We propose the below reward function reward=1 if the labeler agrees with the majority label reward=0 otherwise

5 IEThresh at the Beginning Oracles Expected reward increases

6 IEThresh Oracle Selection Oracles Expected reward increases Threshold

7 IE Learning Snapshot II Expected reward increases Oracles Threshold

8 IEThresh Instance Selection

9 Uniform Expert Accuracy є (0.5,1] Repeated Labeling [Sheng et al, 2008]: querying all experts for labeling Classification error

10 # Oracle Queries vs. Accuracy : First 10 iterations : Next 40 iterations : Next 100 iterations

11 # Oracle queries to reach a target accuracy skew increases better

12 Results on AMT Data with Human Annotators  IEThresh reaches the best performance with similar effort to Repeated labeling  Repeated baseline needs 840 queries total to reach 0.95 accuracy Dataset at made available by [Snow et al., 2008] 5 annotators 6 annotators

13 Conclusions and Future Work  Conclusions IEThresh is effective in balancing exploration vs. exploitation tradeoff Early filtering of unreliable labelers boosts performance Utilizing labeler accuracy estimates is more effective than asking all or randomly  Future Work from consistent to time-variant labeler quality label noise conditioned on the data instance correlated labeling errors

14 THANK YOU!

15

16

17 Problem Setup Summary  multiple noisy oracles (labelers)  unknown labeling accuracy  Goal: estimate labeler accuracy (quality) select highest quality labeler(s) balance exploration vs. exploitation tradeoff

18 Interval Estimation Learning (IE) [L. P. Kaelbling] Goal : find action a* with the highest expected reward 1. Estimate the reward of each action/oracle 2. Choose the action a* with the highest upper confidence interval 3. Record the observed reward of a* 4. Repeat to step 1  a* has high expected reward (exploitation) and/or large uncertainty in the reward (exploration)  IE automatically trades off these two

19 IE Learning Snapshot I Expected reward increases Actions (Oracles, Experts, etc.)

20 Outline of IEThresh

21 Classification Error vs. # Oracle Queries skew increases