Towards Minimizing the Annotation Cost of Certified Text Classification Mossaab Bagdouri 1 David D. Lewis 2 William Webber 1 Douglas W. Oard 1 1 University.

Towards Minimizing the Annotation Cost of Certified Text Classification Mossaab Bagdouri 1 David D. Lewis 2 William Webber 1 Douglas W. Oard 1 1 University of Maryland, College Park, MD, USA 2 David D. Lewis consulting, Chicago, IL, USA

Outline Introduction Economical assured effectiveness Solution framework Baseline solutions Conclusion 2

1. Build a good classifier 2. Certify that this classifier is good 3. Use nearly minimal total annotations Goal: Economical assured effectiveness 3 (Photo courtesy of www.stockmonkeys.com) ? + -

Notation F1F1 Annotations F1F1 ^ F1F1 θ τ Test Training 4 α = 0.05

Fixed test set Growing training set F1F1 Annotations τ Test Training 5 F1F1 ^ F1F1 θ

Stop Criterion Success Desired95.00% F 1 ≥ τ46.42% θ ≥ τ91.87% Fixed test set Growing training set Training documents Test Training τ 6 Collection = RCV1, Topic = M132, Freq = 3.33% ^

Fixed training set Growing test set F1F1 Annotations τ Training Test 7 F1F1 ^ F1F1 θ

Problem 1: Sequential testing  bias F1F1 Annotations τ Stop here Want to stop here Do not stop θ F1F1 8

Solution: Train sequentially, Test once F1F1 Training annotations τ θ Train without testing Test only once Training Test 9 θ

Problem 2: What is the size of the Test set? Training Test 10

Solution: Power analysis Observation 1 from power analysis: ◦True effectiveness greatly exceeds the target  Small test set needed Observation 2 from the shape of learning curves: ◦New training examples provide less of an increase in effectiveness Training documents 11 τ F1F1 β = 0.07 Power = 1 - β

+∞ Training Test +∞ Training True F 1 τ Designing annotation minimization policies 12 Training + Test ($$$)

Allocation policies in practice No closed form solution to go from an effect size on F 1 to a test set size ◦  Simulation methods True effectiveness invisible ◦  Cross-validation to estimate it No access to the entire curve Scattered and noisy estimates ◦  Need to decide online Training Training + Test ($$$) True F 1 τ Topic = C18, Frequency = 6.57% Training documents Training + Test ($$$) 13

Estimating the true F1 (Cross-validation) Training 14

Estimating the true F1 (Simulations) Training Posterior distribution 15

Infer test set size Infer test set size Training F1F1 Training annotations τ θ Test +∞ Minimizing the annotations 16 α τ β Measure (F 1 ) Algorithm (SVM)

Experiments Test collection: RCV1-v2 ◦29 topics with a prevalence ≥ 3% ◦20 randomized runs per topic Classifier: SVM Perf ◦Off-the-shelf classifier ◦Optimizes training for F 1 Settings ◦Budget: 10,000 documents ◦Power 1 - β = 0.93 ◦Confidence level 1 – α = 0.95 ◦Documents added in buckets of 20 17

Policies Training documents Training + Test ($$$) Topic = C18 Frequency = 6.57% 18

Stop as early as possible Budget achieved in 70.52% of times Failure rate of 20.54% > β (7%) Sequential testing bias pushed into process management Training documents Training + Test ($$$) 19 Topic = C18, Frequency = 6.57%

Minimum cost policy ◦Savings: 43.21% of the total annotations ◦Failure rate of 27.14% > β (7%) Minimum cost for success policy ◦Savings: 38.08% Training documents Training + Test ($$$) 20 Topic = C18, Frequency = 6.57% Oracle policies 20

Training documents Training + Test ($$$) 21 Topic = C18, Frequency = 6.57% Wait-a-while policies 21 Savings (%) Success (%) Cannot open (%) w W=0W=1 W=2 W=3 Last chance

Conclusion Re-testing introduces statistical bias Algorithm to indicate: ◦If / when a classifier can achieve a threshold ◦How many documents required to certify a trained model Subroutine for policies minimizing the cost Possibility to save 38% of cost 22

Towards Minimizing the Annotation Cost of Certified Text Classification Thank you!

Towards Minimizing the Annotation Cost of Certified Text Classification Mossaab Bagdouri 1 David D. Lewis 2 William Webber 1 Douglas W. Oard 1 1 University.

Similar presentations

Presentation on theme: "Towards Minimizing the Annotation Cost of Certified Text Classification Mossaab Bagdouri 1 David D. Lewis 2 William Webber 1 Douglas W. Oard 1 1 University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards Minimizing the Annotation Cost of Certified Text Classification Mossaab Bagdouri 1 David D. Lewis 2 William Webber 1 Douglas W. Oard 1 1 University.

Similar presentations

Presentation on theme: "Towards Minimizing the Annotation Cost of Certified Text Classification Mossaab Bagdouri 1 David D. Lewis 2 William Webber 1 Douglas W. Oard 1 1 University."— Presentation transcript:

Similar presentations

About project

Feedback