Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Minimizing the Annotation Cost of Certified Text Classification Mossaab Bagdouri 1 David D. Lewis 2 William Webber 1 Douglas W. Oard 1 1 University.

Similar presentations


Presentation on theme: "Towards Minimizing the Annotation Cost of Certified Text Classification Mossaab Bagdouri 1 David D. Lewis 2 William Webber 1 Douglas W. Oard 1 1 University."— Presentation transcript:

1 Towards Minimizing the Annotation Cost of Certified Text Classification Mossaab Bagdouri 1 David D. Lewis 2 William Webber 1 Douglas W. Oard 1 1 University of Maryland, College Park, MD, USA 2 David D. Lewis consulting, Chicago, IL, USA

2 Outline Introduction Economical assured effectiveness Solution framework Baseline solutions Conclusion 2

3 1. Build a good classifier 2. Certify that this classifier is good 3. Use nearly minimal total annotations Goal: Economical assured effectiveness 3 (Photo courtesy of www.stockmonkeys.com) ? + -

4 Notation F1F1 Annotations F1F1 ^ F1F1 θ τ Test Training 4 α = 0.05

5 Fixed test set Growing training set F1F1 Annotations τ Test Training 5 F1F1 ^ F1F1 θ

6 Stop Criterion Success Desired95.00% F 1 ≥ τ46.42% θ ≥ τ91.87% Fixed test set Growing training set Training documents Test Training τ 6 Collection = RCV1, Topic = M132, Freq = 3.33% ^

7 Fixed training set Growing test set F1F1 Annotations τ Training Test 7 F1F1 ^ F1F1 θ

8 Problem 1: Sequential testing  bias F1F1 Annotations τ Stop here Want to stop here Do not stop θ F1F1 8

9 Solution: Train sequentially, Test once F1F1 Training annotations τ θ Train without testing Test only once Training Test 9 θ

10 Problem 2: What is the size of the Test set? Training Test 10

11 Solution: Power analysis Observation 1 from power analysis: ◦True effectiveness greatly exceeds the target  Small test set needed Observation 2 from the shape of learning curves: ◦New training examples provide less of an increase in effectiveness Training documents 11 τ F1F1 β = 0.07 Power = 1 - β

12 +∞ Training Test +∞ Training True F 1 τ Designing annotation minimization policies 12 Training + Test ($$$)

13 Allocation policies in practice No closed form solution to go from an effect size on F 1 to a test set size ◦  Simulation methods True effectiveness invisible ◦  Cross-validation to estimate it No access to the entire curve Scattered and noisy estimates ◦  Need to decide online Training Training + Test ($$$) True F 1 τ Topic = C18, Frequency = 6.57% Training documents Training + Test ($$$) 13

14 Estimating the true F1 (Cross-validation) Training 14

15 Estimating the true F1 (Simulations) Training Posterior distribution 15

16 Infer test set size Infer test set size Training F1F1 Training annotations τ θ Test +∞ Minimizing the annotations 16 α τ β Measure (F 1 ) Algorithm (SVM)

17 Experiments Test collection: RCV1-v2 ◦29 topics with a prevalence ≥ 3% ◦20 randomized runs per topic Classifier: SVM Perf ◦Off-the-shelf classifier ◦Optimizes training for F 1 Settings ◦Budget: 10,000 documents ◦Power 1 - β = 0.93 ◦Confidence level 1 – α = 0.95 ◦Documents added in buckets of 20 17

18 Policies Training documents Training + Test ($$$) Topic = C18 Frequency = 6.57% 18

19 Stop as early as possible Budget achieved in 70.52% of times Failure rate of 20.54% > β (7%) Sequential testing bias pushed into process management Training documents Training + Test ($$$) 19 Topic = C18, Frequency = 6.57%

20 Minimum cost policy ◦Savings: 43.21% of the total annotations ◦Failure rate of 27.14% > β (7%) Minimum cost for success policy ◦Savings: 38.08% Training documents Training + Test ($$$) 20 Topic = C18, Frequency = 6.57% Oracle policies 20

21 Training documents Training + Test ($$$) 21 Topic = C18, Frequency = 6.57% Wait-a-while policies 21 Savings (%) Success (%) Cannot open (%) w W=0W=1 W=2 W=3 Last chance

22 Conclusion Re-testing introduces statistical bias Algorithm to indicate: ◦If / when a classifier can achieve a threshold ◦How many documents required to certify a trained model Subroutine for policies minimizing the cost Possibility to save 38% of cost 22

23 Towards Minimizing the Annotation Cost of Certified Text Classification Thank you!


Download ppt "Towards Minimizing the Annotation Cost of Certified Text Classification Mossaab Bagdouri 1 David D. Lewis 2 William Webber 1 Douglas W. Oard 1 1 University."

Similar presentations


Ads by Google