Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adjudicator Agreement and System Rankings for Person Name Search Mark Arehart, Chris Wolf, Keith Miller The MITRE Corporation {marehart, cwolf,

Similar presentations


Presentation on theme: "Adjudicator Agreement and System Rankings for Person Name Search Mark Arehart, Chris Wolf, Keith Miller The MITRE Corporation {marehart, cwolf,"— Presentation transcript:

1 Adjudicator Agreement and System Rankings for Person Name Search Mark Arehart, Chris Wolf, Keith Miller The MITRE Corporation {marehart, cwolf, keith}@mitre.org

2 Summary Matching multicultural name variants is knowledge intensive Ground truth dataset requires tedious adjudication Guidelines not comprehensive, adjudicators often disagree Previous evaluations: multiple adjudication, voting Results of study: High agreement, multiple adjudication not needed “Nearly” same payoff for much less effort 2

3 Dataset Watchlist, ~71K Deceased persons lists Mixed cultures 1.1K variants for 404 base names Ave. 2.8 variants per base record Queries, 700 404 base names 296 randomly selected from watchlist Subset of 100 randomly selected for this study 3

4 Method Adjudication pools as in TREC: pool from 13 algorithms Four judges complete pools (1712 pairs, excluding exact matches) Compare system rankings under different versions of ground truth TypeCriteria for true match 1ConsensusTie or majority vote (baseline) 2UnionJudged true by anyone 3IntersectionJudged true by everyone 4SingleJudgments from a single adjudicator (4) 5RandomRandomly choose adjudicator per item (1000) 4

5 Adjudicator Agreement Measures +- +AB -CD overlap = a / (a + b + c) p+ = 2a / (2a + b + c) p- = 2d / (2d + b + c) 5

6 Adjudicator Agreement 6 Lowest is A~B kappa 0.57 Highest is C~D kappa 0.78

7 So far… 7 Test watchlist and query list Results from 13 algorithms Adjudications by 4 volunteers Ways of compiling alternate ground truth sets Still need…

8 Comparing System Rankings 8 A complete ranking B C D A E How similar? Kendall’s tau Spearman’s rank correlation C B E A D

9 Significance Testing 9 Not all differences are significant (duh) F1-measure: harmonic mean of precision & recall Not a proportion or mean of independent observations Not amenable to traditional significance tests Like other IR measures, e.g. MAP Bootstrap resampling Sample with replacement from data Compute difference for many trials Produces a distribution of differences

10 Incomplete Ranking 10 BC D A E Not all differences significant  partial ordering How similar? B C D A E

11 Evaluation Statements 11 BC D A E A>B A>C A>D A>E B=C B>D B>E C>D C>E D=E B C D A E A<B A>C A>D A>E B>C B>D B>E C>D C>E D=E

12 Similarity 12 A>B A>C A>D A>E B=C B>D B>E C>D C>E D=E n systems  n(n-1) / 2 evaluation statements A<B A>C A>D A>E B>C B>D B>E C>D C>E D=E Sensitivity: proportion of relations with sig diff Sens = 80%Sens = 90% Reversal rate: proportion of reversed relations: 10% Total disagreement: 20%

13 Comparisons With Baseline 13 Truth SetSensitivityDisagreeReversal Consensus0.744n/a Union 0.7820.0640 Intersection0.5380.4230.038 Judge A 0.7690.0510 Judge B 0.7050.0380 Judge C 0.7560.1150 Judge D 0.6920.1790 No reversals except with intersection GT (one algorithm) Highest and lowest agr with consensus Low!

14 GT Comparisons 14

15 Comparison With Random 15 1000 GT versions created by randomly selecting a judge Consensus sensitivity = 74.4% Average random sensitivity = 72.9% (sig diff at 0.05) Average disagreement with consensus = 7.3% 5% disagreement expected (actually more) 2.3% remainder (actually less) attributable to GT method No reversals in any of the 1000 sets

16 Conclusion 16 Multiple adjudicators judge everything  expensive Single adjudicator  variability in sensitivity Multiple adjudicators randomly divide pool: Slightly less sensitivity No reversals of results Much less labor Differences wash out approximating consensus Practically same result for less effort


Download ppt "Adjudicator Agreement and System Rankings for Person Name Search Mark Arehart, Chris Wolf, Keith Miller The MITRE Corporation {marehart, cwolf,"

Similar presentations


Ads by Google