Adjudicator Agreement and System Rankings for Person Name Search Mark Arehart, Chris Wolf, Keith Miller The MITRE Corporation {marehart, cwolf,

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Lesson Inferences Between Two Variables. Objectives Perform Spearman’s rank-correlation test.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Evaluating Search Engine
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Wisdom of Crowds and Rank Aggregation Wisdom of crowds phenomenon: aggregating over individuals in a group often leads to an estimate that is better than.
= == Critical Value = 1.64 X = 177  = 170 S = 16 N = 25 Z =
EXPERIMENTAL DESIGN Random assignment Who gets assigned to what? How does it work What are limits to its efficacy?
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Nonparametric and Resampling Statistics. Wilcoxon Rank-Sum Test To compare two independent samples Null is that the two populations are identical The.
2.4 – Linear Inequalities in One Variable
INTRODUCTION TO Machine Learning 3rd Edition
Chapter 15 Nonparametric Statistics
Recall the distributive property of multiplication over addition... symbolically: a × (b + c) = a × b + a × c and pictorially (rectangular array area model):
Inferential Statistics
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.
Rule Generation [Chapter ]
14 Elements of Nonparametric Statistics
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Bug Localization with Machine Learning Techniques Wujie Zheng
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Alternative Wide Block Encryption For Discussion Only.
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Nonparametric Statistics
Biostatistics Nonparametric Statistics Class 8 March 14, 2000.
Topic by Topic Performance of Information Retrieval Systems Walter Liggett National Institute of Standards and Technology TREC-7 (1999)
Geometry: Plane Figures Chapter. point A point marks a location. A A B B line segment the part of the line between 2 points endpoints.
Properties Associative, Commutative and Distributive.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Distributive Property with area models
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Running and jumping Time and space records: long jump, one hundred meters are getting closer. (NG)
LECTURE 11: LINEAR MODEL SELECTION PT. 1 March SDS 293 Machine Learning.
Information Organization: Evaluation of Classification Performance.
©2010 Cengage Learning SLIDES FOR CHAPTER 3 BOOLEAN ALGEBRA (continued) Click the mouse to move to the next page. Use the ESC key to exit this chapter.
Math 145 May 27, 2009.
Evaluation of IR Systems
Math 145 June 25, 2013.
Algebraic Properties.
Evaluation.
Spatial Point Pattern Analysis
تصنيف التفاعلات الكيميائية
Math 145.
STAT 145.
Math 145 January 28, 2015.
AB AC AD AE AF 5 ways If you used AB, then, there would be 4 remaining ODD vertices (C, D, E and F) CD CE CF 3 ways If you used CD, then, there.
Sampling Distribution of the Mean
Inferences Between Two Variables
STAT 245.
Nonparametric Statistics
Math 145 September 3, 2008.
Math 145 May 23, 2016.
Information Organization: Evaluation of Classification Performance
Presentation transcript:

Adjudicator Agreement and System Rankings for Person Name Search Mark Arehart, Chris Wolf, Keith Miller The MITRE Corporation {marehart, cwolf,

Summary Matching multicultural name variants is knowledge intensive Ground truth dataset requires tedious adjudication Guidelines not comprehensive, adjudicators often disagree Previous evaluations: multiple adjudication, voting Results of study: High agreement, multiple adjudication not needed “Nearly” same payoff for much less effort 2

Dataset Watchlist, ~71K Deceased persons lists Mixed cultures 1.1K variants for 404 base names Ave. 2.8 variants per base record Queries, base names 296 randomly selected from watchlist Subset of 100 randomly selected for this study 3

Method Adjudication pools as in TREC: pool from 13 algorithms Four judges complete pools (1712 pairs, excluding exact matches) Compare system rankings under different versions of ground truth TypeCriteria for true match 1ConsensusTie or majority vote (baseline) 2UnionJudged true by anyone 3IntersectionJudged true by everyone 4SingleJudgments from a single adjudicator (4) 5RandomRandomly choose adjudicator per item (1000) 4

Adjudicator Agreement Measures +- +AB -CD overlap = a / (a + b + c) p+ = 2a / (2a + b + c) p- = 2d / (2d + b + c) 5

Adjudicator Agreement 6 Lowest is A~B kappa 0.57 Highest is C~D kappa 0.78

So far… 7 Test watchlist and query list Results from 13 algorithms Adjudications by 4 volunteers Ways of compiling alternate ground truth sets Still need…

Comparing System Rankings 8 A complete ranking B C D A E How similar? Kendall’s tau Spearman’s rank correlation C B E A D

Significance Testing 9 Not all differences are significant (duh) F1-measure: harmonic mean of precision & recall Not a proportion or mean of independent observations Not amenable to traditional significance tests Like other IR measures, e.g. MAP Bootstrap resampling Sample with replacement from data Compute difference for many trials Produces a distribution of differences

Incomplete Ranking 10 BC D A E Not all differences significant  partial ordering How similar? B C D A E

Evaluation Statements 11 BC D A E A>B A>C A>D A>E B=C B>D B>E C>D C>E D=E B C D A E A<B A>C A>D A>E B>C B>D B>E C>D C>E D=E

Similarity 12 A>B A>C A>D A>E B=C B>D B>E C>D C>E D=E n systems  n(n-1) / 2 evaluation statements A<B A>C A>D A>E B>C B>D B>E C>D C>E D=E Sensitivity: proportion of relations with sig diff Sens = 80%Sens = 90% Reversal rate: proportion of reversed relations: 10% Total disagreement: 20%

Comparisons With Baseline 13 Truth SetSensitivityDisagreeReversal Consensus0.744n/a Union Intersection Judge A Judge B Judge C Judge D No reversals except with intersection GT (one algorithm) Highest and lowest agr with consensus Low!

GT Comparisons 14

Comparison With Random GT versions created by randomly selecting a judge Consensus sensitivity = 74.4% Average random sensitivity = 72.9% (sig diff at 0.05) Average disagreement with consensus = 7.3% 5% disagreement expected (actually more) 2.3% remainder (actually less) attributable to GT method No reversals in any of the 1000 sets

Conclusion 16 Multiple adjudicators judge everything  expensive Single adjudicator  variability in sensitivity Multiple adjudicators randomly divide pool: Slightly less sensitivity No reversals of results Much less labor Differences wash out approximating consensus Practically same result for less effort