Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology
Slot Filler Validation (SFV) Track Goals ▫Allow teams without a full slot-filling system to participate, focus on answer validation rather than document retrieval ▫Evaluate the contribution of RTE systems on KBP slot-filling ▫Allow teams to experiment with system voting and global SFV input: ▫Candidate slot filler ▫Possibly additional information about candidate slot fillers SFV output: ▫Binary classification (Correct / Incorrect) of each candidate slot filler Can only improve precision, not recall of full slot-filling systems Evaluation metrics depends on SFV use case and availability of additional information about candidate fillers TAC RTE KBP Validation task (2011) TAC KBP Slot Filler Validation task (2012)
TAC RTE KBP Validation task (2011) 1 RTE evaluation pair, where: T is the entire document supporting the slot filler H is a set of synonymous sentences, representing different realizations of the slot filler Each slot filler returned by SF systems
Use Case 1: SFV as Textual Entailment (2011) SFV input: ▫All regular English slot filling input (slot definitions, queries, source documents) ▫Individual candidate slot fillers (filler, provenance) Local Approach: ▫Generic textual entailment: H is relation implied by candidate slot filler (e.g., “Barack Obama has lived in Chicago”), T is provenance (entire document, or smaller regions defined by justification offsets) ▫Tailored textual entailment: train on different slot types; could be a validation module for a full slot filling system. Evaluation: ▫F score on entire pool of candidate slot fillers (unique slot filler, provenance) ▫Baseline: All T’s classified as entailing the corresponding H: P=R=percentage of entailing pairs in the pooled SF responses ▫Weak baseline, easily beat by all SFV systems; not a direct measure of utility of SFV to SF
Use Case 2: SFV impact on single SF systems SFV input: ▫All regular English slot filling input (slot definitions, queries, source documents) ▫Individual candidate slot fillers (filler, provenance, confidence) Broken out into individual slot filling runs Global Approach: ▫System Voting, leveraging features across multiple SF runs Evaluation: ▫Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run
Slot Filler Validation (SFV) 2012 SFV input: ▫All regular English slot filling input (slot definitions, queries, source documents) ▫Individual candidate slot fillers (filler, provenance, confidence) Broken out into individual slot filling runs ▫System profile for each SF run ▫Preliminary assessment of 10% of KBP 2013 Slot Filling queries SFV output: ▫Binary classification (Correct / Incorrect) of each candidate slot filler Evaluation: Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run
Slot Filler Validation (SFV) 2012 SFV input: ▫All regular English slot filling input (slot definitions, queries, source documents) ▫Individual candidate slot fillers (filler, provenance, confidence) Broken out into individual slot filling runs ▫System profile for each SF run ▫Preliminary assessment of 10% of KBP 2013 Slot Filling queries SFV output: ▫Binary classification (Correct / Incorrect) of each candidate slot filler Evaluation: Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run One SFV submission, decreased F1 of almost all SF runs except poorest performing SF runs.
Slot Filler Validation (SFV) 2013 SFV input: ▫All regular English slot filling input (slot definitions, queries, source documents) ▫Individual candidate slot fillers (filler, provenance, confidence) Broken out into individual slot filling runs SFV output: ▫Binary classification (Correct / Incorrect) of each candidate slot filler Evaluation: Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run
Slot Filler Validation (SFV) 2013 SFV input: ▫All regular English slot filling input (slot definitions, queries, source documents) ▫Individual candidate slot fillers (filler, provenance, confidence) Broken out into individual slot filling runs ▫System profile for each SF run ▫Preliminary assessment of 10% of KBP 2013 Slot Filling queries SFV output: ▫Binary classification (Correct / Incorrect) of each candidate slot filler Evaluation: Filter out “Incorrect” slot fillers from each run, and score according to regular English SF; compare to score for original run Score only on the 90% of KBP 2013 slot filling queries that didn’t have preliminary assessments released as part of SFV input
SF System Profile SF Team ranks in KBP Did the system extract fillers from the KBP 2013 source corpus? Do the Confidence Values have meaning? Is the Confidence Value a probability? Tools or methods for: ▫Query expansion ▫Document retrieval ▫Sentence retrieval ▫NER nominal tagging ▫Coreference resolution ▫Third-party relation/event extraction ▫Dependency/Constituent parsing ▫POS tagging ▫Chunking ▫Main slot filling algorithm ▫Learning algorithm ▫Ensemble model ▫External resources
Slot Filler Validation Teams and Approaches BIT: Beijing Institute of Technology [local] ▫Generic RTE approach based on word overlap, cosine similarity, and token edit distance Stanford: Stanford University [local] ▫Based on Stanford’s full slot-filling system, especially component for checking consistency and validity of candidate fillers UI_CCG: University of Illinois at Urbana-Champaign [local] ▫Tailored RTE approach; check candidate for slot-specific constraints jhuapl: Johns Hopkins University Applied Physics Laboratory [weak global] ▫Consider only the confidence value associated with each candidate filler and aggregate confidence values across systems. RPI_BLENDER: Rensselaer Polytechnic Institute [strong global] ▫Based on RPI_BLENDER full slot-filling system (like Stanford), but also leveraged full set of SFV input (including SF system profile and preliminary assessments) to rank systems and apply tier-specific filtering.
Impact of RPI_BLENDER2 SFV on SF Runs SF RunF1 of original SF run F1 after applying SFV filter lsv lsv lsv ARPANI lsv RPI_BLENDER RPI_BLENDER lsv RPI_BLENDER PRIS NYU UWashington UWashington UWashington SAFT_KRes CMUML TALP_UPC Top 10 SF runs Negatively impacted SF runs
Conclusion Leveraging global features boosts scores of individual SF runs…. If done discriminately ▫Don’t treat all slot filling systems the same Even weak global features (e.g. raw confidence values) may help in some cases Caveat: other evaluation metrics also valid depending on use case. ▫RTE KBP validation (2011) metric may be appropriate if goal is to make assessment more efficient