Distant Supervision for Knowledge Base Population Mihai Surdeanu, David McClosky, John Bauer, Julie Tibshirani, Angel Chang, Valentin Spitkovsky, Christopher Manning
Definition and Approach We took part in TAC KBP 2010 this year (both tasks) Slot filling task: learning a pre-defined set of relations and attributes for target entities based on documents in a collection – “Warren Buffett began studying at the Warton School of Finance at the University of Pennsylvania, but transferred to the University of Nebraska where he graduated.” (per:schools_attended, Warren Buffett, University of Pennsylvania) (per:schools_attended, Warren Buffett, University of Nebraska Distant supervision approach: generate training data automatically from Wikipedia infoboxes
Infobox KB Map infobox fields to KBP slots (one to many mapping) IR: find relevant sentences Query: entity name + slot value Extract +/- slot candidates Train multiclass classifier Map KBP slots to fine-grained NE labels KBP query: entity name IR: find relevant sentences Query: entity name + trigger words Extract slot candidates Classify candidates Inference (greedy, local) TrainingEvaluation Extracted slots
Results LabelCorrectPredictActualPRF1 UNRELATED org:city_of_ headquarters org:country_of_ headquarters org:founded org:parents org:top_members/empl oyees per:city_of_birth per:country_of_birth per:date_of_birth per:member_of per:title Total Training on 2/3 of infoboxes, evaluating on 1/3 Evaluating only on sentences that contain at least a valid slot Top 10 most common slots Total for all slots
Challenges Improve quality of data generated through distant supervision Improve IR recall – Use relation-specific trigger words (or n-grams or dependency paths etc.) to boost sentences likely to contain answers to the top – How to acquire these automatically? Better classifiers for noisy text (e.g., web snippets)