Download presentation
Presentation is loading. Please wait.
Published byMadlyn Harmon Modified over 6 years ago
1
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Chen-Tse Tsai and Dan Roth University of Illinois at Urbana-Champaign
2
Concept Grounding Grounding concepts and entities
Wikipedia is not the single ideal resource some domains Multiple ontologies in biological domain E.g., Gene Ontology, Sequence Ontology, Protein Ontology … We study the problem of grounding concepts to multiple knowledge bases (KBs), and use biomedical domain as our application Mubarak, wife of Egyptian President Hosni Mubarak and …
3
The Task BRCA2 and homologous recombination. PR:000004804 EG:675
id: PR: name: breast cancer type 2 susceptibility protein def: A protein that is a translation product of the human BRCA2 gene or a 1:1 ortholog thereof synonyms: BRCA2, FACD,… is_a: PR: Protein Ontology id: EG:675 symbol: BRCA2 description: protein-coding BRCA2 breast cancer 2, early onset synonyms: BRCC2, BROVCA2, … Entrez Gene Unlike Wikipedia entries, there is no full text article with hyperlink structure
4
Challenges Ambiguity Variability Supervision
A phrase can be used to express many different concepts BRCA2 is used by 177 concepts Variability A concept may have many synonyms EG:675 has synonyms BRCC2, FACD, FAD, FANCD, … Supervision Wikipedia has nice hyperlink structure which other doesn’t have It is difficult to obtain human annotations for scientific domain We explore the relationship between KBs to construct training examples without any document, the ranking model trained on these examples outperforms all unsupervised methods in our experiments.
5
System Overview … Concept Candidate Generation Mention
Concept Candidate Ranking Global Inference with Knowledge Ranked Candidates Indirect Supervision KB1 KB2 KBl … Example: Candidates EG:675 PR:04804 EG:77244 GO:02111 Candidates Score PR:04804 0.8 EG:77244 0.6 EG:675 0.2 GO:02111 0.1 Candidates Score PR:04804 0.8 EG:77244 0.6 EG:675 0.2 GO:02111 0.1 BRCA2 and homologous recombination
6
Indirect Supervision KB1 KBl … Candidate Ranking Global Inference Candidate Generation Candidate Generation Given a mention, produce a small set of possible concepts Synonym Matching Construct a dictionary from all synonyms and names across all KBs Phrase possible concept IDs Word Matching Splitting phrases to words, and combining concepts by words Word possible concept IDs Only keep top k concepts Words are normalized by the SPECIALIST Lexical Tools
7
Candidate Ranking Relevance score between (mention, concept candidate)
Indirect Supervision KB1 KBl … Candidate Ranking Global Inference Candidate Generation Candidate Ranking Relevance score between (mention, concept candidate) Representations of the mention m context-word(m): neighboring words in the document context-concept(m): concept candidates of other mentions in the document Representations of the candidate c def(c): definition of c neighbor(c): concepts have a relation with c in all KBs Ranking features Common words in context-word(m) and def(c) Common concepts in context-concept(m) and neighbor(c)
8
Indirect Supervision KB1 KBl … Candidate Ranking Global Inference Candidate Generation Indirect Supervision We explore the redundancy and relationship between KBs to construct training examples Discovering positive examples Cross reference Chromosome Make one as the “mention”, annotated by another Has participant relationship GO: SO: Wikipdeia:Chromosome xref GO: SO: GO: fructose metabolic process fructose has_participant
9
Indirect Supervision Generating other candidates
KB1 KBl … Candidate Ranking Global Inference Candidate Generation Indirect Supervision Generating other candidates Apply candidate generation on the name of the concept Uniformly sample 200 concepts from all KBs Take the number of common ancestors between a candidate and the positive candidate as the relevance score Extracting features from pairs of concepts There is no context for GO: Using def(m) instead of context-word(m) Using neighbor(m) instead of context-concept(m) GO: SO:
10
Indirect Supervision KB1 KBl … Candidate Ranking Global Inference Candidate Generation Global Inference Enforcing a coherent global solution of all mentions in a document by constraints Hard constraints If a gene is from a species which is not mentioned anywhere in the document, it is removed from the final list. Entries in Entrez Gene Database and Protein Ontology have relations to NCBI Taxonomy Constraints Ranking score of the j-th candidate of the i-th mention
11
Dataset Colorado Richly Annotated Full-Text corpus [Bada et al., 2012]
67 full text of biomedical journal articles 7 ontologies Ontology # Concepts # Annotations # Unique annot. PR 26,879 15,593 889 NCBITaxon 789,509 7,449 149 GO 25,471 29,443 1,235 CHEBI 19,633 8,137 553 EG 17,097,474 12,266 1,021 SO 1,704 21,284 259 CL 857 5,760 155 Total 17,961,527 99,138 4,261
12
Evaluation Comparing to 5 unsupervised methods
AUC: area under PR curve hAUC: hierarchical version, considering common ancestors Approach Mean AUC Mean hAUC TF-IDF 40.44 48.50 PageRank 42.78 50.04 Zheng et al. (2014) 35.67 42.93 Agirre and Soroa (2009) 43.39 51.88 Agirre and Soroa (2009) w2w 46.51 55.46 Our Approach 48.58 57.37 Direct Supervision 58.98 62.59
13
Using KBs Individually v.s. Jointly
Joint: grounding a mention to all KBs simultaneously Individual: focusing on a single KB each time Approach Individual Joint PageRank 49.85 55.74 Zheng et al. (2014) 49.46 55.42 Agirre and Soroa (2009) 52.12 54.88 Agirre and Soroa (2009) w2w 52.23 56.18 Our Approach 49.93 57.65
14
Conclusions Concept grounding to multiple KBs without hyperlink structure We propose an approach to construct training examples without using any document. It enables us to apply well-studied statistical models and outperforms unsupervised methods We show that considering multiple KBs together has advantage over using each KB individually
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.