Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.

1 Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh City University of Technology, Vietnam International IEEE Conference - RIVF’08

2 2 Outline Introduction Background Approach Evaluation Conclusion

3 3 Introduction No explicit semantic information about data and objects are presented in most of the Web pages. Semantic Web aim at solving this problem by making semantic metadata available in web page content –Ex: the entity “John McCarthy” pointing to the homepage of the inventor of Lisp programming –Entity disambiguation

4 4 Introduction- Entity disambiguation Entity disambiguation is the process of identifying when different references correspond to the same real world entity (Jorge Cardoso and Amit Sheth) Our work aim at detecting named entities in a text and linking them to a given ontology

5 5 Introduction - What are Named Entities? Named Entities (NE) are considered: people, organizations, locations, date, time, money, measures, percentage, etc. Example “Ms. Washington's candidacy is being championed by several powerful lawmakers including her boss, Chairman John Dingell (D., Mich.) of the House Energy and Commerce Committee.”

6 6 Introduction – Basic problem in NE Many NEs share the same name –Ambiguity of NE types: John Smith (company vs. person) –May (person vs. month) –Washington (person vs. location) –etc. –Ambiguity of referent (e.g. Paris may be the capital of French, or a small town in Texas )

7 7 Introduction - Our contribution are two-fold Utilizing ontological concepts, and properties of instances in a specific KB, to automatically generate a corpus of labeled training data Exploiting Wikipedia to enrich the training data with new and informative features. Exploring a range of features extracted from texts, a KB, and Wikipedia

8 8 Background - Ontology Ontology schema defines taxonomy of classes and properties (relations and attributes) Knowledge base contains semantic descriptions, including attributes and relations, of named entities in real world

9 9 Background - Wikipedia Each article defines an entity or a concept Four sources of information –Title –Redirect titles –Categories –Hyperlinks Outlinks vs. Inlinks

10 10 Background - Wikipedia

11 11 Approach Expoiting terms (i.e. base noun phrases) and named entities coocurring with ambiguous name for disambiguation Casting the problem as ranking problem –Using TFIDF to calculate similarity and choose the candidate with the highest score

12 12 Approach Constructing corpus –Utilizing classes and properties to generate a snippet for each instance in an ontology –Feature generation for enriching representation of those instances Analyzing a text for disambiguation and identification of NEs occurring therein

13 13 Approach - Construct corpus

14 14 Approach- Construct corpus

15 15 Approach – Disambiguation process For each ambiguous name –Looking up candidates –Extracting base noun phrases in the same sentence an in the headline –Extracting named entities in the whole text –Using TFIDF to rank and choose the candidate with the highest score

16 16 Approach – An example

17 17 Evaluation Using KIM Ontology 140 texts of news articles in some news agencies Focusing on four names: John McCarthy, John Wiliams, Georgia, and Columbia Measure accuracy as the total number of correctly assignment NEs (in text)/ontology instances divided by the total number of assignment

18 18 Evaluation

19 19 Conclusion Our approach is quite natural and similar to the way humans do, relying on co-occurring NEs and terms to resolve other ambiguous entities in a given context. Currently Wikipedia editions are available for approximately 200 languages, so our method can be used to build NE disambiguation systems for a large number of languages The features from Wikipedia, and NEs in the whole text are meaningful evidence for disambiguation In the future: detecting NEs out of the ontology, and investigating other similarity metrics

20 20 Thanks for your attention !

