YAGO: A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum Max-Planck-Institute for Computer Science, Saarbruecken,

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian.
Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.
RDF Tutorial.
Leveraging Community-built Knowledge For Type Coercion In Question Answering Aditya Kalyanpur, J William Murdock, James Fan and Chris Welty Mehdi AllahyariSpring.
Graph Data Management Lab, School of Computer Science Put conference information here.
Galia Angelova Institute for Parallel Processing, Bulgarian Academy of Sciences Visualisation and Semantic Structuring of Content (some.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.
March 17, 2008SAC WT Hermes: a Semantic Web-Based News Decision Support System* Flavius Frasincar Erasmus University Rotterdam.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Saarbrucken / Germany ¨
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
1 The BT Digital Library A case study in intelligent content management Paul Warren
YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Machine Learning Approach for Ontology Mapping using Multiple Concept Similarity Measures IEEE/ACIS International Conference on Computer and Information.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Building an Ontology of Semantic Web Techniques Utilizing RDF Schema and OWL 2.0 in Protégé 4.0 Presented by: Naveed Javed Nimat Umar Syed.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Semantic Enrichment of Ontology Mappings: A Linguistic-based Approach Patrick Arnold, Erhard Rahm University of Leipzig, Germany 17th East-European Conference.
© Paul Buitelaar – November 2007, Busan, South-Korea Evaluating Ontology Search Towards Benchmarking in Ontology Search Paul Buitelaar, Thomas.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Mining fuzzy domain ontology based on concept Vector from wikipedia category network.
Information Extraction Lecture 8 – Ontological and Open IE CIS, LMU München Winter Semester Dr. Alexander Fraser.
Advanced topics in software engineering (Semantic web)
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
A Classification of Schema-based Matching Approaches Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan.
P2P Concept Search Fausto Giunchiglia Uladzimir Kharkevich S.R.H Noori April 21st, 2009, Madrid, Spain.
Natural Language Questions for the Web of Data Mohamed Yahya 1, Klaus Berberich 1, Shady Elbassuoni 2 Maya Ramanath 3, Volker Tresp 4, Gerhard Weikum 1.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
CSC Intro. to Computing Lecture 10: Databases.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.
Algorithmic Detection of Semantic Similarity WWW 2005.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Natural Language Questions for the Web of Data Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany Shady Elbassuoni.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Similarity Measures for Query Expansion in TopX Caroline Gherbaoui Universität des Saarlandes Naturwissenschaftlich-Technische Fak. I Fachrichtung 6.2.
1 NAGA: Searching and Ranking Knowledge Gjergji Kasneci Joint work with: Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, and Gerhard Weikum.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Knowledge Representation. Keywordsquick way for agents to locate potentially useful information Thesaurimore structured approach than keywords, arranging.
Session 1 Module 1: Introduction to Data Integrity
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov 2, Birgitta.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Gaby Nativ, SDBI  Motivation  Other Ontologies  System overview  YAGO Dive IN  LEILA  NAGA  Conclusion.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Of 24 lecture 11: ontology – mediation, merging & aligning.
Software Documentation
Result of Ontology Alignment with RiMOM at OAEI’06
Presentation 王睿.
Extracting Semantic Concept Relations
DBpedia 2014 Liang Zheng 9.22.
Yago Type Heuristics 丁基伟.
Presentation transcript:

YAGO: A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum Max-Planck-Institute for Computer Science, Saarbruecken, Germany Journal of Web Semantics August 2011 IDB Lab Seminar Presented by Jee-bum Park

Outline  Introduction  The YAGO model  Sources for YAGO  Information extraction  Evaluation  Conclusion 2

Introduction  Many applications in modern information technology utilize ontological background knowledge –Exploit lexical knowledge –Uses taxonomies –Combined with ontologies –Rely on background knowledge  Ontological knowledge structures play an important role in –Data cleaning –Record linkage –Information integration –Entity- and fact-oriented Web search –Community management  But the existing applications typically use only a single source of background knowledge 3

Introduction  If a huge ontology with knowledge from several sources were available, applications could boost their performance 4

Introduction  YAGO –Based on a data model that slightly extends RDFS –Combines high coverage with high quality  YAGO sources –From the vast amount of individuals known to Wikipedia –From WordNet for the clean taxonomy of concepts 5

Outline  Introduction  The YAGO model  Sources for YAGO  Information extraction  Evaluation  Conclusion 6

The YAGO model  The state-of-the-art formalism in knowledge representation is the Web Ontology Language (OWL) –However, it cannot express relations between facts  RDFS, the basis of OWL, –provides only very primitive semantics –For example, it does not know transitivity  This is why we introduce an extension of RDFS, the YAGO model 7

The YAGO model - Informal description  The YAGO model uses the same knowledge representation as RDFS –All objects are represented as entities in the YAGO model –Two entities can stand in a relation  For example, to state that Elvis won a Grammy Award, 8 ElvisPresley hasWonPrize GrammyAward Entities Relation

The YAGO model - Informal description  A certain word refers to a certain entity  This allows us to deal with synonymy and ambiguity  We use quotes to distinguish words from other entities 9 ”Elvis” means ElvisPresley ”Elvis” means ElvisConstello Words

The YAGO model - Informal description  Similar entities are grouped into classes  Each entity is an instance of at least one class  Classes are arranged in a taxonomic hierarchy, expressed by the subClassOf relation 10 ElvisPresley type singer singer subClassOf person Class

The YAGO model - Informal description  The triple of an entity, a relation and an entity is called a fact  The Two entities are called the arguments of the fact 11 ElvisPresley hasWonPrize GrammyAward Arguments Fact

The YAGO model - Informal description  In YAGO, we will store with each fact where it was found  For this purpose, facts are given a fact identifier –Each fact has a fact identifier  Suppose that the below fact had the fact identifier #1  Then the following line says that this fact was found in Wikipedia: 12 ElvisPresley bornInYear 1935 Fact identifier #1 foundIn Wikipedia

The YAGO model - Reification graphs  We write down a YAGO ontology by listing the elements of the function in the form id 1 : arg1 1 rel 1 arg2 1 id 2 : arg1 2 rel 2 arg2 2 …  We allow the following shorthand notation id 2 : (arg1 1 rel 1 arg2 1 ) rel 2 arg2 2 to mean id 1 : arg1 1 rel 1 arg2 1 id 2 : id 1 rel 2 arg2 2 13

The YAGO model - Reification graphs  For example, to state that Elvis’ birth date was found in Wikipedia, we can simply write this fragment of the reification graph as 14 Elvis bornInYear 1935 foundIn Wikipedia

The YAGO model - n-Ary relations  Some facts require more than two arguments  RDFS and OWL do not allow n-ary relations  Therefore, the standard way to deal with this problem is: 15 GrammyAward prize elvisGetsGrammy Elvis winner elvisGetsGrammy 1921 year elvisGetsGrammy

The YAGO model - n-Ary relations  The YAGO model offers a more natural solution to this problem: 16 Elvis hasWonPrize GrammyAward inYear 1967

The YAGO model - Query language  “When did Elvis win the Grammy Award?”  Usually, each entity that appears in the query also has to appear in the ontology –If that is not the case, there is no match –However “Which singers were born after 1930?”  Hence, we introduce filter relations ?x type singer ?x bornInYear ?y ?y after Elvis hasWonPrize GrammyAward inYear ?x

Outline  Introduction  The YAGO model  Sources for YAGO  Information extraction  Evaluation  Conclusion 18

Sources for YAGO - WordNet  WordNet is a semantic lexicon for the English language  WordNet distinguishes between words as literally appearing in texts and the actual senses of the words  A set of words that share one sense is called a synset 19

Sources for YAGO - Wikipedia  Wikipedia is a multilingual, Web-based encyclopedia  The majority of Wikipedia pages have been manually assigned to one or multiple categories  Furthermore, a Wikipedia page may have an infobox 20

Sources for YAGO - Wikipedia 21

Outline  Introduction  The YAGO model  Sources for YAGO  Information extraction  Evaluation  Conclusion 22

Information extraction - Wikipedia heuristics  The individuals for YAGO are taken from Wikipedia  Each Wikipedia page title is a candidate to become an individual in YAGO –The page titles in Wikipedia are unique 23

Information extraction - Wikipedia heuristics  Infobox heuristics 24

Information extraction - Wikipedia heuristics  To establish for each individual its class, we exploit the category system of Wikipedia  The Wikipedia categories are organized in a directed acyclic graph –The hierarchy is of little use from an ontological point of view  Hence we take only the leaf categories of Wikipedia and ignore all higher categories  Then we use WordNet to establish the hierarchy of classes, because WordNet offers an ontologically well-defined taxonomy of synsets 25

Information extraction - Wikipedia heuristics  Each synset of WordNet becomes a class of YAGO  For example, the Wikipedia class “American people in Korea”  Has to be made a subclass of the WordNet class “person” –We stem the head compound of the category name to its singular form: “American person in Korea” –We determine the pre-modifier and the post-modifier: “Amercian person”, “in Korea” –Then we check whether there is a WordNet synset for the modifier: “Amercian person” is a hyponym of “person” –The head compound “person” has to be mapped to a corresponding WordNet synset 26

Information extraction - Storage  We store for each individual the URL of the corresponding Wikipedia page with the describes relation –This will allow future applications to provide the user with detailed information on the entities  To produce minimal overhead, we decided to use simple text files as an internal format  We maintain a folder for each relation, each folder contains files that list the entity pairs 27

Information extraction - Query engine  Since entities can have several names in YAGO, we have to deal with ambiguity  We replace each non-literal, non-variable argument in the query by a fresh variable and add a means fact for it –We call this process word resolution 28

Information extraction - Query engine  “Who was born after Elvis?” ?i1: Elvis bornOnDate ?e ?i2: ?x bornOnDate ?y ?i3: ?y after ?e This query becomes ?i0: “Elvis” means ?Elvis ?i1: ?Elvis bornOnDate ?e ?i2: ?x bornOnDate ?y ?i3: ?y after ?e 29

Information extraction - Query engine  In the example, the SQL query is: SELECT f0.arg2, f1.arg2, f2.arg1, f2.arg2 FROM facts f0, facts f1, facts f2 WHERE f0.arg1=‘”Elvis”’ AND f0.relation=‘means’ AND f1.arg1=f0.arg2 AND f1.relation=‘bornOnDate’ AND f2.relation=‘bornOnDate’  Then, the query engine evaluates the after relation on the result 30

Information extraction - Query engine  This implementation leaves much room for improvement, especially concerning efficiency –It takes several seconds to return 10 answers to the previous query –Queries with more joins can take even longer  In this article, we use the engine only to showcase the contents of YAGO 31

Outline  Introduction  The YAGO model  Sources for YAGO  Information extraction  Evaluation  Conclusion 32

Evaluation - Precision  To evaluate the precision of an ontology, its facts have to be compared to some ground truth –We had to rely on manual evaluation  We presented randomly selected facts of the ontology to human judges and asked them to assess whether the facts were correct  13 judges participated in the evaluation  Evaluated a total number of 5200 facts 33

Evaluation - Precision 34

Evaluation - Size  Half of YAGO’s individuals are people and locations  The overall number of entities is 1.7 million 35

Outline  Introduction  The YAGO model  Sources for YAGO  Information extraction  Evaluation  Conclusion 36

Conclusion  We presented our ontology YAGO and the methodology  We showed how the category system and the infoboxes of Wikipedia can be exploited for knowledge extraction  Our evaluation showed not only that YAGO is one of the largest knowledge bases available today, but also that it has an unprecedented quality in the league of automatically generated ontologies 37

Thank You! Any Questions or Comments?