YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian.
1 ICS-FORTH & Univ. of Crete SeLene November 15, 2002 A View Definition Language for the Semantic Web Maganaraki Aimilia.
Fabian M. SuchanekYAGO - A Core of Semantic Knowledge 1 YAGO – A Core of Semantic Knowledge Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum (Max-Planck.
Fabian M. SuchanekYAGO - A Core of Semantic Knowledge 1 YAGO – A Core of Semantic Knowledge Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum (Max-Planck.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
YAGO: A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum Max-Planck-Institute for Computer Science, Saarbruecken,
Using Link Grammar and WordNet on Fact Extraction for the Travel Domain.
Copyright Irwin/McGraw-Hill Data Modeling Prepared by Kevin C. Dittman for Systems Analysis & Design Methods 4ed by J. L. Whitten & L. D. Bentley.
So What Does it All Mean? Geospatial Semantics and Ontologies Dr Kristin Stock.
Graph Data Management Lab, School of Computer Science Put conference information here.
Ontology Notes are from:
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Database and Information- Retrieval Methods for Knowledge Discovery Database and Information- Retrieval Methods for Knowledge Discovery Gerhard Weikum,
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 The Enhanced Entity- Relationship (EER) Model.
COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Semantic Web Presented by: Edward Cheng Wayne Choi Tony Deng Peter Kuc-Pittet Anita Yong.
Semi-Automatic Generation of Mini-Ontologies from Canonicalized Relational Tables Chris Hathaway.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Saarbrucken / Germany ¨
BIS310: Week 7 BIS310: Structured Analysis and Design Data Modeling and Database Design.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
Principles of the GOLD Ontology & Conversion of GOLD to DCIF Presenters: Anthony Aristar, Evelyn Richter.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
INF 384 C, Spring 2009 Ontologies Knowledge representation to support computer reasoning.
Nancy Lawler U.S. Department of Defense ISO/IEC Part 2: Classification Schemes Metadata Registries — Part 2: Classification Schemes The revision.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Open Information Extraction using Wikipedia
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Semantic Enrichment of Ontology Mappings: A Linguistic-based Approach Patrick Arnold, Erhard Rahm University of Leipzig, Germany 17th East-European Conference.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
Coastal Atlas Interoperability - Ontologies (Advanced topics that we did not get to in detail) Luis Bermudez Stephanie Watson Marine Metadata Interoperability.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Information Extraction Lecture 8 – Ontological and Open IE CIS, LMU München Winter Semester Dr. Alexander Fraser.
Information Extraction Lecture 8 – Ontological and Open IE Dr. Alexander Fraser, U. Munich September 10th, 2014 ISSALE: University of Colombo School of.
XML Schema Integration Ray Dos Santos July 19, 2009.
Database Systems: Enhanced Entity-Relationship Modeling Dr. Taysir Hassan Abdel Hamid.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei Dept. of Computer Science, Princeton University, USA CVPR ImageNet1.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Databases Illuminated Chapter 3 The Entity Relationship Model.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Some questions -What is metadata? -Data about data.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
1 NAGA: Searching and Ranking Knowledge Gjergji Kasneci Joint work with: Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, and Gerhard Weikum.
Tutorial: Knowledge Bases for Web Content Analytics
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Harith Alani, Sanghee Kim, David Millard, Mark Weal, Paul Lewis, Wendy Hall, Nigel Shadbolt Using Protégé for Automatic Ontology Instantiation 7 th International.
Gaby Nativ, SDBI  Motivation  Other Ontologies  System overview  YAGO Dive IN  LEILA  NAGA  Conclusion.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Johannes Hoffart, Fabian Suchanek Klaus BerberichBy Gerhard Weikum Madhav Charan.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Social Knowledge Mining
Extracting Semantic Concept Relations
DBpedia 2014 Liang Zheng 9.22.
ProBase: common Sense Concept KB and Short Text Understanding
Yago Type Heuristics 丁基伟.
Presentation transcript:

YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer

Motivation for an Ontology  Natural Language communication  Automated text translation  Finding information on internet  Computer-processable collection of knowledge

What is an Ontology?  An ontology is the description of a domain, its classes and properties and relationships between those classes by means of a formal language.  collection of knowledge about the world, a knowledge base  Example ontologies:  large taxonomies categorizing Web sites (such as on Yahoo!)  categorizations of products for sale and their features (such as on Amazon.com)

Uses of Ontologies  Machine Translation  Word Sense Disambiguation  Document Classification  Question Answering  Entity and fact-oriented Web Search

What is Yago  Yet Another Great Ontology  Part of Yago-Naga project  Goal to build a knowledge base that is  Large Scale  Domain-independent  Automatic Construction  High Accuracy  Uses Wikipedia and WordNet

More about YAGO  2 million entities  20 million facts  Facts represented as RDF triples  Accuracy of 95%  Examples:  Elvis Presley isA singer  singer subClassOf person  Elvis Presley bornOnDate  Elvis Presley bornIn Tupelo  Tupelo locatedIn Mississippi(state)  Mississippi(state) locatedIn USA

The YAGO model  Slight extension of RDFS  Represents knowledge as  Entities  Classes  Relations  Facts  Properties of relations like transitivity  Simple and decidable model

Knowledge Representation in YAGO  All objects are entities  e.g. Elvis Presley, Grammy Award  2 entities can stand in a relationship  e.g. hasWonAward  Elvis Presley hasWonAward Grammy Award  The triple of entity, relationship, entity is a fact  e.g. Elvis Presley hasWonAward Grammy Award is a fact

Knowledge Representation in YAGO -2  Numbers, dates and strings are also entities.  Elvis Presley BornInYear 1935  Words are entities  “Elvis” means Elvis Presley  Entity is instance of class  Elvis Presley Type Singer  Classes are also entities  Singer Type class

Knowledge Representation in YAGO- 3  Classes have hierarchies  Singer SubClassOf Person  Relations are also entities  subClassOf Type atr  Each fact has a fact identifier  #1 FoundIn Wikipedia

Key Contributions of YAGO  Information Extraction from Wikipedia  Infoboxes  Category Pages  Combination with WordNet  Taxonomy  Quality Control  Canonicalization  Type Checking

Information Extraction -1  Entities from Wikipedia  Each page title is candidate entity  Wiki Markup Language  Wikipedia dump as of September, 2008

Information Extraction - WML

Information Extraction Techniques  Infobox Harvesting  Wikipedia Infoboxes  Word-Level Techniques  Wikipedia Redirects  Category Harvesting  Wikipedia Categories  Type Extraction  Wikipedia Categories, WordNet Classes

1. Information Extraction from Wikipedia – Infobox Harvesting Wikipedia Infobox

Bor B B Born: January 8, 1935 Attribute Relation Inverse Manifold Indirect … Born bornOnDate … Elvis PresleybornOnDateJanuary 8, 1935 Infobox Attribute Map Relation Domain Range … bornOnDate person yagoDate … Relation Map

Bor B B Died: August 16, 1977 Attribute Relation Inverse Manifold Indirect … Died diedOnDate … Elvis PresleydiedOnDate Infobox Attribute Map August 16, 1977 Relation Map Relation Domain Range … diedOnDate person yagoDate …

Bor B B Genre: Rock and Roll Attribute Relation Inverse Manifold Indirect … Genre isOfGenre … Elvis PresleyisOfGenre Infobox Attribute Map Rock and Roll … isOfGenre entity yagoClass … Relation Domain Range Relation Map

Bor B B Birth Name: Elvis Aaron Presley Attribute Relation Inverse Manifold Indirect … birth name means … means Infobox Attribute Map Elvis PresleyElvis Aaron Presley Relation Map Relation Domain Range … means yagoWord entity …

Manifold Attributes  Some attributes may have multiple values  e.g. a person may have multiple children  Multiple facts are generated  e.g. one hasChild fact for each child

Indirect Attributes - 1  Some attributes do not concern article entity, but another fact  e.g attribute GDP does not concern the article entity i.e. Republic of Singapore, but year 2008  Therefore, facts generated:  Singapore hasGDP billion  #14 during 2008  Singapore hasGDP billion during 2008 Attribute Relation Inverse Manifold Indirect … gdp ppp hasGDP gdp year during Attribute Map

Indirect Attributes - 2 Singapore Infobox

Type of Infobox Released October, 1971 Format vinyl record Genre Folk Rock Length 8:33 mins Label United Artists Writer Don McLean Manufacturer Tesla Motors Production 2008-present Class Roadster Length 3,946 mm Width 1,873 mm Height 1,127 mm American PieTesla Roadster Song Infobox Car Infobox

Type of Infobox: Attribute Map Attribute Relation Inverse Manifold Indirect … car #length hasLength … song #length hasDuration … Attribute Map Song InfoboxCar Infobox American Pie hasDuration 8:33 Tesla Roadster hasLength 3946

Information Extraction - Word Level Techniques  Wikipedia Redirects  virtual redirect page for “Presley, Elvis“ links to “Elvis Presley”  Each redirect gives ‘means’ fact  e.g. “Presley, Elvis“ means Elvis Presley  Parsing Person Names  extract the name components  establish relations givenNameOf and familyNameOf e.g. Presley familyNameOf Elvis Presley Elvis givenNameOf Elvis Presley

Wikipedia Categories Categories: Presidents of the United States | Lists of office-holders | Lists of Presidents Categories: Rift Valleys | North Sea | Rivers of Germany | Articles needing translation from German Wikipedia | Rivers of Netherlands Categories: Canadian Singers| Canadian male singers| 1959 births | English-language singers | Living people | Grammy Award Winners | Portrait photographers

Facts created from Wikipedia Categories  Rhine locatedIn Germany  Bryan Adams bornOnDate 1959  Bryan Adams hasWonAward Grammy Award  Abraham Lincoln politicianOf United States

Information Extraction - Category Harvesting  Relational Categories ([0-9]f3,4g) births ([0-9]f3,4g) deaths ([0-9]f3,4g) establishments ([0-9]f3,4g) books|novels MountainsjRivers in (.*) PresidentsjGovernors of (.*) (.*) winners [A-Za-z]+ (.*) winners bornOnDate diedOnDate establishedOnDate writtenOnDate locatedIn politicianOf hasWonPrize RelationRegular Expression Table: Some Category Heuristics

2. Connecting Wikipedia and WordNet – What is WordNet  Lexical database for the English language  Created at the Cognitive Science Laboratory of Princeton University  Groups English words into sets of synonyms called synsets  Provides short, general definitions  Provides hypernym/hyponym relations  e.g. canine is hypernym, dog is hyponym

Connecting Wikipedia and WordNet – Type Extraction  Goal: create class hierarchy  e.g. singer subClassOf performer performer subClassOf artist  hyponymy relation from WordNet  Wikipedia class ‘American people in Japan’ is subclass of WordNet class ‘person’

Classifications of Categories  Conceptual Categories  e.g. Albert Einstein is in ‘Naturalized citizens of the United States’  Administrative Categories  e.g. Albert Einstein is in ‘Articles with unsourced statements’  Relational Information  1879 births  Thematic Vicinity  Physics

Identification of Conceptual Categories  Only conceptual categories are used  Shallow linguistic parsing of category names  e.g. category ‘American people in Japan’  Break category into pre-modifier - ‘American’ head - ‘people’ post-modifier - ‘in Japan’  If head is plural, then category is conceptual category  Extract class from Wikipedia category  Connect to class from WordNet  e.g. the Wikipedia class ‘American people in Japan’ has to be made a subclass of the WordNet class ‘person’

Algorithm Function wiki2wordnet(c) Input: Wikipedia category name c Output: WordNet synset 1 head =headCompound(c) 2 pre =preModifier(c) 3 post =postModifier(c) 4 head =stem(head) 5 If there is a WordNet synset s for pre + head 6 return s 7 If there are WordNet synsets s1, …, sn for head 8 (ordered by their frequency for head) 9 return s1 10 fail

Explanation of Algorithm Input: American people in Japan 1.pre-modifier : American 2.Head : people 3.Post-modifier : in Japan 4.Stem(head) : person 5.If there is a WordNet synset for ‘American person’ 6.return that synset 7.If there are s1, …, sn synsets for ‘person’ 8.(Ordered by frequency for ‘person’) 9.Return s1 10.Fail Output: person Result: American People in Japan subClassOf person

Fig.: WordNet search for “person” Fig.: WordNet search for ‘American Person’

Exceptions  Complete hierarchy of classes  Upper classes from WordNet  Leaves from Wikipedia  2 dozen cases failed  Categories with head compound “capital”  In Wikipedia, it means “capital city”  In WordNet, it means “financial asset”  These cases were corrected manually

3. Quality Control  Canonicalization  Each fact and each entity reference unique  an entity is always referred to by the same identifier in all facts in YAGO  Type Checking  eliminates individuals that do not have class  eliminates facts that do not respect domain and range constraints  an argument of a fact in YAGO is always an instance of the class required by the relation

Canonicalization - 1  Redirect Resolution  infobox heuristics deliver facts that have Wikipedia entities (i.e. Wikipedia links) as arguments  These links may not be correct Wikipedia page identifiers  Check if each argument is correct Wikipedia identifier  Replace by correct, redirected identifier E.g. Hermitage Museum locatedIn St. Petersburg Hermitage Museum locatedIn Saint Petersburg

Canonicalization - 2  Removal of Duplicate facts  Sometimes, 2 heuristics deliver the same fact.  canonicalization eliminates one of them  e.g., category ‘1935 births’ yields the fact: Elvis Presley bornOnDate 1935  Infobox attribute ‘Born: January 8, 1935’ yields the fact: Elvis Presley bornOnDate January 8, 1935

Type Checking - 1  Reductive Type Checking  Sometimes class of entity cannot be determined  Such facts are discarded e.g. Wikipedia entities that have been proposed for an article, but that do not have a page yet  Inductive Type Checking  Type constraints can be used to generate facts  e.g. Elvis Presley bornOnDate January 8, 1935  So, Elvis Presley is a person  Regular expression check to ensure entity name pattern of given name and family name

Type Checking - 2  Type Coherence Checking  Sometimes, classification yields wrong results  e.g. Abraham Lincoln is instance of 13 classes  12 are subclasses of class ‘person’; e.g. lawyer, president  13 th class is class ‘cabinet’  Class hierarchy of YAGO is partitioned into branches  e.g. locations, artifacts, people, other physical  entities, and abstract entities  Branch that most types lead to, is determined  Other types are purged

References  YAGO:ALarge Ontology from Wikipedia andWordNet Fabian M. Suchanek, Gjergji Kasneci, GerhardWeikum Max-Planck-Institute for Computer Science, Saarbruecken, Germany  Automated Construction and Growth of a Large Ontology Fabian M. Suchanek Thesis for obtaining the title of Doctor of Engineering of the Faculties of Natural Sciences and Technology of Saarland University  Wikipedia  WordNet

Thank You, Any Questions?