YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer
Motivation for an Ontology Natural Language communication Automated text translation Finding information on internet Computer-processable collection of knowledge
What is an Ontology? An ontology is the description of a domain, its classes and properties and relationships between those classes by means of a formal language. collection of knowledge about the world, a knowledge base Example ontologies: large taxonomies categorizing Web sites (such as on Yahoo!) categorizations of products for sale and their features (such as on Amazon.com)
Uses of Ontologies Machine Translation Word Sense Disambiguation Document Classification Question Answering Entity and fact-oriented Web Search
What is Yago Yet Another Great Ontology Part of Yago-Naga project Goal to build a knowledge base that is Large Scale Domain-independent Automatic Construction High Accuracy Uses Wikipedia and WordNet
More about YAGO 2 million entities 20 million facts Facts represented as RDF triples Accuracy of 95% Examples: Elvis Presley isA singer singer subClassOf person Elvis Presley bornOnDate Elvis Presley bornIn Tupelo Tupelo locatedIn Mississippi(state) Mississippi(state) locatedIn USA
The YAGO model Slight extension of RDFS Represents knowledge as Entities Classes Relations Facts Properties of relations like transitivity Simple and decidable model
Knowledge Representation in YAGO All objects are entities e.g. Elvis Presley, Grammy Award 2 entities can stand in a relationship e.g. hasWonAward Elvis Presley hasWonAward Grammy Award The triple of entity, relationship, entity is a fact e.g. Elvis Presley hasWonAward Grammy Award is a fact
Knowledge Representation in YAGO -2 Numbers, dates and strings are also entities. Elvis Presley BornInYear 1935 Words are entities “Elvis” means Elvis Presley Entity is instance of class Elvis Presley Type Singer Classes are also entities Singer Type class
Knowledge Representation in YAGO- 3 Classes have hierarchies Singer SubClassOf Person Relations are also entities subClassOf Type atr Each fact has a fact identifier #1 FoundIn Wikipedia
Key Contributions of YAGO Information Extraction from Wikipedia Infoboxes Category Pages Combination with WordNet Taxonomy Quality Control Canonicalization Type Checking
Information Extraction -1 Entities from Wikipedia Each page title is candidate entity Wiki Markup Language Wikipedia dump as of September, 2008
Information Extraction - WML
Information Extraction Techniques Infobox Harvesting Wikipedia Infoboxes Word-Level Techniques Wikipedia Redirects Category Harvesting Wikipedia Categories Type Extraction Wikipedia Categories, WordNet Classes
1. Information Extraction from Wikipedia – Infobox Harvesting Wikipedia Infobox
Bor B B Born: January 8, 1935 Attribute Relation Inverse Manifold Indirect … Born bornOnDate … Elvis PresleybornOnDateJanuary 8, 1935 Infobox Attribute Map Relation Domain Range … bornOnDate person yagoDate … Relation Map
Bor B B Died: August 16, 1977 Attribute Relation Inverse Manifold Indirect … Died diedOnDate … Elvis PresleydiedOnDate Infobox Attribute Map August 16, 1977 Relation Map Relation Domain Range … diedOnDate person yagoDate …
Bor B B Genre: Rock and Roll Attribute Relation Inverse Manifold Indirect … Genre isOfGenre … Elvis PresleyisOfGenre Infobox Attribute Map Rock and Roll … isOfGenre entity yagoClass … Relation Domain Range Relation Map
Bor B B Birth Name: Elvis Aaron Presley Attribute Relation Inverse Manifold Indirect … birth name means … means Infobox Attribute Map Elvis PresleyElvis Aaron Presley Relation Map Relation Domain Range … means yagoWord entity …
Manifold Attributes Some attributes may have multiple values e.g. a person may have multiple children Multiple facts are generated e.g. one hasChild fact for each child
Indirect Attributes - 1 Some attributes do not concern article entity, but another fact e.g attribute GDP does not concern the article entity i.e. Republic of Singapore, but year 2008 Therefore, facts generated: Singapore hasGDP billion #14 during 2008 Singapore hasGDP billion during 2008 Attribute Relation Inverse Manifold Indirect … gdp ppp hasGDP gdp year during Attribute Map
Indirect Attributes - 2 Singapore Infobox
Type of Infobox Released October, 1971 Format vinyl record Genre Folk Rock Length 8:33 mins Label United Artists Writer Don McLean Manufacturer Tesla Motors Production 2008-present Class Roadster Length 3,946 mm Width 1,873 mm Height 1,127 mm American PieTesla Roadster Song Infobox Car Infobox
Type of Infobox: Attribute Map Attribute Relation Inverse Manifold Indirect … car #length hasLength … song #length hasDuration … Attribute Map Song InfoboxCar Infobox American Pie hasDuration 8:33 Tesla Roadster hasLength 3946
Information Extraction - Word Level Techniques Wikipedia Redirects virtual redirect page for “Presley, Elvis“ links to “Elvis Presley” Each redirect gives ‘means’ fact e.g. “Presley, Elvis“ means Elvis Presley Parsing Person Names extract the name components establish relations givenNameOf and familyNameOf e.g. Presley familyNameOf Elvis Presley Elvis givenNameOf Elvis Presley
Wikipedia Categories Categories: Presidents of the United States | Lists of office-holders | Lists of Presidents Categories: Rift Valleys | North Sea | Rivers of Germany | Articles needing translation from German Wikipedia | Rivers of Netherlands Categories: Canadian Singers| Canadian male singers| 1959 births | English-language singers | Living people | Grammy Award Winners | Portrait photographers
Facts created from Wikipedia Categories Rhine locatedIn Germany Bryan Adams bornOnDate 1959 Bryan Adams hasWonAward Grammy Award Abraham Lincoln politicianOf United States
Information Extraction - Category Harvesting Relational Categories ([0-9]f3,4g) births ([0-9]f3,4g) deaths ([0-9]f3,4g) establishments ([0-9]f3,4g) books|novels MountainsjRivers in (.*) PresidentsjGovernors of (.*) (.*) winners [A-Za-z]+ (.*) winners bornOnDate diedOnDate establishedOnDate writtenOnDate locatedIn politicianOf hasWonPrize RelationRegular Expression Table: Some Category Heuristics
2. Connecting Wikipedia and WordNet – What is WordNet Lexical database for the English language Created at the Cognitive Science Laboratory of Princeton University Groups English words into sets of synonyms called synsets Provides short, general definitions Provides hypernym/hyponym relations e.g. canine is hypernym, dog is hyponym
Connecting Wikipedia and WordNet – Type Extraction Goal: create class hierarchy e.g. singer subClassOf performer performer subClassOf artist hyponymy relation from WordNet Wikipedia class ‘American people in Japan’ is subclass of WordNet class ‘person’
Classifications of Categories Conceptual Categories e.g. Albert Einstein is in ‘Naturalized citizens of the United States’ Administrative Categories e.g. Albert Einstein is in ‘Articles with unsourced statements’ Relational Information 1879 births Thematic Vicinity Physics
Identification of Conceptual Categories Only conceptual categories are used Shallow linguistic parsing of category names e.g. category ‘American people in Japan’ Break category into pre-modifier - ‘American’ head - ‘people’ post-modifier - ‘in Japan’ If head is plural, then category is conceptual category Extract class from Wikipedia category Connect to class from WordNet e.g. the Wikipedia class ‘American people in Japan’ has to be made a subclass of the WordNet class ‘person’
Algorithm Function wiki2wordnet(c) Input: Wikipedia category name c Output: WordNet synset 1 head =headCompound(c) 2 pre =preModifier(c) 3 post =postModifier(c) 4 head =stem(head) 5 If there is a WordNet synset s for pre + head 6 return s 7 If there are WordNet synsets s1, …, sn for head 8 (ordered by their frequency for head) 9 return s1 10 fail
Explanation of Algorithm Input: American people in Japan 1.pre-modifier : American 2.Head : people 3.Post-modifier : in Japan 4.Stem(head) : person 5.If there is a WordNet synset for ‘American person’ 6.return that synset 7.If there are s1, …, sn synsets for ‘person’ 8.(Ordered by frequency for ‘person’) 9.Return s1 10.Fail Output: person Result: American People in Japan subClassOf person
Fig.: WordNet search for “person” Fig.: WordNet search for ‘American Person’
Exceptions Complete hierarchy of classes Upper classes from WordNet Leaves from Wikipedia 2 dozen cases failed Categories with head compound “capital” In Wikipedia, it means “capital city” In WordNet, it means “financial asset” These cases were corrected manually
3. Quality Control Canonicalization Each fact and each entity reference unique an entity is always referred to by the same identifier in all facts in YAGO Type Checking eliminates individuals that do not have class eliminates facts that do not respect domain and range constraints an argument of a fact in YAGO is always an instance of the class required by the relation
Canonicalization - 1 Redirect Resolution infobox heuristics deliver facts that have Wikipedia entities (i.e. Wikipedia links) as arguments These links may not be correct Wikipedia page identifiers Check if each argument is correct Wikipedia identifier Replace by correct, redirected identifier E.g. Hermitage Museum locatedIn St. Petersburg Hermitage Museum locatedIn Saint Petersburg
Canonicalization - 2 Removal of Duplicate facts Sometimes, 2 heuristics deliver the same fact. canonicalization eliminates one of them e.g., category ‘1935 births’ yields the fact: Elvis Presley bornOnDate 1935 Infobox attribute ‘Born: January 8, 1935’ yields the fact: Elvis Presley bornOnDate January 8, 1935
Type Checking - 1 Reductive Type Checking Sometimes class of entity cannot be determined Such facts are discarded e.g. Wikipedia entities that have been proposed for an article, but that do not have a page yet Inductive Type Checking Type constraints can be used to generate facts e.g. Elvis Presley bornOnDate January 8, 1935 So, Elvis Presley is a person Regular expression check to ensure entity name pattern of given name and family name
Type Checking - 2 Type Coherence Checking Sometimes, classification yields wrong results e.g. Abraham Lincoln is instance of 13 classes 12 are subclasses of class ‘person’; e.g. lawyer, president 13 th class is class ‘cabinet’ Class hierarchy of YAGO is partitioned into branches e.g. locations, artifacts, people, other physical entities, and abstract entities Branch that most types lead to, is determined Other types are purged
References YAGO:ALarge Ontology from Wikipedia andWordNet Fabian M. Suchanek, Gjergji Kasneci, GerhardWeikum Max-Planck-Institute for Computer Science, Saarbruecken, Germany Automated Construction and Growth of a Large Ontology Fabian M. Suchanek Thesis for obtaining the title of Doctor of Engineering of the Faculties of Natural Sciences and Technology of Saarland University Wikipedia WordNet
Thank You, Any Questions?