Download presentation
Presentation is loading. Please wait.
Published byThomasine Kelley Modified over 9 years ago
1
YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer
2
Motivation for an Ontology Natural Language communication Automated text translation Finding information on internet Computer-processable collection of knowledge
3
What is an Ontology? An ontology is the description of a domain, its classes and properties and relationships between those classes by means of a formal language. collection of knowledge about the world, a knowledge base Example ontologies: large taxonomies categorizing Web sites (such as on Yahoo!) categorizations of products for sale and their features (such as on Amazon.com)
4
Uses of Ontologies Machine Translation Word Sense Disambiguation Document Classification Question Answering Entity and fact-oriented Web Search
5
What is Yago Yet Another Great Ontology Part of Yago-Naga project Goal to build a knowledge base that is Large Scale Domain-independent Automatic Construction High Accuracy Uses Wikipedia and WordNet
6
More about YAGO 2 million entities 20 million facts Facts represented as RDF triples Accuracy of 95% Examples: Elvis Presley isA singer singer subClassOf person Elvis Presley bornOnDate 1935-01-08 Elvis Presley bornIn Tupelo Tupelo locatedIn Mississippi(state) Mississippi(state) locatedIn USA
7
The YAGO model Slight extension of RDFS Represents knowledge as Entities Classes Relations Facts Properties of relations like transitivity Simple and decidable model
8
Knowledge Representation in YAGO All objects are entities e.g. Elvis Presley, Grammy Award 2 entities can stand in a relationship e.g. hasWonAward Elvis Presley hasWonAward Grammy Award The triple of entity, relationship, entity is a fact e.g. Elvis Presley hasWonAward Grammy Award is a fact
9
Knowledge Representation in YAGO -2 Numbers, dates and strings are also entities. Elvis Presley BornInYear 1935 Words are entities “Elvis” means Elvis Presley Entity is instance of class Elvis Presley Type Singer Classes are also entities Singer Type class
10
Knowledge Representation in YAGO- 3 Classes have hierarchies Singer SubClassOf Person Relations are also entities subClassOf Type atr Each fact has a fact identifier #1 FoundIn Wikipedia
11
Key Contributions of YAGO Information Extraction from Wikipedia Infoboxes Category Pages Combination with WordNet Taxonomy Quality Control Canonicalization Type Checking
12
Information Extraction -1 Entities from Wikipedia Each page title is candidate entity Wiki Markup Language Wikipedia dump as of September, 2008
13
Information Extraction - WML
14
Information Extraction Techniques Infobox Harvesting Wikipedia Infoboxes Word-Level Techniques Wikipedia Redirects Category Harvesting Wikipedia Categories Type Extraction Wikipedia Categories, WordNet Classes
15
1. Information Extraction from Wikipedia – Infobox Harvesting Wikipedia Infobox
16
Bor B B Born: January 8, 1935 Attribute Relation Inverse Manifold Indirect … Born bornOnDate … Elvis PresleybornOnDateJanuary 8, 1935 Infobox Attribute Map Relation Domain Range … bornOnDate person yagoDate … Relation Map
17
Bor B B Died: August 16, 1977 Attribute Relation Inverse Manifold Indirect … Died diedOnDate … Elvis PresleydiedOnDate Infobox Attribute Map August 16, 1977 Relation Map Relation Domain Range … diedOnDate person yagoDate …
18
Bor B B Genre: Rock and Roll Attribute Relation Inverse Manifold Indirect … Genre isOfGenre … Elvis PresleyisOfGenre Infobox Attribute Map Rock and Roll … isOfGenre entity yagoClass … Relation Domain Range Relation Map
19
Bor B B Birth Name: Elvis Aaron Presley Attribute Relation Inverse Manifold Indirect … birth name means … means Infobox Attribute Map Elvis PresleyElvis Aaron Presley Relation Map Relation Domain Range … means yagoWord entity …
20
Manifold Attributes Some attributes may have multiple values e.g. a person may have multiple children Multiple facts are generated e.g. one hasChild fact for each child
21
Indirect Attributes - 1 Some attributes do not concern article entity, but another fact e.g attribute GDP does not concern the article entity i.e. Republic of Singapore, but year 2008 Therefore, facts generated: Singapore hasGDP 238.755 billion #14 during 2008 Singapore hasGDP 238.755 billion during 2008 Attribute Relation Inverse Manifold Indirect … gdp ppp hasGDP gdp year during Attribute Map
22
Indirect Attributes - 2 Singapore Infobox
23
Type of Infobox Released October, 1971 Format vinyl record Genre Folk Rock Length 8:33 mins Label United Artists Writer Don McLean Manufacturer Tesla Motors Production 2008-present Class Roadster Length 3,946 mm Width 1,873 mm Height 1,127 mm American PieTesla Roadster Song Infobox Car Infobox
24
Type of Infobox: Attribute Map Attribute Relation Inverse Manifold Indirect … car #length hasLength … song #length hasDuration … Attribute Map Song InfoboxCar Infobox American Pie hasDuration 8:33 Tesla Roadster hasLength 3946
25
Information Extraction - Word Level Techniques Wikipedia Redirects virtual redirect page for “Presley, Elvis“ links to “Elvis Presley” Each redirect gives ‘means’ fact e.g. “Presley, Elvis“ means Elvis Presley Parsing Person Names extract the name components establish relations givenNameOf and familyNameOf e.g. Presley familyNameOf Elvis Presley Elvis givenNameOf Elvis Presley
26
Wikipedia Categories Categories: Presidents of the United States | Lists of office-holders | Lists of Presidents Categories: Rift Valleys | North Sea | Rivers of Germany | Articles needing translation from German Wikipedia | Rivers of Netherlands Categories: Canadian Singers| Canadian male singers| 1959 births | English-language singers | Living people | Grammy Award Winners | Portrait photographers
27
Facts created from Wikipedia Categories Rhine locatedIn Germany Bryan Adams bornOnDate 1959 Bryan Adams hasWonAward Grammy Award Abraham Lincoln politicianOf United States
28
Information Extraction - Category Harvesting Relational Categories ([0-9]f3,4g) births ([0-9]f3,4g) deaths ([0-9]f3,4g) establishments ([0-9]f3,4g) books|novels MountainsjRivers in (.*) PresidentsjGovernors of (.*) (.*) winners [A-Za-z]+ (.*) winners bornOnDate diedOnDate establishedOnDate writtenOnDate locatedIn politicianOf hasWonPrize RelationRegular Expression Table: Some Category Heuristics
29
2. Connecting Wikipedia and WordNet – What is WordNet Lexical database for the English language Created at the Cognitive Science Laboratory of Princeton University Groups English words into sets of synonyms called synsets Provides short, general definitions Provides hypernym/hyponym relations e.g. canine is hypernym, dog is hyponym
31
Connecting Wikipedia and WordNet – Type Extraction Goal: create class hierarchy e.g. singer subClassOf performer performer subClassOf artist hyponymy relation from WordNet Wikipedia class ‘American people in Japan’ is subclass of WordNet class ‘person’
32
Classifications of Categories Conceptual Categories e.g. Albert Einstein is in ‘Naturalized citizens of the United States’ Administrative Categories e.g. Albert Einstein is in ‘Articles with unsourced statements’ Relational Information 1879 births Thematic Vicinity Physics
33
Identification of Conceptual Categories Only conceptual categories are used Shallow linguistic parsing of category names e.g. category ‘American people in Japan’ Break category into pre-modifier - ‘American’ head - ‘people’ post-modifier - ‘in Japan’ If head is plural, then category is conceptual category Extract class from Wikipedia category Connect to class from WordNet e.g. the Wikipedia class ‘American people in Japan’ has to be made a subclass of the WordNet class ‘person’
34
Algorithm Function wiki2wordnet(c) Input: Wikipedia category name c Output: WordNet synset 1 head =headCompound(c) 2 pre =preModifier(c) 3 post =postModifier(c) 4 head =stem(head) 5 If there is a WordNet synset s for pre + head 6 return s 7 If there are WordNet synsets s1, …, sn for head 8 (ordered by their frequency for head) 9 return s1 10 fail
35
Explanation of Algorithm Input: American people in Japan 1.pre-modifier : American 2.Head : people 3.Post-modifier : in Japan 4.Stem(head) : person 5.If there is a WordNet synset for ‘American person’ 6.return that synset 7.If there are s1, …, sn synsets for ‘person’ 8.(Ordered by frequency for ‘person’) 9.Return s1 10.Fail Output: person Result: American People in Japan subClassOf person
36
Fig.: WordNet search for “person” Fig.: WordNet search for ‘American Person’
37
Exceptions Complete hierarchy of classes Upper classes from WordNet Leaves from Wikipedia 2 dozen cases failed Categories with head compound “capital” In Wikipedia, it means “capital city” In WordNet, it means “financial asset” These cases were corrected manually
38
3. Quality Control Canonicalization Each fact and each entity reference unique an entity is always referred to by the same identifier in all facts in YAGO Type Checking eliminates individuals that do not have class eliminates facts that do not respect domain and range constraints an argument of a fact in YAGO is always an instance of the class required by the relation
39
Canonicalization - 1 Redirect Resolution infobox heuristics deliver facts that have Wikipedia entities (i.e. Wikipedia links) as arguments These links may not be correct Wikipedia page identifiers Check if each argument is correct Wikipedia identifier Replace by correct, redirected identifier E.g. Hermitage Museum locatedIn St. Petersburg Hermitage Museum locatedIn Saint Petersburg
40
Canonicalization - 2 Removal of Duplicate facts Sometimes, 2 heuristics deliver the same fact. canonicalization eliminates one of them e.g., category ‘1935 births’ yields the fact: Elvis Presley bornOnDate 1935 Infobox attribute ‘Born: January 8, 1935’ yields the fact: Elvis Presley bornOnDate January 8, 1935
41
Type Checking - 1 Reductive Type Checking Sometimes class of entity cannot be determined Such facts are discarded e.g. Wikipedia entities that have been proposed for an article, but that do not have a page yet Inductive Type Checking Type constraints can be used to generate facts e.g. Elvis Presley bornOnDate January 8, 1935 So, Elvis Presley is a person Regular expression check to ensure entity name pattern of given name and family name
42
Type Checking - 2 Type Coherence Checking Sometimes, classification yields wrong results e.g. Abraham Lincoln is instance of 13 classes 12 are subclasses of class ‘person’; e.g. lawyer, president 13 th class is class ‘cabinet’ Class hierarchy of YAGO is partitioned into branches e.g. locations, artifacts, people, other physical entities, and abstract entities Branch that most types lead to, is determined Other types are purged
43
References YAGO:ALarge Ontology from Wikipedia andWordNet Fabian M. Suchanek, Gjergji Kasneci, GerhardWeikum Max-Planck-Institute for Computer Science, Saarbruecken, Germany Automated Construction and Growth of a Large Ontology Fabian M. Suchanek Thesis for obtaining the title of Doctor of Engineering of the Faculties of Natural Sciences and Technology of Saarland University Wikipedia http://en.wikipedia.org/wiki/Main_Page WordNet http://wordnet.princeton.edu/
44
Thank You, Any Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.