Download presentation
Presentation is loading. Please wait.
Published byTreyton Patman Modified over 9 years ago
1
Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/ For a Few Triples More
2
Acknowledgements
3
LOD: RDF Triples on the Web http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
4
owl:sameAs rdf.freebase.com/ns/ en.rome owl:sameAs data.nytimes.com/ 51688803696189142301 Coord geonames.org/ 3169070/roma N 41° 54' 10'' E 12° 29' 2'' dbpprop:citizenOf dbpedia.org/resource/ Rome rdf:type rdfs:subclassOf yago/ wordnet:Actor109765278 rdf:type rdfs:subclassOf yago/ wikicategory:ItalianComposer yago/ wordnet: Artist109812338 prop:actedIn imdb.com/name/nm0910607/ LOD: Linked RDF Triples on the Web prop: composedMusicFor imdb.com/title/tt0361748 / dbpedia.org/resource/ Ennio_Morricone
5
LOD: Linked RDF Triples on the Web Size: 30 Billion triples Linkage: 500 Million links Dynamics:encyclopedic reference data
6
The Good, the Bad, and the Ugly
7
30 billion triples – still not enough ? No! Consider: 1.Dynamics 2.Linkage 3.Ubiquity For a Few Triples More
8
Outline Why More Triples: Dynamics, Linkage, Ubiquity Web-Scale Linkage Explain Title Wrap-up Linkage & Ubiquity: Named-Entity Disambiguation
9
1. Dynamics: in a Fast Paced World Anecdotic examples: Chairman and CEO, Apple Inc. <dcterms:subject rdf:resource= "http://... Category:Nobel_Peace_Prize_laureates”/> Dina Ruiz 1 child Maggie Johnson 2 children Nancy Shevell still there not there never there both there none there
10
1. Dynamics: As Fresh As Possible http://data.gov.uk/openspending
11
1. Dynamics: Updates in the Web of Data http://sindice.com
12
1. Dynamics: Closer to the Sources RDF Data on the Web produced by: Maintained, but mostly „static“ reference collections (e.g. geo) Periodic exports from curated databases (e.g. gov, bio, music) Periodic extraction from Web sources (e.g. encyclopedia, news) Tags in social streams and advertisements mostly fresh often stale very noisy Get closer to the data origin: RDF engines (Sparql APIs) for production DBs view-maintenance by pub-sub push (feeds) Deep-Web crawl/query for surfacing of RDF data
13
1. Dynamics: Nothing Lasts Forever Even old and „static“ data often needs temporal scope (timepoint, timespan) for proper interpretation Need to add temporal properties to RDF and SPARQL with reification, or use quads (quints, pints, etc.) [11-Jun-2002, 2008] [Oct-2011, now] [1999] PaulMcCartney hasSpouse HeatherMills PaulMcCartney hasSpouse NancyShevell PaulMcCartney gotHonor SirPaul 1: 2: 3: 1 validFrom 11-Jun-20021 validUntil 2008 2 validFrom Oct-2011 3 happendOn 1999 Select ?w Where { ?id1: PM gotHonor SirPaul. ?id1 happendOn ?t. ?id2: PM hasSpouse ?w. ?id2 validFrom ?b. ?id2 validUntil ?e. ?t containedIn [?b,?e]. } but: principled, expressive, easy-to-use
14
1. Dynamics: Nothing Lasts Forever http://www.mpi-inf.mpg.de/yago-naga/yago/
15
2. Linkage: sameAs Links dbpedia.org/resource/Linda_Louise_Eastman owl:sameAs yago-knowledge.org/resource/Linda_McCartney www.freebase.com/view/en/man_with_no_name owl:SameAs dbpedia.org/page/Clint_Eastwood data.linkedmdb.org/page/film/38166 owl:sameAs de.dbpedia.org/page/Zwei_glorreiche_Halunken LOD statistics: 30 Bio. triples, 500 Mio. links 330 Mio. links trivial (ID-based) within pub, within bio 10‘s Mio. links near-trivial Dbpedia Freebase Yago GeoNames sameas.org: 17 Mio. bundles for 50 Mio. URIs data.nytimes.com: 5000 people, 2000 locations Way too few for a world with: 1 Mio. people, 10 Mio. locations, 10‘s Mio. species, 6 Mio. books, 2 Mio. movies, 10 Mio. songs, etc. etc.
16
2. Linkage: sameAs Coverage
17
2. Linkage: sameAs Accuracy http://sameas.org
18
3. Ubiquity: Web-of-Data & Web-of-Contents
19
3. Ubiquity: Web of Data & Other Contents RDF data and Web contents need to be interconnected RDFa & microformats provide the mechanism How do we get the Web RDF-annotated (at large scale)? Largely automated, but allow humans in the loop
20
3. Ubiquity: Web of Data & Other Contents May 2, 2011 Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will feature both Classical compositions and soundtracks such as the Ecstasy of Gold. In programme two concerts for July 14th and 15th. <html … May 2, 2011 Maestro Morricone <a rel="sameAs" resource="dbpedia…/Ennio_Morricone "/> … Smetana Hall … <span property="rdf:type" resource="yago:performance"> The concert will feature … <span property="event:date" content="14-07-2011"> July 1
21
Why a Few Triples More? Dynamics: Where is the live data? Linkage: Where are the links in Linked Data? Ubiquity: Where are the paths between the Web-of-Data and the Web? Linked Data is great! But still in its infancy Need to add triples to capture further issues:
22
Outline Why More Triples: Dynamics, Linkage, Ubiquity Web-Scale Linkage Explain Title Wrap-up Linkage & Ubiquity: Named-Entity Disambiguation
23
Entities on the Web http://sig.ma
24
Named-Entity Disambiguation (NED) Harry fought with you know who. He defeats the dark lord. 1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger) 2) co-reference resolution: link to preceding NP (trained classifier over linguistic features) 3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB) Three NLP tasks: Harry Potter Dirty Harry Lord Voldemort The Who (band) Prince Harry of England
25
Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mentions, Meanings, Mappings D5 Overview May 30, 2011 Sergio means Sergio_Leone Sergio means Serge_Gainsbourg Ennio means Ennio_Antonelli Ennio means Ennio_Morricone Eli means Eli_(bible) Eli means ExtremeLightInfrastructure Eli means Eli_Wallach Ecstasy means Ecstasy_(drug) Ecstasy means Ecstasy_of_Gold trilogy means Star_Wars_Trilogy trilogy means Lord_of_the_Rings trilogy means Dollars_Trilogy … … … KB Eli (bible) Eli Wallach Mentions (surface names) Entities (meanings) Dollars Trilogy Lord of the Rings Star Wars Trilogy Benny Andersson Benny Goodman Ecstasy of Gold Ecstasy (drug) ?
26
Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mention-Entity Graph Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) bag-of-words or language model: words, bigrams, phrases
27
Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mention-Entity Graph Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) joint mapping
28
Mention-Entity Graph 28 28 / 20 Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy(drug) Eli (bible) Eli Wallach KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films.
29
Mention-Entity Graph 29 29 / 20 KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) American Jews film actors artists Academy Award winners Metallica songs Ennio Morricone songs artifacts soundtrack music spaghetti westerns film trilogies movies artifacts Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films.
30
Mention-Entity Graph 30 30 / 20 KB+Stats weighted undirected graph with two types of nodes Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) http://.../wiki/Dollars_Trilogy http://.../wiki/The_Good,_the_Bad, _the_Ugly http://.../wiki/Clint_Eastwood http://.../wiki/Honorary_Academy_Award http://.../wiki/The_Good,_the_Bad,_the_Ugly http://.../wiki/Metallica http://.../wiki/Bellagio_(casino) http://.../wiki/Ennio_Morricone http://.../wiki/Sergio_Leone http://.../wiki/The_Good,_the_Bad,_the_Ugly http://.../wiki/For_a_Few_Dollars_More http://.../wiki/Ennio_Morricone Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films.
31
Mention-Entity Graph 31 31 / 20 KB+Stats Popularity (m,e): freq(m,e|m) length(e) #links(e) Similarity (m,e): cos/Dice/KL (context(m), context(e)) Coherence (e,e‘): dist(types) overlap(links) overlap (anchor words) Metallica on Morricone tribute Bellagio water fountain show Yo-Yo Ma Ennio Morricone composition The Magnificent Seven The Good, the Bad, and the Ugly Clint Eastwood University of Texas at Austin For a Few Dollars More The Good, the Bad, and the Ugly Man with No Name trilogy soundtrack by Ennio Morricone weighted undirected graph with two types of nodes Dollars Trilogy Lord of the Rings Star Wars Ecstasy of Gold Ecstasy (drug) Eli (bible) Eli Wallach Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films.
32
Joint Mapping Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB Compute high-likelihood mapping (ML or MAP) or dense subgraph such that: each m is connected to exactly one e (or at most one e) 90 30 5 100 50 20 50 90 80 90 30 10 20 30
33
Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search 90 30 5 100 50 90 80 90 30 10 20 10 20 30 [J. Hoffart et al.: EMNLP‘11] 140 180 50 470 145 230
34
Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search 90 30 5 100 50 90 80 90 30 1030 [J. Hoffart et al.: EMNLP‘11] 140 180 50 470 145 230 140 170 470 145 210
35
Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search 90 30 5 100 90 80 90 30 [J. Hoffart et al.: EMNLP‘11] 140 170 460 145 210 120 460 145 210
36
Coherence Graph Algorithm Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Greedy approximation: iteratively remove weakest entity and its edges Keep alternative solutions, then use local/randomized search 90 100 90 30 [J. Hoffart et al.: EMNLP‘11] 120 380 145 210
37
Named-Entity Disambiguation: State-of-the-Art Online tools: https://d5gate.ag5.mpi-sb.mpg.de/webaida/ http://tagme.di.unipi.it/ http://spotlight.dbpedia.org/demo/index.html http://viewer.opencalais.com/ http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/ etc. Literature: Razvan Bunescu, Marius Pasca: EACL 2006 Silviu Cucerzan: EMNLP 2007 David Milne, Ian Witten: CIKM 2008 S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009 G. Limaye, S. Sarawagi, S. Chakrabarti: VLDB 2010 Paolo Ferragina, Ugo Scaella: CIKM 2010 Mark Dredze et al.: COLING 2010 Johannes Hoffart et al.: EMNLP 2011 etc.
38
NED: Experimental Evaluation Benchmark: Extended CoNLL 2003 dataset: 1400 newswire articles originally annotated with mention markup (NER), now with NED mappings to Yago and Freebase difficult texts: … Australia beats India … Australian_Cricket_Team … White House talks to Kreml … President_of_the_USA … EDS made a contract with … HP_Enterprise_Services Results: Best: AIDA method with prior+sim+coh + robustness test 82% precision @100% recall, 87% mean average precision Comparison to other methods, see paper J. Hoffart et al.: Robust Disambiguation of Named Entities in Text, EMNLP 2011 http://www.mpi-inf.mpg.de/yago-naga/aida/
39
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
40
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
41
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
42
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
43
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
44
Interesting Research Issues More efficient graph algorithms (multicore, etc.) Allow mentions of unknown entities, mapped to null Short and difficult texts: tweets, headlines, etc. fictional texts: novels, song lyrics, etc. incoherent texts Disambiguation beyond entity names: coreferences: pronouns, paraphrases, etc. common nouns, verbal phrases (general WSD) Leverage deep-parsing structures, leverage semantic types
45
Why Named Entity Disambiguation is Key Linked data is best if it has many good links New & rich contents mostly in traditional Web Create sameAs links in (X)HTML contents, via RDFa Links for named entities give best mileage/effort Methods & tools greatly advanced & gradually maturing Keep human in the loop, embed NED in authoring tools
46
Outline Why More Triples: Dynamics, Linkage, Ubiquity Web-Scale Linkage Explain Title Wrap-up Linkage & Ubiquity: Named-Entity Disambiguation
47
Variants of NED at Web Scale How to run this on big batch of 1 Mio. input texts? partition inputs across distributed machines, organize dictionary appropriately, … exploit cross-document contexts How to deal with inputs from different time epochs? consider time-dependent contexts, map to entities of proper epoch (e.g. harvested from Wikipedia history) How to handle Web-scale inputs (100 Mio. pages) restricted to a set of interesting entities? (e.g. tracking politicians and companies) Tools can map short text onto entities in a few seconds
48
owl:sameAs rdf.freebase.com/ns/ en.rome_ny owl:sameAs data.nytimes.com/ 51688803696189142301 Coord geonames.org/5134301/ city_of_rome N 43° 12' 46'' W 75° 27' 20'' dbpprop:citizenOf dbpedia.org/resource/ Rome rdf:type rdfs:subclassOf yago/ wordnet:Actor109765278 rdf:type rdfs:subclassOf yago/ wikicategory:ItalianComposer yago/ wordnet: Artist109812338 prop:actedIn imdb.com/name/nm0910607/ Linked RDF Triples on the Web prop: composedMusicFor imdb.com/title/tt0361748 / dbpedia.org/resource/ Ennio_Morricone referential data quality: automatic, dynamic, high coverage ! ? ? ?
49
Outline Why More Triples: Dynamics, Linkage, Ubiquity Web-Scale Linkage Explain Title Wrap-up Linkage & Ubiquity: Named-Entity Disambiguation
50
Summary Dynamics: (Deep-Web) sources feeds, pub-sub, … ? fresh & versioned triples Linkage: LOD entity mapping user community Ubiquity: RDFa entity disambiguation authoring Linked Data is great! But it needs more triples to capture:
51
Outlook For a Few Triples More Challenge 1: generate high-quality sameAs links in RDFa & across all LOD sources For a Few Triples Less Challenge 2: add efficient top-k ranking to queries over RDF-in-context
52
Thank You !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.