joint work with Shady Elbassuoni, Georgiana Ifrim, Gjergji Kasneci,

Slides:



Advertisements
Similar presentations
Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian.
Advertisements

Fabian M. SuchanekYAGO - A Core of Semantic Knowledge 1 YAGO – A Core of Semantic Knowledge Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum (Max-Planck.
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
Research Internships Advanced Research and Modeling Research Group.
YAGO: A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum Max-Planck-Institute for Computer Science, Saarbruecken,
SWIMs: From Structured Summaries to Integrated Knowledge Base
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Gerhard Weikum Max Planck Institute for Informatics & Saarland University Semantic Search: from Names and Phrases to.
Knowledge Base Completion via Search-Based Question Answering
YAGO-NAGA Project Presented By: Mohammad Dwaikat To: Dr. Yuliya Lierler CSCI 8986 – Fall 2012.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao 1 1 gStore: Answering SPARQL Queries Via Subgraph Matching 1 Peking University, 2 Hong.
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
RDF Tutorial.
RDF-3X: a RISC style Engine for RDF Ref: Thomas Neumann and Gerhard Weikum [PVLDB’08 ] Presented by: Pankaj Vanwari Course: Advanced Databases (CS 632)
Gerhard Weikum Harvesting, Searching, and Ranking Knowledge from the Web Joint work with Georgiana.
Search Engines and Information Retrieval
Database and Information- Retrieval Methods for Knowledge Discovery Database and Information- Retrieval Methods for Knowledge Discovery Gerhard Weikum,
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Information Retrieval in Practice
Which Nobel laureate survived both world wars and all his four children? Tandem What‘s this? Question AnsweringPhoto AnnotationTimeline Analysis.
Saarbrucken / Germany ¨
Information Retrieval in Practice
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
Search Engines and Information Retrieval Chapter 1.
SEEKING STATEMENT-SUPPORTING TOP-K WITNESSES Date: 2012/03/12 Source: Steffen Metzger (CIKM’11) Speaker: Er-gang Liu Advisor: Dr. Jia-ling Koh 1.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Author: William Tunstall-Pedoe Presenter: Bahareh Sarrafzadeh CS 886 Spring 2015.
DBrev: Dreaming of a Database Revolution Gjergji Kasneci, Jurgen Van Gael, Thore Graepel Microsoft Research Cambridge, UK.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Ontology-Based Information Extraction: Current Approaches.
Assigning Global Relevance Scores to DBpedia Facts Philipp Langer, Patrick Schulze, Stefan George, Tobias Metzke, Ziawasch Abedjan, Gjergji Kasneci DESWeb.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Information Extraction Lecture 8 – Ontological and Open IE CIS, LMU München Winter Semester Dr. Alexander Fraser.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Question Answering over Implicitly Structured Web Content
Natural Language Questions for the Web of Data Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany Shady Elbassuoni.
Natural Language Questions for the Web of Data Mohamed Yahya 1, Klaus Berberich 1, Shady Elbassuoni 2 Maya Ramanath 3, Volker Tresp 4, Gerhard Weikum 1.
Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.
RDF-3X : RISC-Style RDF Database Engine
Ontology Mapping in Pervasive Computing Environment C.Y. Kong, C.L. Wang, F.C.M. Lau The University of Hong Kong.
Semi-Automatic Quality Assessment of Linked Data without Requiring Ontology Saemi Jang, Megawati, Jiyeon Choi, and Mun Yong Yi KIRD, KAIST NLP&DBPEDIA.
RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.
Summary Knowledge Bases from Web are Real, Big & Useful: Entities, Classes & Relations Key Asset for Intelligent Applications: Semantic Search, Question.
Natural Language Questions for the Web of Data Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany Shady Elbassuoni.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
1 NAGA: Searching and Ranking Knowledge Gjergji Kasneci Joint work with: Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, and Gerhard Weikum.
Tutorial: Knowledge Bases for Web Content Analytics
Gaby Nativ, SDBI  Motivation  Other Ontologies  System overview  YAGO Dive IN  LEILA  NAGA  Conclusion.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.
RDF storages and indexes Maciej Janik September 1, 2005 Enterprise Integration – Semantic Web.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Einat Minkov University of Haifa, Israel CL course, U
Database and Information - Retrieval Methods for Knowledge Discovery
Information Extraction from Wikipedia: Moving Down the Long Tail
Ontology Evolution: A Methodological Overview
Web IR: Recent Trends; Future of Web Search
Wikitology Wikipedia as an Ontology
Extracting Semantic Concept Relations
DBpedia 2014 Liang Zheng 9.22.
ProBase: common Sense Concept KB and Short Text Understanding
Chaitali Gupta, Madhusudhan Govindaraju
Aiming at prize for brilliant idea the world is not ready for.
Topic: Semantic Text Mining
Introduction to Search Engines
Presentation transcript:

joint work with Shady Elbassuoni, Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Mauro Sozio, Fabian Suchanek

My Vision Opportunity: Turn the Web (and Web 2.0 and Web 3.0 ...) into the world‘s most comprehensive knowledge base Approach: 1) harvest and combine hand-crafted knowledge sources (Semantic Web, ontologies) automatic knowledge extraction (Statistical Web, text mining) social communities and human computing (Social Web, Web 2.0) 2) express knowledge queries, search, and rank 3) everything efficient and scalable

Why Google and Wikipedia Are Not Enough Answer „knowledge queries“ (by scientists, journalists, analysts, etc.) such as: drugs or enzymes that inhibit proteases (HIV) connections between Thomas Mann and Goethe German Nobel prize winner who survived both world wars and outlived all of his four children how are Max Planck, Angela Merkel, Jim Gray, and the Dalai Lama related politicians who are also scientists

Why Google and Wikipedia Are Not Enough Answer „knowledge queries“ (by scientists, journalists, analysts, etc.) such as: What is lacking? Information is not Knowledge. Knowledge is not Wisdom. Wisdom is not Truth Truth is not Beauty. Beauty is not Music. Music is the best. (Frank Zappa 1940 – 1993) extract facts from Web pages capture user intention by concepts, entities, relations drugs or enzymes that inhibit proteases (HIV) connections between Thomas Mann and Goethe German Nobel prize winner who survived both world wars and outlived all of his four children how are Max Planck, Angela Merkel, Jim Gray, and the Dalai Lama related politicians who are also scientists

Related Work information Web extraction & entity ontology search Kylin KOG Cimple DBlife Libra START TextRunner Answers Avatar information extraction & ontology building Web entity search & QA UIMA Hakia Powerset Freebase EntityRank Cyc ExpertFinder DBpedia semistructured IR & graph search TopX XQ-FT Yago Naga Tijah SPARQL DBexplorer Banks SWSE

Relevant Projects KnowItAll / TextRunner (UW Seattle) IntelligenceInWikipedia (UW Seattle) DBpedia (U Leipzig & FU Berlin) SeerSuite (PennState) Cimple / DBlife (U Wisconsin & Yahoo) Avatar / System T (IBM Almaden) Libra (MS Research Beijing) SQoUT (Columbia U) Wikipedia Entities (Yahoo Barcelona) Expert Finding (U Amsterdam) Expertise Finding (U Twente) ... and more and G, Y, MS for products, locations, ... Selected overviews in: ACM SIGMOD Record 37(4), Dec 2008

Outline  Motivation • Information Extraction & Knowledge Harvesting (YAGO) • Consistent Growth of Knowledge (SOFIE) • Ranking for Search over Entity-Relation Graphs (NAGA) • Efficient Query Processing (RDF-3X) • Conclusion

Information Extraction (IE): Text to Records Max Planck 4/23, 1858 Kiel Albert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar Person BirthDate BirthPlace ... Person ScientificResult Max Planck Quantum Theory extracted facts often have confidence < 1 (DB with uncertainty) Planck‘s constant 6.2261023 Js Constant Value Dimension sometimes: confidence << 1 high computational costs Person Collaborator Max Planck Albert Einstein Max Planck Niels Bohr Person Organization Max Planck KWG / MPG combine NLP, pattern matching, lexicons, statistical learning

High-Quality Knowledge Sources General-purpose ontologies and thesauri: WordNet family 200 000 concepts and relations; can be cast into description logics or graph, with weights for relation strengths (derived from co-occurrence statistics) scientist, man of science (a person with advanced knowledge) => cosmographer, cosmographist => biologist, life scientist => chemist => cognitive scientist => computer scientist ... => principal investigator, PI … HAS INSTANCE => Bacon, Roger Bacon can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc.

Exploit Hand-Crafted Knowledge Wikipedia and other lexical sources can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc.

Exploit Hand-Crafted Knowledge Wikipedia and other lexical sources {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]]</br> [[Humboldt-Universität zu Berlin]]</br> [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]]</br> … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) DB inside can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc.

Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc.

YAGO: Yet Another Great Ontology [F. Suchanek et al.: WWW‘07] Turn Wikipedia into formal knowledge base (semantic DB); keep source pages as witnesses Exploit hand-crafted categories and infoboxes Represent facts as knowledge triples: relation (entity1, entity2) (in FOL, compatible with RDF, OWL-lite, XML, etc.) Map relations into WordNet concept DAG relation entity1 entity2 can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc. Examples: bornIn isInstanceOf Max_Planck Kiel Kiel City

Difficulties in Wikipedia Harvesting instanceOf relation: misleading and difficult category names „disputed articles“, „particle physics“, „American Music of the 20th Century“, „naturalized citizens of the United States“, … subclass relation: mapping categories onto WordNet classes: „Nobel laureates in physics“  Nobel_laureates, „people from Kiel“  person entity name synonyms & ambiguities: „St. Petersburg“, „Saint Petersburg“, „M31“, „NGC224“  means ... can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc. type (consistency) checking for rejecting false candidates: AlmaMater (Max Planck, Kiel)  Person  University DB inside

YAGO Knowledge Base [F. Suchanek et al.: WWW 2007] Entity subclass subclass Person concepts Location subclass Scientist subclass subclass subclass subclass City Country Biologist Physicist instanceOf instanceOf Nobel Prize Erwin_Planck bornIn Kiel hasWon FatherOf individual entities can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc. diedOn bornOn October 4, 1947 Max_Planck April 23, 1858 means means means words, phrases “Max Planck” “Max Karl Ernst Ludwig Planck” “Dr. Planck” Online access and download at http://www.mpi-inf.mpg.de/yago/

YAGO Knowledge Base [F. Suchanek et al.: WWW 2007] Entities Facts KnowItAll 30 000 SUMO 20 000 60 000 WordNet 120 000 80 000 Cyc 300 000 5 Mio. TextRunner n/a 8 Mio. YAGO 1.9 Mio. 19 Mio. DBpedia 1.9 Mio. 103 Mio. Freebase ??? 156 Mio. RDF triples ( entity1-relation-entity2, subject-predicate-object ) Entity subclass subclass Person concepts Location subclass Scientist subclass subclass subclass subclass City Country Biologist Physicist Accuracy  95% IWP YAGO instanceOf instanceOf Nobel Prize Erwin_Planck bornIn Kiel hasWon FatherOf individual entities can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc. diedOn bornOn October 4, 1947 Max_Planck April 23, 1858 means means means words, phrases “Max Planck” “Max Karl Ernst Ludwig Planck” “Dr. Planck” Online access and download at http://www.mpi-inf.mpg.de/yago/

Long Tail of Wikipedia (Intelligence-in-Wikipedia Project) [Wu / Weld: WWW 2008] YAGO & DBpedia mappings of entities onto classes are valuable assets Learning infobox attributes  sparse & noisy training data Computer Scientist Scientist Musician University Organization Physicist Artist can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc.

Outline  Motivation  Information Extraction & Knowledge Harvesting (YAGO) • Consistent Growth of Knowledge (SOFIE) • Ranking for Search over Entity-Relation Graphs (NAGA) • Efficient Query Processing (RDF-3X) • Conclusion

Maintaining and Growing YAGO Web sources + Word Net Wikipedia YAGO Gatherer Hypotheses YAGO Core Extractors YAGO Gatherer Scrutinizer YAGO Core Checker can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc. YAGO Core YAGO G r o w i n g knows  all entities focus on facts

SOFIE: Self-Organizing Framework for IE [F. Suchanek et al.: WWW 2009] Reconcile textual/linguistic pattern-based IE with statistics seeds  patterns  facts  patterns  ... declarative rule-based IE with constraints functional dependencies: hasCapital is a function inclusion dependencies: isCapitalOf  isCityOf DB inside can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc.

From Facts to Patterns to Hypotheses occurs (X and her husband Y, Angela Merkel, JoachimSauer) [4] MelindaGates, BillGates) [2] CarlaBruni, NIcolasSarkozy) [3] occurs (X loves Y, LarryPage, Google) [5] occurs (X married to Y, Facts Spouse (HillaryClinton, BillClinton) Spouse (MelindaGates, BillGates) Spouse (AngelaMerkel, JoachimSauer) Hypotheses Spouse (LarryPage, Google) Spouse (CarlaBruni, NicolasSarkozy) Spouse (AngelaMerkel, UlrichMerkel) expresses (and her husband, Spouse) expresses (married to, Spouse) expresses (loves, Spouse) can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc.

Adding Consistency Constraints occurs (X and her husband Y, Angela Merkel, JoachimSauer) [4] Facts Spouse (HillaryClinton, BillClinton) Patterns Spouse (MelindaGates, BillGates) occurs (X and her husband Y, MelindaGates, BillGates) [2] Spouse (AngelaMerkel, JoachimSauer) occurs (X and her husband Y, CarlaBruni, NIcolasSarkozy) [3] occurs (X married to Y, MelindaGates, BillGates) [2] occurs (X loves Y, LarryPage, Google) [5] Hypotheses Spouse (CarlaBruni, NicolasSarkozy) expresses (and her husband, Spouse) Spouse (LarryPage, Google) expresses (married to, Spouse) Spouse (AngelaMerkel, UlrichMerkel) expresses (loves, Spouse) can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc. occur(P, X, Y)  expresses(P, Spouse)  Spouse(X,Y) Constraints DB inside occur(P, X, Y)  Spouse(X,Y)  expresses (P, Spouse) Spouse(X, Y)  YZ  Spouse(X,Z) Spouse(X, Y)  Type(X,Person)  Type(Y,Person)

Representation by Clauses occurs (X and her husband Y, Angela Merkel, JoachimSauer) [4] Facts Spouse (HillaryClinton, BillClinton) Patterns Spouse (MelindaGates, BillGates) Clauses connect facts, patterns, hypotheses, constraints Treat hypotheses as variables, facts as constants: (1  A  1), (1  A  B), (1  C), (D  E), (D  F), ... Clauses can be weighted by pattern statistics Solve weighted Max-Sat problem: assign truth values to variables s.t. total weight of satisfied clauses is max! occurs (X and her husband Y, MelindaGates, BillGates) [2] Spouse (AngelaMerkel, JoachimSauer) occurs (X and her husband Y, CarlaBruni, NIcolasSarkozy) [3] occurs (X married to Y, MelindaGates, BillGates) [2] occurs (X loves Y, LarryPage, Google) [5] Hypotheses Spouse (CarlaBruni, NicolasSarkozy) expresses (and her husband, Spouse) Spouse (LarryPage, Google) expresses (married to, Spouse) Spouse (AngelaMerkel, UlrichMerkel) expresses (loves, Spouse) occur (and her husband, AngelaMerkel, JoachimSauer)  expresses(and her husband, Spouse)  Spouse(AngelaMerkel, JoachimSauer) Clauses occur (and her husband, CarlaBruni, NicolasSarkozy)  expresses(and her husband, Spouse)  Spouse(CarlaBruni, NicolasSarkozy) Spouse(AngelaMerkel, JoachimSauer)  Spouse(AngelaMerkel, UlrichMerkel) Spouse(LarryPage, Google)  Type(LarryPage,Person)  Type(Google,Person) ...

SOFIE: Consistent Growth of YAGO [F. Suchanek et al.: WWW 2009] self-organizing framework for scrutinizing hypotheses about new facts, enabling automated growth of the knowledge base unifies pattern-based IE, consistency checking and entity disambiguation DB inside Experimental evidence: input: biographies of 400 US senators, 3500 HTML files output: birth/death date&place, politicianOf (state) run-time: 7 h parsing, 6 h hypotheses, 2 h weighted Max-Sat precision: 90-95 %, except for death place discovered patterns: politicianOf: X was a * of Y, X represented Y, ... deathDate: X died on Y, X was assassinated on Y, ... deathPlace: X was born in Y can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc.

Open Issues Temporal Knowledge: temporal validity of all facts (spouses, CEO‘s, etc.) Total Knowledge: all possible relations („Open IE“), but in canonical form worksFor, employedAt, isEmployeeOf, ...  affiliation Multimodal Knowledge: photos, videos, sound, sheetmusic of entities (people, landmarks, etc.) and facts (marriages, soccer matches, etc.) Scalable Knowledge Gathering: high-quality IE at the rate at which news, blogs, Wikipedia updates are produced ! can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc.

Scalability: Benchmark Proposal for all people in Wikipedia (100,000‘s) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night redundancy of sources helps, stresses scalability even more consistency constraints are potentially helpful: functional dependencies: {husband, time}  wife inclusion dependencies: marriedPerson  adultPerson age/time/gender restrictions: birthdate +  < marriage < divorce DB inside

Outline    Motivation Information Extraction & Knowledge Harvesting (YAGO)  Consistent Growth of Knowledge (SOFIE) • Ranking for Search over Entity-Relation Graphs (NAGA) • Efficient Query Processing (RDF-3X) • Conclusion

NAGA: Graph Search with Ranking [G. Kasneci et al NAGA: Graph Search with Ranking [G. Kasneci et al.: ICDE 2008, ICDE 2009] Graph-based search on knowledge bases with built-in ranking based on confidence and informativeness Simple query Complex query (with regular expr.) Sep 2, 1945 Jul 28, 1914 Germany (bornIn | livesIn | citizenOf) .locatedIn* > < Politician ?b isa bornOn can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc. ?x has Won Nobel Prize diedOn ?d ?x isa > fatherOf Scientist diedOn ?c ?y ?x hasWon NobelPrize . ?x fatherOf ?y ?x bornOn ?b . FILTER (?b < Jul-28-1914) ?x diedOn ?d . ... ?x isa Politician . ?x isa Scientist

Statistical Language Models (LM‘s) for Entity Ranking [work by U Amsterdam, MSR Beijing, U Twente, Yahoo Barcelona, ...] LM (entity e) = prob. distr. of words seen in context of e query q: „Dutch soccer player Barca“ Dutch goalgetter soccer champion Dutch player Ajax Amsterdam trainer Barca 8 years Camp Nou played soccer FC Barcelona Jordi Cruyff son candidate entities: e1: Johan Cruyff e2: Ruud van Nistelroy e3: Ronaldinho e4: Zinedine Zidane e5: FC Barcelona weighted by extraction accuracy can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc. Zizou champions league 2002 Real Madrid van Nistelroy Dutch soccer world cup best player 2005 lost against Barca

LM for Fact (Entity-Relation) Ranking query q fact pool for candidate answers q1: ?x hasWon NobelPrize f1: Einstein hasWon NobelPrize f2: Gruenberg hasWon NobelPrize f3: Gruenberg hasWon JapanPrize f4: Vickrey hasWon NobelPrize f5: Cerf hasWon TuringAward f6: Einstein bornIn Germany f7: Gruenberg bornIn Germany f8: Goethe bornIn Germany f9: Schiffer bornIn Germany f10: Vickrey bornIn Canada f11: Cerf bornIn USA 200 50 20 100 150 10 q2: ?x bornIn Germany instantiation (user interests) plus smoothing LM(q1): Einstein hasWon NP Gruenberg hasWon NP Vickrey hasWon NP 200/300 50/300 can use onto service for various purposes ... heuristically map Web queries onto search interfaces of Deep Web portals, for example, query forms for sheetmusic like this; as a Web Service it looks like this, but key problem is what to do with query flute if interface has no general-purpose keyword field but rather has specific parameters like instrument, category, style, etc. LM(q2): Einstein bornIn G Gruenberg bornIn G Goethe bornIn G Schiffer bornIn G 100/470 20/470 200/470 150/470 witnesses may be weighted by confidence

NAGA Example Query: ?x isa politician ?x isa scientist Results: Benjamin Franklin Paul Wolfowitz Angela Merkel … Benjamin Franklin: Conservation of charge, lightning rod, bifocal glasses, repeal stamp act, active in american revolution, drafted declaration of independence Paul Wolfowitz: WB president until recently, Math and Chem. BE, political science Angela Merkel: Physical chemistry doctorate Alan Greenspan: Economics degree, Chairman of the Board of Federal Reserve

Outline     Motivation Information Extraction & Knowledge Harvesting (YAGO)  Consistent Growth of Knowledge (SOFIE)  Ranking for Search over Entity-Relation Graphs (NAGA) • Efficient Query Processing (RDF-3X) • Conclusion

Scalable Semantic Web: Pattern Queries on Large RDF Graphs schema-free RDF triples: subject-property-object (SPO) example: Einstein hasWon NobelPrize SPARQL triple patterns: Select ?p,?c Where { ?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe} large join queries, unpredictable workload, difficult physical design, difficult query optimization AllTriples S hasWon bornIn .,. Person Einstein Nobel Ulm .,. Ronaldo FIFA Rio .,. .,. .,. .,. .,. S O Einstein Nobel Ronaldo FIFA … … hasWon Einstein hasWon Nobel Einstein bornIn Ulm Ronaldo hasWon FIFA Spain partOf Europe France partOf Europe … … .,. S P O S O … … bornIn Country S partOf capital .,. .,. .,. .,. .,. Semantic-Web engines (Sesame, Jena, etc.) did not provide scalable query performance

Scalable Semantic Web: RDF-3X Engine [T. Neumann et al.: VLDB’08] RISC-style, tuning-free system architecture map literals into ids (dictionary) and precompute exhaustive indexing for SPO triples: SPO, SOP, PSO, POS, OSP, OPS, SP*, PS*, SO*, OS*, PO*, OP*, S*, P*, O* very high compression efficient merge joins with order-preservation join-order optimization by dynamic programming over subplan  result-order statistical synopses for accurate result-size estimation IR inside DB inside

Performance Experiments Librarything social-tagging excerpt (36 Mio. triples) Benchmark queries such as: Select ?t Where { ?b hasTitle ?t . ?u romance ?b . ?u love ?b . ?u mystery ?b . ?u suspense ?b . ?u crimeNovel ?c . ?u hasFriend ?f . ?f ... } execution time [s] books tagged with romance, love, mystery, suspense by users who like crime novels and have friends who ... RDF-3X on PC (2 GHz, 2 GB RAM, 30 MB/s disk) compared to: column-store (for property tables) using MonetDB triples store (with selected indexes) using PostgreSQL similar results on YAGO, Uniprot (845 Mio. triples) and Billion-Triples

Outline      Motivation Information Extraction & Knowledge Harvesting (YAGO)  Consistent Growth of Knowledge (SOFIE)  Ranking for Search over Entity-Relation Graphs (NAGA)  Efficient Query Processing (RDF-3X) • Conclusion

inside DB Take-Home Message Information is not Knowledge. Knowledge is not Wisdom. Wisdom is not Truth Truth is not Beauty. Beauty is not Music. Music is the best. (Frank Zappa, 1940 – 1993) turn Wikipedia, Web, news, literature, ... into comprehensive knowledge base of facts  YAGO core reconcile rule-based & pattern-based info extraction (Semantic-Web & Statistical-Web) with consistency constraints  YAGO growth with SOFIE enable search & ranking over entity-relation graphs  NAGA, RDF-3X DB inside

Technical Challenges ... and more Handling Time extracting temporal attributes reasoning on validity times of facts life-cycle management of KB Scalable Performance high-quality dynamic IE at the rate of news/blogs/Wikipedia updates „Marital Knowledge“ benchmark Query Language and Ranking querying expressive but simple (Sparql-FT ?) LM-based ranking vs. PR/HITS-style vs. learned scoring from user behavior efficient top-k queries on ER graphs ... and more

Thank You !