Gerhard Weikum Harvesting, Searching, and Ranking Knowledge from the Web Joint work with Georgiana.

weikum@mpi-inf.mpg.de http://www.mpi-inf.mpg.de/~weikum/ Gerhard Weikum Harvesting, Searching, and Ranking Knowledge from the Web Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

2/38 Vision Opportunity: Turn the Web (and Web 2.0 and Web 3.0...) into the world‘s most comprehensive knowledge base Approach: 1) harvest and combine a) hand-crafted knowledge sources (Semantic Web, ontologies) b) automatic knowledge extraction (Statistical Web, text mining) c) social communities and human computing (Social Web, Web 2.0) 2) express knowledge queries, search, and rank 3) everything efficient and scalable

3/38 Why Google and Wikipedia Are Not Enough German universities with world-class computer scientists German Nobel prize winner who survived both world wars and all of his four children proteins that inhibit proteases and other human enzymes Answer „knowledge queries“ such as: connection between Thomas Mann and Goethe politicians who are also scientists

4/38 Which politicians are also scientists ? Why Google and Wikipedia Are Not Enough What is lacking? Information is not Knowledge. Knowledge is not Wisdom. Wisdom is not Truth Truth is not Beauty. Beauty is not Music. Music is the best. (Frank Zappa)  extract facts from Web pages  capture user intention by concepts, entities, relations

5/38 NAGA Example Query: $x isa politician $x isa scientist Results: Benjamin Franklin Paul Wolfowitz Angela Merkel …

6/38 Related Work semistructured IR & graph search Banks TextRunner DBexplorer Cyc Freebase Cimple DBlife UIMA DBpedia Yago Naga XQ-FT Libra SPARQL Avatar EntityRank Powerset START Web entity search & QA information extraction & ontology building TopX Answers SWSE Hakia Tijah

7/38 Outline Motivation Information Extraction & Knowledge Harvesting (YAGO) Ranking for Search over Entity-Relation Graphs (NAGA) Conclusion Efficient Query Processing (RDF-3X)

8/38 Information Extraction (IE): Text to Records Max Planck 4/23, 1858 Kiel Albert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar Person BirthDate BirthPlace... Person ScientificResult Max Planck Quantum Theory Person Collaborator Max Planck Albert Einstein Max Planck Niels Bohr Planck‘s constant 6.226  10 23 Js Constant Value Dimension combine NLP, pattern matching, lexicons, statistical learning extracted facts often have confidence < 1  DB with uncertainty (probabilistic DB) expensive and error-prone

9/38 High-Quality Knowledge Sources General-purpose ontologies and thesauri: WordNet family scientist, man of science (a person with advanced knowledge) => cosmographer, cosmographist => biologist, life scientist => chemist => cognitive scientist => computer scientist... => principal investigator, PI … HAS INSTANCE => Bacon, Roger Bacon … 200 000 concepts and relations; can be cast into description logics or graph, with weights for relation strengths (derived from co-occurrence statistics)

10/38 {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]] [[Humboldt-Universität zu Berlin]] [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]] … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) … Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources

11/38 Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources

12/38 YAGO: Yet Another Great Ontology [F. Suchanek, G. Kasneci, G. Weikum: WWW‘07] Turn Wikipedia into explicit knowledge base (semantic DB); keep source pages as witnesses Exploit hand-crafted categories and infobox templates Represent facts as explicit knowledge triples: relation (entity1, entity2) (in FOL, compatible with RDF, OWL-lite, XML, etc.) Map (and disambiguate) relations into WordNet concept DAG entity1entity2 relation Max_PlanckKiel bornIn Kiel City isInstanceOf Examples:

13/38 YAGO Knowledge Base [F. Suchanek et al.: WWW’07] Entity Max_PlanckApril 23, 1858 Person CityCountry subclass Location subclass instanceOf subclass bornOn “Max Planck” means “Dr. Planck” means subclass October 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck FatherOf hasWon Scientist means “Max Karl Ernst Ludwig Planck” Physicist instanceOf subclass Biologist subclass concepts individuals words Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/http://www.mpi-inf.mpg.de/~suchanek/yago/ Accuracy  95% Entities Facts KnowItAll 30 000 SUMO 20 000 60 000 WordNet 120 000 80 000 Cyc 300 000 5 Mio. TextRunner n/a 8 Mio. YAGO 1.7 Mio. 15 Mio. DBpedia 1.9 Mio.103 Mio. Freebase??? ???

14/38 Wikipedia Harvesting: Difficulties & Solutions instanceOf relation: isleading and difficult category names ( „disputed articles“, „particle physics“, „American Music of the 20th Century“, „Nobel laureates in physics“, „naturalized citizens of the United States“, … )  noun group parser: ignore when head word in singular isA relation: mapping categories onto WordNet classes: „Nobel laureates in physics“  Nobel_laureates, „people from Kiel“  person  map to (singular of) head; exploit synsets and statistics Entity name ambiguities: „St. Petersburg“, „Saint Petersburg“, „M31“, „NGC224“  means...  exploit Wikipedia redirects & disambiguations, WN synsets type checking for scrutinizing candidates: accept fact candidate only if arguments have proper classes marriedTo (Max Planck, quantum physics)  Person  Person

15/38 Higher-Order Facts in YAGO BerlinGermany CapitalOf Bonn CapitalOf 1990-2008 validIn 1949-1989 validIn facts about facts represented by reification as first-order facts e314159 1990-2008 validIn BerlinGermany CapitalOf Arnold Schwarzen- egger Politician Actor instanceOf 1987-2008 validIn 2003-2008 validIn

16/38 Ongoing Work: YAGO for Easier IE YAGO knows (almost) all (interesting) entities leverage for discovering & extracting new facts in NL texts can filter out many uninteresting sentences can quickly identify relation arguments can eliminate many fact candidates by type checking can focus on specific properties like time Seine Paris runs Through rivercity Cologne lies on the banks of the Rhine SsMVpDMcMpDg JsJp NPPPVPNPPPNP People in Cairo like wine from the Rhine valley MpMp JsOs SpMvpDs Js AN NP PPVPPPNP IE with dependency parser is expensive ! The city of Paris was founded on an island in the Seine in 300 BC France Europe isa locatedIn

18/38 NAGA: Graph Search [G. Kasneci et al.: ICDE‘08] complex queries (with regular expressions) computer science $xscientist isa wonPrize $u university isa worksAt | graduatedFrom discovery queriesconnectedness queries Thomas Mann Goethe * German novelist isa politician$xscientist isa Graph-based search on YAGO-style knowledge bases with built-in ranking based on confidence and informativeness Germany locatedIn * $p inField queries over reified facts Germany 1988 $c capitalOf isa city validIn

19/38 Search Results Without Ranking q: Fisher isa scientist Fisher isa $x mathematician_109635652 —subClassOf—> scientist_109871938 Alumni_of_Gonville_and_Caius_College,_Cambridge —subClassOf—> alumnus_109165182 "Fisher" —familyNameOf—> Ronald_Fisher Ronald_Fisher —type—> Alumni_of_Gonville_and_Caius_College,_Cambridge Ronald_Fisher —type—> 20th_century_mathematicians "scientist" —means—> scientist_109871938 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = alumnus_109165182 $@Fisher = Irving_Fisher $@scientist = scientist_109871938 $X = social_scientist_109927304 $@Fisher = James_Fisher $@scientist = scientist_10981938 $X = ornithologist_109711173 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = theorist_110008610 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = colleague_109301221 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = organism_100003226 …

20/38 Ranking with Statistical Language Model q: Fisher isa scientist Fisher isa $x Score: 7.184462521168058E-13 mathematician_109635652 —subClassOf—> scientist_109871938 "Fisher" —familyNameOf—> Ronald_Fisher Ronald_Fisher —type—> 20th_century_mathematicians "scientist" —means—> scientist_109871938 20th_century_mathematicians —subClassOf—> mathematician_109635652 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = mathematician_109635652 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = statistician_109958989 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = president_109787431 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = geneticist_109475749 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = scientist_109871938 … Online access at http://www.mpi-inf.mpg.de/~kasneci/naga/http://www.mpi-inf.mpg.de/~kasneci/naga/  statistical language model for result graphs

21/38 Ranking Factors Confidence: Prefer results that are likely to be correct  Certainty of IE  Authenticity and Authority of Sources Informativeness: Prefer results that are likely important May prefer results that are likely new to user  Frequency in answer  Frequency in corpus (e.g. Web)  Frequency in query log Compactness: Prefer results that are tightly connected  Size of answer graph bornIn (Max Planck, Kiel) from „Max Planck was born in Kiel“ (Wikipedia) livesIn (Elvis Presley, Mars) from „They believe Elvis hides on Mars“ (Martian Bloggeria) q: isa (Einstein, $y) isa (Einstein, scientist) isa (Einstein, vegetarian) q: isa ($x, vegetarian) isa (Einstein, vegetarian) isa (Al Nobody, vegetarian) Einstein vegetarian BohrNobel Prize Tom Cruise 1962 isa bornIn diedIn won

22/38 NAGA Ranking Model Following the paradigm of statistical language models (used in speech recognition and modern IR), applied to graphs For query q with fact templates q 1 … q n bornIn ($x, Frankfurt) rank result graphs g with facts g 1 … g n bornIn (Goethe, Frankfurt) by decreasing likelihoods: using generative mixture model reflect informativeness background model weights subqueries Ex.: bornIn ($x, Germany) & wonAward ($x, Nobel)

23/38 NAGA Ranking Model: Informativeness Estimate P[q i | g i ] for q i = (x*, r, z) with var x* (analogously for other cases) Ex.: bornIn ($x, Frankfurt) Albert Einstein isa vegetarian physicist isa Ex.: isa (Einstein, $z) bornIn (GW, Frankfurt) bornIn (Goethe, Frankfurt) isa (Einstein, physicist) bornIn (Einstein, vegetarian) Estimate on knowledge graph:Estimate on Web (exploit redundancy): freq (Einstein, isa, physicist) vs. freq (Einstein, isa, vegetarian)

24/38 NAGA Example Query: $x isa politician $x isa scientist Results: Benjamin Franklin Paul Wolfowitz Angela Merkel …

25/38 User Study for Quality Assessment (1) Benchmark: 55 queries from TREC QA 2005/2006 Examples: 1) In what country is Luxor? 2) Discoveries of the 20th Century? 12 queries from work on SphereSearch Examples: 1) In which movies did a governor act? 2) Firstname of politician Rice? 18 regular expression queries by us Example: What do Albert Einstein and Niels Bohr have in common? Competitors: NAGA vs. Google, Yahoo! Answers, BANKS (IIT Bombay), START (MIT)

26/38 User Study for Quality Assessment (2) Benchmark# Q# AMetricGoogleYahoo! Answers STARTBANKS scoring NAGA TREC QA551098NDCG P@1 75.88% 67.81% 26.15% 17.20% 75.38% 73.23% 87.93% 69.54% 92.75% 84.40% SphereSearch12343NDCG P@1 38.22% 19.38% 17.23% 6.15%2.87% 88.82% 84.28% 91.01% 84.94% Own18418NDCG P@1 54.09% 27.95% 17.98% 6.57% 13.35% 13.57% 85.59% 76.54% 91.33% 86.56% Quality Measures: Precision@1 NDCG: normalized discounted cumulative gain based on ratings highly relevant (2), somewhat relevant (1), irrelevant (0) with Wilson confidence intervals at  = 0.95

28/38 Why RDF? Why a New Engine? Marie Curie Nobel Prize Physics bornOn Henri Becquerel U Paris Nobel Prize Chemistry 1867 1934 Warsaw Pierre Curie Maria Sklodowska 1852 1908 Poland diedOn bornAs marriedTo Alma Mater won Award wonAward won Award bornOn diedOn advsior wonAward bornIn inCountry RDF triples (subject – property/predicate – value/object): (id1, Name, „Marie Curie“), (id1, bornAs, „Maria Sklobodowska“), (id1, bornOn, 1867), (id1, bornIn, id2), (id2, Name, „Warsaw“), (id2, locatedIn, id3), (id3, Name, „Poland“), (id1, marriedTo, id4), (id4, Name, „Pierre Curie“), (id1, wonAward, id5), (id4, wonAward, id5), … pay-as-you-go: schema-agnostic or schema later RDF triples form fine-grained (ER) graph queries bound to need many star-joins and long chain-joins physical design critical, but hardly predictable workload

29/38 SPARQL Query Language SPJ combinations of triple patterns Ex:: Select ?c Where { ?p isa scientist. ?p bornIn ?t. ?p hasWon ?a. ?t inCountry ?c. ?a Name NobelPrize } options for filter predicates, duplicate handling, wildcard join, etc. Ex:: Select Distinct ?c Where { ?p ?r1 ?t. ?t ?r2 ?c. ?c isa. ?p bornOn ?b. Filter (?b > 1945) } support for RDFS: types

30/38 RDF & SPARQL Engines choice of physical design is crucial giant triples table(vert. partitioned) property tables clustered property tables (+ leftover table) id1 Name Marie Curie id1 bornOn 1867 id1 bornIn id2 id2 Name Warsaw id2 Country id11 id1 Advisor id5 … ….,. S P O S O id1 1867 id id5 1852 … bornOn S O id1 id5 … Advisor id1 Marie C 1867 id2 id2 Henri B 1852 id9 … ….,. S Name bornOn bornIn … Person id2 Warsaw id11 … ….,. SESAME / OpenRDF YARS2 (DERI) Jena (HP Labs) Oracle RDF_MATCH + physical design wizard ! C-Store (MIT) MonetDB (CWI) column stores + materialized views S Name Country Town id2 Warsaw id11 … ….,.

31/38 RDF-3X: a RISC-style Engine [T. Neumann, G. Weikum: VLDB 2008] Design rationale: RDF-specific engine (not an RXORDBMS) Simplify operations Reduce implementation choices Optimize for common case Eliminate tuning knobs Key principles: Mapping dictionary for encoding all literals into ids Exhaustive indexing of id triples Index-only store, high compression QP mostly merge joins with order-preservation Very fast DP-based query optimizer Frequent-paths synopses, property-value histograms

32/38 RDF-3X Indexing index all collation orders of subject-property-object id triples: SPO, SOP, OSP, OPS, PSO, POS directly stored in clustered B+ trees high compression:  indexes < original data can choose any order for scan & join additionally index count-aggregated projections in all orders: SP, SO, OS, OP, PS, PO – with counter for each entry enables efficient bookkeeping for duplicates also index projections S, P, O with count-aggregation also need two mapping indexes: literal  id, id  literal,

33/38 RDF-3X Query Optimization Principles: optimizing join orders is key (star joins, long join chains) should exploit exhaustive indexes and order-preservation support merge-joins and hash-joins Bottom-up dynamic programming for exhaustive plan enumeration (< 100ms for 20 joins) Cost model based on selectivity estimation from histograms for each of the 6 SPO orderings (approx. equi-depth) frequent join paths (property sequences) for stars and chains ?x1 p1 ?x2 p2 ?x3 p3 ?x4 p4 ?x5 p5 ?x6 v1 a1 v4 a4 v6 a6 Example Query:

34/38 Experimental Evaluation: Setup Setup and competitors: 2GHz dual core, 2 GB RAM, 30MB/s disk, Linux column-store property tables by Abadi et al., using MonetDB triples store with SPO, POS, PSO indexes, using PostgreSQL Datasets: 1) Barton library catalog: 51 Mio. triples (4.1 GB) 2) YAGO knowledge base: 40 Mio. triples (3.1 GB) 3) Librarything social-tagging excerpt: 30 Mio. triples (1.8 GB) Benchmark queries (7 or 8 per dataset) in the spirit of: 1) counts of French library items (books, music, etc.), with creator, publisher, language, etc. 2) scientist from Poland with French advisor who both won awards 3) books tagged with romance, love, mystery, suspense by users who like crime novels and have friends who... Select ?t Where { ?b hasTitle ?t. ?u romance ?b. ?u love ?b. ?u mystery ?b. ?u suspense ?b. ?u crimeNovel ?c. ?u hasFriend ?f. ?f... }

35/38 Experimental Evaluation: Results DB sizes [GB]: Barton Yago LibThing RDF-3X 2.8 2.7 1.6 MonetDB 1.6-2.0 1.1-2.4 0.7-6.9 PostgreSQL 8.7 7.5 5.7 DB load times [min]: Barton Yago LibThing RDF-3X 13 25 20 MonetDB 11 21 4 PostgreSQL 30 25 20 Barton Yago LibThing RDF-3X 0.4 (5.9) 0.04 (0.7) 0.13 (0.89) MonetDB 3.8 ( 26.4) 54.6 (78.2) 4.39 (8.16) PostgreSQL 64.3 (167.8)0.56 (10.6)30.4 (93.9) Geometric means for query run-times [sec] for warm (cold) cache

37/38 Summary & Outlook lift world‘s best information sources (Wikipedia, Web, Web 2.0) to the level of explicit knowledge (ER-oriented facts) 1) building knowledge graphs: combine semantic & statistical & social IE sources (for scholarly Web, digital libraries, enterprise know-how) challenges in consistency vs. uncertainty, long-term evolution 3) efficiency and scalability challenges for search & ranking (top-k queries) and updates 2) heterogeneity & uncertain IE necessitate ranking new ranking models (e.g. statistical LM for graphs)

38/38 Thank You !

Gerhard Weikum Harvesting, Searching, and Ranking Knowledge from the Web Joint work with Georgiana.

Similar presentations

Presentation on theme: "Gerhard Weikum Harvesting, Searching, and Ranking Knowledge from the Web Joint work with Georgiana."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gerhard Weikum Harvesting, Searching, and Ranking Knowledge from the Web Joint work with Georgiana.

Similar presentations

Presentation on theme: "Gerhard Weikum Harvesting, Searching, and Ranking Knowledge from the Web Joint work with Georgiana."— Presentation transcript:

Similar presentations

About project

Feedback