ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Slides:



Advertisements
Similar presentations
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Advertisements

YAGO: A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum Max-Planck-Institute for Computer Science, Saarbruecken,
Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Universal Search and Social Networking Exploiting the features of each to enhance the other and the tools that make it possible Peter Wallqvist Ravn Systems.
Graph Data Management Lab, School of Computer Science Put conference information here.
Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010.
Information Retrieval in Practice
Search Engines and Information Retrieval
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
INFO 624 Week 3 Retrieval System Evaluation
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Overview of Search Engines
Saarbrucken / Germany ¨
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Information Retrieval in Practice
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
 Fatemeh Lashkari UNB University May 7 th  Indexing  Semantic Search  Semantic Search Architecture  Index process  Index Maintenance.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Search Engines and Information Retrieval Chapter 1.
The Power of Prefix Search (with a nice open problem) Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Talk at ADS 2007 in Bertinoro,
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
1 The BT Digital Library A case study in intelligent content management Paul Warren
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Querying Structured Text in an XML Database By Xuemei Luo.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.
Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Keyword Query Routing.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Combining GATE and UIMA Ian Roberts. University of Sheffield NLP 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE.
ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Information Retrieval
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Combining GATE and UIMA Ian Roberts. 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE and UIMA.
Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany.
Information Retrieval in Practice
Search Engine Architecture
Information Retrieval in Department 1
Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Multimedia Information Retrieval
Data Mining Chapter 6 Search Engines
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
CS246: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea, Fabian Suchanek, Ingmar Weber Talk at SIGIR’07 in Amsterdam, July 26th

ESTER E fficient S earch on T ext, E ntities, and R elations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea, Fabian Suchanek, Ingmar Weber Talk at SIGIR’07 in Amsterdam, July 26th

Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea, Fabian Suchanek, Ingmar Weber Talk at SIGIR’07 in Amsterdam, July 26th ESTER It’s about: Fast Semantic Search

Keyword Search vs. Semantic Search Keyword search –Query: john lennon –Answer: documents containing the words john and lennon Semantic search –Query: musician –Answer: documents containing an instance of musician Combined search –Query: beatles musician –Answer: documents containing the word beatles and an instance of musician Useful by itself or as a component of a QA system

Semantic Search: Challenges + Our System 1.Entity recognition –approach 1: let users annotate (semantic web) –approach 2: annotate (semi-)automatically –our system: uses Wikipedia links + learns from them 2.Query Processing –build a space-efficient index –which enables fast query answers –our system: as compact and fast as a standard full-text engine 3.User Interface –easy to use –yet powerful query capabilities –our system: standard interface with interactive suggestions

Semantic Search: Challenges + Our System 1.Entity recognition –approach 1: let users annotate (semantic web) –approach 2: annotate (semi-)automatically –our system: uses Wikipedia links + learns from them 2.Query Processing –build a space-efficient index –which enables fast query answers –our system: as compact and fast as a standard full-text engine 3.User Interface –easy to use –yet powerful query capabilities –our system: standard interface with interactive suggestions focus of the paper and of this talk

In the Rest of this Talk … Efficiency –three simple ideas (which all fail) –our approach (which works) Queries supported –essentially all SPARQL queries, and –seamless integration with ordinary full-text search Experiments –efficiency (great) –quality (not so great yet) Conclusions –lots of interesting + challenging open problems

Efficiency: Simple Idea 1 Add “semantic tags” to the document –e.g., add the special word tag:musician before every occurrence of a musician in a document Problem 1: Index blowup –e.g., John Lennon is a: Musician, Singer, Composer, Artist, Vegetarian, Person, Pacifist, … (28 classes) Problem 2: Limited querying capabilities –e.g., could not produce list of musicians that occur in documents that also contain the word beatles –i.p., could not do all SPARQL queries (more on that later)

Efficiency: Simple Idea 2 Query Expansion –e.g., replace query word musician by disjunction musician:aaron_copland OR … OR musician:zarah_leander (7,593 musicians in Wikipedia) Problem: Inefficient query processing –one intersection per element of the disjunction needed

Efficiency: Simple Idea 3 Use a database –map semantic queries to SQL queries on suitably constructed tables –that’s what the Artificial-Intelligence / Semantic-Web people usually do Problem: Inefficient + Lack of control –building a search engine on top of an off-the-shelf database is orders of magnitude slower or uses orders of magnitude more space, or both –very limited control regarding efficiency aspects

Efficiency: Our Approach Two basic operations –prefix search of a special kind [will be explained by example] –join [will be explained by example] An index data structure –which supports these two operations efficiently Artificial words in the documents –such that a large class of semantic queries reduces to a combination of (few of) these operations

Processing the query “beatles musician” Gitanes … legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice … Gitanes … legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice … John Lennon 0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer … John Lennon 0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer … entity:john_lennon entity:1964 entity:liverpool etc. entity:wolfang_amadeus_mozart entity:johann_sebastian_bach entity:john_lennon etc. entity:john_lennon etc. two prefix queries one join position beatles entity:*entity:*. relation:is_a. class:musician

Processing the query “beatles musician” Problem: entity:* has a huge number of occurrences –≈ 200 million for Wikipedia, which is ≈ 20% of all occurrences –prefix search efficient only for up to ≈ 1% (explanation follows) Solution: frontier classes –classes at “appropriate” level in the hierarchy –e.g.: artist, believer, worker, vegetable, animal, … Gitanes … legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice … Gitanes … legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice … John Lennon 0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer … John Lennon 0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer … position beatles entity:*entity:*. relation:is_a. class:musician

Processing the query “beatles musician” Gitanes … legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked … Gitanes … legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked … John Lennon 0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician … John Lennon 0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician … artist:john_lennon artist:graham_greene artist:pete_best etc. artist:wolfang_amadeus_mozart artist:johann_sebastian_bach artist:john_lennon etc. artist:john_lennon etc. position beatles artist:*artist:*. relation:is_a. class:musician two prefix queries one join first figure out: musician  artist (easy)

Maintains lists for word ranges (not words) Looks like this for person:* abl-abt Doc. 12Doc. 83 Doc. 187… Pos. 5Pos. 14Pos. 124Pos. 88… Scor. 0.5Scor. 0.2Scor. 0.7Scor. 0.4… ableablazeabroadabnormal person:* Doc. 17Doc. 23Doc. 72 … Pos. 12Pos. 3Pos. 55Pos. 59… Scor. 0.1Scor. 0.5Scor. 0.3Scor. 0.5… person:john_lennonperson:ringo_starr person:graham_greene person:john_lennon The HYB Index [Bast/Weber, SIGIR’06]

Maintains lists for word ranges (not words) Provably efficient –no more space than an inverted index (on the same data) –each query = scan of a moderate number of (compressed) items abl-abt Doc. 12Doc. 83 Doc. 187… Pos. 5Pos. 14Pos. 124Pos. 88… Scor. 0.5Scor. 0.2Scor. 0.7Scor. 0.4… ableablazeabroadabnormal Extremely versatile –can do all kinds of things an inverted index cannot do (efficiently) –autocompletion, faceted search, query expansion, error correction, select and join, …

SPARQL Protocol And RDF Query Language (yes, it’s recursive) Queries we can handle We prove the following theorem: –Any basic SPARQL graph query with m edges can be reduced to at most 2m prefix / join operations SELECT ?who WHERE { ?who is_a Musician ?who born_in_year ?when John_Lennon born_in_year ?when } ESTER achieves seamless integration with full-text search –SPARQL has no means for dealing with full text search –XQuery can handle full-text search, but is not really suitable for semantic search musicians born in the same year as John Lennon more about supported queries in the paper

Experiments: Corpus, Ontology, Index Corpus: English Wikipedia (xml dump from Nov. 2006) ≈ 8 GB raw xml ≈ 2,8 million documents ≈ 1 billion words Ontology: YAGO (Suchanek/Kasneci/Weikum, WWW’07) ≈ 2,5 million facts derived from clever combination of Wikipedia + WordNet (Entities from Wikipedia, Taxonomy from WordNet) Our Index ≈ 1.5 billion words (original + artificial) ≈ 3.3 GB total index size; ontology-only is a mere 100 MB Note: our system works for an arbitrary corpus + ontology

Experiments: Efficiency — What Baseline? SPARQL engines –can’t do text search –and slow for ontology-only too(on Wikipedia: seconds) XQuery engines –extremely slow for text search(on Wikipedia: minutes) –and slow for ontology-only too(on Wikipedia: seconds) Other prototypes which do semantic + full-text search –efficiency is hardly considered –e.g., the system of Castells/Fernandez/Vallet (TKDE’07) “… average informally observed response time on a standard professional desktop computer [of] below 30 seconds [on 145,316 documents and an ontology with 465,848 facts] …” –our system: ~100ms, 2.8 million documents, 2.5 million facts

Experiments: Efficiency — Stress Test 1 Compare to ontology-only system –the YAGO engine from WWW’07 –Onto Simple : when was [person] born [1000 queries] –Onto Advanced: list all people from [profession] [1000 queries] –Onto Hard : when did people die who were born in the same year as [person] [1000 queries] Note: comparison very unfair (for our system) Our systemOnto-Only avg.max.avg.max. Onto Simple2 ms5 ms3 ms20 ms Onto Advanced9 ms31 ms3 ms794 ms Onto Hard64 ms208 ms78 ms550 ms 100 MB index 4 GB index

Experiments: Efficiency — Stress Test 2 Our systemFull-Text Only avg.max.avg.max. Onto+Text Easy224 ms772 ms90 ms498 ms Onto+Text Hard279 ms502 ms44 ms85 ms Compare to text-only search engine –state-of-the-art system from SIGIR’06 –Onto+Text Easy: counties in [US state] [50 queries] –Onto+Text Hard: computer scientists [nationality] [50 queries] –Full-text query: e.g. german computer scientists Note: hardly finds relevant documents Note: comparison extremely unfair (for our system)

Experiments: Quality — Entity Recognition Use Wikipedia links as hints –“… following [[John Lennon | Lennon]] and Paul McCartney, two of the Beatles, …” –“… The southern terminus is located south of the town of [[Lennon, Michigan | Lennon]] …” Learn other links –use words in neighborhood as features Accuracy all words2 senses3 senses≥4 senses 93.4%88.2%84.4%80.3%

Experiments: Quality — Relevance 2 Query Sets –People associated with [american university][100 queries] –Counties of [american state][50 queries] Ground truth –Wikipedia has corresponding lists e.g., List of Carnegie Mellon University People Precision and Recall PEOPLE37.3%89.7% COUNTIES66.5%97.8%

Conclusions Semantic Retrieval System ESTER –fast and scalable via reduction to prefix search and join –can handle all basic SPARQL queries –seamless integration with full-text search –standard user interface with (semantic) suggestions Lots of interesting and challenging problems –simultaneous ranking of entities and documents –proper snippet generation and highlighting –search result quality –…–… Dank je wel!

Context-Sensitive Prefix-Search Compute completions of last query word –which together with the previous part of the query would lead to a hit –[DEMO: show a live example] Extremely useful –autocompletion search –faceted search –error correction, synonym search, … –category search for example, add place:amsterdam then query place:* finds all instances of a place formal definition in the paper Isn’t the last idea enough for semantic search?

DEMO Do the following queries [live or recorded] –beatles –beatles musi –beatles musicia –beatles musician:john_lennon (or beatles entity:john_lennon)

Processing the query “beatles musician” Liverpool [one of many documents mentioning John Lennon] … in honor of the late Beatle entity:john_lennon Liverpool [one of many documents mentioning John Lennon] … in honor of the late Beatle entity:john_lennon John Lennon 0 entity:john_lennon 1 r:is_a 2 class:musician 2 class:singer … John Lennon 0 entity:john_lennon 1 r:is_a 2 class:musician 2 class:singer … beatles entity:*“entity:* r:is_a class:musician” position Problem: entity:* has a huge number of occurrences –≈ 200 million for Wikipedia = 20% of all occurrences –prefix search efficient only up XXX Solution: Frontier set –classes high up in the hierarchy [explain more] –e.g.: person, animal, substance, abstraction, …

Processing the query “beatles musician” Liverpool [one of many documents mentioning John Lennon] … in honour of the late Beatle person:john_lennon Liverpool [one of many documents mentioning John Lennon] … in honour of the late Beatle person:john_lennon John Lennon 0 person:john_lennon 1 is_a: 2 class:musician 2 class:singer … John Lennon 0 person:john_lennon 1 is_a: 2 class:musician 2 class:singer … beatles person:* person:john_lennon person:the_queen person:pete_best etc. “person:* r:is_a class:musician” person:wolfang_amadeus_mozart person:johann_sebastian_bach person:john_lennon etc. entity:john_lennon etc. position two prefix queries one join

Our Solution, Version 1 Combination of Prefix Search + Join –Query 1: beatles entity:* entities co-occuring with beatles –Query 2: musician – entity:* entities which are musicians –Join the completion from 1 & 2 musicians co-occuring with beatles Some document about Albert Einstein … entity:einstein … Some document about Albert Einstein … entity:einstein … Albert Einstein entity:albert_einstein scientist vegetarian intellectual … Albert Einstein entity:albert_einstein scientist vegetarian intellectual … But: unspecific prefixes (entity:*) are hard

Our Solution, Version 2 Combination of Prefix Search + Join –Query 1: translate:singer:* tells us that a singer is a musician –Query 2: beatles musician:* musicians co-occurring with beatles –Query 3: physicist – scientist:* musicians which are singers –Join the completion from 1 & 2 singers co-occurring with beatles Some document mentioning John Lennon … musician:john_lennon xyz:john_lennon … Some document mentioning John Lennon … musician:john_lennon xyz:john_lennon … John Lennon musician:john_lennon xyz:john_lennon … John Lennon musician:john_lennon xyz:john_lennon … [Special Doc] TRANSLATE:singe r:musician [Special Doc] TRANSLATE:singe r:musician

Processing the query “beatles musician” Gitanes … legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked … Gitanes … legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked … John Lennon 0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician … John Lennon 0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician … artist:john_lennon artist:queen_elisabeth artist:pete_best etc. artist:wolfang_amadeus_mozart artist:johann_sebastian_bach artist:john_lennon etc. person:john_lennon etc. position beatles artist:*artist:*. relation:is_a. class:musician two prefix queries one join John Lennon at the Royal Variety Show in 1963, in the presence of members of the British royalty: "Those of you in the cheaper seats can clap your hands. The rest of you, if you'll just rattle your jewellery."