The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.

Slides:



Advertisements
Similar presentations
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.
Advertisements

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Evaluating the Performance of IR Sytems
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Locking Key Ranges with Unbundled Transaction Services 1 David Lomet Microsoft Research Mohamed Mokbel University of Minnesota.
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
 Fatemeh Lashkari UNB University May 7 th  Indexing  Semantic Search  Semantic Search Architecture  Index process  Index Maintenance.
Lecture 11: DMBS Internals
The Power of Prefix Search (with a nice open problem) Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Talk at ADS 2007 in Bertinoro,
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Querying Structured Text in an XML Database By Xuemei Luo.
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.
Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.
Keyword Query Routing.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
Gestion efficace de Séries Temporelles en P2P Application à l'analyse technique et l'étude des objets mobiles G. Gardarin, B. Nguyen, L. Yeh, K. Zeitouni,
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Why Spectral Retrieval Works Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar SIGIR 2005 in.
Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
CS315 Introduction to Information Retrieval Boolean Search 1.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany.
University of Maryland Baltimore County
COMP9319: Web Data Compression and Search
Database Applications (15-415) DBMS Internals- Part VIII Lecture 17, Oct 30, 2016 Mohammad Hammoud.
Large Scale Search: Inverted Index, etc.
An Efficient Algorithm for Incremental Update of Concept space
Supporting Ranking and Clustering as Generalized Order-By and Group-By
Database Management System
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
COMP 430 Intro. to Database Systems
Implementation Issues & IR Systems
Chapter 12: Query Processing
Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Lecture 11: DMBS Internals
Spatio-temporal Pattern Queries
Index Construction: sorting
Lecture 7: Index Construction
CS246 Search Engine Scale.
Learning Literature Search Models from Citation Behavior
CS246: Search-Engine Scale
Efficient Aggregation over Objects with Extent
Presentation transcript:

The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Ingmar Weber CIDR 2007 in Asilomar, California, 8 th January 2007

general-purpose but slow on large data scales very well but special-purpose IR versus DB (simplified view) IR system (search engine) single data structure and query algorithm, optimized for ranked retrieval on textual data highly compressible and high locality of access ranking is an integral part  can't do even simple selects, joins, etc. DB system (relational) variety of indices and query algorithms, to suit all sorts of complex queries on structured data  space overhead and limited locality of access  no integrated ranked retrieval can do complex selects, joins, … (SQL)

Our contribution (in a nutshell) The CompleteSearch engine novel data structure and query algorithm for context-sensitive prefix search and completion highly compressible and high locality of access IR-style ranked retrieval DB-style selects and joins natural blend of the two subsecond query times for up to a terabyte on a single machine  no transactions, recovery, etc.  for low dynamics (few insertions/deletions)  other open issues at the end of the talk … fairly general-purpose and scales very well

D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A Context-Sensitive Prefix Search & Completion D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids

Context-Sensitive Prefix Search & Completion Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids Answer –all matching word-in-doc pairs –with scores D13 E … D88 E … … D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D88 P A E G Q D88 P A E G Q D17 B W U K A D17 B W U K A D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H D88 G

Index data structure (previous work) AutoTree (SPIRE'06) –hierarchies of ranges, relative bit vectors –output sensitive: one item output every O(1) steps –only good in main memory (bit rank data structure) Half-inverted index (SIGIR'06) –flat partitioning into equal-size blocks, entropy encoding –very good compressibility –very good locality of access (data accessed in large blocks) Basic Idea: precompute lists of word-in-document pairs for ranges of words D5D15 D37D39 D67D95D98… ARTFDKLBEA… No time for that, sorry!

Supported queries (examples) Full-text search with autocompletion (SIGIR'06) –cidr con* Add structured data via special words –conference:sigmod –author:gerhard_weikum –year:2005 Select … Where … queries –conference:sigmod author:* Join queries –launch conference:sigmod author:* and conference:sigir author:* and intersect the set of completions (not documents) –syntax is author[conference:sigmod conference:sigir] Mixed IR/DB queries –continuous query processing author:* –author[conference:sigir conference:sigmod] query optimization Gerhard WeikumSIGMOD2005paper #23876 Surajit ChaudhuriSIGMOD2005paper #23876 Gerhard WeikumSIGIR2006paper #31457 Ralitsa AngelovaSIGIR2006paper #31457 …………

Efficiency Index size –theoretical guarantee: space consumption is within 1+ε of data entropy –empirical results (on TREC Terabyte): raw data: 426 GB index size: 4.9 GB Query time –theoretical guarantee: each query ≈ a scan of ε ∙ #docs items (compressed) –empirical results (on TREC Terabyte): average / maximal query time: 0.11 secs / 0.86 secs Note: –100 disk seeks take about half a second –in that time can read 200 MB of data, if compressed on disk assuming 5ms seek time, 50 MB/s transfer rate, compression factor 8

Conclusions Summary –mechanism for context-sensitive prefix search and completion –very efficient in space and time, scales very well –combines IR-style ranked retrieval with DB-style selects and joins On our TODO list –achieve both output-sensitivity and locality of access –integrate top-k query processing –find out which SQL queries can be supported efficiently? –deal with high dynamics (many insertions/deletions)

Conclusions Thank you! Summary –mechanism for context-sensitive prefix search and completion –very efficient in space and time, scales very well –combines IR-style ranked retrieval with DB-style selects and joins On our TODO list –achieve both output-sensitivity and locality of access –integrate top-k query processing –find out which SQL queries can be supported efficiently? –deal with high dynamics (many insertions/deletions)