Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.
Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.
Inverted Index Hongning Wang
Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010.
Evaluating Search Engine
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
 Fatemeh Lashkari UNB University May 7 th  Indexing  Semantic Search  Semantic Search Architecture  Index process  Index Maintenance.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
The Power of Prefix Search (with a nice open problem) Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Talk at ADS 2007 in Bertinoro,
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Data Structures & Algorithms and The Internet: A different way of thinking.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.
Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
Pawe ł Gawrychowski* and Pat Nicholson** *University of Warsaw **Max-Planck-Institut für Informatik.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
Evidence from Content INST 734 Module 2 Doug Oard.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Internal and External Sorting External Searching
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar.
1 Introduction to IR Systems: Supporting Boolean Text Search.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Why indexing? For efficient searching of a document
CPS216: Data-intensive Computing Systems
Text Indexing and Search
Indexing & querying text
Information Retrieval in Department 1
Implementation Issues & IR Systems
The core algorithmic problem Ordinary Inverted Index
Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Spatio-temporal Pattern Queries
CS246 Search Engine Scale.
CS246: Search-Engine Scale
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Ingmar Weber at Google in Mountain View, USA, August 14

Basic Autocompletion –saves typing –no more information than necessary –find out about formulations used googlism, googlearchy –error correction googel It's useful …

It's more useful … Complete to phrases –phrase mountain view → add word mountain_view to index Complete to subwords –compound word eigenproblem → add word problem to index Complete to category names –author Edleno Moura → add moura:edleno::author edleno::moura:author Faceted search –add ct:conference:sigir –add ct:author:edleno_moura –add ct:year:2005 all via the same mechanism

Related Engines

Basic Problem Definition Query –a set D of documents (= hits for the first part of the query) –a range W of words (= potential completions of last word) Answer –all documents D' from D, containing a word from W –all words W' from W, contained in a document from D Extensions (see paper at SIGIR'06) –ranking (best hits from D' and best completions from W') –positional information (proximity queries) First try: inverted index (INV)

Processing 1-word queries with INV For example, goog* Dall documents W all words matching goog* Iterate over all words from W googleDoc.18, Doc. 53, Doc. 591,... googlearchyDoc. 3, Doc. 66, Doc. 765,... googlesDoc. 25, Doc. 98, Doc. 221,... googlingDoc. 67, Doc. 189, Doc. 221,... googlismDoc. 16, Doc. 110, Doc. 141,... Merge the documents lists D'Doc. 3, Doc. 16, Doc. 18, Doc. 25, … Output all words from range as completions W' google, googlearchy, googles, … Expensive! Trivial for 1-word queries

Processing multi-word queries with INV For example, goog* mou* D Doc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits for goog*) W all words matching mou* Iterate over all words from W mouldDoc. 8, Doc. 23, Doc. 291,... mountDoc. 24, Doc. 36, Doc. 165,... mountain Doc. 3, Doc. 18, Doc. 66,... mountingDoc. 56, Doc. 129, Doc. 251,... mouraDoc. 18, Doc. 21, Doc. 25,... Intersect each list with D, then merge D'Doc. 3, Doc. 18, Doc. 25, … Output all words with non-empty intersection W' mountain, moura Most intersection are empty, but INV has to compute them all!

INV — Problems Asymptotic time complexity is bad (for our problem) –many intersections (one per potential completion) –has to merge/sort (the non-empty intersections) Still hard to beat INV in practice –highly compressible half the space on disk means half the time to read it –INV has very good locality of access the ratio random access time/sequential access time is 50,000 for disk, and still 100 for main memory –simple code instruction cache, branch prediction, etc.

A Hybrid Index (HYB) But this looks very wasteful Basic Idea: have lists for ranges of words mould – moura Doc. 3, Doc. 16, Doc.18, Doc. 25,... Problem: not enough to show completions Solution: store the word(s) along with each doc id mould – moura Doc. 3, Doc. 16, Doc.18, Doc. 25,... mould moura mount mould mountain mounting moura

HYB — Details HYB has a block for each word range, conceptually: Replace doc ids by gaps and words by frequency ranks: DACABACADAABCACA rd 1 st 2 nd 1 st 4 th 1 st 2 nd 1 st 3 rd 1 st 4 th 2 nd 1 st 2 nd 1 st Encode both gaps and ranks such that x  log 2 x bits +0  0 +1   st (A)  0 2 nd (C)  10 3 rd (D)  th (B)  An actual block of HYB How well does it compress? Which block size?

INV vs. HYB — Space Consumption Theorem: The empirical entropy of INV is Σ n i ∙ ( 1/ln 2 + log 2 (n/n i ) ) Theorem: The empirical entropy of HYB with block size ε∙n is Σ n i ∙ ( (1+ε)/ln 2 + log 2 (n/n i ) ) MEDICINE 44,015 docs 263,817 words with positions WIKIPEDIA 2,866,503 docs 6,700,119 words with positions TREC.GOV 25,204,013 docs 25,263,176 words no positions raw size452 MB 7.4 GB426 GB INV 13 MB0.48 GB 4.6 GB HYB 14 MB0.51 GB 4.9 GB Nice match of theory and practice n i = number of documents containing i-th word, n = number of documents

INV vs. HYB — Query Time MEDICINE 44,015 docs 263,817 words 5,732 real queries with proximity avg : 0.03 secs max: 0.38 secs avg :.003 secs max: 0.06 secs INV HYB WIKIPEDIA 2,866,503 docs 6,700,119 words 100 random queries with proximity avg : 0.17 secs max: 2.27 secs avg : 0.05 secs max: 0.49 secs Theoretical analysis  see paper at SIGIR'06 Experiment: type ordinary queries from left to right – go, goo, goog, googl, google, google mo, google mou,... TREC.GOV 25,204,013 docs 25,263,176 words 50 TREC queries no proximity avg : 0.58 secs max: secs avg : 0.11 secs max: 0.86 secs HYB better by an order of magnitude

System Design — High Level View Debugging such an application is hell! Compute Server C++ Web Server PHP User Client JavaScript

Summary of Results Properties of HYB –highly compressible (just like INV) –fast prefix-completion queries (perfect locality of access) –fast indexing (no full inversion necessary) Autocompletion and more –phrase and subword completion, semantic completion, XML support, … –faceted search (Workshop Talk on Thursday) –efficient DB joins: author[sigir sigmod] NEW all with one and the same (efficient) mechanism