The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007) Conference on Innovative Data Systems Research (CIDR 2007) Summarized by Jaehui Park, IDS Lab., Seoul National University Presented by Jaehui Park, IDS Lab., Seoul National University
Copyright 2008 by CEBT Introduction Interactive search engine: CompleteSearch Variety of complex features – Automatic query completion (IR perspective) – Semi-structured retrieval – Semantic search – DB-style joins and grouping (DB perspective) – Range search (Theorist’s perspective) – Combining IR-style with DB-style querying example query ir db integration conference:sigmod author: author[conference:sigmod] ir db integration Context-sensitive prefix search and completion For a given collections of documents, with a unique id for each document and a unique id for each of the words used in the collection, a context-sensitive prefix search and completion query is a pair (D, W), where D is a set of document ids and W is a range of word ids. To process the query means to compute a ranked list of all pairs (d, w), where words w occurs in document d, d is from D and w is from W. Novel index data structure – HYB [SIGIR 2006][SPIRE 2006] Using no more space than a state-of-the-art compressed inverted index With 10 times faster query processing 2
Copyright 2008 by CEBT Model Data is given as documents containing words documents have ids (D1, D2, …) words have ids (A, B, C, …) Query given a sorted list of doc ids and a range of words ids 3 D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H
Copyright 2008 by CEBT Model Data is given as documents containing words documents have ids (D1, D2, …) words have ids (A, B, C, …) Query given a sorted list of doc ids and a range of words ids Answer all matching word-in-document pairs with scores 4 D13 E … D88 E … … D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D88 P A E G Q D88 P A E G Q D17 B W U K A D17 B W U K A D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H D88 G
Copyright 2008 by CEBT HYB Indexing data structure Space usage : HYB ~= INV (inverted index) – Empirical entropy The inherent space complexity of an index Processing time : HYB < INV – The number of operations needed – The latencies of access to data Basic idea Precompute inverted lists for unions of words – The union of all lists for word range W W: arbitrary word range The basic unit of processing is a block – Block => a range of words Block consists of all pairs (w,d) Each block is sorted by document id – Effective gap encoding scheme Compressed multiset 5
Copyright 2008 by CEBT HYB (An example) 10 documents (ids: 3,5,6,7,8,9,11,12,13,15) A block for the word range A-D HYB consists of a collection of such blocks Word-in-document pairs : (w,d) Two operations on the block Intersection with a sorted list of document ids Intersection with a list of word ids For example, – Query : ontol sem search The sorted list of ids of documents matching ontol sem The sorted list of documents ids from the blocks containing all occurrences of the word search 6
Copyright 2008 by CEBT HYB Block volumes The number of pairs (w,d) A small fraction of the total number of documents – c < 1 (c ~= 0.2) Advantages Simple Can be compressed extremely well – It is proven both theoretically and empirically in [SIGIR 2006] Enables a processing of the prefix search and completion queries – By mere sequential access – Without sorting or other non-linear operations Rank by a precomputed score – For each word-in-document pair Okapi BM25 score + IDF 7
Copyright 2008 by CEBT CompleteSearch’s feature set Context-sensitive autocompletion search Display completions of the last query word that – Would lead to good hits, as well as the best hits for any of these completion Compute all completions of the last query word Google Suggest, Apple’s Spotlight, AlltheWeb Live Search We remark that a prototype of our engine already existed when Google Suggest and Apple Spotlight were launched. Algorithmically easier 8
Copyright 2008 by CEBT CompleteSearch’s feature set Structured search in XML documents XML tags as special words.. (two dots) – Proximity operator Ex) tag: ..tag:subj..dbworld Retrieve all messages mentioning dbworld in their subject line Semantic search Tag documents – Ex) politician:tony_blair – The necessity of semantic annotation (proactive behavior) tagging Which politician had a private audience with a pope? – Query: audience pope politician: Compute ranked list of completions of politician: which occur in the context of audience pope 9
Copyright 2008 by CEBT CompleteSearch’s feature set DB_style joins Add structured data via special words – : – conference:sigmod DB-style join functionality – Intersect the two list of completions == attribute-values pairs for the join attributes – Ex) query -> table:ABC attr_k: and table:XYZ attr_k: (select … where …) Something standard IR-style keyword search cannot handle – conference:sigir author: – conference:sigmod author: – The completion of the two queries Intersecting the two lists of authors No document is a SIGIR paper and a SIGMOD paper at the same time – When the answer is spread over several pages Which German chancellors had an audience with the pope? Combine information from the followings One page about Angela Merkel (current German chancellor) Another page about the current pope having met Angela Merkel Intersect of following two queries german chancellor politician: audience pope politician: 10
Copyright 2008 by CEBT Lessons Learned Locality of Access For efficiency – Access data as sequential as possible Faster than random access (100 times) average / maximal query time: 0.11 secs / 0.86 secs – Process as little data as possible per query Extensive use of compression raw data: 426 GB index size: 4.9 GB (within 1+ε of data entropy) – Hardware-aware implementation When it comes to algorithms highly optimized for sequential access to data, the choice of programming language is critical : C++ An interactive web-application AJAX User Feedback The vast majority of users is not willing to read even the tiniest bit of documentation – Make user interface intuitive and simple 11
Copyright 2008 by CEBT Things to do We describe CompleteSearch novel data structure and query algorithm for context-sensitive prefix search and completion – highly compressible and high locality of access – IR-style ranked retrieval – DB-style selects and joins – Natural blend of the two Subsecond query times for up to a terabyte on a single machine Things to do We have not yet fully exploited the potential of top-k retrieval techniques How to deal with dynamics (many insertion or deletions) 12