Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

Introduction to Information Retrieval
CSE332: Data Abstractions Lecture 14: Beyond Comparison Sorting Dan Grossman Spring 2010.
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
Sorting Algorithms. Motivation Example: Phone Book Searching Example: Phone Book Searching If the phone book was in random order, we would probably never.
Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010.
CSE332: Data Abstractions Lecture 12: Introduction to Sorting Tyler Robison Summer
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Uninformed Search (cont.)
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
CS107 Introduction to Computer Science Lecture 7, 8 An Introduction to Algorithms: Efficiency of algorithms.
CSE 326: Data Structures Binomial Queues Ben Lerner Summer 2007.
CSE 373 Data Structures Lecture 15
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
The Power of Prefix Search (with a nice open problem) Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Talk at ADS 2007 in Bertinoro,
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Brought to you by Max (ICQ: TEL: ) February 5, 2005 Advanced Data Structures Introduction.
1 Lecture 16: Lists and vectors Binary search, Sorting.
Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.
Jessie Zhao Course page: 1.
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Arrays Tonga Institute of Higher Education. Introduction An array is a data structure Definitions  Cell/Element – A box in which you can enter a piece.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.
CSE332: Data Abstractions Lecture 14: Beyond Comparison Sorting Dan Grossman Spring 2012.
An Optimal Cache-Oblivious Priority Queue and Its Applications in Graph Algorithms By Arge, Bender, Demaine, Holland-Minkley, Munro Presented by Adam Sheffer.
1 Today’s Material Iterative Sorting Algorithms –Sorting - Definitions –Bubble Sort –Selection Sort –Insertion Sort.
Information Retrieval WS 2012 / 2013 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Algorithms  Algorithms are simply a list of steps required to solve some particular problem  They are designed as abstractions of processes carried.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Sorting.
Dictionaries and Tolerant retrieval
Evidence from Content INST 734 Module 2 Doug Oard.
B+ Trees: An IO-Aware Index Structure Lecture 13.
CPSC 322, Lecture 6Slide 1 Uniformed Search (cont.) Computer Science cpsc322, Lecture 6 (Textbook finish 3.5) Sept, 17, 2012.
Internal and External Sorting External Searching
Week 14 - Monday.  What did we talk about last time?  Heaps  Priority queues  Heapsort.
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
CSE332: Data Abstractions Lecture 12: Introduction to Sorting Dan Grossman Spring 2010.
Today’s Material Sorting: Definitions Basic Sorting Algorithms
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany.
Why indexing? For efficient searching of a document
May 17th – Comparison Sorts
Database Management System
15-121: Introduction to Data Structures
CSC 222: Object-Oriented Programming
Sorting by Tammy Bailey
October 30th – Priority QUeues
Hashing Exercises.
Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
Objective of This Course
Introduction to Database Systems
Lecture 7: Index Construction
Data Structures Review Session
Lecture 2- Query Processing (continued)
Database Design and Programming
Presentation transcript:

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture 6, Thursday November 26 th, 2009 (Prefix Search)

Overview of today’s Lecture Everything about Prefix Search –how to realize it … … using an ordinary inverted index –how to realize it efficiently … … using a special kind of index –what all prefix search is good for for example, synonym search

Prefix Search Demo Obvious advantages –type less –find more –find out what words there are in the collection Less obvious advantages –many advanced search features reduce to prefix search –synonym search –error tolerant search –database-like search –semantic search

Prefix Search via an Inverted Index Binary search on the (sorted) vocabulary –let the size of the vocabulary be n –for example bas* –time ~ log 2 n to find the first match (locate bas) –time ~ log 2 n to find the last match (locate bat) –so time ~ log 2 n overall –for n = 100 million ≈ 2^27 … log 2 n is 27 –one string comparison takes ≈ 1 µsec –so a fraction of 1 msec even for large vocabularies –but only works if vocabulary fits into memory –but: 100 millions words take up ≈ 1GB about aware banks base based bases basics basis bruno cache call cases …

Permuterm Index What if we allow the * in any place –for example ba*s –should find banks, bases, basics, and basis –no longer a range of words (worst case: * in the beginning) –scanning the whole vocabulary is too expensive for n = 100 million  100 seconds Idea: Permuterm index –append a $ to each word –add all permutations for each word –for example, for base$ add each of base$, ase$b, se$ba, e$bas, $base

Permuterm Index Assume three-word vocabulary with –banks, base, basics Permuterm index –simply all permutations sorted –each permutation points to the inverted list of the word of which it is a permutation (no need to duplicate the lists for each permutation) –now for the query ba*s find matches for s$ba*

Permuterm Index Efficiency –blowup in vocabulary size is about a factor of 8 –a factor of 8 increases log 2 n by 3 –so no problem for the binary searches –but a very large vocabulary might not fit into memory anymore –note that the size of the inverted lists remains the same (we did not copy the lists, just pointed to them) Data structure for very large vocabularies –the B-Tree –with todays memory, depth 2 is usually enough [maybe draw picture of B-tree on separate slide]

Permuterm Index How about more than one * ? –for example in*ma*tik –should find informatik Simple trick –first collapse to one * as in in*tik –we already know how to handle this query –but this will find a (typically strict) superset of matches –for example, will also find intervallarithmetik –anyway, the number of matches will be relatively small –so just go over them, and filter out the false positives

N-Gram Index Can we do with less space than Permuterm? –YES WE CAN! Idea: Index not the words, but n-grams of the words –n-grams of a word = the substrings of length n –for example, the 3-grams of $informatik$ are $inf, nfo, for, rma, mat, ati, tik, tik$ N-gram Index –Variant 1: let each n-gram point to the words that contain it –Variant 2: let each n-gram point to the union of the inverted lists of the doc ids containing it

N-Gram Index Why more space-efficient than Permuterm index? –because many n-grams are common to many words (whereas the permutations were unique) –and anyway, the number of 3-grams is bounded say 128 = 2 7 symbols which occur at all then at most grams with these symbols How to query with the n-gram index? –for example, search in*tik –contains n-grams $in, tik, and ik$ –boolean query for $in AND tik AND ik$ –again, must post-filter … why?

Merging the Inverted Lists Whatever we do … –… be it binary search, Permuterm, n-Gram index –we end up with a large number of inverted lists (one for each word matching the wildcard query) –these have to be merged (now it’s really merge, not intersection) –merging k sorted lists with a total of n elements takes time n ∙ log k

K-Way Merge Algorithm –for each of the k lists maintain the current position –in each step determine the smallest of the elements at the k current positions –output that element and advance by one in that list –requires the following data structure at each point have k elements be able to return the smallest of these … fast and replace it with a new one this is called a (fixed-size) priority queue

Priority Queue of fixed size k A fixed size priority queue can be easily realized with a heap data structure –at each time maintain the heap property: each element is larger than its parent [show example of a heap with 8 elements]

Priority Queue of fixed size k Analysis –the heap obviously achieves time ~ log k per replace-min (a typical priority queue will support get-min, delete-min, and insert separately, but here we only need replace-min) –so merging k lists with a total of n elements can be done in time k ∙ log n –could it possibly be done (asymptotically) faster? –No! (At least not comparison-based) Why? –otherwise we could sort faster than n ∙ log n

Efficiency of the Approaches so far Summary of what we have seen so far: –space consumption can be an issue for Permuterm –finding (a superset of) the matching words is very fast –but then we have to merge all these inverted lists –that is very, very, very expensive cost is C ∙ log 2 k ∙ total size of inverted lists k can easily become 128  log 2 k = 7 C ≈ 5 compared to a simple scan  C ∙ log 2 k ≈ 50 total size of inverted are a factor of, say, larger than a typical inverted list of a single word that is, prefix search several 100 times more expensive than an ordinary keyword search

The HYB index HYB is the index behind our CompleteSearch engine Simple idea behind HYB –precompute inverted lists for unions of words –in the following let words be capital letters: A, B, C, … –along with each doc id, we now also have to store the word because of which that doc id is in the list DACABACADAABCACA list for A-D EFGJHIIEFGHJI list for E-J LNMNNKLMNMKLMKL list for K-N

The HYB Index How do we do prefix search here? –simply find the enclosing blocks (typically only one) –scan the block and filter out false-positives (that’s why we need to store the words along with the doc ids) –we are avoiding the merge overhead here (which gave the bulk of the cost before: a factor of ≈ 50) DACABACADAABCACA list for A-D EFGJHIIEFGHJI list for E-J LNMNNKLMNMKLMKL list for K-N

The HYB Index Timewise, this looks good … –… but what about the space? –we know that lists of sorted doc ids can be compressed well –but the list of words are going to completely spoil it, right? DACABACADAABCACA list for A-D EFGJHIIEFGHJI list for E-J LNMNNKLMNMKLMKL list for K-N

The HYB index — Space Analysis Let us analyze –the entropy of the ordinary inverted index (INV) –the entropy of the HYB index –(we already know that the inverted index has great space complexity) –(we already know that we can achieve compression close to the entropy) Notation –let n i denote the size of the inverted list of the i-th word –so the sum of all n i is just N = total number of all occurrences –we will not assume anything about the n i –let n be the total number of documents

Entropy of the INV index We will show that the entropy of INV is close to Σ n i ∙ ( 1/ln 2 + log 2 (n/n i ) ) [prove it live]

Entropy of the HYB index We will show that the entropy of HYB is at most Σ n i ∙ ( (1+ε)/ln 2 + log 2 (n/n i ) ) where ε ∙ n is the average number of doc ids in a block [prove it live]

Synonym Search Naïve solution –have a synonym dictionary = thesaurus –at query time look up each query word in the thesaurus –if it’s there, replace it with a disjunction of all its synonyms –for example uni AND studieren could become uni OR universität OR hochschule AND studieren OR lernen OR abhängen –same problem as for prefix search expanding the query is not hard but, again, computing the union of all the inverted lists (could again be very many) is very expensive

Synonym Search Idea: do it via prefix search –give each group of synonyms a group id –uni, universität, hochschule, etc. get the id 174 –studieren, lernen, abhängen, etc. get the id 99 –now in your vocabulary prepend the group id to each word syngroup:174:uni syngroup:174:universität syngroup:99:studieren –at query time, determine the group id for each query –and replace uni studieren by syngroup:174:* syngroup:99:* –if you don’t want synonym search just replace the * by the query word, as in syngroup:174:uni syngroup:99:studieren

Prefix Search is extremely universal Basically everything can be done with prefix search –prefix search –autocompletion –synonym search –error-tolerant search –database-like search –semantic search –factorize arbitrarily large numbers –failure-safe lecture recording –automatic exercise sheet solving –and many more …