Auto-completion Search

Slides:



Advertisements
Similar presentations
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Advertisements

File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Dictionaries and Data-Aware Measures Ankur Gupta Butler University.
Tries Standard Tries Compressed Tries Suffix Tries.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Digital Search Trees & Binary Tries Analog of radix sort to searching. Keys are binary bit strings.  Fixed length – 0110, 0010, 1010,  Variable.
BTrees & Bitmap Indexes
Digital Search Trees & Binary Tries Analog of radix sort to searching. Keys are binary bit strings.  Fixed length – 0110, 0010, 1010,  Variable.
Patricia P ractical A lgorithm T o R etrieve I nformation C oded I n A lphanumeric. Compressed binary trie. All nodes are of the same data type (binary.
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
1 CSC401 – Analysis of Algorithms Lecture Notes 9 Radix Sort and Selection Objectives  Introduce no-comparison-based sorting algorithms: Bucket-sort and.
Information Retrieval Space occupancy evaluation.
 Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Information and Computer Sciences University of Hawaii, Manoa
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Random access to arrays of variable-length items
Dictionary search Exact string search Paper on Cuckoo Hashing.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.
David Stotts Computer Science Department UNC Chapel Hill.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Dictionary indexing.
Contents What is a trie? When to use tries
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Index construction: Compression of postings
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Succinct Data Structures
Tries 07/28/16 11:04 Text Compression
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
CPS216: Data-intensive Computing Systems
Tries 5/27/2018 3:08 AM Tries Tries.
Higher Order Tries Key = Social Security Number.
Succinct Data Structures
Query processing: phrase queries and positional indexes
Dictionary data structures for the Inverted Index
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Two equivalent problems
COSC160: Data Structures Binary Trees
Bucket-Sort and Radix-Sort
Mark Redekopp David Kempe
Succinct Data Structures
Query processing: phrase queries and positional indexes
The Greedy Method and Text Compression
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Implementation Issues & IR Systems
Hashing Exercises.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Patricia Practical Algorithm To Retrieve Information Coded In Alphanumeric. Compressed binary trie. All nodes are of the same data type (binary tries use.
Monday, April 16, 2018 Announcements… For Today…
Index Construction: sorting
Bucket-Sort and Radix-Sort
Discrete Methods in Mathematical Informatics
Index construction: Compression of postings
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Higher Order Tries Key = Social Security Number.
Database Design and Programming
2018, Spring Pusan National University Ki-Joune Li
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Index construction: Compression of postings
Query processing: phrase queries and positional indexes
Index construction: Compression of postings
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Rank and Select data structures
Indexing and Searching
CSE 326: Data Structures Lecture #14
Presentation transcript:

Auto-completion Search Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Auto-completion Search

How it works What’s the dictionary ?

Trie for the Dictionary Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Trie for the Dictionary s 1 y z 2 2 omo stile aibelyite zyg 7 5 5 czecin 1 etic ygy ial 6 2 4 3 Pro: O(p) search time = path scan Cons: edge + node labels + tree structure

What’s the ranking/scoring of the answers ? Top-1 P = sy s 1 y 8,1 z 2 2 omo stile aibelyite zyg 7 5 5 czecin 8 2 1 4 1 etic ygy ial 6 2 4 3 What’s the ranking/scoring of the answers ?

How to compute the top-1 in O(1) time ? Top-1: How to speed-up P = sy s 8,1 1 1 y z 7 1 2 2 omo stile aibelyite zyg 2 3 5 7 5 4 5 czecin 8 2 1 4 1 etic ygy ial 6 2 4 3 How to compute the top-1 in O(1) time ?

Top-2 P = sy How to compute the top-2 in O(1) time ? Top-k in O(1) time, but k× space P = sy s 1 1,7 y z 7,6 1,4 2 2 omo stile aibelyite zyg 2 3 5 7 5 4,2 5 czecin 8 2 1 4 1 etic ygy ial 6 2 4 3 How to compute the top-2 in O(1) time ?

Top-k: How to squeeze ? P = sy 2 3 5 8 2 1 4 P = sy s 1 y z 2 2 omo stile aibelyite zyg 2 3 5 7 5 5 czecin 8 2 1 4 1 etic ygy ial 6 2 4 3 Prefixed by P, proceed D&C Score 8 1 2 1 3 4 2 5 3 6 5 7 String

Time: O(k) time, and space Top-k: How to squeeze ? Prefixed by P, proceed D&C Score 8 1 2 1 3 4 2 5 3 6 5 7 String L R RMQ-query in O(1) time and O(n) space Let H be a max-heap of size k, keep also min[H] and max[H] Initialize H with k pairs <-, NULL> Given the range <L,R> (here <1,4>) Compute max-score in Array[L,R] (pos. M, value m) If m ≤ min[H], skip; else: Insert <m,string> in H; If size(H)>k then remove min[H]; Recurse on <L,M-1> and <M+1,R>, if not empty. Time: O(k) time, and space

H = {<8,4> e <5,7>} Example for Top-2 Consider this other array Score 4 1 2 1 3 8 4 2 5 3 6 5 7 String L R Range : operations [1,7]: H  <8,4>; recurse on [1,3] and [5,7] [1,3]: H={<8,4>}  <4,1>; recurse on [1,0] and [2,3] [5,7]: H={<8,4>,<4,1>}  <5,7>; delete <4,1> from H, recurse on [5,6] and [8,7] [2,3]: H={<8,4>,<5,7>}  <2,2>; since min[H]=5, not insert in H [5,6]: H ={<8,4>,<5,7>}  <3,6>; since min[H]=5, not insert in H H = {<8,4> e <5,7>}

Time: still O(k) time, and space A smarter approach Prefixed by P, proceed D&C Score 8 1 2 1 3 4 2 5 3 6 5 7 String L R Let H be a max-heap, including items <val, string, [low,high]> Compute max-score in Array[L,R] (pos. M, value m) i=0; insert <m, string[M], L, R> in H While (i<k) do Extract <x, string[X], Lx, Rx> from H, where x is max-value in H Return String[X] as one of the top-k strings Compute max-score in Array[Lx,X-1] (pos. M’, value m’) insert <m’, string[M’], Lx, X-1> Compute max-score in Array[X+1,Rx] (pos. M’’, value m’’) insert <m’’, string[M’’], X+1, Rx> i++; Time: still O(k) time, and space

Random access to postings lists and other data types Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Random access to postings lists and other data types

We aim at achieving ≈ n log(m/n) bits < n log m A basic problem ! 1 12 15 20 22.... Dog  Array of n skip pointers to an array of m integers (log m) bits per pointer = (n log m) bits = 32 n bits. it is effective for few pointers AbacoBattleCarColdCod .... D Array of n string pointers to strings of total length m (n log m) bits = 32 n bits. it is independent of string length 100001000001001000100 .... B We aim at achieving ≈ n log(m/n) bits < n log m

Rank/Select Wish to index the bit vector B (possibly compressed). B 00101001010101011111110000011010101.... Rank1(6) = 2 m = |B| n = #1 Rankb(i) = number of b in B[1,i] Selectb(i) = position of the i-th b in B Do exist data structures that solve this problem in O(1) query time and very small extra space (i.e. +o(m) bits)

The Bit-Vector Index: |B| + o(|B|) m = |B| n = #1s The Bit-Vector Index: |B| + o(|B|) Goal. B is read-only, and the additional index takes o(m) bits. Rank B 00101001010101011 1111100010110101 0101010111000.... Z 8 18 block pos #1 z (bucket-relative) Rank1 4 5 8 0000 1 .... ... 1011 2 (absolute) Rank1 Setting Z = poly(log m) and z=(1/2) log m: Extra space is + (m/Z) log m + (m/z) log Z + o(m) + O(m loglog m / log m) = o(m) bits Rank time is O(1) Term o(m) is crucial in practice, B is untouched (not compressed) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is needed and read-only!

Elias-Fano (B is not needed) If w = log (m/n) and z = log n, where m = |B| and n = #1 then L takes n w = n log (m/n) bits H takes n 1s + n 0s = 2n bits z = 3, w=2 0 1 2 3 4 5 6 7 (Select1 on H) In unary Select1(i) on B  uses L and (Select1(H,i) – i) in +o(n) space Rank1(i) on B  Needs binary search over B

If you wish to play with Rank and Select Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" If you wish to play with Rank and Select m/10 + n log (m/n) Rank in 0.4 msec, Select in < 1 msec vs 32n bits of explicit pointers