Auto-completion Search Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Auto-completion Search
How it works What’s the dictionary ?
Trie for the Dictionary Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Trie for the Dictionary s 1 y z 2 2 omo stile aibelyite zyg 7 5 5 czecin 1 etic ygy ial 6 2 4 3 Pro: O(p) search time = path scan Cons: edge + node labels + tree structure
What’s the ranking/scoring of the answers ? Top-1 P = sy s 1 y 8,1 z 2 2 omo stile aibelyite zyg 7 5 5 czecin 8 2 1 4 1 etic ygy ial 6 2 4 3 What’s the ranking/scoring of the answers ?
How to compute the top-1 in O(1) time ? Top-1: How to speed-up P = sy s 8,1 1 1 y z 7 1 2 2 omo stile aibelyite zyg 2 3 5 7 5 4 5 czecin 8 2 1 4 1 etic ygy ial 6 2 4 3 How to compute the top-1 in O(1) time ?
Top-2 P = sy How to compute the top-2 in O(1) time ? Top-k in O(1) time, but k× space P = sy s 1 1,7 y z 7,6 1,4 2 2 omo stile aibelyite zyg 2 3 5 7 5 4,2 5 czecin 8 2 1 4 1 etic ygy ial 6 2 4 3 How to compute the top-2 in O(1) time ?
Top-k: How to squeeze ? P = sy 2 3 5 8 2 1 4 P = sy s 1 y z 2 2 omo stile aibelyite zyg 2 3 5 7 5 5 czecin 8 2 1 4 1 etic ygy ial 6 2 4 3 Prefixed by P, proceed D&C Score 8 1 2 1 3 4 2 5 3 6 5 7 String
Time: O(k) time, and space Top-k: How to squeeze ? Prefixed by P, proceed D&C Score 8 1 2 1 3 4 2 5 3 6 5 7 String L R RMQ-query in O(1) time and O(n) space Let H be a max-heap of size k, keep also min[H] and max[H] Initialize H with k pairs <-, NULL> Given the range <L,R> (here <1,4>) Compute max-score in Array[L,R] (pos. M, value m) If m ≤ min[H], skip; else: Insert <m,string> in H; If size(H)>k then remove min[H]; Recurse on <L,M-1> and <M+1,R>, if not empty. Time: O(k) time, and space
H = {<8,4> e <5,7>} Example for Top-2 Consider this other array Score 4 1 2 1 3 8 4 2 5 3 6 5 7 String L R Range : operations [1,7]: H <8,4>; recurse on [1,3] and [5,7] [1,3]: H={<8,4>} <4,1>; recurse on [1,0] and [2,3] [5,7]: H={<8,4>,<4,1>} <5,7>; delete <4,1> from H, recurse on [5,6] and [8,7] [2,3]: H={<8,4>,<5,7>} <2,2>; since min[H]=5, not insert in H [5,6]: H ={<8,4>,<5,7>} <3,6>; since min[H]=5, not insert in H H = {<8,4> e <5,7>}
Time: still O(k) time, and space A smarter approach Prefixed by P, proceed D&C Score 8 1 2 1 3 4 2 5 3 6 5 7 String L R Let H be a max-heap, including items <val, string, [low,high]> Compute max-score in Array[L,R] (pos. M, value m) i=0; insert <m, string[M], L, R> in H While (i<k) do Extract <x, string[X], Lx, Rx> from H, where x is max-value in H Return String[X] as one of the top-k strings Compute max-score in Array[Lx,X-1] (pos. M’, value m’) insert <m’, string[M’], Lx, X-1> Compute max-score in Array[X+1,Rx] (pos. M’’, value m’’) insert <m’’, string[M’’], X+1, Rx> i++; Time: still O(k) time, and space
Random access to postings lists and other data types Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Random access to postings lists and other data types
We aim at achieving ≈ n log(m/n) bits < n log m A basic problem ! 1 12 15 20 22.... Dog Array of n skip pointers to an array of m integers (log m) bits per pointer = (n log m) bits = 32 n bits. it is effective for few pointers AbacoBattleCarColdCod .... D Array of n string pointers to strings of total length m (n log m) bits = 32 n bits. it is independent of string length 100001000001001000100 .... B We aim at achieving ≈ n log(m/n) bits < n log m
Rank/Select Wish to index the bit vector B (possibly compressed). B 00101001010101011111110000011010101.... Rank1(6) = 2 m = |B| n = #1 Rankb(i) = number of b in B[1,i] Selectb(i) = position of the i-th b in B Do exist data structures that solve this problem in O(1) query time and very small extra space (i.e. +o(m) bits)
The Bit-Vector Index: |B| + o(|B|) m = |B| n = #1s The Bit-Vector Index: |B| + o(|B|) Goal. B is read-only, and the additional index takes o(m) bits. Rank B 00101001010101011 1111100010110101 0101010111000.... Z 8 18 block pos #1 z (bucket-relative) Rank1 4 5 8 0000 1 .... ... 1011 2 (absolute) Rank1 Setting Z = poly(log m) and z=(1/2) log m: Extra space is + (m/Z) log m + (m/z) log Z + o(m) + O(m loglog m / log m) = o(m) bits Rank time is O(1) Term o(m) is crucial in practice, B is untouched (not compressed) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is needed and read-only!
Elias-Fano (B is not needed) If w = log (m/n) and z = log n, where m = |B| and n = #1 then L takes n w = n log (m/n) bits H takes n 1s + n 0s = 2n bits z = 3, w=2 0 1 2 3 4 5 6 7 (Select1 on H) In unary Select1(i) on B uses L and (Select1(H,i) – i) in +o(n) space Rank1(i) on B Needs binary search over B
If you wish to play with Rank and Select Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" If you wish to play with Rank and Select m/10 + n log (m/n) Rank in 0.4 msec, Select in < 1 msec vs 32n bits of explicit pointers