Presentation is loading. Please wait.

Presentation is loading. Please wait.

Auto-completion Search

Similar presentations


Presentation on theme: "Auto-completion Search"— Presentation transcript:

1 Auto-completion Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Auto-completion Search

2 How it works What’s the dictionary ?

3 Trie for the Dictionary
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Trie for the Dictionary s 1 y z 2 2 omo stile aibelyite zyg 7 5 5 czecin 1 etic ygy ial 6 2 4 3 Pro: O(p) search time = path scan Cons: edge + node labels + tree structure

4 What’s the ranking/scoring of the answers ?
Top-1 P = sy s 1 y 8,1 z 2 2 omo stile aibelyite zyg 7 5 5 czecin 8 2 1 4 1 etic ygy ial 6 2 4 3 What’s the ranking/scoring of the answers ?

5 How to compute the top-1 in O(1) time ?
Top-1: How to speed-up P = sy s 8,1 1 1 y z 7 1 2 2 omo stile aibelyite zyg 2 3 5 7 5 4 5 czecin 8 2 1 4 1 etic ygy ial 6 2 4 3 How to compute the top-1 in O(1) time ?

6 Top-2 P = sy How to compute the top-2 in O(1) time ?
Top-k in O(1) time, but k× space P = sy s 1 1,7 y z 7,6 1,4 2 2 omo stile aibelyite zyg 2 3 5 7 5 4,2 5 czecin 8 2 1 4 1 etic ygy ial 6 2 4 3 How to compute the top-2 in O(1) time ?

7 Top-k: How to squeeze ? P = sy 2 3 5 8 2 1 4
P = sy s 1 y z 2 2 omo stile aibelyite zyg 2 3 5 7 5 5 czecin 8 2 1 4 1 etic ygy ial 6 2 4 3 Prefixed by P, proceed D&C Score 8 1 2 1 3 4 2 5 3 6 5 7 String

8 Time: O(k) time, and space
Top-k: How to squeeze ? Prefixed by P, proceed D&C Score 8 1 2 1 3 4 2 5 3 6 5 7 String L R RMQ-query in O(1) time and O(n) space Let H be a max-heap of size k, keep also min[H] and max[H] Initialize H with k pairs <-, NULL> Given the range <L,R> (here <1,4>) Compute max-score in Array[L,R] (pos. M, value m) If m ≤ min[H], skip; else: Insert <m,string> in H; If size(H)>k then remove min[H]; Recurse on <L,M-1> and <M+1,R>, if not empty. Time: O(k) time, and space

9 H = {<8,4> e <5,7>}
Example for Top-2 Consider this other array Score 4 1 2 1 3 8 4 2 5 3 6 5 7 String L R Range : operations [1,7]: H  <8,4>; recurse on [1,3] and [5,7] [1,3]: H={<8,4>}  <4,1>; recurse on [1,0] and [2,3] [5,7]: H={<8,4>,<4,1>}  <5,7>; delete <4,1> from H, recurse on [5,6] and [8,7] [2,3]: H={<8,4>,<5,7>}  <2,2>; since min[H]=5, not insert in H [5,6]: H ={<8,4>,<5,7>}  <3,6>; since min[H]=5, not insert in H H = {<8,4> e <5,7>}

10 Time: still O(k) time, and space
A smarter approach Prefixed by P, proceed D&C Score 8 1 2 1 3 4 2 5 3 6 5 7 String L R Let H be a max-heap, including items <val, string, [low,high]> Compute max-score in Array[L,R] (pos. M, value m) i=0; insert <m, string[M], L, R> in H While (i<k) do Extract <x, string[X], Lx, Rx> from H, where x is max-value in H Return String[X] as one of the top-k strings Compute max-score in Array[Lx,X-1] (pos. M’, value m’) insert <m’, string[M’], Lx, X-1> Compute max-score in Array[X+1,Rx] (pos. M’’, value m’’) insert <m’’, string[M’’], X+1, Rx> i++; Time: still O(k) time, and space

11 Random access to postings lists and other data types
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Random access to postings lists and other data types

12 We aim at achieving ≈ n log(m/n) bits < n log m
A basic problem ! Dog  Array of n skip pointers to an array of m integers (log m) bits per pointer = (n log m) bits = 32 n bits. it is effective for few pointers AbacoBattleCarColdCod .... D Array of n string pointers to strings of total length m (n log m) bits = 32 n bits. it is independent of string length B We aim at achieving ≈ n log(m/n) bits < n log m

13 Rank/Select Wish to index the bit vector B (possibly compressed). B
Rank1(6) = 2 m = |B| n = #1 Rankb(i) = number of b in B[1,i] Selectb(i) = position of the i-th b in B Do exist data structures that solve this problem in O(1) query time and very small extra space (i.e. +o(m) bits)

14 The Bit-Vector Index: |B| + o(|B|)
m = |B| n = #1s The Bit-Vector Index: |B| + o(|B|) Goal. B is read-only, and the additional index takes o(m) bits. Rank B Z 8 18 block pos #1 z (bucket-relative) Rank1 0000 1 .... ... 1011 2 (absolute) Rank1 Setting Z = poly(log m) and z=(1/2) log m: Extra space is + (m/Z) log m + (m/z) log Z + o(m) + O(m loglog m / log m) = o(m) bits Rank time is O(1) Term o(m) is crucial in practice, B is untouched (not compressed) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is needed and read-only!

15 Elias-Fano (B is not needed)
If w = log (m/n) and z = log n, where m = |B| and n = #1 then L takes n w = n log (m/n) bits H takes n 1s + n 0s = 2n bits z = 3, w=2 (Select1 on H) In unary Select1(i) on B  uses L and (Select1(H,i) – i) in +o(n) space Rank1(i) on B  Needs binary search over B

16 If you wish to play with Rank and Select
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" If you wish to play with Rank and Select m/10 + n log (m/n) Rank in 0.4 msec, Select in < 1 msec vs 32n bits of explicit pointers


Download ppt "Auto-completion Search"

Similar presentations


Ads by Google