Download presentation
Presentation is loading. Please wait.
Published byGabriella Leonard Modified over 9 years ago
1
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Dictionary indexing
2
Introduction to Information Retrieval String Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support searches for a pattern P over them. Hashing ?
3
Introduction to Information Retrieval Hashing with chaining
4
Introduction to Information Retrieval Key issue: a good hash function Basic assumption: Uniform hashing Avg #keys per slot = n * (1/m) = n/m = (load factor)
5
Introduction to Information Retrieval Search cost m = (n)
6
Introduction to Information Retrieval In practice we use simple hash functions: prime
7
Introduction to Information Retrieval Do “provably good” hashes exist ? Each a i is selected at random in [0,m) k0k0 k1k1 k2k2 krkr ≈log 2 m r ≈ log 2 U / log 2 m a0a0 a1a1 a2a2 arar K a prime U = universe of keys m = Table size not necessarily: (...mod p) mod m
8
Introduction to Information Retrieval Cuckoo Hashing ABC ED 2 hash tables, and 2 random choices where an item can be stored
9
Introduction to Information Retrieval ABC ED F A running example
10
Introduction to Information Retrieval ABFC ED A running example
11
Introduction to Information Retrieval ABFC ED G A running example
12
Introduction to Information Retrieval EGBFC AD A running example
13
Introduction to Information Retrieval Cuckoo Hashing Examples ABC ED F G Random (bipartite) graph: node=cell, edge=key
14
Introduction to Information Retrieval Natural Extensions More than 2 hashes (choices) per key. Very different: hypergraphs instead of graphs. Higher memory utilization 3 choices : 90+% in experiments 4 choices : about 97% 2 hashes + bins of B-size. Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths but more insert time (and random access) more memory...but more local
15
Introduction to Information Retrieval Minimal Ordered Perfect Hashing 15 m = 1.25 n n=12 m=15 The h 1 and h 2 are not perfect
16
Introduction to Information Retrieval h(t) = [ g( h 1 (t) ) + g ( h 2 (t) ) ] mod n 16 computed h is perfect, no strings need to be stored space is negligible for h 1 and h 2 and m log n for g
17
Introduction to Information Retrieval How to construct it 17 Term = edge, its vertices are given by h1 and h2 All g(v)=0; then assign g() by difference with known h() Acyclic ok No-Acycl regenerate hashes
18
Introduction to Information Retrieval Prefix Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.
19
Introduction to Information Retrieval Array of strings (pointers…) systile syzygetic syzygial syzygy Search = O(P * log 2 K) time, O(log 2 K) I/Os Space = N + 4K bytes I/O = cache misses (esp. range search)
20
Introduction to Information Retrieval Reduce I/Os: Force some locality ….systilesyzygeticsyzygialsyzygy…. sorted order + linear storage 2 advantages: Save random I/Os in last binary-steps I/O-scan in reporting range-results How do we reduce space storage ?
21
Introduction to Information Retrieval Space + I/O reduction: Bucketing ….7systile9syzygetic8syzygial6syzygy 11szaibelyite8szczecin9szomo…. 2 further advantages: Search = O(log 2 b) I/Os, where b ≈ N/B Space = (N + K) + 4 * b bytes B B
22
Introduction to Information Retrieval Space reduction: Front-coding http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html... 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html 33 45% 0 http://checkmate.com/All/Natural/Washcloth.html... ….systile syzygetic syzygial syzygy…. 2 55 Gzip may be much better...
23
Introduction to Information Retrieval Solution #1: Bucketing + FC ….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo…. Search = O(log 2 b) I/Os, where b ≈ N/B Space ≈ ( FC(D) + K ) + 4 * b bytes Not really FC(D) B B depends on D’s size
24
Introduction to Information Retrieval Trie: speeding-up searches 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo Pro: O(p) search time Cons: edge + node labels and tree structure
25
Introduction to Information Retrieval ….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo…. systile szaielyite CT on a sample Solution #2: 2-level indexing Disk Internal Memory 2 disadvantages: Sampling rate ≈ lengths of sampled strings Trade-off ≈ speed vs space (because of bucket size) 2 advantages: Search ≈ typically 1 I/O Space ≈ Front-coding over buckets
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.