Dictionary search Exact string search Paper on Cuckoo Hashing
Exact String Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support searches for a pattern P over them. Hashing
Hashing with chaining
Key issue: a good hash function Basic assumption: Uniform hashing Avg #keys per slot = n * (1/m) = n/m = (load factor)
Search cost m = (n)
In practice A trivial hash function is: prime
A “provably good” hash is Each a i is selected at random in [0,m) k0k0 k1k1 k2k2 krkr ≈log 2 m r ≈ L / log 2 m a0a0 a1a1 a2a2 arar K a prime l = max string len m = table size not necessarily: (...mod p) mod m
Cuckoo Hashing ABC ED 2 hash tables, and 2 random choices where an item can be stored
ABC ED F A running example
ABFC ED
ABFC ED G
EGBFC AD
Cuckoo Hashing Examples ABC ED F G Random (bipartite) graph: node=cell, edge=key
Natural Extensions More than 2 hashes (choices) per key. Very different: hypergraphs instead of graphs. Higher memory utilization 3 choices : 90+% in experiments 4 choices : about 97% 2 hashes + bins of B-size. Balanced allocation and tightly O(1)-size bins Insertion sees a tree of possible evict+ins paths but more insert time (and random access) more memory...but more local
Dictionary search Making one-side errors Paper on Bloom Filter
Crawling How to keep track of the URLs visited by a crawler? URLs are long Check should be very fast No care about small errors (≈ page not crawled) Bloom Filter over crawled URLs
Searching with errors...
Problem: false positives
TTT 2
Not perfectly true but...
m/n = 8 Opt k = We do have an explicit formula for the optimal k
Dictionary search Prefix-string search Reading 3.1 and 5.2
Prefix-string Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them.
Trie: speeding-up searches y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo Pro: O(p) search time Cons: edge + node labels and tree structure
Front-coding: squeezing strings Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html 33 45% 0 ….systile syzygetic syzygial syzygy… Gzip may be much better...
….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo…. systile szaielyite CT on a sample 2-level indexing Disk Internal Memory A disadvantage: Trade-off ≈ speed vs space (because of bucket size) 2 advantages: Search ≈ typically 1 I/O Space ≈ Front-coding over buckets