Practical Perfect Hashing for very large Key-Value Databases Amjad Daoud, Ph.D.

Slides:



Advertisements
Similar presentations
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Advertisements

Introduction to Algorithms
CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick.
HASH TABLE. HASH TABLE a group of people could be arranged in a database like this: Hashing is the transformation of a string of characters into a.
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
Data Structures Using C++ 2E
Hashing as a Dictionary Implementation
© 2004 Goodrich, Tamassia Hash Tables1  
Hashing Chapters What is Hashing? A technique that determines an index or location for storage of an item in a data structure The hash function.
Log Files. O(n) Data Structure Exercises 16.1.
Maps, Dictionaries, Hashtables
Dictionaries and Hash Tables1  
Lecture 11 March 5 Goals: hashing dictionary operations general idea of hashing hash functions chaining closed hashing.
Hash Tables and Associative Containers CS-212 Dick Steflik.
Look-up problem IP address did we see the IP address before?
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
REPRESENTING SETS CSC 172 SPRING 2002 LECTURE 21.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
CS 206 Introduction to Computer Science II 11 / 17 / 2008 Instructor: Michael Eckmann.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Hashing General idea: Get a large array
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
CS 221 Analysis of Algorithms Data Structures Dictionaries, Hash Tables, Ordered Dictionary and Binary Search Trees.
1 Hash Tables  a hash table is an array of size Tsize  has index positions 0.. Tsize-1  two types of hash tables  open hash table  array element type.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Hash Tables1   © 2010 Goodrich, Tamassia.
1 Symbol Tables The symbol table contains information about –variables –functions –class names –type names –temporary variables –etc.
90-723: Data Structures and Algorithms for Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved. 1 Lecture 9: Searching Data Structures.
LECTURE 34: MAPS & HASH CSC 212 – Data Structures.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Hashing Hashing is another method for sorting and searching data.
© 2004 Goodrich, Tamassia Hash Tables1  
Hashing as a Dictionary Implementation Chapter 19.
Lecture 12COMPSCI.220.FS.T Symbol Table and Hashing A ( symbol) table is a set of table entries, ( K,V) Each entry contains: –a unique key, K,
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
CHAPTER 8 SEARCHING CSEB324 DATA STRUCTURES & ALGORITHM.
Hash Tables. 2 Exercise 2 /* Exercise 1 */ void mystery(int n) { int i, j, k; for (i = 1; i
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Chapter 13 C Advanced Implementations of Tables – Hash Tables.
Data Structure & Algorithm Lecture 8 – Hashing JJCAO Most materials are stolen from Prof. Yoram Moses’s course.
1 the hash table. hash table A hash table consists of two major components …
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
1 the BSTree class  BSTreeNode has same structure as binary tree nodes  elements stored in a BSTree are a key- value pair  must be a class (or a struct)
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
CSC 212 – Data Structures Lecture 28: More Hash and Dictionaries.
Algorithmic complexity: Speed of algorithms
Containers and Lists CIS 40 – Introduction to Programming in Python
Data Abstraction & Problem Solving with C++
Dictionaries Dictionaries 07/27/16 16:46 07/27/16 16:46 Hash Tables 
Hash Tables.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Algorithmic complexity: Speed of algorithms
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Advanced Implementation of Tables
Algorithmic complexity: Speed of algorithms
Presentation transcript:

Practical Perfect Hashing for very large Key-Value Databases Amjad Daoud, Ph.D.

Practical Perfect Hashing for very large Key-Value Databases Abstract This presentation describes a practical algorithm for perfect hashing that is suitable for very large KV (key, value) databases. The algorithm was recently used to compute MPHFs for a keyset with 74 billion keys.

Dictionary (Key, Value) Data Structures Trie’s (Text retRieval data structure) can be used deterministically to find the closest match in a dictionary of words. Unsuccessful searches terminate quickly. Searches are usually tolerant and allows mistyped queries and fuzzy searches. Storage requirements are huge; and to be practical must be compressed sacrificing speed. DAWGs can be used to compress a trie using shared links for similar keys but cannot store other information as values

Dictionary Data Structures: Traditional Hashing Hash tables are one of the most powerful ways for searching data based on a key. The main issue with traditional hash function is the hash value collisions Here's an example of a hash table that uses separate chaining. To look up a word, we run it through a hash function, H() which returns a number. We then look at all the items in that "bucket" to find the data. The search algorithm must recheck the key against a list of elements sharing the same hash. This can cause some major performance issues and too much storage space. See Hashing Animations AppletHashing Animations Applet If we could guarantee that there were no collisions, we could throw away the keys of the hash table. Very useful for URLs on the entire Internet, or parts of the human genome.

Dictionary Data Structures: Minimal perfect hashing Perfect hashing is builds a hash table with no collisions and O(1) search time; all of the keys must be known in advance. Minimal implies no empty slots. Hash functions of the form F(d, key), uses a small table G of parameters or displacements to find the unique position for the key. F(0, key), can be used to index the table G. The function F will always returns a the correct location of the key if it is known for sure that it is in the table; no false positives. Usually faster than Bloomier Filters; Bloom Filter Animation Bloom Filter Animation Requires more space than Bloom Filters on average 2 bits/key(very close to theoretical limit).

Dictionary Data Structures: Minimal perfect hashing MapOrderSearch algorithm How do we find the G table in linear time? By exploiting randomness. Mapping Step, map the keys into buckets using a simple hash. Ordering Step, we process the buckets largest first Searching Step try to place all the keys it contains in an empty slot of the value table using F(d=1, key). If that is unsuccessful, we use different random displacements d. Randomness is key here and it is fast since the table is largely empty and much larger than the size of patterns being processed. All patterns of size 1 are fitted directly and negative displacements are used.

Minimal perfect hashing References Implementing the MOS Algorithm II CACM92, and Amjad M Daoud Ph.D. Thesis 1993 at VTMOS Algorithm II CACM92 Amjad M Daoud Thesis 1993 at VT An example mphf in C for the unix dictionary.mphf in Cthe unix dictionary The code ported to Python; download as For the javascript port: download as The algorithm is used to compute Google page rank Google Page Rank in C#Google Page Rank in C# Better and more scalable algorithm for perfect hashing Perfect Hash Functions for Large Web Repositories, The Seventh International Conference on Information Integration and Web Based Applications & Services (iiWAS2005) Perfect Hash Functions for Large Web RepositoriesThe Seventh International Conference on Information Integration and Web Based Applications & Services (iiWAS2005)

Many Derivative Open Source Implementations 1.Fuzzy Tolerant Search with DWAGs and MPHFFuzzy Tolerant Search with DWAGs and MPHF 2.CMPH Library cmph.sourceforge.com, CMPH Library cmph.sourceforge.com 3.MPHF in C# MPHF in C# 4. sh/perfect.html (splits keys into buckets by a first h1, sorts buckets by size, maps them in decreasing order so table[hash1(key)] ^ hash2(key) causes no collision). sh/perfect.html 5.The algorithm is used to compute Google page rank Google Page Rank in C#;Google Page Rank in C#

Minimal perfect hashing Python Code import sys import math # first level simple hash... used to disperse patterns using random d values def hash( d, str ): if d == 0: d = 0x811C9DC5 # Use the FNV-1a hash for c in str: d = d ^ ord(c) * & 0xffffffff # FNV-1a return d

Minimal perfect hashing Python Code # create PHF with MOS(Map,Order,Search), g is specifications array def CreatePHF( dict ): # size = len(dict) for minimal perfect hash size = nextprime(len(dict)+len(dict)/4) print "Size = %d" % (size) gsize = nextprime(int(size/(4*math.log(size,2)))) #c=4 corresponds to 4 bits/key print "G array size = %d" % (gsize) sys.stdout.flush()

Minimal perfect hashing Python Code #Step 1: Mapping patterns = [ [] for i in range(gsize) ] g = [0] * gsize #initialize g values = [None] * size #initialize values for key in dict.keys(): patterns[hash(0, key) % gsize].append( key )

Minimal perfect hashing Python Code # Step 2: Sort patterns in descending order and process patterns.sort( key= len, reverse=True ) for b in xrange( gsize ): pattern = patterns[b] if len(pattern) <= 1: break d = 1 item = 0 slots = []

Minimal perfect hashing Python Code # Step 3: rotate patterns and search for suitable displacement while item < len(pattern): slot = hash( d, pattern[item] ) % size if values[slot] != None or slot in slots: d += 1 if d < 0 : break item = 0 slots = [] else: slots.append( slot ) item += 1 if d < 0: print "failed" return g[hash(0, pattern[0]) % gsize] = d for i in range(len(pattern)): values[slots[i]] = dict[pattern[i]]

Minimal perfect hashing Python Code # Process patterns with one key and use a negative value of d freelist = [] for i in xrange(size): if values[i] == None: freelist.append( i ) for b in xrange(b+1,gsize ): pattern = patterns[b] if len(pattern) == 0: break #if len(pattern) > 1: continue; slot = freelist.pop() # subtract one to handle slot zero g[hash(0, pattern[0]) % gsize] = -slot-1 values[slot] = dict[pattern[0]] print "PHF succeeded" return (g, values)

Minimal perfect hashing Python Code # Look up a value in the hash table, defined by g and V. def lookup( g, V, key ): d = g[hash(0,key) % len(g)] if d < 0: return V[-d-1] return V[hash(d, key) % len(V)]

Minimal perfect hashing Python Code # main program #reading keyset size is given by num DICTIONARY = "/usr/share/dict/words“ dict = {} line = 1 for key in open(DICTIONARY, "rt").readlines(): dict[key.strip()] = line line += 1 if line > num: break (g, V) = CreatePHF( dict ) #printing phf specification print g

Minimal perfect hashing Python Code #fast verification for few (key,value) count given by num1 num1 = 5 print "Verifying hash values for the first %d words"% (num1) line = 1 for key in open(DICTIONARY, "rt").readlines(): line = lookup( g, V, key.strip() ) print "Word %s occurs on line %d" % (key.strip(), line) line += 1 if line > num1: break

Hashing Animations and Videos Hashing Animations Applet MIT 6.046J Introduction to Algorithms