Hashing functions Have many uses. We can use them to hash values into a hashing table, but they have more general uses such as computing a unique identifier.

Hashing functions Recall that hashing functions are transformations that map keys to hash table indexes. Specifically, we desire to transform keys into integers in the range {0..M-1} where M is the size of the hash table. Ideally, hash functions should be easy to compute and approximate a "random" function: for each input, every output should be in some sense equally likely. That is, the hash function is not biased.

Numerical keys Suppose that our key is some large integer. The most common method of hashing is to choose M to be prime, and for any key k the hash function h(k) is: h(k) = k mod M Example: k = 44217, M= 101 h(k) = 44217 mod 101 = 80 Therefore key 44217 hashes to slot 80.

Non-Numerical keys Sometimes our key is not numeric initially. This requires computing k first, if the actual data is not numeric. For example, assume k = "akey" We choose to use a 5-bit value for every character in the string where each 5-bit value represents a position in the alphabet. This amounts to: 00001 01011 00101 11001 <= decimal 44217

Non-Numerical keys h(k) = k mod M Why choose M to be prime? This is mainly due to the arithmetic properties of the mod operator. As it turns out, we are treating the key k as a base 32 number. 00001 01011 00101 11001 <= decimal 44217 1.32 3 + 11.32 2 + 5.32 1 + 25.32 0 where 1 = position of 'a', 11 = position of 'k', 5 = position of 'e', and 25 = position of 'y'.

Non-Numerical keys Typically, keys are non-numeric and not necessarily short. For example: 'VERYLONGKEY' This amount to h(k) = 22.32 10 + 5.32 9 + 18.32 8 + 25.32 7 + 12.32 6 + 15.32 5 + 14.32 4 + 7.32 3 + 11.32 2 + 5.32 1 + 25 This can result in computational inefficiencies. But we should be able to handle much larger keys than this example. This is accomplished by refactoring. (((((((((22.32+5)32+18)32+25)32+12)32+15)32+14)32+7)32+11)32+5)32+25 which is computationally more palatable.

Refactoring - aka Horner's Rule (an aside) p(x) = x 4 + 3x 3 - 6x 2 + 2 x + 1 can be rewritten as: p(x) = x(x(x(x+3)-6)+2)+1

Non-Numerical keys This leads to a direct algorithm for computing hash functions: h = key[0]; for (j= 1; j<keysize; j++) h := ((h * 32) + key[j]) mod M; where h is the computed hash value and key[i] is assumed to contain j if the i th character in the key is the j th letter in the alphabet. Overflow is avoided because the mod function always results in a value less than M. Using this algorithm, h('VERYLONGKEY') = 81 (for M=101).

More hash functions The following 3 hash function examples were extracted verbatim from: http://www.cs.yorku.ca/~oz/hash.html as simple examples. You should take the contents with a grain of salt given the source, however they are good examples of variations on a theme.

More hash functions djb2 this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained. unsigned long hash(unsigned char *str) { unsigned long hash = 5381; int c; while (c = *str++) hash = ((hash << 5) + hash) + c; /* hash * 33 + c */ return hash; }

More hash functions sdbm this algorithm was created for sdbm (a public-domain reimplementation of ndbm) database library. it was found to do well in scrambling bits, causing better distribution of the keys and fewer splits. it also happens to be a good general hashing function with good distribution. the actual function is hash(i) = hash(i - 1) * 65599 + str[i]; what is included below is the faster version used in gawk. [there is even a faster, duff-device version] the magic constant 65599 was picked out of thin air while experimenting with different constants, and turns out to be a prime. this is one of the algorithms used in berkeley db and elsewhere. static unsigned long sdbm(str) unsigned char *str; { unsigned long hash = 0; int c; while (c = *str++) hash = c + (hash << 6) + (hash << 16) - hash; return hash; }

More hash functions This hash function appeared in K&R (1st ed) but at least the reader was warned: "This is not the best possible algorithm, but it has the merit of extreme simplicity." This is an understatement; It is a terrible hashing algorithm, and it could have been much better without sacrificing its "extreme simplicity." [see the second edition!] Many C programmers use this function without actually testing it, or checking something like Knuth's Sorting and Searching, so it stuck. It is now found mixed with otherwise respectable code. unsigned long hash(unsigned char *str) { unsigned int hash = 0; int c; while (c = *str++) hash += c; return hash; }

A simple program – and the importance of M (hash table size) #include using namespace std; static char buff[512]; unsigned int hash(char *s, int M) { int i=0; unsigned int h = 0; while (s[i] != '\0') { h = ( (h*32) + s[i++] ) % M; } return h; } int main() { while (1) { cout << "string to hash: "; cin >> buff; if (buff[0] == '\0') break; cout << "hash: " << hash(buff, 101) << endl; } return 1; } What is the implication of a poor choice for M? For example, what if M=32?

Hashing functions Have many uses. We can use them to hash values into a hashing table, but they have more general uses such as computing a unique identifier.

Similar presentations

Presentation on theme: "Hashing functions Have many uses. We can use them to hash values into a hashing table, but they have more general uses such as computing a unique identifier."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hashing functions Have many uses. We can use them to hash values into a hashing table, but they have more general uses such as computing a unique identifier.

Similar presentations

Presentation on theme: "Hashing functions Have many uses. We can use them to hash values into a hashing table, but they have more general uses such as computing a unique identifier."— Presentation transcript:

Similar presentations

About project

Feedback