Hash Tables and Associative Containers

Hash Tables and Associative Containers
CS-240 Dick Steflik

Hash Tables a hash table is an array of size Tsize
has index positions Tsize-1 two types of hash tables open hash table array element type is a <key, value> pair all items stored in the array chained hash table element type is a pointer to a linked list of nodes containing <key, value> pairs items are stored in the linked list nodes keys are used to generate an array index home address (0 .. Tsize-1)

Faster Searching "balanced" search trees guarantee O(log2 n) search path by controlling height of the search tree AVL tree 2-3-4 tree red-black tree (used by STL associative container classes) hash table allows for O(1) search performance search time does not increase as n increases

Hash Table a hash table is an array/vector (fixed size)
has index positions Tsize-1 if we could use the keys as an index we would have O(1) retrieval hashTable[key] keys are used to generate an array index home address (0 .. Tsize-1) function to do this is called a hash function hash(key) returns an int value hash(key) % Tsize => Tsize - 1

Collisions Collisions occur whenever two keys produce the same index (hash to the same location Design goal: pick a hash function that produces no collisions Away of life with hash tables What do you do? linear probing: check the next location, if its empty use it quadratic probing: check next, then 2 away, then 4 away......

a Hash Table of size 7 1 2 3 4 5 6 some insertions:
hash(K1) % 7 => 3 hash(K2) % 7 => 5 hash(K3) % 7 => 2 hash(K4) % 7 => 3 hash(K5) % 7 => 2 hash(K6) % 7 => 4 1 2 3 4 5 6 T linear probe open addressing collision resolution strategy key value empty

Search Performance 1 2 3 4 5 6 average number of probes needed
to retrieve the value with key K? 1 2 3 4 5 6 F K3 K3info F K1 K1info F K2 K2info F K4 K4info F K5 K5info F K6 K6info T K home address #probes K K K K K K 1 2 5 4 14/6 = 2.33 (successful) unsuccessful search?

Chaining with Separate Lists
1 2 3 4 5 6 hash(K1) % 7 => 3 hash(K2) % 7 => 5 hash(K3) % 7 => 2 hash(K4) % 7 => 3 hash(K5) % 7 => 2 hash(K6) % 7 => 4 K3 K3info K5 K5info K1 K1info K4 K4info K6 K6info K2 K2info linked lists of synonyms

Search Performance 1 2 3 4 5 6 average number of probes needed
to retrieve the value with key K? 1 2 3 4 5 6 K3 K3info K1 K1info K5 K5info K4 K4info K6 K6info K2 K2info K home address #probes K K K K K K 1 2 8/6 = 1.33 (successful) unsuccessful search?

Where are Hash Tables used?
Databases Spelling checkers Java uses them all over the place (built into the language) most scripting languages (ASP, PERL, PHP) have associative arrays Caching Schemes software – browsers, http proxy servers, DNS servers hardware – memory caching, instruction caching

Deletions? search for item to be deleted chained hash table
delete a node from a linked list open hash table just mark spot as "empty"? must mark vacated spot as “deleted” is different than “empty”

Hash Functions a hash function is used to map a key to an array index (home address) search starts from here insert, retrieve, delete all start by applying the hash function to the key Characteristics uniform distribution of hash values (no clustering) goals for a hash function fast to compute even distribution over the entire collection of keys all hash functions produce collisions multiple keys hash to same home address

Some Hash Functions... Division
works good in most cases as long as keys are relatively random H(key) = key mod m if key is an integer identity function ( return key) good if keys are random not good if keys have similar characteristics ex m = 25 all keys divisible by 5 would map into positions 0, 5,10,15… causing clustering around those values

more Hash functions... Mid-Squared index = 10001000102 = 54610
produces a nearly random distribution of indices mid-square technique takes longer to compute but gives better distribution when keys may have some digits in common convert key to an octal string A-Z = and 0-9 = ex key = A1 = 1348 1348 * 1348 = using a table of 1024 elements use middle 10 bits as the index index = = 54610 note - most collisions will occur for short identifiers

more Hash functions... Digit Folding Double hashing
assume a 5 digit decimal string (digits 0-9 only) H(key) = d1 + d2 + d3 + d4 + d5 (sum of digits) this would yield 0 <= h <= 45 for all possible keys if we were to fold the digits in pairs H(key) = d1d2 + d3d4 + d5 0 <= h <= ( ) Double hashing use two (or more) hash functions serially helps overcome effects of a function that produces a poor distribution of keys

Clustering Undesirable characteristic of the hash function selected and the collision resolution strategy too many keys hash to the same location causing long strings of keys that need to be searched especially bad using a divide based function and using linear probing insertion/deletion/search can approach O(n) Solutions Pick a different hash function Pick a different collision resolution strategy

Factors Affecting Search Performance
quality of hash function Uniformity of the distribution depends on actual data collision resolution strategy used load factor of the HashTable N/Tsize the lower the load factor the better the search performance

Successful Search Performance
open addressing open addressing chaining (linear probing) (double hashing) load factor

Summary of Hash tables search speed depends on load factor and quality of hash function should be less than .75 for open addressing can be more than 1 for chaining items not kept sorted by key very good for fast access to unordered data with known upper bound to pick a good TSize

Hash Tables and Associative Containers

Similar presentations

Presentation on theme: "Hash Tables and Associative Containers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hash Tables and Associative Containers

Similar presentations

Presentation on theme: "Hash Tables and Associative Containers"— Presentation transcript:

Similar presentations

About project

Feedback