Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 G64ADS Advanced Data Structures Hashing. 2 Overview oHashing oTechnique supporting insertion, deletion and search in average-case constant time oOperations.

Similar presentations


Presentation on theme: "1 G64ADS Advanced Data Structures Hashing. 2 Overview oHashing oTechnique supporting insertion, deletion and search in average-case constant time oOperations."— Presentation transcript:

1 1 G64ADS Advanced Data Structures Hashing

2 2 Overview oHashing oTechnique supporting insertion, deletion and search in average-case constant time oOperations requiring elements to be sorted (e.g., FindMin) are not efficiently supported o Hash table ADT o Implementations o Analysis o Applications

3 3 Hashing Table o One approach o Hash table is an array of fixed size TableSize o Array elements indexed by a key, which is mapped to an array index (0…TableSize-1) o Mapping (hash function) h from key to index o e.g., h(“john”) = 3

4 4 Hashing Table o Problems oChoose a hash function oWhat to do when two keys hash to the same value (collision) oTable size

5 5 Hashing Table o Insert o T [h(“john”] = o Delete o T [h(“john”)] = NULL o Search o Return T [h(“john”)] o What if h(“john”) = h(“joe”) ?

6 6 Hash Function oMapping from key to array index is called a hash function oTypically, many-to-one mapping oDifferent keys map to different indices oDistributes keys evenly over table oCollision occurs when hash function maps two keys to the same array index

7 7 Hash Function oSimple hash o h(Key) = Key mod TableSize o Assumes integer keys oFor random keys, h() distributes keys evenly over table oWhat if TableSize = 100 and keys are multiples of 10? oBetter if TableSize is a prime number oNot too close to powers of 2 or 10

8 8 Hash Function for String Keys o Approach 1 o Add up character ASCII values (0-127) to produce integer keys o Small strings may not use all of table o Strlen(S) * 127 < TableSize o Approach 2 o Treat first 3 characters of string as base-27 integer (26 letters plus space) o Key = S[0] + (27 * S[1]) + (272 * S[2]) o Assumes first 3 characters randomly distributed o Not true of English

9 9 Hash Function for String Keys o Approach3 o Use all N characters of string as an N-digit base-K integer o Choose K to be prime number larger than number of different digits (characters), i.e., K = 29, 31, 37 o If L = length of string S, then oUse Horner’s rule to compute h(S) oLimit L for long strings

10 10 Hash Function for String Keys o Approach3 o Use all N characters of string as an N-digit base-K integer o Choose K to be prime number larger than number of different digits (characters) o i.e., K = 29, 31, 37 o If L = length of string S, then oUse Horner’s rule to compute h(S) oLimit L for long strings

11 11 Collision Resolution o What happens when h(k1) = h(k2)? o Collision resolution strategies o Chaining o Store colliding keys in a linked list o Open addressing o Store colliding keys elsewhere in the table

12 12 Collision Resolution - Chaining oHash table T is a vector of lists oOnly singly-linked lists needed if memory is tight oKey k is stored in list at T[h(k)] o e.g., TableSize = 10 o h(k) = k mod 10 oInsert first 10 perfect squares

13 13 Chaining Implementation

14 14 Chaining Implementation

15 15 Collision Resolution - Chaining oAnalysis o Load factor λ of a hash table o N = number of elements in T o M = size of T o λ = N/M o Average length of a chain is λ o Unsuccessful search O(λ) o Successful search O(λ/2) o Ideally, want λ ≈ 1 (not a function of N) o i.e., TableSize = number of elements you expect to store in the table

16 16 Collision Resolution – Open Addressing o When a collision occurs, look elsewhere in the table for an empty slot o Advantages over chaining o No need for addition list structures o No need to allocate/deallocate memory during insertion/deletion (slow) o Disadvantages oSlower insertion – May need several attempts to find an empty slot oTable needs to be bigger (than chaining-based table) to achieve average-case constant-time performance o Load factor λ ≈ 0.5

17 17 Collision Resolution – Open Addressing o Probe sequence o Sequence of slots in hash table to search o h0(x), h1(x), h2(x), … o Needs to visit each slot exactly once o Needs to be repeatable (so we can find/delete what we’ve inserted) o Hash function o hi(x) = (h(x) + f(i)) mod TableSize o f(0) = 0

18 18 Collision Resolution – Open Addressing o Linear probing o f(i) is a linear function of i o e.g., f(i) = i

19 19 Collision Resolution – Open Addressing o

20 20 Collision Resolution – Open Addressing oLinear Probing – Analysis o Probe sequences can get long o Primary clustering o Keys tend to cluster in one part of table o Keys that hash into cluster will be added to the end of the cluster (making it even bigger)

21 21 Collision Resolution – Open Addressing oExpected number of probes for insertion or unsuccessful search oExpected number of probes for successful search

22 22 Collision Resolution – Open Addressing oRandom probe does not suffered from clustering oExpected number of probes for insertion or unsuccessful search

23 23 Collision Resolution – Open Addressing oLinear vs. random probing

24 24 Collision Resolution – Open Addressing oQuadratic probing o Avoids primary clustering o f(i) is quadratic in i o e.g., f(i) = i 2 o Example o h0(58) = (h(58)+f(0)) mod 10 = 8 (X) o h1(58) = (h(58)+f(1)) mod 10 = 9 (X) o h2(58) = (h(58)+f(2)) mod 10 = 2

25 25 Collision Resolution – Open Addressing oQuadratic probing

26 26 Collision Resolution – Open Addressing oQuadratic probing- Analysis oDifficult to analyze oTheorem 5.1: New element can always be inserted into a table that is at least half empty and TableSize is prime oOtherwise, may never find an empty slot, even if one exists o Ensure table never gets half full o If close, then expand it

27 27 Collision Resolution – Open Addressing oQuadratic probing- Analysis o Only M (TableSize) different probe sequences o May cause “secondary clustering” o Deletion o Emptying slots can break probe sequence o Lazy deletion o Differentiate between empty and deleted slot o Skip deleted slots o Slows operations (effectively increases λ)

28 28 Collision Resolution – Open Addressing oQuadratic probing- Implementation

29 29 Collision Resolution – Open Addressing oQuadratic probing- Implementation

30 30 Double Hashing o Combine two different hash functions o f(i) = i * h2(x) o Good choices for h2(x) ? o Should never evaluate to 0 o h2(x) = R – (x mod R) o R is prime number less than TableSize o Previous example with R=7 o h0(49) = (h(49)+f(0)) mod 10 = 9 (X) o h1(49) = (h(49)+(7 – 49 mod 7)) mod 10 = 6

31 31 Double Hashing o

32 32 Double Hashing o Analysis oImperative that TableSize is prime o e.g., insert 23 into previous table oEmpirical tests show double hashing close to random hashing oExtra hash function takes extra time to compute

33 33 Rehashing oIncrease the size of the hash table when load factor too high oTypically expand the table to twice its size (but still prime) oReinsert existing elements into new hash table

34 34 Rehashing h(x) = x mod 7 λ = 0.57 Insert 23 λ = 0.71 Rehashing h(x) = x mod 17 λ = 0.29

35 35 Rehashing oAnalysis o Rehashing takes O(N) time o But happens infrequently o Specifically o Must have been N/2 insertions since last rehash o Amortizing the O(N) cost over the N/2 prior insertions yields only constant additional time per insertion

36 36 Rehashing oImplementation o When to rehash o When table is half full (λ = 0.5) o When an insertion fails o When load factor reaches some threshold o Works for chaining and open addressing

37 37 Rehashing oFor chaining

38 38 Rehashing oFor quadratic probing

39 39 Hash Table in the Standard Library

40 40 Problem with Large Tables o What if hash table is too large to store in main memory? o Solution: Store hash table on disk o Minimize disk accesses o But… o Collisions require disk accesses o Rehashing requires a lot of disk accesses

41 41 Extendible Hashing o Store hash table in a depth-1 tree o Every search takes 2 disk accesses o Insertions require few disk accesses o Hash the keys to a long integer (“extendible”) o Use first few bits of extended keys as the keys in the root node (“directory”) o Leaf nodes contain all extended keys starting with the bits in the associated root node key

42 42 Extendible Hashing oExtendible hash table oContains N = 12 data elements oFirst D = 2 bits of key used by root node keys o2 D entries in directory oEach leaf contains up to M = 4 data elements oAs determined by disk page size oEach leaf stores number of common starting bits (d L )

43 43 Extendible Hashing After inserting 100100 Directory split and rewritten

44 44 Extendible Hashing After inserting 000000 Directory split and rewritten

45 45 Extendible Hashing oAnalysis o Expected number of leaves is (N/M)*log2 e = (N/M)*1.44 o Average leaf is (ln 2) = 0.69 full o Same as for B-trees o Expected size of directory is O(N(1+1/M)/M) o O(N/M) for large M (elements per leaf)

46 46 Extendible Hashing oApplications o Maintaining symbol table in compilers o Accessing tree or graph nodes by name o e.g., city names in Google maps o Maintaining a transposition table in games o Remember previous game situations and the move taken (avoid re-computation) o Dictionary lookups o Spelling checkers oNatural language understanding (word sense)

47 47 Summary oHash tables support fast insert and search oO(1) average case performance oDeletion possible, but degrades performance oNot good if need to maintain ordering over elements oMany applications


Download ppt "1 G64ADS Advanced Data Structures Hashing. 2 Overview oHashing oTechnique supporting insertion, deletion and search in average-case constant time oOperations."

Similar presentations


Ads by Google