Presentation is loading. Please wait.

Presentation is loading. Please wait.

1/51 Dictionaries, Tables Hashing TCSS 342 2/51 The Dictionary ADT a dictionary (table) is an abstract model of a database like a priority queue, a dictionary.

Similar presentations


Presentation on theme: "1/51 Dictionaries, Tables Hashing TCSS 342 2/51 The Dictionary ADT a dictionary (table) is an abstract model of a database like a priority queue, a dictionary."— Presentation transcript:

1

2 1/51 Dictionaries, Tables Hashing TCSS 342

3 2/51 The Dictionary ADT a dictionary (table) is an abstract model of a database like a priority queue, a dictionary stores key-element pairs the main operation supported by a dictionary is searching by key

4 3/51 Examples Telephone directory Library catalogue Books in print: key ISBN FAT (File Allocation Table)

5 4/51 Main Issues Size Operations: search, insert, delete, ??? Create reports??? List? What will be stored in the dictionary? How will be items identified?

6 5/51 The Dictionary ADT simple container methods: – size () –isEmpty () –elements () query methods: –findElemen t(k) –findAllElements (k)

7 6/51 The Dictionary ADT update methods: –insertItem(k, e) –removeElement(k) –removeAllElements(k) special element –NO_SUCH_KE Y, returned by an unsuccessful search

8 7/51 Implementing a Dictionary with a Sequence unordered sequence –searching and removing takes O(n) time –inserting takes O(1) time –applications to log files (frequent insertions, rare searches and removals) 34 14 12 22 18 3414122218

9 8/51 Implementing a Dictionary with a Sequence array-based ordered sequence (assumes keys can be ordered) - searching takes O(log n) time (binary search) - inserting and removing takes O(n) time - application to look-up tables (frequent searches, rare insertions and removals) 1214182234

10 9/51 Binary Search narrow down the search range in stages “high-low” game findElemen t(22) 24578912141719222527283337 low high mid 14

11 10/51 Binary Search 24578912141719222527283337 low high mid 25 24578912141719222527283337 low high mid 19 24578912141719222527283337 low = mid = high 22

12 11/51 Pseudocode for Binary Search Algorithm BinarySearch(S, k, low, high) if low > high then return NO_SUCH_KEY else mid  (low+high) / 2 if k = key(mid) then return key(mid) else if k < key(mid) then return BinarySearch(S, k, low, mi  -1) else return BinarySearch(S, k, mi  +1, high)

13 12/51 Running Time of Binary Search The range of candidate items to be searched is halved after each comparison ComparisonSearch Range 0 n 1 n/2 …… 2i2i n/2 i log 2 n1

14 13/51 Running Time of Binary Search In the array-based implementation, access by rank takes O(1) time, thus binary search runs in O(log n) time Binary Search is applicable only to Random Access structures (Arrays, Vectors…)

15 14/51 Implementations Sorted? Non Sorted? Elementary: Arrays, vectors linked lists –Orgainization: None (log file), Sorted, Hashed Advanced: balanced trees

16 15/51 Skip Lists Simulate Binary Search on a linked list. Linked list allows easy insertion and deletion. http://www.epaperpress.com/s_man.html

17 16/51 Hashing Place item with key k in position h(k). Hope: h(k) is 1-1. Requires: unique key (unless multiple items allowed). Key must be protected from change (use abstract class that provides only a constructor). Keys must be “comparable”.

18 17/51 Key class public abstract class KeyID { Private Comparable searchKey; Public KeyID(Comparable m) { searchKey = m; }//Only one constructor public Comparable getSearchKey() { return searchKey; }

19 18/51 Hash Tables RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone numbers are in the range 0 to R = 10 10 –1 –n is the number of phone numbers used –want to do this as efficiently as possible

20 19/51 Alternatives There are a few ways to design this dictionary: Balanced search tree (AVL, red-black, 2-4 trees, B-trees) or a skip-list with the phone number as the key has O(log n) query time and O(n) space --- good space usage and search time, but can we reduce the search time to constant? A bucket array indexed by the phone number has optimal O(1) query time, but there is a huge amount of wasted space: O(n + R)

21 20/51 Bucket Array Each cell is thought of as a bucket or a container –Holds key element pairs –In array A of size N, an element e with key k is inserted in A[k]. Table operations without searches! (null) Roberto(null)…… 000-000-0000 000-000-0001 401-863-7639... 999-999-9999 Note: we need 10,000,000,000 buckets!

22 21/51 Generalized indexing Hash table –Data storage location associated with a key –The key need not be an integer, but keys must be comparable.

23 22/51 Hash Tables A data structure The location of an item is determined –Directly as a function of the item itself –Not by a sequence of trial and error comparisons Commonly used to provide faster searching. Comparisons of searching time: –O(n) for linear searches –O (logn) for binary search –O(1) for hash table

24 23/51 Examples: A symbol table constructed by a compiler. Stores identifiers and information about them in an array. File systems: –I-node location of a file in a file system. Personal records –Personal information retrieval based on key

25 24/51 Hashing Engine itemKey Position Calculator

26 25/51 Example Insert item (401-863-7639, Roberto) into a table of size 5: calculate: 4018637639 mod 5 = 4, insert item (401-863- 7639, Roberto) in position 4 of the table (array, vector). A lookup uses the same process: use the hash engine to map the key to a position, then check the array cell at that position. 401- 863-7639 Roberto 0 1 2 3 4

27 26/51 Chaining The expected, search/insertion/removal time is O(n/N), provided the indices are uniformly distributed The performance of the data structure can be fine-tuned by changing the table size N

28 27/51 From Keys to Indices The mapping of keys to indices of a hash table is called a hash function A hash function is usually the composition of two maps: –hash code map: key   integer –compression map: integer   [0, N  1] An essential requirement of the hash function is to map equal keys to equal indices. A “good” hash function is fast and minimizes the probability of collisions

29 28/51 Perfect hash functions A perfect hash function maps each key to a unique position. A perfect hash function can be constructed if we know in advance all the keys to be stored in the table (almost never…)

30 29/51 A good hash function 1.Be easy and fast to compute 2.Distribute items evenly throughout the hash table 3.Efficient collision resolution.

31 30/51 Popular Hash-Code Maps Integer cas t : for numeric types with 32 bits or less, we can reinterpret the bits of the number as an int Component sum: for numeric types with more than 32 bits (e.g., long and doubl e), we can add the 32-bit components.

32 31/51 Sample of hash functions Digit selection: h(2536924520) = 590 (select 2-nd, 5-th and last digits). This is usually not a good hash function. It will not distribute keys evenly. A hash function should use every part of the key.

33 32/51 Sample (continued) Folding: add all digits Modulo arithmetic: h(key) = h(x) = x mod table_size. The modulo arithmetic is a very popular basis for hash functions. To better the chance of even distribution table_size should be a prime number. If n is the number of items there is always a prime p, n < p < 2n.

34 33/51 Popular Hash-Code Maps Polynomial accumulation: for strings of a natural language, combine the character values (ASCII or Unicode) a 0 a 1... a n-1 by viewing them as the coefficients of a polynomial: a 0 + a 1 x +...+ a n-1 x n-1 For instance, choosing x = 33, 37, 39, or 41 gives at most 6 collisions on a vocabulary of 50,000 English words.

35 34/51 Popular Hash-Code Maps Why is the component-sum hash code bad for strings?

36 35/51 Popular Compression Maps Division: h(k) = |k| mod N –the choice N =2 k is bad because not all the bits are taken into account –the table size N is usually chosen as a prime number –certain patterns in the hash codes are propagated Multiply, Add, and Divide (MAD): –h(k) = |ak + b| mod N –eliminates patterns provided a mod N  0 –same formula used in linear congruential (pseudo) random number generators

37 36/51 Java Hash Java provides a hashCode() method for the Object class, which typically returns the 32- bit memory address of the object. This default hash code would work poorly for Integer and String objects The hashCode() method should be suitably redefined by classes.

38 37/51 Collision A collision occurs when two distinct items are mapped to the same position. Insert (401-863-9350, Andy)  0 And insert (401-863-2234, Devin). 4018632234  4. We have a collision! 401- 863-9350 Andy 0 1 2 3 4 401- 863-7639 Roberto

39 38/51 Collision Resolution How to deal with two keys which map to the same cell of the array? Need policies, design good Hashing engines that will minimize collisions.

40 39/51 Chaining I Use chaining –Each position is viewed as a container of a list of items, not a single item. All items in this list share the same hash value.

41 40/51 Chaining II 0123401234

42 41/51 Collisions resolution policies A key is mapped to an already occupied table location –what to do?!? Use a collision handling technique –Chaining (may have less buckets than items) –Open Addressing (load factor < 1) Linear Probing Quadratic Probing Double Hashing

43 42/51 Linear Probing If the current location is used, try the next table location linear_probing_insert(K) if (table is full) error probe = h(K) while (table[probe] occupied) probe = (probe + 1) mod M table[probe] = K

44 43/51 Linear Probing Lookups walk along table until the key or an empty slot is found Uses less memory than chaining –don’t have to store all those links Slower than chaining –may have to walk along table for a long way Deletion is more complex –either mark the deleted slot –or fill in the slot by shifting some elements down

45 44/51 Linear Probing Example h(k) = k mod 13 Insert keys: 18 41 22 44 59 32 31 73 0 1 2 3 4 5 6 7 8 9 10 11 12 4118445932223172 0 1 2 3 4 5 6 7 8 9 10 11 12

46 45/51 Double Hashing Use two hash functions If M is prime, eventually will examine every position in the table double_hash_insert(K) if(table is full) error probe = h 1 (K) offset = h 2 (K) while (table[probe] occupied) probe = (probe + offset) mod M table[probe] = K

47 46/51 Double Hashing Many of same (dis)advantages as linear probing Distributes keys more uniformly than linear probing does

48 47/51 Double Hashing Example h 1 (K) = K mod 13 h 2 (K) = 8 - K mod 8 –we want h 2 to be an offset to add –18 41 22 44 59 32 31 73 –h 1 (44) = 5 (occupied) h 2 (0) = 8 44  5+8 Mod 13 0 1 2 3 4 5 6 7 8 9 10 11 12 4441731832533122 0 1 2 3 4 5 6 7 8 9 10 11 12

49 48/51 Why so many Hash functions? “Its different strokes for different folks”. We seldom know the nature of the object that will be stored in our dictionary.

50 49/51 A FAT Example Directory: Key: file name. Data (time, date, size …) location of first block in the FAT table. If first block is in physical location #23 (Disk block number) look up position #23 in the FAT. Either shows end of file or has the block number on disk. Example: Directory entry: block # 4 FAT: x x x F 5 6 10 x 23 25 3 The file occupies blocks 4,5,6,10, 3.


Download ppt "1/51 Dictionaries, Tables Hashing TCSS 342 2/51 The Dictionary ADT a dictionary (table) is an abstract model of a database like a priority queue, a dictionary."

Similar presentations


Ads by Google