EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class.

EEM 480 Lecture 11 Hashing and Dictionaries

Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc. Typical symbol table operations are Insert, Delete and Search It's a dictionary structure!

Symbol Table What kind of information is usually stored in a symbol table? Type ( int, short, long int, float, …) storage class (label, static symbol, external def, structure tag,..) size scope stack frame offset register We also need a way to keep track of reserved words.

Symbol Table Where is a symbol table stored? array/linked list simple, but linear lookup time However, we may use a sorted array for reserved words, since they are generally few and known in advance. balanced tree O(log n) lookup time hash table most common implementation O(1) amortized time for dictionary operations

Hashing Depends on mapping keys into positions in a table called hash table Hashing is a technique used for performing insertions, deletions and searches in constant average time

Hashing In this example john maps 3 Phil maps 4 … Problem : How mapping will be done? If two items maps the same place what happens?

A Plan For Hashing Save items in a key-indexed table. Index is a function of the key. Save items in a key-indexed table. Index is a function of the key. Hash function. Hash function. Method for computing table index from key. Method for computing table index from key. Collision resolution strategy. Collision resolution strategy. Algorithm and data structure to handle two keys that hash to the same index. Algorithm and data structure to handle two keys that hash to the same index. If there is no space limitation If there is no space limitation Trivial hash function with key as address. Trivial hash function with key as address. If there is no time limitation If there is no time limitation Trivial collision resolution = sequential search. Trivial collision resolution = sequential search. Limitations on both time and space: hashing (the real world) Limitations on both time and space: hashing (the real world)

Hashing Hash tables use array of size m to store elements given key k (the identifier name), use a function h to compute index h(k) for that key collisions are possible two keys hash into the same slot. Hash functions is easy to compute avoids collisions (by breaking up patterns in the keys and uniformly distributing the hash values)

Hashing Nomenclature Nomenclature k is a key h(k) is the hash function m is the size of the hash table n is the number of keys in the hash table

What is Hash (in Wikipedia) Hash is an American dish consisting of a mixture of beef (often corned beef or roast beef), onions, potatoes, and spices that are mashed together into a coarse, chunky paste, and then cooked, either alone, or with other ingredients. (in Wikipedia) Hash is an American dish consisting of a mixture of beef (often corned beef or roast beef), onions, potatoes, and spices that are mashed together into a coarse, chunky paste, and then cooked, either alone, or with other ingredients. Is it related with our definition???? Is it related with our definition???? to chop any patterns in the keys so that the results are uniformly distributed to chop any patterns in the keys so that the results are uniformly distributed

What is Hashing Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms.

Hashing When the key is a string, we generally use the ASCII values of its characters in someway: Examples for k = c 1 c 2 c 3...c x h(k) = (c 1 128(x-1) +c 2 128(x-2) +...+c x 128*0 ) mod m h(k) = (c 1 +c 2 +...+c x ) mod m h(k) = (h 1 (c 1 )+h 2 (c 2 )+...h x (c x )) mod m, where each h i is an independent hash function.

Finding A Hash Function Goal: scramble the keys. Goal: scramble the keys. Each table position equally likely for each key. Each table position equally likely for each key. Ex: Vatandaşlık Numarası for 10000 person Ex: Vatandaşlık Numarası for 10000 person Bad: The Whole Number Since 10000 will not be used forever Bad: The Whole Number Since 10000 will not be used forever Better: last three digits. But every number is even Better: last three digits. But every number is even The Best : Use 2,3,4,5 digits The Best : Use 2,3,4,5 digits Ex: date of birth. Ex: date of birth. Bad: first three digits of birth year. Bad: first three digits of birth year. Better: birthday. Better: birthday. Ex: phone numbers. Ex: phone numbers. Bad: first three digits. Bad: first three digits. Better: last three digits. Better: last three digits.

Hash Function Truncation Ignore part of the key and use the remaining part directly as the index. Example: if the keys are 8-digit numbers and the hash table has 1000 entries, then the first, fourth and eighth digit could make the hash function. Not a very good method : does not distribute keys uniformly

Hash Function Folding Break up the key in parts and combine them in some way Example : if the keys are 9 digit numbers, break up a key into three 3-digit numbers and add them up. Ex ISBN 0-321-37319-7 Ex ISBN 0-321-37319-7 Divide them to three as 321 373 and 197 Divide them to three as 321 373 and 197 Add them : 891 use it as mod 500 = 491 Add them : 891 use it as mod 500 = 491

Hash Function Middle square Compute k*k and pick some digits from the resulting number Example : given a 9-digit key k, and a hash table of size 1000 pick three digits from the middle of the number k*k. Ex 175344387 – 344*344= 118336 -----183 or 833 Works fairly well in practice if the keys do not have many leading or trailing zeroes.

Hash Function Division h(k)=k mod m Fast Not all values of m are suitable for this. For example powers of 2 should be avoided because then k mod m is just the least significant digits of k Good values for m are prime numbers.

Hash Function Multiplication h(k)=int(m *(k * c - int(k * c) ), 0<c<1 In English : Multiply the key k by a constant c, 0<c<1 Take the fractional part of k * c Multiply that by m Take the floor of the result The value of m does not make a difference Some values of c work better than others A good value for c :

Hash Function Multiplication Example: Suppose the size of the table, m, is 1301. For k=1234, h(k)=850 For k=1235, h(k)=353 For k=1236, h(k)=115 For k=1237, h(k)=660 For k=1238, h(k)=164 For k=1239, h(k)=968 For k=1240, h(k)=471

Hash Function Universal Hashing Worst-case scenario: The chosen keys all hash to the same slot. This can be avoided if the hash function is not fixed: Start with a collection of hash functions with the property that for any given set of inputs they will scatter the inputs among the range of the function well Select one at random and use that. Good performance on average: the probability that the randomly chosen hash function exhibits the worst-case behavior is very low.

When Collusion Occurs... Collusion Occurs when more than one item has been mapped to the same location Collusion Occurs when more than one item has been mapped to the same location Ex n = 10 m = 10 Use mod 10 Ex n = 10 m = 10 Use mod 10 9 will be mapped to 9 9 will be mapped to 9 769 will be mapped to 9 769 will be mapped to 9 In probability theory, the birthday problem or birthday paradox pertains to the probability that in a set of randomly chosen people some pair of them will have the same birthday. In a group of 23 (or more) randomly chosen people, there is more than 50% probability that some pair of them will both have been born on the same day. For 57 or more people, the probability is more than 99%, reaching 100% as the number of people reaches 366. The mathematics behind this problem leads to a well-known cryptographic attack called the birthday attack. In probability theory, the birthday problem or birthday paradox pertains to the probability that in a set of randomly chosen people some pair of them will have the same birthday. In a group of 23 (or more) randomly chosen people, there is more than 50% probability that some pair of them will both have been born on the same day. For 57 or more people, the probability is more than 99%, reaching 100% as the number of people reaches 366. The mathematics behind this problem leads to a well-known cryptographic attack called the birthday attack. When collusion occurs an algorithm has to map the second, third,...n’th item to a definitive places in the map When collusion occurs an algorithm has to map the second, third,...n’th item to a definitive places in the map In order to read data from the map the same algorithm has been used to retrieve it. In order to read data from the map the same algorithm has been used to retrieve it.

Resolving Collusion Chaining Put all the elements that collide in a chain (list) attached to the slot. The hash table is an array of linked lists The load factor indicates the average number of elements stored in a chain. It could be less than, equal to, or larger than 1.

What is Load Factor? Given a hash table of size m, and n elements stored in it, we define the load factor of the table as =n/m (lambda) Given a hash table of size m, and n elements stored in it, we define the load factor of the table as =n/m (lambda) The load factor gives us an indication of how full the table is. The load factor gives us an indication of how full the table is. The possible values of the load factor depend on the method we use for resolving collisions. The possible values of the load factor depend on the method we use for resolving collisions.

Return to Resolving Collision Chaining ctd. Chaining puts elements that hash to the same slot in a linked list Separate chaining: array of M linked lists. Hash: map key to integer i between 0 and M-1. Insert: put at front of ith chain. constant time Search: only need to search ith chain. proportional to length of chain

Chaining Insert/Delete/Lookup in expected O(1)time Keep the list doubly-linked to facilitate deletions Worst case of lookup time is linear. However, this assumes that the chains are kept small. If the chains start becoming too long, the table must be enlarged and all the keys rehashed.

Chaining Performance Search cost is proportional to length of chain. Trivial: average length = N / M. Worst case: all keys hash to same chain. Theorem. Let λ= N / M > 1 be average length of list which is called loading factor. Average search cost : 1+ λ/2 What is the choice of M M too large too many empty chains. M too small chains too long. Typical choice: = N / M ~ 10 constant-time search/insert.

Chaining Performance Analysis of successful search: Expected number e of elements examined during a successful search for key k = one more than the expected number of elements examined when k was inserted. it makes no difference whether we insert at the beginning or the end of the list. Take the average, over the n items in the table, of 1 plus the expected length of the chain to which the i th element was added:

Open Addressing Open addressing Store all elements within the table The space we save from the chain pointers is used instead to make the array larger. If there is a collision, probe the table in a systematic way to find an empty slot. If the table fills up, we need to enlarge it and rehash all the keys.

Open Addressing hash function: (h(k) + i ) mod m for i=0, 1,...,m-1 hash function: (h(k) + i ) mod m for i=0, 1,...,m-1 Insert : Start with the location where the key hashed and do a sequential search for an empty slot. Insert : Start with the location where the key hashed and do a sequential search for an empty slot. Search : Start with the location where the key hashed and do a sequential search until you either find the key(success) or find an empty slot (failure). Search : Start with the location where the key hashed and do a sequential search until you either find the key(success) or find an empty slot (failure). Delete : (lazy deletion) follow same route but mark slot as DELETED rather than EMPTY, otherwise sub sequent searches will fail. Delete : (lazy deletion) follow same route but mark slot as DELETED rather than EMPTY, otherwise sub sequent searches will fail.

Hash Table without Linked-List Linear probing: array of size M. Hash: map key to integer i between 0 and M-1. Insert: put in slot i if free, if not try i+1, i+2, etc. Search: search slot i, if occupied but no match, try i+1, i+2, etc. Cluster. Contiguous block of items. Search through cluster using elementary algorithm for arrays.

Open Address Lineer Probing Advantage: very easy to implement Disadvantage: primary clustering Long sequences of used slots build up with gaps between them. Every insertion requires several probes and adds to the cluster. The average length of a probe sequence when inserting is

Quadratic Probes Probe the table at slots (h(k) + i 2 ) mod m for i =0, 1,2, 3,..., m-1 Ease of computation: Not as easy as linear probing. Do we really have to compute a power? Clustering Primary clustering is avoided, since the probes are not sequential.

Search Quadratic Probing 3 + 0^2 = 3 3 + 1^2 = 4 3 + 2^2 = 7 3 + 3^2 = 12 3 + 4^2 = 3 3 + 5^2 = 12 3 + 6^2 = 7 3 + 7^2 = 4 3 + 8^2 = 3 3 + 9^2 = 4 3 + 10^2 = 7 3 + 11^2 = 12 3 + 12^2 = 3 3 + 13^2 = 12 3 + 14^2 = 7 3 + 15^2 = 4 Probe sequence for hash value 3 in a table of size 16:

Quadrature Probing Probe sequence for hash value 3 in a table of size 19: Probe sequence for hash value 3 in a table of size 19: 3 + 0^2 = 3 3 + 1^2 = 4 3 + 2^2 = 7 3 + 32 = 12 3 + 42 = 0 3 + 52 = 9 3 + 62 = 1 3 + 72 = 14 3 + 82 = 10 3 + 92 = 8

Quadrature Probing Disadvantage: secondary clustering: if h(k1)==h(k2) the probing sequences for k1 and k2 are exactly the same. Is this really bad? In practice, not so much It becomes an issue when the load factor is high.

Double Hashing The hash function is (h(k)+i h2(k)) mod m In English: use a second hash function to obtain the next slot. The probing sequence is: h(k), h(k)+h2(k), h(k)+2h2(k), h(k)+3h3(k),... Performance : Much better than linear or quadratic probing. Does not suffer from clustering BUT requires computation of a second function

Double Hashing The choice of h2(k) is important It must never evaluate to zero consider h2(k)=k mod 9 for k=81 The choice of m is important If it is not prime, we may run out of alternate locations very fast.

Rehashing After 70% of table is full, double the size of the hash table. After 70% of table is full, double the size of the hash table. Don’t forget to have prime number Don’t forget to have prime number

Lempel-Ziv-Welch (LZW) Compression Algorithm Lempel-Ziv-Welch (LZW) Compression Algorithm  Introduction to the LZW Algorithm  Example 1: Encoding using LZW  Example 2: Decoding using LZW  LZW: Concluding Notes

Introduction to LZW  As mentioned earlier, static coding schemes require some knowledge about the data before encoding takes place.  Universal coding schemes, like LZW, do not require advance knowledge and can build such knowledge on-the- fly.  LZW is the foremost technique for general purpose data compression due to its simplicity and versatility.  It is the basis of many PC utilities that claim to “double the capacity of your hard drive”  LZW compression uses a code table, with 4096 as a common choice for the number of table entries.

Introduction to LZW (cont'd) Introduction to LZW (cont'd)  Codes 0-255 in the code table are always assigned to represent single bytes from the input file.  When encoding begins the code table contains only the first 256 entries, with the remainder of the table being blanks.  Compression is achieved by using codes 256 through 4095 to represent sequences of bytes.  As the encoding continues, LZW identifies repeated sequences in the data, and adds them to the code table.  Decoding is achieved by taking each code from the compressed file, and translating it through the code table to find what character or characters it represents.

LZW Encoding Algorithm LZW Encoding Algorithm 1 Initialize table with single character strings 1 Initialize table with single character strings 2 P = first input character 2 P = first input character 3 WHILE not end of input stream 3 WHILE not end of input stream 4 C = next input character 4 C = next input character 5 IF P + C is in the string table 5 IF P + C is in the string table 6 P = P + C 6 P = P + C 7 ELSE 7 ELSE 8 output the code for P 8 output the code for P 9 add P + C to the string table 9 add P + C to the string table 10 P = C 10 P = C 11 END WHILE 11 END WHILE 12 output code for P 12 output code for P

Example 1: Compression using LZW Example 1: Compression using LZW Example 1: Use the LZW algorithm to compress the string BABAABAAA

Example 1: LZW Compression Step 1 BABAABAAAP=A C= empty STRING TABLE ENCODER OUTPUT stringcodewordrepresenting output code BA256B66

Example 1: LZW Compression Step 2 BABAABAAAP=B C= empty STRING TABLE ENCODER OUTPUT stringcodewordrepresenting output code BA256B66 AB257A65

Example 1: LZW Compression Step 3 BABAABAAAP=A C= empty STRING TABLE ENCODER OUTPUT stringcodewordrepresenting output code BA256B66 AB257A65 BAA258BA256

Example 1: LZW Compression Step 4 BABAABAAAP=A C= empty STRING TABLE ENCODER OUTPUT stringcodewordrepresenting output code BA256B66 AB257A65 BAA258BA256 ABA259AB257

Example 1: LZW Compression Step 5 BABAABAAAP=A C=A STRING TABLE ENCODER OUTPUT stringcodewordrepresenting output code BA256B66 AB257A65 BAA258BA256 ABA259AB257 AA260A65

Example 1: LZW Compression Step 6 BABAABAAAP=AA C= empty STRING TABLE ENCODER OUTPUT stringcodewordrepresenting output code BA256B66 AB257A65 BAA258BA256 ABA259AB257 AA260A65 AA260

LZW Decompression LZW Decompression  The LZW decompressor creates the same string table during decompression.  It starts with the first 256 table entries initialized to single characters.  The string table is updated for each character in the input stream, except the first one.  Decoding achieved by reading codes and translating them through the code table being built.

LZW Decompression Algorithm LZW Decompression Algorithm 1 Initialize table with single character strings 2 OLD = first input code 3 output translation of OLD 4 WHILE not end of input stream 5 NEW = next input code 6 IF NEW is not in the string table 7 S = translation of OLD 8 S = S + C 9 ELSE 10 S = translation of NEW 11 output S 12 C = first character of S 13 OLD + C to the string table 14 OLD = NEW 15 END WHILE

Example 2: LZW Decompression 1 Example 2: LZW Decompression 1 Example 2: Use LZW to decompress the output sequence of Example 1: Example 1:<66><65><256><257><65><260>.

Example 2: LZW Decompression Step 1 Example 2: LZW Decompression Step 1 Old = 65 S = A Old = 65 S = A New = 66 C = A STRING TABLE ENCODER OUTPUT stringcodewordstring B BA256A

Example 2: LZW Decompression Step 2 Example 2: LZW Decompression Step 2 Old = 256 S = BA Old = 256 S = BA New = 256 C = B STRING TABLE ENCODER OUTPUT stringcodewordstring B BA256A AB257BA

Example 2: LZW Decompression Step 3 Example 2: LZW Decompression Step 3 Old = 257 S = AB Old = 257 S = AB New = 257 C = A STRING TABLE ENCODER OUTPUT stringcodewordstring B BA256A AB257BA BAA258AB

Example 2: LZW Decompression Step 4 Example 2: LZW Decompression Step 4 Old = 65 S = A Old = 65 S = A New = 65 C = A STRING TABLE ENCODER OUTPUT stringcodewordstring B BA256A AB257BA BAA258AB ABA259A

Example 2: LZW Decompression Step 5 Example 2: LZW Decompression Step 5 Old = 260 S = AA Old = 260 S = AA New = 260 C = A STRING TABLE ENCODER OUTPUT stringcodewordstring B BA256A AB257BA BAA258AB ABA259A AA260AA

LZW: Some Notes LZW: Some Notes  This algorithm compresses repetitive sequences of data well.  Since the codewords are 12 bits, any single encoded character will expand the data size rather than reduce it.  In this example, 72 bits are represented with 72 bits of data. After a reasonable string table is built, compression improves dramatically.  Advantages of LZW over Huffman:  LZW requires no prior information about the input data stream.  LZW can compress the input stream in one single pass.  Another advantage of LZW its simplicity, allowing fast execution.

LZW: Limitations LZW: Limitations  What happens when the dictionary gets too large (i.e., when all the 4096 locations have been used)?  Here are some options usually implemented:  Simply forget about adding any more entries and use the table as is.  Throw the dictionary away when it reaches a certain size.  Throw the dictionary away when it is no longer effective at compression.  Clear entries 256-4095 and start building the dictionary again.  Some clever schemes rebuild a string table from the last N input characters.

EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class.

Similar presentations

Presentation on theme: "EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class.

Similar presentations

Presentation on theme: "EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class."— Presentation transcript:

Similar presentations

About project

Feedback