Hashing Static Hashing Dynamic Hashing. – 2 – Sungkyunkwan University, Hyoung-Kee Choi © Symbol table ADT  We define the symbol table as a set of name-attribute.

Slides:



Advertisements
Similar presentations
CS Data Structures Chapter 8 Hashing.
Advertisements

Hash Tables.
Review of Chapter 8 張啟中.
Part II Chapter 8 Hashing Introduction Consider we may perform insertion, searching and deletion on a dictionary (symbol table). Array Linked list Tree.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Skip List & Hashing CSE, POSTECH.
NCUE CSIE Wireless Communications and Networking Laboratory CHAPTER 8 Hashing 1.
Data Structures Using C++ 2E
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hashing as a Dictionary Implementation
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Appendix I Hashing. Chapter Scope Hashing, conceptually Using hashes to solve problems Hash implementations Java Foundations, 3rd Edition, Lewis/DePasquale/Chase21.
CHAPTER 7 HASHING What is hashing for? For searching But we already have binary search in O( ln n ) time after sorting. And we already have algorithms.
Hashing Techniques.
Hashing CS 3358 Data Structures.
CHAPTER 81 HASHING All the programs in this file are selected from Ellis Horowitz, Sartaj Sahni, and Susan Anderson-Freed “Fundamentals of Data Structures.
Symbol Table Symbol table is used widely in many applications.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
CS Data Structures Chapter 8 Hashing (Concentrating on Static Hashing)
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
1 Chapter 5 Hashing General ideas Methods of implementing the hash table Comparison among these methods Applications of hashing Compare hash tables with.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Hashing. Concept of Hashing In CS, a hash table, or a hash map, is a data structure that associates keys (names) with values (attributes). Look-Up Table.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Appendix E-A Hashing Modified. Chapter Scope Concept of hashing Hashing functions Collision handling – Open addressing – Buckets – Chaining Deletions.
Comp 335 File Structures Hashing.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Lecture 12COMPSCI.220.FS.T Symbol Table and Hashing A ( symbol) table is a set of table entries, ( K,V) Each entry contains: –a unique key, K,
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
1 Chapter 7 Skip Lists and Hashing Part 2: Hashing.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Copyright © Curt Hill Hashing A quick lookup strategy.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Data Structures Using C++
Chapter 9 Hashing Dr. Youssef Harrath
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
Chapter 5 Record Storage and Primary File Organizations
Data Structures Chapter 8: Hashing 8-1. Performance Comparison of Arrays and Trees Is it possible to perform these operations in O(1) ? ArrayTree Sorted.
Data Structures Using C++ 2E
Hashing, Hash Function, Collision & Deletion
Subject Name: File Structures
EEE2108: Programming for Engineers Chapter 8. Hashing
Data Structures Using C++ 2E
Review Graph Directed Graph Undirected Graph Sub-Graph
Advanced Associative Structures
Hash Table.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
EE 312 Software Design and Implementation I
What we learn with pleasure we never forget. Alfred Mercier
EE 312 Software Design and Implementation I
Presentation transcript:

Hashing Static Hashing Dynamic Hashing

– 2 – Sungkyunkwan University, Hyoung-Kee Choi © Symbol table ADT  We define the symbol table as a set of name-attribute pairs.  In a thesaurus, the name is a word, and the attribute is a list of synonyms for the word.  In a symbol table for a compiler, the name is an identifier, and the attributes might include an initial value and a list of lines that use the identifier.  Generally we would want to perform the following operations on any symbol table:  Determine if a particular name is in the table  Retrieve the attributes of that name  Modify the attributes of that name  Insert a new name and its attributes  Delete a name and its attributes.

– 3 – Sungkyunkwan University, Hyoung-Kee Choi ©  There are only three basic operations on symbol tables: searching, inserting, and deleting.  To implement these operations, we could use the O(n) binary search tree introduced in Section 5.7, or some other binary trees with O(log n) complexity.  In this chapter we examine a technique for search, insert, and delete operations that has very good expected performance. This technique is referred to as hashing.  Unlike search tree methods which rely on identifier comparisons to perform a search, hashing relies on a formula called the hash function.

– 4 – Sungkyunkwan University, Hyoung-Kee Choi ©  In static hashing, we store the identifiers in a fixed size table called a hash table.  We use an arithmetic function, f, to determine the address, or location, of an identifier, x, in the table. Thus, f(x) gives the hash, or home address, of x in the table.  The hash table ht is stored in sequential memory locations that are partitioned into b buckets, ht[0], …, ht[b-1]. Each bucket has s slots. Static hashing

– 5 – Sungkyunkwan University, Hyoung-Kee Choi ©  Definition: The identifier density of a hash table is the ratio n/T, where n is the number of identifiers in the table. The loading density or loading factor of a hash table is  = n/(sb).  Two identifiers, i 1 and i 2, are synonyms with respect to f if f(i 1 ) = f(i 2 ).  An overflow occurs when we hash a new identifier, i, into a full bucket.  A collision occurs when we hash tow nonidentical identifiers into the same bucket.  When the bucket size is 1, collisions and overflows occur simultaneously.

– 6 – Sungkyunkwan University, Hyoung-Kee Choi ©  The time required to enter, delete, or search for identifiers does not depend on the number of identifiers n in use; it is O(1).  Since the ratio b/T is usually small, we cannot avoid collisions altogether.  Example 8.1  b = 26, s = 2  f(x) = the first character of x Slot 0Slot 1 0acosatan 1 2charceil 3define 4exp 5floatfloor 6 … 25 Hash tables with 26 buckets and two slots per bucket

– 7 – Sungkyunkwan University, Hyoung-Kee Choi ©  A hash function, f, transforms an identifier, x, into a bucket address in the hash table.  We want a hash function that is easy to compute and that minimizes the number of collisions.  To avoid collisions, the hash function should depend on all the characters in an identifier.  Hashing functions should be unbiased.  That is, if we randomly choose an identifier, x, from the identifier space, the probability that f(x) = i is 1/b for all buckets i.  We call a hash function that satisfies unbiased property a uniform hash function. Hashing functions

– 8 – Sungkyunkwan University, Hyoung-Kee Choi ©  Four type of uniform hash functions:  Mid-square  Division  Folding  Digit Analysis  Mid-square  We compute the function f by squaring the identifier and then using an appropriate number of bits from the middle of the square to obtain the bucket address.  Since the middle bits of the square usually depend upon all the characters in an identifier, there is high probability that different identifiers will produce different hash addresses.  The number of bits used to obtain the bucket address depends on the table size. If we use r bits, the range of the value is 2 r.

– 9 – Sungkyunkwan University, Hyoung-Kee Choi © Division  We divide the identifier x by some number M and use the remainder as the hash address for x. f(x) = x % M  This gives bucket addresses that range from 0 to M - 1, where M = that table size.  The choice of M is critical.  If M is divisible by 2, then odd keys are mapped to odd buckets and even keys are mapped to even buckets.  When many identifiers are permutations of each other, a biased use of the table results.  A good choice for M would be : M a prime number such that M does not divide r k  a for small k and a.  In practice, choose M such that it has no prime divisors less than 20.

– 10 – Sungkyunkwan University, Hyoung-Kee Choi © Folding  We partition the identifier x into several parts. All parts, except for the last one have the same length. We then add the parts together to obtain the hash address for x.  There are two ways of carrying out this addition.  Shift folding: We shift all parts except for the last one, so that the least significant bit of each part lines up with corresponding bit of the last part. We then add the parts together to obtain f (x).  Ex: suppose that we have divided the identifier x into the following parts: x 1 = 123, x 2 = 203, x 3 = 241, x 4 = 112, and x 5 = 20. We would align x 1 through x 4 with x 5 and add. This gives us a hash address of 699.  Folding at the boundaries: reverses every other partition before adding.  Ex: suppose the identifier x is divided into the same partitions as in shift folding. We would reverse the second and forth partitions, that is x 2 = 302 and x 4 = 211, and add the partitions. This gives us a hash address of 897.

– 11 – Sungkyunkwan University, Hyoung-Kee Choi © Digit Analysis  Digit Analysis  Digital analysis is used with static files. A static file is one in which all the identifiers are known in advance.  Using this method,  We first transform the identifiers into numbers using some radix, r.  We then examine the digits of each identifier, deleting those digits that have the most skewed distribution.  We continue deleting digits until the number of remaining digits is small enough to give an address in the range of the hash table.  Of these methods, the one most suitable for general purpose applications is the division method with a divisor, M, such that M has no prime factors less than 20.

– 12 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (1/8)  Linear open addressing (Linear probing)  Compute f(x) for identifier x  Examine the buckets: ht[(f(x)+j)%TABLE_SIZE], 0  j  TABLE_SIZE  The bucket contains x.  The bucket contains the empty string (insert to it)  The bucket contains a nonempty string other than x (examine the next bucket) (circular rotation)  Return to the home bucket ht[f(x)], if the table is full we report an error condition and exit

– 13 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (2/8)  Additive transformation and Division Hash table with linear probing (13 buckets, 1 slot/bucket) insertion

– 14 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (3/8)  Problem of Linear Probing  Identifiers tend to cluster together  Adjacent cluster tend to coalesce  Increase the search time  Example: suppose we enter the C built-in functions into a 26-bucket hash table in order. The hash function uses the first character in each function name acos, atoi, char, define, exp, ceil, cos, float, atol, floor, ctime Hash table with linear probing (26 buckets, 1 slot/bucket) Enter: Enter sequence: acosatoichardefineexpceilcosfloatatolfloorctime # of key comparisons=35/11=3.18

– 15 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (4/8)  Alternative techniques to improve open addressing approach:  Quadratic probing  rehashing  random probing  Rehashing  Try f 1, f 2, …, f m in sequence if collision occurs  disadvantage  comparison of identifiers with different hash values  use chain to resolve collisions

– 16 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (5/8)  Quadratic Probing  Linear probing searches buckets (f(x)+i)%b  Quadratic probing uses a quadratic function of i as the increment  Examine buckets f(x), (f(x)+i 2 )%b, (f(x)-i 2 )%b, for 1<=i<=(b-1)/2  When b is a prime number of the form 4j+3, j is an integer, the quadratic search examines every bucket in the table Primej j

– 17 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (6/8)  Chaining  Linear probing and its variations perform poorly because inserting an identifier requires the comparison of identifiers with different hash values.  In this approach we maintained a list of synonyms for each bucket.  To insert a new element  Compute the hash address f (x)  Examine the identifiers in the list for f(x).  Since we would not know the sizes of the lists in advance, we should maintain them as linjed chains  The experimental evaluation indicates that chaining performs better than linear open addressing.

– 18 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (7/8)  Results of Hash Chaining acos, atoi, char, define, exp, ceil, cos, float, atol, floor, ctime f (x)=first character of x # of key comparisons=21/11=1.91

– 19 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (8/8)  Comparison:  In Figure 8.7, The values in each column give the average number of bucket accesses made in searching eight different table with 33,575, 24,050, 4909, 3072, 2241, 930, 762, and 500 identifiers each.  Chaining performs better than linear open addressing.  We can see that division is generally superior Average number of bucket accesses per identifier retrieved

– 20 – Sungkyunkwan University, Hyoung-Kee Choi © void chaing_insert(element item, list_pointer ht[]) { int hash_value = hash(item.key); list_pointer ptr, trail = NULL, lead = ht[hash_value]; for(; lead; trail = lead, lead = lead->link) if (!strcmp(lead->item.key, item-key)) { fprintf(stderr, “The key is in the table\n”); exit(1);} ptr = (list_pointer)malloc(sizeof(list)); if (IS_FULL(ptr)) { fprintf(stderr, “The memory is full\n”); exit(1);} ptr->item = item; ptr->link = NULL if (trail) trail->link = ptr; else ht[hash_value] = ptr; } typedef struct { char key[MAX_CHAR]; /* other fields */ } element; typedef struct list *list_pointer; Typedef struct list { element item; list_pointer link;} List_pointer hash_table[TABLE_SIZE];

– 21 – Sungkyunkwan University, Hyoung-Kee Choi ©  Motivation for dynamic hashing  One of the most important classes of software is the DBMS. A key characteristic of a DBMS is that the amount of information can vary a great deal over time.  Various data structures have been suggested for storing the data in a DBMS. In this section, we examine an extension of hashing that permits the technique to be used by a DBMS.  Traditional hashing schemes are not ideal because we must statically allocate a portion of memory to hold the hash table.  Dynamic hashing, also referred to as extendible hashing, can accommodate dynamically increasing and decreasing file size without penalty. Dynamic hashing

– 22 – Sungkyunkwan University, Hyoung-Kee Choi © Dynamic hashing using directories  Example: an identifier consists of two characters and each character is represented by 3 bits.  We would like to place these identifiers into a table that has four page. Each page can hold no more than two identifiers, and the pages are indexed by the 2 bit sequence 00, 01, 10, 11, respectively.  We use the two low-order bits of each identifier to determine the page address of the identifier. Identifiers Binary representation a a b b c c c c

– 23 – Sungkyunkwan University, Hyoung-Kee Choi © a0, b0 c2 a1, b1 c a0, b0 c2 a1, b1 c c5 0 1 a0, b0 c2 a1, c1 c c5 0 1 b1 0 1 two level trie on four pages inserting c5 with overflow inserting c1 with overflow a b c d a a bb c c d d e e f  We use the term trie to denote a binary tree in which we locate an identifier by following its bit sequence.  Notice that this trie has nodes that always branch in two directions corresponding to 0 or 1. Only the leaf nodes of the trie contain a pointer to a page.

– 24 – Sungkyunkwan University, Hyoung-Kee Choi ©  From this example we can see two major problems exist.  The access time for a page depends on the number of bits needed to distinguish the identifiers.  If the identifiers have a skewed distribution, the tree is also skewed.  How to avoid these problems?  To avoid the skewed distribution of identifiers, a hash function is used.  To avoid the long search down the trie, the trie is mapped to a directory.  A directory is a table of page pointer.  In case k bits are needed to distinguish the identifiers, the directory has 2 k entries indexed 0, 1, 2, 3, …, 2 k -1.

– 25 – Sungkyunkwan University, Hyoung-Kee Choi © 00 -a-> a0, b0000 -a-> a0, b a-> a0, b0 01 -c-> a1, b1001 -c-> a1, b c-> a1, c1 10 -b-> c2010 -b-> c b-> c2 11 -d-> c3011 -e-> c f-> c a->0100 -a-> 101 -d-> c e-> c b->0110 -b-> 111 -e->0111 -f-> a-> d-> b b-> f-> a-> e-> b-> f->

– 26 – Sungkyunkwan University, Hyoung-Kee Choi ©  Advantage of a directory:  Using a directory to represent a trie allows the table of identifiers to grow and shrink dynamically.  Accessing any page requires only two steps.  First, use the hash function to find the address of the directory entry.  Then, retrieve the page associated with the address.  Disadvantage of a directory:  If the keys are not uniformly divided among the pages, the directory can grow quite large. However, most of the entries point to the same pages.  To prevent this from happening, we cannot use the bit sequence of the keys themselves. Instead we translate the bits into a random sequence using a uniform hash function as discussed in the previous section.  We need a family of hash functions, because, at any point, we may require a different number of bits to distinguish the new key.

– 27 – Sungkyunkwan University, Hyoung-Kee Choi ©  Our solution is the family of where hash i is simply hash i-1 with either a zero or one appended as the new leading bit of the result. Thus hash (key, i) might be a function that produces a random number of i bits from the identifier key.  Some important twists are associated with this approach.  For example, suppose a page identified by i bits overflows. We allocate a new page and rehash the identifiers into those two pages. The identifiers in both pages have their low-order i bits in common. We refer to these pages as buddies. When the number of identifiers in two buddy pages is no more than the capacity of a single page, then we coalesce the two pages into one.