CHAPTER 81 HASHING All the programs in this file are selected from Ellis Horowitz, Sartaj Sahni, and Susan Anderson-Freed “Fundamentals of Data Structures.

Slides:



Advertisements
Similar presentations
CS Data Structures Chapter 8 Hashing.
Advertisements

Review of Chapter 8 張啟中.
Part II Chapter 8 Hashing Introduction Consider we may perform insertion, searching and deletion on a dictionary (symbol table). Array Linked list Tree.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Skip List & Hashing CSE, POSTECH.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
NCUE CSIE Wireless Communications and Networking Laboratory CHAPTER 8 Hashing 1.
Data Structures Using C++ 2E
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hashing as a Dictionary Implementation
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
CHAPTER 7 HASHING What is hashing for? For searching But we already have binary search in O( ln n ) time after sorting. And we already have algorithms.
Hashing Chapters What is Hashing? A technique that determines an index or location for storage of an item in a data structure The hash function.
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Symbol Table Symbol table is used widely in many applications.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Data Structures Using Java1 Chapter 8 Search Algorithms.
CHAPTER 8 Hashing Instructors: C. Y. Tang and J. S. Roger Jang All the material are integrated from the textbook "Fundamentals of Data Structures in C"
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
CS Data Structures Chapter 8 Hashing (Concentrating on Static Hashing)
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
CHAPTER 8 Hashing Instructor: C. Y. Tang Ref.: Ellis Horowitz, Sartaj Sahni, and Susan Anderson-Freed, “Fundamentals of Data Structures in C,” Computer.
Data Structures Using Java1 Chapter 8 Search Algorithms.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Hashing. Concept of Hashing In CS, a hash table, or a hash map, is a data structure that associates keys (names) with values (attributes). Look-Up Table.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Appendix E-A Hashing Modified. Chapter Scope Concept of hashing Hashing functions Collision handling – Open addressing – Buckets – Chaining Deletions.
Comp 335 File Structures Hashing.
Hashing Hashing is another method for sorting and searching data.
Hashing as a Dictionary Implementation Chapter 19.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashing Static Hashing Dynamic Hashing. – 2 – Sungkyunkwan University, Hyoung-Kee Choi © Symbol table ADT  We define the symbol table as a set of name-attribute.
Chapter 13 C Advanced Implementations of Tables – Hash Tables.
Data Structures Using C++
Chapter 9 Hashing Dr. Youssef Harrath
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Chapter 5 Record Storage and Primary File Organizations
Data Structures Chapter 8: Hashing 8-1. Performance Comparison of Arrays and Trees Is it possible to perform these operations in O(1) ? ArrayTree Sorted.
Data Structures Using C++ 2E
Hashing Alexandra Stefan.
C. Y. Tang and J. S. Roger Jang
EEE2108: Programming for Engineers Chapter 8. Hashing
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Advanced Associative Structures
Chapter 10 Hashing.
CHAPTER 8 HASHING All the programs in this file are selected from
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hash-Based Indexes Chapter 11
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
What we learn with pleasure we never forget. Alfred Mercier
Presentation transcript:

CHAPTER 81 HASHING All the programs in this file are selected from Ellis Horowitz, Sartaj Sahni, and Susan Anderson-Freed “Fundamentals of Data Structures in C”, Computer Science Press, 1992.

CHAPTER 82 Symbol Table n Definition A set of name-attribute pairs n Operations –Determine if a particular name is in the table –Retrieve the attributes of the name –Modify the attributes of that name –Insert a new name and its attributes –Delete a name and its attributes

CHAPTER 83 The ADT of Symbol Table Structure SymbolTable(SymTab) is objects: a set of name-attribute pairs, where the names are unique functions: for all name belongs to Name, attr belongs to Attribute, symtab belongs to SymbolTable, max_size belongs to integer SymTab Create(max_size) ::= create the empty symbol table whose maximum capacity is max_size Boolean IsIn(symtab, name) ::= if (name is in symtab) return TRUE else return FALSE Attribute Find(symtab, name) ::= if (name is in symtab) return the corresponding attribute else return null attribute SymTab Insert(symtal, name, attr) ::= if (name is in symtab) replace its existing attribute with attr else insert the pair (name, attr) into symtab SymTab Delete(symtab, name) ::= if (name is not in symtab) return else delete (name, attr) from symtab search, insertion, deletion

CHAPTER 84 Search vs. Hashing n Search tree methods: key comparisons n hashing methods: hash functions n types –statistic hashing –dynamic hashing

CHAPTER 85 Static Hashing b-2 b ………. s hash table (ht) f(x): 0 … (b-1) s slots b buckets

CHAPTER 86 Identifier Density and Loading Density n The identifier density of a hash table is the ratio n/T –n is the number of identifiers in the table –T is possible identifiers The loading density or loading factor of a hash table is  = n/(sb) –s is the number of slots –b is the number of buckets

CHAPTER 87 Synonyms n Since the number of buckets b is usually several orders of magnitude lower than T, the hash function f must map several different identifiers into the same bucket n Two identifiers, i and j are synonyms with respect to f if f(i) = f(j)

CHAPTER 88 Overflow and Collision n An overflow occurs when we hash a new identifier into a full bucket n A collision occurs when we hash two non-identical identifiers into the same bucket

CHAPTER 89 Example b=26, s=2, n=10,  =10/52=0.19, f(x)=the first char of x x: acos, define, float, exp, char, atan, ceil, floor, clock, ctime f(x):0, 3, 5, 4, 2, 0, 2, 5, 2, 2 synonyms synonyms: char, ceil, clock, ctime overflow synonyms

CHAPTER 810 Hashing Functions n Two requirements –easy computation –minimal number of collisions n mid-square (middle of square) n division (0 ~ (M-1)) Avoid the choice of M that leads to many collisions

CHAPTER 811 M = 2 i f D (X) depends on LSBs of X Example. (1)Each character is represented by six bits. (2)Identifiers are right-justified and zero-filled A1 =… B1 =… C1 =… X41 =… NTXY1 =… Zero-filled f D (X) shift right i bits M=2 i, i  6 Similarly, AXY, BXY, WTXY (M=2 i, i  12) have the same bucket address A program tends to have variables with the same suffix, M=2 i is not suitable

CHAPTER 812 (3)Identifiers are left-justified and zero-filled. 60-bit word f D (one-char id) = 0, M=2 i, i  54 f D (two-char id) = 0, M=2 i, i  48

CHAPTER 813 Programs in which many variables are permutations of each other. Example. X=X1X2Y=X2X1 X1 --> C(X1)X2 --> C(X2) Each character is represented by six bits X: C(X1) * C(X2) Y: C(X2) * C(X1) (f D (X) - f D (Y)) % P (where P is a prime number) = C(X1) * 2 6 % P + C(X2) % P - C(X2) * 2 6 % P- C(X1) % P P = 3 64 % 3 * C(X1) % 3 + C(X2) % % 3 * C(X2) % 3- C(X1) % 3 = C(X1) % 3 + C(X2) % 3 - C(X2) % 3- C(X1) % 3 = 0 P = 7? M is a prime number such that M does not divide r k  a for small k and a (Knuth) for example, M = 1009

CHAPTER 814 Hashing Functions n Folding –Partition the identifier x into several parts –All parts except for the last one have the same length –Add the parts together to obtain the hash address –Two possibilities Shift folding –x1=123, x2=203, x3=241, x4=112, x5=20, address=699 Folding at the boundaries –x1=123, x2=203, x3=241, x4=112, x5=20, address=897

CHAPTER 815 P1P2P3P4P shift folding folding at the boundaries MSD ---> LSD LSD <--- MSD

CHAPTER 816 Digital Analysis n All the identifiers are known in advance M=1~999 X 1 d 11 d 12 …d 1n X 2 d 21 d 22 …d 2n … X m d m1 d m2 …d mn Select 3 digits from n Criterion: Delete the digits having the most skewed distributions

CHAPTER 817 Overflow Handling n Linear Open Addressing (linear probing) n Quadratic probing n Chaining

CHAPTER 818 Data Structure for Hash Table #define MAX_CHAR 10 #define TABLE_SIZE 13 typedef struct { char key[MAX_CHAR]; /* other fields */ } element; element hash_table[TABLE_SIZE];

CHAPTER 819 Hash Algorithm via Division void init_table(element ht[]) { int i; for (i=0; i<TABLE_SIZE; i++) ht[i].key[0]=NULL; } int transform(char *key) { int number=0; while (*key) number += *key++; return number; } int hash(char *key) { return (transform(key) % TABLE_SIZE); }

CHAPTER 820 Example

CHAPTER 821 Linear Probing (linear open addressing) n Compute f(x) for identifier x n Examine the buckets ht[(f(x)+j)%TABLE_SIZE] 0  j  TABLE_SIZE –The bucket contains x. –The bucket contains the empty string –The bucket contains a nonempty string other than x –Return to ht[f(x)]

CHAPTER 822 Linear Probing void linear_insert(element item, element ht[]) { int i, hash_value; I = hash_value = hash(item.key); while(strlen(ht[i].key)) { if (!strcmp(ht[i].key, item.key)) fprintf(stderr, “Duplicate entry\n”); exit(1); } i = (i+1)%TABLE_SIZE; if (i == hash_value) { fprintf(stderr, “The table is full\n”); exit(1); } ht[i] = item; }

CHAPTER 823 Problem of Linear Probing n Identifiers tend to cluster together n Adjacent cluster tend to coalesce n Increase the search time

CHAPTER 824 Coalesce Phenomenon Average number of buckets examined is 41/11=3.73

CHAPTER 825 Quadratic Probing n Linear probing searches buckets (f(x)+i)%b n Quadratic probing uses a quadratic function of i as the increment n Examine buckets f(x), (f(x)+i )%b, (f(x)- i )%b, for 1<=i<=(b-1)/2 n b is a prime number of the form 4j+3, j is an integer

CHAPTER 826 rehashing n Try f 1, f 2, …, f m in sequence if collision occurs n disadvantage –comparison of identifiers with different hash values –use chain to resolve collisions

CHAPTER 827 Data Structure for Chaining #define MAX_CHAR 10 #define TABLE_SIZE 13 #define IS_FULL(ptr) (!(ptr)) typedef struct { char key[MAX_CHAR]; /* other fields */ } element; typedef struct list *list_pointer; typedef struct list { element item; list_pointer link; }; list_pointer hash_table[TABLE_SIZE];

CHAPTER 828 Chain Insert void chain_insert(element item, list_pointer ht[]) { int hash_value = hash(item.key); list_pointer ptr, trail=NULL, lead=ht[hash_value]; for (; lead; trail=lead, lead=lead->link) if (!strcmp(lead->item.key, item.key)) { fprintf(stderr, “The key is in the table\n”); exit(1); } ptr = (list_pointer) malloc(sizeof(list)); if (IS_FULL(ptr)) { fprintf(stderr, “The memory is full\n”); exit(1); } ptr->item = item; ptr->link = NULL; if (trail) trail->link = ptr; else ht[hash_value] = ptr; }

CHAPTER 829 Results of Hash Chaining acos, atoi, char, define, exp, ceil, cos, float, atol, floor, ctime f(x)=first character of x # of key comparisons=21/11=1.91

CHAPTER 830

CHAPTER 831 dynamic hashing (extensible hashing) n dynamically increasing and decreasing file size n concepts –file: a collection of records –record: a key + data, stored in pages (buckets) –space utilization

CHAPTER 832 *Figure 8.8:Some identifiers requiring 3 bits per character(p.414) Dynamic Hashing Using Directories Example. m(# of pages)=4, P(page capacity)=2 00, 01, 10, 11 allocation: lower order two bits from LSB to MSB

CHAPTER 833 *Figure 8.9: A trie to hole identifiers(p.415) a0,b0 c2 a1,b1 c3 a0,b0 c2 c3 a1,b1 c5 a0,b0 c2 c3 c5 a1,c1 b1 (a) two level trie on four pages (b) inserting c5 with overflow (c) inserting c1 with overflow c5 +c1 Note: time to access a page: # of bits to distinguish the identifiers Note: identifiers skewed: depth of tree skewed

CHAPTER 834 Extendiable Hashing f(x)=a set of binary digits --> table lookup local depth global depth: 4 page pointer {0000,1000,0100,1100} {0001} {0010,1010,0110,1110} {0011,1011,0111,1111} {0101,1101} {1001} pages c & d: buddies {000,100} {001} {010,110} {011,111} {101} a0,b0 c2 a1,b1 c a0,b0 c2 c3 a1,b1 c c1

CHAPTER 835 If keys do not uniformly divide up among pages, then the directory can glow quite large, but most of entries will point to the same page f: a family of hashing functions hash i : key --> {0.. 2 i-1 } 1  i  d hash(key, i): produce random number of i bits from identifier key hash i is hash i-1 with either a zero or one appeared as the new leading bit of result hash(a0,2)=00hash(a1,4)=0001hash(b0,2)= hash(b1,4)=1001hash(c1, 4)=0001hash(c2, 2)= hash(c3,2)=11hash(c5,3)=101

CHAPTER 836 *Program 8.5: Dynamic hashing (p.421) #include #include #include #define WORD_SIZE 5 /* max number of directory bits */ #define PAGE_SIZE 10 /* max size of a page */ #define DIRECTORY_SIZE 32 /* max size of directory */ typedef struct page *paddr; typedef struct page { int local_depth; /* page level */ char *name[PAGE_SIZE]; int num_idents; /* #of identifiers in page */ }; typedef struct { char *key; /* pointer to string */ /*other fields */ } brecord; int global_depth; /* trie height */ paddr directory[DIRECTORY_SIZE]; /* pointers to pages */ 2 5 =32 the actual identifiers See Figure 8.10(c) global depth=4 local depth of a =2

CHAPTER 837 paddr hash(char *, short int); paddr buddy(paddr); short int pgsearch(char *, paddr ); int convert(paddr); void enter(brecord, paddr); void pgdelete(char *, paddr); paddr find(brecord, char *); void insert (brecord, char *); int size(paddr); void coalesce (paddr, paddr); void delete(brecord, char *); paddr hash(char *key, short int precision) { /* *key is hashed using a uniform hash function, and the low precision bits are returned as the page address */ } directory subscript for directory lookup

CHAPTER 838 paddr buddy(paddr index) { /*Take an address of a page and returns the page’s buddy, i. e., the leading bit is complemented */ } int size(paddr ptr) { /* return the number of identifiers in the page */ } void coalesce(paddr ptr, paddr, buddy) { /*combine page ptr and its buddy into a single page */ } short int pgsearch{char *key, paddr index) { /*Search a page for a key. If found return 1 otherwise return 0 */ } buddy b n-1 b n-2 … b 0 b n-1 b n-2 … b 0

CHAPTER 839 void convert (paddr ptr) { /* Convert a pointer to a pointer to a page to an equivalent integer */ } void enter(brecord r, paddr ptr) { /* Insert a new record into the page pointed at by ptr */ } void pgdelete(char *key, paddr ptr) { /* remove the record with key, hey, from the page pointed to by ptr */ } short int find (char *key, paddr *ptr) { /* return 0 if key is not found and 1 if it is. Also, return a pointer (in ptr) to the page that was searched. Assume that an empty directory has one page. */

CHAPTER 840 paddr index; int intindex; index = hash(key, global_depth); intindex = convert(index); *ptr = directory[intindex]; return pgsearch(key, ptr); } void insert(brecord r, char *key) { paddr ptr; if find(key, &ptr) { fprintr(stderr, “ The key is already in the table.\n”); exit(1); } if (ptr-> num_idents != PAGE_SIZE) { enter(r, ptr); ptr->num_idents++; } else{ /*Split the page into two, insert the new key, and update global_depth if necessary.

CHAPTER 841 If this causes global_depth to exceed WORD_SIZE then print an error and terminate. */ }; } void delete(brecord r, char *key) { /* find and delete the record r from the file */ paddr ptr; if (!find (key, &ptr )) { fprintf(stderr, “Key is not in the table.\n”); return; /* non-fatal error */ } pgdelete(key, ptr); if (size(ptr) + size(buddy(ptr)) <= PAGE_SIZE) coalesce(ptr, buddy(ptr)); } void main(void) { }

CHAPTER 842 *Figure 8.12: A trie mapped to a directoryless, contiguous storage (p.424) a0,b0 c2 a1,b1 c a0 b0 c2 - a1 b1 c Directoryless Dynamic Hashing (Linear Hashing) continuous address space offset of base address (cf directory scheme)

CHAPTER 843 *Figure 8.13: An example with two insertions (p.425) a0 b0 c2 - a1 b1 c new page a0 b0 c2 - a1 b1 c c5 overflow page a0 b0 c2 - a1 b1 c c1 c5 new page start of expansion 2 there are 4 pages (a) insert c5 page 10 overflows page 00 splits (b) insert c1 page 10 overflows page 01 splits (c) +c5 1 2 rehash & split +c1 rehash & split

CHAPTER 844 *Figure 8.14: During the rth phase of expansion of directoryless method (p.426) pages already split pages not yet split pages added so far addressed by r+1 bits addressed by r bits addressed by r+1 bits q r 2 r pages at start suppose we are at phase r; there are 2 r pages indexed by r bits

CHAPTER 845 *Program 8.6:Modified hash function (p.427) if ( hash(key,r) < q) page = hash(key, r+1); else page = hash(key, r); if needed, then follow overflow pointers;