Hashing CENG 351.

Slides:



Advertisements
Similar presentations
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Advertisements

Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
CPSC 404, Laks V.S. Lakshmanan1 Hash-Based Indexes Chapter 11 Ramakrishnan & Gehrke (Sections )
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
1 Database Systems ( 資料庫系統 ) November 8, 2004 Lecture #9 By Hao-hua Chu ( 朱浩華 )
FALL 2005 CENG 351 Data Management and File Structures 1 Hashing.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Static Hashing (using overflow for collision managment e.g., h(key) mod M h key Primary bucket pages 1 0 M-1 Overflow pages(as separate link list) Overflow.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
1 Database Systems ( 資料庫系統 ) November 28, 2005 Lecture #9.
Chapter 5 Record Storage and Primary File Organizations
CENG Hashing for files. CENG 3512 Introduction Idea: to reference items in a table directly by doing arithmetic operations to transform keys into.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Data Structures Using C++ 2E
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Dynamic Hashing (Chapter 12)
Lecture 21: Hash Tables Monday, February 28, 2005.
Are they better or worse than a B+Tree?
Hash-Based Indexes Chapter 11
Subject Name: File Structures
Database Management Systems (CS 564)
Data Structures Using C++ 2E
Review Graph Directed Graph Undirected Graph Sub-Graph
Disk Storage, Basic File Structures, and Hashing
Introduction to Database Systems
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes R&G Chapter 10 Lecture 18
Hash-Based Indexes Chapter 10
External Memory Hashing
Indexing and Hashing Basic Concepts Ordered Indices
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hashing.
Hash-Based Indexes Chapter 11
Index tuning Hash Index.
Advance Database System
CS202 - Fundamental Structures of Computer Science II
Database Systems (資料庫系統)
Database Design and Programming
2018, Spring Pusan National University Ki-Joune Li
Module 12a: Dynamic Hashing
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
What we learn with pleasure we never forget. Alfred Mercier
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Presentation transcript:

Hashing CENG 351

Hashing: Intro The primary goal is to locate the desired record in a single access of disk. Sequential search: O(N) B+ trees: O(logk N) Hashing: O(1) In hashing, the key of a record is transformed into an address and the record is stored at that address. Hash-based indexes are best for equality selections. Cannot support range searches. Static and dynamic hashing techniques exist. CENG 351

Hash-based Index Data entries are kept in buckets (an abstract term) Each bucket is a collection of one primary page and zero or more overflow pages. Given a search key value, k, we can find the bucket where the data entry k* is stored as follows: Use a hash function, denoted by h The value of h(k) is the address for the desired bucket. h(k) should distribute the search key values uniformly over the collection of buckets CENG 351 2

Design Factors Bucket size: the number of records that can be held at the same address. Loading factor: the ratio of the number of records put in the file to the total capacity of the buckets. Hash function: should evenly distribute the keys among the addresses. Overflow resolution technique. CENG 351

Hash Functions Key mod N: Folding: Truncation: Squaring: N is the size of the table, better if it is prime. Folding: e.g. 123|456|789: add them and take mod. Truncation: e.g. 123456789 map to a table of 1000 addresses by picking 3 digits of the key. Squaring: Square the key and then truncate Radix conversion: e.g. 1 2 3 4 treat it to be base 11, truncate if necessary. CENG 351

Static Hashing Primary Area: # primary pages fixed, allocated sequentially, never de-allocated; (say M buckets). A simple hash function: h(k) = f(k) mod M Overflow area: disjoint from the primary area. It keeps buckets which hold records whose key maps to a full bucket. Adding the address of an overflow bucket to a primary area bucket is called chaining. Collision does not cause a problem as long as there is still room in the mapped bucket. Overflow occurs during insertion when a record is hashed to the bucket that is already full. CENG 351 3

Example Assume f(k) = k. Let M = 5. So, h(k) = k mod 5 Bucket factor = 3 records. Insert: 12, 35, 44, 60, 6, 46,57,33,62,17 CENG 351

Example 1 2 3 4 Assume f(k) = k. Let M = 5. So, h(k) = k mod 5 Bucket factor = 3 records. Insert: 12, 35, 44, 60, 6, 46,57,33,62,17 1 2 3 4 35 60 6 46 17 12 57 62 33 overflow 44 Primary area CENG 351

Load Factor (Packing density) To limit the amount of overflow we allocate more space to the primary area than we need (i.e. the primary area will be, say, 70% full) Load Factor = => Lf = # of records in the file # of spaces in primary area n M * Bkfr Where n: number of records, Bkf: the blocking factor, M: number of blocks CENG 351

Effects of Lf and Bkf Performance can be enhanced by the choice of bucket size and load factor. In general, a smaller load factor means less overflow and a faster fetch time; but more wasted space. A larger Bkf means less overflow in general, but slower fetch. CENG 351

Insertion and Deletion Insertion: New records are inserted at the end of the chain. Deletion: Two ways are possible: Mark the record to be deleted Consolidate sparse buckets when deleting records. In the 2nd approach: When a record is deleted, fill its place with the last record in the chain of the current bucket. Deallocate the last bucket when it becomes empty. CENG 351

Problem of Static Hashing The main problem with static hashing: the number of buckets is fixed: Long overflow chains can develop and degrade performance. On the other hand, if a file shrinks greatly, a lot of bucket space will be wasted. There are some other hashing techniques that allow dynamically growing and shrinking hash index. These include: linear hashing extendible hashing CENG 351 4

Linear Hashing It maintains a constant load factor. Thus avoids reorganization. by incrementally adding new buckets to the primary area. In linear hashing the last bits in the hash number are used for placing the records. CENG 351

Example e.g. 34: 100010 28: 011100 08: 001000 13: 001101 21: 010101 37: 100101 12: 001100 Last 3 bits 000 8 16 32 001 17 25 010 34 50 011 11 27 100 28 101 5 110 14 111 55 15 Lf = 14/24 = 58% Lf = 15/24 = 63% Insert: 13, 21, 37,12 Lf = 16/24 = 67% 13 21 37 Lf = 17/24 = 70% CENG 351

Insertion of records To expand the table: split an existing bucket denoted by k digits into two buckets using the last k+1 digits. e.g. 0000 000 1000 CENG 351

Expanding the table 0000 16 32 001 17 25 010 34 50 011 11 27 100 28 12 101 5 13 21 110 14 111 55 15 1000 8 Boundary value 37 CENG 351

Hash # 1000: uses last 4 digits Hash # 1101: uses last 3 digits 0000 16 32 0001 17 0010 34 50 011 11 27 100 28 12 101 5 13 21 110 14 111 55 15 1000 8 1001 25 1010 26 k = 3 Hash # 1000: uses last 4 digits Hash # 1101: uses last 3 digits Boundary value 37 CENG 351

Fetching a record Calculate the hash function. Look at the last k digits. If it’s less than the boundary value, the location is in the bucket labeled with the last k+1 digits. Otherwise it is in the bucket labeled with the last k digits. Follow overflow chains as with static hashing. CENG 351

Insertion Search for the correct bucket into which to place the new record. If the bucket is full, allocate a new overflow bucket. Split condition*: If Lf is above threshold, Add one more bucket to the primary area. Distribute the records from the bucket chain at the boundary value between the original area and the new primary area buckets Add 1 to the boundary value. *any overflow may be also considered as split condition (Elmasri), or Lf*Bkfr records more than needed for the given Lf (Salzberg) CENG 351

Deletion Read in a chain of records. Replace the deleted record with the last record in the chain. If the last overflow bucket becomes empty, deallocate it. Grouping condition*: When the last bucket is empty or Lf is below a given threshold, contract the primary area by one bucket. Compressing the table is exact opposite of expanding it: Keep the total # of records in the file and buckets in primary area. When we have Lf * Bkfr fewer records than needed, consolidate the last bucket with the bucket which shares the same last k digits. *when number of records is Lf * Bkfr less than the number needed for Lf, it may be considered as a grouping condition (Salzberg) CENG 351

Extendible Hashing:Intr. Extendible hashing does not have chains of buckets, contrary to linear hashing. Hashing is based on creating index for an index table, which have pointers to the data buckets. The number of the entries in the index table is 2i, where i is number of bit used for indexing. The records with the same first i bits are placed in the same bucket, If max content of any bucket is reached, a new bucket is added, provided the corresponding index entry is free. If max content of any bucket is reached and a new bucket cannot be added, according to the first i bits, then the table size is doubled. CENG 351

Extendible Hashing: Overflow When i=j and overflow occurs, then index table is doubled; where j is the level of the index, while i is the level of data buckets Successive doubling of the index table may happen, in case of a poor hashing function. The main disadvantage of the extendible hashing is that, the index table may grow to be too big and too sparse. The best case: there are as many buckets as index table size. If the index table can fit into the memory, the access to a record requires one disk access only. CENG 351

Length of common hash prefix Extendible Hashing Extendable Hashing i i2 bucket2 i3 bucket3 i1 bucket1 Data bucket Bucket address table Length of common hash prefix Hash prefix Hash function returns b bits Only the prefix i bits are used to hash the item There are 2i entries in the bucket address table Let ij be the length of the common hash prefix for data bucket j, there is 2(i-ij) entries in bucket address table points to j CENG 351

Splitting a bucket: Case 1 Extendable Hashing Splitting (Case 1: ij=i) Only one entry in bucket address table points to data bucket j i++; split data bucket j to j, z; ij=iz=i; rehash all items previously in j; 3 2 1 000 001 010 011 100 101 110 111 2 00 01 10 11 1 CENG 351

Splitting: Case 2 Extendable Hashing Splitting (Case 2: ij< i) More than one entry in bucket address table point to data bucket j split data bucket j to j, z; ij = iz = ij +1; Adjust the pointers previously point to j to j and z; rehash all items previously in j; 2 3 000 001 010 011 100 101 110 111 2 1 3 000 001 010 011 100 101 110 111 CENG 351

Example Extendable Hashing Suppose the hash function is h(x) = x mod 8 and each bucket can hold at most two records. Show the extendable hash structure after inserting 1, 4, 5, 7, 8, 2, 20. 1 4 5 7 8 2 20 001 100 101 111 000 010 1 4 00 01 10 11 1 8 2 4 5 7 1 4 5 CENG 351

Example Extendable Hashing inserting 1, 4, 5, 7, 8, 2, 20 1 4 5 7 8 2 001 100 101 111 000 010 3 000 001 010 011 100 101 110 111 1 8 2 4 20 7 5 1 8 2 4 5 7 00 01 10 11 CENG 351

Comments on Extendible Hashing If directory fits in memory, equality search answered with one disk access. A typical example: a100MB file with 100 bytes/entry and a page size of 4K contains 1,000,000 records (as data entries) but only about 25,000 directory elements  chances are high that directory will fit in memory. If the distribution of hash values is skewed (e.g., a large number of search key values all are hashed to the same bucket ), directory can grow large. But this kind of skew can be avoided with a well-tuned hashing function CENG 351 9

Comments on Extendible Hashing Delete: If removal of data entry makes bucket empty, can be merged with a “buddy” bucket. If each directory element points to same bucket as its split image, can halve directory. CENG 351 9

Simple Hashing: Search No overflow case, ie. direct hit: TF=s+r+btt Overflow successful case, for average chain length of x TFs=s+r+btt + (x/2)*(s+r+btt) Overflow unsuccessful case, for average chain length of x TFu=s+r+btt + x*(s+r+btt) CENG 351

Deletion from a hashed file Delete without consolidation, i.e., mark the deleted record as deleted… which requires occasional reorganization. TD=TFs+2r Where x/2 of the chain is involved in TFs- successful search. CENG 351

Delete with consolidation Delete with consolidation, requires the entire chain to be read in and written out, after been updated. This will not require any reorganization. The approximate formula: TD=TFu+2r Where the full chain (x blocks) is involved in TFu… CENG 351

Insertion Assuming that insertion involves the modification of the last bucket, just like deletion… TI=TFu+2r CENG 351

Sequential Access In hashing, each record is a randomly placed and randomly accessed, requiring very long sequential record processing: TX=n*TF CENG 351

Linear hashing For access time considerations, experimentally obtained estimations E.g., blocking factor m=50 and load factor f=75%, the average access times for successful and unsuccessful cases can be stated as TFs=1.05*(s+r+btt) TFu=1.27*(s+r+btt) CENG 351

Summary Hash-based indexes: best for equality searches, cannot support range searches. Static Hashing can lead to long overflow chains. Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. Directory to keep track of buckets, doubles periodically. Can get large with skewed data; additional I/O if this does not fit in main memory. CENG 351 16