External Memory Hashing

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Advertisements

Hashing Dashiell Fryer CS 157B Dr. Lee. Contents Static Hashing Static Hashing File OrganizationFile Organization Properties of the Hash FunctionProperties.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Department of Computer Science and Engineering, HKUST Slide 1 Dynamic Hashing Good for database that grows and shrinks in size Allows the hash function.
Hash Tables Hash function h: search key  [0…B-1]. Buckets are blocks, numbered [0…B-1]. Big idea: If a record with search key K exists, then it must be.
DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
COMP 451/651 B-Trees Size and Lookup Chapter 1.
Hash Tables Hash function h: search key  [0…B-1]. Buckets are blocks, numbered [0…B-1]. Big idea: If a record with search key K exists, then it must be.
Hash Table indexing and Secondary Storage Hashing.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
External Memory Hashing. Hash Tables Hash function h: search key  [0…B-1]. Buckets are blocks, numbered [0…B-1]. Big idea: If a record with search key.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #8.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #11.
Chapter 13.4 Hash Tables Steve Ikeoka ID: 113 CS 257 – Spring 2008.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #8.
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
CS CS4432: Database Systems II. CS Index definition in SQL Create index name on rel (attr) (Check online for index definitions in SQL) Drop.
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
1 Lecture 21: Hash Tables Wednesday, November 17, 2004.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 111 Database Systems II Index Structures.
1 Ullman et al. : Database System Principles Notes 5: Hashing and More.
COMP3017 Advanced Databases
CS 245: Database System Principles
Relational Database Systems 2
Dynamic Hashing (Chapter 12)
Lecture 21: Hash Tables Monday, February 28, 2005.
Are they better or worse than a B+Tree?
Hash-Based Indexes Chapter 11
Hashing CENG 351.
CPSC-608 Database Systems
Database Management Systems (CS 564)
External Memory Hashing
Introduction to Database Systems
External Memory Hashing
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 10
Extendible Hashing Primarily used for storage of files on disk
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
CS 245: Database System Principles
Hashing.
Hash-Based Indexes Chapter 11
Index tuning Hash Index.
Database Systems (資料庫系統)
LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation
Database Design and Programming
2018, Spring Pusan National University Ki-Joune Li
Module 12a: Dynamic Hashing
Chapter 11: Indexing and Hashing
CPSC-608 Database Systems
Index tuning Hash Index.
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
CS4432: Database Systems II
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

External Memory Hashing

Hash Tables Hash function h: search key  [0…B-1]. Buckets are blocks, numbered [0…B-1]. Big idea: If a record with search key K exists, then it must be in bucket h(K). One disk I/O if there is only one block per bucket. h(d) = 0, h(e) = h(c) = 1, … Hash­Table Lookup: For record(s) with search key K, compute h(K); search that bucket.

Hash­Table Insertion Put in bucket h(K) if it fits; otherwise create an overflow block. Overflow block(s) are part of bucket. Example: Insert record with search key g, where h(g) = 1 Delete c, where h(c)=1 ?

What if the File Grows too Large? Efficiency is highest if actual #records < #buckets  max #(records/bucket) If file grows, we need a dynamic hashing method to maintain the above relationship. Extensible Hashing: double the number of buckets when needed. Linear hashing: add one more bucket as appropriate.

Dynamic Hashing Framework Hash function h produces a sequence of k bits. Only some of the bits are used at any time to determine placement of keys in buckets. Extensible Hashing (Buckets may share blocks!) Keep parameter i = number of bits from the beginning of h(K) that determine the bucket. Bucket array now = pointers to buckets. A block can serve several buckets. For each block, a parameter ji tells how many bits of h(K) determine membership in the block. i.e., a block represents 2i-j buckets that share the first j bits of their number.

Example An extensible hash table when i=1:

Extensible Hash­table Insert If record with key K fits in the block B pointed to by h(K), put it there. If not, let this block B represent j bits. j=i: Set i:=i+1; Double the bucket array, so it has now 2i+1 entries; Let w be an old array entry. Both the new entries, w0 and w1, point to the same block that w used to point to. Split B into two and distribute the records (of B) according to (j+1)st bit; set j:=j+1; fix pointers in bucket array, so that entries that formerly pointed to B now point either to B or the new block How? depending on…(j+1)st bit j<i: Do as in 1.d

Now, after the insertion Example Insert record with h(K) = 1010. Before Now, after the insertion

Example: Next Next: records with h(K)=0000; h(K)=0111. Bucket for 0... gets split, but i stays at 2. Then: record with h(K) = 1000. Overflows bucket for 10... Raise i to 3. After the insertions Currently

More on Extensible Hashing Two ideas Use i bits of the k bits of h(Key) 00110101 k use i  grows over time…. h(Key)  Use directory: Bucket Address Table (BAT) . h(Key)[i ] to bucket

Example – Extensible Hashing – Insert New directory 2 00 01 10 11 i = i = 1 0001 1001 1100 h(K) is 4 bits; 2 keys/bucket 1 1100 1010 Insert 1010

Example – Extensible Hashing – Insert 2 0111 0000 0001 i = 1 0001 2 1001 1010 1100 00 01 10 11 h(K) is 4 bits; 2 keys/bucket 0111 Insert: 0111 0000

Example – Extensible Hashing – Insert 000 001 010 011 100 101 110 111 3 i = 00 01 10 11 2 i = 1001 1010 1100 0111 0000 0001 h(K) is 4 bits; 2 keys/bucket 1001 1011 Insert: 1011

Extensible Hash Tables: Advantages: Lookup; never search more than one data block. Hope that the bucket array fits in main memory Defects: Doubling the bucket array could make the array to not fit in main memory. Problem with skewed key distributions. E.g. Let 1 block=2 records. Suppose that three records have hash values, which happen to be the same in the first 20 bits. In that case we would have i=21 and 2^21 ≈ two million bucket-array entries, even though we have only 3 records!!

This is also part of the structure Linear Hashing Number of buckets n is always chosen so that no. of records r < p × (no. of buckets × bucket capacity) p is typically between 0.5 and 0.9 Overflow blocks are permitted If r > p × (no. of buckets × bucket capacity) one new bucket is allocated, and one old bucket needs to be rehashed Number of bits i needed to index n buckets = log2(n) Buckets numbered [0…n-1], where 2i-1 < n  2i. i=1 n=2 r=3 #of buckets #of records This is also part of the structure

This is also part of the structure Linear Hashing Use i bits from right (low ­order) end of h(K). Buckets numbered [0…n-1], where 2i-1 < n  2i. Let last i bits of h(K) be ai-1ai-2…a0 = integer m. If m < n, then record belongs to bucket m. If n  m < 2i, then record belongs to bucket m-2i-1, that is the bucket we would get if we changed ai-1 (which must be 1) to 0. i=1 n=2 r=3 #of buckets #of records This is also part of the structure

Lookup in Linear Hash Table For record(s) with search key K, compute h(K); search the corresponding bucket according to the procedure described for insertion. If the record we wish to look up isn’t there, it can’t be anywhere else. E.g. lookup for a key which hashes to 1010, and then for a key which hashes to 1011. i=2 n=3 r=4

Linear Hash­Table Insert Pick an upper limit on capacity, e.g., 85% (1.7 records/bucket in our example). If an insertion exceeds capacity limit, set n := n + 1. If new n is 2i + 1, set i := i + 1. No change in bucket numbers needed --- just imagine a leading 0. Need to split bucket n - 2i-1 because there is now a bucket numbered (old) n.

Example Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. r=1 n=1 i=1 0000 Before After

Example Capacity limit exceeded; increment n Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. r=1 n=1 i=1 0000 r=2 n=2 i=1 1010 0000 1 Before After Capacity limit exceeded; increment n

Example Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. r=2 n=2 i=1 1010 0000 r=3 n=2 i=1 1 1111 1 Before After

Example Capacity limit exceeded; increment n, Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. r=3 n=2 i=1 1010 0000 r=4 n=3 i=2 0000 00 1111 1 1111 01 0101 1010 10 Before After Capacity limit exceeded; increment n, which causes incrementing i as well.

Example As long as capacity is not exceeded can add overflow blocks. Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. r=4 n=3 i=2 0000 00 r=5 n=3 i=2 0000 00 1111 01 0101 1111 01 0101 0001 1010 10 1010 10 Before After As long as capacity is not exceeded can add overflow blocks.

Example Capacity limit exceeded; increment n. Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. r=5 n=3 i=2 0000 00 r=6 n=4 i=2 0000 00 1100 1111 01 0101 0001 0001 01 0101 1010 10 1010 10 1111 11 Before After Capacity limit exceeded; increment n.

Exercise Suppose we want to insert keys with hash values: 0000…1111 in a linear hash table with 100% capacity threshold. Assume that a block can hold three records.