CSE 326: Data Structures Lecture #14

Slides:



Advertisements
Similar presentations
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
Advertisements

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
CSE 326: Data Structures Lecture #13 Extendible Hashing and Splay Trees Alon Halevy Spring Quarter 2001.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
CSE 326: Data Structures Lecture #16 Hashing HUGE Data Sets (and two presents from the Database Fiancée) Steve Wolfman Winter Quarter 2000.
CSE 326: Data Structures Lecture #13 Bart Niswonger Summer Quarter 2001.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Chapter 5: Hashing Collision Resolution: Open Addressing Extendible Hashing Mark Allen Weiss: Data Structures and Algorithm Analysis in Java Lydia Sinapova,
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
CS422 Principles of Database Systems Indexes
Indexing Goals: Store large files Support multiple search keys
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Hash table CSC317 We have elements with key and satellite data
Hashing CSE 2011 Winter July 2018.
CS 332: Algorithms Hash Tables David Luebke /19/2018.
CSE 332 Data Abstractions B-Trees
File System Structure How do I organize a disk into a file system?
Lecture 21: Hash Tables Monday, February 28, 2005.
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Are they better or worse than a B+Tree?
Hash-Based Indexes Chapter 11
Hashing CENG 351.
Database Management Systems (CS 564)
Hashing Exercises.
Evaluation of Relational Operations
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
Advanced Associative Structures
Hash Table.
Database Applications (15-415) DBMS Internals- Part V Lecture 17, March 20, 2018 Mohammad Hammoud.
Introduction to Database Systems
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
B-Trees CSE 373 Data Structures CSE AU B-Trees.
External Memory Hashing
Hash-Based Indexes Chapter 10
Hashing.
CSE 326: Data Structures Hashing
Hash-Based Indexes Chapter 11
Index tuning Hash Index.
Advance Database System
B-Trees CSE 373 Data Structures CSE AU B-Trees.
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Database Systems (資料庫系統)
Database Design and Programming
Advanced Implementation of Tables
CSE 373, Copyright S. Tanimoto, 2002 B-Trees -
CSE 373: Data Structures and Algorithms
Wednesday, 5/8/2002 Hash table indexes, physical operators
CSE 326: Data Structures Lecture #12 Hashing II
Hash-Based Indexes Chapter 11
CSE 373 Data Structures and Algorithms
CSE 373: Data Structures and Algorithms
Richard Anderson Spring 2016
B-Trees CSE 373 Data Structures CSE AU B-Trees.
CSE 326: Data Structures Lecture #5 Heaps More
Chapter 11 Instructor: Xin Zhang
Advance Database System
DATA STRUCTURES-COLLISION TECHNIQUES
CSE 326: Data Structures Lecture #10 B-Trees
Collision Resolution: Open Addressing Extendible Hashing
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

CSE 326: Data Structures Lecture #14 More, Bigger Hash Please Bart Niswonger Summer Quarter 2001

Today’s Outline Hashing Project Rules of competition Comments? Probing continued Rehashing Extendible Hashing Case Study Hashing A whole lotta hashing going on! We’ll get to the hashing – but first the project stuff – new assignment – communication Rules for maze runners? Comments?

Cost of a Database Query This is a trace of how much time a simple database operation takes: this one lists all employees along with their job information, getting the employee and job information from separate databases. The thing to notice is that disk access takes something like 100 times as much time as processing. I told you disk access was expensive! BTW: the “index” in the picture is a B-Tree. - SQLServer I/O to CPU ratio is 300!

Extendible Hashing Hashing technique for huge data sets Table contains optimizes to reduce disk accesses each hash bucket fits on one disk block better than B-Trees if order is not important Table contains buckets, each fitting in one disk block, with the data a directory that fits in one disk block used to hash to the correct bucket OK, extendible hashing takes hashing and makes it work for huge data sets. The idea is very similar to B-Trees.

Extendible Hash Table Directory - entries labeled by k bits & pointer to bucket with all keys starting with its bits Each block contains keys & data matching on the first j  k bits directory for k = 3 000 001 010 011 100 101 110 111 Notice that those are the hashed keys I’m representing in the table. Actually, the buckets would store the original key/value pair. (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11011 11100 11110

Inserting (easy case) insert(11011) 000 001 010 011 100 101 110 111 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11100 11110 Inserting if there’s room in the bucket is easy. Just like in B-Trees. This takes 2 disk accesses. One to grab the directory, and one for the bucket. But what if there isn’t room? 000 001 010 011 100 101 110 111 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11011 11100 11110

Splitting insert(11000) 000 001 010 011 100 101 110 111 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11011 11100 11110 Then, as in B-Trees, we need to split a node. In this case, that turns out to be easy because we have room in the directory for another node. How many disk accesses does this take? 3: one for the directory, one for the old bucket, and one for the new bucket. 000 001 010 011 100 101 110 111 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (3) 11000 11001 11011 (3) 11100 11110

Rehashing insert(10010) No room to insert and no adoption! 11 (2) 01101 (2) 10000 10001 10011 10111 (2) 11001 11110 Now, even though there’s tons of space in the table, we have to expand the directory. Once we’ve done that, we’re back to the case on the previous page. How many disk accesses does this take? 3! Just as for a normal split. The only problem is if the directory ends up too large to fit in a block. So, what does this say about the hash function we want for extendible hashing? We want something parameterized so we can rehash if necessary to get the buckets evened out. Expand directory 000 001 010 011 100 101 110 111 Now, it’s just a normal split.

When Directory Gets Too Large Store only pointers to the items (potentially) much smaller M fewer items in the directory one extra disk access! Rehash potentially better distribution over the buckets fewer unnecessary items in the directory can’t solve the problem if there’s simply too much data What if these don’t work? use a B-Tree to store the directory! If the directory gets too large, we’re in trouble. In that case, we have two ways to fix it. One puts more keys in each bucket. The other evens out the buckets. Neither is guaranteed to work.

Rehash of Hashing Hashing is a great data structure for storing unordered data that supports insert, delete & find Both separate chaining (open) and open addressing (closed) hashing are useful separate chaining flexible closed hashing uses less storage, but performs badly with load factors near 1 extendible hashing for very large disk-based data Hashing pros and cons + very fast + simple to implement, supports insert, delete, find - lazy deletion necessary in open addressing, can waste storage - does not support operations dependent on order: min, max, range

Case Study Spelling dictionary Goals Practical notes 30,000 words static arbitrary(ish) preprocessing time Goals fast spell checking minimal storage Practical notes almost all searches are successful words average about 8 characters in length 30,000 words at 8 bytes/word is 1/4 MB pointers are 4 bytes there are many regularities in the structure of English words Why? Alright, let’s move on to the case study. Here’s the situation. Why are most searches successful? Because most words will be spelled correctly.

Solutions Solutions sorted array + binary search open hashing closed hashing + linear probing Remember that we get arbitrary preprocessing time. What can we do with that? We should use a universal hash function (or at least a parameterized hash function) because that will let use rehash in preprocess until we’re happy with the distribution. What kind of hash function should we use?

Storage … Assume words are strings & entries are pointers to strings Array + binary search … Closed hashing (open addressing) Open hashing Notice that all of these use the same amount of memory for the strings. How many pointers does each one use? N for array. Just store a pointer to each string. n (pointer to each string) + n/lambda (null pointer ending each list). Note, I’m playing a little game here; without an optimization, this would actually be 2n + n/lambda. N/lambda. How many pointers does each use?

Analysis Binary search Open hashing Closed hashing storage: n pointers + words = 360KB time: log2n  15 probes/access, worst case Open hashing storage: n + n/ pointers + words ( = 1  600KB) time: 1 + /2 probes/access, average ( = 1  1.5) Closed hashing storage: n/ pointers + words ( = 0.5  480KB) time: probes/access, average ( = 0.5  1.5) The answer, of course, is it depends! However, given that we want fast checking, either hash table is better than the BST. And, given that we want low memory usage, we should use the closed hash table. Can anyone argue for the binary search? Which one should we use?

To Do Start Project III (only 1 week!) Start reading chapter 8 in the book Even if you don’t yet have a team you should absolutely start thinking about project III. The code should not be hard to write, I think, but the algorithm may take you a bit

Coming Up Disjoint-set union-find ADT Quiz (Thursday) Next time, We’ll look at extendible hashing for huge data sets.