CSE 326: Data Structures Lecture #16 Hashing HUGE Data Sets (and two presents from the Database Fiancée) Steve Wolfman Winter Quarter 2000.

Slides:



Advertisements
Similar presentations
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
Advertisements

CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
1 Lecture 8: Data structures for databases II Jose M. Peña
Hash Table indexing and Secondary Storage Hashing.
B+-tree and Hashing.
Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Chapter 13.4 Hash Tables Steve Ikeoka ID: 113 CS 257 – Spring 2008.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Design and Analysis of Algorithms - Chapter 71 Hashing b A very efficient method for implementing a dictionary, i.e., a set with the operations: – insert.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
HASH TABLES Malathi Mansanpally CS_257 ID-220. Agenda: Extensible Hash Tables Insertion Into Extensible Hash Tables Linear Hash Tables Insertion Into.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
CSE 326: Data Structures Lecture #13 Extendible Hashing and Splay Trees Alon Halevy Spring Quarter 2001.
Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
(B+-Trees, that is) Steve Wolfman 2014W1
MA/CSSE 473 Day 28 Hashing review B-tree overview Dynamic Programming.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 5, 6 of Elmasri “ How index-learning turns no student.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Basic File Structures and Hashing Lectured by, Jesmin Akhter, Assistant professor, IIT, JU.
1 CMSC 341 Extensible Hashing Chapter 5, Section 6 (pp. 200 – 203)
1 CSE 326: Data Structures Sorting It All Out Henry Kautz Winter Quarter 2002.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
CSE 326: Data Structures Lecture #13 Bart Niswonger Summer Quarter 2001.
Lecture 5 Cost Estimation and Data Access Methods.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Hashing as a Dictionary Implementation Chapter 19.
1 CPS216: Data-intensive Computing Systems Operators for Data Access (contd.) Shivnath Babu.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Hashing is a method to store data in an array so that sorting, searching, inserting and deleting data is fast. For this every record needs unique key.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
1 CSE 326: Data Structures Part 5 Hashing Henry Kautz Autumn 2002.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
1 Lecture 21: Hash Tables Wednesday, November 17, 2004.
CSE 326: Data Structures Lecture #14 Whoa… Good Hash, Man Steve Wolfman Winter Quarter 2000.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
Hash Tables ADT Data Dictionary, with two operations – Insert an item, – Search for (and retrieve) an item How should we implement a data dictionary? –
Exam 3 Review Data structures covered: –Hashing and Extensible hashing –Priority queues and binary heaps –Skip lists –B-Tree –Disjoint sets For each of.
Chapter 5 Record Storage and Primary File Organizations
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Hash 2004, Spring Pusan National University Ki-Joune Li.
CSE 326: Data Structures Lecture #9 Big, Bad B-Trees Steve Wolfman Winter Quarter 2000.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Hashing (part 2) CSE 2011 Winter March 2018.
Lecture 21: Hash Tables Monday, February 28, 2005.
Hashing CENG 351.
Advanced Associative Structures
CSE 326: Data Structures Hashing
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Wednesday, 5/8/2002 Hash table indexes, physical operators
CSE 326: Data Structures Lecture #12 Hashing II
CSE 326: Data Structures Lecture #10 B-Trees
CSE 326: Data Structures Lecture #14
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

CSE 326: Data Structures Lecture #16 Hashing HUGE Data Sets (and two presents from the Database Fiancée) Steve Wolfman Winter Quarter 2000

Cost of a Database Query I/O to CPU ratio is 300!

Today’s Outline Finish rehashing rehashing Case study Extendible hashing CIDR student interview

Case Study Spelling dictionary –30,000 words –static –arbitrary(ish) preprocessing time Goals –fast spell checking –minimal storage Practical notes –almost all searches are successful –words average about 8 characters in length –30,000 words at 8 bytes/word is 1/4 MB –pointers are 4 bytes –there are many regularities in the structure of English words Why?

Solutions –sorted array + binary search –open hashing –closed hashing + linear probing What kind of hash function should we use?

Storage Assume words are strings and entries are pointers to strings Array + binary search Open hashing … Closed hashing (open addressing) How many pointers does each use?

Analysis Binary search –storage:n pointers + words = 360KB –time: log 2 n  15 probes per access, worst case Open hashing –storage:n + n/ pointers + words ( = 1  600KB) –time:1 + /2 probes per access on average ( = 1  1.5) Closed hashing –storage:n/ pointers + words ( = 0.5  480KB) –time: probes per access on average ( = 0.5  1.5) Which one should we use?

Extendible Hashing Hashing technique for huge data sets –optimizes to reduce disk accesses –each hash bucket fits on one disk block –better than B-Trees if order is not important Table contains –buckets, each fitting in one disk block, with the data –a directory that fits in one disk block used to hash to the correct bucket

Extendible Hash Table Directory contains entries labelled by k bits plus a pointer to the bucket with all keys starting with its bits Each block contains keys+data matching on the first j  k bits (2) (2) (3) (3) (2) directory for k = 3

Inserting (easy case) (2) (2) (3) (3) (2) insert(11011) (2) (2) (3) (3) (2)

Splitting (2) (2) (3) (3) (2) insert(11000) (2) (2) (3) (3) (3) (3)

Rehashing (2) (2) (2) insert(10010) No room to insert and no adoption! Expand directory Now, it’s just a normal split.

When Extendible Hashing Doesn’t Cut It Store only pointers to the items +(potentially) much smaller M +fewer items in the directory –one extra disk access! Rehash +potentially better distribution over the buckets +fewer unnecessary items in the directory –can’t solve the problem if there’s simply too much data What if these don’t work? –use a B-Tree to store the directory!

To Do Continue Project III Read chapter 7 in the book Start HW#4

Coming Up CIDR student interview (in about 10 seconds) Disjoint-set union-find ADT Fourth HW due (February 16 th !) Project III due (February 17 th by 5PM!)