I/O Efficient Algorithms. Problem Data is often too massive to fit in the internal memory The I/O communication between internal memory (fast) and external.

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Tutorial 8 CSI 2132 Database I. Exercise 1 Both disks and main memory support direct access to any desired location (page). On average, main memory accesses.
BTrees & Bitmap Indexes
Hash Table indexing and Secondary Storage Hashing.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 5, 6 of Elmasri “ How index-learning turns no student.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Basic File Structures and Hashing Lectured by, Jesmin Akhter, Assistant professor, IIT, JU.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CPS216: Data-intensive Computing Systems Operators for Data Access (contd.) Shivnath Babu.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
Chapter 5: Hashing Collision Resolution: Open Addressing Extendible Hashing Mark Allen Weiss: Data Structures and Algorithm Analysis in Java Lydia Sinapova,
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Chapter 5 Record Storage and Primary File Organizations
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Database Applications (15-415) DBMS Internals- Part IV Lecture 15, March 13, 2016 Mohammad Hammoud.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Multiway Search Trees Data may not fit into main memory
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Dynamic Hashing (Chapter 12)
Temporal Indexing MVBT.
Temporal Indexing MVBT.
Database Management Systems (CS 564)
Chapter Trees and B-Trees
Chapter Trees and B-Trees
Disk Storage, Basic File Structures, and Hashing
Hash-Based Indexes Chapter 10
Hashing.
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Database Systems (資料庫系統)
CPS216: Advanced Database Systems
Presentation transcript:

I/O Efficient Algorithms

Problem Data is often too massive to fit in the internal memory The I/O communication between internal memory (fast) and external memory (slower) can be a major performance bottleneck

Goal Design algorithms and data structures for external memory to exploit locality and parallelism in order to reduce I/O costs

Fundamental I/O operations Scanning Sorting Searching Outputting

Bounds N = problem size (in units of data items) B = block transfer size (in units of data items) D = number of independent disk drives Z = number of items of an answer OperationI/O bound, D = 1I/O bound, general D ≥ 1 Scan(N)Θ(N/B) = Θ(n)Θ(N/DB) = Θ(n/D) Sort(N)Θ(N/B log M/B N/B) = Θ(n log m n)Θ(N/DB log M/B N/B) = Θ(n/D log m n) Search(N)Θ(log B N)Θ(log DB N) Output(Z)Θ(max {1, Z/B}) = Θ(max {1,z})Θ(max {1, Z/DB}) = Θ(max {1,z/D})

Types of problems Batched – Scan and Sort Online – Search and Output

External Hashing for Online Dictionary Search Insert O(1) Delete O(1) LookupO(Output(Z))

Statically allocated tables Most commonly/traditionally used Can handle only a fixed range of N Goal is to develop dinamic external memory structures that can easily handle different sizes of data

Extendible Hashing R. Fagin, J. Nievergelt, N. Pippinger, and H. R. Strong assume that the size K of the range of the hash function is sufficiently large directory consists of an array of 2 d pointersfor a given d ≥ 0 (d is the global depth) each item is assigned to the table location corresponding to the d least signifcant bits of its hash address d is set to the smallest value for which each table location has at most B items assigned to it each table location contains a pointer to a block where its items are stored a lookup takes two I/Os: one to access the directory and one to access the block storing the item (only one I/O if the directory fits in internal memory)

Minimizing Storage Utilization a table location may hold fewer than B items, therefore they can share the same disk block for storing their items a table location shares a disk block with all the other table locations having the same k least significant bits in their address k is chosen to be as small as possible so that the pooled items t into a single disk block each disk block has its own local depth

Inserting New Items when a new item is inserted, and its disk block overflows, the global depth d and the block's local depth k are recalculated so that the invariants on d and k once again hold this is done by splitting the block that overflows and redistributing its items global depth d is incremented by 1, the directory doubles in size (this is how the hash is able to adapt to the growing N) pointers in the new directory are set to the appropriate disk blocks the disk blocks themselves do not need to be changed during doubling, except for the one block where the overflow has occured

Inserting New Items when a new item is inserted, and its disk block overflows, the global depth d and the block's local depth k are recalculated so that the invariants on d and k once again hold this is done by splitting the block that overflows and redistributing its items global depth d is incremented by 1, the directory doubles in size (this is how the hash is able to adapt to the growing N) pointers in the new directory are set to the appropriate disk blocks the disk blocks themselves do not need to be changed during doubling, except for the one block where the overflow has occured

Inserting New Items contd. let hash d be the hash function corresponding to the d least significant bits of hash (hash d (x) = hash(x) % 2 d initially a single disk block is created to store the data items, and all the slots in the directory are initialized to point to the block the local depth k of the block is set to 0 when a new item with key value x is inserted, it is stored in the disk block pointed to by directory slot hash d (x) if as a result block b overflows, then b is split into two blocks - the original block b and a new block b’ and its items are redistributed based upon the (b.k + 1)st least signicant bit of hash(x) (b.k = b’s local depth) b.k is incremented by 1 and that value alsostored in in b’.k if the blocks are still overflowing the blocks are split and their sizes are incremented until overflow no longer occurs

after all splits are done, if b.k ≤ d, we just update those directory pointers originally pointing to b that need to be changed if b.k > d then the directory is not large enough to accommodate hash addresses with b.k bits, so we repeatedly double the directory size and increment the global depth d by 1 until d = b.k once again: - pointers in the new directory are initialized to point to the appropriate disk blocks - the disk blocks do not need to be modified during doubling, except for the block that overflows Inserting New Items contd.

deletion is handled very similarly to insertion when two blocks with the same local depth k contain items whose hash addresses share the same k-1 least significant bits and can fit together into a single block, then their items are merged into a single block with a decremented value of k the combined size of the blocks being merged must be sufficiently less than B to prevent immediate splitting after a subsequent insertion the directory shrinks by half and the global depth d is decremented by 1, when all the local depths are less than the current value of d Deleting Items

Some Numbers the expected number of disk blocks required to store the data items is n/ ln 2, therefore the blocks tend to be about 69% full at least Ω(n/B) blocks are needed to store the directory P. Flajolet showed that on the average the directory uses Ѳ(N 1/B n/B) = Ѳ(N 1+1/B /B 2 ) blocks, which can be superlinear in N asymptotically for practical values of N and B, the N 1/B term is a small constant, typically less than 2, and the directory size is within a constant factor of the optimum

So... the resulting directory is equivalent to the leaves of a perfectly balanced tree, in which the search path for each item is determined by its hash address, except that hashing allows the leaves of the tree to be accessed directly in a single I/O therefore any item can be retrieved in a total of two I/Os if the directory ts in internal memory, only one I/O is needed

The End Jeff Vitter's survey paper: