Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.

Slides:



Advertisements
Similar presentations
Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
Advertisements

External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
CPSC 404, Laks V.S. Lakshmanan1 Hash-Based Indexes Chapter 11 Ramakrishnan & Gehrke (Sections )
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 11 – Hash-based Indexing.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hash Indexes: Chap. 11 CS634 Lecture 6, Feb
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
B+-tree and Hashing.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
1 Database Systems ( 資料庫系統 ) November 8, 2004 Lecture #9 By Hao-hua Chu ( 朱浩華 )
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree- and Hash-Structured Indexes Selected Sections of Chapters 10 & 11.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Static Hashing (using overflow for collision managment e.g., h(key) mod M h key Primary bucket pages 1 0 M-1 Overflow pages(as separate link list) Overflow.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Chapter 5 Record Storage and Primary File Organizations
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Tree-Structured Indexes. Introduction As for any index, 3 alternatives for data entries k*: – Data record with key value k –  Choice is orthogonal to.
Tree-Structured Indexes
COP Introduction to Database Structures
Are they better or worse than a B+Tree?
Hash-Based Indexes Chapter 11
Hashing CENG 351.
CPSC-608 Database Systems
Database Management Systems (CS 564)
Disk Storage, Basic File Structures, and Hashing
Introduction to Database Systems
B+-Trees and Static Hashing
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes R&G Chapter 10 Lecture 18
Hash-Based Indexes Chapter 10
Introduction to Database Systems
Indexing and Hashing Basic Concepts Ordered Indices
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hashing.
Hash-Based Indexes Chapter 11
Tree-Structured Indexes
Index tuning Hash Index.
Database Systems (資料庫系統)
LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation
Index tuning Hash Index.
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Presentation transcript:

Database Management 7. course

Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats

Today System catalogue Hash-based indexing – Static – Extendible – Linear Time-cost of operations

System catalogue Special table Indexes – Type of the data structure and search key Tables – Name, filename, file structure (e.g. heap) – Attribute names, types – Integrity constraints – Index names Views – Name and definition Statistics, permissions, buffer size, etc.

Attr_Cat(attr_name, rel_name, type, position)

Hash-based indexing

Basic thought Index is given for every search key Hash function ( f ) between search key ( K ) and memory address ( A ): A = f ( K ) Ideally one key for one address and one address for one key: key is the address

Hashing Ideal for joining tables Do not support range search, just equality check Many versions exist: e.g. static, dynamic

Static hashing File~collection of buckets Every bucket has one primary page and other overflow pages File has N buckets: 0..N-1 Bucket contains data entries, 3 ways to store: – Data records with key k –

To identify the bucket in which the data entry is, hash function h is applied. In the bucket alternative search is applied In case of record insertion h is used to find the proper bucket and put there If there is not enough space, create an overflow chain to the bucket

In case of deletion h is used to locate tha data. If the deleted was the last record, than page is removed from the overflow chain and given to the list of free pages Identified bucket: h ( value ) mod N h ( value ) = ( a * value + b ) a and b are constants, can be tuned to influence searching

Primary pages can be stored sequentially on the disk If the file grows a lot – Long overflow chain – Worsens the search – Create new file with more buckets! If the file shrinks a lot – A lot of space is wasted

Solution Ideally – 80% of the buckets is used – no overflow Periodically rehash the file – Takes time – Index cannot be used during rehashing Use dynamic hashing – Extendible Hashing – Linear Hashing

Extendible hashing

Let’s consider Static Hashing If a new entry is to be inserted to a full bucket Double the number of buckets and redistribute the entries – time consuming Use directory of pointers to save time (pointers point at the buckets) Only the directory file has to be doubled Split only the overflowed bucket

Example Size of directory: 4 Last 2 binary digits of the hashed search field gives a number between 0 and 3  no. of directory element  follow pointer to the bucket

Insert: calculate the entry’s hash value (13), locate bucket no. (01) If the page has free space, insert What if data 20* to be inserted? (A)

Split bucket A, redistribute Consider last 3 digits of the hash values Double the directory and set the pointers

Result Local depth of bucket A is increased Insert 9*?

Split bucket B Redistribution is not needed since local depth of bucket B<global depth Local depth of bucket B is increased Initially local depth=global depth

If bucket gets empty Merging buckets is also possible Not always done Decrease local depth

Storage Typical: 100 MB file 100 bytes/data entry Page size: 4KB 1,000,000 data entries 25,000 elements in the directory High chance that it will fit in memory  speed=speed of Static Hashing Otherwise twice slow Collision: entries with the same hash values (overflow pages are needed)

Linear Hashing Family of hash functions: h 0, h 1, … Each function's range is twice that of its predecessor E.g. h i (value) = h(value) mod (2 i N). d o :number of bits of N’s representation d i :d o +i Example: N=32, d o =5, h 1 is h mod (2*32), d 1 =6

Basic idea Rounds of splitting Number of actual round is L Only h L and are h L+1 in use At any given point within a round we have splitted buckets, buckets yet to be splitted, and buckets created by splits in this round

Searching h L is applied – If it leads to an unsplitted bucket, we look there – If it lead to a splitted bucket, we apply h L+1 to decide in which bucket our data is Insertion may needs overflow page If the overflow chain gets big then split is triggered

Example L=0 round number N L =N*2 L number of buckets at the beginning of the L th round (N 0 =N)

If split is triggered, actual (Next) bucket is split and redistributed by h L+1 The new bucket gets to the end of the buckets Next is incremented by 1 Apply h L and if the searched hash value is before Next then apply h L+1 Continue: insert 43*, 37*, 29*, 22*, 66*, 34*, and 50*.

43

37

29

22, 66, 34

50

Deletion If the last bucket is empty, it can be removed, Next can be decremented. With some conditions merging can be triggered with not empty buckets New round, merging: empty buckets are removed, L is decremented, Next=N L /2-1

Comparison Imagine that Linear hashing is stored as Extendible The choice of hashing function is similar to Extendible hashing (moving from h i to h i+1 corresponds to doubling the directory in Extendible) Extendible hashing: reduced number of splits and higher bucket occupancy

Linear hashing – clever choice of bucket split avoids directory structure – primary pages are stored consecutively, finding them is easy (offset calculation). Quicker equality selection. – Skewed distribution results in almost empty buckets: not efficient storage

Imagine directory structure for Linear hashing: one bucket=one directory Overflow pages are stored easily Overhead of a directory level Costly for large, uniformly distributed files Improves space occupancy but still worse than that of Extendible hashing

File organizations

Cost model To analyze the (time) cost of the DB operations No. of data pages: B Records/page: R Time of reading/writing: D=15ms (dominant) Time of record processing: C=100nanos Time of hashing: H=100nanos

Reduced calculation just for the I/O time 3 basic file organization: – Heap files – Sorted files – Hashed files

File operations Scan: fetch all records in the file, locate the records Search with equality selection: fetch all records that satisfy an equality selection (=) and locate them Search with range selection: Fetch all records that satisfy a range selection (>,<) Insert: insert record to file, identify the page, fetch it, modify it, write back to file (with others sometimes if e.g. 1 file consists of multiple pages) Delete: delete a record, identify the page, fetch it, modify it, write it back to file (with others sometimes if e.g. 1 file consists of multiple pages)

Heap files Scan the file: B ( D + RC ) Search with equality selection: – One result: in average B ( D + RC ) / 2 – Several results: search the entire file, B ( D + RC ) Search with range selection: B ( D + RC ) Insert: fetch the last page, add record, write back, 2D + C Delete: find record, delete, write page, cost of searching + C + D B data pages R records/page D time of reading/writing C time of record processing

Sorted files Scan: B ( D + RC ) Search with equality selection: – One result: D log 2 B + C log 2 R – Several results: D log 2 B + C log 2 R + no. of results Search with range selection: D log 2 B + C log 2 R + no. of results Insert: find place, insert, move the rest, write pages, search position + B ( D + RC ) in average Delete: find record, delete, move the rest, write pages, cost of searching + B ( D + RC ) B data pages R records/page D time of reading/writing C time of record processing

Hashed files No overflow pages 80% occupancy of buckets Scan the file: 1.25 * B ( D + RC ) Search with equality selection: in average H + D + RC/2 Search with range selection: 1.25 * B ( D + RC ) Insert: locate page, add record, write back, search + D + C Delete: find record, delete, write page, cost of searching + C + D B data pages R records/page D time of reading/writing C time of record processing H time of hashing

Summary Heap file: Storage is good, modifying is good, searching is bad Sorted file: Searching is good, modifying is bad Hashed file: Modifying is good, range selection is not supported, needs more space TypeScanEq. SearchRange search InsertDelete HeapBDBD/2BD2D2DSearch + D SortedBDDlog 2 BDlog 2 B + #matches Search + BD Hashed1.25BDD 2D2DSearch + D

Thank you for your attention! Book is uploaded: R. Ramakrishnan, J. Gehrke: Database Management Systems, 2nd edition