Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.

Database Management 7. course

Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats

Today System catalogue Hash-based indexing – Static – Extendible – Linear Time-cost of operations

System catalogue Special table Indexes – Type of the data structure and search key Tables – Name, filename, file structure (e.g. heap) – Attribute names, types – Integrity constraints – Index names Views – Name and definition Statistics, permissions, buffer size, etc.

Attr_Cat(attr_name, rel_name, type, position)

Hash-based indexing

Basic thought Index is given for every search key Hash function ( f ) between search key ( K ) and memory address ( A ): A = f ( K ) Ideally one key for one address and one address for one key: key is the address

Hashing Ideal for joining tables Do not support range search, just equality check Many versions exist: e.g. static, dynamic

Static hashing File~collection of buckets Every bucket has one primary page and other overflow pages File has N buckets: 0..N-1 Bucket contains data entries, 3 ways to store: – Data records with key k –

To identify the bucket in which the data entry is, hash function h is applied. In the bucket alternative search is applied In case of record insertion h is used to find the proper bucket and put there If there is not enough space, create an overflow chain to the bucket

In case of deletion h is used to locate tha data. If the deleted was the last record, than page is removed from the overflow chain and given to the list of free pages Identified bucket: h ( value ) mod N h ( value ) = ( a * value + b ) a and b are constants, can be tuned to influence searching

Primary pages can be stored sequentially on the disk If the file grows a lot – Long overflow chain – Worsens the search – Create new file with more buckets! If the file shrinks a lot – A lot of space is wasted

Solution Ideally – 80% of the buckets is used – no overflow Periodically rehash the file – Takes time – Index cannot be used during rehashing Use dynamic hashing – Extendible Hashing – Linear Hashing

Extendible hashing

Let’s consider Static Hashing If a new entry is to be inserted to a full bucket Double the number of buckets and redistribute the entries – time consuming Use directory of pointers to save time (pointers point at the buckets) Only the directory file has to be doubled Split only the overflowed bucket

Example Size of directory: 4 Last 2 binary digits of the hashed search field gives a number between 0 and 3  no. of directory element  follow pointer to the bucket

Insert: calculate the entry’s hash value (13), locate bucket no. (01) If the page has free space, insert What if data 20* to be inserted? (A)

Split bucket A, redistribute Consider last 3 digits of the hash values Double the directory and set the pointers

Result Local depth of bucket A is increased Insert 9*?

Split bucket B Redistribution is not needed since local depth of bucket B<global depth Local depth of bucket B is increased Initially local depth=global depth

If bucket gets empty Merging buckets is also possible Not always done Decrease local depth

Storage Typical: 100 MB file 100 bytes/data entry Page size: 4KB 1,000,000 data entries 25,000 elements in the directory High chance that it will fit in memory  speed=speed of Static Hashing Otherwise twice slow Collision: entries with the same hash values (overflow pages are needed)

Linear Hashing Family of hash functions: h 0, h 1, … Each function's range is twice that of its predecessor E.g. h i (value) = h(value) mod (2 i N). d o :number of bits of N’s representation d i :d o +i Example: N=32, d o =5, h 1 is h mod (2*32), d 1 =6

Basic idea Rounds of splitting Number of actual round is L Only h L and are h L+1 in use At any given point within a round we have splitted buckets, buckets yet to be splitted, and buckets created by splits in this round

Searching h L is applied – If it leads to an unsplitted bucket, we look there – If it lead to a splitted bucket, we apply h L+1 to decide in which bucket our data is Insertion may needs overflow page If the overflow chain gets big then split is triggered

Example L=0 round number N L =N*2 L number of buckets at the beginning of the L th round (N 0 =N)

If split is triggered, actual (Next) bucket is split and redistributed by h L+1 The new bucket gets to the end of the buckets Next is incremented by 1 Apply h L and if the searched hash value is before Next then apply h L+1 Continue: insert 43*, 37*, 29*, 22*, 66*, 34*, and 50*.

22, 66, 34

Deletion If the last bucket is empty, it can be removed, Next can be decremented. With some conditions merging can be triggered with not empty buckets New round, merging: empty buckets are removed, L is decremented, Next=N L /2-1

Comparison Imagine that Linear hashing is stored as Extendible The choice of hashing function is similar to Extendible hashing (moving from h i to h i+1 corresponds to doubling the directory in Extendible) Extendible hashing: reduced number of splits and higher bucket occupancy

Linear hashing – clever choice of bucket split avoids directory structure – primary pages are stored consecutively, finding them is easy (offset calculation). Quicker equality selection. – Skewed distribution results in almost empty buckets: not efficient storage

Imagine directory structure for Linear hashing: one bucket=one directory Overflow pages are stored easily Overhead of a directory level Costly for large, uniformly distributed files Improves space occupancy but still worse than that of Extendible hashing

File organizations

Cost model To analyze the (time) cost of the DB operations No. of data pages: B Records/page: R Time of reading/writing: D=15ms (dominant) Time of record processing: C=100nanos Time of hashing: H=100nanos

Reduced calculation just for the I/O time 3 basic file organization: – Heap files – Sorted files – Hashed files

File operations Scan: fetch all records in the file, locate the records Search with equality selection: fetch all records that satisfy an equality selection (=) and locate them Search with range selection: Fetch all records that satisfy a range selection (>,<) Insert: insert record to file, identify the page, fetch it, modify it, write back to file (with others sometimes if e.g. 1 file consists of multiple pages) Delete: delete a record, identify the page, fetch it, modify it, write it back to file (with others sometimes if e.g. 1 file consists of multiple pages)

Heap files Scan the file: B ( D + RC ) Search with equality selection: – One result: in average B ( D + RC ) / 2 – Several results: search the entire file, B ( D + RC ) Search with range selection: B ( D + RC ) Insert: fetch the last page, add record, write back, 2D + C Delete: find record, delete, write page, cost of searching + C + D B data pages R records/page D time of reading/writing C time of record processing

Sorted files Scan: B ( D + RC ) Search with equality selection: – One result: D log 2 B + C log 2 R – Several results: D log 2 B + C log 2 R + no. of results Search with range selection: D log 2 B + C log 2 R + no. of results Insert: find place, insert, move the rest, write pages, search position + B ( D + RC ) in average Delete: find record, delete, move the rest, write pages, cost of searching + B ( D + RC ) B data pages R records/page D time of reading/writing C time of record processing

Hashed files No overflow pages 80% occupancy of buckets Scan the file: 1.25 * B ( D + RC ) Search with equality selection: in average H + D + RC/2 Search with range selection: 1.25 * B ( D + RC ) Insert: locate page, add record, write back, search + D + C Delete: find record, delete, write page, cost of searching + C + D B data pages R records/page D time of reading/writing C time of record processing H time of hashing

Summary Heap file: Storage is good, modifying is good, searching is bad Sorted file: Searching is good, modifying is bad Hashed file: Modifying is good, range selection is not supported, needs more space TypeScanEq. SearchRange search InsertDelete HeapBDBD/2BD2D2DSearch + D SortedBDDlog 2 BDlog 2 B + #matches Search + BD Hashed1.25BDD 2D2DSearch + D

Thank you for your attention! Book is uploaded: R. Ramakrishnan, J. Gehrke: Database Management Systems, 2nd edition

Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.

Similar presentations

Presentation on theme: "Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.

Similar presentations

Presentation on theme: "Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats."— Presentation transcript:

Similar presentations

About project

Feedback