Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.

Similar presentations


Presentation on theme: "Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats."— Presentation transcript:

1 Database Management 7. course

2 Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats

3 Today System catalogue Hash-based indexing – Static – Extendible – Linear Time-cost of operations

4 System catalogue Special table Indexes – Type of the data structure and search key Tables – Name, filename, file structure (e.g. heap) – Attribute names, types – Integrity constraints – Index names Views – Name and definition Statistics, permissions, buffer size, etc.

5 Attr_Cat(attr_name, rel_name, type, position)

6 Hash-based indexing

7 Basic thought Index is given for every search key Hash function ( f ) between search key ( K ) and memory address ( A ): A = f ( K ) Ideally one key for one address and one address for one key: key is the address

8 Hashing Ideal for joining tables Do not support range search, just equality check Many versions exist: e.g. static, dynamic

9 Static hashing File~collection of buckets Every bucket has one primary page and other overflow pages File has N buckets: 0..N-1 Bucket contains data entries, 3 ways to store: – Data records with key k –

10 To identify the bucket in which the data entry is, hash function h is applied. In the bucket alternative search is applied In case of record insertion h is used to find the proper bucket and put there If there is not enough space, create an overflow chain to the bucket

11 In case of deletion h is used to locate tha data. If the deleted was the last record, than page is removed from the overflow chain and given to the list of free pages Identified bucket: h ( value ) mod N h ( value ) = ( a * value + b ) a and b are constants, can be tuned to influence searching

12

13 Primary pages can be stored sequentially on the disk If the file grows a lot – Long overflow chain – Worsens the search – Create new file with more buckets! If the file shrinks a lot – A lot of space is wasted

14 Solution Ideally – 80% of the buckets is used – no overflow Periodically rehash the file – Takes time – Index cannot be used during rehashing Use dynamic hashing – Extendible Hashing – Linear Hashing

15 Extendible hashing

16 Let’s consider Static Hashing If a new entry is to be inserted to a full bucket Double the number of buckets and redistribute the entries – time consuming Use directory of pointers to save time (pointers point at the buckets) Only the directory file has to be doubled Split only the overflowed bucket

17 Example Size of directory: 4 Last 2 binary digits of the hashed search field gives a number between 0 and 3  no. of directory element  follow pointer to the bucket

18 Insert: calculate the entry’s hash value (13), locate bucket no. (01) If the page has free space, insert What if data 20* to be inserted? (A)

19 Split bucket A, redistribute Consider last 3 digits of the hash values Double the directory and set the pointers

20 Result Local depth of bucket A is increased Insert 9*?

21 Split bucket B Redistribution is not needed since local depth of bucket B<global depth Local depth of bucket B is increased Initially local depth=global depth

22 If bucket gets empty Merging buckets is also possible Not always done Decrease local depth

23 Storage Typical: 100 MB file 100 bytes/data entry Page size: 4KB 1,000,000 data entries 25,000 elements in the directory High chance that it will fit in memory  speed=speed of Static Hashing Otherwise twice slow Collision: entries with the same hash values (overflow pages are needed)

24 Linear Hashing Family of hash functions: h 0, h 1, … Each function's range is twice that of its predecessor E.g. h i (value) = h(value) mod (2 i N). d o :number of bits of N’s representation d i :d o +i Example: N=32, d o =5, h 1 is h mod (2*32), d 1 =6

25 Basic idea Rounds of splitting Number of actual round is L Only h L and are h L+1 in use At any given point within a round we have splitted buckets, buckets yet to be splitted, and buckets created by splits in this round

26

27 Searching h L is applied – If it leads to an unsplitted bucket, we look there – If it lead to a splitted bucket, we apply h L+1 to decide in which bucket our data is Insertion may needs overflow page If the overflow chain gets big then split is triggered

28 Example L=0 round number N L =N*2 L number of buckets at the beginning of the L th round (N 0 =N)

29 If split is triggered, actual (Next) bucket is split and redistributed by h L+1 The new bucket gets to the end of the buckets Next is incremented by 1 Apply h L and if the searched hash value is before Next then apply h L+1 Continue: insert 43*, 37*, 29*, 22*, 66*, 34*, and 50*.

30 43

31 37

32 29

33 22, 66, 34

34 50

35 Deletion If the last bucket is empty, it can be removed, Next can be decremented. With some conditions merging can be triggered with not empty buckets New round, merging: empty buckets are removed, L is decremented, Next=N L /2-1

36 Comparison Imagine that Linear hashing is stored as Extendible The choice of hashing function is similar to Extendible hashing (moving from h i to h i+1 corresponds to doubling the directory in Extendible) Extendible hashing: reduced number of splits and higher bucket occupancy

37 Linear hashing – clever choice of bucket split avoids directory structure – primary pages are stored consecutively, finding them is easy (offset calculation). Quicker equality selection. – Skewed distribution results in almost empty buckets: not efficient storage

38 Imagine directory structure for Linear hashing: one bucket=one directory Overflow pages are stored easily Overhead of a directory level Costly for large, uniformly distributed files Improves space occupancy but still worse than that of Extendible hashing

39 File organizations

40 Cost model To analyze the (time) cost of the DB operations No. of data pages: B Records/page: R Time of reading/writing: D=15ms (dominant) Time of record processing: C=100nanos Time of hashing: H=100nanos

41 Reduced calculation just for the I/O time 3 basic file organization: – Heap files – Sorted files – Hashed files

42 File operations Scan: fetch all records in the file, locate the records Search with equality selection: fetch all records that satisfy an equality selection (=) and locate them Search with range selection: Fetch all records that satisfy a range selection (>,<) Insert: insert record to file, identify the page, fetch it, modify it, write back to file (with others sometimes if e.g. 1 file consists of multiple pages) Delete: delete a record, identify the page, fetch it, modify it, write it back to file (with others sometimes if e.g. 1 file consists of multiple pages)

43 Heap files Scan the file: B ( D + RC ) Search with equality selection: – One result: in average B ( D + RC ) / 2 – Several results: search the entire file, B ( D + RC ) Search with range selection: B ( D + RC ) Insert: fetch the last page, add record, write back, 2D + C Delete: find record, delete, write page, cost of searching + C + D B data pages R records/page D time of reading/writing C time of record processing

44 Sorted files Scan: B ( D + RC ) Search with equality selection: – One result: D log 2 B + C log 2 R – Several results: D log 2 B + C log 2 R + no. of results Search with range selection: D log 2 B + C log 2 R + no. of results Insert: find place, insert, move the rest, write pages, search position + B ( D + RC ) in average Delete: find record, delete, move the rest, write pages, cost of searching + B ( D + RC ) B data pages R records/page D time of reading/writing C time of record processing

45 Hashed files No overflow pages 80% occupancy of buckets Scan the file: 1.25 * B ( D + RC ) Search with equality selection: in average H + D + RC/2 Search with range selection: 1.25 * B ( D + RC ) Insert: locate page, add record, write back, search + D + C Delete: find record, delete, write page, cost of searching + C + D B data pages R records/page D time of reading/writing C time of record processing H time of hashing

46 Summary Heap file: Storage is good, modifying is good, searching is bad Sorted file: Searching is good, modifying is bad Hashed file: Modifying is good, range selection is not supported, needs more space TypeScanEq. SearchRange search InsertDelete HeapBDBD/2BD2D2DSearch + D SortedBDDlog 2 BDlog 2 B + #matches Search + BD Hashed1.25BDD 2D2DSearch + D

47 Thank you for your attention! Book is uploaded: R. Ramakrishnan, J. Gehrke: Database Management Systems, 2nd edition


Download ppt "Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats."

Similar presentations


Ads by Google