Efficient Storage and Retrieval of Data

Name: Efficient Storage and Retrieval of Data
Uploaded: 2017-11-29T07:57:05+00:00
Duration: PTM17S21
Description: Efficient Storage and Retrieval of Data

Efficient Storage and Retrieval of Data
Physical Data Organization Management of large amount of persistent, reliable and shared data large: data does not fit into the main memory, we have to use some secondary storage persistent: data written into a file should persist even after using it so that it can be used again reliable: should survive hardware and software failures, should be able to recover from these failures sharable: sharable by multiple users Physical Storage Media (hierarchy) Cache main memory flash memory magnetic-disk storage optical storage magnetic-tape storage

Magnetic Disks Access time is much larger than the processing time.
Access time consists of seek time rotation delay block transfer time A better organization of data requires less number of disk accesses A relation can be stored in one or more files with tuples as records and attributes as fields Block Size: bytes Blocking Factor: Number of records that can fit in a block f = B/R where f = blocking factor, R = record size, B = Block size E.g. B=1024bytes, R=100, f = 1024/100 = 10

Files and Records File Operations Block Allocation
Find, Delete, Modify, Insert Block Allocation contiguous: consecutive blocks are assigned, difficult to expand linked: each block contains the address of the next block, easy to expand but reading is slow indexed: an index is stored in the file header fixed length vs variable length records spanned vs unspanned records

File Organization unordered (heap or pile) Ordered hashing
a new record is placed in the last block insertion is very cheap, but search, delete, update or reading in order are expensive requires b/2 block accesses on an average because of the linear search For 4096 blocks, 4096/2 = 2048 block accesses are needed Ordered ordering field is same as the key field search, update or reading in order are efficient requires log2(b) because binary search can be used but insertion is expensive (overflow blocks can be used to reduce the cost) For 4096 blocks, log2(4096) = 14 block accesses are needed hashing used when fast access is required whereas the access time for ordered file is log2(b), the access time for hashing is constant permits access on the basis of the key miserable for range queries

Access Methods Primary Key Access Methods Secondary Key Access Methods
Hashing Primary Key Indexing Multilevel Indexing B - Trees B+-Trees Secondary Key Access Methods Secondary key indexing Clustering Indexing

Internal Hashing h(K) = K mod m m = 70 - 90% of the expected number
of records Apply Hash function Physical Address Key Example: h(James Adams) = (74+65) mod 17 = 139 mod 17 = 3 Name Department Salary Name Department Salary Overflow Pointer 1 2 3 James Adams 15 Mary Jones 16 17 Henry Truman

External Hashing - number of disk accesses is never more than 2 but will usually be 1 - the file has 2 levels, the directory (bucket address table) and buckets - the bucket contains actual records - key is to choose a good hash function h such that no more than n records have the same has value if n is the number of records that can be stored in a bucket - if there is a collusion, overflow buckets may be used 3760 Part Number Hash Function 1 2428 mod 8 = 4 null mod 8 = 0 2 5659 mod 8 = 5 1620 null 3 mod 8 = 4 2369 4 4871 mod 8 = 1 null mod 8 = 6 5 4692 mod 8 = 1 null 6 mod 8 = 0 7115 7 null

Primary Indexing EMPLOYEE EMP # NAME DEPT SALARY 107 1 10k 110 3 12k
112 4 20k 115 1 15k 201 1 25k 236 5 10k EMP # Block Pointer 307 3 30k 107 366 2 35k 201 371 371 1 12k 624 395 3 15k 524 4 33k 608 5 25k 624 2 20k 630 2 30k 724 5 30k 798 4 35k

Example Number of records, r = 30000 Block size, B = 1024 bytes
Record length, R = 100 bytes Blocking factor, f = B/R = 1024/100 = 10 records/block Number of blocks needed, b = 30,000/10 = 3000 blocks Key field, V = 9 bytes Block pointer, P = 6 bytes Blocking factor for index entries = 1024/15 = 68 Number of blocks need to store index entries = 3000/68 = 45blocks Number of block accesses needed = log245 +1 = 6+1 = 7

Clustering Indexing EMP # EMPLOYEE NAME DEPT SALARY 107 1 10k 236 5
110 3 12k 371 1 12k null 115 1 15k 395 3 15k Salary Block Pointer 112 4 20k 10k 624 2 20k null 12k 15k 201 1 25k 20k 608 5 25k 25k 307 3 30k 30k 630 2 30k 33k 35k 724 5 30k 524 4 33k 366 2 35k 798 4 35k null

Secondary Indexing Constructed on a nonordering field
EMPLOYEE Constructed on a nonordering field Can create many secondary indexes If constructed on a key field, it is called secondary key EMP # NAME DEPT SALARY 201 1 25k 110 3 12k EMP # Block Pointer 366 2 35k 107 107 1 10k 110 112 115 1 15k 115 236 5 10k 201 307 3 30k 236 112 4 20k 307 366 798 4 35k 371 395 3 15k 395 524 4 33k 524 724 5 30k 608 624 2 20k 624 630 2 30k 630 608 5 25k 724 371 1 12k 798

Secondary Index Example: Number of records, r = 30000
Block size, B = 1024 bytes Record length, R = 100 bytes Blocking factor, f = B/R = 1024/100 = 10 records/block Number of blocks needed, b = 30,000/10 = 3000 blocks Key field, V = 9 bytes Block pointer, P = 6 bytes Blocking factor for index entries = 1024/15 = 68 Number of blocks need to store index entries = 30000/68 = 442blocks Number of block accesses needed = log2442 +1 = 9+1 = 10 Occupy more space requires maintenance hence expensive can create it on a non-key field

Multilevel Indexing When the index file itself is large, then we can construct an index on index This is always the primary index Blocking factor for index entries = 1024/15 = 68 Number of blocks need to store index entries at level 1 = 30000/68 = 442 Number of blocks need to store index entries at level 2 = 442/68 = 7 Number of blocks need to store index entries at level 3 = 7/68 = 1 Number of block accesses needed = 3+1 = 4 1 4 7 2 3 5 6 8 9 Block Pointer Key

B-Trees and B+-Trees B-Tree B+-Trees
Each node in the B-tree of order p is of the form P1,<K1,Pr1> .. <K2,Pr2>, Pq> where Pi is a tree pointer, Ki is the key field value, Pri is the data pointer, p  q Each path from the root node to a leaf node has the same length Each node (except the root and leaf) has at least p/2 children B+-Trees All the keys and the associated data pointers to the record reside in the leaf nodes

Example of a B-Tree

Efficient Storage and Retrieval of Data

Similar presentations

Presentation on theme: "Efficient Storage and Retrieval of Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Storage and Retrieval of Data

Similar presentations

Presentation on theme: "Efficient Storage and Retrieval of Data"— Presentation transcript:

Similar presentations

About project

Feedback