Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.

Slides:



Advertisements
Similar presentations
Chapter 6 File Systems 6.1 Files 6.2 Directories
Advertisements

Secondary Storage Devices: Magnetic Disks
Storing Data: Disks and Files
3/31/2017.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Disk Storage, Basic File Structures, and Hashing.
Disk Storage, Basic File Structures, and Hashing
CS 440 Database Management Systems RDBMS Architecture and Data Storage 1.
Indexing Large Data COMP # 22
Hash Tables.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
1 Lecture 8: Data structures for databases II Jose M. Peña
Advance Database System
IELM 230: File Storage and Indexes Agenda: - Physical storage of data in Relational DB’s - Indexes and other means to speed Data access - Defining indexes.
1 Storing Data: Disks and Files Yanlei Diao UMass Amherst Feb 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
File Organizations and Indexes ISYS 464. Disk Devices Disk drive: Read/write head and access arm. Single-sided, double-sided, disk pack Track, sector,
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Efficient Storage and Retrieval of Data
Disk Storage, Basic File Structures, and Hashing
CS 728 Advanced Database Systems Chapter 16
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
1 Disk Storage, Basic File Structures, and Hashing.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
Layers of a DBMS Query optimization Execution engine Files and access methods Buffer management Disk space management Query Processor Query execution plan.
1 Lecture 7: Data structures for databases I Jose M. Peña
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 5, 6 of Elmasri “ How index-learning turns no student.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
CHAPTER 13:DISK STORAGE, BASIC FILE STRUCTURES, AND HASHING Disk Storage, Basic File Structures, and Hashing Copyright © 2007 Ramez Elmasri and Shamkant.
Disk Storage Copyright © 2004 Pearson Education, Inc.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Disk Storage, Basic File Structures, and Hashing
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Announcements Exam Friday Project: Steps –Due today.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
ICS 321 Fall 2011 Overview of Storage & Indexing (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 11/9/20111Lipyeow.
Chapter 9 Disk Storage and Indexing Structures for Files Copyright © 2004 Pearson Education, Inc.
External data structures
1) Disk Storage, Basic File Structures, and Hashing This material is a modified version of the slides provided by Ramez Elmasri and Shamkant Navathe for.
Database Management Systems,Shri Prasad Sawant. 1 Storing Data: Disks and Files Unit 1 Mr.Prasad Sawant.
IDA / ADIT Databasteknik Databaser och bioinformatik Data structures and Indexing (I) Fang Wei-Kleiner.
1 Overview of Database Design Process. Data Storage, Indexing Structures for Files 2.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.
Chapter Ten. Storage Categories Storage medium is required to store information/data Primary memory can be accessed by the CPU directly Fast, expensive.
Chapter 13 Disk Storage, Basic File Structures, and Hashing. Copyright © 2004 Pearson Education, Inc.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe.
Chapter 5 Record Storage and Primary File Organizations
CS4432: Database Systems II
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Lec 5 part1 Disk Storage, Basic File Structures, and Hashing.
File Organization Record Storage and Primary File Organization
Disk Storage, Basic File Structures, and Hashing
Indexing and hashing.
Oracle SQL*Loader
Disk Storage, Basic File Structures, and Hashing
Database Management Systems (CS 564)
9/12/2018.
Chapters 17 & 18 6e, 13 & 14 5e: Design/Storage/Index
Disk Storage, Basic File Structures, and Hashing
Disk Storage, Basic File Structures, and Buffer Management
Disk storage Index structures for files
1/17/2019.
Lec 7:Disk Storage, Basic File Structures, and Hashing
Lecture 20: Indexes Monday, February 27, 2006.
Advance Database System
Presentation transcript:

Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner

Storage hierarchy Important because it effects query efficiency. CPU Primary storage (fast, small, expensive, volatile, accessible by CPU) Cache memory Main memory Secondary storage (slow, big, cheap, permanent, inaccessible by CPU) Disk Tape Databases

Disk Storage Devices Preferred secondary storage device for high storage capacity and low cost. Data stored as magnetized areas on magnetic disk surfaces. A disk pack contains several magnetic disks connected to a rotating spindle. Disks are divided into concentric circular tracks on each disk surface. Track capacities vary typically from 4 to 50 Kbytes or more

Disk Storage Devices (cont.) A track is divided into smaller blocks or sectors The division of a track into sectors is hard-coded on the disk surface and cannot be changed. The block size B is fixed for each system. Typical block sizes range from B=512 bytes to B=4096 bytes. Whole blocks are transferred between disk and main memory for processing.

Disk Storage Devices (cont.) A read-write head moves to the track that contains the block to be transferred. Disk rotation moves the block under the read-write head for reading or writing. A physical disk block (hardware) address consists of: a cylinder number (imaginary collection of tracks of same radius from all recorded surfaces) the track number or surface number (within the cylinder) and block number (within track). Reading or writing a disk block is time consuming seek time  5 – 10 msec rotational delay (depends on revolution per minute)  3 - 5 msec

Disk Read/write to disk is a bottleneck, i.e. Disk access sec. Main memory access sec. CPU instruction sec

Files and records Data stored in files. File is a sequence of records. Record is a set of field values. For instance, file = relation, record = entity, and field = attribute. Records are allocated to file blocks.

Files and records Let us assume B is the size in bytes of the block. R is the size in bytes of the record. r is the number of records in the file. Blocking factor (number of records in each block): Blocks needed for the file: What is the space wasted per block ?

Files and records Wasted space per block = B – bfr * R. Solution: Spanned records. block i record 1 record 2 wasted Unspanned block i+1 record 3 record 5 wasted block i record 1 record 2 record 3 p Spanned block i+1 record 3 record 4 record 5

From file blocks to disk blocks Contiguous allocation: cheap sequential access but expensive record addition. Why ? Linked allocation: expensive sequential access but cheap record addition. Why ? Linked clusters allocation. Indexed allocation.

File organization Heap files. Sorted files. Hash files. File organization != access method, though related in terms of efficiency.

Heap files Records are added to the end of the file. Hence, Cheap record addition. Expensive record retrieval, removal and update, since they imply linear search: Average case: block accesses. Worst case: b block accesses (if it doesn't exist or several exist). Moreover, record removal implies waste of space. So, periodic reorganization.

Sorted files Records ordered according to some field. So, Cheap ordered record retrieval (on the ordering field, otherwise expensive): All the records: access blocks sequentially. Next record: probably in the same block. Random record: binary search, then worst case implies block accesses. Expensive record addition, but less expensive record deletion (deletion markers + periodic reorganization). Is record updating cheap or expensive ?

Primary index Let us assume that the ordering field is a key. Primary index = ordered file whose records contain two fields: One of the ordering key values. A pointer to a disk block. There is one record for each data block, and the record contains the ordering key value of the first record in the data block plus a pointer to the block. binary search !

Primary index Why is it faster to access a random record via a binary search in index than in the file ? What is the cost of maintaining an index? If the order of the data records changes…

Clustering index Now, the ordering field is a non-key. Clustering index = ordered file whose records contain two fields: One of the ordering field values. A pointer to a disk block. There is one record for each distinct value of the ordering field, and the record contains the ordering field value plus a pointer to the first data block where that value appears. binary search !

Clustering index Efficiency gain ? Maintenance cost ?

Secondary indexes The index is now on a non-ordering field. Let us assume that that is a key. Secondary index = ordered file whose records contain two fields: One of the non-ordering field values. A pointer to a disk record or block. There is one record per data record. binary search !

Secondary indexes Efficiency gain ? Maintenance cost ?

Secondary indexes Now, the index is on a non-ordering and non-key field.

Multilevel indexes Index on index (first level, second level, etc.). Works for primary, clustering and secondary indexes as long as the first level index has a distinct index value for every entry. How many levels ? Until the last level fits in a single disk block. How many disk block accesses to retrieve a random record?

Multilevel indexes Efficiency gain ? Maintenance cost ?

Exercise Assume an ordered file whose ordering field is a key. The file has 1000000 records of size 1000 bytes each. The disk block is of size 4096 bytes (unspanned allocation). The index record is of size 32 bytes. How many disk block accesses are needed to retrieve a random record when searching for the key field Using no index ? Using a primary index ? Using a multilevel index ?

Dynamic multilevel indexes Record insertion, deletion and update may be expensive operations. Recall that all the index levels are ordered files. Solutions: Overflow area + periodic reorganization. Empty records + dynamic multilevel indexes, based on B-trees and B+-trees.  Search tree  B-tree  B+-tree

Search Tree A search tree of order p is a tree s.t. Each node contains at most p-1 search values, and at most p pointers <P1 ,K1 , … Pi , Ki … Kq-1, Pq> where q ≤ p Pi: pointer to a child node Ki: a search value (key)  within each node: K1 < K2 < Ki < … < Kq-1

Searching a value X over the search tree Follow the appropriate pointer Pi at each level of the tree  only one node access at each tree level  time cost for retrieval equals to the depth h of the tree Expected that h << tree size (set of the key values) Is that always guaranteed?

Dynamic Multilevel Indexes Using B-Trees and B+-Trees B stands for Balanced  all the leaf nodes are at the same level (both B-Tree and B+-Tree are balanced) Depth of the tree is minimized These data structures are variations of search trees that allow efficient insertion and deletion of search values. In B-Tree and B+-Tree data structures, each node corresponds to a disk block Recall the multilevel index Ensure big fan-out (number of pointers in each node) Each node is kept between half-full and completely full Why?

Dynamic Multilevel Indexes Using B-Trees and B+-Trees (cont.) Insertion An insertion into a node that is not full is quite efficient If a node is full the insertion causes a split into two nodes Splitting may propagate to other tree levels Deletion A deletion is quite efficient if a node does not become less than half full If a deletion causes a node to become less than half full, it must be merged with neighboring nodes

Difference between B-tree and B+-tree In a B-tree, pointers to data records exist at all levels of the tree In a B+-tree, all pointers to data records exists only at the leaf-level nodes A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree

B-tree Structures

The Nodes of a B+-tree Pnext (pointer at leaf node): ordered access to the data records on the indexing fields

B+-trees: Retrieval Very fast retrieval of a random record p is the order of the internal nodes of the B+-tree. N is the number of leaves in the B+-tree. How would the retrieval proceed ? Insertion and deletion can be expensive.

B+-trees: Insertion

B+-trees: Insertion

B+-trees: Insertion

B+-trees: Insertion

B+-trees: Insertion

B+-trees: Insertion

B+-trees: Insertion

B+-trees: Insertion

B+-trees: Insertion

B-trees: Order

B+-trees: Order

Exercise B=4096 bytes, P=16 bytes, K=64 bytes, node fill percentage=70 %. For both B-trees and B+-trees: Compute the order p. Compute the number of nodes, pointers and key values in the root, level 1, level 2 and leaves. If the results are different for B-trees and B+-trees, explain why this is so.