CS4432: Database Systems II Data Storage 1
Storage in DBMSs DBMSs manage large amounts of data How does a DBMS store and manage large amounts of data? – Has significant impact on performance Design decisions: – What representations and data structures best support efficient manipulations of this data? To understand why the DBMSs applies specific strategies – Must first understand how disks work 2
Disks and Files DBMS stores information on (“hard”) disks. Main memory is only for processing This has major implications for DBMS design! – READ: transfer data from disk to main memory (RAM). – WRITE: transfer data from RAM to disk. – Both are high-cost operations, relative to in- memory operations, so must be planned carefully! 3
DBMS vs. OS? Who’s in Control DBMS is in control of managing its data – It knows more about structure – It knows more about access pattern 4
That is why DBMS has Storage Manager & Buffer Manager 5
Understanding Disks 6
Storage Hierarchy Cache (all levels) Main Memory Secondary Storage Tertiary Storage Fastest Slowest Avg. Size: 256kb-1MB Read/Write Time: seconds. Random Access Smallest of all memory, and also the most costly. Usually on same chip as processor. Easy to manage in Single Processor Environments, more complicated in Multiprocessor Systems. Avg. Size: 128 MB – 1 GB Read/Write Time: to seconds. Random Access Becoming more affordable. Volatile Avg. Size: 30GB-160GB Read/Write Time: seconds NOT Random Access Extremely Affordable: $0.68/GB!!! Can be used for File System, Virtual Memory, or for raw data access. Blocking (need buffering) Avg. Size: Gigabytes-Terabytes Read/Write Time: seconds NOT Random Access, or even remotely close Extremely Affordable: pennies/GB!!! Not efficient for any real-time database purposes, could be used in an offline processing environment 7
Storage Hierarchy 8
Memory Hierarchy Summary access time (sec) cache electronic main electronic secondary magnetic optical disks online tape nearline tape & optical disks offline tape typical capacity (bytes) 9
Memory Hierarchy Summary access time (sec) cache electronic main electronic secondary magnetic optical disks online tape nearline tape & optical disks offline tape dollars/MB 10
Why Not Store Everything in Main Memory? Costs too much. $100 will buy you either 16GB of RAM or 360GB of disk today. Main memory is volatile. We want data to be saved between runs. (Obviously!) Typical hierarchy: – Main memory (RAM) Processing – Disks (secondary storage) Persistent Storage – Tapes & DVDs Archival 11
Motivation Consider the following algorithm : For each tuple r in relation R{ Read the tuple r For each tuple s in relation S{ read the tuple s append the entire tuple s to r } What is the time complexity of this algorithm? 12
Motivation Complexity: – This algorithm is O(n 2 ) ! Is it always ? – Yes, if we assume random access of data. Hard disks are not efficient in Random Access ! Unless organized efficiently, this algorithm may be much worse than O(n 2 ). 13
Disks: Some Facts Data is stored and retrieved in units called disk blocks. – Disk block 512 bytes to 4K or 8K Movement to main-memory – Must read or write one block at a time 14
Disk Components Platter (2 surface) 15
Virtual Cylinder Disk Head Platter Cylinder 16
Tracks divided into Sectors Track Sector Gap Gaps ≈ 10% Sectors ≈ 90% 17
Movements Arm moves in-out – Called seek time – Mechanical Platter rotates – Called latency time – Mechanical 18
Actual Disk 19
Disk Controller Processor MemoryDisk Controller... Disk 1 Disk 2 1.Controls the mechanical movement 2.Transferring the data from disks to memory 3.Smart buffering and scheduling 20
How big is the disk if? There are 4 platters There are 8192 tracks per surface There are 256 sectors per track There are 512 bytes per sector Size = 2 * num of platters * tracks * sectors * bytes per sector Size = 2 * 4* 8192 * 256 * 512 Size = 2 33 bytes / (1024 bytes/kb) /(1024 kb/MB) /(1024 MB/GB) Size = 2 33 = 2 3 * 2 30 = 8GB Remember 1kb = 1024 bytes, not 1000! 21
Scale of Bytes 22
More Disk Terminology Rotation Speed: – The speed at which the disk rotates: 5400RPM Number of Tracks: – Typically 10,000 to 15,000. Bytes per track: – ~10 5 bytes per track 23
Big Question: What about access time? block x in memory ? I want block X Time = Disk Controller Processing Time + Disk Delay{seek & rotation} + Transfer Time 24