Lecture 15: Midterm Review Data Storage Monday, October 31, 2005
Midterm Monday, 11:30, this room (in class) Open book 50’ Notes, notebooks, anything No computers
Midterm SQL E/R Diagrams Functional Dependencies XML/Xpath/XQuery
SQL Know the basics: SFW, GROUP-BY, HAVING… When are two queries equivalent ? Eliminating subqueries Be aware of duplicates Insert/delete, especially more than one tuple Constraints in SQL
E/R Diagrams Good design (don’t make stupid mistakes) Translation to relations Many-many v.s. many-one relationships Subtleties: Inheritance Union types Weak entity sets
Functional Dependencies Know the definition of X ® Y Does a given table satisfy X ® Y ? Understand inference If A ® B, B ® C, does it follow that C ® A ? Why ? Why not ? Understand closure: X+ Understand BCNF and 3NF
XML Basics in XPath and Xquery In what sense is XML “semistructured” ? Mapping relations to XML Simple ways to store XML data Exclude the XML index
Midterm How to prepare: Read lecture notes Read from the textbook Review the homeworks Make sure you understand
Outline Disks 11.3 Representing data elements 12 Recommended reading: entire chapter 11 Representing data elements 12
The Mechanics of Disk Mechanical characteristics: Cylinder Mechanical characteristics: Rotation speed (5400RPM) Number of platters (1-30) Number of tracks (<=10000) Number of bytes/track(105) Spindle Tracks Disk head Sector Unit of read or write: disk block Once in memory: page Typically: 4k or 8k or 16k Arm movement Platters Arm assembly
Disk Access Characteristics Disk latency = time between when command is issued and when data is in memory Disk latency = seek time + rotational latency Seek time = time for the head to reach cylinder 10ms – 40ms Rotational latency = time for the sector to rotate Rotation time = 10ms Average latency = 10ms/2 Transfer time = typically 40MB/s Disks read/write one block at a time
Average Seek Time Suppose we have N tracks, what is the average seek time ? Getting from cylinder x to y takes time |x-y|
RAID Several disks that work in parallel Redundancy: use parity to recover from disk failure Speed: read from several disks at once Various configurations (called levels): RAID 1 = mirror RAID 4 = n disks + 1 parity disk RAID 5 = n+1 disks, assign parity blocks round robin RAID 6 = “Hamming codes”
Buffer Management in a DBMS Page Requests from Higher Levels READ WRITE BUFFER POOL disk page free frame INPUT OUTUPT MAIN MEMORY DISK DB choice of frame dictated by replacement policy Data must be in RAM for DBMS to operate on it! Table of <frame#, pageid> pairs is maintained 4
Buffer Manager Manages buffer pool: the pool provides space for a limited number of pages from disk. Needs to decide on page replacement policy LRU Clock algorithm Both work well in OS, but not always in DB Enables the higher levels of the DBMS to assume that the needed data is in main memory.
Buffer Manager Why not use the Operating System for the task?? - DBMS may be able to anticipate access patterns - Hence, may also be able to perform prefetching - DBMS needs the ability to force pages to disk, for recovery purposes
Representing Data Elements Relational database elements: A tuple is represented as a record The table is a sequence of records CREATE TABLE Product ( pid INT PRIMARY KEY, name CHAR(20), description VARCHAR(200), maker CHAR(10) REFERENCES Company(name) )
Issues Represent attributes inside the records Represent the records inside the blocs
Record Formats: Fixed Length Base address (B) Address = B+L1+L2 Information about field types same for all records in a file; stored in system catalogs. Finding i’th field requires scan of record. Note the importance of schema information! 9
Record Header To schema length F1 F2 F3 F4 L1 L2 L3 L4 header timestamp Need the header because: The schema may change for a while new+old may coexist Records from different relations may coexist 9
Variable Length Records Other header information header F1 F2 F3 F4 L1 L2 L3 L4 length Place the fixed fields first: F1 Then the variable length fields: F2, F3, F4 Null values take 2 bytes only Sometimes they take 0 bytes (when at the end) 9
Records With Repeating Fields Other header information header F1 F2 F3 L1 L2 L3 length Needed e.g. in Object Relational systems, or fancy representations of many-many relationships 9
Storing Records in Blocks Blocks have fixed size (typically 4k – 8k) BLOCK R4 R3 R2 R1
Spanning Records Across Blocks When records are very large Or even medium size: saves space in blocks block header block header R1 R2 R3 R2
BLOB Binary large objects Supported by modern database systems E.g. images, sounds, etc. Storage: attempt to cluster blocks together CLOB = character large objec Supports only restricted operations
Modifications: Insertion File is unsorted: add it to the end (easy ) File is sorted: Is there space in the right block ? Yes: we are lucky, store it there Is there space in a neighboring block ? Look 1-2 blocks to the left/right, shift records If anything else fails, create overflow block
Overflow Blocks Blockn-1 Blockn Blockn+1 Overflow After a while the file starts being dominated by overflow blocks: time to reorganize
Modifications: Deletions Free space in block, shift records Maybe be able to eliminate an overflow block Can never really eliminate the record, because others may point to it Place a tombstone instead (a NULL record) How can we point to a record in an RDBMS ?
Modifications: Updates If new record is shorter than previous, easy If it is longer, need to shift records, create overflow blocks
Pointers Physical addresses Where do we need them in RDBMS ? Pointers Physical addresses Each block and each record have a physical address that consists of: The host The disk The cylinder number The track number The block within the track For records: an offset in the block’s header Note: review what a pointer in C is
Pointers Logical address: a string of bytes (10-16) More flexible: can blocks/records around But need translation table: Logical address Physical address L1 P1 L2 P2 L3 P3
Main Memory Address When the block is read in main memory, it receives a main memory address Need another translation table Memory address Logical address M1 L1 M2 L2 M3 L3
Optimization: Pointer Swizzling = the process of replacing a physical/logical pointer with a main memory pointer Still need translation table, but subsequent references are faster
Pointer Swizzling Block 2 Block 1 Disk read in memory swizzled Memory unswizzled
Pointer Swizzling Automatic: when block is read in main memory, swizzle all pointers in the block On demand: swizzle only when user requests No swizzling: always use translation table
Pointer Swizzling When blocks return to disk: pointers need unswizzled Danger: someone else may point to this block Pinned blocks: we don’t allow it to return to disk Keep a list of references to this block