Lecture 15: Midterm Review Data Storage

Lecture 15: Midterm Review Data Storage
Monday, October 31, 2005

Midterm Monday, 11:30, this room (in class) Open book 50’
Notes, notebooks, anything No computers

Midterm SQL E/R Diagrams Functional Dependencies XML/Xpath/XQuery

SQL Know the basics: SFW, GROUP-BY, HAVING…
When are two queries equivalent ? Eliminating subqueries Be aware of duplicates Insert/delete, especially more than one tuple Constraints in SQL

E/R Diagrams Good design (don’t make stupid mistakes)
Translation to relations Many-many v.s. many-one relationships Subtleties: Inheritance Union types Weak entity sets

Functional Dependencies
Know the definition of X ® Y Does a given table satisfy X ® Y ? Understand inference If A ® B, B ® C, does it follow that C ® A ? Why ? Why not ? Understand closure: X+ Understand BCNF and 3NF

XML Basics in XPath and Xquery In what sense is XML “semistructured” ?
Mapping relations to XML Simple ways to store XML data Exclude the XML index

Midterm How to prepare: Read lecture notes Read from the textbook
Review the homeworks Make sure you understand

Outline Disks 11.3 Representing data elements 12
Recommended reading: entire chapter 11 Representing data elements 12

The Mechanics of Disk Mechanical characteristics:
Cylinder Mechanical characteristics: Rotation speed (5400RPM) Number of platters (1-30) Number of tracks (<=10000) Number of bytes/track(105) Spindle Tracks Disk head Sector Unit of read or write: disk block Once in memory: page Typically: 4k or 8k or 16k Arm movement Platters Arm assembly

Disk Access Characteristics
Disk latency = time between when command is issued and when data is in memory Disk latency = seek time + rotational latency Seek time = time for the head to reach cylinder 10ms – 40ms Rotational latency = time for the sector to rotate Rotation time = 10ms Average latency = 10ms/2 Transfer time = typically 40MB/s Disks read/write one block at a time

Average Seek Time Suppose we have N tracks, what is the average seek time ? Getting from cylinder x to y takes time |x-y|

RAID Several disks that work in parallel
Redundancy: use parity to recover from disk failure Speed: read from several disks at once Various configurations (called levels): RAID 1 = mirror RAID 4 = n disks + 1 parity disk RAID 5 = n+1 disks, assign parity blocks round robin RAID 6 = “Hamming codes”

Buffer Management in a DBMS
Page Requests from Higher Levels READ WRITE BUFFER POOL disk page free frame INPUT OUTUPT MAIN MEMORY DISK DB choice of frame dictated by replacement policy Data must be in RAM for DBMS to operate on it! Table of <frame#, pageid> pairs is maintained 4

Buffer Manager Manages buffer pool: the pool provides space for a limited number of pages from disk. Needs to decide on page replacement policy LRU Clock algorithm Both work well in OS, but not always in DB Enables the higher levels of the DBMS to assume that the needed data is in main memory.

Buffer Manager Why not use the Operating System for the task??
- DBMS may be able to anticipate access patterns - Hence, may also be able to perform prefetching - DBMS needs the ability to force pages to disk, for recovery purposes

Representing Data Elements
Relational database elements: A tuple is represented as a record The table is a sequence of records CREATE TABLE Product ( pid INT PRIMARY KEY, name CHAR(20), description VARCHAR(200), maker CHAR(10) REFERENCES Company(name) )

Issues Represent attributes inside the records
Represent the records inside the blocs

Record Formats: Fixed Length
Base address (B) Address = B+L1+L2 Information about field types same for all records in a file; stored in system catalogs. Finding i’th field requires scan of record. Note the importance of schema information! 9

Record Header To schema length F1 F2 F3 F4 L1 L2 L3 L4 header
timestamp Need the header because: The schema may change for a while new+old may coexist Records from different relations may coexist 9

Variable Length Records
Other header information header F1 F2 F3 F4 L1 L2 L3 L4 length Place the fixed fields first: F1 Then the variable length fields: F2, F3, F4 Null values take 2 bytes only Sometimes they take 0 bytes (when at the end) 9

Records With Repeating Fields
Other header information header F1 F2 F3 L1 L2 L3 length Needed e.g. in Object Relational systems, or fancy representations of many-many relationships 9

Storing Records in Blocks
Blocks have fixed size (typically 4k – 8k) BLOCK R4 R3 R2 R1

Spanning Records Across Blocks
When records are very large Or even medium size: saves space in blocks block header block header R1 R2 R3 R2

BLOB Binary large objects Supported by modern database systems
E.g. images, sounds, etc. Storage: attempt to cluster blocks together CLOB = character large objec Supports only restricted operations

Modifications: Insertion
File is unsorted: add it to the end (easy ) File is sorted: Is there space in the right block ? Yes: we are lucky, store it there Is there space in a neighboring block ? Look 1-2 blocks to the left/right, shift records If anything else fails, create overflow block

Overflow Blocks Blockn-1 Blockn Blockn+1 Overflow After a while the file starts being dominated by overflow blocks: time to reorganize

Modifications: Deletions
Free space in block, shift records Maybe be able to eliminate an overflow block Can never really eliminate the record, because others may point to it Place a tombstone instead (a NULL record) How can we point to a record in an RDBMS ?

Modifications: Updates
If new record is shorter than previous, easy  If it is longer, need to shift records, create overflow blocks

Pointers Physical addresses
Where do we need them in RDBMS ? Pointers Physical addresses Each block and each record have a physical address that consists of: The host The disk The cylinder number The track number The block within the track For records: an offset in the block’s header Note: review what a pointer in C is

Pointers Logical address: a string of bytes (10-16)
More flexible: can blocks/records around But need translation table: Logical address Physical address L1 P1 L2 P2 L3 P3

Main Memory Address When the block is read in main memory, it receives a main memory address Need another translation table Memory address Logical address M1 L1 M2 L2 M3 L3

Optimization: Pointer Swizzling
= the process of replacing a physical/logical pointer with a main memory pointer Still need translation table, but subsequent references are faster

Pointer Swizzling Block 2 Block 1 Disk read in memory swizzled Memory
unswizzled

Pointer Swizzling Automatic: when block is read in main memory, swizzle all pointers in the block On demand: swizzle only when user requests No swizzling: always use translation table

Pointer Swizzling When blocks return to disk: pointers need unswizzled
Danger: someone else may point to this block Pinned blocks: we don’t allow it to return to disk Keep a list of references to this block

Lecture 15: Midterm Review Data Storage

Similar presentations

Presentation on theme: "Lecture 15: Midterm Review Data Storage"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 15: Midterm Review Data Storage

Similar presentations

Presentation on theme: "Lecture 15: Midterm Review Data Storage"— Presentation transcript:

Similar presentations

About project

Feedback