The Relational Model (cont’d) Introduction to Disks and Storage CS 186, Spring 2007, Lecture 3 Cow book Section 1.5, Chapter 3 (cont’d) Cow book Chapter 9 Mary Roth
Administrivia Homework 0 due today 10 p.m.! Nathan and Erinaios posted their office hours on class homepage Homework 1 available today from class web site Submit team members online Read thru description; we’ll talk more about it after today’s lecture Questions from last time?
Outline What we learned last time –Components of a DBMS –Relational Data Model New stuff –Storage, Disks and Files
Review: Components of a DBMS Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB A DBMS is like an ogre; it has layers Today we go here…
Review: Relational Data Model Most widely used data model today. Relations: –Schema : specifies name of relation, plus name and type of each column. –Instance : a table, with rows and columns that contain data. SQL is a query language for relational data model –DDL: To define/modify/change schemas –DML: To query data in table. Keys are a way to associate tuples in different relations
Let’s return to our bank… Can we apply a relational model to our bank spreadsheet? Customer ID NameAddressAccount ID CREATE TABLE CUSTOMERS (CustomerID INTEGER, Name VARCHAR(128), Address VARCHAR(256), AccountID INTEGER); CREATE TABLE ACCOUNTS (AccountID INTEGER, Balance Double); Frodo Baggins BagEnd Sam Gamgee BagShot Row Bilbo Baggins Rivendell Account ID Balance
A set of fields is a superkey if: –No two distinct tuples can have same values in all key fields A set of fields is a key for a relation if : –It is a superkey –No subset of the fields is a superkey what if >1 key for a relation? –One of the keys is chosen (by DBA) to be the primary key. Other keys are called candidate keys. e.g. –{sid, gpa} is an example of a superkey. –sid is a key for Students. –what about name? login? Primary Keys sidnameloginagegpa Students
Primary and Candidate Keys in SQL Keys must be chosen and defined carefully! They imply semantics! What does this set of key definitions imply about students? “Students can take only one course, and no two students in a course receive the same grade.” CREATE TABLE Enrolled (sid CHAR(20) cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid), UNIQUE (cid, grade))
Primary and Candidate Keys in SQL Better definition: “ For a given student and course, there is a single grade.” CREATE TABLE Enrolled (sid CHAR(20) cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid))
Foreign Keys, Referential Integrity Foreign key: Set of fields in one relation that is used to `refer’ to a tuple in another relation. –Must correspond to the primary key of the other relation. –Like a `logical pointer’. Plays the same role as the physical pointer in IMS If all foreign keys in a table refer to tuples in the other, referential integrity is achieved (i.e., no dangling references.)
Foreign Keys in SQL E.g. Only students listed in the Students relation should be allowed to enroll for courses. – sid is a foreign key referring to Students: CREATE TABLE Enrolled (sid CHAR(20),cid CHAR(20),grade CHAR(2), PRIMARY KEY (sid,cid), FOREIGN KEY (sid) REFERENCES Students ) sidcidgrade 53666Carnatic101C 53666Reggae203B 53650Topology112A 53666History105B Enrolled sidnameloginagegpa Students English102 A
Let’s return to our bank… Can we define keys for our relations? CREATE TABLE CUSTOMERS (CustomerID INTEGER NOT NULL, Name VARCHAR(128), Address VARCHAR(256), AccountID INTEGER, PRIMARY KEY(CustomerID), FOREIGN KEY(accountid) references ACCOUNTS); CREATE TABLE ACCOUNTS (AccountID INTEGER NOT NULL, Balance Double, PRIMARY KEY (AccountID)); Why do we need NOT NULL? What would happen if I executed these commands in this order?
Let’s return to our bank… We’ll come back to these later… Write a SQL query (DML) that returns the names and account balances for all customers that have an account balance > Write a SQL query (DML) that withdraws $300 from Frodo’s account.
Intermission
Disks, Memory, and Files Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB You are here…
Disks and Files DBMS stores information on disks. –Data must be transferred to and from disk and RAM –READ: transfer data from disk to main memory (RAM). –WRITE: transfer data from RAM to disk. READ and WRITE are expensive and must be planned carefully! –DBMS architecture is designed to minimize both
Why Not Store Everything in Main Memory? Costs too much. For ~$300, PCConnection will sell you: –~1GB of RAM –~30GB of flash –~1 TB of disk Main memory is volatile. We want data to be saved between runs. (Obviously!)
The Storage Hierarchy Source: Operating Systems Concepts 5th Edition –Main memory (RAM) for currently used data. –Disk for the main database (secondary storage). –Tapes for archiving older versions of the data (tertiary storage). Smaller, Faster Bigger, Slower
Jim Gray’s Storage Latency Analogy: How Far Away is the Data?
Disks Secondary storage device of choice. Main advantage over tapes: – faster time to retrieve –random access vs. sequential. Data is stored and retrieved in units called disk blocks or pages. Unlike RAM, time to retrieve a disk block varies depending upon location on disk. –Therefore, relative placement of blocks on disk has major impact on DBMS performance!
Components of a Disk Platters The platters spin (say, 120 rps). Spindle The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder (imaginary!). Disk head Arm movement Arm assembly Only one head reads/writes at any one time. Tracks Sector v Block size is a multiple of sector size (which is fixed).
Accessing a Disk Page Time to access (read/write) a disk block: –seek time ( moving arms to position disk head on track ) –rotational delay ( waiting for block to rotate under head ) –transfer time ( actually moving data to/from disk surface ) Arm movement Seek time Rotational delay Transfer time
Accessing a Disk Page Seek time and rotational delay dominate. –Seek time varies between about 0.3 and 10msec –Rotational delay varies from 0 to 4msec –Transfer rate around.08msec per 8K block Key to lower I/O cost: reduce seek/rotation delays!
Arranging Pages on Disk `Next’ block concept: –blocks on same track, followed by –blocks on same cylinder, followed by –blocks on adjacent cylinder Blocks in a file should be arranged sequentially on disk (by `next’), to minimize seek and rotational delay. For a sequential scan, pre-fetching several pages at a time is a big win!
Summary: Disk Space Manager Lowest layer of DBMS software manages space on disk (using OS file system or not?). Higher levels call upon this layer to: –allocate/de-allocate a page –read/write a page Best if a request for a sequence of pages is satisfied by pages stored sequentially on disk! –Responsibility of disk space manager. –Higher levels don’t know how this is done, or how free space is managed. –Though they may make performance assumptions! Hence disk space manager should do a decent job.
Buffer Management Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB You are here… Data must be in RAM for DBMS to operate on it! Buffer Mgr hides the fact that not all data is in RAM
Buffer Management in a DBMS DB MAIN MEMORY DISK disk page free frame Page Requests from Higher Levels BUFFER POOL choice of frame dictated by replacement policy Buffer pool information table contains:
Requesting a page 22 MAIN MEMORY DISK disk page free frames BUFFER POOL …… Higher level DBMS component I need page 3 Disk Mgr Buf Mgr I need page * If requests can be predicted (e.g., sequential scans) pages can be pre-fetched several pages at a time!
Releasing a page 22 MAIN MEMORY DISK disk page free frames BUFFER POOL …… Higher level DBMS component I read page 3 and I’m done with it Disk Mgr Buf Mgr 3
Releasing a page 22 MAIN MEMORY DISK disk page free frames BUFFER POOL …… Higher level DBMS component I wrote on page 3 and I’m done with it Disk Mgr Buf Mgr 3’
More on Buffer Management Requestor of page must eventually unpin it, and indicate whether page has been modified: –dirty bit is used for this. Page in pool may be requested many times, –a pin count is used. –To pin a page, pin_count++ – A page is a candidate for replacement iff pin count == 0 (“unpinned”) CC & recovery may entail additional I/O when a frame is chosen for replacement. –Write-Ahead Log protocol; more later!