Database Techniek Martin Kersten Peter Boncz CWI.

Slides:



Advertisements
Similar presentations
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Advertisements

Chapter 16: Recovery System
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
E-R Diagram for a Banking Enterprise
CS4432: Database Systems II
Chapter 8 File organization and Indices.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
©Silberschatz, Korth and Sudarshan4.1Database System Concepts Chapter 4: SQL Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries.
Recap of Feb 25: Physical Storage Media Issues are speed, cost, reliability Media types: –Primary storage (volatile): Cache, Main Memory –Secondary or.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
SECTIONS 13.1 – 13.3 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin SECONDARY STORAGE MANAGEMENT.
Multimedia Information Systems CS Outlines Introduction to DMBS Relational database and SQL B + - tree index structure.
Secondary Storage CSCI 444/544 Operating Systems Fall 2008.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
©Silberschatz, Korth and Sudarshan11.1Database System Concepts Chapter 11: Storage and File Structure Overview of Physical Storage Media Magnetic Disks.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
1 Database Systems Storage Media Asma Ahmad 21 st Apr, 11.
Lecture 11: DMBS Internals
©Silberschatz, Korth and Sudarshan4.1Database System Concepts Chapter 4: SQL Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How are data stored? –physical level –logical level.
Physical Storage and File Organization COMSATS INSTITUTE OF INFORMATION TECHNOLOGY, VEHARI.
Chapter 10 Storage and File Structure Yonsei University 2 nd Semester, 2013 Sanghyun Park.
1 Physical Data Organization and Indexing Lecture 14.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 13: Query Processing.
©Silberschatz, Korth and Sudarshan11.1Database System Concepts Magnetic Hard Disk Mechanism NOTE: Diagram is schematic, and simplifies the structure of.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
1/14/2005Yan Huang - CSCI5330 Database Implementation – Storage and File Structure Storage and File Structure.
Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
11.1Database System Concepts. 11.2Database System Concepts Now Something Different 1st part of the course: Application Oriented 2nd part of the course:
Lecture 1- Query Processing Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Indexing CS 400/600 – Data Structures. Indexing2 Memory and Disk  Typical memory access: 30 – 60 ns  Typical disk access: 3-9 ms  Difference: 100,000.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
©Silberschatz, Korth and Sudarshan3.1Database System Concepts Extended Relational-Algebra-Operations Generalized Projection Aggregate Functions Outer Join.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
CS 540 Database Management Systems
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
1 Ullman et al. : Database System Principles Notes 4: Indexing.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
Data Storage and Querying in Various Storage Devices.
11.1 Chapter 11: Storage and File Structure 11.1 Overview of physical storage media 11.2 Magnetic disks 11.3 RAID 11.4 Tertiary access 11.5 Storage access.
Storage Overview of Physical Storage Media Magnetic Disks RAID
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
Storage and Disks.
Lecture 16: Data Storage Wednesday, November 6, 2006.
Database Management Systems (CS 564)
Database Management Systems (CS 564)
Chapter 12: Query Processing
Chapter 10: Storage and File Structure
Overview Continuation from Monday (File system implementation)
Module 10: Physical Storage Systems
Presentation transcript:

Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.2Database System Concepts Outline Introduction & Course Organization  Recap of Introductory Database Course  SQL  Relational Algebra (X100 flavor) Storage and File Structures

©Silberschatz, Korth and Sudarshan4.3Database System Concepts Why a DBMS? Main Advantages  Centralization (at least conceptually)  Data Independence (physical changes don’t break legacy apps)  Declarative Data Integrity Constraints  Atomic actions (DBMS recovers consistently from system crash)  Consistency under Multi-User Concurrent Updates  Declarative & Powerful Query Language, Automatically Optimized  Multi-user security DBMS now is the basic building block of all information systems Almost everybody in IT works with DBMS on a daily basis

©Silberschatz, Korth and Sudarshan4.4Database System Concepts Application Architectures  Two-tier architecture: E.g. client programs using ODBC/JDBC to communicate with a database (aka “client-server”)  Three-tier architecture: E.g. web-based applications (e.g. LAMP), or application servers (e.g. jBOSS, BEA)

©Silberschatz, Korth and Sudarshan4.5Database System Concepts Doel verkrijgen van inzicht in de implementatie technieken binnenin een relationeel DBMS Beoordeling:  Cijfer = (2*tentamen+practicum)/3  tentamen >= 6, practicum >= 6 Literatuur:  A. Silberschatz e.a., 'Database system concepts', 4th ed, McGraw-Hill, 2002Database system concepts 

©Silberschatz, Korth and Sudarshan4.6Database System Concepts Hoorcolleges Query OptimizationH14BonczFeb 22 MonetDB/XQueryKersten/BonczMar 155 MonetDB/SQLKersten/NesMar 84 TransactionsH15-17KerstenMar 13 Query ProcessingH13BonczFeb 152 SQL + X100 Alg Storage + B-Trees H4 + X100 doc H11-12 Kersten/ Boncz Feb 81 OnderwerpMateriaalDocentDatum Tentamen laatste week maart

©Silberschatz, Korth and Sudarshan4.7Database System Concepts Practicum Assignment 0: Hands-on experience with relational DBMSs & SQL Assignment 1: Translating SQL to X100 algebra ("by hand") Assignment 2: (choose on of) a) Building logical cost functions for X100 algebra operations ("by hand") b) Analyse and explain the behaviour of a query optimizer Begeleider: Marc Makkes Hard deadlines (first: Saturday, February 17, 2007, 23:59:59 CET! ) Work in couples

©Silberschatz, Korth and Sudarshan4.8Database System Concepts Outline Introduction & Course Organization Recap of Introductory Database Course  SQL   Relational Algebra (X100 flavor) Storage and File Structures

©Silberschatz, Korth and Sudarshan4.9Database System Concepts SQL re-cap: Basic Structure A typical SQL query has the form: select A 1, A 2,..., A n from r 1, r 2,..., r m where P  A i s represent attributes  r i s represent relations  P is a predicate. This query is equivalent to the relational algebra expression. project A1, A2,..., An (select  P (r 1 join  true r 2 join  true... join  true r m )) The result of an SQL query is again a relation. SQL relations may have duplicates  Use select distinct to get a set

©Silberschatz, Korth and Sudarshan4.10Database System Concepts Aggregate Queries Find the names of all branches where the average account balance is more than $1,200. Note: predicates in the having clause are applied after the formation of groups whereas predicates in the where clause are applied before forming groups select branch-name, avg (balance) from account group by branch-name having avg (balance) > 1200

©Silberschatz, Korth and Sudarshan4.11Database System Concepts Ordering the Display of Tuples List in alphabetic order the names of all customers having a loan in Perryridge branch select customer-name from borrower, loan where borrower loan-number - loan.loan-number and branch-name = ‘ Perryridge ’ order by customer-name We may specify desc for descending order or asc for ascending order, for each attribute; ascending order is the default.  E.g. order by customer-name desc We may restrict the result to the first N tuples  E.g. order by customer-name limit N

©Silberschatz, Korth and Sudarshan4.12Database System Concepts Nested Subqueries SQL provides a mechanism for the nesting of subqueries. A subquery is a select-from-where expression that is nested within another query. A common use of subqueries is to perform tests for set membership, set comparisons, and set cardinality.

©Silberschatz, Korth and Sudarshan4.13Database System Concepts Example Query Find all customers who have both an account and a loan at the bank. select distinct customer-name from borrower where customer-name in (select customer-name from depositor) select distinct customer-name from borrower as B where exists (select * from depositor where customer-name = B.customer-name)

©Silberschatz, Korth and Sudarshan4.14Database System Concepts Outline Introduction & Course Organization Recap of Introductory Database Course  SQL  Relational Algebra (X100 flavor) Storage and File Structures

©Silberschatz, Korth and Sudarshan4.15Database System Concepts Relational algebra SQL physical algebra logical algebra parsing, normalization logical query optimization physical query optimization query execution

©Silberschatz, Korth and Sudarshan4.16Database System Concepts The Practicum SQL physical algebra X100 algebra parsing, normalization logical query optimization physical query optimization X100 system

©Silberschatz, Korth and Sudarshan4.17Database System Concepts X100 relational algebra MonetDB/X100 is a CWI research projects high-performance experimental DBMS for e.g. Data warehousing Data mining Information Retrieval Video databases (retrieval by content) Research goal: study interaction between modern hardware and database internals High perf algorithms, compression E.g. exploit CPU caches, Multi-Processors, MEMS

©Silberschatz, Korth and Sudarshan4.18Database System Concepts X100 relational algebra (Cont.) X100 has a relational algebra interface Table ::= table(Identifier) select(Table, Expr ) project(Table, [ Expr ] ) join(Table, TABLE, Expr ) aggr(Table, [ Expr ], [ AggrFcn ] ) order (Table, [ Expr ] ) topn(Table, [ Expr ], Expr ) Identifier = Table

©Silberschatz, Korth and Sudarshan4.19Database System Concepts select(Table, Expr ) Relation r ABCD   select (r, and( ==(A,B), >(D,int(‘5’) ) ) ) ABCD  

©Silberschatz, Korth and Sudarshan4.20Database System Concepts select(Table, Expr ) Relation r ABCD   select (r, and( ==(A,B), >(D,int(‘5’) ) ) ) ABCD   Functional C-like notation: A = B and d > 5

©Silberschatz, Korth and Sudarshan4.21Database System Concepts select(Table, Expr ) Relation r ABCD   select (r, and( ==(A,B), >(D,int(‘5’) ) ) ) ABCD   All constants denoted as cast: TYPE(‘string’)

©Silberschatz, Korth and Sudarshan4.22Database System Concepts project(Table, [ Expr ] ) Relation r: ABC  AD  Project (r, [ A, D=*(C,int(’10’)) ] )

©Silberschatz, Korth and Sudarshan4.23Database System Concepts project(Table, [ Expr ] ) Relation r: ABC  AD  Project (r, [ A, D=*(C,int(’10’)) ] ) X100 is a bag algebra: no double elimination

©Silberschatz, Korth and Sudarshan4.24Database System Concepts join(Table, Table, Expr ) Relations r, s: AB  CD  aababaabab E F  r AB  CD  aaaabaaaab F  s join(r, s, ==(B,E))

©Silberschatz, Korth and Sudarshan4.25Database System Concepts join(Table, Table, Expr ) Relations r, t: AB  CD  aababaabab E F  r AB  CD  aaaabaaaab F  s X100 join result is the union of all attributes. Name conflicts must be resolved with an extra project E C  t join(r, s, ==(B,E)) project( t, [ E,F=C ] )

©Silberschatz, Korth and Sudarshan4.26Database System Concepts aggr(Table, [Expr ], [AggrFcn ]) Relation account grouped by branch-name: branch-nameaccount-numberbalance Perryridge Brighton Redwood A-102 A-201 A-217 A-215 A branch-namebalance Perryridge Brighton Redwood aggr( account, [ branch-name ], [ balance = sum(balance) ] )

©Silberschatz, Korth and Sudarshan4.27Database System Concepts aggr(Table, [Expr ], [AggrFcn ]) Relation account grouped by branch-name: branch-nameaccount-numberbalance Perryridge Brighton Redwood A-102 A-201 A-217 A-215 A branch-namebalance Perryridge Brighton Redwood aggr( account, [ branch-name ], [ balance = sum(balance) ] ) Identifier = AggrFcn(Identifier) AggrFcn ::= count () avg (T) sum (T) min (T) max (T)

©Silberschatz, Korth and Sudarshan4.28Database System Concepts aggr(Table, [Expr ], [AggrFcn ]) Relation r: AB   C total 27 aggr( r, [], [total = sum(C)])

©Silberschatz, Korth and Sudarshan4.29Database System Concepts aggr(Table, [Expr ], [AggrFcn ]) Relation r: AB   C total 27 aggr( r, [], [total = sum(C)]) Empty groupby-list  Global aggregate

©Silberschatz, Korth and Sudarshan4.30Database System Concepts aggr(Table, [Expr ], [AggrFcn ]) Relation account grouped by branch-name: branch-nameaccount-numberbalance Perryridge Brighton Redwood A-102 A-201 A-217 A-215 A branch-name Perryridge Brighton Redwood aggr( account, [ branch-name ], [] )

©Silberschatz, Korth and Sudarshan4.31Database System Concepts aggr(Table, [Expr ], [AggrFcn ]) Relation account grouped by branch-name: branch-nameaccount-numberbalance Perryridge Brighton Redwood A-102 A-201 A-217 A-215 A branch-name Perryridge Brighton Redwood aggr( account, [ branch-name ], [] ) Empty AggrFcn-list  Double elimination

©Silberschatz, Korth and Sudarshan4.32Database System Concepts order (Table, [ Expr ]) Relation r ABCD   orderby(r, [D,C desc]) ABCD  

©Silberschatz, Korth and Sudarshan4.33Database System Concepts topn(Table, [ Expr ], int) Relation r ABCD   topn(r, [D,C desc], int(‘2’) ) ABCD  

©Silberschatz, Korth and Sudarshan4.34Database System Concepts TPC-H: Data Warehousing Scenario “Give date, priority and sum of the top 10 high revenue orders for construction customers that had been ordered but not yet shipped on march 15 “ TPC-C transaction processing TPC-H data warehousing Large repository of data about Orders, consisting of Lineitems, delivered to Customers. CUSTOMER 1  n ORDER 1  n LINEITEM Query 3:

©Silberschatz, Korth and Sudarshan4.35Database System Concepts SQL Data Warehousing Query (TPC-H Query 3) select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' and o_orderdate < date ' ' and l_shipdate > date ' ' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10;

©Silberschatz, Korth and Sudarshan4.36Database System Concepts SQL  Algebra translation select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' and o_orderdate < date ' ' and l_shipdate > date ' ' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; join

©Silberschatz, Korth and Sudarshan4.37Database System Concepts SQL  Algebra translation select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' and o_orderdate < date ' ' and l_shipdate > date ' ' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; join select

©Silberschatz, Korth and Sudarshan4.38Database System Concepts SQL  Algebra translation select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' and o_orderdate < date ' ' and l_shipdate > date ' ' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; join select aggr

©Silberschatz, Korth and Sudarshan4.39Database System Concepts SQL  Algebra translation select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' and o_orderdate < date ' ' and l_shipdate > date ' ' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; join select aggr topn

©Silberschatz, Korth and Sudarshan4.40Database System Concepts Query in X100 Algebra

©Silberschatz, Korth and Sudarshan4.41Database System Concepts

©Silberschatz, Korth and Sudarshan4.42Database System Concepts Outline Introduction & Course Organization Recap of Introductory Database Course  SQL  Relational Algebra (X100 flavor) Storage and File Structures 

©Silberschatz, Korth and Sudarshan4.43Database System Concepts Storage Hierarchy 300GB 4GB 2GB 2MB 64KB 128B sizebandwidthlatencyEUR/GBUnit 60MB/s (20MB/s) ns202KB NAND Flash 3000MB/s70ns6064B RAM (DDR2) 80MB/s10 min0.1032KB Tape (HP) 80MB/s ns0.308KB Magnetic disk (IDE) 7000MB/s10ns64B L2 CPU cache 24000MB/s1ns64B L1 CPU cache 24000MB/s18B CPU registers

©Silberschatz, Korth and Sudarshan4.44Database System Concepts Hardware Trends CPU speed (KHz) RAM Size (KB) Disk Size (MB) RAM Bandwidth (MB/s) Disk Bandwidth (MB/s) RAM Latency (ns) Disk Latency (ms)

©Silberschatz, Korth and Sudarshan4.45Database System Concepts Storage Hierarchy (Cont.) primary storage: Fastest media but volatile (cache, main memory). secondary storage: next level in hierarchy, non-volatile, moderately fast access time  also called on-line storage  E.g. flash memory, magnetic disks tertiary storage: lowest level in hierarchy, non-volatile, slow access time  also called off-line storage  E.g. magnetic tape, optical storage

©Silberschatz, Korth and Sudarshan4.46Database System Concepts Magnetic Hard Disk Mechanism NOTE: Diagram is schematic, and simplifies the structure of actual disk drives

©Silberschatz, Korth and Sudarshan4.47Database System Concepts Performance Measures of Disks Access time – the time it takes from when a read or write request is issued to when data transfer begins. Consists of:  Seek time – time it takes to reposition the arm over the correct track.  Average seek time is 1/2 the worst case seek time. – Would be 1/3 if all tracks had the same number of sectors, and we ignore the time to start and stop arm movement  4 to 10 milliseconds on typical disks  Rotational latency – time it takes for the sector to be accessed to appear under the head.  Average latency is 1/2 of the worst case latency.  4 to 11 milliseconds on typical disks (5400 to r.p.m.) Data-transfer rate – the rate at which data can be retrieved from or stored to the disk.  20 to 60 MB per second is typical  Multiple disks may share a controller, so rate that controller can handle is also important  E.g. ATA: 100 MB/second, SCSI: 320 MB/

©Silberschatz, Korth and Sudarshan4.48Database System Concepts Magnetic Disk Hardware Trends

©Silberschatz, Korth and Sudarshan4.49Database System Concepts Performance Measures (Cont.) Mean time to failure (MTTF) – the average time the disk is expected to run continuously without any failure.  Typically 3 to 5 years  Probability of failure of new disks is quite low, corresponding to a “theoretical MTTF” of 30,000 to 1,200,000 hours for a new disk  E.g., an MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one will fail every 1200 hours  MTTF decreases as disk ages

©Silberschatz, Korth and Sudarshan4.50Database System Concepts RAID RAID: Redundant Arrays of Independent Disks  disk organization techniques that manage a large numbers of disks, providing a view of a single disk of  high capacity and high speed by using multiple disks in parallel, and  high reliability by storing data redundantly, so that data can be recovered even if a disk fails The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail.  E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days)  Techniques for using redundancy to avoid data loss are critical with large numbers of disks

©Silberschatz, Korth and Sudarshan4.51Database System Concepts Improvement of Reliability via Redundancy Redundancy – store extra information that can be used to rebuild information lost in a disk failure E.g., Mirroring (or shadowing)  Duplicate every disk. Logical disk consists of two physical disks.  Every write is carried out on both disks  Reads can take place from either disk  If one disk in a pair fails, data still available in the other  Data loss would occur only if a disk fails, and its mirror disk also fails before the system is repaired – Probability of combined event is very small » Except for dependent failure modes such as fire or building collapse or electrical power surges Mean time to data loss depends on mean time to failure, and mean time to repair  E.g. MTTF of 100,000 hours, mean time to repair of 10 hours gives mean time to data loss of 500*10 6 hours (or 57,000 years) for a mirrored pair of disks (ignoring dependent failure modes)

©Silberschatz, Korth and Sudarshan4.52Database System Concepts RAID Levels Schemes to provide redundancy at lower cost by using disk striping combined with parity bits  Different RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics RAID Level 1: Mirrored disks with block striping  Offers best write performance.  Popular for applications such as storing log files in a database system. RAID Level 0: Block striping; non-redundant.  Used in high-performance applications where data lost is not critical.

©Silberschatz, Korth and Sudarshan4.53Database System Concepts RAID Levels (Cont.) RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk.  E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data blocks stored on the other 4 disks.

©Silberschatz, Korth and Sudarshan4.54Database System Concepts Choice of RAID Level Level 0 provides maximum performance, no safety Level 1 provides much better write performance than level 5  Level 5 requires at least 2 block reads and 2 block writes to write a single block, whereas Level 1 only requires 2 block writes  Level 1 preferred for high update environments such as log disks Level 1 had higher storage cost than level 5  disk drive capacities increasing rapidly (50%/year) whereas disk access times have decreased much less (x 3 in 10 years)  I/O requirements have increased greatly, e.g. for Web servers  When enough disks have been bought to satisfy required rate of I/O, they often have spare storage capacity  so there is often no extra monetary cost for Level 1! Level 5 is preferred for applications with low update rate, and large amounts of data Level 1 is preferred for all other applications

©Silberschatz, Korth and Sudarshan4.55Database System Concepts Hardware Issues Hot swapping: replacement of disk while system is running, without power down  Supported by some hardware RAID systems,  reduces time to recovery, and improves availability greatly Many systems maintain spare disks which are kept online, and used as replacements for failed disks immediately on detection of failure  Reduces time to recovery greatly Many hardware RAID systems ensure that a single point of failure will not stop the functioning of the system by using  Redundant power supplies with battery backup  Multiple controllers and multiple interconnections to guard against controller/interconnection failures

©Silberschatz, Korth and Sudarshan4.56Database System Concepts Organization of Records in Files Heap – a record can be placed anywhere in the file where there is space Sequential – store records in sequential order, based on the value of the search key of each record Hashing – a hash function computed on some attribute of each record; the result specifies in which block of the file the record should be placed Records of each relation may be stored in a separate file. In a clustering file organization records of several different relations can be stored in the same file  Motivation: store related records on the same block to minimize I/O

©Silberschatz, Korth and Sudarshan4.57Database System Concepts Index Classification Primary vs. Secondary  primary – the index on the primary key  unique – an index on a candidate key  secondary – not primary Clustered vs Unclustered  clustered – key order corresponds with record order  E.g. B-tree separate from record file  Index-organized table  B-tree leaves store records (no file)  unclustered – index contains record-IDs in random order

©Silberschatz, Korth and Sudarshan4.58Database System Concepts Root B+Treen=

©Silberschatz, Korth and Sudarshan4.59Database System Concepts Sample non-leaf to keys to keys < 5757  k<81 81  k<95  95

©Silberschatz, Korth and Sudarshan4.60Database System Concepts Sample leaf node: From non-leaf node to next leaf in sequence To record with key 57 To record with key 81 To record with key 85

©Silberschatz, Korth and Sudarshan4.61Database System Concepts Non-root nodes have to be at least half-full Use at least Non-leaf:  n/2  children Leaf:  (n-1)/2  pointers to data

©Silberschatz, Korth and Sudarshan4.62Database System Concepts Full nodemin. node Non-leaf Leaf n=

©Silberschatz, Korth and Sudarshan4.63Database System Concepts Insert into B+tree (a) simple case  space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

©Silberschatz, Korth and Sudarshan4.64Database System Concepts (simple case) Insert key = 32 n=

©Silberschatz, Korth and Sudarshan4.65Database System Concepts (leaf overflow) Insert key = 7 n=

©Silberschatz, Korth and Sudarshan4.66Database System Concepts (internal overflow) Insert key = 160 n=

©Silberschatz, Korth and Sudarshan4.67Database System Concepts (new root) insert 45 n= new root

©Silberschatz, Korth and Sudarshan4.68Database System Concepts insert: 1, 2, 10, 20, 3, 12, 30, 32, 25, 40, 45 n=4

©Silberschatz, Korth and Sudarshan4.69Database System Concepts problem: - Binary search in B+ tree node - CPU cache misses! Ideas: - Fractal Prefetching B-trees (Chen et al. SIGMOD 2002) - “cache-oblivious B trees” - Optimizing the memory layout (Rao et al. SIGMOD 2000) - Eliminate internal pointers - Buffered Access (Zhou et al., SIGMOD 2004) - Do lookups in batches B+ trees and CPU Caches

©Silberschatz, Korth and Sudarshan4.70Database System Concepts (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf Deletion from B+tree

©Silberschatz, Korth and Sudarshan4.71Database System Concepts (b) Coalesce with sibling  Delete n=5 40

©Silberschatz, Korth and Sudarshan4.72Database System Concepts (c) Redistribute keys  Delete n=4 35

©Silberschatz, Korth and Sudarshan4.73Database System Concepts (d) Non-leaf coalesce  Delete 37 n= new root

©Silberschatz, Korth and Sudarshan4.74Database System Concepts (d) Non-leaf coalesce  Delete 37 n= new root

©Silberschatz, Korth and Sudarshan4.75Database System Concepts B+tree deletions in practice – Often, coalescing is not implemented  Too hard and not worth it!

©Silberschatz, Korth and Sudarshan4.76Database System Concepts Interesting problem: For B+tree, how large should n be? … n is number of keys / node

©Silberschatz, Korth and Sudarshan4.77Database System Concepts Assumptions You have the right to set the disk page size for the disk where a B-tree will reside. Compute the optimum page size n assuming that  The items are 4 bytes long and the pointers are also 4 bytes long.  Time to read a node from disk is n  Time to process a block in memory is unimportant  B+tree is full (I.e., every page has the maximum number of items and pointers

©Silberschatz, Korth and Sudarshan4.78Database System Concepts  FIND n opt by f’(n) = 0 What happens to n opt as Disk bandwidth increases? Access time stays behind? CPU get faster?

©Silberschatz, Korth and Sudarshan4.79Database System Concepts f(n) = time to find a record = log n (T) * ( n) f(n) = time to find a record = log n (T) * ( n)

©Silberschatz, Korth and Sudarshan4.80Database System Concepts f(n) = time to find a record = log n (T) * ( n) f(n) = time to find a record = log n (T) * ( n) 1994 (book)  2004 (now) N=500  n=4000

©Silberschatz, Korth and Sudarshan4.81Database System Concepts f(n) = time to find a record = log n (T) * ( n) f(n) = time to find a record = log n (T) * ( n) 1994 Table 1M records 10ms access time 4MB/s bandwidth n~ KB / 8KB pages Be conservative to limit RAM consumption

©Silberschatz, Korth and Sudarshan4.82Database System Concepts f(n) = time to find a record = log n (T) * ( n) f(n) = time to find a record = log n (T) * ( n) 2004 Table 10M records 6ms access time 40MB/s bandwidth n~ KB / 32KB pages relative benefit decreases so don’t overdo it

©Silberschatz, Korth and Sudarshan4.83Database System Concepts  FIND n opt by f’(n) = 0 Answer should be n opt = “few thousand” What happens to n opt as  block sizes are increasing.. Disk bandwidth increases? Access time stays behind? CPU get faster?

©Silberschatz, Korth and Sudarshan4.84Database System Concepts Primary or Auxiliary Structure Primary index  Leaf blocks in sequence  clustered index  Main storage structure for a database table  E.g. B+-tree organized file / hash structured files  Typically an index on an unique key  But not necessarily  Normally, you can have only one clustered index! Secondary index  Also called unclustered index  A separate file from where the table is stored  Refers with (block/offset) pointers to records in the table file  You can define many as you want (to maintain) 

©Silberschatz, Korth and Sudarshan4.85Database System Concepts Clustered vs. Unclustered Index Primary index  Leaf blocks in sequence  clustered index  Main storage structure for a database table  E.g. B+-tree organized file / hash structured files  Typically an index on an unique key  But not necessarily  Normally, you can have only one clustered index! Secondary index  Also called unclustered index  A separate file from where the table is stored  Refers with (block/offset) pointers to records in the table file  You can define many as you want (to maintain)  low high Primary B-Tree index 1 access only (rest is ‘just’ bandwidth)

©Silberschatz, Korth and Sudarshan4.86Database System Concepts Clustered vs. Unclustered Index Primary index  Leaf blocks in sequence  clustered index  Main storage structure for a database table  E.g. B+-tree organized file / hash structured files  Typically an index on an unique key  But not necessarily  Normally, you can have only one clustered index! Secondary index  Also called unclustered index  A separate file from where the table is stored  Refers with (block/offset) pointers to records in the table file  You can define many as you want (to maintain)  low high Primary B-Tree index 1 access only (rest is ‘just’ bandwidth) Secondary B-tree index Pay N times access cost

©Silberschatz, Korth and Sudarshan4.87Database System Concepts Are Unclustered Indices a Good Idea?  Secondary indices depend on random I/O  can do asynchronous I/O (multiple I/Os at-a-time)  degenerates into full table scans

©Silberschatz, Korth and Sudarshan4.88Database System Concepts Block size for sequential reads?

©Silberschatz, Korth and Sudarshan4.89Database System Concepts When do random I/Os make sense?

©Silberschatz, Korth and Sudarshan4.90Database System Concepts Are Unclustered Indices a Good Idea?  Secondary indices depend on random I/O  can do asynchronous I/O (multiple I/Os at-a-time)  degenerates into full table scans  Is not using an index at all better?  I.e. read the entire table sequentially without any index  Use redundant clustered orderings – Materialized views – C-STORE (Stonebraker et al, VLDB 2005), MonetDB/X100 – Database Cracking (Kersten, CIDR )