Database Techniek Martin Kersten Peter Boncz CWI.

Database Techniek Martin Kersten Peter Boncz CWI

©Silberschatz, Korth and Sudarshan4.2Database System Concepts Outline Introduction & Course Organization  Recap of Introductory Database Course  SQL  Relational Algebra (X100 flavor) Storage and File Structures

©Silberschatz, Korth and Sudarshan4.3Database System Concepts Why a DBMS? Main Advantages  Centralization (at least conceptually)  Data Independence (physical changes don’t break legacy apps)  Declarative Data Integrity Constraints  Atomic actions (DBMS recovers consistently from system crash)  Consistency under Multi-User Concurrent Updates  Declarative & Powerful Query Language, Automatically Optimized  Multi-user security DBMS now is the basic building block of all information systems Almost everybody in IT works with DBMS on a daily basis

©Silberschatz, Korth and Sudarshan4.4Database System Concepts Application Architectures  Two-tier architecture: E.g. client programs using ODBC/JDBC to communicate with a database (aka “client-server”)  Three-tier architecture: E.g. web-based applications (e.g. LAMP), or application servers (e.g. jBOSS, BEA)

©Silberschatz, Korth and Sudarshan4.5Database System Concepts Doel verkrijgen van inzicht in de implementatie technieken binnenin een relationeel DBMS Beoordeling:  Cijfer = (2*tentamen+practicum)/3  tentamen >= 6, practicum >= 6 Literatuur:  A. Silberschatz e.a., 'Database system concepts', 4th ed, McGraw-Hill, 2002Database system concepts  http://www.cwi.nl/~manegold/teaching/DBtech/

©Silberschatz, Korth and Sudarshan4.6Database System Concepts Hoorcolleges Query OptimizationH14BonczFeb 22 MonetDB/XQueryKersten/BonczMar 155 MonetDB/SQLKersten/NesMar 84 TransactionsH15-17KerstenMar 13 Query ProcessingH13BonczFeb 152 SQL + X100 Alg Storage + B-Trees H4 + X100 doc H11-12 Kersten/ Boncz Feb 81 OnderwerpMateriaalDocentDatum Tentamen laatste week maart

©Silberschatz, Korth and Sudarshan4.7Database System Concepts Practicum Assignment 0: Hands-on experience with relational DBMSs & SQL Assignment 1: Translating SQL to X100 algebra ("by hand") Assignment 2: (choose on of) a) Building logical cost functions for X100 algebra operations ("by hand") b) Analyse and explain the behaviour of a query optimizer Begeleider: Marc Makkes (mmakkes@science.uva.nl) Hard deadlines (first: Saturday, February 17, 2007, 23:59:59 CET! ) Work in couples

©Silberschatz, Korth and Sudarshan4.8Database System Concepts Outline Introduction & Course Organization Recap of Introductory Database Course  SQL   Relational Algebra (X100 flavor) Storage and File Structures

©Silberschatz, Korth and Sudarshan4.9Database System Concepts SQL re-cap: Basic Structure A typical SQL query has the form: select A 1, A 2,..., A n from r 1, r 2,..., r m where P  A i s represent attributes  r i s represent relations  P is a predicate. This query is equivalent to the relational algebra expression. project A1, A2,..., An (select  P (r 1 join  true r 2 join  true... join  true r m )) The result of an SQL query is again a relation. SQL relations may have duplicates  Use select distinct to get a set

©Silberschatz, Korth and Sudarshan4.10Database System Concepts Aggregate Queries Find the names of all branches where the average account balance is more than $1,200. Note: predicates in the having clause are applied after the formation of groups whereas predicates in the where clause are applied before forming groups select branch-name, avg (balance) from account group by branch-name having avg (balance) > 1200

©Silberschatz, Korth and Sudarshan4.11Database System Concepts Ordering the Display of Tuples List in alphabetic order the names of all customers having a loan in Perryridge branch select customer-name from borrower, loan where borrower loan-number - loan.loan-number and branch-name = ‘ Perryridge ’ order by customer-name We may specify desc for descending order or asc for ascending order, for each attribute; ascending order is the default.  E.g. order by customer-name desc We may restrict the result to the first N tuples  E.g. order by customer-name limit N

©Silberschatz, Korth and Sudarshan4.12Database System Concepts Nested Subqueries SQL provides a mechanism for the nesting of subqueries. A subquery is a select-from-where expression that is nested within another query. A common use of subqueries is to perform tests for set membership, set comparisons, and set cardinality.

©Silberschatz, Korth and Sudarshan4.13Database System Concepts Example Query Find all customers who have both an account and a loan at the bank. select distinct customer-name from borrower where customer-name in (select customer-name from depositor) select distinct customer-name from borrower as B where exists (select * from depositor where customer-name = B.customer-name)

©Silberschatz, Korth and Sudarshan4.14Database System Concepts Outline Introduction & Course Organization Recap of Introductory Database Course  SQL  Relational Algebra (X100 flavor) Storage and File Structures

©Silberschatz, Korth and Sudarshan4.15Database System Concepts Relational algebra SQL physical algebra logical algebra parsing, normalization logical query optimization physical query optimization query execution

©Silberschatz, Korth and Sudarshan4.16Database System Concepts The Practicum SQL physical algebra X100 algebra parsing, normalization logical query optimization physical query optimization X100 system

©Silberschatz, Korth and Sudarshan4.17Database System Concepts X100 relational algebra MonetDB/X100 is a CWI research projects http://www.cwi.nl/~boncz/x100.html high-performance experimental DBMS for e.g. Data warehousing Data mining Information Retrieval Video databases (retrieval by content) Research goal: study interaction between modern hardware and database internals High perf algorithms, compression E.g. exploit CPU caches, Multi-Processors, MEMS

©Silberschatz, Korth and Sudarshan4.18Database System Concepts X100 relational algebra (Cont.) X100 has a relational algebra interface Table ::= table(Identifier) select(Table, Expr ) project(Table, [ Expr ] ) join(Table, TABLE, Expr ) aggr(Table, [ Expr ], [ AggrFcn ] ) order (Table, [ Expr ] ) topn(Table, [ Expr ], Expr ) Identifier = Table

©Silberschatz, Korth and Sudarshan4.19Database System Concepts select(Table, Expr ) Relation r ABCD   1 5 12 23 7 3 10 select (r, and( ==(A,B), >(D,int(‘5’) ) ) ) ABCD   1 23 7 10

©Silberschatz, Korth and Sudarshan4.20Database System Concepts select(Table, Expr ) Relation r ABCD   1 5 12 23 7 3 10 select (r, and( ==(A,B), >(D,int(‘5’) ) ) ) ABCD   1 23 7 10 Functional C-like notation: A = B and d > 5

©Silberschatz, Korth and Sudarshan4.21Database System Concepts select(Table, Expr ) Relation r ABCD   1 5 12 23 7 3 10 select (r, and( ==(A,B), >(D,int(‘5’) ) ) ) ABCD   1 23 7 10 All constants denoted as cast: TYPE(‘string’)

©Silberschatz, Korth and Sudarshan4.22Database System Concepts project(Table, [ Expr ] ) Relation r: ABC  10 20 30 40 11121112 AD  10 20 Project (r, [ A, D=*(C,int(’10’)) ] )

©Silberschatz, Korth and Sudarshan4.23Database System Concepts project(Table, [ Expr ] ) Relation r: ABC  10 20 30 40 11121112 AD  10 20 Project (r, [ A, D=*(C,int(’10’)) ] ) X100 is a bag algebra: no double elimination

©Silberschatz, Korth and Sudarshan4.24Database System Concepts join(Table, Table, Expr ) Relations r, s: AB  1241212412 CD  aababaabab E 1312313123 F  r AB  1111211112 CD  aaaabaaaab F  s join(r, s, ==(B,E))

©Silberschatz, Korth and Sudarshan4.25Database System Concepts join(Table, Table, Expr ) Relations r, t: AB  1241212412 CD  aababaabab E 1312313123 F  r AB  1111211112 CD  aaaabaaaab F  s X100 join result is the union of all attributes. Name conflicts must be resolved with an extra project E 1312313123 C  t join(r, s, ==(B,E)) project( t, [ E,F=C ] )

©Silberschatz, Korth and Sudarshan4.26Database System Concepts aggr(Table, [Expr ], [AggrFcn ]) Relation account grouped by branch-name: branch-nameaccount-numberbalance Perryridge Brighton Redwood A-102 A-201 A-217 A-215 A-222 400 900 750 700 branch-namebalance Perryridge Brighton Redwood 1300 1500 700 aggr( account, [ branch-name ], [ balance = sum(balance) ] )

©Silberschatz, Korth and Sudarshan4.27Database System Concepts aggr(Table, [Expr ], [AggrFcn ]) Relation account grouped by branch-name: branch-nameaccount-numberbalance Perryridge Brighton Redwood A-102 A-201 A-217 A-215 A-222 400 900 750 700 branch-namebalance Perryridge Brighton Redwood 1300 1500 700 aggr( account, [ branch-name ], [ balance = sum(balance) ] ) Identifier = AggrFcn(Identifier) AggrFcn ::= count () avg (T) sum (T) min (T) max (T)

©Silberschatz, Korth and Sudarshan4.28Database System Concepts aggr(Table, [Expr ], [AggrFcn ]) Relation r: AB   C 7 3 10 total 27 aggr( r, [], [total = sum(C)])

©Silberschatz, Korth and Sudarshan4.29Database System Concepts aggr(Table, [Expr ], [AggrFcn ]) Relation r: AB   C 7 3 10 total 27 aggr( r, [], [total = sum(C)]) Empty groupby-list  Global aggregate

©Silberschatz, Korth and Sudarshan4.30Database System Concepts aggr(Table, [Expr ], [AggrFcn ]) Relation account grouped by branch-name: branch-nameaccount-numberbalance Perryridge Brighton Redwood A-102 A-201 A-217 A-215 A-222 400 900 750 700 branch-name Perryridge Brighton Redwood aggr( account, [ branch-name ], [] )

©Silberschatz, Korth and Sudarshan4.31Database System Concepts aggr(Table, [Expr ], [AggrFcn ]) Relation account grouped by branch-name: branch-nameaccount-numberbalance Perryridge Brighton Redwood A-102 A-201 A-217 A-215 A-222 400 900 750 700 branch-name Perryridge Brighton Redwood aggr( account, [ branch-name ], [] ) Empty AggrFcn-list  Double elimination

©Silberschatz, Korth and Sudarshan4.32Database System Concepts order (Table, [ Expr ]) Relation r ABCD   23 12 35 25 10 9 7 orderby(r, [D,C desc]) ABCD   35 25 12 23 7 9 10

©Silberschatz, Korth and Sudarshan4.33Database System Concepts topn(Table, [ Expr ], int) Relation r ABCD   23 12 35 25 10 9 7 topn(r, [D,C desc], int(‘2’) ) ABCD   35 25 7777

©Silberschatz, Korth and Sudarshan4.34Database System Concepts TPC-H: Data Warehousing Scenario “Give date, priority and sum of the top 10 high revenue orders for construction customers that had been ordered but not yet shipped on march 15 “ http://www.tpc.org TPC-C transaction processing TPC-H data warehousing Large repository of data about Orders, consisting of Lineitems, delivered to Customers. CUSTOMER 1  n ORDER 1  n LINEITEM Query 3:

©Silberschatz, Korth and Sudarshan4.35Database System Concepts SQL Data Warehousing Query (TPC-H Query 3) select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' and o_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10;

©Silberschatz, Korth and Sudarshan4.36Database System Concepts SQL  Algebra translation select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' and o_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; join

©Silberschatz, Korth and Sudarshan4.37Database System Concepts SQL  Algebra translation select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' and o_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; join select

©Silberschatz, Korth and Sudarshan4.38Database System Concepts SQL  Algebra translation select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' and o_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; join select aggr

©Silberschatz, Korth and Sudarshan4.39Database System Concepts SQL  Algebra translation select l_orderkey, o_orderdate, o_shippriority, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem where c_custkey = o_custkey and l_orderkey = o_orderkey and c_mktsegment = 'BUILDING' and o_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; join select aggr topn

©Silberschatz, Korth and Sudarshan4.40Database System Concepts Query in X100 Algebra

©Silberschatz, Korth and Sudarshan4.41Database System Concepts

©Silberschatz, Korth and Sudarshan4.42Database System Concepts Outline Introduction & Course Organization Recap of Introductory Database Course  SQL  Relational Algebra (X100 flavor) Storage and File Structures 

©Silberschatz, Korth and Sudarshan4.43Database System Concepts Storage Hierarchy 300GB 4GB 2GB 2MB 64KB 128B sizebandwidthlatencyEUR/GBUnit 60MB/s (20MB/s) 100000ns202KB NAND Flash 3000MB/s70ns6064B RAM (DDR2) 80MB/s10 min0.1032KB Tape (HP) 80MB/s10000000ns0.308KB Magnetic disk (IDE) 7000MB/s10ns64B L2 CPU cache 24000MB/s1ns64B L1 CPU cache 24000MB/s18B CPU registers

©Silberschatz, Korth and Sudarshan4.44Database System Concepts Hardware Trends CPU speed (KHz) RAM Size (KB) Disk Size (MB) RAM Bandwidth (MB/s) Disk Bandwidth (MB/s) RAM Latency (ns) Disk Latency (ms)

©Silberschatz, Korth and Sudarshan4.45Database System Concepts Storage Hierarchy (Cont.) primary storage: Fastest media but volatile (cache, main memory). secondary storage: next level in hierarchy, non-volatile, moderately fast access time  also called on-line storage  E.g. flash memory, magnetic disks tertiary storage: lowest level in hierarchy, non-volatile, slow access time  also called off-line storage  E.g. magnetic tape, optical storage

©Silberschatz, Korth and Sudarshan4.46Database System Concepts Magnetic Hard Disk Mechanism NOTE: Diagram is schematic, and simplifies the structure of actual disk drives

©Silberschatz, Korth and Sudarshan4.47Database System Concepts Performance Measures of Disks Access time – the time it takes from when a read or write request is issued to when data transfer begins. Consists of:  Seek time – time it takes to reposition the arm over the correct track.  Average seek time is 1/2 the worst case seek time. – Would be 1/3 if all tracks had the same number of sectors, and we ignore the time to start and stop arm movement  4 to 10 milliseconds on typical disks  Rotational latency – time it takes for the sector to be accessed to appear under the head.  Average latency is 1/2 of the worst case latency.  4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.) Data-transfer rate – the rate at which data can be retrieved from or stored to the disk.  20 to 60 MB per second is typical  Multiple disks may share a controller, so rate that controller can handle is also important  E.g. ATA: 100 MB/second, SCSI: 320 MB/

©Silberschatz, Korth and Sudarshan4.48Database System Concepts Magnetic Disk Hardware Trends

©Silberschatz, Korth and Sudarshan4.49Database System Concepts Performance Measures (Cont.) Mean time to failure (MTTF) – the average time the disk is expected to run continuously without any failure.  Typically 3 to 5 years  Probability of failure of new disks is quite low, corresponding to a “theoretical MTTF” of 30,000 to 1,200,000 hours for a new disk  E.g., an MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one will fail every 1200 hours  MTTF decreases as disk ages

©Silberschatz, Korth and Sudarshan4.50Database System Concepts RAID RAID: Redundant Arrays of Independent Disks  disk organization techniques that manage a large numbers of disks, providing a view of a single disk of  high capacity and high speed by using multiple disks in parallel, and  high reliability by storing data redundantly, so that data can be recovered even if a disk fails The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail.  E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days)  Techniques for using redundancy to avoid data loss are critical with large numbers of disks

©Silberschatz, Korth and Sudarshan4.51Database System Concepts Improvement of Reliability via Redundancy Redundancy – store extra information that can be used to rebuild information lost in a disk failure E.g., Mirroring (or shadowing)  Duplicate every disk. Logical disk consists of two physical disks.  Every write is carried out on both disks  Reads can take place from either disk  If one disk in a pair fails, data still available in the other  Data loss would occur only if a disk fails, and its mirror disk also fails before the system is repaired – Probability of combined event is very small » Except for dependent failure modes such as fire or building collapse or electrical power surges Mean time to data loss depends on mean time to failure, and mean time to repair  E.g. MTTF of 100,000 hours, mean time to repair of 10 hours gives mean time to data loss of 500*10 6 hours (or 57,000 years) for a mirrored pair of disks (ignoring dependent failure modes)

©Silberschatz, Korth and Sudarshan4.52Database System Concepts RAID Levels Schemes to provide redundancy at lower cost by using disk striping combined with parity bits  Different RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics RAID Level 1: Mirrored disks with block striping  Offers best write performance.  Popular for applications such as storing log files in a database system. RAID Level 0: Block striping; non-redundant.  Used in high-performance applications where data lost is not critical.

©Silberschatz, Korth and Sudarshan4.53Database System Concepts RAID Levels (Cont.) RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk.  E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data blocks stored on the other 4 disks.

©Silberschatz, Korth and Sudarshan4.54Database System Concepts Choice of RAID Level Level 0 provides maximum performance, no safety Level 1 provides much better write performance than level 5  Level 5 requires at least 2 block reads and 2 block writes to write a single block, whereas Level 1 only requires 2 block writes  Level 1 preferred for high update environments such as log disks Level 1 had higher storage cost than level 5  disk drive capacities increasing rapidly (50%/year) whereas disk access times have decreased much less (x 3 in 10 years)  I/O requirements have increased greatly, e.g. for Web servers  When enough disks have been bought to satisfy required rate of I/O, they often have spare storage capacity  so there is often no extra monetary cost for Level 1! Level 5 is preferred for applications with low update rate, and large amounts of data Level 1 is preferred for all other applications

©Silberschatz, Korth and Sudarshan4.55Database System Concepts Hardware Issues Hot swapping: replacement of disk while system is running, without power down  Supported by some hardware RAID systems,  reduces time to recovery, and improves availability greatly Many systems maintain spare disks which are kept online, and used as replacements for failed disks immediately on detection of failure  Reduces time to recovery greatly Many hardware RAID systems ensure that a single point of failure will not stop the functioning of the system by using  Redundant power supplies with battery backup  Multiple controllers and multiple interconnections to guard against controller/interconnection failures

©Silberschatz, Korth and Sudarshan4.56Database System Concepts Organization of Records in Files Heap – a record can be placed anywhere in the file where there is space Sequential – store records in sequential order, based on the value of the search key of each record Hashing – a hash function computed on some attribute of each record; the result specifies in which block of the file the record should be placed Records of each relation may be stored in a separate file. In a clustering file organization records of several different relations can be stored in the same file  Motivation: store related records on the same block to minimize I/O

©Silberschatz, Korth and Sudarshan4.57Database System Concepts Index Classification Primary vs. Secondary  primary – the index on the primary key  unique – an index on a candidate key  secondary – not primary Clustered vs Unclustered  clustered – key order corresponds with record order  E.g. B-tree separate from record file  Index-organized table  B-tree leaves store records (no file)  unclustered – index contains record-IDs in random order

©Silberschatz, Korth and Sudarshan4.58Database System Concepts Root B+Treen=4 100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200

©Silberschatz, Korth and Sudarshan4.59Database System Concepts Sample non-leaf 57 81 95 to keys to keys < 5757  k<81 81  k<95  95

©Silberschatz, Korth and Sudarshan4.60Database System Concepts Sample leaf node: From non-leaf node to next leaf in sequence 57 81 95 To record with key 57 To record with key 81 To record with key 85

©Silberschatz, Korth and Sudarshan4.61Database System Concepts Non-root nodes have to be at least half-full Use at least Non-leaf:  n/2  children Leaf:  (n-1)/2  pointers to data

©Silberschatz, Korth and Sudarshan4.62Database System Concepts Full nodemin. node Non-leaf Leaf n=4 120 150 180 30 3 5 11 30 35

©Silberschatz, Korth and Sudarshan4.63Database System Concepts Insert into B+tree (a) simple case  space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

©Silberschatz, Korth and Sudarshan4.65Database System Concepts (leaf overflow) Insert key = 7 n=4 3 5 11 30 31 30 100 3535 7 7

©Silberschatz, Korth and Sudarshan4.66Database System Concepts (internal overflow) Insert key = 160 n=4 100 120 150 180 150 156 179 180 200 160 180 160 179

©Silberschatz, Korth and Sudarshan4.67Database System Concepts (new root) insert 45 n=4 10 20 30 123123 10 12 20 25 30 32 40 45 40 30 new root

©Silberschatz, Korth and Sudarshan4.69Database System Concepts problem: - Binary search in B+ tree node - CPU cache misses! Ideas: - Fractal Prefetching B-trees (Chen et al. SIGMOD 2002) - “cache-oblivious B trees” - Optimizing the memory layout (Rao et al. SIGMOD 2000) - Eliminate internal pointers - Buffered Access (Zhou et al., SIGMOD 2004) - Do lookups in batches B+ trees and CPU Caches

©Silberschatz, Korth and Sudarshan4.70Database System Concepts (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf Deletion from B+tree

©Silberschatz, Korth and Sudarshan4.71Database System Concepts (b) Coalesce with sibling  Delete 50 10 40 100 10 20 30 40 50 n=5 40

©Silberschatz, Korth and Sudarshan4.72Database System Concepts (c) Redistribute keys  Delete 50 10 40 100 20 30 35 40 50 n=4 35

©Silberschatz, Korth and Sudarshan4.73Database System Concepts 30 37 25 26 20 22 10 14 2030 (d) Non-leaf coalesce  Delete 37 n=4 30 25 new root

©Silberschatz, Korth and Sudarshan4.74Database System Concepts 30 37 25 26 20 22 10 14 2030 (d) Non-leaf coalesce  Delete 37 n=4 30 25 new root

©Silberschatz, Korth and Sudarshan4.75Database System Concepts B+tree deletions in practice – Often, coalescing is not implemented  Too hard and not worth it!

©Silberschatz, Korth and Sudarshan4.76Database System Concepts Interesting problem: For B+tree, how large should n be? … n is number of keys / node

©Silberschatz, Korth and Sudarshan4.77Database System Concepts Assumptions You have the right to set the disk page size for the disk where a B-tree will reside. Compute the optimum page size n assuming that  The items are 4 bytes long and the pointers are also 4 bytes long.  Time to read a node from disk is 10+.0002n  Time to process a block in memory is unimportant  B+tree is full (I.e., every page has the maximum number of items and pointers

©Silberschatz, Korth and Sudarshan4.78Database System Concepts  FIND n opt by f’(n) = 0 What happens to n opt as Disk bandwidth increases? Access time stays behind? CPU get faster?

©Silberschatz, Korth and Sudarshan4.79Database System Concepts f(n) = time to find a record = log n (T) * (10 + 0.0002n) f(n) = time to find a record = log n (T) * (10 + 0.0002n)

©Silberschatz, Korth and Sudarshan4.80Database System Concepts f(n) = time to find a record = log n (T) * (10 + 0.0002n) f(n) = time to find a record = log n (T) * (10 + 0.0002n) 1994 (book)  2004 (now) N=500  n=4000

©Silberschatz, Korth and Sudarshan4.81Database System Concepts f(n) = time to find a record = log n (T) * (10 + 0.0002n) f(n) = time to find a record = log n (T) * (10 + 0.0002n) 1994 Table 1M records 10ms access time 4MB/s bandwidth n~500-1000 4KB / 8KB pages Be conservative to limit RAM consumption

©Silberschatz, Korth and Sudarshan4.82Database System Concepts f(n) = time to find a record = log n (T) * (10 + 0.0002n) f(n) = time to find a record = log n (T) * (10 + 0.0002n) 2004 Table 10M records 6ms access time 40MB/s bandwidth n~1000-4000 8KB / 32KB pages relative benefit decreases so don’t overdo it

©Silberschatz, Korth and Sudarshan4.83Database System Concepts  FIND n opt by f’(n) = 0 Answer should be n opt = “few thousand” What happens to n opt as  block sizes are increasing.. Disk bandwidth increases? Access time stays behind? CPU get faster?

©Silberschatz, Korth and Sudarshan4.84Database System Concepts Primary or Auxiliary Structure Primary index  Leaf blocks in sequence  clustered index  Main storage structure for a database table  E.g. B+-tree organized file / hash structured files  Typically an index on an unique key  But not necessarily  Normally, you can have only one clustered index! Secondary index  Also called unclustered index  A separate file from where the table is stored  Refers with (block/offset) pointers to records in the table file  You can define many as you want (to maintain) 

©Silberschatz, Korth and Sudarshan4.85Database System Concepts Clustered vs. Unclustered Index Primary index  Leaf blocks in sequence  clustered index  Main storage structure for a database table  E.g. B+-tree organized file / hash structured files  Typically an index on an unique key  But not necessarily  Normally, you can have only one clustered index! Secondary index  Also called unclustered index  A separate file from where the table is stored  Refers with (block/offset) pointers to records in the table file  You can define many as you want (to maintain)  low high Primary B-Tree index 1 access only (rest is ‘just’ bandwidth)

©Silberschatz, Korth and Sudarshan4.86Database System Concepts Clustered vs. Unclustered Index Primary index  Leaf blocks in sequence  clustered index  Main storage structure for a database table  E.g. B+-tree organized file / hash structured files  Typically an index on an unique key  But not necessarily  Normally, you can have only one clustered index! Secondary index  Also called unclustered index  A separate file from where the table is stored  Refers with (block/offset) pointers to records in the table file  You can define many as you want (to maintain)  low high Primary B-Tree index 1 access only (rest is ‘just’ bandwidth) Secondary B-tree index Pay N times access cost

©Silberschatz, Korth and Sudarshan4.87Database System Concepts Are Unclustered Indices a Good Idea?  Secondary indices depend on random I/O  can do asynchronous I/O (multiple I/Os at-a-time)  degenerates into full table scans

©Silberschatz, Korth and Sudarshan4.90Database System Concepts Are Unclustered Indices a Good Idea?  Secondary indices depend on random I/O  can do asynchronous I/O (multiple I/Os at-a-time)  degenerates into full table scans  Is not using an index at all better?  I.e. read the entire table sequentially without any index  Use redundant clustered orderings – Materialized views – C-STORE (Stonebraker et al, VLDB 2005), MonetDB/X100 – Database Cracking (Kersten, CIDR 2005+2007)

Database Techniek Martin Kersten Peter Boncz CWI.

Similar presentations

Presentation on theme: "Database Techniek Martin Kersten Peter Boncz CWI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Database Techniek Martin Kersten Peter Boncz CWI.

Similar presentations

Presentation on theme: "Database Techniek Martin Kersten Peter Boncz CWI."— Presentation transcript:

Similar presentations

About project

Feedback