Database Tuning Principles, Experiments and Troubleshooting Techniques Dennis Shasha (shasha@cs.nyu.edu) Philippe Bonnet (bonnet.p@gmail.com)
Availability of Materials Power Point presentation is available on my web site (and Philippe Bonnet’s). Just type our names into google Experiments available from Denmark site maintained by Philippe (site in a few slides) Book with same title available from Morgan Kaufmann.
Database Tuning Database Tuning is the activity of making a database application run more quickly. “More quickly” usually means higher throughput, though it may mean lower response time for time-critical applications.
Hardware [Processor(s), Disk(s), Memory] Application Programmer (e.g., business analyst, Data architect) Application Sophisticated Application Programmer (e.g., SAP admin) Query Processor Indexes Storage Subsystem Concurrency Control Recovery DBA, Tuner Operating System Hardware [Processor(s), Disk(s), Memory]
Outline of Tutorial Basic Principles Tuning the guts Indexes Relational Systems Application Interface Ecommerce Applications Data warehouse Applications Distributed Applications Troubleshooting
Goal of the Tutorial To show: Tuning principles that port from one system to the other and to new technologies Experimental results to show the effect of these tuning principles. Troubleshooting techniques for chasing down performance problems.
Tuning Principles Leitmotifs Think globally, fix locally (does it matter?) Partitioning breaks bottlenecks (temporal and spatial) Start-up costs are high; running costs are low (disk transfer, cursors) Be prepared for trade-offs (indexes and inserts)
Experiments -- why and where Simple experiments to illustrate the performance impact of tuning principles. Philippe Bonnet has lots of code on his course’s site: https://learnit.itu.dk/mod/workshop/view.php?id=43059
Experimental DBMS and Hardware Results presented here obtained with old systems. Conclusions mostly hold: SQL Server 7, SQL Server 2000, Oracle 8i, Oracle 9i, DB2 UDB 7.1 Three configurations: Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb) 2 Ultra 160 channels, 4x18Gb drives (10000RPM), Windows 2000. Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000. Pentium III (1 GHz, 256 Kb), 1Gb RAM, Adapter 39160 with 2 channels, 3x18Gb drives (10000RPM), Linux Debian 2.4.
Tuning the Guts Concurrency Control Recovery OS Hardware How to minimize lock contention? Recovery How to manage the writes to the log (to dumps)? OS How to optimize buffer size, process scheduling, … Hardware How to allocate CPU, RAM and disk subsystem resources?
Isolation Correctness vs. Performance Number of locks held by each transaction Kind of locks Length of time a transaction holds locks
Isolation Levels Read Uncommitted (No lost update) Exclusive locks for write operations are held for the duration of the transactions No locks for read Read Committed (No dirty retrieval) Shared locks are released as soon as the read operation terminates. Repeatable Read (no unrepeatable reads for read/write ) Two phase locking Serializable (read/write/insert/delete model) Table locking or index locking to avoid phantoms
Snapshot isolation R(Y) returns 1 R(Z) returns 0 R(X) returns 0 T1 Each transaction executes against the version of the data items that was committed when the transaction started: No locks for read Costs space (old copy of data must be kept) Almost serializable level: T1: x:=y T2: y:= x Initially x=3 and y =17 Serial execution: x,y=17 or x,y=3 Snapshot isolation: x=17, y=3 if both transactions start at the same time. T1 T2 T3 TIME R(Y) returns 1 R(Z) returns 0 R(X) returns 0 W(Y:=1) W(X:=2, Z:=3) X=Y=Z=0
Value of Serializability -- Data Settings: accounts( number, branchnum, balance); create clustered index c on accounts(number); 100000 rows Cold buffer; same buffer size on all systems. Row level locking Isolation level (SERIALIZABLE or READ COMMITTED) SQL Server 7, DB2 v7.1 and Oracle 8i on Windows 2000 Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
Value of Serializability -- transactions Concurrent Transactions: T1: summation query [1 thread] select sum(balance) from accounts; T2: swap balance between two account numbers (in order of scan to avoid deadlocks) [N threads] T1: valX:=select balance from accounts where number=X; valY:=select balance from accounts where number=Y; T2: update accounts set balance=valX where number=Y; update accounts set balance=valY where number=X;
Value of Serializability -- results With SQL Server and DB2 the scan returns incorrect answers if the read committed isolation level is used (default setting) With Oracle correct answers are returned (snapshot isolation).
Cost of Serializability Because the update conflicts with the scan, correct answers are obtained at the cost of decreased concurrency and thus decreased throughput.
Locking Implementation Database Item (e.g., row or table) Lock set L Wait set W x LO1(T1,S), LO3(T3,S) LW2(T2,X) y LO2(T2,X) LW4(T4, S), LW5(T5, X) Transaction ID Locks T1 LO1 T2 LO2, LW2 T3 LO3 T4 LW4 T5 LW5 @ Dennis Shasha and Philippe Bonnet, 2013
@ Dennis Shasha and Philippe Bonnet, 2013 Latches and Locks Locks are used for concurrency control Requests for locks are queued Priority queue Lock data structure Locking mode (S, lock granularity, transaction id. Lock table Latches are used for mutual exclusion Requests for latch succeeds or fails Active wait (spinning) on latches on multiple CPU. Single location in memory Test and set for latch manipulation @ Dennis Shasha and Philippe Bonnet, 2013
@ Dennis Shasha and Philippe Bonnet, 2013 Phantom Problem T1: Insert into R values (03, Smythe, 42) into R T2: Select max(age) from R where Name like ‘Sm%’ Select max(age) from R where Name like ‘Sm%’ Any serializable execution returns twice the same max value (either 35 if T2 is executed before T1, or 42 if T1 is executed before Table R E# Name age 01 Smith 35 02 Jones 28 [row1] [row2] Snapshot isolation not in effect Time T1: insert(03,Smythe, 42), commit T2: get locks on existing rows, read all rows, compute max, read all rows, compute max, release locks, commit 35 42 2 Phase locking with row locks does not prevent concurrent insertions as they only protect existing rows. This schedule returns two Different max values! @ Dennis Shasha and Philippe Bonnet, 2013
Solution to Phantom Problem Table locking (mode X) No insertion is allowed in the table Problem: too coarse if predicate is used in transactions Solution #2: Predicate locking – avoid inserting tuples that satisfy a given predicate E.g., 30 < age < 40 Problem: very complex to implement Solution #3: Next key locking (NS) See index tuning Set of Tuples in R Tuples that satisfy predicate P Set of all tuples that can be inserted in R @ Dennis Shasha and Philippe Bonnet, 2013
Locking Overhead -- data Settings: accounts( number, branchnum, balance); create clustered index c on accounts(number); 100000 rows Cold buffer SQL Server 7, DB2 v7.1 and Oracle 8i on Windows 2000 No lock escalation on Oracle; Parameter set so that there is no lock escalation on DB2; no control on SQL Server. Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
Locking Overhead -- transactions No Concurrent Transactions: Update [10 000 updates] update accounts set balance = Val; Insert [10 000 transactions], e.g. typical one: insert into accounts values(664366,72255,2296.12);
Locking Overhead Row locking is barely more expensive than table locking because recovery overhead is higher than row locking overhead Exception is updates on DB2 where table locking is distinctly less expensive than row locking.
Logical Bottleneck: Sequential Key generation Consider an application in which one needs a sequential number to act as a key in a table, e.g. invoice numbers for bills. Ad hoc approach: a separate table holding the last invoice number. Fetch and update that number on each insert transaction. Counter approach: use facility such as Sequence (Oracle)/Identity(MSSQL).
Counter Facility -- data Settings: default isolation level: READ COMMITTED; Empty tables Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000. accounts( number, branchnum, balance); create clustered index c on accounts(number); counter ( nextkey ); insert into counter values (1);
Counter Facility -- transactions No Concurrent Transactions: System [100 000 inserts, N threads] SQL Server 7 (uses Identity column) insert into accounts values (94496,2789); Oracle 8i insert into accounts values (seq.nextval,94496,2789); Ad-hoc [100 000 inserts, N threads] begin transaction NextKey:=select nextkey from counter; update counter set nextkey = NextKey+1; insert into accounts values(NextKey,?,?); commit transaction
Avoid Bottlenecks: Counters System generated counter (system) much better than a counter managed as an attribute value within a table (ad hoc). Counter is separate transaction. The Oracle counter can become a bottleneck if every update is logged to disk, but caching many counter numbers is possible. Counters may miss ids.
Insertion Points -- transactions No Concurrent Transactions: Sequential [100 000 inserts, N threads] Insertions into account table with clustered index on ssnum Data is sorted on ssnum Single insertion point Non Sequential [100 000 inserts, N threads] Data is not sorted (uniform distribution) 100 000 insertion points Hashing Key [100 000 inserts, N threads] Insertions into account table with extra attribute att with clustered index on (ssnum, att) Extra attribute att contains hash key (1021 possible values) 1021 insertion points
Insertion Points Page locking: single insertion point is a source of contention (sequential key with clustered index, or heap) Row locking: No contention between successive insertions. DB2 v7.1 and Oracle 8i do not support page locking.
Semantics-Altering Chopping You call up an airline and want to reserve a seat. You talk to the agent, find a seat and then reserve it. Should all of this be a single transaction?
Semantics-Altering Chopping II Probably not. You don’t want to hold locks through human interaction. Transaction redesign: read is its own transaction. Get seat is another. Consequences?
Atomicity and Durability Every transaction either commits or aborts. It cannot change its mind Even in the face of failures: Effects of committed transactions should be permanent; Effects of aborted transactions should leave no trace. COMMITTED COMMIT Ø ACTIVE (running, waiting) BEGIN TRANS ABORTED ROLLBACK
Pi Pj UNSTABLE STORAGE DATABASE BUFFER LOG DATA DATA DATA LOG BUFFER lri lrj WRITE log records before commit WRITE modified pages after commit LOG DATA DATA DATA RECOVERY STABLE STORAGE
Physical Logging (pure) Update records contain before and after images Roll-back: install before image Roll-forward: install after image Pros: Idempotent. If a crash occurs during recovery, then recovery simply restart as phase 2 (rollforward) and 3 (rollback) rely on operations that are indempotent. Cons: A single SQL statement might touch many pages and thus generate many update records The before and after images are large @ Dennis Shasha and Philippe Bonnet, 2013
Logical Logging (kdb/other main memory systems) Update records contain logical operations and its inverse instead of before and after image E.g., <op: insert t in T, inv: delete t from T> Pro: compact Cons: Not idempotent. Solution: Include a LastLSN in each database page. During phase 2 of recovery, an operation is rolled forward iff its LSN is higher than the LastLSN of the page. Not atomic What if a logical operation actually involves several pages, e.g., a data and index page? And possibly several index pages? Solution: Physiological logging @ Dennis Shasha and Philippe Bonnet, 2013
Physiological Logging Physical across pages Logical within a page Combines the benefits of logical logging and avoids the problem of atomicity, as logical mini-operations are bound to a single page Logical operations split into mini-operations on each page A log record is created for each mini-operation Mini-operations are not idempotent, thus page LSN have to be used in phase 2 of recovery @ Dennis Shasha and Philippe Bonnet, 2013
Logging in SQL Server, DB2 Log entries: - LSN - before and after images or logical log Physiological logging Free Log caches Current Log caches Flush Log caches free Pi Pj DATABASE BUFFER Waiting processes db writer Flush Log caches Lazy- writer Flush queue Synchronous I/O Asynchronous I/O LOG DATA @ Dennis Shasha and Philippe Bonnet, 2013
Logging in Oracle after 10g Physiological logging In memory undo latch In-memory undo Private redo redo allocation latch Free list In memory undo latch In-memory undo Private redo redo allocation latch In memory undo latch In-memory undo Private redo redo allocation latch (public) Redo allocation latch Redo log buffer Pi Pj Redo copy latches Redo log records (redo+undo) DATABASE BUFFER Redo log records (redo+undo) Redo log records (redo+undo) LGWR (log writer) DBWR (database writer) UNDO LOG DATA Log File #1 Log File #2 @ Dennis Shasha and Philippe Bonnet, 2013
Log IO -- data Settings: READ COMMITTED isolation level Empty table lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER , L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT ); READ COMMITTED isolation level Empty table Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
Log IO -- transactions No Concurrent Transactions: Insertions [300 000 inserts, 10 threads], e.g., insert into lineitem values (1,7760,401,1,17,28351.92,0.04,0.02,'N','O','1996-03-13','1996-02-12','1996-03-22','DELIVER IN PERSON','TRUCK','blithely regular ideas caj');
@ Dennis Shasha and Philippe Bonnet, 2013 Group Commits For small transactions, log records might have to be flushed to disk before a log page is filled up When many small transactions are committed, many IOs are executed to flush half empty pages Group commits allow to delay transaction commit until a log page is filled up Many transactions committed together Pros: Avoid too many round trips to disk, i.e., improves throughput Cons: Increase mean response time @ Dennis Shasha and Philippe Bonnet, 2013
Group Commits DB2 UDB v7.1 on Windows 2000 Log records of many transactions are written together Increases throughput by reducing the number of writes at cost of increased minimum response time.
Put Log on a Separate Disk Improve log writer performance HDD: sequential IOs not disturbed by random IOs SSD: minimal garbage collection/wear leveling Isolate data and log failures @ Dennis Shasha and Philippe Bonnet, 2013
Put the Log on a Separate Disk DB2 UDB v7.1 on Windows 2000 5 % performance improvement if log is located on a different disk Controller cache hides negative impact mid-range server, with Adaptec RAID controller (80Mb RAM) and 2x18Gb disk drives.
Tuning Database Writes Dirty data is written to disk When the number of dirty pages is greater than a given parameter (Oracle 8) When the number of dirty pages crosses a given threshold (less than 3% of free pages in the database buffer for SQL Server 7) When the log is full, a checkpoint is forced. This can have a significant impact on performance.
Tune Checkpoint Intervals Oracle 8i on Windows 2000 A checkpoint (partial flush of dirty pages to disk) occurs at regular intervals or when the log is full: Impacts the performance of on-line processing Reduces the size of log Reduces time to recover from a crash
Database Buffer Size Buffer too small, then hit ratio too small LOG DATA RAM Paging Disk DATABASE PROCESSES DATABASE BUFFER Buffer too small, then hit ratio too small hit ratio = (logical acc. - physical acc.) / (logical acc.) Buffer too large, paging Recommended strategy: monitor hit ratio and increase buffer size until hit ratio flattens out. If there is still paging, then buy memory.
Buffer Size -- data Settings: employees(ssnum, name, lat, long, hundreds1, hundreds2); clustered index c on employees(lat); (unused) 10 distinct values of lat and long, 100 distinct values of hundreds1 and hundreds2 20000000 rows (630 Mb); Warm Buffer Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000 RPM), Windows 2000.
Buffer Size -- queries Queries: Scan Query select sum(long) from employees; Multipoint query select * from employees where lat = ?;
Database Buffer Size SQL Server 7 on Windows 2000 Scan query: LRU (least recently used) does badly when table spills to disk as Stonebraker observed 20 years ago. Multipoint query: Throughput increases with buffer size until all data is accessed from RAM.
Scan Performance -- data Settings: lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER , L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT ); 600 000 rows Lineitem tuples are ~ 160 bytes long Cold Buffer Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
Scan Performance -- queries select avg(l_discount) from lineitem;
Prefetching DB2 UDB v7.1 on Windows 2000 Throughput increases up to a certain point when prefetching size increases.
Usage Factor DB2 UDB v7.1 on Windows 2000 Usage factor is the percentage of the page used by tuples and auxilliary data structures (the rest is reserved for future) Scan throughput increases with usage factor.
@ Dennis Shasha and Philippe Bonnet, 2013 Large Reads External algorithms for sorting/hashing manipulate a working set which is larger than RAM (or larger than the buffer space allocated for sorting/hashing) Sort or hash is performed in multiple passes In each pass data is read from disk, hashed/sorted/merged in memory and then written to secondary Pass N+1 can only start when Pass N is done writing data back to secondary storage @ Dennis Shasha and Philippe Bonnet, 2013
Chop Large Update Transactions Consider an update-intensive batch transaction (concurrent access is not an issue): It can be broken up in short transactions (mini-batch): + Does not overfill the log buffers + Does not overfill the log files Example: Transaction that updates, in sorted order, all accounts that had activity on them, in a given day. Break-up to mini-batches each of which access 10,000 accounts and then updates a global counter. Note: DB2 has a parameter limiting the portion of the log used by a single transaction (max_log) © Dennis Shasha, Philippe Bonnet 2001
Tuning the Storage Subsystem Goals: Principles of tuning are based on todays infrastructure. Storage is a major aspect of DB architectures. Magnetisk disks should be understood. We should also understand current trends to evaluate the limitations of the tuning principles. For instance, we will see that today the choice of index has a major impact on performance – now the day my 800 Gb database fits in main memory is the choice of an index still important? In some case it doesn’t and in some cases it does. We will come back on this in a few weeks. Goal today: What are the characteristics of secondary and tertiary storage What are the trends that drive the evolution of storage … we will discuss a couple of challenges.
Outline Storage Subsystem Components From SCSI to SAN, NAS and Beyond Moore’s law and consequences Magnetic disk performances From SCSI to SAN, NAS and Beyond Storage virtualization Tuning the Storage Subsystem RAID levels RAID controller cache
Exponential Growth Moore’s law Every 18 months: New processing = sum of all existing processing New storage = sum of all existing storage 2x / 18 months ~ 100x / 10 years http://www.intel.com/research/silicon/moorespaper.pdf
Consequences of “Moore’s law” Over the last decade: 10x better access time 10x more bandwidth 100x more capacity 4000x lower media price Scan takes 10x longer (3 min vs 45 min) Data on disk is accessed 25x less often (on average) Need for an address bit every 18 months Example: IDE uses 28 bits for sector number Barrier at 2^28 sectors of 512 bytes ~ 137 GB
Data Flood Disk Sales double every nine months Because volume of stored data increases Data Warehouses Internet Logs Web Archives Sky Survey Because media price drops much faster than areal density. Graph courtesy of Joe Hellerstein Source: J. Porter, Disk/Trend, Inc. http://www.disktrend.com/pdf/portrpkg.pdf
Memory Hierarchy Processor cache RAM/flash Disks Tapes / Optical Disks Price $/ Mb Access Time 100 Processor cache 1 ns RAM/flash x10 10 0.2 6 x10 Disks 0.2 (nearline) Tapes / Optical Disks 10 x10
Magnetic Disks Controller read/write head disk arm tracks platter spindle actuator disk interface 1956: IBM (RAMAC) first disk drive 5 Mb – 0.002 Mb/in2 35000$/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625 Kb/sec 1999: IBM MICRODRIVE first 1’’ disk drive 340Mb 6.1 MB/sec Discussion: See disks downstairs: form factor is an element in the evolution Aerial density is main challenge for capacity: How many bits per track (or sector a fraction of a track)? Coat of the platter, bit encoding, process for encoding information. Rotation speed is main challenge for throughput: How fast can the heas read and decode without making too many mistakes? – currently around 10000 RPM is current Actuator and control are key for the access time. - controller with cache and processor.
Magnetic Disks Access Time (2001) Disk Interface Controller overhead (0.2 ms) Seek Time (4 to 9 ms) Rotational Delay (2 to 6 ms) Read/Write Time (10 to 500 KB/ms) Disk Interface IDE (16 bits, Ultra DMA - 25 MHz) SCSI: width (narrow 8 bits vs. wide 16 bits) - frequency (Ultra3 - 80 MHz). http://www.pcguide.com/ref/hdd/ Questions: - is data stored on both sides of a platter? Discussion - implication on performances: minimize seek time (sequential access, prefetching/large pages) shared disk (wait most of the time – minimize disk arm) Question: Given disk characteristics: what are principles for data layout? 1 – hot data in RAM - cache (memory hierarchy) 2 – sequential access vs. random access. Sequential access should be favored (locality for reads and writes should be researched and preserved) 3 – role of controller
Solid State Drive (SSD) Read Write Logical address space Scheduling & Mapping Wear Leveling Garbage collection Program Erase Chip … Flash memory array Channels Physical address space Example on a disk with 1 channel and 4 chips Chip bound Channel bound Chip bound Chip1 Page transfer Page program Chip2 Page program Chip3 Page read Page program Chip4 Command Page program Four parallel reads Four parallel writes @ Dennis Shasha and Philippe Bonnet, 2013
@ Dennis Shasha and Philippe Bonnet, 2013 Performance Contract Disk (spinning media) The block device abstraction hides a lot of complexity while providing a simple performance contract: Sequential IOs are orders of magnitude faster than random IOs Contiguity in the logical space favors sequential Ios Flash (solid state drive) No intrinsic performance contract A few invariants: No need to avoid random IOs Applications should avoid writes smaller than a flash page Applications should fill up the device IO queues (but not overflow them) so that the SSD can leverage its internal parallelism @ Dennis Shasha and Philippe Bonnet, 2013
RAID Levels RAID 0: striping (no redundancy) RAID 1: mirroring (2 disks) RAID 5: parity checking Read: stripes read from multiple disks (in parallel) Write: 2 reads + 2 writes RAID 10: striping and mirroring Software vs. Hardware RAID: Software RAID: run on the server’s CPU Hardware RAID: run on the RAID controller’s CPU
Why 4 read/writes when updating a single stripe using RAID 5? Read old data stripe; read parity stripe (2 reads) XOR old data stripe with replacing one. Take result of XOR and XOR with parity stripe. Write new data stripe and new parity stripe (2 writes).
RAID Levels -- data Settings: 100000 rows Cold Buffer accounts( number, branchnum, balance); create clustered index c on accounts(number); 100000 rows Cold Buffer Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
RAID Levels -- transactions No Concurrent Transactions: Read Intensive: select avg(balance) from accounts; Write Intensive, e.g. typical insert: insert into accounts values (690466,6840,2272.76); Writes are uniformly distributed.
RAID Levels SQL Server7 on Windows 2000 (SoftRAID means striping/parity at host) Read-Intensive: Using multiple disks (RAID0, RAID 10, RAID5) increases throughput significantly. Write-Intensive: Without cache, RAID 5 suffers. With cache, it is ok.
RAID Levels Log File Temporary Files Data and Index Files RAID 1 is appropriate Fault tolerance with high write throughput. Writes are synchronous and sequential. No benefits in striping. Temporary Files RAID 0 is appropriate. No fault tolerance. High throughput. Data and Index Files RAID 5 is best suited for read intensive apps or if the RAID controller cache is effective enough. RAID 10 is best suited for write intensive apps.
Controller Prefetching no, Write-back yes. Read-ahead: Prefetching at the disk controller level. No information on access pattern. Better to let database management system do it. Write-back vs. write through: Write back: transfer terminated as soon as data is written to cache. Batteries to guarantee write back in case of power failure Write through: transfer terminated as soon as data is written to disk.
SCSI Controller Cache -- data Settings: employees(ssnum, name, lat, long, hundreds1, hundreds2); create clustered index c on employees(hundreds2); Employees table partitioned over two disks; Log on a separate disk; same controller (same channel). 200 000 rows per table Database buffer size limited to 400 Mb. Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
SCSI (not disk) Controller Cache -- transactions No Concurrent Transactions: update employees set lat = long, long = lat where hundreds2 = ?; cache friendly: update of 20,000 rows (~90Mb) cache unfriendly: update of 200,000 rows (~900Mb)
SCSI Controller Cache SQL Server 7 on Windows 2000. Adaptec ServerRaid controller: 80 Mb RAM Write-back mode Updates Controller cache increases throughput whether operation is cache friendly or not. Efficient replacement policy!
Dealing with Multi-Core Socket 0 Socket 1 Cache locality is king: Processor affinity Interrupt affinity Spinning vs. blocking Core 0 Core 1 Core 2 Core 3 CPU CPU CPU CPU L1 cache L1 cache L1 cache L1 cache L2 Cache L2 Cache RAM System Bus IO, NIC Interrupts LOOK UP: SW for shared multi-core, Interrupts and IRQ tuning @ Dennis Shasha and Philippe Bonnet, 2013
Row store Page Layout structure record_id { integer page_id; integer row_id: } procedure RECORD_ID_TO_BYTES(int record_id) returns bytes { pid = record_id.page_id; p = PAGE_ID_TO_PAGE(pid); byte byte_array[PAGE_SIZE_IN_BYTES]; byte_array = p.contents; byte_address = byte_array + PAGE_SIZE_IN_BYTES-1; row_start = byte_address – record_id.row_id * 2 // each address entry is 2B return RECORD_ADDRESS_TO_BYTES(int row_address);
Record Structure Procedure column_id_to_bytes return bytes
Storing Large Attributes
Columnstore Ids Explicit IDs Virtual IDs Expand size on disk Expand size when transferring data to RAM Virtual IDs Offset as virtual ID Trades simple arithmetic for space I.e., CPU time for IO time Assumes fixed width attributes Challenge when using compression
Page Layout source: IEEE Row store: N-ary Storage Model – NSM) Decomposed Storage Model – DSM PAX Model – Partition Attributes Across
PAX Model Invented by A.Ailamaki in early 2000s IO Pattern of NSM Great for cache utilization columns packed together in cache lines
@ Dennis Shasha and Philippe Bonnet, 2013 Partitioning There is parallelism (i) across servers, and (ii) within a server both at the CPU level and throughout the IO stack. To leverage this parallelism Rely on multiple instances/multiple partitions per instance A single database is split across several instances. Different partitions can be allocated to different CPUs (partition servers) / Disks (partition). Problem#1: How to control overall resource usage across instances/partitions? Control the number/priority of threads spawned by a DBMS instance Problem#2: How to manage priorities? Problem#3: How to map threads onto the available cores Fix: processor/interrupt affinity @ Dennis Shasha and Philippe Bonnet, 2013
@ Dennis Shasha and Philippe Bonnet, 2013 Instance Caging Allocating a number of CPU (core) or a percentage of the available IO bandwidth to a given DBMS Instance Two policies: Partitioning: the total number of CPUs is partitioned across all instances Over-provisioning: more than the total number of CPUs is allocated to all instances # Cores Instance A (2 CPU) Max #core Instance A (2 CPU) Instance B (3 CPU) Instance B (2 CPU) Instance C (2 CPU) Instance C (1 CPU) Instance D (1 CPU) Instance C (1 CPU) Partitioning Over-provisioning LOOK UP: Instance Caging @ Dennis Shasha and Philippe Bonnet, 2013
@ Dennis Shasha and Philippe Bonnet, 2013 Number of DBMS Threads Given the DBMS process architecture How many threads should be defined for Query agents (max per query, max per instance) Multiprogramming level (see tuning the writes and index tuning) Log flusher See tuning the writes Page cleaners Prefetcher See index tuning Deadlock detection See lock tuning Fix the number of DBMS threads based on the number of cores available at HW/VM level Partitioning vs. Over-provisioning Provisioning for monitoring, back-up, expensive stored procedures/UDF @ Dennis Shasha and Philippe Bonnet, 2013
@ Dennis Shasha and Philippe Bonnet, 2013 Priorities Mainframe OS have allowed to configure thread priority as well as IO priority for some time. Now it is possible to set IO priorities on Linux as well: Threads associated to synchronous IOs (writes to the log, page cleaning under memory pressure, query agent reads) should have higher priorities than threads associated to asynchronous IOs (prefetching, page cleaner with no memory pressure) – see tuning the writes and index tuning Synchronous IOs should have higher priority than asynchronous IOs. LOOK UP: Getting Priorities Straight, Linux IO priorities @ Dennis Shasha and Philippe Bonnet, 2013
The Priority Inversion Problem Three transactions: T1, T2, T3 in priority order (high to low) T3 obtains lock on x and is preempted T1 blocks on x lock, so is descheduled T2 does not access x and runs for a long time Net effect: T1 waits for T2 Solution: No thread priority Priority inheritance request X T1 Priority #1 Priority #2 lock X T2 Priority #3 T3 Transaction states running waiting @ Dennis Shasha and Philippe Bonnet, 2013
Processor/Interrupt Affinity Mapping of thread context or interrupt to a given core Allows cache line sharing between application threads or between application thread and interrupt (or even RAM sharing in NUMA) Avoid dispatch of all IO interrupts to core 0 (which then dispatches software interrupts to the other cores) Should be combined with VM options Specially important in NUMA context Affinity policy set at OS level or DBMS level? LOOK UP: Linux CPU tuning @ Dennis Shasha and Philippe Bonnet, 2013
Processor/Interrupt Affinity IOs should complete on the core that issued them I/O affinity in SQL server Log writers distributed across all NUMA nodes Locking of a shared data structure across cores, and specially across NUMA nodes Avoid multiprogramming for query agents that modify data Query agents should be on the same NUMA node DBMS have pre-set NUMA affinity policies LOOK UP: Oracle and NUMA, SQL Server and NUMA @ Dennis Shasha and Philippe Bonnet, 2013
Transferring large files (with TCP) With the advent of compute cloud, it is often necessary to transfer large files over the network when loading a database. To speed up the transfer of large files: Increase the s ize of the TCP buffers Increase the socket buffer size (Linux) Set up TCP large windows (and timestamp) Rely on selective acks LOOK UP: TCP Tuning @ Dennis Shasha and Philippe Bonnet, 2013
Throwing hardware at a problem More Memory Increase buffer size without increasing paging More Disks Log on separate disk Mirror frequently read file Partition large files More Processors Off-load non-database applications onto other CPUs Off-load data mining applications to old database copy Increase throughput to shared data Shared memory or shared disk architecture @ Dennis Shasha and Philippe Bonnet, 2013
Virtual Storage Performance Host: Ubuntu 12.04 noop scheduler VM: VirtualBox 4.2 (nice -5) all accelerations enabled 4 CPUs 8 GB VDI disk (fixed) SATA Controller Guest: Ubuntu 12.04 noop scheduler Core i5 CPU 750 @ 2.67GHz 4 cores Intel 710 (100GN, 2.5in SATA 3Gb/s, 25nm, MLC) 170 MB/s sequential write 85 usec latency write 38500 iops Random reads (full range; 32 iodepth) 75 usec latency reads [randReads] ioengine=libaio Iodepth=32 rw=read bs=4k,4k direct=1 Numjobs=1 size=50m directory=/tmp [seqWrites] ioengine=libaio Iodepth=32 rw=write bs=4k,4k direct=1 Numjobs=1 size=50m directory=/tmp Experiments with flexible I/O tester (fio): - sequential writes (seqWrites.fio) - random reads (randReads.fio) @ Dennis Shasha and Philippe Bonnet, 2013
Virtual Storage - seqWrites Default page size is 4k, default iodepth is 32 Performance on Host Performance on Guest @ Dennis Shasha and Philippe Bonnet, 2013
Virtual Storage - seqWrites Page size is 4k, iodepth is 32 @ Dennis Shasha and Philippe Bonnet, 2013
Virtual Storage - randReads Page size is 4k* * experiments with 32k show negligible difference @ Dennis Shasha and Philippe Bonnet, 2013
Index Tuning Index issues Indexes may be better or worse than scans Multi-table joins that run on for hours, because the wrong indexes are defined Concurrency control bottlenecks Indexes that are maintained and never used
Condition on attribute value Index An index is a data structure that supports efficient access to data Set of Records Matching records Condition on attribute value index (search key)
Search Keys A (search) key is a sequence of attributes. Types of keys create index i1 on accounts(branchnum, balance); Types of keys Sequential: the value of the key is monotonic with the insertion order (e.g., counter or timestamp) Non sequential: the value of the key is unrelated to the insertion order (e.g., social security number)
Data Structures Most index data structures can be viewed as trees. In general, the root of this tree will always be in main memory, while the leaves will be located on disk. The performance of a data structure depends on the number of nodes in the average path from the root to the leaf. Data structure with high fan-out (maximum number of children of an internal node) are thus preferred.
B+-Tree Performance Tree levels Tree maintenance Tree locking Tree Fanout Size of key Page utilization Tree maintenance Online On inserts On deletes Offline Tree locking Tree root in main memory
B+-Tree A B+-Tree is a balanced tree whose nodes contain a sequence of key-pointer pairs. Depth Fan-out
size(node) = number of key-pointer pairs B+-Tree Nodes contains a bounded number of key-pointer pairs determined by b (branching factor) Internal nodes: ceiling(b/ 2) <= size(node) <= b Root node: Root is the only node in the tree: 1 <= size(node) <= b Internal nodes exist: 2 <= size(node) <= b Leaves (no pointers): floor(b/2) <= number(keys) <= b-1 Insertion, deletion algorithms keep the tree balanced, and maintain these constraints on the size of each node Nodes might then be split or merged, possibly the depth of the tree is increased. size(node) = number of key-pointer pairs
B+-Tree Performance #1 Memory / Disk Fan-out and tree level Root is always in memory What is the portion of the index actually in memory? Impacts the number of IO Worst-case: an I/O per level in the B+-tree! Fan-out and tree level Both are interdependent They depend on the branching factor In practice, index nodes are mapped onto index pages of fixed size Branching factor then depends on key size (pointers of fixed size) and page utilization
B+-Tree Performance Key length influences fanout Choose small key when creating an index Key compression Prefix compression (Oracle 8, MySQL): only store that part of the key that is needed to distinguish it from its neighbors: Smi, Smo, Smy for Smith, Smoot, Smythe. Front compression (Oracle 5): adjacent keys have their front portion factored out: Smi, (2)o, (2)y. There are problems with this approach: Processor overhead for maintenance Locking Smoot requires locking Smith too.
Hash Index A hash index stores key-value pairs based on a pseudo-randomizing function called a hash function. Hashed key values 1 n R1 R5 R3 R6 R9 R14 R17 R21 R25 Hash function key 2341 Overflow buckets + Offline reorganization Good for point queries The length of these chains impacts performance
Clustered / Non clustered index Clustered index (primary index) A clustered index on attribute X co-locates records whose X values are near to one another. Non-clustered index (secondary index) A non clustered index does not constrain table organization. There might be several non-clustered indexes per table. Records Records
Dense / Sparse Index Sparse index Dense index P1 P2 Pi Pointers are associated to pages Dense index Pointers are associated to records Non clustered indexes are dense P1 P2 Pi record record record
DBMS Implementation Oracle 11g B-tree vs. Hash vs. Bitmap Index-organized table vs. heap Non-clustered index can be defined on both Reverse key indexes Key compression Invisible index (not visible to optimizer – allows for experimentation) Function-based indexes CREATE INDEX idx ON table_1 (a + b * (c - 1), a, b); See: Oracle 11g indexes
DBMS Implementation DB2 10 SQL Server 2012 B+-tree (hash index only for db2 for z/OS) Non-cluster vs. cluster Key compression Indexes on expression SQL Server 2012 B+-tree (spatial indexes built on top of B+-trees) Columnstore index for OLAP queries Non key columns included in index (for coverage) Indexes on simple expressions (filtered indexes) See: DB2 types of indexes, SQLServer 2012 indexes
Index Implementations in some major DBMS SQL Server B+-Tree data structure Clustered indexes are sparse Indexes maintained as updates/insertions/deletes are performed DB2 B+-Tree data structure, spatial extender for R-tree Clustered indexes are dense Explicit command for index reorganization Oracle B+-tree, hash, bitmap, spatial extender for R-Tree clustered index Index organized table (unique/clustered) Clusters used when creating tables. Index-organized tables organize data according to index.
Types of Queries Point Query SELECT balance FROM accounts WHERE number = 1023; Multipoint Query SELECT balance FROM accounts WHERE branchnum = 100; Range Query SELECT number FROM accounts WHERE balance > 10000 and balance <= 20000; Prefix Match Query SELECT * FROM employees WHERE name = ‘J*’ ;
More Types of Queries Extremal Query SELECT * FROM accounts WHERE balance = max(select balance from accounts) Ordering Query SELECT * FROM accounts ORDER BY balance; Grouping Query SELECT branchnum, avg(balance) FROM accounts GROUP BY branchnum; Join Query SELECT distinct branch.adresse FROM accounts, branch WHERE accounts.branchnum = branch.number and accounts.balance > 10000;
Index Tuning -- data Settings: employees(ssnum, name, lat, long, hundreds1, hundreds2); clustered index c on employees(hundreds1) with fillfactor = 100; nonclustered index nc on employees (hundreds2); index nc3 on employees (ssnum, name, hundreds2); index nc4 on employees (lat, ssnum, name); 1000000 rows ; Cold buffer Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
Index Tuning -- operations Update: update employees set name = ‘XXX’ where ssnum = ?; Insert: insert into employees values (1003505,'polo94064',97.48,84.03,4700.55,3987.2); Multipoint query: select * from employees where hundreds1= ?; select * from employees where hundreds2= ?; Covered query: select ssnum, name, lat from employees; Range Query: select * from employees where long between ? and ?; Point Query: select * from employees where ssnum = ?
Clustered Index Multipoint query that returns 100 records out of 1000000. Cold buffer Clustered index is twice as fast as non-clustered index and orders of magnitude faster than a scan.
Index “Face Lifts” Index is created with fillfactor = 100. Insertions cause page splits and extra I/O for each query Maintenance consists in dropping and recreating the index With maintenance performance is constant while performance degrades significantly if no maintenance is performed.
Index Maintenance In Oracle, clustered index are approximated by an index defined on a clustered table No automatic physical reorganization Index defined with pctfree = 0 Overflow pages cause performance degradation
Covering Index - defined Select name from employee where department = “marketing” Good covering index would be on (department, name) Index on (name, department) less useful. Index on department alone moderately useful.
Covering Index - impact Covering index performs better than clustering index when first attributes of index are in the where clause and last attributes in the select. When attributes are not in order then performance is much worse.
Scan Can Sometimes Win IBM DB2 v7.1 on Windows 2000 Range Query If a query retrieves 10% of the records or more, scanning is often better than using a non-clustering non-covering index. Crossover > 10% when records are large or table is fragmented on disk – scan cost increases.
Index on Small Tables Small table: 100 records, i.e., a few pages. Two concurrent processes perform updates (each process works for 10ms before it commits) No index: the table is scanned for each update. No concurrent updates. A clustered index allows to take advantage of row locking.
Bitmap vs. Hash vs. B+-Tree Settings: employees(ssnum, name, lat, long, hundreds1, hundreds2); create cluster c_hundreds (hundreds2 number(8)) PCTFREE 0; create cluster c_ssnum(ssnum integer) PCTFREE 0 size 60; create cluster c_hundreds(hundreds2 number(8)) PCTFREE 0 HASHKEYS 1000 size 600; create cluster c_ssnum(ssnum integer) PCTFREE 0 HASHKEYS 1000000 SIZE 60; create bitmap index b on employees (hundreds2); create bitmap index b2 on employees (ssnum); 1000000 rows ; Cold buffer Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
Multipoint query: B-Tree, Hash Tree, Bitmap There is an overflow chain in a hash index because hundreds2 has few values In a clustered B-Tree index records are on contiguous pages. Bitmap is proportional to size of table and non-clustered for record access.
B-Tree, Hash Tree, Bitmap Hash indexes don’t help when evaluating range queries Hash index outperforms B-tree on point queries
Tuning Relational Systems Schema Tuning Denormalization, Vertical Partitioning Query Tuning Query rewriting Materialized views
Denormalizing -- data Settings: lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT ); region( R_REGIONKEY, R_NAME, R_COMMENT ); nation( N_NATIONKEY, N_NAME, N_REGIONKEY, N_COMMENT,); supplier( S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL, S_COMMENT); 600000 rows in lineitem, 25 nations, 5 regions, 500 suppliers
Denormalizing -- transactions lineitemdenormalized ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT, L_REGIONNAME); 600000 rows in lineitemdenormalized Cold Buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Queries on Normalized vs. Denormalized Schemas select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, R_NAME from LINEITEM, REGION, SUPPLIER, NATION where L_SUPPKEY = S_SUPPKEY and S_NATIONKEY = N_NATIONKEY and N_REGIONKEY = R_REGIONKEY and R_NAME = 'EUROPE'; select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, L_REGIONNAME from LINEITEMDENORMALIZED where L_REGIONNAME = 'EUROPE';
Denormalization TPC-H schema Query: find all lineitems whose supplier is in Europe. With a normalized schema this query is a 4-way join. If we denormalize lineitem and add the name of the region for each lineitem (foreign key denormalization) throughput improves 30%
Vertical Partitioning Consider account(id, balance, homeaddress) When might it be a good idea to do a “vertical partitioning” into account1(id,balance) and account2(id,homeaddress)? Join vs. size.
Vertical Partitioning Which design is better depends on the query pattern: The application that sends a monthly statement is the principal user of the address of the owner of an account The balance is updated or examined several times a day. The second schema might be better because the relation (account_ID, balance) can be made smaller: More account_ID, balance pairs fit in memory, thus increasing the hit ratio A scan performs better because there are fewer pages.
Tuning Normalization A single normalized relation XYZ is better than two normalized relations XY and XZ if the single relation design allows queries to access X, Y and Z together without requiring a join. The two-relation design is better iff: Users access tend to partition between the two sets Y and Z most of the time Attributes Y or Z have large values
Vertical Partitioning and Scan R (X,Y,Z) X is an integer YZ are large strings Scan Query Vertical partitioning exhibits poor performance when all attributes are accessed. Vertical partitioning provides a sped up if only two of the attributes are accessed.
Vertical Partitioning and Point Queries R (X,Y,Z) X is an integer YZ are large strings A mix of point queries access either XYZ or XY. Vertical partitioning gives a performance advantage if the proportion of queries accessing only XY is greater than 20%. The join is not expensive compared to a simple look-up.
Queries Settings: employee(ssnum, name, dept, salary, numfriends); student(ssnum, name, course, grade); techdept(dept, manager, location); clustered index i1 on employee (ssnum); nonclustered index i2 on employee (name); nonclustered index i3 on employee (dept); clustered index i4 on student (ssnum); nonclustered index i5 on student (name); clustered index i6 on techdept (dept); 100000 rows in employee, 100000 students, 10 departments; Cold buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Queries – View on Join View Techlocation: create view techlocation as select ssnum, techdept.dept, location from employee, techdept where employee.dept = techdept.dept; Queries: Original: select dept from techlocation where ssnum = ?; Rewritten: select dept from employee where ssnum = ?;
Query Rewriting - Views All systems expand the selection on a view into a join The difference between a plain selection and a join (on a primary key-foreign key) followed by a projection is greater on SQL Server than on Oracle and DB2 v7.1.
Queries – Correlated Subqueries Original: select ssnum from employee e1 where salary = (select max(salary) from employee e2 where e2.dept = e1.dept); Rewritten: select max(salary) as bigsalary, dept into TEMP from employee group by dept; select ssnum from employee, TEMP where salary = bigsalary and employee.dept = temp.dept;
Query Rewriting – Correlated Subqueries SQL Server 2000 does a good job at handling the correlated subqueries (a hash join is used as opposed to a nested loop between query blocks) The techniques implemented in SQL Server 2000 are described in “Orthogonal Optimization of Subqueries and Aggregates” by C.Galindo-Legaria and M.Joshi, SIGMOD 2001. > 1000 > 10000
Eliminate unneeded DISTINCTs Query: Find employees who work in the information systems department. There should be no duplicates. SELECT distinct ssnum FROM employee WHERE dept = ‘information systems’ DISTINCT is unnecessary, since ssnum is a key of employee so certainly is a key of a subset of employee.
Eliminate unneeded DISTINCTs Query: Find social security numbers of employees in the technical departments. There should be no duplicates. SELECT DISTINCT ssnum FROM employee, tech WHERE employee.dept = tech.dept Is DISTINCT needed?
Distinct Unnecessary Here Too Since dept is a key of the tech table, each employee record will join with at most one record in tech. Because ssnum is a key for employee, distinct is unnecessary.
Reaching The relationship among DISTINCT, keys and joins can be generalized: Call a table T privileged if the fields returned by the select contain a key of T Let R be an unprivileged table. Suppose that R is joined on equality by its key field to some other table S, then we say R reaches S. Now, define reaches to be transitive. So, if R1 reaches R2 and R2 reaches R3 then say that R1 reaches R3.
Reaches: Main Theorem There will be no duplicates among the records returned by a selection, even in the absence of DISTINCT if one of the two following conditions hold: Every table mentioned in the FROM clause is privileged Every unprivileged table reaches at least one privileged table.
Reaches: Proof Sketch If every relation is privileged then there are no duplicates The keys of those relations are in the from clause Suppose some relation T is not privileged but reaches at least one privileged one, say R. Then the qualifications linking T with R ensure that each distinct combination of privileged records is joined with at most one record of T.
Reaches: Example 1 SELECT ssnum FROM employee, tech WHERE employee.manager = tech.manager The same employee record may match several tech records (because manager is not a key of tech), so the ssnum of that employee record may appear several times. Tech does not reach the privileged relation employee.
Reaches: Example 2 Each repetition of a given ssnum vlaue would be accompanied by a new tech.dept since tech.dept is a key of tech Both relations are privileged. SELECT ssnum, tech.dept FROM employee, tech WHERE employee.manager = tech.manager
Reaches: Example 3 Student is priviledged SELECT student.ssnum FROM student, employee, tech WHERE student.name = employee.name AND employee.dept = tech.dept; Student is priviledged Employee does not reach student (name is not a key of employee) DISTINCT is needed to avoid duplicates.
Aggregate Maintenance -- data Settings: orders( ordernum, itemnum, quantity, storeid, vendorid ); create clustered index i_order on orders(itemnum); store( storeid, name ); item(itemnum, price); create clustered index i_item on item(itemnum); vendorOutstanding( vendorid, amount); storeOutstanding( storeid, amount); 1000000 orders, 10000 stores, 400000 items; Cold buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Aggregate Maintenance -- triggers Triggers for Aggregate Maintenance create trigger updateVendorOutstanding on orders for insert as update vendorOutstanding set amount = (select vendorOutstanding.amount+sum(inserted.quantity*item.price) from inserted,item where inserted.itemnum = item.itemnum ) where vendorid = (select vendorid from inserted) ; create trigger updateStoreOutstanding on orders for insert as update storeOutstanding (select storeOutstanding.amount+sum(inserted.quantity*item.price) where storeid = (select storeid from inserted)
Aggregate Maintenance -- transactions Concurrent Transactions: Insertions insert into orders values (1000350,7825,562,'xxxxxx6944','vendor4'); Queries (first without, then with redundant tables) select orders.vendor, sum(orders.quantity*item.price) from orders,item where orders.itemnum = item.itemnum group by orders.vendorid; vs. select * from vendorOutstanding; select store.storeid, sum(orders.quantity*item.price) from orders,item, store and orders.storename = store.name group by store.storeid; vs. select * from storeOutstanding;
Aggregate Maintenance SQLServer 2000 on Windows 2000 Using triggers for view maintenance If queries frequent or important, then aggregate maintenance is good.
Superlinearity -- data Settings: sales( id, itemid, customerid, storeid, amount, quantity); item (itemid); customer (customerid); store (storeid); A sale is successful if all foreign keys are present. successfulsales(id, itemid, customerid, storeid, amount, quantity); unsuccessfulsales(id, itemid, customerid, storeid, amount, quantity); tempsales( id, itemid, customerid, storeid, amount,quantity);
Superlinearity -- indexes Settings (non-clustering, dense indexes): index s1 on item(itemid); index s2 on customer(customerid); index s3 on store(storeid); index succ on successfulsales(id); 1000000 sales, 400000 customers, 40000 items, 1000 stores Cold buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Superlinearity -- queries Insert/create indexdelete insert into successfulsales select sales.id, sales.itemid, sales.customerid, sales.storeid, sales.amount, sales.quantity from sales, item, customer, store where sales.itemid = item.itemid and sales.customerid = customer.customerid and sales.storeid = store.storeid; insert into unsuccessfulsales select * from sales; go delete from unsuccessfulsales where id in (select id from successfulsales)
Superlinearity -- batch queries Small batches DECLARE @Nlow INT; DECLARE @Nhigh INT; DECLARE @INCR INT; set @INCR = 100000 set @NLow = 0 set @Nhigh = @INCR WHILE (@NLow <= 500000) BEGIN insert into tempsales select * from sales where id between @NLow and @Nhigh set @Nlow = @Nlow + @INCR set @Nhigh = @Nhigh + @INCR delete from tempsales where id in (select id from successfulsales); insert into unsuccessfulsales select * from tempsales; delete from tempsales; END
Superlinearity -- outer join Queries: outerjoin insert into successfulsales select sales.id, item.itemid, customer.customerid, store.storeid, sales.amount, sales.quantity from ((sales left outer join item on sales.itemid = item.itemid) left outer join customer on sales.customerid = customer.customerid) left outer join store on sales.storeid = store.storeid; insert into unsuccessfulsales select * from successfulsales where itemid is null or customerid is null or storeid is null; go delete from successfulsales or storeid is null
Circumventing Superlinearity SQL Server 2000 Outer join achieves the best response time. Small batches do not help because overhead of crossing the application interface is higher than the benefit of joining with smaller tables.
Tuning the Application Interface 4GL Power++, Visual basic Programming language + Call Level Interface ODBC: Open DataBase Connectivity JDBC: Java based API OCI (C++/Oracle), CLI (C++/ DB2), Perl/DBI In the following experiments, the client program is located on the database server site. Overhead is due to crossing the application interface.
Looping can hurt -- data Settings: lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT ); 600 000 rows; warm buffer. Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Looping can hurt -- queries No loop: sqlStmt = “select * from lineitem where l_partkey < 200;” odbc->prepareStmt(sqlStmt); odbc->execPrepared(sqlStmt); Loop: sqlStmt = “select * from lineitem where l_partkey = ?;” for (int i=1; i<200; i++) { odbc->bindParameter(1, SQL_INTEGER, i); }
Looping can Hurt SQL Server 2000 on Windows 2000 Crossing the application interface has a significant impact on performance. Why would a programmer use a loop instead of relying on set-oriented operations: object-orientation?
Cursors are Death -- data Settings: employees(ssnum, name, lat, long, hundreds1, hundreds2); 100000 rows ; Cold buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Cursors are Death -- queries No cursor select * from employees; Cursor DECLARE d_cursor CURSOR FOR select * from employees; OPEN d_cursor while (@@FETCH_STATUS = 0) BEGIN FETCH NEXT from d_cursor END CLOSE d_cursor go
Cursors are Death SQL Server 2000 on Windows 2000 Response time is a few seconds with a SQL query and more than an hour iterating over a cursor.
Retrieve Needed Columns Only - data Settings: lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT ); create index i_nc_lineitem on lineitem (l_orderkey, l_partkey, l_suppkey, l_shipdate, l_commitdate); 600 000 rows; warm buffer. Lineitem records are ~ 10 bytes long Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Retrieve Needed Columns Only - queries All Select * from lineitem; Covered subset Select l_orderkey, l_partkey, l_suppkey, l_shipdate, l_commitdate from lineitem;
Retrieve Needed Columns Only Avoid transferring unnecessary data May enable use of a covering index. In the experiment the subset contains ¼ of the attributes. Reducing the amount of data that crosses the application interface yields significant performance improvement. Experiment performed on Oracle8iEE on Windows 2000.
Bulk Loading Data Settings: lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT ); Initially the table is empty; 600 000 rows to be inserted (138Mb) Table sits one disk. No constraint, index is defined. Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Bulk Loading Queries Oracle 8i sqlldr directpath=true control=load_lineitem.ctl data=E:\Data\lineitem.tbl load data infile "lineitem.tbl" into table LINEITEM append fields terminated by '|' ( L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE DATE "YYYY-MM-DD", L_COMMITDATE DATE "YYYY-MM-DD", L_RECEIPTDATE DATE "YYYY-MM-DD", L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT )
Direct Path Direct path loading bypasses the query engine and the storage manager. It is orders of magnitude faster than for conventional bulk load (commit every 100 records) and inserts (commit for each record). Experiment performed on Oracle8iEE on Windows 2000.
Batch Size Throughput increases steadily when the batch size increases to 100000 records.Throughput remains constant afterwards. Trade-off between performance and amount of data that has to be reloaded in case of problem. Experiment performed on SQL Server 2000 on Windows 2000.
Tuning E-Commerce Applications Database-backed web-sites: Online shops Shop comparison portals MS TerraServer
E-commerce Application Architecture Clients Web servers Application servers Database server Web cache DB cache Web cache DB cache Web cache
E-commerce Application Workload Touristic searching (frequent, cached) Access the top few pages. Pages may be personalized. Data may be out-of-date. Category searching (frequent, partly cached and need for timeliness guarantees) Down some hierarchy, e.g., men’s clothing. Keyword searching (frequent, uncached, need for timeliness guarantees) Shopping cart interactions (rare, but transactional) Electronic purchasing (rare, but transactional)
Design Issues Need to keep historic information Electronic payment acknowledgements get lost. Preparation for variable load Regular patterns of web site accesses during the day, and within a week. Possibility of disconnections State information transmitted to the client (cookies) Special consideration for low bandwidth Schema evolution Representing e-commerce data as attribute-value pairs (IBM Websphere)
Caching Web cache: Database cache (Oracle9iAS, TimesTen’s FrontTier) Static web pages Caching fragments of dynamically created web pages Database cache (Oracle9iAS, TimesTen’s FrontTier) Materialized views to represent cached data. Queries are executed either using the database cache or the database server. Updates are propagated to keep the cache(s) consistent. Note to vendors: It would be good to have queries distributed between cache and server.
Ecommerce -- setting Settings: 500000 rows; warm buffer shoppingcart( shopperid, itemid, price, qty); 500000 rows; warm buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Ecommerce -- transactions Concurrent Transactions: Mix insert into shoppingcart values (107999,914,870,214); update shoppingcart set Qty = 10 where shopperid = 95047 and itemid = 88636; delete from shoppingcart where shopperid = 86123 and itemid = 8321; select shopperid, itemid, qty, price from shoppingcart where shopperid = ?; Queries Only
Connection Pooling (no refusals) Each thread establishes a connection and performs 5 insert statements. If a connection cannot be established the thread waits 15 secs before trying again. The number of connection is limited to 60 on the database server. Using connection pooling, the requests are queued and serviced when possible. There are no refused connections. Experiment performed on Oracle8i on Windows 2000
Indexing Using a clustered index on shopperid in the shopping cart provides: Query speed-up Update/Deletion speed-up Experiment performed on SQL Server 2000 on Windows 2000
Capacity Planning Arrival Rate Service Time (S) Utilization Entry (S1) 0.4 Arrival Rate A1 is given as an assumption A2 = (0.4 A1) + (0.5 A2) A3 = 0.1 A2 Service Time (S) S1, S2, S3 are measured Utilization U = A x S Response Time R = U/(A(1-U)) = S/(1-U) (assuming Poisson arrivals) 0.5 Search (S2) 0.1 Checkout (S3) Getting the demand assumptions right is what makes capacity planning hard
How to Handle Multiple Servers Suppose one has n servers for some task that requires S time for a single server to perform. The perfect parallelism model is that it is as if one has a single server that is n times as fast. However, this overstates the advantage of parallelism, because even if there were no waiting, single tasks require S time.
Rough Estimate for Multiple Servers There are two components to response time: waiting time + service time. In the parallel setting, the service time is still S. The waiting time however can be well estimated by a server that is n times as fast.
Approximating waiting time for n parallel servers. Recall: R = U/(A(1-U)) = S/(1-U) On an n-times faster server, service time is divided by n, so the single processor utilization U is also divided by n. So we would get: Rideal = (S/n)/(1 – (U/n)). That Rideal = serviceideal + waitideal. So waitideal = Rideal – S/n Our assumption: waitideal ~ wait for n processors.
Approximating response time for n parallel servers Waiting time for n parallel processors ~ (S/n)/(1 – (U/n)) – S/n = (S/n) ( 1/(1-(U/n)) – 1) = (S/(n(1 – U/n)))(U/n) = (S/(n – U))(U/n) So, response time for n parallel processors is above waiting time + S.
Example A = 8 per second. S = 0.1 second. U = 0.8. Single server response time = S/(1-U) = 0.1/0.2 = 0.5 seconds. If we have 2 servers, then we estimate waiting time to be (0.1/(2-0.8))(0.4) = 0.04/1.2 = 0.033. So the response time is 0.133. For a 2-times faster server, S = 0.05, U = 0.4, so response time is 0.05/0.6 = 0.0833
Example -- continued A = 8 per second. S = 0.1 second. U = 0.8. If we have 4 servers, then we estimate waiting time to be (S/(n – U))(U/n) = 0.1/(3.2) * (0.8/4) = 0.02/3.2 = 0.00625 So response time is 0.10625.
Datawarehouse Tuning Aggregate (strategic) targeting: Examples: Aggregates flow up from a wide selection of data, and then Targeted decisions flow down Examples: Riding the wave of clothing fads Tracking delays for frequent-flyer customers
Data Warehouse Workload Broad Aggregate queries over ranges of values, e.g., find the total sales by region and quarter. Deep Queries that require precise individualized information, e.g., which frequent flyers have been delayed several times in the last month? Dynamic (vs. Static) Queries that require up-to-date information, e.g. which nodes have the highest traffic now?
Tuning Knobs Indexes Materialized views Approximation
Bitmaps -- data Settings: 100000 rows ; cold buffer lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT ); create bitmap index b_lin_2 on lineitem(l_returnflag); create bitmap index b_lin_3 on lineitem(l_linestatus); create bitmap index b_lin_4 on lineitem(l_linenumber); 100000 rows ; cold buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Bitmaps -- queries Queries: 1 attribute 2 attributes 3 attributes select count(*) from lineitem where l_returnflag = 'N'; 2 attributes select count(*) from lineitem where l_returnflag = 'N' and l_linenumber > 3; 3 attributes select count(*) from lineitem where l_returnflag = 'N' and l_linenumber > 3 and l_linestatus = 'F';
Bitmaps Order of magnitude improvement compared to scan, because summing up the 1s is very fast. Bitmaps are best suited for multiple conditions on several attributes, each having a low selectivity. A N R l_returnflag O F l_linestatus
Vertical Partitioning Revisited In the same settings that bit vectors are useful – lots of attributes, unpredictable combinations queried – so is vertical partitioning more attractive. If a table has 200 attributes, no reason to retrieve 200 field rows if only three attributes are needed. Sybase IQ, KDB, Vertica all use column-oriented tables.
Multidimensional Indexes -- data Settings: create table spatial_facts ( a1 int, a2 int, a3 int, a4 int, a5 int, a6 int, a7 int, a8 int, a9 int, a10 int, geom_a3_a7 mdsys.sdo_geometry ); create index r_spatialfacts on spatial_facts(geom_a3_a7) indextype is mdsys.spatial_index; create bitmap index b2_spatialfacts on spatial_facts(a3,a7); 500000 rows ; cold buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.
Multidimensional Indexes -- queries Point Queries select count(*) from fact where a3 = 694014 and a7 = 928878; select count(*) from spatial_facts where SDO_RELATE(geom_a3_a7, MDSYS.SDO_GEOMETRY(2001, NULL, MDSYS.SDO_POINT_TYPE(694014,928878, NULL), NULL, NULL), 'mask=equal querytype=WINDOW') = 'TRUE'; Range Queries select count(*) from spatial_facts where SDO_RELATE(geom_a3_a7, mdsys.sdo_geometry(2003,NULL,NULL, mdsys.sdo_elem_info_array(1,1003,3),mdsys.sdo_ordinate_array(10,800000,1000000,1000000)), 'mask=inside querytype=WINDOW') = 'TRUE'; select count(*) from spatial_facts where a3 > 10 and a3 < 1000000 and a7 > 800000 and a7 < 1000000;
Multidimensional Indexes Oracle 8i on Windows 2000 Spatial Extension: 2-dimensional data Spatial functions used in the query R-tree does not perform well because of the overhead of spatial extension.
Multidimensional Indexes R-Tree SELECT STATEMENT SORT AGGREGATE TABLE ACCESS BY INDEX ROWID SPATIAL_FACTS DOMAIN INDEX R_SPATIALFACTS Bitmaps SELECT STATEMENT SORT AGGREGATE BITMAP CONVERSION COUNT BITMAP AND BITMAP INDEX SINGLE VALUE B_FACT7 BITMAP INDEX SINGLE VALUE B_FACT3
Materialized Views -- data Settings: orders( ordernum, itemnum, quantity, storeid, vendor ); create clustered index i_order on orders(itemnum); store( storeid, name ); item(itemnum, price); create clustered index i_item on item(itemnum); 1000000 orders, 10000 stores, 400000 items; Cold buffer Oracle 9i Pentium III (1 GHz, 256 Kb), 1Gb RAM, Adapter 39160 with 2 channels, 3x18Gb drives (10000RPM), Linux Debian 2.4.
Materialized Views -- data Settings: create materialized view vendorOutstanding build immediate refresh complete enable query rewrite as select orders.vendor, sum(orders.quantity*item.price) from orders,item where orders.itemnum = item.itemnum group by orders.vendor;
Materialized Views -- transactions Concurrent Transactions: Insertions insert into orders values (1000350,7825,562,'xxxxxx6944','vendor4'); Queries select orders.vendor, sum(orders.quantity*item.price) from orders,item where orders.itemnum = item.itemnum group by orders.vendor; select * from vendorOutstanding;
Materialized Views Graph: Oracle9i on Linux Total sale by vendor is materialized Trade-off between query speed-up and view maintenance: The impact of incremental maintenance on performance is significant. Rebuild maintenance achieves a good throughput. A static data warehouse offers a good trade-off.
Materialized View Maintenance Problem when large number of views to maintain. The order in which views are maintained is important: A view can be computed from an existing view instead of being recomputed from the base relations (total per region can be computed from total per nation). Let the views and base tables be nodes v_i Let there be an edge from v_1 to v_2 if it possible to compute the view v_2 from v_1. Associate the cost of computing v_2 from v_1 to this edge. Compute all pairs shortest path where the start nodes are the set of base tables. The result is an acyclic graph A. Take a topological sort of A and let that be the order of view construction.
Materialized View Example Detail(storeid, item, qty, price, date) Materialized view1(storeid, category, qty, month) Materializedview2(city, category, qty, month) Materializedview3(storeid, category, qty, year)
Approximations -- data Settings: TPC-H schema Approximations insert into approxlineitem select top 6000 * from lineitem where l_linenumber = 4; insert into approxorders select O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY, O_CLERK, O_SHIPPRIORITY, O_COMMENT from orders, approxlineitem where o_orderkey = l_orderkey;
Approximations -- queries insert into approxsupplier select distinct S_SUPPKEY, S_NAME , S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL, S_COMMENT from approxlineitem, supplier where s_suppkey = l_suppkey; insert into approxpart select distinct P_PARTKEY, P_NAME , P_MFGR , P_BRAND , P_TYPE , P_SIZE , P_CONTAINER , P_RETAILPRICE , P_COMMENT from approxlineitem, part where p_partkey = l_partkey; insert into approxpartsupp select distinct PS_PARTKEY, PS_SUPPKEY, PS_AVAILQTY, PS_SUPPLYCOST, PS_COMMENT from partsupp, approxpart, approxsupplier where ps_partkey = p_partkey and ps_suppkey = s_suppkey; insert into approxcustomer select distinct C_CUSTKEY, C_NAME , C_ADDRESS, C_NATIONKEY, C_PHONE , C_ACCTBAL, C_MKTSEGMENT, C_COMMENT from customer, approxorders where o_custkey = c_custkey; insert into approxregion select * from region; insert into approxnation select * from nation;
Approximations -- more queries Single table query on lineitem select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where datediff(day, l_shipdate, '1998-12-01') <= '120' group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus;
Approximations -- still more Queries: 6-way join select n_name, avg(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem, supplier, nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AFRICA' and o_orderdate >= '1993-01-01' and datediff(year, o_orderdate,'1993-01-01') < 1 group by n_name order by revenue desc;
Approximation accuracy Good approximation for query Q1 on lineitem The aggregated values obtained on a query with a 6-way join are significantly different from the actual values -- for some applications may still be good enough.
Approximation Speedup Aqua approximation on the TPC-H schema 1% and 10% lineitem sample propagated. The query speed-up obtained with approximated relations is significant.
Good Way to Find a Random Sample If you know how big your data is (i.e. it is not streaming), then the best way is to include each record with some probability p drawn independently and randomly. If you don’t know how big your data is, hash each record to a value in some range. Choose the lowest k. Can use a priority queue having k elements. As each new element is inserted, perform deletemax.
Approximate Counts of Every Item Count-min idea. Imagine that you had a perfect hash function H (i.e. no collisions). Then, you could simply increment location H(x) every time x appeared. You don’t. So use several hash function H1, H2, … Hj. When you see x, increment locs H1(x), H2(x), … To estimate the count of x, look at min of those locs.
Keep Track of k Most Frequent Items Take the first k items. Each time you see an item x, if x is among the top k, increment the count of x. Otherwise, replace the item y having the lowest count among the top k by x and increment the count (formerly for y) and ascribe that count to x. Works (approximately) if items are presented as a random permutation. Remarkably robust.
Maintaining “typical” clusters for streaming data Find a sample using the sampling technique from before. Form a cluster off-line. As points come in, associate them with nearest centroid. Let the initial fraction with centroid i be p(i). If that fraction changes a lot, then maybe the distribution has changed.
Approximating Count Distinct Suppose you have one or more columns with duplicates. You want to know how many distinct values there are. Imagine that you could hash into (0,1). (Simulate this by a very large integer range). Hashing will put duplicates in the same location and, if done right, will distribute values uniformly. Draw the dots of hash values between 0 and 1.
Approximating Count Distinct Suppose you hash every item and take the minimum 1000 hash values. Suppose that highest hash value of the 1000 is f (between 0 and 1). Intuitively, if f is small then there are many distinct values. What is a good approximation of the number of distinct values?
Tuning Distributed Applications Queries across multiple databases Federated Datawarehouse IBM’s DataJoiner now integrated at DB2 v7.2 The source should perform as much work as possible and return as few data as possible for processing at the federated server Processing data across multiple databases
A puzzle Two databases X and Y X records inventory data to be used for restocking Y contains delivery data about shipments Improve the speed of shipping by sharing data Certain data of X should be postprocessed on Y shortly after it enters X You want to avoid losing data from X and you want to avoid double-processing data on Y, even in the face of failures.
Two-Phase Commit commit X Commits are coordinated between the source and the destination If one participant fails then blocking can occur Not all db systems support prepare-to-commit interface commit Source (coordinator & participant) Y Destination (participant)
Replication Server Destination within a few seconds of being up-to-date Decision support queries can be asked on destination db Administrator is needed when network connection breaks! X Source Y Destination
Staging Tables No specific mechanism is necessary at source or destination Coordination of transactions on X and Y X Staging table Source Y Destination
Issues in Solution A single thread can invoke a transaction on X, Y or both, but the thread may fail in the middle. A new thread will regain state by looking at the database. We want to delete data from the staging tables when we are finished. The tuples in the staging tables will represent an operation as well as data.
States A tuple (operation plus data) will start in state unprocessed, then state processed, and then deleted. The same transaction that processes the tuple on the destination site also changes its state. This is important so each tuple’s operation is done exactly once.
Staging Tables STEP 1 STEP 2 Write to destination Read from source table Write to destination M1 M2 Yunprocessed Yunprocessed Yunprocessed unprocessed Yunprocessed Table I Table S Table I Table S Database X Database Y Database X Database Y STEP 1 STEP 2
Staging Tables STEP 3 STEP 4 Delete source; then dest Y Xact: Update source site Xact:update destination site Update on destination M3 M4 Yunprocessed Yunprocessed processed unprocessed Yunprocessed processed unprocessed Yunprocessed Table I Table S Table I Table S Database X Database Y Database X Database Y STEP 3 STEP 4
What to Look For Transactions are atomic but threads, remember, are not. So a replacement thread will look to the database for information. In which order should the final step do its transactions? Should it delete the unprocessed tuple from X as the first transaction or as the second?
Main-memory Databases Database Tuning – Spring 2015 Philippe Bonnet – phbo@itu.dk
Agenda Memory hierarchy Cache-optimized data structures Multi-version Concurrency Control System Examples New Architectures SQL Server 2014 (Hekaton)
Memory Hierarchy
Memory Hierarchy Source: http://www.3dincites.com/2015/01/iedm-2014-3d-short-course-highlights-3d-memory-cubes-system-design/
Main-memory Databases Memory has finite capacity: How to deal with space limitation? Memory is volatile: How to deal with persistence? Memory is fast: How can a DBMS take advantage?
Dealing with capacity On Azure, G-series instances go up to 32 cores, 448 GiB of RAM, and 6,596 GB of local SSD storage
Dealing with Capacity Many databases whose size is 10s or 100s GB fit in RAM on a single host The storage engine of SQL Server 2014 makes it possible to split a database into a collection of tables that are memory-optimized and a collection of tables that are disk-optimized (data resides on disk and is manipulated through the traditional buffer pool). For databases that are larger than RAM, two options: All data in RAM on a cluster of hosts (e.g., VoltDB) Separate hot data in RAM from cold data on SSD on the same host (e.g. SQL Server 2014)
Dealing with Persistence Check-in/Check-out Memory optimized tables are loaded in memory when the database management system starts up They are stored on disk when the database management system shuts down Swap in/Swap out Tables are swapped in when data gets hot and swapped out when data gets cold What happens if data is modified? ...
Dealing with Persistence Defining the database state: 2nd storage Traditional log-based approach Write-ahead logging protocol (remember that?) RAM and 2nd storage on a single host (SQL Server) No need for write-ahead logging (see log tuning) RAM on multiple hosts (VoltDB) Recovery is needed when a host fails Checkpointing to save consistent states Command log to replay history and recover failed host (see log tuning)
Dealing with Persistence Restrict (greatly) the modifications that are allowed on memory-optimized tables: No insertions/deletions No updates No new index No concurrent modifications, only short transactions
Leveraging RAM Using actual pointers (that refer to memory addresses), rather than logical IDs and offsets within buffer pool pages Variable-size data structures, rather than fixed-size pages Using an index structure to represent a collection of tuples SQL Server 2014 relies on hash-index
Leveraging RAM Traditional focus on RAM/Disk transfers With main-memory databases, the focus is on cache/RAM transfers Maximize cache utilization once data is in cache Column-based representation (see storage tuning) Cache conscious indexes (see index tuning) Staged query processing Avoid concurrent accesses on cache lines Introduce optimistic concurrency control Trade more instructions for less cache transfers Use compression
Avoiding Concurrent Accesses on Cache Lines Two approaches: Enforce serial transaction execution (VoltDB) No need for concurrency control Explicit conflict declaration/discovery when transactions start Conflict discovery when transactions commit Optimistic concurrency control Multi-Version Concurrency Control (SQL Server)
Troubleshooting Techniques(*) Extraction and Analysis of Performance Indicators Consumer-producer chain framework Tools Query plan monitors Performance monitors Event monitors (*) From Alberto Lerner’s chapter
A Consumer-Producer Chain of a DBMS’s Resources High Level Consumers Intermediate Resources/ Consumers Primary Resources sql commands PARSER OPTIMIZER probing spots for indicators EXECUTION SUBSYSTEM DISK SYBSYSTEM LOCKING SUBSYSTEM CACHE MANAGER LOGGING SUBSYSTEM MEMORY CPU DISK/ CONTROLLER NETWORK
Recurrent Patterns of Problems Effects are not always felt first where the cause is! An overloading high-level consumer A poorly parameterized subsystem An overloaded primary resource
A Systematic Approach to Monitoring Extract indicators to answer the following questions Question 1: Are critical queries being served in the most efficient manner? Question 2: Are subsystems making optimal use of resources? Question 3: Are there enough primary resources available?
Investigating High Level Consumers Answer question 1: “Are critical queries being served in the most efficient manner?” Identify the critical queries Analyze their access plans Profile their execution
Identifying Critical Queries Critical queries are usually those that: Take a long time Are frequently executed Often, a user complaint will tip us off.
Event Monitors to Identify Critical Queries If no user complains... Capture usage measurements at specific events (e.g., end of each query) and then sort by usage Less overhead than other type of tools because indicators are usually by-product of events monitored Typical measures include CPU used, IO used, locks obtained etc.
@ Dennis Shasha and Philippe Bonnet, 2013 DBMS Components Parser Query Processor Compiler Execution Engine Indexes Indexes Storage Subsystem Concurrency Control Recovery Buffer Manager @ Dennis Shasha and Philippe Bonnet, 2013
DB2 9.7 Process Architecture Exercise 1: Is intra-query parallelism possible with this process model? In other words, can a query be executed in parallel within a same instance (or partition)? @ Dennis Shasha and Philippe Bonnet, 2013 Source: http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.perf.doc/doc/00003525.gif
DB2 10.1 Process Architecture No need to know much more about the process abstractions. We will cover much more on the memory abstractions (log tuning), and on the communication abstractions (tuning the application interface, tuning across instances). @ Dennis Shasha and Philippe Bonnet, 2013 Source: http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.perf.doc%2Fdoc%2Fc0008930.html
@ Dennis Shasha and Philippe Bonnet, 2013 MySQL Architecture @ Dennis Shasha and Philippe Bonnet, 2013 Source: http://docs.oracle.com/cd/E19957-01/mysql-refman-5.5/storage-engines.html
Experimental Framework Exercise 2: Is throughput always the inverse of response time? System Application + DBMS + OS + HW Parameters (fixed/factors) Metrics Throughput / Response Time DBMS performance indicators OS performance indicators Workload Actual users (production), replay trace or synthetic workload (e.g., TPC benchmark) Experiments What factor to vary? Exercise 3: Define an experiment to measure the write throughput of the file system on your laptop @ Dennis Shasha and Philippe Bonnet, 2013
Example DBMS OS Windows: Performance monitor Linux: iostat, vmstat Statement number: 1 select C_NAME, N_NAME from DBA.CUSTOMER join DBA.NATION on C_NATIONKEY = N_NATIONKEY where C_ACCTBAL > 0 Number of rows retrieved is: 136308 Number of rows sent to output is: 0 Elapsed Time is: 76.349 seconds … Buffer pool data logical reads = 272618 Buffer pool data physical reads = 131425 Buffer pool data writes = 0 Buffer pool index logical reads = 273173 Buffer pool index physical reads = 552 Buffer pool index writes = 0 Total buffer pool read time (ms) = 71352 Total buffer pool write time (ms) = 0 Summary of Results ================== Elapsed Agent CPU Rows Rows Statement # Time (s) Time (s) Fetched Printed 1 76.349 6.670 136308 0 Windows: Performance monitor Linux: iostat, vmstat More on this in tuning the guts.
An example Event Monitor CPU indicators sorted by Oracle’s Trace Data Viewer Similar tools: DB2’s Event Monitor and MSSQL’s Server Profiler
An example Plan Explainer Access plan according to MSSQL’s Query Analyzer Similar tools: DB2’s Visual Explain and Oracle’s SQL Analyze Tool
Finding Strangeness in Access Plans What to pay attention to in a plan Access paths for each table Sorts or intermediary results (join, group by, distinct, order by) Order of operations Algorithms used in the operators
To Index or not to index? select c_name, n_name from CUSTOMER join NATION on c_nationkey=n_nationkey where c_acctbal > 0 Which plan performs best? (nation_pk is an non-clustered index over n_nationkey, and similarly for acctbal_ix over c_acctbal)
Non-clustering indexes can be trouble For a low selectivity predicate, each access to the index generates a random access to the table – possibly duplicate! It ends up that the number of pages read from the table is greater than its size, i.e., a table scan is way better Table Scan Index Scan CPU time data logical reads data physical reads index logical reads index physical reads 5 sec 143,075 pages 6,777 pages 136,319 pages 7 pages 76 sec 272,618 pages 131,425 pages 273,173 pages 552 pages
An example Performance Monitor (query level) Statement number: 1 select C_NAME, N_NAME from DBA.CUSTOMER join DBA.NATION on C_NATIONKEY = N_NATIONKEY where C_ACCTBAL > 0 Number of rows retrieved is: 136308 Number of rows sent to output is: 0 Elapsed Time is: 76.349 seconds … Buffer pool data logical reads = 272618 Buffer pool data physical reads = 131425 Buffer pool data writes = 0 Buffer pool index logical reads = 273173 Buffer pool index physical reads = 552 Buffer pool index writes = 0 Total buffer pool read time (ms) = 71352 Total buffer pool write time (ms) = 0 Summary of Results ================== Elapsed Agent CPU Rows Rows Statement # Time (s) Time (s) Fetched Printed 1 76.349 6.670 136308 0 Buffer and CPU consumption for a query according to DB2’s Benchmark tool Similar tools: MSSQL’s SET STATISTICS switch and Oracle’s SQL Analyze Tool
An example Performance Monitor (system level) An IO indicator’s consumption evolution (qualitative and quantitative) according to DB2’s System Monitor Similar tools: Window’s Performance Monitor and Oracle’s Performance Manager
Investigating High Level Consumers: Summary Find critical queries Investigate lower levels Found any? Answer Q1 over them no yes Overcon- sumption? Tune problematic queries no yes
Investigating Primary Resources Answer question 3: “Are there enough primary resources available for a DBMS to consume?” Primary resources are: CPU, disk & controllers, memory, and network Analyze specific OS-level indicators to discover bottlenecks. A system-level Performance Monitor is the right tool here
CPU Consumption Indicators at the OS Level Sustained utilization over 70% should trigger the alert. System utilization shouldn’t be more than 40%. DBMS (on a non- dedicated machine) should be getting a decent time share. 100% CPU % of utilization total usage 70% system usage time
Disk Performance Indicators at the OS Level Should be close to zero Average Queue Size New requests Wait queue Disk Transfers /second Wait times should also be close to zero Idle disk with pending requests? Check controller contention. Also, transfers should be balanced among disks/controllers
Memory Consumption Indicators at the OS Level Page faults/time should be close to zero. If paging happens, at least not DB cache pages. virtual memory real memory % of pagefile in use will tell you how much memory is “needed.” pagefile
Investigating Intermediate Resources/Consumers Answer question 2: “Are subsystems making optimal use of resources?” Main subsystems: Cache Manager, Disk subsystem, Lock subsystem, and Log/Recovery subsystem Similarly to Q3, extract and analyze relevant Performance Indicators
Cache Manager Performance Indicators If page is not in the cache, readpage (logical) generates an actual IO (physical). Fraction of readpages that did not generate physical IO should be 90% or more (hit ratio) Table scan readpage() Cache Manager Pick victim strategy Data Pages Pages are regularly saved to disk to make free space. # of free slots should always be > 0 Free Page slots Page reads/ writes
Disk Manager Performance Indicators Displaced rows (moved to other pages) should be kept under 5% of rows rows Free space fragmentation: pages with little space should not be in the free list page Storage Hierarchy (simplified) extent Data fragmentation: ideally files that store DB objects (table, index) should be in one or few (<5) contiguous extents file File position: should balance workload evenly among all disks disk
Lock Manager Performance Indicators Lock wait time for a transaction should be a small fraction of the whole transaction time. Number of lock waits should be a small fraction of the number of locks on the lock list. Lock request Locks pending list Object Lock Type TXN ID Lock List Deadlocks and timeouts should seldom happen (no more then 1% of the transactions)
Investigating Intermediate and Primary Resources: Summary Answer Q3 Problems at OS level? Answer Q2 Tune low-level resources no yes Problematic subsystems? Tune subsystems Investigate upper level yes no
Troubleshooting Techniques Monitoring a DBMS’s performance should be based on queries and resources. The consumption chain helps distinguish problems’ causes from their symptoms Existing tools help extracting relevant performance indicators
Recall Tuning Principles Think globally, fix locally (troubleshoot to see what matters) Partitioning breaks bottlenecks (find parallelism in processors, controllers, caches, and disks) Start-up costs are high; running costs are low (batch size, cursors) Be prepared for trade-offs (unless you can rethink the queries)