Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown DBLunch Seminar, 1/23/01.

Initial Availability Benchmarking of a Database System Aaron Brown abrown@cs.berkeley.edu DBLunch Seminar, 1/23/01

Motivation Availability is a key metric for modern apps. –e-commerce, enterprise apps, online services, ISPs Database availability is particularly important –databases hold the critical hard state for most enterprise and e-business applications »the most important system component to keep available –we trust databases to be highly dependable. Should we? »how do DBMSs react to hardware faults/failures? »what is the user-visible impact of such failures?

Overview of approach Use availability benchmarking to evaluate database dependability –an empirical technique based on simulated faults –study 3-tier OLTP workload »back-end: commercial database »middleware: transaction monitor & business logic »front-end: web-based form interface –focus on storage system faults/failures –measure availability in terms of performance »also possible to look at consistency of data

Outline Availability benchmarking methodology Adapting methodology for OLTP databases Case study of Microsoft SQL Server 2000 Discussion and future directions

Availability benchmarking A general methodology for defining and measuring availability –focused toward research, not marketing –empirically demonstrated with software RAID systems [Usenix00] 3 components 1) metrics 2) benchmarking techniques 3) representation of results

Part 1: Availability metrics Traditionally, percentage of time system is up –time-averaged, binary view of system state (up/down) This metric is inflexible –doesn’t capture degraded states »a non-binary spectrum between “up” and “down” –time-averaging discards important temporal behavior »compare 2 systems with 96.7% traditional availability: system A is down for 2 seconds per minute system B is down for 1 day per month Our solution: measure variation in system quality of service metrics over time –performance, fault-tolerance, completeness, accuracy

Part 2: Measurement techniques Goal: quantify variation in QoS metrics as system availability is compromised Leverage existing performance benchmarks –to measure & trace quality of service metrics –to generate fair workloads Use fault injection to compromise system –hardware and software faults –maintenance events (repairs, SW/HW upgrades) Examine single-fault and multi-fault workloads –the availability analogues of performance micro- and macro-benchmarks

Results are most accessible graphically –plot change in QoS metrics over time –compare to “normal” behavior »99% confidence intervals calculated from no-fault runs Part 3: Representing results Graphs can be distilled into numbers

Outline Availability benchmarking methodology Adapting methodology for OLTP databases –metrics –workload and fault injection Case study of Microsoft SQL Server 2000 Discussion and future directions

Availability metrics for databases Possible OLTP quality of service metrics –transaction throughput –transaction response time »better: % of transactions longer than a fixed cutoff –rate of transactions aborted due to errors –consistency of database –fraction of database content available Our experiments focused on throughput –rates of normal and failed transactions

Workload & fault injection Performance workload –easy: TPC-C Fault workload: disk subsystem –realistic fault set based on Tertiary Disk study »correctable & uncorrectable media errors, hardware errors, power failures, disk hangs/timeouts »both transient and “sticky” faults –injected via an emulated SCSI disk (~0.5ms overhead) –faults injected in one of two partitions: »database data partition »database’s write-ahead log partition

Experimental setup Database –Microsoft SQL Server 2000, default configuration Middleware/front-end software –Microsoft COM+ transaction monitor/coordinator –IIS 5.0 web server with Microsoft’s tpcc.dll HTML terminal interface and business logic –Microsoft BenchCraft remote terminal emulator TPC-C-like OLTP order-entry workload –10 warehouses, 100 active users, ~860 MB database Measured metrics –throughput of correct NewOrder transactions/min –rate of aborted NewOrder transactions (txn/min)

Experimental setup (2) Database installed in one of two configurations: –data on emulated disk, log on real (IBM) disk –data on real (IBM) disk, log on emulated disk IBM 18 GB 10k RPM DB Server IDE system disk = Fast/Wide SCSI bus, 20 MB/sec Adaptec 3940 Emulated Disk DB data/ log disks Front End SCSI system disk 100mb Ethernet IBM 18 GB 10k RPM SCSI system disk Disk Emulator Intel P-II/300 128 MB DRAM Windows NT 4.0 Adaptec 2940 emulator backing disk (NTFS) AdvStor ASC-U2W UltraSCSI ASC VirtualSCSI lib. Intel P-III/450 256 MB DRAM Windows 2000 AS MS BenchCraft RTE IIS + MS tpcc.dll MS COM+ AMD K6-2/333 128 MB DRAM Windows 2000 AS SQL Server 2000

Results All results are from single-fault micro- benchmarks 14 different fault types –injected once for each of data and log partitions 4 categories of behavior detected 1) normal 2) transient glitch 3)degraded 4)failed

Type 1: normal behavior System tolerates fault Demonstrated for all sector-level faults except: –sticky uncorrectable read, data partition –sticky uncorrectable write, log partition

Type 2: transient glitch One transaction is affected, aborts with error Subsequent transactions using same data would fail Demonstrated for one fault only: –sticky uncorrectable read, data partition

Type 3: degraded behavior DBMS survives error after running log recovery Middleware partially fails, results in degraded perf. Demonstrated for one fault only: –sticky uncorrectable write, log partition

Type 4: failure DBMS hangs or aborts all transactions Middleware behaves erratically, sometimes crashing Demonstrated for all fatal disk-level faults –SCSI hangs, disk power failures Example behaviors (10 distinct variants observed) Disk hang during write to data diskSimulated log disk power failure

Results: summary DBMS was robust to a wide range of faults –tolerated all transient and recoverable errors –tolerated some unrecoverable faults »transparently (e.g., uncorrectable data writes) »or by reflecting fault back via transaction abort »these were not tolerated by the SW RAID systems Overall, DBMS is significantly more robust to disk faults than software RAID on same OS!

Results: discussion DBMS’s extra robustness comes from: –redundant data representation in form of log –transactions »standard mechanism for reporting errors (txn abort) »encapsulate meaningful unit of work, providing consistent rollback upon failure But, middleware was not robust, compromising overall system availability –crashed or behaved erratically when DBMS recovered or returned errors –user cannot distinguish DBMS and middleware failure –system is only as robust as its weakest component! compare RAID: blocks don’t let you do this

Discussion of methodology General availability benchmarking methodology does work on more than just RAID systems Issues in adapting the methodology –defining appropriate metrics –measuring non-performance availability metrics –understanding layered (multi-tier) systems with only end-to-end instrumentation

Discussion of methodology General availability benchmarking methodology does work on more than just RAID systems Issues in adapting the methodology –defining appropriate metrics »metrics to capture database ACID properties »adapting “binary” metrics such as data consistency –measuring non-performance availability metrics »existing benchmarks (like TPC-C) may not do this –understanding layered (multi-tier) systems with only end-to-end instrumentation »teasing apart availability impact of different layers DO NOT PROJECT THIS SLIDE!

Future directions Direct extensions of this work: –expand metrics, including tests of ACID properties –consider other fault injection points besides disks –investigate clustered database designs –study issues in benchmarking layered systems

Future directions (2) Availability/maintainability extensions to TPC –proposed by James Hamilton at ISTORE retreat –an “optional maintainability test” after regular run –sponsor supplies N best administrators –TPC benchmark run repeated with realistic fault injection and a set of maintenance tasks to perform –measure availability, performance, admin. time,... –requires: »characterization of typical failure modes, admin. tasks »scalable, easy-to-deploy fault-injection harness This work is a (small) step toward that goal –and hints at poor state-of-the-art in TPC-C benchmark middleware fault handling

Thanks! Microsoft SQL Server group –for generously providing access to SQL Server 2000 and the Microsoft TPC-C Benchmark Kit –James Hamilton –Jamie Redding and Charles Levine

Backup slides

Example results: failing data disk Transient, correctable read fault (system tolerates fault) Sticky, uncorrectable read fault (transaction is aborted with error) Disk hang between SCSI commands (DBMS hangs, middleware returns errors) Disk hang during a data write (DBMS hangs, middleware crashes)

Example results: failing log disk Transient, correctable write fault (system tolerates fault) Sticky, uncorrectable write fault (DBMS recovers, middleware degrades) Simulated disk power failure (DBMS aborts all txns with errors) Disk hang between SCSI commands (DBMS hangs, middleware hangs)

Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown DBLunch Seminar, 1/23/01.

Similar presentations

Presentation on theme: "Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown DBLunch Seminar, 1/23/01."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown DBLunch Seminar, 1/23/01.

Similar presentations

Presentation on theme: "Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown DBLunch Seminar, 1/23/01."— Presentation transcript:

Similar presentations

About project

Feedback