Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown DBLunch Seminar, 1/23/01.

Slides:



Advertisements
Similar presentations
Clustering Technology For Scaleability Jim Gray Microsoft Research
Advertisements

Database Architectures and the Web
Fabián E. Bustamante, Winter 2006 Recovery Oriented Computing Embracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-
Database Administration and Security Transparencies 1.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
Log Tuning. AOBD 2007/08 H. Galhardas Atomicity and Durability Every transaction either commits or aborts. It cannot change its mind Even in the face.
Module 20 Troubleshooting Common SQL Server 2008 R2 Administrative Issues.
Slide 1 Availability and Maintainability Benchmarks A Case Study of Software RAID Systems Aaron Brown, Eric Anderson, and David A. Patterson Computer Science.
Hands-On Microsoft Windows Server 2003 Administration Chapter 10 Monitoring and Troubleshooting Windows Server 2003.
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
70-270, MCSE/MCSA Guide to Installing and Managing Microsoft Windows XP Professional and Windows Server 2003 Chapter Thirteen Performing Network.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
12 Chapter 12 Client/Server Systems Hachim Haddouti.
Slide 1 Towards Benchmarks for Availability, Maintainability, and Evolutionary Growth (AME) A Case Study of Software RAID Systems Aaron Brown 2000 Winter.
70-291: MCSE Guide to Managing a Microsoft Windows Server 2003 Network Chapter 14: Troubleshooting Windows Server 2003 Networks.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Slide 1 Availability and Maintainability Benchmarks A Case Study of Software RAID Systems Aaron Brown, Eric Anderson, and David A. Patterson Computer Science.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Web Application Architecture: multi-tier (2-tier, 3-tier) & mvc
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Introduction to Databases Transparencies 1. ©Pearson Education 2009 Objectives Common uses of database systems. Meaning of the term database. Meaning.
1 Motivation Goal: Create and document a black box availability benchmark Improving dependability requires that we quantify the ROC-related metrics.
Selling the Database Edition for Oracle on HP-UX November 2000.
Ch 11 Managing System Reliability and Availability 1.
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
1 Storage Refinement. Outline Disk failures To attack Intermittent failures To attack Media Decay and Write failure –Checksum To attack Disk crash –RAID.
Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.
Dependability benchmarking for transactional and web systems Henrique Madeira University of Coimbra, DEI-CISUC Coimbra, Portugal.
Continuous resource monitoring for self-predicting DBMS Dushyanth Narayanan 1 Eno Thereska 2 Anastassia Ailamaki 2 1 Microsoft Research-Cambridge, 2 Carnegie.
AppMetrics and SCOM Working Together to Maximize the availability of Your applications.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Window NT File System JianJing Cao (#98284).
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
1 Web Server Administration Chapter 2 Preparing For Server Installation.
MCTS Guide to Microsoft Windows Vista Chapter 4 Managing Disks.
1 Selecting LAN server (Week 3, Monday 9/8/2003) © Abdou Illia, Fall 2003.
Designing and Deploying a Scalable EPM Solution Ken Toole Platform Test Manager MS Project Microsoft.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
Designing a Scalable Enterprise Project Management Architecture Ken Toole Platform Test Manager MS Project Microsoft Corporation.
Slide 1 Breaking databases for fun and publications: availability benchmarks Aaron Brown UC Berkeley ROC Group HPTS 2001.
Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown 2001 Winter ISTORE Retreat.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.
Srik Raghavan Principal Lead Program Manager Kevin Cox Principal Program Manager SESSION CODE: DAT206.
1 Admission Control and Request Scheduling in E-Commerce Web Sites Sameh Elnikety, EPFL Erich Nahum, IBM Watson John Tracey, IBM Watson Willy Zwaenepoel,
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
1 Chapter Overview Planning to Install SQL Server 2000 Deciding SQL Server 2000 Setup Configuration Options Running the SQL Server 2000 Setup Program Using.
Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala
Jorke Odolphi Product Technology Specialist WebCentral Using Microsoft Operations Manager To Monitor And Maintain Your Farm.
Install, configure and test ICT Networks
Oracle Architecture - Structure. Oracle Architecture - Structure The Oracle Server architecture 1. Structures are well-defined objects that store the.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
ITMT 1371 – Window 7 Configuration 1 ITMT Windows 7 Configuration Chapter 8 – Managing and Monitoring Windows 7 Performance.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
I/Watch™ Weekly Sales Conference Call Presentation (See next slide for dial-in details) Andrew May Technical Product Manager Dax French Product Specialist.
Fail-stutter Behavior Characterization of NFS
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Maximum Availability Architecture Enterprise Technology Centre.
Fault Tolerance In Operating System
Web Server Administration
Fault Tolerance Distributed Web-based Systems
Admission Control and Request Scheduling in E-Commerce Web Sites
Co-designed Virtual Machines for Reliable Computer Systems
Presentation transcript:

Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown DBLunch Seminar, 1/23/01

Slide 2 Motivation Availability is a key metric for modern apps. –e-commerce, enterprise apps, online services, ISPs Database availability is particularly important –databases hold the critical hard state for most enterprise and e-business applications »the most important system component to keep available –we trust databases to be highly dependable. Should we? »how do DBMSs react to hardware faults/failures? »what is the user-visible impact of such failures?

Slide 3 Overview of approach Use availability benchmarking to evaluate database dependability –an empirical technique based on simulated faults –study 3-tier OLTP workload »back-end: commercial database »middleware: transaction monitor & business logic »front-end: web-based form interface –focus on storage system faults/failures –measure availability in terms of performance »also possible to look at consistency of data

Slide 4 Outline Availability benchmarking methodology Adapting methodology for OLTP databases Case study of Microsoft SQL Server 2000 Discussion and future directions

Slide 5 Availability benchmarking A general methodology for defining and measuring availability –focused toward research, not marketing –empirically demonstrated with software RAID systems [Usenix00] 3 components 1) metrics 2) benchmarking techniques 3) representation of results

Slide 6 Part 1: Availability metrics Traditionally, percentage of time system is up –time-averaged, binary view of system state (up/down) This metric is inflexible –doesn’t capture degraded states »a non-binary spectrum between “up” and “down” –time-averaging discards important temporal behavior »compare 2 systems with 96.7% traditional availability: system A is down for 2 seconds per minute system B is down for 1 day per month Our solution: measure variation in system quality of service metrics over time –performance, fault-tolerance, completeness, accuracy

Slide 7 Part 2: Measurement techniques Goal: quantify variation in QoS metrics as system availability is compromised Leverage existing performance benchmarks –to measure & trace quality of service metrics –to generate fair workloads Use fault injection to compromise system –hardware and software faults –maintenance events (repairs, SW/HW upgrades) Examine single-fault and multi-fault workloads –the availability analogues of performance micro- and macro-benchmarks

Slide 8 Results are most accessible graphically –plot change in QoS metrics over time –compare to “normal” behavior »99% confidence intervals calculated from no-fault runs Part 3: Representing results Graphs can be distilled into numbers

Slide 9 Outline Availability benchmarking methodology Adapting methodology for OLTP databases –metrics –workload and fault injection Case study of Microsoft SQL Server 2000 Discussion and future directions

Slide 10 Availability metrics for databases Possible OLTP quality of service metrics –transaction throughput –transaction response time »better: % of transactions longer than a fixed cutoff –rate of transactions aborted due to errors –consistency of database –fraction of database content available Our experiments focused on throughput –rates of normal and failed transactions

Slide 11 Workload & fault injection Performance workload –easy: TPC-C Fault workload: disk subsystem –realistic fault set based on Tertiary Disk study »correctable & uncorrectable media errors, hardware errors, power failures, disk hangs/timeouts »both transient and “sticky” faults –injected via an emulated SCSI disk (~0.5ms overhead) –faults injected in one of two partitions: »database data partition »database’s write-ahead log partition

Slide 12 Outline Availability benchmarking methodology Adapting methodology for OLTP databases Case study of Microsoft SQL Server 2000 Discussion and future directions

Slide 13 Experimental setup Database –Microsoft SQL Server 2000, default configuration Middleware/front-end software –Microsoft COM+ transaction monitor/coordinator –IIS 5.0 web server with Microsoft’s tpcc.dll HTML terminal interface and business logic –Microsoft BenchCraft remote terminal emulator TPC-C-like OLTP order-entry workload –10 warehouses, 100 active users, ~860 MB database Measured metrics –throughput of correct NewOrder transactions/min –rate of aborted NewOrder transactions (txn/min)

Slide 14 Experimental setup (2) Database installed in one of two configurations: –data on emulated disk, log on real (IBM) disk –data on real (IBM) disk, log on emulated disk IBM 18 GB 10k RPM DB Server IDE system disk = Fast/Wide SCSI bus, 20 MB/sec Adaptec 3940 Emulated Disk DB data/ log disks Front End SCSI system disk 100mb Ethernet IBM 18 GB 10k RPM SCSI system disk Disk Emulator Intel P-II/ MB DRAM Windows NT 4.0 Adaptec 2940 emulator backing disk (NTFS) AdvStor ASC-U2W UltraSCSI ASC VirtualSCSI lib. Intel P-III/ MB DRAM Windows 2000 AS MS BenchCraft RTE IIS + MS tpcc.dll MS COM+ AMD K6-2/ MB DRAM Windows 2000 AS SQL Server 2000

Slide 15 Results All results are from single-fault micro- benchmarks 14 different fault types –injected once for each of data and log partitions 4 categories of behavior detected 1) normal 2) transient glitch 3)degraded 4)failed

Slide 16 Type 1: normal behavior System tolerates fault Demonstrated for all sector-level faults except: –sticky uncorrectable read, data partition –sticky uncorrectable write, log partition

Slide 17 Type 2: transient glitch One transaction is affected, aborts with error Subsequent transactions using same data would fail Demonstrated for one fault only: –sticky uncorrectable read, data partition

Slide 18 Type 3: degraded behavior DBMS survives error after running log recovery Middleware partially fails, results in degraded perf. Demonstrated for one fault only: –sticky uncorrectable write, log partition

Slide 19 Type 4: failure DBMS hangs or aborts all transactions Middleware behaves erratically, sometimes crashing Demonstrated for all fatal disk-level faults –SCSI hangs, disk power failures Example behaviors (10 distinct variants observed) Disk hang during write to data diskSimulated log disk power failure

Slide 20 Results: summary DBMS was robust to a wide range of faults –tolerated all transient and recoverable errors –tolerated some unrecoverable faults »transparently (e.g., uncorrectable data writes) »or by reflecting fault back via transaction abort »these were not tolerated by the SW RAID systems Overall, DBMS is significantly more robust to disk faults than software RAID on same OS!

Slide 21 Outline Availability benchmarking methodology Adapting methodology for OLTP databases Case study of Microsoft SQL Server 2000 Discussion and future directions

Slide 22 Results: discussion DBMS’s extra robustness comes from: –redundant data representation in form of log –transactions »standard mechanism for reporting errors (txn abort) »encapsulate meaningful unit of work, providing consistent rollback upon failure But, middleware was not robust, compromising overall system availability –crashed or behaved erratically when DBMS recovered or returned errors –user cannot distinguish DBMS and middleware failure –system is only as robust as its weakest component! compare RAID: blocks don’t let you do this

Slide 23 Discussion of methodology General availability benchmarking methodology does work on more than just RAID systems Issues in adapting the methodology –defining appropriate metrics –measuring non-performance availability metrics –understanding layered (multi-tier) systems with only end-to-end instrumentation

Slide 24 Discussion of methodology General availability benchmarking methodology does work on more than just RAID systems Issues in adapting the methodology –defining appropriate metrics »metrics to capture database ACID properties »adapting “binary” metrics such as data consistency –measuring non-performance availability metrics »existing benchmarks (like TPC-C) may not do this –understanding layered (multi-tier) systems with only end-to-end instrumentation »teasing apart availability impact of different layers DO NOT PROJECT THIS SLIDE!

Slide 25 Future directions Direct extensions of this work: –expand metrics, including tests of ACID properties –consider other fault injection points besides disks –investigate clustered database designs –study issues in benchmarking layered systems

Slide 26 Future directions (2) Availability/maintainability extensions to TPC –proposed by James Hamilton at ISTORE retreat –an “optional maintainability test” after regular run –sponsor supplies N best administrators –TPC benchmark run repeated with realistic fault injection and a set of maintenance tasks to perform –measure availability, performance, admin. time,... –requires: »characterization of typical failure modes, admin. tasks »scalable, easy-to-deploy fault-injection harness This work is a (small) step toward that goal –and hints at poor state-of-the-art in TPC-C benchmark middleware fault handling

Slide 27 Thanks! Microsoft SQL Server group –for generously providing access to SQL Server 2000 and the Microsoft TPC-C Benchmark Kit –James Hamilton –Jamie Redding and Charles Levine

Slide 28 Backup slides

Slide 29 Example results: failing data disk Transient, correctable read fault (system tolerates fault) Sticky, uncorrectable read fault (transaction is aborted with error) Disk hang between SCSI commands (DBMS hangs, middleware returns errors) Disk hang during a data write (DBMS hangs, middleware crashes)

Slide 30 Example results: failing log disk Transient, correctable write fault (system tolerates fault) Sticky, uncorrectable write fault (DBMS recovers, middleware degrades) Simulated disk power failure (DBMS aborts all txns with errors) Disk hang between SCSI commands (DBMS hangs, middleware hangs)