Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown 2001 Winter ISTORE Retreat.

Slides:



Advertisements
Similar presentations
MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.
Advertisements

Chapter 9. Performance Management Enterprise wide endeavor Research and ascertain all performance problems – not just DBMS Five factors influence DB performance.
Database Architectures and the Web
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
Chapter 20: Recovery. 421B: Database Systems - Recovery 2 Failure Types q Transaction Failures: local recovery q System Failure: Global recovery I Main.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown DBLunch Seminar, 1/23/01.
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
Chapter 9 : Distributed Database.
Chapter 8 : Transaction Management. u Function and importance of transactions. u Properties of transactions. u Concurrency Control – Meaning of serializability.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Slide 1 Towards Benchmarks for Availability, Maintainability, and Evolutionary Growth (AME) A Case Study of Software RAID Systems Aaron Brown 2000 Winter.
Transaction Management WXES 2103 Database. Content What is transaction Transaction properties Transaction management with SQL Transaction log DBMS Transaction.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Introduction to Databases Transparencies 1. ©Pearson Education 2009 Objectives Common uses of database systems. Meaning of the term database. Meaning.
A Self-tuning Page Cleaner for the DB2 Buffer Pool Wenguang Wang Rick Bunt Department of Computer Science University of Saskatchewan.
1 Motivation Goal: Create and document a black box availability benchmark Improving dependability requires that we quantify the ROC-related metrics.
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
1 Storage Refinement. Outline Disk failures To attack Intermittent failures To attack Media Decay and Write failure –Checksum To attack Disk crash –RAID.
Storage and NT File System INFO333 – Lecture Mariusz Nowostawski Noria Foukia.
Managing Multi-User Databases AIMS 3710 R. Nakatsu.
Dependability benchmarking for transactional and web systems Henrique Madeira University of Coimbra, DEI-CISUC Coimbra, Portugal.
Database Architectures and the Web Session 5
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.
Naaliel Mendes, João Durães, Henrique Madeira CISUC, Department of Informatics Engineering University of Coimbra {naaliel, jduraes,
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Module 9: Configuring Storage
Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.
Window NT File System JianJing Cao (#98284).
Overview of the ORBIT Radio Grid Testbed for Evaluation of Next-Generation Wireless Network Protocols D.Raychaudhuri, M.ott, S.Ganu, K.ramachandran, H.Kremo,
Enterprise Computing With Aspects of Computer Architecture Jordan Harstad Technology Support Analyst Arizona State University.
© Pearson Education Limited, Chapter 16 Physical Database Design – Step 7 (Monitor and Tune the Operational System) Transparencies.
1 Web Server Administration Chapter 2 Preparing For Server Installation.
Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.
MCTS Guide to Microsoft Windows Vista Chapter 4 Managing Disks.
1 Selecting LAN server (Week 3, Monday 9/8/2003) © Abdou Illia, Fall 2003.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
Designing a Scalable Enterprise Project Management Architecture Ken Toole Platform Test Manager MS Project Microsoft Corporation.
Slide 1 Breaking databases for fun and publications: availability benchmarks Aaron Brown UC Berkeley ROC Group HPTS 2001.
Concurrency Control. Objectives Management of Databases Concurrency Control Database Recovery Database Security Database Administration.
Chapter 15 Recovery. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.15-2 Topics in this Chapter Transactions Transaction Recovery System.
Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.
Data Sharing. Data Sharing in a Sysplex Connecting a large number of systems together brings with it special considerations, such as how the large number.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Slide 1 What Happens Before A Disk Fails? Randi Thomas, Nisha Talagala
Supporting Privacy Protection in Personalized Web Search.
Transactions.
Configuring Debugging as Search: Finding the Needle in the Haystack Andrew Whitaker, Richard S. Cox and Steven D. Gribble. University of Washington Presented.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler.
Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
ITMT 1371 – Window 7 Configuration 1 ITMT Windows 7 Configuration Chapter 8 – Managing and Monitoring Windows 7 Performance.
Database Security Threats. Database An essential corporate resource Data is a valuable resource Must be strictly controlled, managed and secured May have.
RAID TECHNOLOGY RASHMI ACHARYA CSE(A) RG NO
Fail-stutter Behavior Characterization of NFS
Managing Multi-User Databases
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Maximum Availability Architecture Enterprise Technology Centre.
Unit OS10: Fault Tolerance
Web Server Administration
Fault Tolerance Distributed Web-based Systems
Admission Control and Request Scheduling in E-Commerce Web Sites
Database Security &Threats
Presentation transcript:

Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown 2001 Winter ISTORE Retreat

Slide 2 Motivation Extend availability benchmarks to new areas –explore generality and limitations of approach –gain more understanding of system failure modes Why look at database availability? –databases hold the critical hard state for most enterprise and e-business applications »the most important system component to keep available –we trust databases to be highly reliable. Should we? »how do DBMSs react to hardware faults/failures? »what is the user-visible impact of such failures?

Slide 3 Approach Use our availability benchmarking methodology to evaluate database robustness –focus on storage system failures –study 3-tier OLTP workload »back-end: commercial database »middleware: transaction monitor & business logic »front-end: web-based form interface –measure availability in terms of performance »also possible to look at consistency of data

Slide 4 Refresher: availability benchmarks Goal: quantify variation in quality of service as system availability is compromised Leverage existing performance benchmark –to measure & trace quality of service metrics –to generate fair workloads Use fault injection to compromise system Observe results graphically

Slide 5 Availability metrics for databases Possible OLTP quality of service metrics –transaction throughput –transaction response time »better: % of transactions longer than a fixed cutoff –rate of transactions aborted due to errors –consistency of database –fraction of database content available Our experiments focused on throughput –rates of normal and failed transactions

Slide 6 Fault injection Disk subsystem faults only –realistic fault set based on Tertiary Disk study »correctable & uncorrectable media errors, hardware errors, power failures, disk hangs/timeouts »both transient and “sticky” faults »note: similar fault set to RAID benchmarks –injected via an emulated SCSI disk (~0.5ms overhead) –faults injected in one of two partitions: »database data partition »database’s write-ahead log partition

Slide 7 Experimental setup Database –Microsoft SQL Server 2000, default configuration Middleware/front-end software –Microsoft COM+ transaction monitor/coordinator –IIS 5.0 web server with Microsoft’s tpcc.dll HTML terminal interface and business logic –Microsoft BenchCraft remote terminal emulator TPC-C-like OLTP order-entry workload –10 warehouses, 100 active users, ~860 MB database Measured metrics –throughput of correct NewOrder transactions/min –rate of aborted NewOrder transactions (txn/min)

Slide 8 Experimental setup (2) Database installed in one of two configurations: –data on emulated disk, log on real (IBM) disk –data on real (IBM) disk, log on emulated disk IBM 18 GB 10k RPM DB Server IDE system disk = Fast/Wide SCSI bus, 20 MB/sec Adaptec 3940 Emulated Disk DB data/ log disks Front End SCSI system disk 100mb Ethernet IBM 18 GB 10k RPM SCSI system disk Disk Emulator Intel P-II/ MB DRAM Windows NT 4.0 Adaptec 2940 emulator backing disk (NTFS) AdvStor ASC-U2W UltraSCSI ASC VirtualSCSI lib. Intel P-III/ MB DRAM Windows 2000 AS MS BenchCraft RTE IIS + MS tpcc.dll MS COM+ AMD K6-2/ MB DRAM Windows 2000 AS SQL Server 2000

Slide 9 Results All results are from single-fault micro- benchmarks 14 different fault types –injected once for each of data and log partitions 4 categories of behavior detected 1) normal 2) transient glitch 3)degraded 4)failed

Slide 10 Type 1: normal behavior System tolerates fault Demonstrated for all sector-level faults except: –sticky uncorrectable read, data partition –sticky uncorrectable write, log partition

Slide 11 Type 2: transient glitch One transaction is affected, aborts with error Subsequent transactions using same data would fail Demonstrated for one fault only: –sticky uncorrectable read, data partition

Slide 12 Type 3: degraded behavior DBMS survives error after running log recovery Middleware partially fails, results in degraded perf. Demonstrated for one fault only: –sticky uncorrectable write, log partition

Slide 13 Type 4: failure DBMS hangs or aborts all transactions Middleware behaves erratically, sometimes crashing Demonstrated for all fatal disk-level faults –SCSI hangs, disk power failures Example behaviors (10 distinct variants observed) Disk hang during write to data diskSimulated log disk power failure

Slide 14 Results: summary DBMS was robust to a wide range of faults –tolerated all transient and recoverable errors –tolerated some unrecoverable faults »transparently (e.g., uncorrectable data writes) »or by reflecting fault back via transaction abort »these were not tolerated by the SW RAID systems Overall, DBMS is significantly more robust to disk faults than software RAID systems!

Slide 15 Results: discussion DBMS’s extra robustness comes from: –redundant data representation in form of log –transactions »standard mechanism for reporting errors (txn abort) »encapsulate meaningful unit of work, providing consistent rollback upon failure But, middleware was not robust, compromising overall system availability –crashed or behaved erratically when DBMS recovered or returned errors –user cannot distinguish DBMS and middleware failure –system is only as robust as its weakest component! compare RAID: blocks don’t let you do this

Slide 16 Discussion of methodology General availability benchmarking methodology does work on more than just RAID systems Issues in adapting the methodology –defining appropriate metrics –measuring non-performance availability metrics –understanding layered (multi-tier) systems with only end-to-end instrumentation

Slide 17 Discussion of methodology General availability benchmarking methodology does work on more than just RAID systems Issues in adapting the methodology –defining appropriate metrics »metrics to capture database ACID properties »adapting “binary” metrics such as data consistency –measuring non-performance availability metrics »existing benchmarks (like TPC-C) may not do this –understanding layered (multi-tier) systems with only end-to-end instrumentation »teasing apart availability impact of different layers DO NOT PROJECT THIS SLIDE!

Slide 18 Future directions Last retreat: James Hamilton proposed availability/maintainability extensions to TPC This work is a (small) step toward that goal –exposed limitations, capabilities of disk fault injection –revealed importance of middleware, which clearly must be considered as part of the benchmark –hints at poor state-of-the-art in TPC-C benchmark middleware fault handling Next: –expand metrics, including tests of ACID properties –consider other fault injection points besides disks –investigate clustered database designs –study issues in benchmarking layered systems

Slide 19 Thanks! Microsoft SQL Server group –for generously providing access to SQL Server 2000 and the Microsoft TPC-C Benchmark Kit –James Hamilton –Jamie Redding and Charles Levine

Slide 20 Backup slides

Slide 21 Example results: failing data disk Transient, correctable read fault (system tolerates fault) Sticky, uncorrectable read fault (transaction is aborted with error) Disk hang between SCSI commands (DBMS hangs, middleware returns errors) Disk hang during a data write (DBMS hangs, middleware crashes)

Slide 22 Example results: failing log disk Transient, correctable write fault (system tolerates fault) Sticky, uncorrectable write fault (DBMS recovers, middleware degrades) Simulated disk power failure (DBMS aborts all txns with errors) Disk hang between SCSI commands (DBMS hangs, middleware hangs)