Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Slides:



Advertisements
Similar presentations
Computer-System Structures Er.Harsimran Singh
Advertisements

More on File Management
Pelican: A Building Block for Exascale Cold Data Storage
Write off-loading: Practical power management for enterprise storage D. Narayanan, A. Donnelly, A. Rowstron Microsoft Research, Cambridge, UK.
11-May-15CSE 542: Operating Systems1 File system trace papers The Zebra striped network file system. Hartman, J. H. and Ousterhout, J. K. SOSP '93. (ACM.
What will my performance be? Resource Advisor for DB admins Dushyanth Narayanan, Paul Barham Microsoft Research, Cambridge Eno Thereska, Anastassia Ailamaki.
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Migrating Server Storage to SSDs: Analysis of Tradeoffs
Computer Organization and Architecture
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
The Design and Implementation of a Log-Structured File System Presented by Carl Yao.
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), and Hakim Weatherspoon (Cornell) Gecko: Contention-Oblivious.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
1 The Google File System Reporter: You-Wei Zhang.
Continuous resource monitoring for self-predicting DBMS Dushyanth Narayanan 1 Eno Thereska 2 Anastassia Ailamaki 2 1 Microsoft Research-Cambridge, 2 Carnegie.
PARAID: The Gear-Shifting Power-Aware RAID Charles Weddle, Mathew Oldham, An-I Andy Wang – Florida State University Peter Reiher – University of California,
JOURNALING VERSUS SOFT UPDATES: ASYNCHRONOUS META-DATA PROTECTION IN FILE SYSTEMS Margo I. Seltzer, Harvard Gregory R. Ganger, CMU M. Kirk McKusick Keith.
© Pearson Education Limited, Chapter 16 Physical Database Design – Step 7 (Monitor and Tune the Operational System) Transparencies.
Journaling vs Soft Updates: Asynchronous Metadata Protection in File Systems Margo I. Seltzer, Harvard, Gregory R. Ganger CMU, M. Kirk McKusick, Keith.
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), Lakshmi Ganesh (UT Austin), and Hakim Weatherspoon.
AZR308. Building distributed systems on an abstraction against commodity hardware at Internet scale, composed of multiple services. Distributed System.
Log-structured File System Sriram Govindan
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
26-Oct-15CSE 542: Operating Systems1 File system trace papers The Design and Implementation of a Log- Structured File System. M. Rosenblum, and J.K. Ousterhout.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Serverless Network File Systems Overview by Joseph Thompson.
Srik Raghavan Principal Lead Program Manager Kevin Cox Principal Program Manager SESSION CODE: DAT206.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Embedded System Lab. 서동화 The Design and Implementation of a Log-Structured File System - Mendel Rosenblum and John K. Ousterhout.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
CS333 Intro to Operating Systems Jonathan Walpole.
1EMC CONFIDENTIAL—INTERNAL USE ONLY FAST VP and Exchange Server 2010 Don Turner Consultant Systems Integration Engineer Microsoft TPM.
Outline for Today Journaling vs. Soft Updates Administrative.
Energy Efficient Prefetching and Caching Athanasios E. Papathanasiou and Michael L. Scott. University of Rochester Proceedings of 2004 USENIX Annual Technical.
JOURNALING VERSUS SOFT UPDATES: ASYNCHRONOUS META-DATA PROTECTION IN FILE SYSTEMS Margo I. Seltzer, Harvard Gregory R. Ganger, CMU M. Kirk McKusick Keith.
Embedded System Lab. 정영진 The Design and Implementation of a Log-Structured File System Mendel Rosenblum and John K. Ousterhout ACM Transactions.
PARALLEL DATA LABORATORY Carnegie Mellon University A framework for implementing IO-bound maintenance applications Eno Thereska.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
sRoute: Treating the Storage Stack Like a Network
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Jonathan Walpole Computer Science Portland State University
Chapter 2: Computer-System Structures(Hardware)
Chapter 2: Computer-System Structures
Transactions and Reliability
Finding a Needle in Haystack : Facebook’s Photo storage
CSE-291 (Cloud Computing) Fall 2016
9/12/2018.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
HashKV: Enabling Efficient Updates in KV Storage via Hashing
Disk Storage, Basic File Structures, and Buffer Management
SQL 2014 In-Memory OLTP What, Why, and How
Filesystems 2 Adapted from slides of Hank Levy
Module 2: Computer-System Structures
CSE 451: Operating Systems Autumn 2009 Module 17 Berkeley Log-Structured File System Ed Lazowska Allen Center
Chapter 2: Computer-System Structures
Chapter 2: Computer-System Structures
Module 2: Computer-System Structures
COMP755 Advanced Operating Systems
Presentation transcript:

Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge, UK UKClimbing.com

Problem: I/O peaks on servers Short, unexpected peaks in I/O load –This is not about predictable trends Uncorrelated across servers in data center –And across volumes on a single server Bad I/O response times during peaks 2Everest: write off-loading for I/O peaks

Example: Exchange server Production mail server –5000 users, 7.2 TB across 8 volumes Well provisioned –Hardware RAID, NVRAM, over 100 spindles 24-hour block-level I/O trace –Peak load, response time is 20x mean –Peaks are uncorrelated across volumes 3Everest: write off-loading for I/O peaks

Exchange server load 4Everest: write off-loading for I/O peaks

Everest client Write off-loading 5Everest: write off-loading for I/O peaks Reads and Writes No off-loadingOff-loading Reads Writes Volume Reclaiming Reclaims Everest store

Exploits workload properties Peaks uncorrelated across volumes –Loaded volume can find less-loaded stores Peaks have some writes –Off-load writes  reads see less contention Few foreground reads on off-loaded data –Recently written  hence in buffer cache –Can optimize stores for writes 6Everest: write off-loading for I/O peaks

Challenges Any write anywhere –Maximize potential for load balancing Reads must always return latest version –Split across stores/base volume if required State must be consistent, recoverable –Track both current and stale versions No meta-data writes to base volume 7Everest: write off-loading for I/O peaks

Design features Recoverable soft state Write-optimized stores Reclaiming off-loaded data N-way off-loading Load-balancing policies 8Everest: write off-loading for I/O peaks

Recoverable soft state Need meta-data to track off-loads –block ID  –Latest version as well as old (stale) versions Meta-data cached in memory –On both clients and stores Off-loaded writes have meta-data header –64-bit version, client ID, block range 9Everest: write off-loading for I/O peaks

Recoverable soft state (2) Meta-data also persisted on stores –No synchronous writes to base volume –Stores write data+meta-data as one record “Store set” persisted base volume –Small, infrequently changing Client recovery  contact store set Store recovery  read from disk 10Everest: write off-loading for I/O peaks

Everest stores Short-term, write-optimized storage –Simple circular log –Small file or partition on existing volume –Not LFS: data is reclaimed, no cleaner Monitors load on underlying volume –Only used by clients when lightly loaded One store can support many clients 11Everest: write off-loading for I/O peaks

Everest client Reclaiming in the background 12Everest: write off-loading for I/O peaks “Read any” Volume Everest store delete(block range, version) Multiple concurrent reclaim “threads” –Efficient utilization of disk/network resources write

Correctness invariants I/O on off-loaded range always off-loaded –Reads: sent to correct location –Writes: ensure latest version is recoverable –Foreground I/Os never blocked by reclaim Deletion of a version only allowed if –Newer version written to some store, or –Data reclaimed and older versions deleted All off-loaded data eventually reclaimed 13Everest: write off-loading for I/O peaks

Evaluation 14Everest: write off-loading for I/O peaks Exchange server traces OLTP benchmark Scaling Micro-benchmarks Effect of NVRAM Sensitivity to parameters N-way off-loading

Exchange server workload Replay Exchange server trace –5000 users, 8 volumes, 7.2 TB, 24 hours Choose time segments with peaks –extend segments to cover all reclaim Our server: 14 disks, 2 TB –can fit 3 Exchange volumes Subset of volumes for each segment 15Everest: write off-loading for I/O peaks

Trace segment selection 16Everest: write off-loading for I/O peaks

Trace segment selection 17Everest: write off-loading for I/O peaks

Three volumes/segment 18Everest: write off-loading for I/O peaks Trace min max median store (3%) client

Mean response time 19Everest: write off-loading for I/O peaks

99 th percentile response time 20Everest: write off-loading for I/O peaks

Exchange server summary Substantial improvement in I/O latency –On a real enterprise server workload –Both reads and writes, mean and 99 th pc What about application performance? –I/O trace cannot show end-to-end effects Where is the benefit coming from? –Extra resources, log structure,...? 21Everest: write off-loading for I/O peaks

OLTP benchmark 22Dushyanth Narayanan Log Data Store Everest client SQL Server binary LAN OLTP client Detours DLL redirection 10 min warmup 10 min measurement

OLTP throughput 23Everest: write off-loading for I/O peaks Extra disk + Log layout 2x disks, 3x speedup?

Off-loading not a panacea Works for short-term peaks Cannot use to improve perf 24/7 Data usually reclaimed while store still idle –Long-term off-load  eventual contention Data is reclaimed before store fills up –Long-term  log cleaner issue 24Everest: write off-loading for I/O peaks

Conclusion Peak I/O is a problem Everest solves this through off-loading By modifying workload at block level –Removes write from overloaded volume –Off-loading is short term: data is reclaimed Consistency, persistence are maintained –State is always correctly recoverable 25Everest: write off-loading for I/O peaks

Questions? 26Everest: write off-loading for I/O peaks

Why not always off-load? 27Dushyanth Narayanan Data Store Data Store OLTP client Write Read Write Read Write Read SQL Server 1 Everest client SQL Server 2

10 min off-load,10 min contention 28Everest: write off-loading for I/O peaks

Mean and 99 th pc (log scale) 29Everest: write off-loading for I/O peaks

Read/write ratio of peaks 30Everest: write off-loading for I/O peaks

Exchange server response time 31Everest: write off-loading for I/O peaks

Exchange server load (volumes) 32Everest: write off-loading for I/O peaks

Effect of volume selection 33Everest: write off-loading for I/O peaks

Effect of volume selection 34Everest: write off-loading for I/O peaks

Effect of volume selection 35Everest: write off-loading for I/O peaks

Scaling with #stores 36Dushyanth Narayanan Log Data Store Everest client SQL Server binary OLTP client Detours DLL redirection Store LAN

Scaling: linear until CPU-bound 37Everest: write off-loading for I/O peaks

Everest store: circular log layout Head Tail Reclaim Header block Active log Stale records Delete 38Everest: write off-loading for I/O peaks

Exchange server load: CDF 39Everest: write off-loading for I/O peaks

Unbalanced across volumes 40Everest: write off-loading for I/O peaks