Write off-loading: Practical power management for enterprise storage D. Narayanan, A. Donnelly, A. Rowstron Microsoft Research, Cambridge, UK.

Slides:



Advertisements
Similar presentations
Conserving Disk Energy in Network Servers ACM 17th annual international conference on Supercomputing Presented by Hsu Hao Chen.
Advertisements

Gecko Storage System Tudor Marian, Lakshmi Ganesh, and Hakim Weatherspoon Cornell University.
Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.
MASSIVE ARRAYS OF IDLE DISKS FOR STORAGE ARCHIVES D. Colarelli D. Grunwald U. Colorado, Boulder.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
11-May-15CSE 542: Operating Systems1 File system trace papers The Zebra striped network file system. Hartman, J. H. and Ousterhout, J. K. SOSP '93. (ACM.
Persistent Linda 3.0 Peter Wyckoff New York University.
Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.
RIMAC: Redundancy-based hierarchical I/O cache architecture for energy-efficient, high- performance storage systems Xiaoyu Yao and Jun Wang Computer Architecture.
Migrating Server Storage to SSDs: Analysis of Tradeoffs Dushyanth Narayanan Eno Thereska Austin Donnelly Sameh Elnikety Antony Rowstron Microsoft Research.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Energy Efficient Prefetching – from models to Implementation 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software Engineering.
Energy Efficient Prefetching with Buffer Disks for Cluster File Systems 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software.
Migrating Server Storage to SSDs: Analysis of Tradeoffs
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), and Hakim Weatherspoon (Cornell) Gecko: Contention-Oblivious.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
1 © Copyright 2009 EMC Corporation. All rights reserved. Agenda Storing More Efficiently  Storage Consolidation  Tiered Storage  Storing More Intelligently.
Energy Efficiency and Storage Flexibility in the Blue File System Edmund B Nightingale Jason Flinn University of Michigan.
Module 3 - Storage MIS5122: Enterprise Architecture for IT Auditors.
CERN openlab Open Day 10 June 2015 KL Yong Sergio Ruocco Data Center Technologies Division Speeding-up Large-Scale Storage with Non-Volatile Memory.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
Cloud Data Center/Storage Power Efficiency Solutions Junyao Zhang 1.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
DISKS IS421. DISK  A disk consists of Read/write head, and arm  A platter is divided into Tracks and sector  The R/W heads can R/W at the same time.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Flashing Up the Storage Layer I. Koltsidas, S. D. Viglas (U of Edinburgh), VLDB 2008 Shimin Chen Big Data Reading Group.
PARAID: The Gear-Shifting Power-Aware RAID Charles Weddle, Mathew Oldham, An-I Andy Wang – Florida State University Peter Reiher – University of California,
Exploiting Flash for Energy Efficient Disk Arrays Shimin Chen (Intel Labs) Panos K. Chrysanthis (University of Pittsburgh) Alexandros Labrinidis (University.
© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
1 Secondary Storage Management Submitted by: Sathya Anandan(ID:123)
Planning and Designing Server Virtualisation.
© Pearson Education Limited, Chapter 16 Physical Database Design – Step 7 (Monitor and Tune the Operational System) Transparencies.
COEN/ELEN 180 Storage Systems Memory Hierarchy. We are therefore forced to recognize the possibility of constructing a hierarchy of memories, each of.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), Lakshmi Ganesh (UT Austin), and Hakim Weatherspoon.
1 PARAID: A Gear-Shifting Power-Aware RAID Charles Weddle, Mathew Oldham, Jin Qian, An-I Andy Wang – Florida St. University Peter Reiher – University of.
PARAID: A Gear-Shifting Power-Aware RAID Charles Weddle, Mathew Oldham, Jin Qian, An-I Andy Wang – Florida St. University Peter Reiher – University of.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Serverless Network File Systems Overview by Joseph Thompson.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
1 PARAID: A Gear-Shifting Power-Aware RAID Charles Weddle, Mathew Oldham, Jin Qian, An-I Andy Wang – Florida St. University Peter Reiher – University of.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Best Available Technologies: External Storage Overview of Opportunities and Impacts November 18, 2015.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
COSC 6340: Disks 1 Disks and Files DBMS stores information on (“hard”) disks. This has major implications for DBMS design! » READ: transfer data from disk.
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
PERGAMUM: REPLACING TAPE WITH ENERGY EFFICIENT, RELIABLE, DISK-BASED ARCHIVAL STORAGE M. W. Storer K. M. Greenan E. L. Miller UCSC K. Vorugant Network.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Lecture 17 Raid. Device Protocol Variants Status checks: polling vs. interrupts Data: PIO vs. DMA Control: special instructions vs. memory-mapped I/O.
Data Storage and Querying in Various Storage Devices.
Elastic Parity Logging for SSD RAID Arrays Yongkun Li*, Helen Chan #, Patrick P. C. Lee #, Yinlong Xu* *University of Science and Technology of China #
PARAID: A Gear-Shifting Power-Aware RAID
Database Management Systems (CS 564)
Operating System I/O System Monday, August 11, 2008.
PARAID: A Gear-Shifting Power-Aware RAID
Lecture 9: Data Storage and IO Models
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Filesystems 2 Adapted from slides of Hank Levy
Qingbo Zhu, Asim Shankar and Yuanyuan Zhou
Presentation transcript:

Write off-loading: Practical power management for enterprise storage D. Narayanan, A. Donnelly, A. Rowstron Microsoft Research, Cambridge, UK

Energy in data centers Substantial portion of TCO –Power bill, peak power ratings –Cooling –Carbon footprint Storage is significant –Seagate Cheetah 15K.4: 12 W (idle) –Intel Xeon dual-core: 24 W (idle) 2

Challenge Most of disk’s energy just to keep spinning –17 W peak, 12 W idle, 2.6 W standby Flash still too expensive –Cannot replace disks by flash So: need to spin down disks when idle 3

Intuition Real workloads have –Diurnal, weekly patterns –Idle periods –Write-only periods Reads absorbed by main memory caches We should exploit these –Convert write-only to idle –Spin down when idle 4

Small/medium enterprise DC 10s to100s of disks –Not MSN search Heterogeneous servers –File system, DBMS, etc RAID volumes High-end disks 5 FS1 FS2 DBMS Vol 0 Vol 1 Vol 0 Vol 1 Vol 2 Vol 0 Vol 1

Design principles Incremental deployment –Don’t rearchitect the storage Keep existing servers, volumes, etc. –Work with current, disk-based storage Flash more expensive/GB for at least 5-10 years If system has some flash, then use it Assume fast network –1 Gbps+ 6

Write off-loading Spin down idle volumes Offload writes when spun down –To idle / lightly loaded volumes –Reclaim data lazily on spin up –Maintain consistency, failure resilience Spin up on read miss –Large penalty, but should be rare 7

Roadmap Motivation Traces Write off-loading Evaluation 8

How much idle time is there? Is there enough to justify spinning down? –Previous work claims not Based on TPC benchmarks, cello traces –What about real enterprise workloads? Traced servers in our DC for one week 9

MSRC data center traces Traced 13 core servers for 1 week File servers, DBMS, web server, web cache, … 36 volumes, 179 disks Per-volume, per-request tracing Block-level, below buffer cache Typical of small/medium enterprise DC –Serves one building, ~100 users –Captures daily/weekly usage patterns 10

Idle and write-only periods 11

Roadmap Motivation Traces Write off-loading Preliminary results 12

Write off-loading: managers One manager per volume –Intercepts all block-level requests –Spins volume up/down Off-loads writes when spun down –Probes logger view to find least-loaded logger Spins up on read miss –Reclaims off-loaded data lazily 13

Write off-loading: loggers Reliable, write-optimized, short-term store –Circular log structure Uses a small amount of storage –Unused space at end of volume, flash device Stores data off-loaded by managers –Includes version, manager ID, LBN range –Until reclaimed by manager Not meant for long-term storage 14

Reclaim Off-load life cycle 15 v1 v2 ReadWrite Spin upSpin down ProbeWriteInvalidate

Consistency and durability Read/write consistency –manager keeps in-memory map of off-loads –always knows where latest version is Durability –Writes only acked after data hits the disk Same guarantees as existing volumes –Transparent to applications 16

Recovery: transient failures Loggers can recover locally –Scan the log Managers recover from logger view –Logger view is persisted locally –Recovery: fetch metadata from all loggers –On clean shutdown, persist metadata locally Manager recovers without network communication 17

Recovery: disk failures Data on original volume: same as before –Typically RAID-1 / RAID-5 –Can recover from one failure What about off-loaded data? –Ensure logger redundancy >= manager –k-way logging for additional redundancy 18

Roadmap Motivation Traces Write off-loading Experimental results 19

Testbed 4 rack-mounted servers –1 Gbps network –Seagate Cheetah 15k RPM disks Single process per testbed server –Trace replay app + managers + loggers –In-process communication on each server –UDP+TCP between servers 20

Workload Open loop trace replay Traced volumes larger than testbed –Divided traced servers into 3 “racks” Combined in post-processing 1 week too long for real-time replay –Chose best and worst days for off-load Days with the most and least write-only time 21

Configurations Baseline Vanilla spin down (no off-load) Machine-level off-load –Off-load to any logger within same machine Rack-level off-load –Off-load to any logger in the rack 22

Storage configuration 1 manager + 1 logger per volume –For off-load configurations Logger uses 4 GB partition at end of volume Spin up/down emulated in s/w –Our RAID h/w does not support spin-down –Parameters from Seagate docs 12 W spun up, 2.6 W spun down Spin up delay is 10—15s, energy penalty is 20 J –Compared to keeping the spindle spinning always 23

Energy savings 24

Energy by volume (worst day) 25

Response time: 95 th percentile 26

Response time: mean 27

Conclusion Need to save energy in DC storage Enterprise workloads have idle periods –Analysis of 1-week, 36-volume trace Spinning disks down is worthwhile –Large but rare delay on spin up Write off-loading: write-only  idle –Increases energy savings of spin-down 28

Questions?

Related Work PDC ↓Periodic reconfiguration/data movement ↓Big change to current architectures Hibernator ↑Save energy without spinning down ↓Requires multi-speed disks MAID –Need massive scale

Just buy fewer disks? Fewer spindles  less energy, but –Need spindles for peak performance A mostly-idle workload can still have high peaks –Need disks for capacity High-performance disks have lower capacities Managers add disks incrementally to grow capacity –Performance isolation Cannot simply consolidate all workloads 31

Circular on-disk log 32 H HEAD TAIL XXX12XX Reclaim Write Spin up

Circular on-disk log Nuller Head Tail Reclaim Header block Null blocks Active log Stale versions Invalidate 33

Client state 34

35 Server state 35

Mean I/O rate 36

Peak I/O rate 37

Drive characteristics Typical ST drive +12V LVD current profile 38

Drive characteristics 39