Pelican: A Building Block for Exascale Cold Data Storage

Slides:



Advertisements
Similar presentations
Destage Algorithms for Disk Arrays with Nonvolatile Caches IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 2, JANUARY 1998 Anujan Varma, Member, IEEE, and.
Advertisements

Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Continuous Media 1 Differs significantly from textual and numeric data because of two fundamental characteristics: –Real-time storage and retrieval –High.
CSE506: Operating Systems Disk Scheduling. CSE506: Operating Systems Key to Disk Performance Don’t access the disk – Whenever possible Cache contents.
Distributed Multimedia Systems
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Chapter 3 Presented by: Anupam Mittal.  Data protection: Concept of RAID and its Components Data Protection: RAID - 2.
Availability in Globally Distributed Storage Systems
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Reconfigurable Network Topologies at Rack Scale
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Performance Evaluation
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,
Operating Systems COMP 4850/CISG 5550 Disks, Part II Dr. James Money.
© 2009 IBM Corporation Statements of IBM future plans and directions are provided for information purposes only. Plans and direction are subject to change.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Disk and I/O Management
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
Page 19/4/2015 CSE 30341: Operating Systems Principles Raid storage  Raid – 0: Striping  Good I/O performance if spread across disks (equivalent to n.
Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.
PARAID: The Gear-Shifting Power-Aware RAID Charles Weddle, Mathew Oldham, An-I Andy Wang – Florida State University Peter Reiher – University of California,
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
TRACK-ALIGNED EXTENTS: MATCHING ACCESS PATTERNS TO DISK DRIVE CHARACTERISTICS J. Schindler J.-L.Griffin C. R. Lumb G. R. Ganger Carnegie Mellon University.
Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.
Network Aware Resource Allocation in Distributed Clouds.
11 SYSTEM PERFORMANCE IN WINDOWS XP Chapter 12. Chapter 12: System Performance in Windows XP2 SYSTEM PERFORMANCE IN WINDOWS XP  Optimize Microsoft Windows.
Operating System 4 THREADS, SMP AND MICROKERNELS
Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,
School of EECS, Peking University Microsoft Research Asia UStore: A Low Cost Cold and Archival Data Storage System for Data Centers Quanlu Zhang †, Yafei.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Chapter 3 Managing Disk and File Systems. File Storage Basics Windows XP supports two types of storage Basic Dynamic Basic storage system Centers on partitioning.
MCTS Guide to Microsoft Windows Vista Chapter 4 Managing Disks.
Challenges towards Elastic Power Management in Internet Data Center.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), Lakshmi Ganesh (UT Austin), and Hakim Weatherspoon.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
Fragmentation in Large Object Repositories Russell Sears Catharine van Ingen CIDR 2007 This work was performed at Microsoft Research San Francisco with.
Windows Server 2003 硬碟管理與磁碟機陣列 林寶森
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 30 – Media Server (Part 5) Klara Nahrstedt Spring 2009.
The concept of RAID in Databases By Junaid Ali Siddiqui.
Fast File System 2/17/2006. Introduction Paper talked about changes to old BSD 4.2 File System (FS) Motivation - Applications require greater throughput.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Full and Para Virtualization
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
CSE791 COURSE PRESENTATION QIUWEN CHEN Workload-aware Storage.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
Virtual Memory. Cache memory enhances performance by providing faster memory access speed. Virtual memory enhances performance by providing greater memory.
Journey to the HyperConverged Agile Infrastructure
MERANTI Caused More Than 1.5 B$ Damage
OPERATING SYSTEMS CS 3502 Fall 2017
Memory Management.
Flamingo: Enabling Evolvable HDD-based Near-Line Storage
Pelican: A building block for exascale cold data storage
RAID RAID Mukesh N Tekwani
Data Orgnization Frequently accessed data on the same storage device?
Computer Systems Performance Evaluation
Persistence: hard disk drive
Computer Systems Performance Evaluation
RAID RAID Mukesh N Tekwani April 23, 2019
Presentation transcript:

Pelican: A Building Block for Exascale Cold Data Storage Shobana Balakrishnan, Richard Black, Austin Donnelly, Paul England, Adam Glass, Dave Harper, and Sergey Legtchenko, Aaron Ogus, Eric Peterson and Antony Rowstron Microsoft Research

Data in the cloud

The Problem A significant portion of the data in the cloud is rarely accessed and known as cold data. Pelican tries to address the need for cost-effective storage of cold data and serves as a basic building block for exabyte scale storage for cold data.

Price versus Latency

Price versus Latency

Right Provisioning Provision resources just for the cold data workload: Disks: Archival and SMR instead of commodity drives Power Cooling Bandwidth Servers Enough for data management instead of 1 server/ 40 disks Enough for the required workload Not to keep all disks spinning

Advantages Benefits of removing unnecessary resources: High density of storage Low hardware cost Low operating cost (capped performance)

Pelican

The Pelican Rack Mechanical, hardware and storage software stack co-designed. Right-provisioned for cold data workload: 52U rack with 1152 archival class 3.5” SATA disks. Average size of 4.55 TB to provide a total of 5 PB of storage It uses 2 servers and no top of the rack switch Only 8% of the disks are spinning concurrently. Designed to store blobs which are infrequently accessed. Advantages • Total cost of ownership comparable to tape • Lower latency than tape The challenges of resource limitations managed in software

Resource Domain Each domain is only provisioned to supply its resource to a subset of disks Each disk uses resources from a set of resource domains Domain-conflicting - Disks that are in the same resource domain. Domain-disjoint - Disks that share no common resource domains. Pelican domains Cooling, Power, Vibration Bandwidth domain disjoint disks can move between disk states independently whereas domain conflicting ones cannot do so as it could lead to over commitment Resource Domains to be: the categories of things that are being managed

Handling Right-Provisioning Constraints over sets of active disks: Hard: power, cooling, failure domains Soft: bandwidth, vibration Constraints in Pelican 2 active out of 16 per power domain 1 active out of 12 per cooling domain 1 active out of 2 per vibration domain

Schematic representation 12 6 6 8U chassis each has 12 trays each tray has 16 disks 16

Data Layout Objective - maximize number of requests that can be concurrently serviced while operating within constraints. Each blob is stored over a set of disks. It is split into a sequence of 128kB fragments. For each “k” fragments, additional “r” fragments are generated. The k+r fragments form a stripe In Pelican they statically partition disks into groups and disks within a group can be concurrently active. Thus they concentrate all conflicts over a few sets of disks. additional fragments contain redundancy information using Cauchy Reed Solomon erasure code systematic code to allow the input to be regenerated by simple concatenating these fragments

Data Placement

Group Property: Either fully conflicting Or fully independent In Pelican - 48 groups of 24 disks 4 classes of 12 fully-conflicting groups Blob is stored in one group - Blob is stored over 18 disks (15+3) Off-Rack metadata called “catalog” holds information about the blob placement

Advantages Groups encapsulate the constraints and also reduce time to recover from a failed disk. Since blob is within a group, rebuild operation use disks spinning concurrently. Simplifies the IO schedulers task. Probability of conflict is O(n) which is an improvement from random placement O(n2)

IO Scheduler Traditional disks are optimized to reorder IOs to minimize seek latency. Pelican - reorder requests in order to minimize the impact of spin up latency. Four independent schedulers. Each scheduler services requests for its class and reordering happens at a class level. Each Scheduler uses two queues – one for rebuild operations and one for other operations. Reordering is done to amortize the group spin up latency over the set of operations. Rate limiting is done to manage the interference between rebuild and other operations.

Implementation Considerations

Group abstraction is exploited to parallelize the boot process. Power Up in Standby is used to ensure disks do not spin without the Pelican storage stack managing the constraints. Group abstraction is exploited to parallelize the boot process. Initialization is done on the group level. If there are no user requests then all 4 groups are concurrently initialized Takes around 3 minutes to initialize the entire rack. Unexpected spin up disks is prevented by adding a special No Access flag to the Windows Server driver stack. The BIOS on the server is modified to ensure it can handle all 72 HBA’s.

Evaluation Comparison against a system organized like Pelican but with full provisioning for power and cooling. The FP uses the same physical internal topology but disks are never spun down.

Simulator Cross-Validation • Burst workload, varying burst intensity Simulator accurately predicts real system behavior for all metrics.

Performance – Rack Throughput

Service Time

Time to first byte

Power Consumption

Scheduler Performance Rebuild Time Impact on Client Request

Cost of Fariness Impact on throughput and latency Impact on reject rate

Disk Lifetime Disk statistics as a function of the workload

Pros Cons Reduced Erasure codes for fault tolerance Cost Power Consumption Erasure codes for fault tolerance Hardware abstraction simplifies IO schedulers work. Tight constraints – less flexible to changes Sensitive to hardware changes No Justification to some of the configuration decisions made. Not sure if it is an optimal design

Discussion They claim the values they chose to being a sweet spot in the hardware. How is the performance affected by any changes to this?? More importantly how do we handle for systems of different configurations and hardware. Automate synthesizing of the Data layout – thoughts?