Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Sungjoon Koh, Jie Zhang, Miryeong.

Slides:



Advertisements
Similar presentations
A CASE FOR REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) D. A. Patterson, G. A. Gibson, R. H. Katz University of California, Berkeley.
Advertisements

Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas.
Henry C. H. Chen and Patrick P. C. Lee
1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.
Availability in Globally Distributed Storage Systems
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Ceph: A Scalable, High-Performance Distributed File System
Ceph: A Scalable, High-Performance Distributed File System Priya Bhat, Yonggang Liu, Jing Qin.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.
International Conference on Supercomputing June 12, 2009
NETWORK LOAD BALANCING NLB.  Network Load Balancing (NLB) is a Clustering Technology.  Windows Based. (windows server).  To scale performance, Network.
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Two-Tier Architecture of OSD Metadata Management Xianbo Zhang, Keqiang Wu 11/11/2002.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.
Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
1 The Google File System Reporter: You-Wei Zhang.
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.
Network Aware Resource Allocation in Distributed Clouds.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Ceph: A Scalable, High-Performance Distributed File System
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
The concept of RAID in Databases By Junaid Ali Siddiqui.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Elastic Parity Logging for SSD RAID Arrays Yongkun Li*, Helen Chan #, Patrick P. C. Lee #, Yinlong Xu* *University of Science and Technology of China #
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
Using volunteered resources for data-intensive computing and storage David Anderson Space Sciences Lab UC Berkeley 10 April 2012.
A Tale of Two Erasure Codes in HDFS
A Case for Redundant Arrays of Inexpensive Disks (RAID) -1988
Unistore: Project Updates
Web Server Load Balancing/Scheduling
Web Server Load Balancing/Scheduling
Introduction to Load Balancing:
Steve Ko Computer Sciences and Engineering University at Buffalo
Fujitsu Training Documentation RAID Groups and Volumes
Steve Ko Computer Sciences and Engineering University at Buffalo
A Simulation Analysis of Reliability in Erasure-coded Data Centers
BD-CACHE Big Data Caching for Datacenters
Presented by Haoran Wang
Research Introduction
Section 7 Erasure Coding Overview
Partial-Parallel-Repair (PPR):
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.
Software Engineering Introduction to Apache Hadoop Map Reduce
Jiang Zhou, Wei Xie, Dong Dai, and Yong Chen
A Survey on Distributed File Systems
RAID RAID Mukesh N Tekwani
ICOM 6005 – Database Management Systems Design
Chapter 5: CPU Scheduling
Building a Database on S3
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
UNIT IV RAID.
CMPE 252A : Computer Networks
Specialized Cloud Architectures
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
THE GOOGLE FILE SYSTEM.
RAID RAID Mukesh N Tekwani April 23, 2019
Erasure Codes for Systems
Seminar on Enterprise Software
Efficient Migration of Large-memory VMs Using Private Virtual Memory
Presentation transcript:

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Sungjoon Koh, Jie Zhang, Miryeong Kwon, Jungyeon Yoon, David Donofrio, Nam Sung Kim and Myoungsoo Jung

Take-away Motivation: Distributed systems are starting to adopt erasure coding as a fault tolerance mechanism instead of replication due to its storage overheads. Goal: Understanding system characteristics of online erasure coding by analyzing and comparing them with those of replication. Observations in Online Erasure Coding: Up to 13× I/O performance degradation compared to replication. 50% CPU usage and lots of context switches. Up to 700× I/O amplification more than total request volumes. Up to 500× network traffics among the storage nodes compared to total request amount. Summary of Our Work: We observe and measure various overheads imposed by online erasure coding quantitatively on a distributed system that consists of 52 SSDs. Collect block-level traces from all-flash array based storage clusters, which can be downloaded freely.

Overall Results

Introduction Demands on scalable, high performance distributed storage system

Lower power consumption Employing SSDs to HPC & DC Systems HPC & DC Higher bandwidth Shorter latency & Lower power consumption HDD SSD

“So we need fault tolerance mechanism.” Storage System Failures Typically, storage systems have regular failures. 1) Storage failure ex) Facebook reports; Up to 3% HDDs fails each day (Ref. M. Sathiamoorthy et al., “Xoring elephants: Novel erasure codes for big data,” in PVLDB, 2013.) Although SSDs have higher reliability than HDDs, daily failure cannot be ignored. 2) Network switch errors, power outages, and soft/hard errors “So we need fault tolerance mechanism.”

Need an alternative method to reduce the storage overheads. Fault Tolerance Mechanisms in Distributed System Replication Need an alternative method to reduce the storage overheads. Traditional fault tolerance mechanism. Simple and effective way to make system resilient. High storage overheads (3x). Especially for SSD, replication causes high expense because of SSD’s high cost per GB. causes performance degradation due to SSD’s specific characteristics . ex) Garbage collection, wearing out…

Fault Tolerance Mechanisms in Distributed System Erasure coding We observed significant overheads imposed during I/O services in distributed system employing erasure codes. Alternative method of replication. Lower storage overheads than replication. High reconstruction costs. (well known problem) ex) Facebook cluster with EC increases network traffics by more than 100TB in a day. Many researches try to reduce reconstruction costs.

Background Reed-Solomon: Erasure coding algorithm. Ceph: Distributed system used in this research. Architecture Data path Storage stack

Reed-Solomon D0 D1 ... Dk Data C0 C1 Cm D0 D1 D2 D3 C0 C1 C2 Stripe The most famous erasure coding algorithm. Divide data into k equal data chunks and generates m coding chunks. Encoding: Multiplication of a generator matrix and data chunks as a vector. Stripe: k data chunks. Write Reed-Solomon with k data chunks and m coding chunks as “RS(k,m)”. Can be recovered from m failures. D0 D1 ... Dk Data C0 C1 Cm Chunks Coding Data Chunks D0 D1 D2 D3 C0 C1 Generator Matrix C2 Stripe RS(4,3) 3 failures Coding Chunks

Ceph Architecture Client nodes are connected to the storage nodes through “public network”. Storage nodes are connected through “private network”. Each storage node consists of several object storage device daemons (OSDs) and monitors. OSDs handle read/write services. Monitors manage the access permissions and the status of multiple OSDs.

Data Path “In detail” File/block is handled as an object. Object is assigned to placement group (PG) consists of several OSDs according to the result of hash function. CRUSH algorithm determines primary OSD in PG. Object is sent to primary OSD. Primary OSD sends object to another OSDs (secondary, tertiary, …) as a form of replica/chunk depending on the fault tolerance mechanism. “In detail”

Storage Stack “Implemented in user space.”

Analysis Overview 1) Overall performance. throughput & latency 2) CPU utilization & # context switches. 3) Actual amount of reads & writes served from disks. 4) Private network traffics.

Generate object with coding chunks Object Management in Erasure Coding Observe that erasure coding has different object management scheme with replication. To manage data/coding chunks. Two phases: object initialization & object update. i) Object initialization. 𝑘KB Write Dummy Data Generate object with coding chunks <4MB Object>

ii) Generate coding chunks Object Management in Erasure Coding ii) Object update. To be updated Data Chunk 0 Data Chunk 1 Data Chunk 2 Data Chunk 3 Data Chunk 4 Data Chunk 5 i) Read whole stripe iii) Write to storage Coding Chunk 0 Coding Chunk 1 Coding Chunk 2 ii) Generate coding chunks

Micro benchmark: Flexible I/O (FIO) Workload Description 3-replication, RS(6,3), RS(10,4) Micro benchmark: Flexible I/O (FIO) Request Size (KB) 1, 2, 4, 8, 16, 32, 64, 128 Access Type Sequential Random Pre-write X O Operation Type Write Read Wrie

Analysis Overview 1) Overall performance. throughput & latency 2) CPU utilization & # context switches. 3) Actual amount of reads & writes served from disks. 4) Private network traffics.

Performance Comparison (Sequential Write) X 12.9 (MAX) X 11.3 (MAX) Significant performance degradation in Reed-Solomon. Throughput: 11.3× worse in RS (max) Latency: 12.9× longer in RS (max) Degradation in request size 4~16KB is not acceptable Computation for encoding, data management, and additional network traffic causes degradation in erasure coding.

Stripe Performance Comparison (Sequential Read) “RS-Concatenation” Data Chunk 0 Data Chunk 1 Data Chunk 2 Data Chunk 3 Data Chunk 4 Data Chunk 5 X 3.4 X 3.4 Stripe “RS-Concatenation” Performance degradation in Reed-Solomon. Throughput: 3.4× worse in RS (4KB) Latency: 3.4× longer in RS (4KB) Even though there was no failure, performance degradation occurred. Caused by RS-concatenation, which generates extra data transfers.

Analysis Overview 1) Overall performance. throughput & latency 2) CPU utilization & # context switches. 3) Actual amount of reads & writes served from disks. 4) Private network traffics.

Computing and Software Overheads (CPU Utilization) <Random Write> <Random Read> 70~75% RS requires much more CPU cycles than replication. User mode CPU utilizations account for 70~75% of total CPU cycles. Uncommon in RAID systems Implemented at the user level. (ex. OSD daemon, PG backend, fault tolerant modules)

Computing and software overheads (Context Switch) <Random Write> <Random Read> Relative number of context switches= The number of context switches Total amount of request (MB) Much more context switches occur in RS than replication. Read: Data transfers through OSDs and computations during RS-concatenation. Write: 1) Initializing object has lots of writes, and significant amount of computations. 2) Updating object introduces many transfers among OSDs through user-level modules.

Analysis Overview 1) Overall performance. throughput & latency 2) CPU utilization & # context switches. 3) Actual amount of reads & writes served from disks. 4) Private network traffics.

I/O Amplification (Random Write) <Write Amplification> <Read Amplification> I/O amplification= Read/write amount from storage(MB) Total amount of request(MB) Erasure coding causes write amplification up to 700 times more than total request volume. Why is write amplification by random writes so big? Random: Mostly initialize objects. (Initialize Initialize … Initialize Update) Sequential: Mostly update objects. (Initialize Update … Update Initialize)

<Sequential Read> I/O Amplification (Read) <Random Read> <Sequential Read> Read amplification caused by RS-concatenation. Random read: Mostly reads different stripes.  Lots of read amplifications. Sequential read: Consecutive I/O requests read data from same stripe.  no read amplifications.

Analysis Overview 1) Overall performance. throughput & latency 2) CPU utilization & # context switches. 3) Actual amount of reads & writes served from disks. 4) Private network traffics.

Network Traffics Among the Storage Nodes <Random Write> <Random Read> Show similar trend with I/O amplifications. Erasure coding Write: initializing & updating objects in erasure coding cause lots of network traffics. Read: RS-concatenation cause lots of network traffics. - Replication exhibits only minimum data transfers related to necessary communications. (ex. OSD interaction: monitoring the status of each OSD)

Conclusion We studied the overheads imposed by erasure coding on a distributed SSD array system. In contrast to the common expectation on erasure codes, we observed that they exhibit heavy network traffic and more I/O amplification than replication. Also erasure coding requires much more CPU cycles and context switches than replication due to user-level implementation.

Q&A

observed by random writes on pristine image & random overwrites Object Management in Erasure Coding Obejct initialize Obejct update Time series analysis for CPU utilization, context switches, private network throughput observed by random writes on pristine image & random overwrites