Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage Jeremy C. W. Chan, Qian Ding, Patrick P.

Slides:

Advertisements

Similar presentations

RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.

Advertisements

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.

Henry C. H. Chen and Patrick P. C. Lee

1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.

Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2

1 Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu The Chinese University of Hong Kong.

1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.

The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.

The Zebra Striped Network File System Presentation by Joseph Thompson.

The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.

RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

11-May-15CSE 542: Operating Systems1 File system trace papers The Zebra striped network file system. Hartman, J. H. and Ousterhout, J. K. SOSP '93. (ACM.

Availability in Globally Distributed Storage Systems

RAID: HIGH PERFORMANCE, RELIABLE SECONDARY STORAGE P. M. Chen, U. Michigan E. K. Lee, DEC SRC G. A. Gibson, CMU R. H. Katz, U. C. Berkeley D. A. Patterson,

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.

An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.

Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.

A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation Yinlong Xu University of Science and Technology of.

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Presented by: Raymond Leung Wai Tak Supervisor:

File Systems: Designs Kamen Yotov CS 614 Lecture, 04/26/2001.

The Google File System.

THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.

Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), and Hakim Weatherspoon (Cornell) Gecko: Contention-Oblivious.

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.

Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.

Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.

Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗

1 The Google File System Reporter: You-Wei Zhang.

Interposed Request Routing for Scalable Network Storage Darrell Anderson, Jeff Chase, and Amin Vahdat Department of Computer Science Duke University.

Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2

A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.

Emalayan Vairavanathan

Some key-value stores using log-structure Zhichao Liang LevelDB Riak.

HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST.

VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,

THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.

Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.

Fragmentation in Large Object Repositories Russell Sears Catharine van Ingen CIDR 2007 This work was performed at Microsoft Research San Francisco with.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

Presenters: Rezan Amiri Sahar Delroshan

1 CloudVS: Enabling Version Control for Virtual Machines in an Open- Source Cloud under Commodity Settings Chung-Pan Tang, Tsz-Yeung Wong, Patrick P. C.

A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.

Serverless Network File Systems Overview by Joseph Thompson.

1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.

1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.

Elastic Parity Logging for SSD RAID Arrays Yongkun Li*, Helen Chan #, Patrick P. C. Lee #, Yinlong Xu* *University of Science and Technology of China #

A Tale of Two Erasure Codes in HDFS

Repair Pipelining for Erasure-Coded Storage

Presented by Haoran Wang

Zhirong Shen+, Patrick Lee+, Jiwu Shu$, and Wenzhong Guo*

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Sungjoon Koh, Jie Zhang, Miryeong.

HashKV: Enabling Efficient Updates in KV Storage via Hashing

Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1

Performance-Robust Parallel I/O

Presentation transcript:

Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage Jeremy C. W. Chan*, Qian Ding*, Patrick P. C. Lee, Helen H. W. Chan The Chinese University of Hong Kong FAST’14 1 The first two authors contributed equally to this work.

Clustered storage systems provide scalable storage by striping data across multiple nodes – e.g., GFS, HDFS, Azure, Ceph, Panasas, Lustre, etc. Maintain data availability with redundancy – Replication – Erasure coding 2 Motivation

With explosive data growth, enterprises move to erasure-coded storage to save footprints and cost – e.g., 3-way replication has 200% overhead; erasure coding can reduce overhead to 33% [Huang, ATC’12] Erasure coding recap: – Encodes data chunks to create parity chunks – Any subset of data/parity chunks can recover original data chunks Erasure coding introduces two challenges: (1) updates and (2) recovery/degraded reads 3 Motivation

1. Updates are expensive When a data chunk is updated, its encoded parity chunks need to be updated Two update approaches: – In-place updates: overwrites existing chunks – Log-based updates: appends changes 4 Challenges

2. Recovery/degraded reads are expensive Failures are common – Data may be permanently loss due to crashes – 90% of failures are transient (e.g., reboots, power loss, network connectivity loss, stragglers) [Ford, OSDI’10] Recovery/degraded read approach: – Reads enough data and parity chunks – Reconstructs lost/unavailable chunks 5 Challenges

How to achieve both efficient updates and fast recovery in clustered storage systems? Target scenario: – Server workloads with frequent updates – Commodity configurations with frequent failures – Disk-based storage Potential bottlenecks in clustered storage systems – Network I/O – Disk I/O 6 Challenges

Propose parity-logging with reserved space – Uses hybrid in-place and log-based updates – Puts deltas in a reserved space next to parity chunks to mitigate disk seeks – Predicts and reclaims reserved space in workload-aware manner  Achieves both efficient updates and fast recovery Build a clustered storage system prototype CodFS – Incorporates different erasure coding and update schemes – Released as open-source software Conduct extensive trace-driven testbed experiments 7 Our Contributions

MSR Cambridge traces – Block-level I/O traces captured by Microsoft Research Cambridge in 2007 – 36 volumes (179 disks) on 13 servers – Workloads including home directories and project directories Harvard NFS traces (DEAS03) – NFS requests/responses of a NetApp file server in 2003 – Mixed workloads including , research and development 8 Background: Trace Analysis

9 MSR Trace Analysis Distribution of update size in 10 volumes of MSR Cambridge traces Updates are small – All updates are smaller than 512KB – 8 in 10 volumes show more than 60% of tiny updates (<4KB)

Updates are intensive – 9 in 10 volumes show more than 90% update writes over all writes Update coverage varies – Measured by the fraction of WSS that is updated at least once throughout the trace period – Large variation among different workloads need a dynamic algorithm for handling updates Similar observations for Harvard traces 10 MSR Trace Analysis

Objective #1: Efficient handling of small, intensive updates in an erasure-coded clustered storage 11

Make use of linearity of erasure coding – CodFS reduces network traffic by only sending parity delta Question: How to save data update (A’) and parity delta (ΔA) on disk? 12 Saving Network Traffic in Parity Updates ABCP Linear combination with some encoding coefficient A’A’ BC P’ Update A P is equivalent to applying the same encoding coefficient parity delta (ΔA) A’ P’

Used in host-based file systems (e.g., NTFS and ext4) Also used for parity updates in RAID systems 13 Update Approach #1: in-place updates (overwrite) ABC Disk BC Update A Problem: significant I/O to read and update parities

Used by most clustered storage systems (e.g. GFS, Azure) – Original concept from log-structured file system (LFS) Convert random writes to sequential writes 14 Update Approach #2: log-based updates (logging) ABC Disk BC Update A A Problem: fragmentation to chunk A

Objective #2: Preserves sequentiality in large read (e.g. recovery) for both data and parity chunks 15

Data UpdateParity Delta Full-overwrite (FO) OO Full-logging (FL) LL Parity-logging (PL) OL Parity-logging with reserved space (PLR) OL 16 Parity Update Schemes O: OverwriteL: Logging Our Proposal

17 Parity Update Schemes Storage Node 1Storage Node 2Storage Node 3 FO FL PL PLR Data stream

18 Parity Update Schemes FO FL PL PLR Data stream a a a a b b b b a+b a b Storage Node 1Storage Node 2Storage Node 3

19 Parity Update Schemes FO FL PL PLR Data stream a a a a b b b b ΔaΔa ΔaΔa ΔaΔa a’ a b Storage Node 1Storage Node 2Storage Node 3 a+b

20 Parity Update Schemes FO FL PL PLR Data stream a a a a b b b b ΔaΔa ΔaΔa ΔaΔa a’ c c c c d d d d c+d a b c d a’ Storage Node 1Storage Node 2Storage Node 3 a+b

21 Parity Update Schemes FO FL PL PLR Data stream a a a a b b b b ΔaΔa ΔaΔa ΔaΔa a’ c c c c d d d d b’ ΔbΔb ΔbΔb ΔbΔb a b c d a’b’ Storage Node 1Storage Node 2Storage Node 3 c+d a+b

22 Parity Update Schemes FO FL PL PLR Data stream a a a a b b b b ΔaΔa ΔaΔa a’ c c c c d d d d b’c’ ΔcΔc ΔcΔc ΔcΔc a b c d a’b’c’ Storage Node 1Storage Node 2Storage Node 3 a+b ΔaΔa ΔbΔb ΔbΔb ΔbΔb c+d

23 Parity Update Schemes Storage Node 1Storage Node 2Storage Node 3 FO FL PL PLR Data stream a a a a a b b b b b a’ c d c c c c d d d d b’ c’ FL: disk seek for chunk b FL&PL: disk seek for parity chunk b PLR: No seeks for both data and parity FO: extra read for merging parity ΔaΔa ΔaΔa ΔcΔc ΔcΔc ΔcΔc a+b ΔaΔa ΔbΔb ΔbΔb ΔbΔb c+d

24 Implementation - CodFS CodFS Architecture Exploits parallelization across nodes and within each node Provides a file system interface based on FUSE OSD: Modular Design

Testbed: 22 nodes with commodity hardware – 12-node storage cluster – 10 client nodes sending – Connected via a Gigabit switch Experiments – Baseline tests Show CodFS can achieve theoretical throughput – Synthetic workload evaluation – Real-world workload evaluation 25 Experiments Focus of this talk

26 Synthetic Workload Evaluation Random Write Logging parity (FL, PL, PLR) helps random writes by saving disk seeks and parity read overhead FO has 20% less IOPS than others Logging parity (FL, PL, PLR) helps random writes by saving disk seeks and parity read overhead FO has 20% less IOPS than others IOZone record length: 128KB RDP coding (6,4)

27 Synthetic Workload Evaluation Sequential Read Recovery No seeks in recovery for FO and PLR Only FL needs disk seeks in reading data chunk merge overhead

28 Fixing Storage Overhead PLR (6,4) FO/FL/PL (8,6) FO/FL/PL (8,4) Data Chunk Parity Chunk Reserved Space FO (8,6) is still 20% slower than PLR (6,4) in random writes PLR and FO are still much faster than FL and PL Random WriteRecovery

Remaining problem – What is the appropriate reserved space size? Too small – frequent merges Too large – waste of space – Can we shrink the reserved space if it is not used? Baseline approach – Fixed reserved space size Workload-aware management approach – Predict: exponential moving average to guess reserved space size – Shrink: release unused space back to system – Merge: merge all parity deltas back to parity chunk 29 Dynamic Resizing of Reserved Space

30 Dynamic Resizing of Reserved Space Step 1: Compute utility using past workload pattern Step 2: Compute utility using past workload pattern smoothing factor current usage previous usage no. of chunk to shrink Step 3: Perform shrink disk shrink disk write new data chunks shrinking reserved space as a multiple of chunk size avoids creating unusable “holes”

31 Dynamic Resizing of Reserved Space Shrink + merge performs a merge after the daily shrinking Shrink only performs shrinking at 00:00 and 12:00 each day 16MB baseline *(10,8) Cauchy RS Coding with 16MB segments Reserved space overhead under different shrink strategies in Harvard trace

32 Penalty of Over-shrinking Less than 1% of writes are stalled by a merge operation Penalty of inaccurate prediction Average number of merges per 1000 writes under different shrink strategies in the Harvard trace *(10,8) Cauchy RS Coding with 16MB segments

Latency analysis Metadata management Consistency / locking Applicability to different workloads 33 Open Issues

Key idea: Parity logging with reserved space – Keep parity updates next to parity chunks to reduce disk seeks Workload aware scheme to predict and adjust the reserved space size Build CodFS prototype that achieves efficient updates and fast recovery Source code: Conclusions

35 Backup

36 MSR Cambridge Traces Replay Update Throughput PLR ~ PL ~ FL >> FO

37 MSR Cambridge Traces Replay Recovery Throughput PLR ~ FO >> FL ~ PL