Reconsidering Single Failure Recovery in Clustered File Systems

Slides:

Advertisements

Similar presentations

Request Dispatching for Cheap Energy Prices in Cloud Data Centers

Advertisements

SpringerLink Training Kit

Luminosity measurements at Hadron Colliders

From Word Embeddings To Document Distances

Choosing a Dental Plan Student Name

Virtual Environments and Computer Graphics

Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI

THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –

D. Phát triển thương hiệu

NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN

Điều trị chống huyết khối trong tai biến mạch máu não

BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.

Nasal Cannula X particulate mask

Evolving Architecture for Beyond the Standard Model

HF NOISE FILTERS PERFORMANCE

Electronics for Pedestrians – Passive Components –

Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel

L-Systems and Affine Transformations

CMSC423: Bioinformatic Algorithms, Databases and Tools

Some aspect concerning the LMDZ dynamical core and its use

Bayesian Confidence Limits and Intervals

实习总结（Internship Summary)

Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,

Front End Electronics for SOI Monolithic Pixel Sensor

Face Recognition Monday, February 1, 2016.

Solving Rubik's Cube By: Etai Nativ.

CS284 Paper Presentation Arpad Kovacs

انتقال حرارت 2 خانم خسرویار.

Summer Student Program First results

Theoretical Results on Neutrinos

HERMESでのHard Exclusive生成過程による核子内クォーク全角運動量についての研究

Wavelet Coherence & Cross-Wavelet Transform

yaSpMV: Yet Another SpMV Framework on GPUs

Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.

MOCLA02 Design of a Compact L-band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Fuel cell development program for electric vehicle

Overview of TST-2 Experiment

Optomechanics with atoms

داده کاوی سئوالات نمونه

Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium

ლექცია 4 - ფული და ინფლაცია

10. predavanje Novac i financijski sustav

Wissenschaftliche Aussprache zur Dissertation

FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,

Particle acceleration during the gamma-ray flares of the Crab Nebular

Interpretations of the Derivative Gottfried Wilhelm Leibniz

Advisor: Chiuyuan Chen Student: Shao-Chun Lin

Widow Rockfish Assessment

SiW-ECAL Beam Test 2015 Kick-Off meeting

On Robust Neighbor Discovery in Mobile Wireless Networks

Chapter 6 并发：死锁和饥饿 Operating Systems: Internals and Design Principles

You NEED your book!!! Frequency Distribution

Y V =0 a V =V0 x b b V =0 z

Fairness-oriented Scheduling Support for Multicore Systems

Climate-Energy-Policy Interaction

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Ch48 Statistics by Chtan FYHSKulai

The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.

Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs

Online Learning: An Introduction

Factor Based Index of Systemic Stress (FISS)

What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.

THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*

Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.

The Toroidal Sporadic Source: Understanding Temporal Variations

FW 3.4: More Circle Practice

ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف

Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM

Limits on Anomalous WWγ and WWZ Couplings from DØ

Presentation transcript:

Reconsidering Single Failure Recovery in Clustered File Systems Zhirong Shen*, Jiwu Shu*, Patrick P. C. Lee# *Tsinghua University #The Chinese University of Hong Kong IEEE/IFIP DSN’16

Clustered File Systems Clustered file systems (CFSes) provide unified storage service over multiple storage nodes e.g., GFS, HDFS, Azure, Ceph, Panasas, Lustre, etc Hierarchical Topology: Nodes are grouped in racks Failures are common Network core

Redundancy Solution: data redundancy Replication: make duplicate copies Erasure coding: encode data to parity Enterprises (e.g., Google, Azure, Facebook) move to erasure coding to save footprints Azure reduces overhead from 3x to 1.33x via erasure coding  over 50% cost saving [Huang, ATC’12] Facebook reduces overhead from 3.6x to 2.1x via erasure coding for warm binary data [Muralidhar, OSDI’14]

Erasure Coding Divide data to k data chunks (each with multiple elements) Encode data chunks to additional m parity chunks Distribute each stripe of data/parity chunks to k+m nodes MDS property: any k out of k+m nodes can recover data Data encode divide Nodes (k, m) = (2, 2) A B C D A+C B+D A+D B+C+D

Erasure-Coded CFS Architecture: 5 racks, 20 nodes Erasure Coding: 4 stripes, 14 chunks

Single Failure Recovery Single node failures are most common Extensive work on optimizing single failure recovery by minimizing repair traffic Xiang et al. [ACM SIGMETRICS’10] Khan et al. [USENIX FAST’12] Zhu et al. [IEEE MSST’12] Fu et al. [IEEE SRDS’14] Shen et al. [IEEE SRDS’15] Are there any limitation of existing designs?

Limitations #1: Neglecting bandwidth diversity Oversubscription ratio: 5:1 - 20:1 Cross-rack connection Intra-rack connection Scarce resource Sufficient resource

Limitations #2: Incompatibility with Reed-Solomon (RS) Codes RS codes provide general fault tolerance, and are used in Google and Facebook Existing Solutions Existing Solutions XOR-based code RS code Read any k chunks; C(k+m-1,k) choices Selectively read elements from chunks

Limitations #3: Neglecting load balancing Only focus on minimizing repair traffic, but ignore load balancing property Can we minimize and balance cross-rack repair traffic for single failure recovery in a CFS deployed with RS Codes?

Our Contributions Propose cross-rack-aware recovery (CAR): Designed for single-failure recovery for RS codes in a CFS setting Reduces cross-rack repair traffic for each stripe Balances cross-rack repair traffic for multiple stripes Evaluate CAR in a testbed cluster Reduces 66.9% of cross-rack repair traffic and 53.8% of recovery time over baseline

Problem Formulation … … A1 A2 Af Ar Load balancing rate 𝜆= t1,f Network core t1,f … … A1 A2 Af Ar Amount of cross-rack repair traffic from A1 to Af Load balancing rate 𝜆= MAX cross-rack repair traffic of all racks AVERAGE cross-rack repair traffic of all racks

Objectives Our Objective: CAR’s approach: Minimize 𝜆 (find the most balanced solution) Subject to min 1≤𝑖≠𝑓≤𝑟 𝑡 𝑖,𝑓 (cross-rack repair traffic minimized) CAR’s approach: #1: Minimizing number of accessed racks #2: Intra-rack chunk aggregation #3: Load Balancing Minimize cross-rack repair traffic Balance cross-rack repair traffic

Design of CAR #1: Minimizing number of accessed racks Question 1: how to minimize number of accessed racks? Idea: For each stripe, sort all racks (except 𝑨 𝒇 ) by number of available chunks in each rack in descending order Read chunks from the sorted list of racks, until the number of retrieved chunks plus the remaining chunks in 𝑨 𝒇 is larger than k

Design of CAR #1: Minimizing number of accessed racks Suppose k=8 and 𝑨 𝒇 = 𝑨 𝟏 . We can read chunks from 𝑨 𝟑 and 𝑨 𝟓 , such that 3+4+3=10≥8.

Design of CAR #1: Minimizing number of accessed racks Question 2: how to find the “right” solution with minimum number of accessed racks? Let 𝑛 𝑖 be minimum number of accessed racks for the repair of i-th stripe. A solution is valid if it can repair the i-th stripe by accessing only 𝑛 𝑖 racks A stripe may have many valid recovery solutions

Design of CAR #1: Minimizing number of accessed racks We can also read chunks from 𝑨 𝟑 and 𝑨 𝟒 , since 3+2+3=8≥8.

Design of CAR #2: Intra-rack chunk aggregation Chunks retrieved from a rack can be aggregated into a single chunk before transmitted across racks, due to linear property of RS code Hi = i-th chunk to be read; H’ = repaired chunk H’ = y1 H1 + y2 H2 + … + yk Hk Number of cross-rack transmitted chunks after aggregation is equal to number of accessed racks Partial decoding within each rack

Design of CAR #2: Intra-rack chunk aggregation Four chunks in A5 are aggregated into one chunk. The aggregated chunk is sent to A1 for recovery

Design of CAR #3: Load balancing Use a greedy algorithm Idea: find the most loaded rack 𝐴𝑙 and another rack 𝐴𝑖 find the stripe-level solution 𝑅𝑗 that reads data from 𝐴𝑙 for 𝑗-th stripe find another valid solution 𝑅𝑗′, such that the read chunks in 𝐴𝑙 can be substituted by those in 𝐴𝑖

Design of CAR #3: Load balancing load balancing rate: 𝟏𝟔 𝟗

Design of CAR #3: Load balancing most loaded rack

Design of CAR #3: Load balancing most loaded rack another rack

Design of CAR #3: Load balancing load balancing rate: 𝟏𝟐 𝟗 Find another valid solution in the 3rd stripe. It skips A2 and reads chunks in A3 load balancing rate: 𝟏𝟐 𝟗

Erasure code parameters Evaluation Three cluster settings Chunk size: 4MB, 8MB, and 16MB Compare CAR with random recovery (RR), which randomly reads k chunks from surviving nodes Jerasure library: encoding/decoding operations Clusters Number of racks Number of nodes Erasure code parameters CFS1 3 10 (k=4,m=3) CFS2 4 13 (k=6,m=3) CFS3 5 20 (k=10,m=4)

CAR reduces cross-rack repair traffic by up to 66.9% over RR Evaluation # Cross-rack repair traffic: CAR reduces cross-rack repair traffic by up to 66.9% over RR

CAR effectively balances cross-rack repair traffic Evaluation # Load balancing rate: Unbalanced solution: the solution found in CAR without load balancing CAR effectively balances cross-rack repair traffic

Evaluation # Recovery time: CAR reduces recovery time by up to 53.8% CFS1 CFS2 CFS3 CAR reduces recovery time by up to 53.8% Large k  long recovery time

CAR won’t increase computation burden as it uses same coding Evaluation # Computation time and transmission time: Transmission time dominates overall recovery time, especially for CFS3 with largest k CAR won’t increase computation burden as it uses same coding

Recent Work on Partial Decoding Partial-Parallel Repair [Mitra, EuroSys’16] Partial decoding at different nodes Double Regenerating Codes [Hu, ISIT’16] Optimal regenerating code construction that provably minimizes cross-rack repair traffic CAR focuses on Reed-Solomon codes, and addresses load balancing across stripes

Conclusions Existing single failure recovery solutions are inadequate for CFS architectures CAR: cross-rack-aware recovery Minimizes number of accessed racks Performs intra-rack chunk aggregation Balances cross-rack repair traffic over multiple stripes Experiments show performance gain of CAR