NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds

Slides:

Advertisements

Similar presentations

Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.

Advertisements

A CASE FOR REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) D. A. Patterson, G. A. Gibson, R. H. Katz University of California, Berkeley.

RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.

current hadoop architecture

Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.

CSCE430/830 Computer Architecture

Henry C. H. Chen and Patrick P. C. Lee

1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.

BASIC Regenerating Codes for Distributed Storage Systems Kenneth Shum (Joint work with Minghua Chen, Hanxu Hou and Hui Li)

Simple Regenerating Codes: Network Coding for Cloud Storage Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li University.

Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2

1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.

RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.

Availability in Globally Distributed Storage Systems

1 Rateless codes and random walks for P2P resource discovery in Grids IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, NOV Valerio Bioglio.

REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.

Beyond the MDS Bound in Distributed Cloud Storage

June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.

6/5/ TRAP-Array: A Disk Array Architecture Providing Timely Recovery to Any Point-in-time Authors: Qing Yang,Weijun Xiao,Jin Ren University of Rhode.

Network Coding for Large Scale Content Distribution Christos Gkantsidis Georgia Institute of Technology Pablo Rodriguez Microsoft Research IEEE INFOCOM.

1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.

Cooperative regenerating codes for distributed storage systems Kenneth Shum (Joint work with Yuchong Hu) 22nd July 2011.

Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.

Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.

Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.

Construction of efficient PDP scheme for Distributed Cloud Storage. By Manognya Reddy Kondam.

LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.

CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.

Redundant Array of Independent Disks

Introduction to Computer Networks Introduction to Computer Networks.

22/07/ The MDS Scaling Problem for Cloud Storage Yu-chong Hu Institute of Network Coding.

1 Failure Correction Techniques for Large Disk Array Garth A. Gibson, Lisa Hellerstein et al. University of California at Berkeley.

Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.

Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.

Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.

1 An Update Model for Network Coding in Cloud Storage Systems th Annual Allerton Conference on Communication, Control, and Computing Mohammad Reza.

Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.

"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.

A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.

Hierarchical Quorum Consensus: A New Algorithm for Managing Replicated Data Akhil Kumar IEEE TRANSACTION ON COMPUTERS, VOL.40, NO.9, SEPTEMBER 1991.

Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu, Yinlong Xu, Xiaozhao Wang, Cheng Zhan and Pei.

1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.

Layer-aligned Multi-priority Rateless Codes for Layered Video Streaming IEEE Transactions on Circuits and Systems for Video Technology, 2014 Hsu-Feng Hsiao.

1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.

The concept of RAID in Databases By Junaid Ali Siddiqui.

Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.

The IEEE International Conference on Cluster Computing 2010

20/10/ Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu Institute of Network Coding Please.

A Fast Repair Code Based on Regular Graphs for Distributed Storage Systems Yan Wang, East China Jiao Tong University Xin Wang, Fudan University 1 12/11/2013.

Prioritized Distributed Video Delivery With Randomized Network Coding IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 13, NO. 4, AUGUST 2011 Nikolaos Thomos Jacob.

1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.

Database Laboratory Regular Seminar TaeHoon Kim Article.

Seminar On Rain Technology

RAID TECHNOLOGY RASHMI ACHARYA CSE(A) RG NO

Pouya Ostovari and Jie Wu Computer & Information Sciences

Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.

SEMINAR TOPIC ON “RAIN TECHNOLOGY”

A Tale of Two Erasure Codes in HDFS

Double Regenerating Codes for Hierarchical Data Centers

A Fault Tolerance Protocol for Uploads: Design and Evaluation

Repair Pipelining for Erasure-Coded Storage

RAID RAID Mukesh N Tekwani

Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1

SpiraTest/Plan/Team Deployment Considerations

TECHNICAL SEMINAR PRESENTATION

CMPE 252A : Computer Networks

CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.

RAID RAID Mukesh N Tekwani April 23, 2019

Presentation transcript:

NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds Henry C. H. Chen Yuchong Hu Patrick P. C. Lee Yang Tang IEEE Transactions on Computers, 15 August 2013

Outline Introduction Repair in Multiple Cloud Storage FMSR Codes NCCloud Conclusion

Introduction Cloud storage provides an on-demand remote backup solution. A single cloud storage provider encounters the problem such as a single point of failure.

Introduction The general solution is to distribute data across different cloud providers. stripe data The fault-tolerance can be improved by the diversity of multiple clouds.

Introduction-Data Failure This paper focuses on unexpected permanent cloud failure. a cloud fails permanently => activate repair. maintain data redundancy and fault-tolerance. A repair operation retrieves data from existing surviving clouds. reconstructs the lost data in a new cloud.

Introduction-Data Failure During repair, each surviving node encode its stored data chunks. send the encoded chunks to a new node Regenerate the lost data.

Introduction-Cost Problem Today’s cloud storage providers charge users for outbound data. While repairing failures, moving the enormous amount of data (repair traffic) can introduce significant monetary costs.

Introduction-Repair Traffic Problem In order to minimize repair traffic problem, regenerating codes [16] have been proposed. store data redundantly in a distributed storage system. require less repair traffic, but with the same fault-tolerance level. [16] Network Coding for Distributed Storage Systems

Introduction-Regenerating Codes But, most existing regenerating codes require storage nodes equip with computation capabilities. perform encoding operations during repair.

Introduction-Regenerating Codes In order to make regenerating codes portable to any cloud storage service. This paper considers only a thin-cloud interface where storage nodes only support read/write.

Introduction-NCCloud In this paper, we present the design and implementation of NCCloud a proxy-based storage system. a fault-tolerant storage. over multiple cloud storage providers.

Introduction-FMSR On top of NCCloud, we propose the functional minimum-storage regenerating (FMSR) codes. The FMSR code implementation maintain double-fault tolerance. maintain the same storage cost as in RAID-6 less repair traffic when recovering a single-cloud failure.

Introduction-FMSR FMSR codes are non-systematic the encoded chunks was formed by linear combination of the original data chunks. not keep the original data chunks as in systematic coding schemes.

Outline Introduction Repair in Multiple Cloud Storage FMSR Codes NCCloud Conclusion

Repair in Multiple Cloud Storage Transient failure is short-term, such that the failed cloud will return to normal after some time and no outsourced data is lost.

Repair in Multiple Cloud Storage Permanent failure is long-term, in the sense that the outsourced data on a failed cloud will become permanently unavailable. example : data center outages in disasters. data loss and corruption. malicious attacks.

Outline Introduction Repair in Multiple Cloud Storage FMSR Codes Motivation Implementation NCCloud Conclusion

Motivation This paper considers distributed multiple-cloud storage data is striped proxy-based design

Motivation The proxy reads the essential data pieces from other surviving clouds, reconstructs new data pieces, and writes these new pieces to a new cloud.

Fault-tolerant Maximum Distance Separable property (n, k)-MDS code divide file into equal-size native chunks. linearly combined to form code chunks. distribute over n (larger than k) nodes. reconstruct original file from any k of the n nodes. tolerate the failures of any n − k nodes.

Fault-tolerant The FMSR codes can reconstruct the data of failed node from the surviving nodes. download less data. not reconstruct the whole file.

Different Coding Schemes Storage size 2M Repair traffic M Storage size 2M Repair traffic 0.75M Storage size 2M Repair traffic 0.75M

Double-fault Tolerant FMSR Codes divide a file M into 2(n − 2) native chunks. generate 2n code chunks. each node store two code chunks of size 𝑀 2(𝑛−2) . repair a failed node, repair traffic is 𝑀(𝑛−1) 2(𝑛−2) . RAID-6 codes, total storage size is 𝑀𝑛 𝑛−2 , repair traffic is M. 50% saved

Outline Introduction Repair in Multiple Cloud Storage FMSR Codes Motivation Implementation NCCloud Conclusion

FMSR Codes Implementation FMSR codes do not require lost chunks to be exactly reconstructed not identical to those in the failed node. As long as the MDS property holds.

FMSR Codes Implementation This paper propose a two-phase checking scheme to ensure the code chunks on all nodes always satisfy the MDS property.

FMSR Codes Implementation The implementation assumes a thin-cloud interface. File upload File download Repair

File Upload Native chunks : Code chunks : Encoding matrix of coefficients : size 𝑛 𝑛−𝑘 ×𝑘 𝑛−𝑘 in the Galois field GF(pn)

File Upload Galois field GF(pn) Encoding coefficient vector

File Download Download the k(n−k) code chunks from any k of the n storage nodes. The ECVs of the k(n−k) code chunks can form a k(n−k)×k(n−k) square matrix. Obtain the original k(n − k) native chunks. multiply the inverse of the square matrix with the code chunks.

Iterative Repair MDS property must hold even after iterative repairs. This paper proposes a two-phase checking. MDS property rMDS property

Satisfy MDS, but not rMDS

Iterative Repair Step 1. Download the encoding matrix from a surviving node. Step 2. Select one ECV from each of the n-1 surviving nodes. Step 3. Generate a repair matrix . Step 4. Compute the ECVs for the new code chunks and reproduce a new encoding matrix.

Iterative Repair Step 5. Given EM’, verify if those properties are satisfied. verify MDS by enumerating all 𝑛 𝑘 . verify rMDS by n(n−k)n-1 𝑛 𝑘 . The corresponding encoding matrices must form a full rank. Step 6. Download the actual chunk data and regenerate new chunk data. Step 4 : The new ECVs Code chunks from surviving nodes

rMDS Sustaining

Time of Two-phase Checking

Double-fault Tolerant Codes Markov Model

MTTDL, Compare to RAID-6 Mean Time To Data Loss

Outline Introduction Repair in Multiple Cloud Storage FMSR Codes NCCloud Conclusion

NCCloud A proxy that bridges user applications and multiple clouds. Its design is built on three layers. File system layer Coding layer Storage layer

NCCloud It is mainly implemented in Python, while the coding schemes are implemented in C for better efficiency.

Goal of NCCloud Compare the costs and response time of using RAID-6 and FMSR codes. The cost advantage of FMSR over RAID-6, while maintaining acceptable response time.

Goal of NCCloud Normal operations Repair operation RAID-6 and FMSR incur similar storage costs. Repair operation FMSR save a significant amount of transfer costs over RAID-6.

Cost Saving-Price

Cost Saving Normal operations Repair operation 1.25PB of data stored FMSR : $86,851 monthly storage cost RAID-6 : $86,851 monthly storage cost Repair operation RAID-6 : 1PB of data, $56,832 FMSR : 0.5625PB of data, $33,894 Saving of $ 22,938

Response Time-Local Cloud

Response Time-Local Cloud

Response Time-Commerical Cloud

Outline Introduction Repair in Multiple Cloud Storage FMSR Codes NCCloud Conclusion

Conclusion This paper present NCCloud providing the reliability of today’s cloud backup storage. proxy-based multiple-cloud storage system NCCloud not only provides fault tolerance in storage, but also allows cost-effective repair. The FMSR code implementation eliminates the encoding requirement of storage nodes during repair.