1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.

Slides:



Advertisements
Similar presentations
Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
Advertisements

RAID Redundant Array of Inexpensive Disks Presented by Greg Briggs.
1 On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice Yunfeng Zhu 1, Patrick P. C. Lee 2, Yuchong Hu 2, Liping.
RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
RAID Redundant Array of Independent Disks
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
current hadoop architecture
Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.
CSCE430/830 Computer Architecture
Henry C. H. Chen and Patrick P. C. Lee
BASIC Regenerating Codes for Distributed Storage Systems Kenneth Shum (Joint work with Minghua Chen, Hanxu Hou and Hui Li)
Simple Regenerating Codes: Network Coding for Cloud Storage Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li University.
Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2
1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.
RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Beyond the MDS Bound in Distributed Cloud Storage
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.
Current Projects Network Coding File System – first NC-based data storage prototype – hold the First Workshop on NC and Data Storage in HK (Jul 21-22,
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
File Management Systems
High Performance Computing Course Notes High Performance Storage.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Presented by: Raymond Leung Wai Tak Supervisor:
Session 3 Windows Platform Dina Alkhoudari. Learning Objectives Understanding Server Storage Technologies Direct Attached Storage DAS Network-Attached.
Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.
NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds
Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.
Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.
Redundant Array of Independent Disks
©2001 Pål HalvorsenINFOCOM 2001, Anchorage, April 2001 Integrated Error Management in MoD Services Pål Halvorsen, Thomas Plagemann, and Vera Goebel University.
© 2012 A. Datta & F. Oggier, NTU Singapore Redundantly Grouped Cross-object Coding for Repairable Storage Anwitaman Datta & Frédérique Oggier NTU Singapore.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu th Annual IEEE/IFIP International.
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.
Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu, Yinlong Xu, Xiaozhao Wang, Cheng Zhan and Pei.
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.
A Fast Repair Code Based on Regular Graphs for Distributed Storage Systems Yan Wang, East China Jiao Tong University Xin Wang, Fudan University 1 12/11/2013.
RobuSTore: Performance Isolation for Distributed Storage and Parallel Disk Arrays Justin Burke, Huaxia Xia, and Andrew A. Chien Department of Computer.
Secret Sharing in Distributed Storage Systems Illinois Institute of Technology Nexus of Information and Computation Theories Paris, Feb 2016 Salim El Rouayheb.
Database Laboratory Regular Seminar TaeHoon Kim Article.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
Seminar On Rain Technology
RAID Technology By: Adarsha A,S 1BY08A03. Overview What is RAID Technology? What is RAID Technology? History of RAID History of RAID Techniques/Methods.
RAID TECHNOLOGY RASHMI ACHARYA CSE(A) RG NO
Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.
Elastic Parity Logging for SSD RAID Arrays Yongkun Li*, Helen Chan #, Patrick P. C. Lee #, Yinlong Xu* *University of Science and Technology of China #
CSE 451: Operating Systems Spring 2010 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center 534.
RAID Overview.
Double Regenerating Codes for Hierarchical Data Centers
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
A Simulation Analysis of Reliability in Erasure-coded Data Centers
Repair Pipelining for Erasure-Coded Storage
Presented by Haoran Wang
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.
CSE 451: Operating Systems Spring 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center.
RAID RAID Mukesh N Tekwani
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
RAID RAID Mukesh N Tekwani April 23, 2019
Xiaolu Li†, Runhui Li†, Patrick P. C. Lee†, and Yuchong Hu‡
CS 295: Modern Systems Organizing Storage Devices
Presentation transcript:

1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C. Lee 2, John C. S. Lui 2 1 Institute of Network Coding 2 Department of Computer Science and Engineering The Chinese University of Hong Kong NetCod ’11

2 NCFS Overview  NCFS: Network-coding-based distributed file system Use network coding to speed up repair while preserving fault tolerance of storage No intelligence required on storage nodes Data is transparently striped across nodes

3 NCFS Overview  NCFS is a file system Organize storage in hierarchical namespace Manage file namespace metadata (e.g., filenames, directories, access rights) Clients only see a mount drive Similar to Linux/Windows file systems # Login as root (now ncfs needs root access) sudo -i # mount file system on./mountdir./ncfs rootdir mountdir # copy file to ncfs time cp test.dat./mountdir

4 Why We Build NCFS?  Develop a proof-of-concept prototype for experimenting erasure codes and regenerating codes in practice We by no means claim it’s ready for large-scale use  Provide a storage interface between user applications and distributed storage backends  Provide an extensible platform for future systems research

5 Definitions D1D1 D2D2 DnDn segment (aka stripe) native block code block (or parity block)  A data stream is divided into block groups, each with m native blocks  m native blocks are encoded to form c code blocks An array of n disks

6 MDS Codes  A maximum distance separable code, denoted by MDS(n, k), has the property that any k (< n) out of n nodes can be used to reconstruct original native blocks i.e., at most n-k disk failures can be tolerated  Example: RAID-5 is an MDS(n, n-1) code as it can tolerate 1 disk failure

7 Repair Degree  Repair degree d Consider the case where one disk is failed The repair for the lost blocks of the failed disk is achieved by retrieving data from d disks  The design of an MDS code is determined by the parameters (n, k, d, m, c)

8 Classes of MDS Codes  Erasure codes Stripe data with redundancy e.g., RAID-like codes, array codes  Regenerating codes Stripe data with redundancy and reduce the repair bandwidth based on network coding We focus on the optimal erasure codes (i.e., the MDS property is achieved). Near- optimal erasure codes require slightly more than k blocks to recover the original data Dimakis et al., “A Survey on Network Codes for Distributed Storage”, Proc. of the IEEE, March 2011

9 Erasure Codes  RAID-5 … m1 m4 m7 c4 m13 m16 D1D1 m2 m5 c3 m10 m14 m17 D2D2 c1 m6 m9 m12 c5 m18 D4D4 Data stream of native blocks  Define + as XOR operator  Code block c1 = XOR-sum of native blocks in same segment  Parameters: n = can be any ≥ 2 k = n - 1 d = k m = n - 1 c = 1 m3 c2 m8 m11 m15 c6 D3D3 c1 = m1 + m2 + m3 m1m2m3m4m5

10 Erasure Codes  RAID-6 … m1 m3 c5 m7 m9 c11 D1D1 m2 c3 c6 m8 c9 c12 D2D2 c2 m4 m6 c8 m10 m12 D4D4 Data stream of native blocks  Code blocks: c1 = XOR-sum of native blocks in same segment c2 = Linear combination based on Reed Solomon codes  Parameters: n = can be any ≥ 3 k = n - 2 d = k m = n - 2 c = 2 c1 c4 m5 c7 c10 m11 D3D3 c1 = m1 + m2 c2 = m1 + 2 * m2 m1m2m3m4m5 Intel. “Intelligent RAID6 Theory Overview and Implementation”, 2005.

11 Regenerating Codes  Regenerating codes are the MDS codes that realize the optimal tradeoff between storage and repair bandwidth  Two extreme points of the tradeoff curve: MBR (minimum bandwidth regenerating codes) MSR (minimum storage regenerating codes) Dimakis et al., “Network Coding for Distributed Storage Systems”, IEEE Trans. on Info. Theory, 2010

12 Regenerating Codes  E-MBR (Minimum bandwidth regenerating) codes with exact repair Minimize repair bandwidth while maintaining MDS property  Idea: Assume d = n – 1 (repair data from n – 1 survival disks during a single-node failure) Make a duplicate copy for each native/code block To repair a failed disk, download exactly one block per segment from each survival disk Rashmi et al., “Explicit construction of optimal exact regenerating codes for distributed storage”. Allerton 2009.

13 Regenerating Codes  E-MBR(n, k=n-1, d=n-1) m1 m2 m3 D1D1 m1 m4 m5 D2D2 m3 m5 m6 D4D4 m2 m4 m6 D3D3 … Data stream of native blocks m1m2m3m4m5  Duplicate each native block. Both native and duplicate blocks are in different disks  Parameters: n = can be any ≥ 2 k = n - 1 d = n - 1 m = n(n-1)/2 c = 0 (i.e., no code block) m7 m8 m9 m7 m10 m11 m9 m11 m12 m8 m10 m12

14 Regenerating Codes  E-MBR(n, k=n-2, d=n-1) m1 m2 m3 D1D1 m1 m4 m5 D2D2 m3 m5 c1 D4D4 m2 m4 c1 D3D3 … Data stream of native blocks m1m2m3m4m5  Code block: c1 = m1 + m2 + m3 + m4 + m5  Parameters: n = can be any ≥ 3 k = n - 2 d = n - 1 m = (n-2)(n+1)/2 c = 1 m6 m7 m8 m6 m9 m10 m8 m10 c2 m7 m9 c2

15 Regenerating Codes  E-MBR(n, k, d=n-1) For general n, k Each native/code block still has a duplicate copy Code blocks are formed by Reed-Solomon codes  Parameters: n = can be any ≥ k+1 k = can be any d = n – 1 m = k(2n−k−1)/2 c = (n − k)(n − k − 1)/2

16 Regenerating Codes  Repair in E-MBR: download one block per segment from each survival disk m1 m2 m3 D1D1 m1 m4 m5 D2D2 m3 m5 m6 D4D4 m2 m4 m6 D3D3 m7 m8 m9 m7 m10 m11 m9 m11 m12 m8 m10 m12 E-MBR(n, k=n-1, d=n-1)

17 Regenerating Codes  Others regenerating codes (e.g., MSR) typically require nodes to perform encoding  This assumption may be too strong for some types of storage nodes e.g., block devices such as harddisks

18 Codes That We Consider  We consider following MDS codes in this talk: Erasure codes: RAID Regenerating codes: E-MBR(d = n-1)  Both RAID and E-MBR(d=n-1) codes do not require encoding capabilities in disks

19 NCFS Architecture NCFS client storage nodes /mnt/ncfs Network  NCFS is a proxy-based file system that interconnects different types of storage nodes  Both client-side and server-side deployments are feasible Client-side deployment file

20 NCFS Architecture NCFS client storage nodes /mnt/ncfs Network Server-side deployment file

21 NCFS Architecture  Repair through NCFS: read data from survival nodes regenerate lost data blocks write the regenerated blocks to a new node NCFS storage nodes C failed node new node A B C=A+B A B

22 Layered Design of NCFS  File system layer Implemented in FUSE Use FUSE to implement all filesystem semantics and repair operations NCFS storage nodes /mnt/ncfs File System Layer Coding layer Storage layer  Coding layer: RAID-5, RAID-6, MBR, MSR  Storage layer: Interface to storage nodes  NCFS uses a layered design to allow extensibility

23 Experiments  Goal: empirically experiment different coding schemes under NCFS Upload/download throughput Degraded download throughput Repair throughput  NCFS deployment Intel quad-core machine Linux

24 Topologies department LAN 1Gb switch NAS PC NCFS 4-node switch setting - Test basic functionality 8-node switch setting - Test scalability 4-node LAN setting - Test bandwidth bottlenecks  Three network topologies considered:

25 Experiment: Normal Mode  Normal usage (i.e., all nodes work) Upload a 256-MB file Download a 256-MB file  Different coding schemes have different storage costs e.g., actual storage for n = 4, RAID-5: 341MB RAID-6: 512MB E-MBR(k = n − 1): 512MB E-MBR(k = n − 2): 614MB

26 Normal Upload Throughput  E-MBR(k=n-1) has the highest upload throughput because of no code-block update, even it uploads more data

27 Upload Time Breakdown EncodingDisk readDisk writeOthers RAID %49.78%28.64%12.33% RAID %28.94%36.74%6.94% E-MBR(k=n-1) 0.08%083.43%16.49% E-MBR(k=n-2) 5.93%34.30%52.01%7.77% 4-node switch setting

28 Experiment: Repair  Recover all lost data to a new node  Lost data includes native blocks and redundancy  Measure effective throughput T eff : f = fraction of redundant blocks N = size of all lost blocks t = time for recovery T eff = (1-f)*N / t 1Gb switch NAS NCFS new node 4-node switch setting

29 Repair Throughput  E-MBR has higher repair throughput, which conforms to the theory Repairing a single-node failure

30 Repair Time Breakdown Repairing a single-node failure (4-node switch setting) EncodingDisk readDisk writeOthers RAID %68.62%27.67%0.38% RAID %57.98%28.70%0.64% E-MBR(k=n-1) 0.096%46.05%52.58%1.27% E-MBR(k=n-2) 0.95%46.70%51.94%1.26%

31 Experiment Summary  Demonstrate how NCFS is used in testing different coding schemes  E-MBR not only improves repair throughput, but also improves throughput in normal upload/download Trade-off: higher storage overhead  More coding schemes can be plugged into NCFS

32 Open Issues  Performance optimization Caching strategy Grouping multiple block accesses into one request Parallelization  Security Confidentiality Byzantine fault tolerance Pollution resilience

33 NCFS References  Source code: