Typhoon: An Ultra-Available Archive and Backup System Utilizing Linear-Time Erasure Codes.

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
Digital Fountains: Applications and Related Issues Michael Mitzenmacher.
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
CSCE430/830 Computer Architecture
Henry C. H. Chen and Patrick P. C. Lee
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Disks and RAID.
Jump to first page A. Patwardhan, CSE Digital Fountains Main Ideas : n Distribution of bulk data n Reliable multicast, broadcast n Ideal digital.
RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
Availability in Globally Distributed Storage Systems
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Secondary Storage CSCI 444/544 Operating Systems Fall 2008.
Wide-area cooperative storage with CFS
Project Mimir A Distributed Filesystem Uses Rateless Erasure Codes for Reliability Uses Pastry’s Multicast System Scribe for Resource discovery and Utilization.
Chapter 9 Classification And Forwarding. Outline.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
11:15:01 Storage device. Computer memory Primary storage 11:15:01.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
RAID COP 5611 Advanced Operating Systems Adapted from Andy Wang’s slides at FSU.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.
Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.
Copyright © Curt Hill, RAID What every server wants!
CprE 545 project proposal Long.  Introduction  Random linear code  LT-code  Application  Future work.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu, Yinlong Xu, Xiaozhao Wang, Cheng Zhan and Pei.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
The concept of RAID in Databases By Junaid Ali Siddiqui.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Computer Science Division
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Fundamentals of Programming Languages-II
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Cloud Computing Vs RAID Group 21 Fangfei Li John Soh Course: CSCI4707.
W4118 Operating Systems Instructor: Junfeng Yang.
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
RAID TECHNOLOGY RASHMI ACHARYA CSE(A) RG NO
Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.
A Tale of Two Erasure Codes in HDFS
Main Memory Cache Architectures
Cache Memory.
What every server wants!
Disks and RAID.
Vladimir Stojanovic & Nicholas Weaver
Parallel Processing - introduction
Cache Memory Presentation I
RAID RAID Mukesh N Tekwani
ICOM 6005 – Database Management Systems Design
Chapter 6 Memory System Design
UNIT IV RAID.
Cache Memory.
Computer Organization & Architecture 3416
RAID RAID Mukesh N Tekwani April 23, 2019
Presentation transcript:

Typhoon: An Ultra-Available Archive and Backup System Utilizing Linear-Time Erasure Codes

Part of OceanStore...

Client Naming/LocationCache

Erasure Code: a form of data coding that allows lost portions of data to be recovered Idea is similar to ECC, except that the algorithm must be told which portions of the data are missing Reed Solomon Codes are a common type of Erasure Code, but they are computationally expensive and are usually implemented in hardware Erasure Codes

Tornado Codes: A Linear-Time Probabilistic Family of Erasure Codes Tornado Codes are linear time, but use probabilistic assumptions to “guarantee” that the decoding process will succeed A 1/2 rate Erasure Code will double the size of a file Any half of e. file can be used to recreate the original data T. Codes also require slightly more than half of the encoded file, thus trading a network bandwidth for speed –Inventors of T. Codes report that 5% is typical

File is divided into nodes of equal size (e.g. 512 bytes) Data Nodes are associated with Check Nodes using a series of Bipartite Graphs Contents of a Check Node is the XOR of its neighbors Bipartite Graphs are created to satisfy mathematical constraints that “guarantee” the recovery process will successfully recover the file Overview of Encoding Process Data Nodes Check Nodes Data File

Overview of Encoding Process Data Nodes Data File Check Nodes Once a file is encoded, the data nodes and check nodes are randomly distributed to a set of recipients

MMX: SIMD or Marketing? There are eight MMX registers Data in registers can be divided into four different sizes:

MMX: SIMD or Marketing? There are eight MMX registers Data in registers can be divided into four different sizes MMX has 57 instructions for 6 types of operations: —ADD —SUBTRACT —MULTIPLY —MULTIPLY THEN ADD —COMPARISON —LOGICAL AND NAND OR XOR

MMX: SIMD or Marketing? There are eight MMX registers Data in registers can be divided into four different sizes MMX has 57 instructions for 6 types of operations char array1[512]; char array2[512]; for(int i=0; i<512; ++i) array1[i]=array1[i] ^ array2[i]; MMX is 2.3 times faster than this (1.9 w/o pipeline sched.)

MMX: SIMD or Marketing? There are eight MMX registers Data in registers can be divided into four different sizes MMX has 57 instructions for 6 types of operations char array1[512]; char array2[512]; long * array1ptr=(long*)array1; long * array2ptr=(long*)array2; for(int i=0; i<512/sizeof(long); ++i) array1ptr[i]=array1ptr[i] ^ array2ptr[i]; MMX is 50% faster than this (22% w/o sched.)

MMX: SIMD or Marketing? There are eight MMX registers Data in registers can be divided into four different sizes MMX has 57 instructions for 6 types of operations char array1[512]; char array2[512]; long * array1ptr=(long*)array1; long * array2ptr=(long*)array2; for(int i=0; i<512; i+=32) xor32fast(array1ptr+i, array2ptr+i);

MMX: SIMD or Marketing? inline void xor32bytes(long * array1reg, long* array2reg, long* destreg) { _asm { mov eax, [array1reg] mov ecx, [array2reg] movq mm0, [eax] movq mm1, [ecx] movq mm2, [eax+8] movq mm3, [ecx+8] movq mm4, [eax+16] movq mm5, [ecx+16] movq mm6, [eax+24] movq mm7, [ecx+24] pxor mm0, mm1 ; 64-bit xor pxor mm2, mm3 ; 64-bit xor pxor mm4, mm5 ; 64-bit xor pxor mm6, mm7 ; 64-bit xor mov ecx, [destreg] movq [ecx], mm0 ; store result movq [ecx+8], mm2 ; store result movq [ecx+16], mm4 ; store result movq [ecx+24], mm6 ; store result }

MMX: SIMD or Marketing? inline void xor32fast(long * array1reg, long* array2reg, long* destreg) { _asm { mov eax, [array1reg] mov ebx, [array2reg] mov ecx, [destreg] movq mm0, [eax] ; load 1a U movq mm1, [ebx] ; load 1b U movq mm2, [eax+8] ; load 2a U V pxor mm0, mm1 ; xor 1 movq mm3, [ebx+8] ; load 2b U movq [ecx], mm0 ; store 1 U V pxor mm2, mm3 ; xor 2 movq mm4, [eax+16] ; load 3a U movq mm5, [ebx+16] ; load 3b U movq mm6, [eax+24] ; load 4a U V pxor mm4, mm5 ; xor 3 movq mm7, [ebx+24] ; load 4b U movq [ecx+8], mm2 ; store 2 U V pxor mm6, mm7 ; xor 4 movq [ecx+16], mm4 ; store 3 U movq [ecx+24], mm6 ; store 4 U }

Overview of Encoding Process Server sends storage announcement to a particular set of severs –Set can be determined/specified using multicast groups, a server list, or some form of DNS address lookup UDP

Overview of Encoding Process Server sends storage announcement to a particular set of severs –Set can be determined/specified using multicast groups, a server list, or some form of DNS address lookup Multicast

Overview of Encoding Process Server encodes file During encoding process, the data nodes and check nodes are [randomly] distributed to other servers

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

A set of nodes are received, ideally with random distribution Check nodes can be used to recover missing data nodes Only check nodes that are missing one neighbor can recreate a data node The structure of the graph ensures [w.h.p.] that the encoding process will succeed –Graph is designed so that there is always at least one check node that is missing only one child –Data nodes can be used to recover check nodes, but is not important Overview of Decoding Process Check Nodes Data File Node Received Node Not Received

Overview of Decoding Process Server sends file request announcement to a particular set of servers Retrieves data from multiple servers simultaneously Recovery process can be performed in parallel with receive (network-based RAID-1) Depending on data loss pattern, a particular subset of the servers can be selected Fastest servers (closest servers, or least utilized servers) Operational Servers (i.e., some portion of the set is not functioning) All servers might be needed in some cases, such as network congestion / packet loss

Client Naming/LocationCache Architecture

Architecture What did we implement? Client, Cache, Naming and Location Mechanism, Replication mechanism, filestore. What did we test? Communication Explicit communication  Unicast request Implicit communication  Multicast request Network Distributed servers throughout Berkeley domain. Simulated network delay by randomizing response time. Caching None for worst case Simulation Strained the Typhoon system by creating requests at the same rate as a 24 hour NFS traces over a 3 hour period.

Typhoon: An Ultra-Available Archive and Backup System Utilizing Linear-Time Erasure Codes Benefits of Typhoon Data is ultra-available: up to half of the servers can fail before availability is affected Fast file retrieval: data can be retrieved simultaneously from multiple servers – System can choose to use the fastest machines in a set of servers – Load balancing can be achieved because slow or heavily utilized servers are not used –Information can be disbursed geographically Increases the accessibility of data in the event of a major disaster, such as an earthquake Can benefit people who travel to remote locations, since data may be closer to them –Multicast can be used to reduce latency Low-overhead algorithms: algorithms for encoding and decoding are linear-time Disk overhead of system can be adjusted (typically doubles the size of a file)

Conclusion Tornado Codes are significantly faster than Cauchy-Reed Solomon A Typhoon based system can match the the request of a loaded NFS Typhoon is a viable solution for increasing the reliability and accessibility of data

Architecture What did we implement? Client, Cache, Naming and Location Mechanism, Replication Mechanism, filestore. What did we test? Communication Explicit communication  TCP request, TCP Response. Implicit communication  Multicast request, TCP Response. Network Distributed servers throughout Berkeley domain. Simulated network delay by randomizing response time. Caching None for worst case Simulation Strained the Typhoon system by creating requests at the same rate as a 24 hour NFS traces over a 3 hour period.