© 2013, A. Datta & F. Oggier, NTU Singapore Storage codes: Managing Big Data with Small Overheads Presented by Anwitaman Datta & Frédérique E. Oggier Nanyang.

Slides:



Advertisements
Similar presentations
The Quantum Chromodynamics Grid James Perry, Andrew Jackson, Matthew Egbert, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Advertisements

Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
current hadoop architecture
Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.
CSCE430/830 Computer Architecture
Henry C. H. Chen and Patrick P. C. Lee
1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.
Self-repairing Homomorphic Codes for Distributed Storage Systems [1] Tao He Software Engineering Laboratory Department of Computer Science,
Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2
1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.
Availability in Globally Distributed Storage Systems
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.
Typhoon: An Ultra-Available Archive and Backup System Utilizing Linear-Time Erasure Codes.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds
Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.
Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
Computer System Architectures Computer System Software
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
© 2012 A. Datta & F. Oggier, NTU Singapore Redundantly Grouped Cross-object Coding for Repairable Storage Anwitaman Datta & Frédérique Oggier NTU Singapore.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Copyright © Curt Hill, RAID What every server wants!
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Coding and Algorithms for Memories Lecture 13 1.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Data architecture challenges for CERN and the High Energy.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Cloud Computing Vs RAID Group 21 Fangfei Li John Soh Course: CSCI4707.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.
CSE 451: Operating Systems Spring 2010 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center 534.
A Tale of Two Erasure Codes in HDFS
Double Regenerating Codes for Hierarchical Data Centers
Hadoop Aakash Kag What Why How 1.
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
Partial-Parallel-Repair (PPR):
Repair Pipelining for Erasure-Coded Storage
Presented by Haoran Wang
Partial-Parallel-Repair (PPR):
Chapter 19: Distributed Databases
Storage Virtualization
Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Sungjoon Koh, Jie Zhang, Miryeong.
CSE 451: Operating Systems Spring 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center.
RAID RAID Mukesh N Tekwani
ICOM 6005 – Database Management Systems Design
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
RAID RAID Mukesh N Tekwani April 23, 2019
Presentation transcript:

© 2013, A. Datta & F. Oggier, NTU Singapore Storage codes: Managing Big Data with Small Overheads Presented by Anwitaman Datta & Frédérique E. Oggier Nanyang Technological University, Singapore Tutorial at NetCod 2013, Calgary, Canada.

© 2013, A. Datta & F. Oggier, NTU Singapore Big Data Storage: Disclaimer A note from the trenches: "You know you have a large storage system when you get paged at 1 AM because you only have a few petabytes of storage left." – from Andrew Fikes’ (Principal Engineer, Google) faculty summit talk ` Storage Architecture and Challenges `,

© 2013, A. Datta & F. Oggier, NTU Singapore Big Data Storage: Disclaimer A note from the trenches: "You know you have a large storage system when you get paged at 1 AM because you only have a few petabytes of storage left." – from Andrew Fikes’ (Principal Engineer, Google) faculty summit talk ` Storage Architecture and Challenges `, We never get such calls!! 3

© 2013, A. Datta & F. Oggier, NTU Singapore Big data June 2011 EMC 2 study – world’s data is more than doubling every 2 years faster than Moore’s Law – 1.8 zettabytes of data to be created in 2011 Big data: - big problem? - big opportunity ? * Zetta: Zettabyte: If you stored all of this data on DVDs, the stack would reach from the Earth to the moon and back. 4

© 2013, A. Datta & F. Oggier, NTU Singapore The data deluge: Some numbers Facebook “currently” (in 2010) stores over 260 billion images, which translates to over 20 petabytes of data. Users upload one billion new photos (60 terabytes) each week and Facebook serves over one million images per second at peak. [quoted from a paper on “Haystack” from Facebook] On “Saturday”, photo number four billion was uploaded to photo sharing site Flickr. This comes just five and a half months after the 3 billionth and nearly 18 months after photo number two billion. – Mashable (13 th October 2009 ) [ billion/] 5

© 2013, A. Datta & F. Oggier, NTU Singapore Scale how? * Definitions from Wikipedia 6

© 2013, A. Datta & F. Oggier, NTU Singapore Scale how? Scale up To scale vertically (or scale up ) means to add resources to a single node in a system* * Definitions from Wikipedia 7

© 2013, A. Datta & F. Oggier, NTU Singapore Scale how? Scale upScale out To scale horizontally (or scale out ) means to add more nodes to a system, such as adding a new computer to a distributed software application* To scale vertically (or scale up ) means to add resources to a single node in a system* * Definitions from Wikipedia 8 not distributing is not an option!

© 2013, A. Datta & F. Oggier, NTU Singapore Failure (of parts) is Inevitable 9

© 2013, A. Datta & F. Oggier, NTU Singapore Failure (of parts) is Inevitable But, failure of the system is not an option either! – Failure is the pillar of rivals’ success … 10

© 2013, A. Datta & F. Oggier, NTU Singapore Deal with it Data from Los Alamos National Laboratory (DSN 2006), gathered over 9 years, 4750 machines, CPUs. Distribution of failures: – Hardware 60% – Software 20% – Network/Environment/Humans 5% Failures occurred between once a day to once a month. 11

© 2013, A. Datta & F. Oggier, NTU Singapore Failure happens without fail, so … But, failure of the system is not an option either! – Failure is the pillar of rivals’ success … Solution: Redundancy 12

© 2013, A. Datta & F. Oggier, NTU Singapore Many Levels of Redundancy Physical Virtual resource Availability zone Region Cloud From: oreilly.com /2011/04/the-aws-outage-the-clouds-shining-moment.html 13

© 2013, A. Datta & F. Oggier, NTU Singapore Redundancy Based Fault Tolerance Replicate data – e.g., 3 or more copies – In nodes on different racks Can deal with switch failures Power back-up using battery between racks (Google) 14

© 2013, A. Datta & F. Oggier, NTU Singapore But At What Cost? Failure is not an option, but … – … are the overheads acceptable? 15

© 2013, A. Datta & F. Oggier, NTU Singapore Erasure codes – Much lower storage overhead – High level of fault-tolerance In contrast to replication or RAID based systems Has the potential to significantly improve the “bottomline” – e.g., Both Google’s new DFS Collossus, as well as Microsoft’s Azure now use ECs Reducing the Overheads of Redundancy 16

© 2013, A. Datta & F. Oggier, NTU Singapore Erasure Codes (ECs) 17

© 2013, A. Datta & F. Oggier, NTU Singapore Erasure Codes (ECs) An (n,k) erasure code = a map that takes as input k blocks and outputs n blocks, thus introducing n-k blocks of redundancy. 18

© 2013, A. Datta & F. Oggier, NTU Singapore Erasure Codes (ECs) An (n,k) erasure code = a map that takes as input k blocks and outputs n blocks, thus introducing n-k blocks of redundancy. 3 way replication is a (3,1) erasure code! 19

© 2013, A. Datta & F. Oggier, NTU Singapore Erasure Codes (ECs) An (n,k) erasure code = a map that takes as input k blocks and outputs n blocks, thus introducing n-k blocks of redundancy. 3 way replication is a (3,1) erasure code! k=1 block 20

© 2013, A. Datta & F. Oggier, NTU Singapore Erasure Codes (ECs) An (n,k) erasure code = a map that takes as input k blocks and outputs n blocks, thus introducing n-k blocks of redundancy. 3 way replication is a (3,1) erasure code! Encoding k=1 block n=3 encoded blocks 21

© 2013, A. Datta & F. Oggier, NTU Singapore Erasure Codes (ECs) An (n,k) erasure code = a map that takes as input k blocks and outputs n blocks, thus introducing n-k blocks of redundancy. 3 way replication is a (3,1) erasure code! An erasure code such that the k original blocks can be recreated out of any k encoded blocks is called MDS (maximum distance separable). Encoding k=1 block n=3 encoded blocks 22

© 2013, A. Datta & F. Oggier, NTU Singapore Decoding … Receive any k’ (≥ k) blocks … B1B1 B2B2 BnBn O1O1 O2O2 OkOk Lost blocks Erasure Codes (ECs) Encoding n encoded blocks Original k blocks k blocks … Data = message Reconstruct Data O1O1 O2O2 OkOk Originally designed for communication – EC(n,k) 23

© 2013, A. Datta & F. Oggier, NTU Singapore Erasure Codes for Networked Storage Data = Object Encoding k blocks … O1O1 O2O2 OkOk B2B2 B1B1 BnBn n encoded blocks (stored in storage devices in a network) … … Lost blocks Retrieve any k’ (≥ k) blocks Original k blocks … Reconstruct Data O1O1 O2O2 OkOk Decoding BlBl 24

© 2013, A. Datta & F. Oggier, NTU Singapore HDFS-RAID Distributed RAID File system (DRFS) client – provides application access to the files in the DRFS – transparently recovers any corrupt or missing blocks encountered when reading a file (degraded read) Does not carry out repairs RaidNode, a daemon that creates and maintains parity files for all data files stored in the DRFS BlockFixer, which periodically recomputes blocks that have been lost or corrupted – RaidShell allows on demand repair triggered by administrator Two kinds of erasure codes implemented – XOR code and Reed-Solomon code (typically 10+4 w/ 1.4x overhead) 25 From

© 2013, A. Datta & F. Oggier, NTU Singapore B2B2 B1B1 BnBn … … Lost blocks Retrieve any k’ (≥ k) blocks Original k blocks … O1O1 O2O2 OkOk Decoding Encoding BlBl Recreate lost blocks Re-insert Reinsert in (new) storage devices, so that there is (again) n encoded blocks Replenishing Lost Redundancy for ECs Repair needed for long term resilience. 26

© 2013, A. Datta & F. Oggier, NTU Singapore B2B2 B1B1 BnBn … … Lost blocks Retrieve any k’ (≥ k) blocks Original k blocks … O1O1 O2O2 OkOk Decoding Encoding BlBl Recreate lost blocks Re-insert Reinsert in (new) storage devices, so that there is (again) n encoded blocks Replenishing Lost Redundancy for ECs Repair needed for long term resilience. Repairs are expensive! 27

© 2013, A. Datta & F. Oggier, NTU Singapore Tailored-Made Codes for Storage 28

© 2013, A. Datta & F. Oggier, NTU Singapore Tailored-Made Codes for Storage Desired code properties include: Low storage overhead Good fault tolerance 29 Traditional MDS erasure codes achieve these.

© 2013, A. Datta & F. Oggier, NTU Singapore Tailored-Made Codes for Storage Desired code properties include: Low storage overhead Good fault tolerance Better repairability 30 Traditional MDS erasure codes achieve these.

© 2013, A. Datta & F. Oggier, NTU Singapore Tailored-Made Codes for Storage Desired code properties include: Low storage overhead Good fault tolerance Better repairability 31 Traditional MDS erasure codes achieve these.  Smaller repair fan-in  Reduced I/O for repairs  Possibility of multiple simultaneous repairs  Fast repairs  Efficient B/W usage  …

© 2013, A. Datta & F. Oggier, NTU Singapore Tailored-Made Codes for Storage Desired code properties include: Low storage overhead Good fault tolerance Better repairability Better … 32 Traditional MDS erasure codes achieve these.  Smaller repair fan-in  Reduced I/O for repairs  Possibility of multiple simultaneous repairs  Fast repairs  Efficient B/W usage  …

© 2013, A. Datta & F. Oggier, NTU Singapore Tailored-Made Codes for Storage Desired code properties include: Low storage overhead Good fault tolerance Better repairability Better … 33 Traditional MDS erasure codes achieve these.  Smaller repair fan-in  Reduced I/O for repairs  Possibility of multiple simultaneous repairs  Fast repairs  Efficient B/W usage  …  Better data-insertion  Better migration to archival  …

© 2013, A. Datta & F. Oggier, NTU Singapore Pyramid (Local Reconstruction) Codes 34 Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems, C. Huang et NCA 2007 Erasure Coding in Windows Azure Storage, C. Huang et USENIX ATC 2012

© 2013, A. Datta & F. Oggier, NTU Singapore Pyramid (Local Reconstruction) Codes 35 Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems, C. Huang et NCA 2007 Erasure Coding in Windows Azure Storage, C. Huang et USENIX ATC 2012 – Good for degraded reads (data locality)

© 2013, A. Datta & F. Oggier, NTU Singapore Pyramid (Local Reconstruction) Codes 36 Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems, C. Huang et NCA 2007 Erasure Coding in Windows Azure Storage, C. Huang et USENIX ATC 2012 – Good for degraded reads (data locality) – Not all repairs are cheap (only partial parity locality)

© 2013, A. Datta & F. Oggier, NTU Singapore Regenerating Codes Network information flow based arguments to determine “optimal” trade-off of storage/repair-bandwidth 37

© 2013, A. Datta & F. Oggier, NTU Singapore Locally Repairable Codes Codes satisfying: low repair fan-in, for any failure The name is reminiscent of “locally decodable codes” 38

© 2013, A. Datta & F. Oggier, NTU Singapore Self-repairing Codes 39

© 2013, A. Datta & F. Oggier, NTU Singapore Self-repairing Codes Usual disclaimer: “To the best of our knowledge” – First instances of locally repairable codes Self-repairing Homomorphic Codes for Distributed Storage Systems – Infocom 2011 Self-repairing Codes for Distributed Storage Systems – A Projective Geometric Construction – ITW 2011 – Since then, there have been many other instances from other researchers/groups 40

© 2013, A. Datta & F. Oggier, NTU Singapore Self-repairing Codes Usual disclaimer: “To the best of our knowledge” – First instances of locally repairable codes Self-repairing Homomorphic Codes for Distributed Storage Systems – Infocom 2011 Self-repairing Codes for Distributed Storage Systems – A Projective Geometric Construction – ITW 2011 – Since then, there have been many other instances from other researchers/groups Note – k encoded blocks are enough to recreate the object Caveat: not any arbitrary k (i.e., SRCs are not MDS) However, there are many such k combinations 41

© 2013, A. Datta & F. Oggier, NTU Singapore Self-repairing Codes: Blackbox View B2B2 B1B1 BnBn n encoded blocks (stored in storage devices in a network) … … Lost blocks Retrieve some k” (< k) blocks (e.g. k”=2) to recreate a lost block BlBl Re-insert Reinsert in (new) storage devices, so that there is (again) n encoded blocks 42

© 2013, A. Datta & F. Oggier, NTU Singapore PSRC Example

© 2013, A. Datta & F. Oggier, NTU Singapore PSRC Example

© 2013, A. Datta & F. Oggier, NTU Singapore PSRC Example (o 1 +o 2 +o 4 ) + (o 1 ) => o 2 +o 4 (o 3 ) + (o 2 +o 3 ) => o 2 (o 1 ) + (o 2 ) => o 1 + o 2 Repair using two nodes Four pieces needed to regenerate two pieces Say N 1 and N 3

© 2013, A. Datta & F. Oggier, NTU Singapore PSRC Example (o 1 +o 2 +o 4 ) + (o 1 ) => o 2 +o 4 (o 3 ) + (o 2 +o 3 ) => o 2 (o 1 ) + (o 2 ) => o 1 + o 2 Repair using two nodes Four pieces needed to regenerate two pieces Say N 1 and N 3 (o 1 +o 2 +o 4 ) + (o 4 ) => o 1 +o 2 (o 2 ) + (o 4 ) => o 2 + o 4 Repair using three nodes Three pieces needed to regenerate two pieces Say N 2, N 3 and N 4 46

© 2013, A. Datta & F. Oggier, NTU Singapore47 Replicas Erasure coded data Data access Recap

© 2013, A. Datta & F. Oggier, NTU Singapore48 Replicas Erasure coded data Data access fault-tolerant data access (MSR’s Reconstruction code) Recap

© 2013, A. Datta & F. Oggier, NTU Singapore49 Replicas Erasure coded data Data access fault-tolerant data access (MSR’s Reconstruction code) Recap (partial) re-encode/repair (e.g., Self-Repairing Codes)

© 2013, A. Datta & F. Oggier, NTU Singapore50 Data insertion Replicas Erasure coded data pipelined insertion e.g., data for analytics Data access fault-tolerant data access (MSR’s Reconstruction code) Recap (partial) re-encode/repair (e.g., Self-Repairing Codes)

© 2013, A. Datta & F. Oggier, NTU Singapore51 Data insertion Replicas Erasure coded data pipelined insertion e.g., data for analytics In-network coding e.g., multimedia Data access fault-tolerant data access (MSR’s Reconstruction code) Next (partial) re-encode/repair (e.g., Self-Repairing Codes)

© 2013, A. Datta & F. Oggier, NTU Singapore Inserting Redundant Data 52

© 2013, A. Datta & F. Oggier, NTU Singapore Data insertion Inserting Redundant Data 53

© 2013, A. Datta & F. Oggier, NTU Singapore Data insertion – Replicas can be inserted in a pipelined manner Inserting Redundant Data 54

© 2013, A. Datta & F. Oggier, NTU Singapore Data insertion – Replicas can be inserted in a pipelined manner Inserting Redundant Data 55

© 2013, A. Datta & F. Oggier, NTU Singapore Data insertion – Replicas can be inserted in a pipelined manner – Traditionally, erasure coded systems used a central point of processing Inserting Redundant Data 56

© 2013, A. Datta & F. Oggier, NTU Singapore Data insertion – Replicas can be inserted in a pipelined manner – Traditionally, erasure coded systems used a central point of processing Inserting Redundant Data 57 Can the process of redundancy generation be distributed among the storage nodes?

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Ref: In-Network Redundancy Generation for Opportunistic Speedup of Backup, Future Generation Comp. Syst. 29(1),

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Ref: In-Network Redundancy Generation for Opportunistic Speedup of Backup, Future Generation Comp. Syst Motivations – Reduce the bottleneck at a single point The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generation 59

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Ref: In-Network Redundancy Generation for Opportunistic Speedup of Backup, Future Generation Comp. Syst Motivations – Reduce the bottleneck at a single point The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generation 60

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Ref: In-Network Redundancy Generation for Opportunistic Speedup of Backup, Future Generation Comp. Syst Motivations – Reduce the bottleneck at a single point The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generation – Utilize network resources opportunistically Data-centers: When network/nodes are not busy doing other things P2P/F2F: When nodes are online 61

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Ref: In-Network Redundancy Generation for Opportunistic Speedup of Backup, Future Generation Comp. Syst Motivations – Reduce the bottleneck at a single point The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generation – Utilize network resources opportunistically Data-centers: When network/nodes are not busy doing other things P2P/F2F: When nodes are online 62 More traffic (ephemeral resource) but higher data insertion throughput

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding 63

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Dependencies among self-repairing coded fragments can be exploited for in-network coding! 64

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Dependencies among self-repairing coded fragments can be exploited for in-network coding! Consider a SRC(3,7) code with the following dependencies: 65

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Dependencies among self-repairing coded fragments can be exploited for in-network coding! Consider a SRC(3,7) code with the following dependencies: – r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6 66

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Dependencies among self-repairing coded fragments can be exploited for in-network coding! Consider a SRC(3,7) code with the following dependencies: – r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6 Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper] 67

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Dependencies among self-repairing coded fragments can be exploited for in-network coding! Consider a SRC(3,7) code with the following dependencies: – r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6 Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper] A naïve approach 68

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Dependencies among self-repairing coded fragments can be exploited for in-network coding! Consider a SRC(3,7) code with the following dependencies: – r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6 Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper] A naïve approach 69 r1 r2 r4

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Dependencies among self-repairing coded fragments can be exploited for in-network coding! Consider a SRC(3,7) code with the following dependencies: – r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6 Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper] A naïve approach 70 r1 r2 r4 r6

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Dependencies among self-repairing coded fragments can be exploited for in-network coding! Consider a SRC(3,7) code with the following dependencies: – r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6 Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper] A naïve approach 71 r1 r2 r4 r6 r7

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Dependencies among self-repairing coded fragments can be exploited for in-network coding! Consider a SRC(3,7) code with the following dependencies: – r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6 Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper] A naïve approach 72 r1 r2 r4 r6 r7 Need a good “schedule” to insert the redundancy! - Avoid cycles/dependencies

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Dependencies among self-repairing coded fragments can be exploited for in-network coding! Consider a SRC(3,7) code with the following dependencies: – r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6 Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper] A naïve approach 73 r1 r2 r4 r6 r7 Need a good “schedule” to insert the redundancy! - Avoid cycles/dependencies Subject to “unpredictable” availability of resource!!

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding Dependencies among self-repairing coded fragments can be exploited for in-network coding! Consider a SRC(3,7) code with the following dependencies: – r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6 Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper] A naïve approach 74 r1 r2 r4 r6 r7 Need a good “schedule” to insert the redundancy! - Avoid cycles/dependencies Subject to “unpredictable” availability of resource!! Turns out to be O(n!) w/ Oracle

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding 75

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding 76 Heuristics – Several other policies (such as max data) were also tried

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding 77

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding 78

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding 79 RndFlw was the best heuristic – Among those we tried

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding 80 RndFlw was the best heuristic – Among those we tried Provided 40% (out of a possible 57%) bandwidth savings at source for a SRC(7,3) code

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding 81 RndFlw was the best heuristic – Among those we tried Provided 40% (out of a possible 57%) bandwidth savings at source for a SRC(7,3) code – An increase in the data-insertion throughput btw %

© 2013, A. Datta & F. Oggier, NTU Singapore In-network coding 82 RndFlw was the best heuristic – Among those we tried Provided 40% (out of a possible 57%) bandwidth savings at source for a SRC(7,3) code – An increase in the data-insertion throughput btw % No free lunch: Increase of 20-30% overall network traffic

© 2013, A. Datta & F. Oggier, NTU Singapore83 Data insertion Replicas Erasure coded data pipelined insertion e.g., data for analytics In-network coding e.g., multimedia Data access fault-tolerant data access (MSR’s Reconstruction code) Recap (partial) re-encode/repair (e.g., Self-Repairing Codes)

© 2013, A. Datta & F. Oggier, NTU Singapore84 Data insertion Replicas Erasure coded data pipelined insertion e.g., data for analytics In-network coding e.g., multimedia (partial) re-encode/repair (e.g., Self-Repairing Codes) archival of “cold” data Data access fault-tolerant data access (MSR’s Reconstruction code) Next

© 2013, A. Datta & F. Oggier, NTU Singapore RapidRAID Ref: – RapidRAID: Pipelined Erasure Codes for Fast Data Archival in Distributed Storage Systems (Infocom 2013) Has some local repairability properties, but that aspect is yet to be explored – Another code ICDCN 2013 Decentralized Erasure Coding for Ecient Data Archival in Distributed Storage Systems – Systematic code (unlike RapidRAID) – Found using numerical methods, and a general theory for the construction of such codes, as well as their repairability properties are open issues 85

© 2013, A. Datta & F. Oggier, NTU Singapore RapidRAID Ref: – RapidRAID: Pipelined Erasure Codes for Fast Data Archival in Distributed Storage Systems (Infocom 2013) Has some local repairability properties, but that aspect is yet to be explored – Another code ICDCN 2013 Decentralized Erasure Coding for Ecient Data Archival in Distributed Storage Systems – Systematic code (unlike RapidRAID) – Found using numerical methods, and a general theory for the construction of such codes, as well as their repairability properties are open issues Problem statement: Can the existing (replication based) redundancy be exploited to create an erasure coded archive? 86

© 2013, A. Datta & F. Oggier, NTU Singapore Slight change of view 87 S1 S2 S3 S4 S1 S2 S3 S4 Two ways to look at replicated data

© 2013, A. Datta & F. Oggier, NTU Singapore RapidRAID 88 centralized encoding process

© 2013, A. Datta & F. Oggier, NTU Singapore RapidRAID 89 Decentralizing the hitherto centralized encoding process

© 2013, A. Datta & F. Oggier, NTU Singapore RapidRAID – Example (8,4) code 90

© 2013, A. Datta & F. Oggier, NTU Singapore Initial configuration RapidRAID – Example (8,4) code 91

© 2013, A. Datta & F. Oggier, NTU Singapore Initial configuration Logical phase 1: Pipelined coding RapidRAID – Example (8,4) code 92

© 2013, A. Datta & F. Oggier, NTU Singapore RapidRAID – Example (8,4) code 93

© 2013, A. Datta & F. Oggier, NTU Singapore Logical phase 2: Further local coding RapidRAID – Example (8,4) code 94

© 2013, A. Datta & F. Oggier, NTU Singapore Logical phase 2: Further local coding RapidRAID – Example (8,4) code 95 Resulting Linear Code

© 2013, A. Datta & F. Oggier, NTU Singapore RapidRAID: Some results 96

© 2013, A. Datta & F. Oggier, NTU Singapore RapidRAID: Some results 97

© 2013, A. Datta & F. Oggier, NTU Singapore98 Data insertion Replicas Erasure coded data pipelined insertion e.g., data for analytics In-network coding e.g., multimedia archival of “cold” data Data access fault-tolerant data access (MSR’s Reconstruction code) Big pic Agenda: A composite system achieving all these properties … (partial) re-encode/repair (e.g., Self-Repairing Codes)

© 2013, A. Datta & F. Oggier, NTU Singapore99 Wrapping up: A moment of reflection Revisiting repairability – an engineering alternative – Redundantly Grouped Cross-object Coding for Repairable Storage (APSys2012) – The CORE Storage Primitive: Cross-Object Redundancy for Efficient Data Repair & Access in Erasure Coded Storage (arXiv: arXiv: ) HDFS-RAID compatible implementation

© 2013, A. Datta & F. Oggier, NTU Singapore100 Wrapping up: A moment of reflection e 11 e 21 e m1 p1p1 … e 12 e 22 e m2 p1p1 … e 1k e 2k e mk pkpk … … e 1k+1 e 2k+1 e mk+1 p k+1 … e 1n e 2n e mn pnpn … … RAID-4 of erasure coded pieces of different objects Erasure coding of individual objects (reminiscent of product codes!)

© 2013, A. Datta & F. Oggier, NTU Singapore Separation of concerns Two distinct design objectives for distributed storage systems – Fault-tolerance – Repairability An extremely simple idea – Introduce two different kinds of redundancy Any (standard) erasure code – for fault-tolerance RAID-4 like parity (across encoded pieces of different objects) – for repairability

© 2013, A. Datta & F. Oggier, NTU Singapore CORE repairability Choosing a suitable m < k – Reduction in data transfer for repair – Repair fan-in disentangled from base code parameter “k” Large “k” may be desirable for faster (parallel) data access Codes typically have trade-offs between repair fan-in, code parameter “k” and code’s storage overhead (n/k) However: The gains from reduced fan-in is probabilistic – For i.i.d. failures with probability “f” Possible to reduce repair time – By pipelining data through the live nodes, and computing partial parity

© 2013, A. Datta & F. Oggier, NTU Singapore Interested to – Follow: Also, two surveys on (repairability of) storage codes – one short, at high level (SIGACT Distr. Comp. News, Mar. 2013) – one detailed (FnT, June 2013) – Get involved: 103 ଧନ୍ଯବାଦ