Presented by Haoran Wang

Slides:

Advertisements

Similar presentations

Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.

Advertisements

RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.

Paging: Design Issues. Readings r Silbershatz et al: ,

Colyseus: A Distributed Architecture for Online Multiplayer Games

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.

current hadoop architecture

Henry C. H. Chen and Patrick P. C. Lee

1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.

Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2

Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Availability in Globally Distributed Storage Systems

CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

Multiprocessing Memory Management

A Hybrid Caching Strategy for Streaming Media Files Jussara M. Almeida Derek L. Eager Mary K. Vernon University of Wisconsin-Madison University of Saskatchewan.

Wide-area cooperative storage with CFS

Two-Tier Architecture of OSD Metadata Management Xianbo Zhang, Keqiang Wu 11/11/2002.

Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.

Load Balancing Dan Priece. What is Load Balancing? Distributed computing with multiple resources Need some way to distribute workload Discreet from the.

Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.

Redundant Array of Independent Disks

© 2012 A. Datta & F. Oggier, NTU Singapore Redundantly Grouped Cross-object Coding for Repairable Storage Anwitaman Datta & Frédérique Oggier NTU Singapore.

1 Cache Me If You Can. NUS.SOC.CS5248 OOI WEI TSANG 2 You Are Here Network Encoder Sender Middlebox Receiver Decoder.

Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2

Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

A Dynamic Data Grid Replication Strategy to Minimize the Data Missed Ming Lei, Susan Vrbsky, Xiaoyan Hong University of Alabama.

IT253: Computer Organization

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

Coding and Algorithms for Memories Lecture 13 1.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

A Tale of Two Erasure Codes in HDFS

Curator: Self-Managing Storage for Enterprise Clusters

RAID Redundant Arrays of Independent Disks

Steve Ko Computer Sciences and Engineering University at Buffalo

Steve Ko Computer Sciences and Engineering University at Buffalo

Dave Eckhardt Disk Arrays Dave Eckhardt

Partial-Parallel-Repair (PPR):

Steve Ko Computer Sciences and Engineering University at Buffalo

PA an Coordinated Memory Caching for Parallel Jobs

Memory Management for Scalable Web Data Servers

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Sungjoon Koh, Jie Zhang, Miryeong.

EECS 582 Midterm Review Mosharaf Chowdhury EECS 582 – F16.

COS 518: Advanced Computer Systems Lecture 9 Michael Freedman

Be Fast, Cheap and in Control

Steve Ko Computer Sciences and Engineering University at Buffalo

EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.

Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1

Peer-to-Peer Video Services

CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.

Dr. Zhijie Huang and Prof. Hong Jiang University of Texas at Arlington

RAID Redundant Array of Inexpensive (Independent) Disks

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Mark Zbikowski and Gary Kimura

CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.

Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.

TensorFlow: A System for Large-Scale Machine Learning

Performance-Robust Parallel I/O

Erasure Codes for Systems

Caching 50.5* + Apache Kafka

Presentation transcript:

Presented by Haoran Wang EC-Cache: Load-balanced, Low-latency Cluster Caching with Online Erasure Coding K. V. Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, Kannan Ramchandran Presented by Haoran Wang EECS 582 – F16

Background The trend of in-memory caching for object store Reduced a lot of disk I/O e.g. Alluxio(Tachyon) Load imbalance Skew of popularity Background network traffic congestion EECS 582 – F16

Erasure coding in a nutshell Fault-tolerance in a space-efficient fashion Trade data availability for space efficiency EECS 582 – F16

Replication X Y Z EECS 582 – F16

Replication X Y Z Replica factor = 2 X Y Z EECS 582 – F16

Erasure Coding X Y Z EECS 582 – F16

Erasure Coding X Y Z W a11 + a12 + a13 = Linear parity X Y Z T a21 + EECS 582 – F16

Erasure Coding X Y Z W a11 + a12 + a13 = Linear parity k = 3, r = 2 X EECS 582 – F16

Erasure Coding X Y Z W a11 + a12 + a13 = Linear parity k = 3, r = 2 X A well-known co-efficient matrix: 𝑎 11 𝑎 12 𝑎 13 𝑎 21 𝑎 22 𝑎 23 EECS 582 – F16

Fault tolerance - Replication X Y Z Survived 1 failure X Y Z EECS 582 – F16

Fault tolerance - Replication X Y Z Survived 2 failures X Y Z EECS 582 – F16

Fault tolerance - Replication X Y Z Cannot survive X Y Z EECS 582 – F16

Fault tolerance - Replication Total storage is 6 units Only guaranteed to survive one failure EECS 582 – F16

Fault tolerance - EC X Y Z W a11 + a12 + a13 = Linear parity k = 3, r = 2 X Y Z T a21 + a22 + a23 = A well-known co-efficient matrix: 𝑎 11 𝑎 12 𝑎 13 𝑎 21 𝑎 22 𝑎 23 EECS 582 – F16

Fault tolerance - EC Translated into linear equations: 𝑎 11 𝑋+ 𝑎 12 𝑌+ 𝑎 13 𝑍=𝑤 𝑎 21 𝑋+ 𝑎 22 𝑌+ 𝑎 23 𝑍=𝑡 𝑋=𝑥 𝑌=𝑦 𝑍=𝑧 𝑋 : the symbol that represents data 𝑋 𝑥: the actual value of 𝑋 EECS 582 – F16

Fault tolerance - EC Translated into linear equations: 𝑎 11 𝑋+ 𝑎 12 𝑌+ 𝑎 13 𝑍=𝑤 𝑎 21 𝑋+ 𝑎 22 𝑌+ 𝑎 23 𝑍=𝑡 𝑋=𝑥 𝑌=𝑦 𝑍=𝑧 Still solvable, with unique solution (set) EECS 582 – F16

Fault tolerance - EC Translated into linear equations: 𝑎 11 𝑋+ 𝑎 12 𝑌+ 𝑎 13 𝑍=𝑤 𝑎 21 𝑋+ 𝑎 22 𝑌+ 𝑎 23 𝑍=𝑡 𝑋=𝑥 𝑌=𝑦 𝑍=𝑧 5 equations and 3 variables Can survive ANY 2 failures Because N variables can be solved from N (linearly independent) equations EECS 582 – F16

Fault tolerance – EC Need only 5 storage units to survive 2 failures Compared to replication: (6,1) EECS 582 – F16

What’s the tradeoff? Saving space V.S. Time needed to reconstruct (solving linear equations) EECS 582 – F16

Switch of context EC for fault-tolerance  EC for load balancing and low latency Granularity: single data object  splits of object How about the tradeoff? Deal with primarily in-memory cache, so reconstruction time won’t be a problem EECS 582 – F16

Load balancing 4 pieces of data: Hot Hot Cold Cold EECS 582 – F16

Load balancing Seemingly balanced Hot Cold Hot Cold EECS 582 – F16

Load balancing What if some data is super hot ? Super hot >> hot + cold + cold Super Hot Hot Cold EECS 582 – F16

Load balancing Introducing splits “Distribute” the hotness of single object Also good for read latency – read splits in parallel Load balancing EECS 582 – F16

But what’s wrong with split-level replication? Server 1 Server 2 Server 3 Server 4 Server 5 Server 6 … … … … … … X Y Z X Y Z … … … … … … EECS 582 – F16

But what’s wrong with split-level replication? Server 1 Server 2 Server 3 Server 4 Server 5 Server 6 … … … … … … X Y Z X Y Z … … … … … … This is replication with static factor = 2, but the problem still remains for selective replication Need at least one copy of (X,Y,Z) to retrieve the whole object EECS 582 – F16

The flexibility of EC Server 1 Server 2 Server 3 Server 4 Server 5 … … … … … X Y Z W T … … … … … Any k (= 3) of k + r (=5) units would suffice to reconstruct the original object EECS 582 – F16

EC-Cache: Dealing with straggler/failures Server 1 Server 2 Server 3 Server 4 Server 5 … … … … … X Y Z W T … … … … … Failed/Slow, but it’s OK Read 𝑘+ ∆ (∆ < r) out of k + r units, finish as long as first k reads are complete EECS 582 – F16

Writes Create splits Encode to parity units Uniformly randomly distribute to distinct servers EECS 582 – F16

Implementation Built on top of Alluxio (Tachyon) Transparent to backend storage/caching server Reed-Solomon codes instead of linear parities Intel ISA-L for faster encoding/decoding Zookeeper to manage metadata Location of splits Mapping of object to splits Which splits to write Backend storage server takes care of “ultimate” fault-tolerance EC-Cache deals with only in-memory caches EECS 582 – F16

Evaluation Metrics Setting Load imbalance: 𝜆= 𝐿 𝑚𝑎𝑥 − 𝐿 𝑎𝑣𝑔∗ 𝐿 𝑎𝑣𝑔 ×100 Latency improvement: 𝐿 𝑐𝑜𝑚𝑝𝑎𝑟𝑒𝑑 𝑠𝑐ℎ𝑒𝑚𝑒 𝐿 𝐸𝐶−𝐶𝑎𝑐ℎ𝑒 Setting Selective Replication V.S. EC Zipf’s distribution for object popularity Highly skewed (p=0.9) Allowed 15% overhead EECS 582 – F16

Evaluation – Read latency EECS 582 – F16

Evaluation – Load Balancing EECS 582 – F16

Evaluation – Varying object sizes Median latency Tail latency EECS 582 – F16

Evaluation – Tail latency improvement from additional reads EECS 582 – F16

Summary EC used for in-memory object caching Suitable workloads Effective for load balancing and reducing read latency Suitable workloads Immutable objects Because updates to any data units would require re-encoding a new set of parities Cut-off for small objects Overhead cannot be amortized – still use SR for small objects EECS 582 – F16

Discussions In EC-cache, there are extra overheads for encoding/decoding/metadata/networking, in what contexts can they be tolerated/amortized? What design would be good for frequent in-place updates? Any existing solutions? What policy could minimize the eviction of splits needed for current read requests? In paper review, a lot talked about overhead of decoding/encoding and networking (TCT connection overhead for small splits) In my understanding, encoding/decoding can be justified because in-memory computation is fast Networking/metadata overhead can be amortized if we only use EC-Cache for big object, and we choose proper k and r so that splits are big enough too In-place updates is another popular concern, are there any existing solutions for that? EECS 582 – F16