CubicRing ENABLING ONE-HOP FAILURE DETECTION AND RECOVERY FOR DISTRIBUTED IN- MEMORY STORAGE SYSTEMS Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu,

Slides:

Advertisements

Similar presentations

MinCopysets: Derandomizing Replication in Cloud Storage

Advertisements

Computing Infrastructure

Computer Network Topologies

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

SDN Controller Challenges

BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,

Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,

Availability in Globally Distributed Storage Systems

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.

LightFlood: An Optimal Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.

Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.

Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.

Path Protection in MPLS Networks Ashish Gupta Design and Evaluation of Fault Tolerance Algorithms with Performance Constraints.

A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:

Chuanxiong Guo, Haitao Wu, Kun Tan,

A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.

Ji-Yong Shin * Bernard Wong +, and Emin Gün Sirer * * Cornell University + University of Waterloo 2 nd ACM Symposium on Cloud ComputingOct 27, 2011 Small-World.

A Scalable, Commodity Data Center Network Architecture Mohammad AI-Fares, Alexander Loukissas, Amin Vahdat Presented by Ye Tao Feb 6 th 2013.

A Scalable, Commodity Data Center Network Architecture

Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.

A Scalable, Commodity Data Center Network Architecture.

Camdoop: Exploiting In-network Aggregation for Big Data Applications Paolo Costa, Austin Donnelly, Antony Rowstron, Greg O’Shea Presenter – Manoj Kumar(mkumar11)

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

Network Topologies.

Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.

© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

1 The Google File System Reporter: You-Wei Zhang.

DISTRIBUTED ALGORITHMS Luc Onana Seif Haridi. DISTRIBUTED SYSTEMS Collection of autonomous computers, processes, or processors (nodes) interconnected.

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.

Network Aware Resource Allocation in Distributed Clouds.

Routing & Architecture

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

School of EECS, Peking University Microsoft Research Asia UStore: A Low Cost Cold and Archival Data Storage System for Data Centers Quanlu Zhang †, Yafei.

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

DARD: Distributed Adaptive Routing for Datacenter Networks Xin Wu, Xiaowei Yang.

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.

RAMCloud: System Performance Measurements (Jun ‘11) Nandu Jayakumar

1 Next Few Classes Networking basics Protection & Security.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.

Network Survivability Against Region Failure Signal Processing, Communications and Computing (ICSPCC), 2011 IEEE International Conference on Ran Li, Xiaoliang.

Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

RAMCloud Overview and Status John Ousterhout Stanford University.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, Songwu Lu SIGCOMM 2008 Presented by Ye Tian for Course CS05112.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Super computers Parallel Processing

DECOR: A Distributed Coordinated Resource Monitoring System Shan-Hsiang Shen Aditya Akella.

Click to edit Master title style Literature Review Interconnection Architectures for Petabye-Scale High-Performance Storage Systems Andy D. Hospodor, Ethan.

Next Generation HPC architectures based on End-to-End 40GbE infrastructures Fabio Bellini Networking Specialist | Dell.

MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,

PERFORMANCE MANAGEMENT IMPROVING PERFORMANCE TECHNIQUES Network management system 1.

5/3/2018 3:51 AM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,

Distributed Network Traffic Feature Extraction for a Real-time IDS

Improving Datacenter Performance and Robustness with Multipath TCP

Improving Datacenter Performance and Robustness with Multipath TCP

BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo1, Guohan Lu1, Dan Li1, Haitao Wu1, Xuan Zhang2,

MapReduce Simplied Data Processing on Large Clusters

11/13/ :11 PM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,

Chuanxiong Guo, Haitao Wu, Kun Tan,

Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+,

CS 345A Data Mining MapReduce This presentation has been altered.

Presentation transcript:

CubicRing ENABLING ONE-HOP FAILURE DETECTION AND RECOVERY FOR DISTRIBUTED IN- MEMORY STORAGE SYSTEMS Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu, Haitao Wu, Yongqiang Xiong Presented by Kirill Varshavskiy 1

CURRENT SYSTEMS Low latency, large scale storage systems with recovery techniques All data is kept in RAM, backups are on designated backup servers RAMCloud All primary data is kept in RAM, redundant backup on disks Backup servers selected randomly for ideal distribution CMEM Creates storage clusters, elastic memory One synchronous backup server 2

CURRENT SYSTEM DRAWBACKS Effective but with several important flaws Recovery traffic congestion On server failure, recovery surges cause in-network congestion False failure detection Transient network failures associated with large RTT for heartbeats may cause false positives which start unnecessary recoveries ToR switch failures Top of the Rack (ToR) switches may fail taking out working servers 3 Data Center P

INTUITION BEHIND THE PAPER Reducing distance to backup servers improves reliability One hop communication insures low latency recovery over high speed InfiniBand (or Ethernet) communication Having a designated recovery mapping provides a coordinated, parallelized recovery Robustness in heartbeating can prevent false positives Efficient backup techniques should not significantly impact availability 4

RECOVERY PLANS Primary-Recovery-Backup Primary servers: keep all data in RAM Backup servers: writing backup to disk Recovery servers: server to which backups will recover a failed primary server Recovery server stores backup mappings Each server takes on all three roles in different rings Data Center P P RR BB 5

CubicRing: EVERYTHING IS A HOP AWAY Primary ring, Recovery ring, Backup ring One hop is a trip from source server to the end server via a designated switch BCube creates an interconnected system of servers where each server can reach any other in k+1 hops (one more hop than the cube dimension) Using BCube, all recovery servers are one hop away from the primary server Recovery servers are all n-1 servers in level 0 BCube container + all immediate switch connections to other BCube containers All backup servers are one hop from the recovery servers 6

PRIMARY RING (K,V) 7

CUBIC RING, BCUBE(4,1) 1,0 switch ,0 switch ,1 switch ,2 switch ,3 switch 1,1 switch1,2 switch1,3 switch 8

RECOVERY RING 1,0 switch ,0 switch ,1 switch ,2 switch ,3 switch 1,1 switch1,2 switch1,3 switch

BACKUP RING 1,0 switch ,0 switch ,1 switch ,2 switch ,3 switch 1,1 switch1,2 switch1,3 switch

BACKUP SERVER RECOVERY TRAFFIC 1,0 switch ,0 switch ,1 switch ,2 switch ,3 switch 1,1 switch1,2 switch1,3 switch

DATA STORAGE REDUNDANCY Key-Value store using a global coordinator (MemCube) Global coordinator maintains key space to server mapping in the primary ring Each primary server maps data subspaces to recovery servers and recovery servers map their cached sub space to their backup ring Every primary server has f backups, one of which is the dominant copy, used first for backup Backups are distributed along different failure domains

SINGLE SERVER FAILURE RECOVERY Primary servers heartbeat to recovery servers If heartbeat is not received, global coordinator pings primary through all other BCube switches, failure only if all of the pings failed Minimizes false positives due to network failures – all paths are one hop Recover failed server’s roles simultaneously Can tolerate as least as many failures as there are servers in the recovery ring 13 1,0 switch ,0 switch ,1 switch ,2 switch ,3 switch 1,1 switch1,2 switch1,3 switch

RECOVERY FLOW Heartbeats carry bandwidth limits which can be used to determine stragglers and prevent stragglers from being very active in a recovery scenario Recovery payload is split between recovery servers and their backups, all traffic travels through different links to prevent in-network congestion All servers overprovision RAM in case of a recovery (discussed in section 5, proven in Appendix) 14 P RR BB RR RR B

EVALUATION SETUP 64 PowerLeader servers 12 Intel Xeon 2.5GHz cores 64 GB RAM Six 7200 RPM 1TB disks Five 48 port 10GbE switches Three setups CubicRing organized in BCube(8,1) that runs the MemCube KV store 64 node tree running RAMCloud 64 node FatTree running RAMCloud 15

EXPERIMENTAL DATA Each primary server is filled with 48GBs of data Max write throughput is 197.6K writes per second A primary server is taken offline took 3.1 seconds to recover all 48 GBs using MemCube Aggregate throughput GB/sec Each recovery server contributes about 8.85 GB/sec 16

DETERMINING RECOVERY AND BACKUP SERVER NUMBER Increasing the number of recovery servers, linearly increases the aggregate bandwidth, decreases the fragmentation ratio (less locality) Impact of number of backup servers per recovery server 17

THOUGHTS Would be interesting to look at evaluations for the speed at which backups and recoveries are restored as well as more sizeable failures Centralized aspect of the global coordinator creates a singular point of failure, how far is it from the architecture? Lots of recoveries and backups, I wonder if the total backup-ed data can be reduced Paper uses lots of terms interchangeably, sometimes it is confusing to distinguish the properties of MemCube from CubicRing from BCube 18

THANK YOU! QUESTIONS? 19