Surviving Failures in Bandwidth Constrained Datacenters Authors: Peter Bodik Ishai Menache Mosharaf Chowdhury Pradeepkumar Mani David A.Maltz Ion Stoica.

Slides:

Advertisements

Similar presentations

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

Advertisements

Social network partition Presenter: Xiaofei Cao Partick Berg.

Hadi Goudarzi and Massoud Pedram

VSMC MIMO: A Spectral Efficient Scheme for Cooperative Relay in Cognitive Radio Networks 1.

VCRIB: Virtual Cloud Rule Information Base Masoud Moshref, Minlan Yu, Abhishek Sharma, Ramesh Govindan HotCloud 2012.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.

Fast Algorithms For Hierarchical Range Histogram Constructions

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Wavelength Assignment in Optical Network Design Team 6: Lisa Zhang (Mentor) Brendan Farrell, Yi Huang, Mark Iwen, Ting Wang, Jintong Zheng Progress Report.

1 EL736 Communications Networks II: Design and Algorithms Class8: Networks with Shortest-Path Routing Yong Liu 10/31/2007.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Applying Genetic Algorithms to Decision Making in Autonomic Computing Systems Authors: Andres J. Ramirez, David B. Knoester, Betty H.C. Cheng, Philip K.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

A Comparison of Layering and Stream Replication Video Multicast Schemes Taehyun Kim and Mostafa H. Ammar.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Placement of Integration Points in Multi-hop Community Networks Ranveer Chandra (Cornell University) Lili Qiu, Kamal Jain and Mohammad Mahdian (Microsoft.

1 A Shifting Strategy for Dynamic Channel Assignment under Spatially Varying Demand Harish Rathi Advisors: Prof. Karen Daniels, Prof. Kavitha Chandra Center.

Dynamic lot sizing and tool management in automated manufacturing systems M. Selim Aktürk, Siraceddin Önen presented by Zümbül Bulut.

2-Layer Crossing Minimisation Johan van Rooij. Overview Problem definitions NP-Hardness proof Heuristics & Performance Practical Computation One layer:

Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.

Distributed Combinatorial Optimization

1 Algorithms for Bandwidth Efficient Multicast Routing in Multi-channel Multi-radio Wireless Mesh Networks Hoang Lan Nguyen and Uyen Trang Nguyen Presenter:

UCSC 1 Aman ShaikhICNP 2003 An Efficient Algorithm for OSPF Subnet Aggregation ICNP 2003 Aman Shaikh Dongmei Wang, Guangzhi Li, Jennifer Yates, Charles.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.

Topology Design for Service Overlay Networks with Bandwidth Guarantees Sibelius Vieira* Jorg Liebeherr** *Department of Computer Science Catholic University.

Network Aware Resource Allocation in Distributed Clouds.

June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.

Maximum Network Lifetime in Wireless Sensor Networks with Adjustable Sensing Ranges Cardei, M.; Jie Wu; Mingming Lu; Pervaiz, M.O.; Wireless And Mobile.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.

EE 685 presentation Utility-Optimal Random-Access Control By Jang-Won Lee, Mung Chiang and A. Robert Calderbank.

Paper # – 2009 A Comparison of Heterogeneous Video Multicast schemes: Layered encoding or Stream Replication Authors: Taehyun Kim and Mostafa H.

Towards Efficient Large-Scale VPN Monitoring and Diagnosis under Operational Constraints Yao Zhao, Zhaosheng Zhu, Yan Chen, Northwestern University Dan.

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

1 11 Channel Assignment for Maximum Throughput in Multi-Channel Access Point Networks Xiang Luo, Raj Iyengar and Koushik Kar Rensselaer Polytechnic Institute.

CAS 721 Course Project Implementing Branch and Bound, and Tabu search for combinatorial computing problem By Ho Fai Ko ( )

1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1 Slides by Yong Liu 1, Deep Medhi 2, and Michał Pióro 3 1 Polytechnic University, New York, USA 2 University of Missouri-Kansas City, USA 3 Warsaw University.

Resource Allocation in Hospital Networks Based on Green Cognitive Radios 王冉茵

@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.

Locating network monitors: complexity, heuristics, and coverage Kyoungwon Suh Yang Guo Jim Kurose Don Towsley.

1 Low Latency Multimedia Broadcast in Multi-Rate Wireless Meshes Chun Tung Chou, Archan Misra Proc. 1st IEEE Workshop on Wireless Mesh Networks (WIMESH),

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Efficient Placement and Dispatch of Sensors in a Wireless Sensor Network You-Chiun Wang, Chun-Chi Hu, and Yu-Chee Tseng IEEE Transactions on Mobile Computing.

1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.

Biao Wang 1, Ge Chen 1, Luoyi Fu 1, Li Song 1, Xinbing Wang 1, Xue Liu 2 1 Shanghai Jiao Tong University 2 McGill University

1 Chapter 5 Branch-and-bound Framework and Its Applications.

On the Ability of Graph Coloring Heuristics to Find Substructures in Social Networks David Chalupa By, Tejaswini Nallagatla.

Chen Qian, Xin Li University of Kentucky

Optimizing Distributed Actor Systems for Dynamic Interactive Services

Impact of Interference on Multi-hop Wireless Network Performance

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.

A Study of Group-Tree Matching in Large Scale Group Communications

Globa Larysa prof, Dr.; Skulysh Mariia, PhD; Sulima Svitlana

ElasticTree Michael Fruchtman.

Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.

Server Allocation for Multiplayer Cloud Gaming

湖南大学-信息科学与工程学院-计算机与科学系

Algorithms for Budget-Constrained Survivable Topology Design

Authors: Jinliang Fan and Mostafa H. Ammar

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Towards Predictable Datacenter Networks

Presentation transcript:

Surviving Failures in Bandwidth Constrained Datacenters Authors: Peter Bodik Ishai Menache Mosharaf Chowdhury Pradeepkumar Mani David A.Maltz Ion Stoica Presented By, Sneha Arvind Mani

OUTLINE Introduction Motivation and Background Problem Statement Algorithmic Solutions Evaluation of the Algorithms Related Work Conclusion

Introduction The main goal of this paper: ◦ To improve the fault tolerance of the deployed applications ◦ Reduce bandwidth usage in the core. -How? - By optimizing allocation of applications to physical machines. Both of the above problems are NP-hard So they formulated a related convex optimization problem that Incentivizes spreading machines of individual services across fault domains. Adds a penalty term for machine reallocations that increase bandwidth usage.

Introduction (2) Their algorithm achieved 20%-50% reduction in bandwidth usage and improving worst-case survival by 40%-120% Improvement in Fault Tolerance – reduced the fraction of services affected by potential hardware failures by up to a factor of 14. The contribution of this paper is three-fold: ◦ Measurement Study ◦ Algorithms ◦ Methodology

Motivation and Background Bing.com – a large scale Web application running in multiple datacenters around the world. Some definitions used in this paper: ◦ Logical Machine: Smallest logical component of a web application. ◦ Service: Service consists of many logical machines executing the same code. ◦ Environment: Consists of many services ◦ Physical Machine: Physical server that can run a single logical machine. ◦ Fault Domain: Set of physical machines that share a single point of failure.

Communication Patterns On tracing communication between all pairs of servers and for each pairs of services i and j, it was observed that datacenter network core is highly utilized. Traffic matrix is very sparse. Only 2% service pairs communicate at all. link utilization >50%>60%>70%>80% aggregate months above utilization

Communication Patterns(2) Communication pattern is very skewed. 0.1% of the services that communicate generate 60% of all traffic & 4.8% of service pairs generate 99% of traffic. Services that do not require lot of bandwidth can be spread out across the datacenter, improving their fault tolerance.

Communication Patterns(3) The majority of the traffic, 45% stays within the same service, 23% leaves the service but stays within the same environment & 23% crosses environments. Median services talk to nine other services. Communicating services form small and large components.

Failure Characteristics Networking hardware failures causes significant outages. Redundancy reduces impact of failures on lost bytes by only 40% Power fault domains create non-trivial patterns. Implications for Optimization Framework : It has to consider the complex patterns of the power and networking fault domains, instead of simply spreading the services across several racks to achieve good fault tolerance.

Problem Statement Metrics: Bandwidth (BW): The sum of the rates on the core links is the overall measure of the bandwidth usage at the core of network. Fault Tolerance(FT): It is the average of Worst-Case- Survival(WCS) across all the services. No. of Moves(NM): The number of servers that have to be re-imaged to get from initial datacenter allocation to the proposed allocation. Optimization: Maximize FT – α BW Subject to NM ≤ N 0 α – tunable positive parameter N 0 – Upper limit on number of moves.

Algorithmic Solutions The solution roadmap is as follows: ◦ Cells – a subset of physical machines that belong to exactly the same fault domains. This allows reduction in the size of optimization problem. ◦ Fault Tolerance Cost (FTC) is a convex structure, hence the minimization of FTC improves FT. ◦ Their method to optimize BW is to perform a minimum k-way cut on the communication graph. ◦ CUT + FT + BW consists of two- phases  Minimum k-way cut to compute initial assignment that minimizes bandwidth at the network core.  Iteratively move machines to improve FT. FT + BW does not perform graph-cut but starts with current allocation & improves performance by greedy moves that reduce weighted sum of BW and FTC.

Formal Definitions I – the indicator function I(n1,n2) = 1 if traffic from n1 to n2 traverses through core link & I(n1,n2) = 0 otherwise. Bandwidth is given by: Where is required BW between a pair of machines from services k1 and k2. To define FT let be the total number of machines allocated to service k affected by fault j. FT is given by:  K – total no. of services.

Formal Definitions(2) Fault Tolerance Cost(FTC) is given by: b k and w j are positive weights assigned to services and faults. A decrease in FTC should increase FT, as squaring the z k,j variables incentivizes keeping their values small, obtained by spreading the machine assignment across multiple fault domain. Minimization of BW is based on minimum k- way cut, which partitions the logical machines into a given number of clusters.

Algorithms to improve both BW & FT CUT+FT : Apply CUT in the first phase then minimize FTC in the second phase using machine swap CUT + FT +BW: As above but in second phase a penalty term for bandwidth is added. (i.e )ΔFTC + αΔ BW, α is the weighing factor. NM-aware algorithm: FT + BW: Start with initial allocation, do only second phase of CUT + FT + BW.

Scaling to large Datacenters An algorithm that directly exploits skewness of the communication matrix. CUT+RandLow: Apply cut in the first phase. Determine the subset of services whose aggregate BW are lower than others then randomly permute the machine allocation of all services belonging to the subset. Scaling to large datacenters: To scale to large datacenters, we sample a large number of candidate swaps and choose the one that most improves FTC. Also during graph cut, logical machines of same service are grouped into smaller number of representative nodes.

Evaluation of Algorithms CUT + FT+ BW: When ignoring the server moves, it achieves 30%-60% reduction in BW usage at the same time improving FT by % FT + BW is close to CUT + FT+BW : FT+BW performs only steepest-descent moves.It could be used in scenarios where concurrent server moves is limited. Random allocation in CUT + RandLow works well as many services transfer relatively little data and they can be spread randomly across DCs.

Methodology to Evaluate The following information is needed to perform evaluation: Network Topology of a cluster Services running in the cluster and list of machines required for each services. List of fault domains and machines in each fault domains Traffic matrix for services in the cluster. The algorithms are compared on the entire achievable tradeoff boundaries instead of their performance.

Comparing Different Algorithms The solid circles represents the FT and BW at starting allocation(at origin), after BW-only optimization(bottom-left-corner) & after FT-only optimization (top-right-corner).

Optimizing for both BW and FT Artificially partitioning each service to several subgroups – did not lead to satisfactory results. Augmenting the cut procedure with “spreading” requirements for services – did not scale to large applications. Cut + FT: Graph is plotted by increasing number of server swaps. By changing the number of swaps, tradeoff between FT & BW can be controlled. The formulation is convex, so performing steepest descent until convergence leads to global minimum w.r.t. fault tolerance.

Optimizing for both BW and FT(2) Cut + FT+BW: Depends on α. Higher the value of α, more weight on improving BW at the cost of not improving FT. Not optimizing over a convex function, not guaranteed to reach global optimum. Cut + RandLow : Performs close to Cut+FT+BW but does not optimize the BW of low-talking service nor the FT of high-talking ones.

These graphs show the trade-off boundary between FT and BW for different algorithms across 3 more DCs.

Optimizing for BW,FT and NM We notice significant improvements by moving just 5% of the cluster. Moving 29% of the cluster achieves results similar to moving most of machines using Cut + FT + BW

When running FT + BW until convergence, it achieves results close to Cut+FT+BW even without the graph cut. This is significant because it means we can use FT + BW incrementally and still reach similar performance as Cut+FT+BW reshuffles the whole datacenter.

Improvements in FT & BW For α = 0.1, FT+BW achieved reduction in BW usage by 26% but improved FT by 140% and FT was reduced only for 2.7% of services and it is much lesser than for α = 1.0 For α = 1.0, FT+BW reduced core BW usage by 47% and improved average FT by 121%

Additional Scenarios Optimization of bandwidth across multiple layers. Preparing for maintenance and online recovery. Adapting to changes in traffic patterns. Hard constraints on fault tolerance and placement. Multiple logical machines on a server.

Related Work Datacenter traffic analysis Datacenter resource allocation Virtual network embedding High availability in distributed systems VPN and network testbed allocation

Conclusion Analysis shows that the communication volume between pairs of services has long tail, with majority of traffic being generated by small fraction of service pairs. This allowed the optimization algorithm to spread most of the services across fault domains without significantly increasing BW usage in the core.

Thank You!