Cluster Resource Management: A Scalable Approach Ning Li and Jordan Parker CS 736 Class Project
Ning Li and Jordan Parker Outline Introduction A Scalable Approach: Hierarchy Results Conclusions Questions 11/21/2018 Ning Li and Jordan Parker
Why Study Resource Management? Clusters have become increasingly popular for large parallel computing. Web Servers Clusters are becoming increasingly large to the order of thousands of nodes. Clusters are providing multiple services. Hard to evaluate Bad is easy to determine Good is much harder Possible scenario. ISP is doing web hosting. They have 5 clients. 4 of them pay the same amount, the fifth pays five times as much. Therefore that client should get half of the clusters resources. But clients are really only paying for guarantees, it can get very complicated from here. 11/21/2018 Ning Li and Jordan Parker
Resource Management Example 4th Node Services only B Poor Management Ideal A 50% B 50% Node 1 A 50% B 50% Node 2 A 50% B 50% Node 3 B 100% Node 4 Overall A 37.5% B 62.5% A 66% B 33% Node 1 A 66% B 33% Node 2 A 66% B 33% Node 3 B 100% Node 4 A 50% B 50% Overall 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Clustering Goals Scalability Reliability High Performance Affordability Cluster Resource Management should have these same goals 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Related Work Proportional-Share Cluster Reserves Proportional Share Andrea Arpaci-Dusseau – Cluster Reserves 11/21/2018 Ning Li and Jordan Parker
Related Work: Approach Differences Our Goal: to provide a scalable solution for resource management. Other work focused primarily on just having good management This often meant 1 manager for all the nodes Clearly this could present a scalable bottleneck Effectiveness: Other solutions probably better for smaller clusters, we hope to be better for large (>1000 nodes) clusters. 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Outline Introduction A Scalable Approach: Hierarchy Results Conclusions Questions 11/21/2018 Ning Li and Jordan Parker
Hierarchy: A Scalable Approach Hierarchical Management Nodes service jobs Managers facilitate resource management 5 6 7 8 9 10 11 12 2 3 4 1 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Banking Algorithm Goal Determine best allocation given previous usage Primitives Tickets Bank accounts Deposit / withdraw tickets 6 Steps 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Banking Algorithm Step 1: For each service class on each node Deposit unused tickets Step 2: For each service class on each node Reallocate service class Full utilization: Allocation = usage + k Under utilization: Allocation = usage - k 11/21/2018 Ning Li and Jordan Parker
Banking Algorithm Cont. Step 3: For each service class Compare total allocation to desired Subtract from over-allocated Add to needy & under-allocated Step 4: For each service class Deposit / Withdraw If still over-allocated withdraw If still under-allocated deposit 11/21/2018 Ning Li and Jordan Parker
Banking Algorithm Cont. Step 5: Withdraw and allocate Reward the needy nodes Step 6: Done, clear the bank accounts 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Reliability Bottom-up Manager Replacement 5 6 7 8 9 10 11 12 2 3 4 1 Not relevant to the performance of our scheduler, we didn’t even simulate it but … It does show that the network layout we’ve designed could easily handle failures Making the tree balance itself and handling failures could be relatively straight forward 5 5 6 7 2 2 1 3 8 9 10 4 11 12 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Outline Introduction A Scalable Approach: Hierarchy Results Conclusions Questions 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Results Cluster Nodes Managers 1st/2nd Level Reporting 1st/2nd Level Workloads Class 2 Constraints Tests 4 2/1 1/1 Steady Dyn 1 1/5 100 10/1 1-30 2 3 900 30/1 1-300 5 Not going show case of having one manager (here). The data is a nice comparison – but take our word for it we got essentially the same results. Max of 900 nodes because NS started page faulting with more nodes. We now have access to a larger server – ironsides (7Gb RAM), we’ll see what happens Choose 3 service classes because it makes it easy to evaluate and easy to 11/21/2018 Ning Li and Jordan Parker
Implementation Details Simulations via The NS – Network Simulator Low bandwidth 10Mbs communication network UDP for lower server overhead Assumptions Node level resource management works ideally UDP might not be appropriate when introducing fault tolerance, we just wanted to show that it could work on a low bandwidth network since it introduces lower overheads. 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Test 1: Overview 4 nodes – 3 services – 60/30/10 Allocation 4th node receives all of 3rd class’s requests Steady Workload 2nd 33% 1st 66% Node 1 2nd 33% 1st 66% Node 2 2nd 33% 1st 66% Node 3 2nd 20% 1st 40% 3rd 40% Node 4 2nd 30% 1st 60% 3rd 10% Overall 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Test 1: Data 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Test 2: Overview 100 nodes – 3 services – 60/30/10 Allocation nodes 1-30 receive all of 3rd class’s requests Steady Workload 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Test 2: Data 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Test 3: Overview 100 nodes – 3 services – 60/30/10 Allocation nodes 1-30 receive all of 3rd class’s requests Dynamic Workload 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Test 3: Data 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Test 4: Overview 100 nodes – 3 services – 60/30/10 Allocation nodes 1-30 receive all of 3rd class’s requests Steady Workload Reporting 1/5 Nodes every 0.3 second Managers every 1.5 seconds 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Test 4: Data 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Test 5: Overview 900 nodes – 3 services – 60/30/10 Allocation nodes 1-300 receive all of 3rd class’s requests Steady Workload 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Test 5: Data 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Outline Introduction A Scalable Approach: Hierarchy Results Conclusions Questions 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Conclusions Benefits of an hierarchy Scalable Reliable Geographic Applications Implemented a new management scheme: Banking Comparable Results Improved Scalability 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Conclusions Clusters are sensitive to small policy changes Clusters are built for specific workloads Their performance is important and small changes have significant impact No scheme is universally applicable Future Work Real system implementation Real Workloads Real node level resource management More steady performance 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Outline Introduction A Scalable Approach: Hierarchy Results Conclusions Questions 11/21/2018 Ning Li and Jordan Parker
Ning Li and Jordan Parker Questions 11/21/2018 Ning Li and Jordan Parker
Related Work: Proportional-Share Stride Scheduling Ticket based and similar to lottery Scale Randomly query k nodes to find best allocation Different Application Condor-like resource allocation/applications Extending Proportional-Share Scheduling to a Network of Workstations Andrea C. Arpaci-Dusseau and David Culler 11/21/2018 Ning Li and Jordan Parker
Related Work: Cluster Reserves Resource Container Schedulers Constrained Optimization Algorithm Scale Centralized single manager 11/21/2018 Ning Li and Jordan Parker
Hierarchical Cluster Reserves – Version 1 Modify Cluster Reserves optimization algorithm Use it when manager manages nodes AND when level_n+1 manager manages level_n managers. 11/21/2018 Ning Li and Jordan Parker
Hierarchical Cluster Reserves – Version 2 Cluster Reserves optimization algorithm Use it when manager manages nodes Don’t use it for upper level managers Modify the manager to manager reporting Lie to the algorithm 11/21/2018 Ning Li and Jordan Parker