Cluster Resource Management: A Scalable Approach

Cluster Resource Management: A Scalable Approach
Ning Li and Jordan Parker CS 736 Class Project

Ning Li and Jordan Parker
Outline Introduction A Scalable Approach: Hierarchy Results Conclusions Questions 11/21/2018 Ning Li and Jordan Parker

Why Study Resource Management?
Clusters have become increasingly popular for large parallel computing. Web Servers Clusters are becoming increasingly large to the order of thousands of nodes. Clusters are providing multiple services. Hard to evaluate Bad is easy to determine Good is much harder Possible scenario. ISP is doing web hosting. They have 5 clients. 4 of them pay the same amount, the fifth pays five times as much. Therefore that client should get half of the clusters resources. But clients are really only paying for guarantees, it can get very complicated from here. 11/21/2018 Ning Li and Jordan Parker

Resource Management Example
4th Node Services only B Poor Management Ideal A 50% B 50% Node 1 A 50% B 50% Node 2 A 50% B 50% Node 3 B 100% Node 4 Overall A 37.5% B 62.5% A 66% B 33% Node 1 A 66% B 33% Node 2 A 66% B 33% Node 3 B 100% Node 4 A 50% B 50% Overall 11/21/2018 Ning Li and Jordan Parker

Clustering Goals Scalability Reliability High Performance Affordability Cluster Resource Management should have these same goals 11/21/2018 Ning Li and Jordan Parker

Related Work Proportional-Share Cluster Reserves Proportional Share Andrea Arpaci-Dusseau – Cluster Reserves 11/21/2018 Ning Li and Jordan Parker

Related Work: Approach Differences
Our Goal: to provide a scalable solution for resource management. Other work focused primarily on just having good management This often meant 1 manager for all the nodes Clearly this could present a scalable bottleneck Effectiveness: Other solutions probably better for smaller clusters, we hope to be better for large (>1000 nodes) clusters. 11/21/2018 Ning Li and Jordan Parker

Hierarchy: A Scalable Approach
Hierarchical Management Nodes service jobs Managers facilitate resource management 5 6 7 8 9 10 11 12 2 3 4 1 11/21/2018 Ning Li and Jordan Parker

Banking Algorithm Goal Determine best allocation given previous usage Primitives Tickets Bank accounts Deposit / withdraw tickets 6 Steps 11/21/2018 Ning Li and Jordan Parker

Banking Algorithm Step 1: For each service class on each node Deposit unused tickets Step 2: For each service class on each node Reallocate service class Full utilization: Allocation = usage + k Under utilization: Allocation = usage - k 11/21/2018 Ning Li and Jordan Parker

Banking Algorithm Cont.
Step 3: For each service class Compare total allocation to desired Subtract from over-allocated Add to needy & under-allocated Step 4: For each service class Deposit / Withdraw If still over-allocated withdraw If still under-allocated deposit 11/21/2018 Ning Li and Jordan Parker

Banking Algorithm Cont.
Step 5: Withdraw and allocate Reward the needy nodes Step 6: Done, clear the bank accounts 11/21/2018 Ning Li and Jordan Parker

Reliability Bottom-up Manager Replacement 5 6 7 8 9 10 11 12 2 3 4 1 Not relevant to the performance of our scheduler, we didn’t even simulate it but … It does show that the network layout we’ve designed could easily handle failures Making the tree balance itself and handling failures could be relatively straight forward 5 5 6 7 2 2 1 3 8 9 10 4 11 12 11/21/2018 Ning Li and Jordan Parker

Results Cluster Nodes Managers 1st/2nd Level Reporting 1st/2nd Level Workloads Class 2 Constraints Tests 4 2/1 1/1 Steady Dyn 1 1/5 100 10/1 1-30 2 3 900 30/1 1-300 5 Not going show case of having one manager (here). The data is a nice comparison – but take our word for it we got essentially the same results. Max of 900 nodes because NS started page faulting with more nodes. We now have access to a larger server – ironsides (7Gb RAM), we’ll see what happens Choose 3 service classes because it makes it easy to evaluate and easy to 11/21/2018 Ning Li and Jordan Parker

Implementation Details
Simulations via The NS – Network Simulator Low bandwidth 10Mbs communication network UDP for lower server overhead Assumptions Node level resource management works ideally UDP might not be appropriate when introducing fault tolerance, we just wanted to show that it could work on a low bandwidth network since it introduces lower overheads. 11/21/2018 Ning Li and Jordan Parker

Test 1: Overview 4 nodes – 3 services – 60/30/10 Allocation 4th node receives all of 3rd class’s requests Steady Workload 2nd 33% 1st 66% Node 1 2nd 33% 1st 66% Node 2 2nd 33% 1st 66% Node 3 2nd 20% 1st 40% 3rd 40% Node 4 2nd 30% 1st 60% 3rd 10% Overall 11/21/2018 Ning Li and Jordan Parker

Test 1: Data 11/21/2018 Ning Li and Jordan Parker

Test 2: Overview 100 nodes – 3 services – 60/30/10 Allocation nodes 1-30 receive all of 3rd class’s requests Steady Workload 11/21/2018 Ning Li and Jordan Parker

Test 3: Overview 100 nodes – 3 services – 60/30/10 Allocation nodes 1-30 receive all of 3rd class’s requests Dynamic Workload 11/21/2018 Ning Li and Jordan Parker

Test 4: Overview 100 nodes – 3 services – 60/30/10 Allocation nodes 1-30 receive all of 3rd class’s requests Steady Workload Reporting 1/5 Nodes every 0.3 second Managers every 1.5 seconds 11/21/2018 Ning Li and Jordan Parker

Test 5: Overview 900 nodes – 3 services – 60/30/10 Allocation nodes receive all of 3rd class’s requests Steady Workload 11/21/2018 Ning Li and Jordan Parker

Conclusions Benefits of an hierarchy Scalable Reliable Geographic Applications Implemented a new management scheme: Banking Comparable Results Improved Scalability 11/21/2018 Ning Li and Jordan Parker

Conclusions Clusters are sensitive to small policy changes Clusters are built for specific workloads Their performance is important and small changes have significant impact No scheme is universally applicable Future Work Real system implementation Real Workloads Real node level resource management More steady performance 11/21/2018 Ning Li and Jordan Parker

Questions 11/21/2018 Ning Li and Jordan Parker

Related Work: Proportional-Share
Stride Scheduling Ticket based and similar to lottery Scale Randomly query k nodes to find best allocation Different Application Condor-like resource allocation/applications Extending Proportional-Share Scheduling to a Network of Workstations Andrea C. Arpaci-Dusseau and David Culler 11/21/2018 Ning Li and Jordan Parker

Related Work: Cluster Reserves
Resource Container Schedulers Constrained Optimization Algorithm Scale Centralized single manager 11/21/2018 Ning Li and Jordan Parker

Hierarchical Cluster Reserves – Version 1
Modify Cluster Reserves optimization algorithm Use it when manager manages nodes AND when level_n+1 manager manages level_n managers. 11/21/2018 Ning Li and Jordan Parker

Hierarchical Cluster Reserves – Version 2
Cluster Reserves optimization algorithm Use it when manager manages nodes Don’t use it for upper level managers Modify the manager to manager reporting Lie to the algorithm 11/21/2018 Ning Li and Jordan Parker

Cluster Resource Management: A Scalable Approach

Similar presentations

Presentation on theme: "Cluster Resource Management: A Scalable Approach"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cluster Resource Management: A Scalable Approach

Similar presentations

Presentation on theme: "Cluster Resource Management: A Scalable Approach"— Presentation transcript:

Similar presentations

About project

Feedback