Download presentation
Presentation is loading. Please wait.
Published byBarnaby Conley Modified over 9 years ago
1
ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar
2
Context GPUs are used in supercomputers – Some of the top500 supercomputers use GPUs Tianhe-1A – 14,336 Xeon X5670 processors – 7,168 Nvidia Tesla M2050 GPUs Stampede – about 6,000 nodes: » Xeon E5-2680 8C, Intel Xeon Phi GPUs are used in cloud computing 2 Need for resource managers and scheduling schemes for heterogeneous clusters including many-core GPUs
3
Categories of Scheduling Objectives Traditional schedulers for supercomputers aim to improve system-wide metrics: throughput & latency A market-based service world is emerging: focus on provider’s profit and user’s satisfaction – Cloud: pay-as-you-go model Amazon: different users (On-Demand, Free, Spot, …) – Recent resource managers for supercomputers (e.g. MOAB) have the notion of service-level agreement (SLA) 3
4
Motivation Open-source batch schedulers start to support GPUs – TORQUE, SLURM – Users’ guide mapping of jobs to heterogeneous nodes – Simple scheduling schemes (goals: throughput & latency) Recent proposals describe runtime systems & virtualization frameworks for clusters with GPUs – [gViM HPCVirt '09][vCUDA IPDPS '09][ rCUDA HPCS’10 ] [gVirtuS Euro-Par 2010][our HPDC’11, CCGRID’12, HPDC’12] – Simple scheduling schemes (goals: throughput & latency) Proposals on market-based scheduling policies focus on homogeneous CPU clusters – [Irwin HPDC’04][Sherwani Soft.Pract.Exp.’04] 4 State of the Art Our Goal: Reconsider market-based scheduling for heterogeneous clusters including GPUs
5
Considerations Community looking into code portability between CPU and GPU – OpenCL – PGI CUDA-x86 – MCUDA (CUDA-C), Ocelot, SWAN (CUDA-OpenCL), OpenMPC → Opportunity to flexibly schedule a job on CPU/GPU In cloud environments oversubscription commonly used to reduce infrastructural costs → Use of resource sharing to improve performance by maximizing hardware utilization 5
6
6 Problem Formulation Given a CPU-GPU cluster Schedule a set of jobs on the cluster –To maximize the provider’s profit / aggregate user satisfaction Exploit the portability offered by OpenCL –Flexibly map the job on to either CPU or GPU Maximize resource utilization –Allow sharing of multi-core CPU or GPU Assumptions/Limitations 1 multi-core CPU and 1 GPU per node Single-node, single GPU jobs Only space-sharing, limited to two jobs per resource
7
Value Function 7 Market-based Scheduling Formulation For each job, Linear-Decay Value Function [Irwin HPDC’04] Max Value → Importance/Priority of job Decay → Urgency of job Delay due to: ₋queuing, execution on non-optimal resource, resource sharing Execution time Yield/Value T Max Value Decay rate Yield = maxValue – decay * delay
8
Overall Scheduling Approach 8 Jobs arrive in batches Execute on CPUExecute on GPU Scheduling Flow Enqueue into CPU Queue Enqueue into GPU Queue Jobs are enqueued on their optimal resource. Phase 1 is oblivious of other jobs (based on optimal walltime) Phase 1: Mapping Sort jobs to Improve Yield Inter-jobs scheduling considerations Sort jobs to Improve Yield Phase 2: Sorting Phase 3: Re-mapping Different schemes: - When to remap? - What to remap?
9
Phase 1: Mapping 9 Users provide walltime on GPU and GPU – walltime used as indicator of optimal/non optimal resource – Each job is mapped onto its optimal resource NOTE: in our experiments we assumed maxValue = optimal walltime
10
Phase 2: Sorting 10 Sort jobs based on Reward [Irwin HPDC’04] Present Value – f(maxValue i, discount_rate) – Value after discounting the risk of running a job – The shorter the job, the lower the risk Opportunity Cost – Degradation in value due to the selection of one among several alternatives
11
Phase 3: Remapping When to remap: – Uncoordinated schemes queue is empty and resource is idle – Coordinated scheme When CPU and GPU queues are imbalanced What to remap: – Which job will have best reward on non-optimal resource? – Which job will suffer least reward penalty ? 11
12
Phase 3: Uncoordinated Schemes 1.Last Optimal Reward (LOR) – Remap job with least reward on optimal resource – Idea: least reward → least risk in moving 2.First Non-Optimal Reward (FNOR) – Compute the reward job could produce on non-optimal resource – Remap job with highest reward on non-optimal resource – Idea: consider non-optimal penalty 3.Last Non-Optimal Reward Penalty (LNORP) – Remap job with least reward degradation RewardDegradation i = OptimalReward i - NonOptimalReward i 12
13
Phase 3: Coordinated Scheme Coordinated Least Penalty (CORLP) When to remap: imbalance between queues – Imbalance affected by: decay rates and execution times of jobs – Total Queuing-Delay Decay-Rate Product (TQDP) – Remap if |TQDP CPU – TQDP GPU | > threshold What to remap – Remap job with least penalty degradation 13
14
Heuristic for Sharing Limitation: Two jobs can space-share of CPU/GPU Factors affecting sharing - Slowdown incurred by jobs using half of a resource + More resource available for other jobs Jobs – Categorized as low, medium, high scaling (based on models/profiling) When to enable sharing – Large fraction of jobs in pending queues with negative yield What jobs share a resource – Scalability-DecayRate factor Jobs grouped based on scalability Within each group, jobs are ordered by decay rate (urgency) –Pick top K fraction of jobs, ‘K’ is tunable (low scalability, low decay) 14 Resource Sharing Heuristic
15
15 Master Node Compute Node … Overall System Prototype Compute Node
16
16 Master Node Cluster-Level Scheduler Scheduling Schemes & Policies TCP Communicator Submission Queue Pending Queues Execution Queues Finished Queues Compute Node Multi-core CPU GPU … Overall System Prototype Compute Node Multi-core CPU GPU Compute Node Multi-core CPU GPU CPU GPU CPU GPU CPU GPU
17
17 Master Node Cluster-Level Scheduler Scheduling Schemes & Policies TCP Communicator Submission Queue Pending Queues Execution Queues Finished Queues Compute Node Node-Level Runtime Multi-core CPU GPU … Overall System Prototype Compute Node Node-Level Runtime Multi-core CPU GPU Compute Node Node-Level Runtime Multi-core CPU GPU CPU GPU CPU GPU CPU GPU
18
18 Master Node Cluster-Level Scheduler Scheduling Schemes & Policies TCP Communicator Submission Queue Pending Queues Execution Queues Finished Queues Compute Node Node-Level Runtime Multi-core CPU GPU … Overall System Prototype Compute Node Node-Level Runtime Multi-core CPU GPU Compute Node Node-Level Runtime Multi-core CPU GPU CPU GPU CPU GPU CPU GPU TCP Communicator CPU Execution Processes GPU Execution Processes GPU Consolidation Framework OS-based scheduling & sharing
19
19 Master Node Cluster-Level Scheduler Scheduling Schemes & Policies TCP Communicator Submission Queue Pending Queues Execution Queues Finished Queues Compute Node Node-Level Runtime Multi-core CPU GPU … Overall System Prototype Compute Node Node-Level Runtime Multi-core CPU GPU Compute Node Node-Level Runtime Multi-core CPU GPU CPU GPU CPU GPU CPU GPU TCP Communicator CPU Execution Processes GPU Execution Processes GPU Consolidation Framework Assumption: shared file system Centralized decision making Execution & sharing mechanisms OS-based scheduling & sharing
20
20 GPU CUDA app 1 CUDA Driver CUDA Runtime GPU execution processes (Front-End) GPU Consolidation Framework Back-End GPU Sharing Framework GPU-related Node-Level Runtime CUDA app N … CUDA Interception Library CUDA Interception Library Front End – Back End Communication Channel
21
21 Front End – Back End Communication Channel GPU CUDA Interception Library CUDA app 1 CUDA Driver CUDA Runtime GPU Consolidation Framework GPU execution processes (Front-End) Back-End Virtual Context CUDA calls arrive from Frontend GPU Sharing Framework GPU-related Node-Level Runtime CUDA app N … CUDA Interception Library Back-End Server CUDA stream 1 Manipulates kernel configurations to allow GPU space sharing CUDA stream 2 CUDA stream N Workload Consolidator Simplified version of our HPDC’11 runtime
22
Experimental Setup 16-node cluster – CPU: 8-core Intel Xeon E5520 (2.27 GHz), 48 GB memory – GPU: Nvidia Tesla C2050 (1.15 GHz), 3GB device memory 256-job workload – 10 benchmark programs – 3 configurations: small, large, very large datasets – Various application domains: scientific computations, financial analysis, data mining, machine learning Baselines – TORQUE (always optimal resource) – Minimum Completion Time (MCT) [Maheswaran et.al, HCW’99] 22
23
Comparison with Torque-based Metrics 23 Baselines suffer from idle resources By privileging shorter jobs, our schemes reduce queuing delays Throughput & Latency COMPLETION TIMEAVERAGE LATENCY 10-20% better ~ 20% better
24
Results with Average Yield Metric 24 up to 8.8x better Yield: Effect of Job Mix Skewed-GPUSkewed-CPU Uniform up to 2.3x better Better on skewed job mixes: −More idle time in case of baseline schemes −More room for dynamic mapping
25
25 up to 3.8x better up to 6.9x better Results with Average Yield Metric Yield: Effect of Value Function Adaptability of our schemes to different value functions
26
Results with Average Yield Metric 26 up to 8.2x better Yield: Effect of System Load As load increases, yield from baselines decreases linearly Proposed schemes achieve initially increased yield and then sustained yield
27
Yield Improvements from Sharing 27 Fraction of jobs to share Careful space sharing can help performance by freeing resources Excessive sharing can be detrimental to performance Yield: Effect of Sharing up to 23x improvement
28
Summary 28 Value-based Scheduling on CPU-GPU clusters -Goal: improve aggregate yield Coordinated and uncoordinated scheduling schemes for dynamic mapping Automatic space sharing of resources based on heuristics Prototypical framework for evaluating the proposed schemes Improvement over state-of-the-art -Based on completion time & latency -Based on average yield Conclusion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.