Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Wei Jiang.

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Wei Jiang (The Ohio State University) Gagan Agrawal (The Ohio State University) Srimat Chakradhar (NEC Research Laboratories) 1

Rise of Heterogeneous Architectures Today’s High Performance Computing –Multi-core CPUs, Many-core GPUs are mainstream Many-core GPUs offer –Excellent “price-performance”& “performance-per-watt” Flavors of Heterogeneous computing –Multi-core CPUs + (GPUs/MICs) connected over PCI-E –Integrated CPU-GPUs like AMD Fusion, Intel Sandy Bridge Such hetero. platforms exist in: –3 out 5 top Supercomputers, large clusters in acad., industry –Many cloud providers: Amazon, Nimbix, SoftLayer … 2

Motivation Supercomputers and Cloud environments are typically “Shared” –Accelerate a set of applications as opposed to single application Software Stack to program CPU-GPU Architectures –Combination of (Pthreads/OpenMP…) + (CUDA/Stream) –Now, OpenCL is becoming more popular OpenCL, a device agnostic platform –Offers great flexibility with portable solutions –Write kernel once, execute on any device Today’s schedulers (like TORQUE) for hetero. clusters: –DO NOT exploit the portability offered by OpenCL –User-guided Mapping of jobs to resources –Does not consider desirable scheduling possibilities (using CPU+GPU) 3 Revisit Scheduling problems for CPU-GPU clusters 1)Exploit portability offered by models like OpenCL 2)Automatic mapping of jobs to resources 3)Desirable advanced scheduling considerations

Outline Problem Formulation Challenges and Solution Approach Scheduling of Single-Node, Single-Resource Jobs Scheduling of Multi-node, Multi-Resource Jobs Experimental Results Conclusions 4

Problem Formulations Problem Goal: Accelerate a set of applications on CPU-GPU cluster Each node has two resources: A Multi-core CPU and a GPU Map applications to resources to: – Maximize overall system throughput – Minimize application latency Scheduling Formulations: 1) Single-Node, Single-Resource Allocation & Scheduling 2) Multi-Node, Multi-Resource Allocation & Scheduling 6

Scheduling Formulations Allocates a multi-core CPU or a GPU from a node in cluster –Benchmarks like Rodinia (UV) & Parboil (UIUC) contain 1-node apps. –Limited mechanisms to exploit CPU+GPU simultaneously Exploit the portability offered by OpenCL prog. Model 7 Single-Node, Single-Resource Allocation & Scheduling Multi-Node, Multi-Resource Allocation & Scheduling In addition, allows CPU+GPU allocation –Desirable in future to allow flexibility in acceleration of applications In addition, allows multiple node allocation per job MATE-CG [IPDPS’12], a framework for Map-Reduce class of apps. allows such implementations

Challenges and Solution Approach Decision Making Challenges: Allocate/Map to CPU-only, GPU-only, or CPU+GPU? Wait for optimal resource (involves queuing delay) Assign to non-optimal resource (involves penalty) Always allocating CPU+GPU  may affect global throughput –Should consider other possibilities like CPU-only or GPU-only Always allocate requested # of nodes? –May increase wait time, can consider allocation of lesser nodes Solution Approach: Take different levels of user inputs (relative speedups, execution times…) Design scheduling schemes for each scheduling formulation 9

Scheduling Schemes for First Formulation 11 Two Input Categories & Three Schemes:  Categories are based on the amount of input expected from the user Category 1: Relative Multi-core (MP) and GPU (GP) performance as input Scheme1: Relative Speedup based w/ Aggressive Option (RSA) Scheme2: Relative Speedup based w/ Conservative Option (RSC) Category 2: Additionally, sequential CPU exec. Time (SQ) Scheme3: Adaptive Shortest Job First (ASJF)

Relative-Speedup Aggressive (RSA) or Conservative (RSC) 12 N Jobs, MP[n], GP[n] Create CJQ, GJQ Enqueue Jobs in Q’s(GP-MP) Sort CJQ and GJQ in Desc. Order R=GetNextResourceAvialable() IsGPU GJQ Empty? Yes No Assign GJQ top to R Yes Assign CJQ bottom to R Wait for CPU Aggres sive? Takes multi-core and GPU speedup as input Create CPU/GPU queues Map jobs to optimal resource queue Aggressive, minimizes penalty Conservative Yes No

Adaptive Shortest Job First (ASJF) 13 N Jobs, MP[n], GP[n], SQ[N] Create CJQ, GJQ Enqueue Jobs in Q’s(GP-MP) Sort CJQ and GJQ in Asc. Order of SQ R=GetNextResourceAvialable() IsGPU GJQ Empty? Yes No Assign GJQ top to R Yes T1= GetMinWaitTimeForNextCPU() T2 k = GetJobWithMinPenOnGPU(CJQ) T1 > T2 k Assign CJQ k to R Yes No Wait for CPU to become free or for GPU jobs Minimize latency for short jobs Automatic switch for aggressive or conservative option

Scheduling Scheme for Second Formulation 15 Solution Approach: Flexibly schedule on CPU-only, GPU-only, or CPU+GPU Molding the # of nodes requested by job Consider allocating ½ or ¼th of requested nodes Inputs from User: Execution times of CPU-only, GPU-only, CPU+GPU Execution times of jobs with n, n/2, n/4 nodes Such app. Information can also be obtained from profiles

Flexible Moldable Scheduling Scheme (FMS) 16 N Jobs, Exec. Times… Group Jobs with # of Nodes as the Index Sort each group based on exec. time of CPU+GPU version Pick a pair of jobs to schedule in order of sorting Minimize resource fragmentation Helps co-locate CPU and GPU job on the same node Gives global view to co-locate on same node Find the fastest completion option from T(i,n,C), T(i,n,G), T(i,n,CG) for each job Choose C for one job & G for the other Co-locate jobs on same set of nodes Choose same resource for both jobs (C,C) (G,G) (CG,CG) 2N Nodes Avail? Yes Schedule pair of jobs in parallel on 2N nodes No Schedule first job on N nodes Consider Molding # of nodes for the next job

Cluster Hardware Setup 18 Cluster of 16 CPU-GPU nodes Each CPU is 8 core Intel Xeon E5520 (2.27GHz) Each GPU is an Nvidia Tesla C2050 (1.15 GHz) CPU Main Memory – 48 GB GPU Device Memory – 3 GB Machines are connected through Infiniband

Benchmarks 19 Single-Node Jobs We use 10 benchmarks Scientific, Financial, Datamining, Image Processing applications Run each benchmark with 3 different exec. Configurations Overall, a pool of 30 jobs Multi-Node Jobs We use 3 applications Gridding kernel, Expectation-Maximization, PageRank Applications run with 2 different datasets and on 3 different node numbers Overall, a pool of 18 jobs

Baselines & Metrics 20 Baseline for Single-Node Jobs Blind Round Robin (BRR) Manual Optimal (Exhaustive search, Upper Bound) Baseline for Multi-Node Jobs TORQUE, a widely used resource manager for hetero. clusters Minimum Completion Time (MCT), [Maheswaran et.al, HCW’99] Metrics Completion Time (Comp. Time) Application Latency: Non-optimal Assignment (Ave. NOA. Lat) Queuing Delay (Ave. QD Lat.) Maximum Idle Time (Max. Idle Time)

Single-Node Job Results 21 Uniform CPU-GPU Job Mix CPU-biased Job Mix 24 Jobs on 2 Nodes Proposed schemes 4 different metrics For each metric 108% better than BRR Within 12% of Manual Optimal Tradeoff between non-optimal penalty vs wait-time for resource BRR has the highest latency RSA, non-optimal penalty RSC, high Queue delay ASF as good as Manual optimal BRR, very high idle times RSC, can be very high too RSA has the best utilization among proposed schemes

Multi-Node Job Results 22 Varying Job Execution Lengths Varying Resource Request Size Short Job (SJ), Long Job (LJ) Small Request (SJ), Large Request (LJ) Proposed schemes 32 Jobs on 16 Nodes FMS, 42% better than best of Torque or MCT Each type of molding gives reasonable improvement Our schemes utilizes the resource better  high throughput Intelligent on deciding to wait for res. or mold it for smaller res. FMS, 32% better than best of Torque or MCT Benefit from ResType Molding is better than NumNodes Molding

Conclusions 24 Revisit scheduling problems on CPU-GPU clusters Goal to improve aggregate throughput Single-node, single-resource scheduling problem Multi-node, multi-resource scheduling problem Developed novel scheduling schemes Exploit portability offered by OpenCL Automatic mapping of jobs to hetero. resources RSA, RSC, and ASJF for single-node jobs Flexible Molding Scheduling (FMS) for multi-node jobs Significant improvement over state-of-the-art

25 Thank You! Questions? raviv@cse.ohio-state.edu becchim@missouri.edu jiangwei@cse.ohio-state.edu agrawal@cse.ohio-state.edu chak@nec-labs.com

Benchmarks – Large Dataset 26 Benchmarks Seq. CPU Exec. (sec) GPU Speedup (GP) Multicore Speedup (MP) Data set Characteristics PDE Solver7.34.76.814336*14336 Image Processing33.85.17.814336*14336 FDTD8.42.27.614336*14336 BlackScholes2.62.17.210 mil options Binomial Options11.85.64.21024 options MonteCarlo45.438.47.91024 options Kmeans330.012.17.81.6 * 10 ^ 9 points KNN67.37.86.267108864 points PCA142.09.75.6262144*80 Molecular Dynamics46.612.97.9 256000 nodes, 31744000 edges

Benchmarks – Small Dataset 27 Benchmarks Seq. CPU Exec. (sec) GPU Speedup (GP) Multicore Speedup (MP) Data set Characteristics PDE Solver1.83.87.17168*7168 Image Processing8.45.67.57168*7168 FDTD2.11.37.77168*7168 BlackScholes0.70.66.82.5 mil options Binomial Options3.02.34.2128 options MonteCarlo11.09.47.9256 options Kmeans74.26.37.70.4*10 ^ 9 points KNN16.82.96.216777216 points PCA33.89.15.665536*80 Molecular Dynamics6.712.87.3 32000 nodes, 3968000 edges

Benchmarks – Large No. of Iterations 28 Benchmarks Seq. CPU Exec. (sec) GPU Speedup (GP) Multicore Speedup (MP) Data set Characteristics PDE Solver722.14.38.114336*14336 Image Processing3385.54.88.014336*14336 FDTD423.31.87.914336*14336 BlackScholes269.192.87.810 mil options Binomial Options1213.612.24.31024 options MonteCarlo453.3368.57.81024 options Kmeans1593.812.67.91.6 * 10 ^ 9 points KNN1691.158.46.967108864 points PCA2835.711.86.2262144*80 Molecular Dynamics593.820.87.8 256000 nodes, 31744000 edges

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Wei Jiang.

Similar presentations

Presentation on theme: "Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Wei Jiang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Wei Jiang.

Similar presentations

Presentation on theme: "Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Wei Jiang."— Presentation transcript:

Similar presentations

About project

Feedback