PVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs.

Slides:

Advertisements

Similar presentations

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.

©2009 HP Confidential template rev Ed Turkel Manager, WorldWide HPC Marketing 4/7/2011 BUILDING THE GREENEST PRODUCTION SUPERCOMPUTER IN THE.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Container-Based Job Management for Fair Resource Sharing Jue Hong, Pavan Balaji, Gaojin Wen, Bibo Tu, Junming Yan, Chengzhong Xu, and Shengzhong Feng Oracle.

Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Green Cloud Computing Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology,

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

st International Conference on Parallel Processing (ICPP)

Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.

OpenFOAM on a GPU-based Heterogeneous Cluster

Evaluating GPU Passthrough in Xen for High Performance Cloud Computing Andrew J. Younge 1, John Paul Walters 2, Stephen P. Crago 2, and Geoffrey C. Fox.

Akhil Langer, Harshit Dokania, Laxmikant Kale, Udatta Palekar* Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

1 AppliedMicro X-Gene ® ARM Processors Optimized Scale-Out Solutions for Supercomputing.

1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.

Panda: MapReduce Framework on GPU’s and CPU’s

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Supporting GPU Sharing in Cloud Environments with a Transparent

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡,

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

IM&T Vacation Program Benjamin Meyer Virtualisation and Hyper-Threading in Scientific Computing.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

GPU Architecture and Programming

Energy Management in Virtualized Environments Gaurav Dhiman, Giacomo Marchetti, Raid Ayoub, Tajana Simunic Rosing (CSE-UCSD) Inside Xen Hypervisor Online.

GreenSched: An Energy-Aware Hadoop Workflow Scheduler

Ashwin M. Aji, Lokendra S. Panwar, Wu-chun Feng …Virginia Tech Pavan Balaji, James Dinan, Rajeev Thakur …Argonne.

A dynamic optimization model for power and performance management of virtualized clusters Vinicius Petrucci, Orlando Loques Univ. Federal Fluminense Niteroi,

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator- Based Systems Presented by: Ashwin M. Aji PhD Candidate, Virginia Tech,

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Transparent Accelerator Migration in Virtualized GPU Environments Shucai Xiao 1, Pavan Balaji 2, James Dinan 2, Qian Zhu 3, Rajeev Thakur 2, Susan Coghlan.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)

VOCL-FT: Introducing Techniques for Efficient Soft Error Coprocessor Recovery Antonio J. Peña, Wesley Bland, Pavan Balaji.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

1 Automated Power Management Through Virtualization Anne Holler, VMware Anil Kapur, VMware.

Parallel Programming Models

NFV Compute Acceleration APIs and Evaluation

Green cloud computing 2 Cs 595 Lecture 15.

Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Presentation transcript:

pVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs Pavan Balaji, James Dinan, Rajeev Thakur, Argonne National Laboratory Yan Li, Yunquan Zhang, Chinese Academy of Sciences Ashwin M. Aji, Wu-chun Feng, Virginia Tech Shucai Xiao, Advanced Micro Devices

Trends in Graphics Processing Unit Performance Performance improvements come from SIMD parallel hardware, which is fundamentally superior with respect to number of arithmetic operations per chip area and power

Graphics Processing Unit Usage in Applications (From the NVIDIA website) Programming models (CUDA, OpenCL) have greatly eased the complexity in programming these devices and quickened the pace of adoption

GPUs in HPC Datacenters GPUs are ubiquitous accelerators in HPC data centers – Significant speedups compared to CPUs due to SIMD parallel hardware – At the cost of high power usage GPU/CPUTDP (Thermal Design Power) 512-core NVIDIA Fermi GPU 295 Watts Quad-core x86-64 CPU 125 Watts 4

Motivation and Challenges Peak power constraints – Power constraints imposed at various levels in a datacenter – CDUs equipped with circuit breakers in case of power constraint violation – A bottleneck for high density configurations, specially when power-hungry GPUs are used Time-varying GPU resource demand – GPUs could be idle at different times – Need for dynamic GPU resource scheduling Tongji University (01/27/2013)

GPU clusters in 3-phase CDU Complex power consumption characteristics depending on the placement of GPU workloads across various compute nodes, & power-phases Power drawn across the three phases needs to be balanced for better power efficiency and equipment reliability. Power-aware placement of GPU workloads 3-phase power supply Power Strip Phase 1 Phase 2 Phase 3 Cabinet 6 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6

Impact of Phase Imbalance Phase imbalance (Power Ph1 – Avg. Power), = max (Power Ph2 – Avg. Power), (Power Ph3 – Avg. Power) Avg. Power

pVOCL: Power-Aware Virtual OpenCL Online power management of GPU-enabled server clusters – Dynamic consolidation and placement of GPU workloads – Improves energy efficiency – Control peak power consumption Investigates and enables the use of GPU virtualization for power management – Enhances and utilizes the Virtual OpenCL (VOCL) library

GPU Virtualization Transparent utilization of remote GPUs Remote GPUs look like local “virtual” GPUs Applications can access them as if they are regular local GPUs Virtualization will automatically move data and computation Efficient GPU resource management Virtual GPUs can migrate from one physical GPU to another If a system administrator wants to add or remove a node, he/she can do that while the applications are running (hot- swap capability)

VOCL: A Virtual Implementation of OpenCL to access and manage remote GPU adapters [InPar’12] Compute Node Physical GPU Application Native OpenCL Library OpenCL API Traditional Model Compute Node Physical GPU VOCL Proxy OpenCL API VOCL Model Native OpenCL Library Compute Node Virtual GPU Application VOCL Library OpenCL API MPI Compute Node Physical GPU VOCL Proxy OpenCL API Native OpenCL Library Virtual GPU MPI

pVOCL Architecture VOCL Power Manager Cabinet Distribution Unit (CDU) Cabinet Distribution Unit (CDU) Migration Manager VOCL Proxy nodes VGPU migration Power ON/OFF VGPU mapping Node-Phase mapping Current config. Next Config. Topology Monitor Power Model Power Optimizer 11 Testbed Implementation: - Each compute node has 2 NVIDIA Tesla C1060 GPUs. - CUDA 4.2 toolkit - Switched CW-24VD/VY 3-Phase CDU - MPICH2 MPI (Ethernet connected nodes) GPU virtualization using VOCL library

Topology Monitor

GPU consolidation and placement: Optimization Problem c0c0 initial config. c1c1 cncn d (a 1 ) p (c 1 ), g (c 1 ) p (c n ), g (c n ) c i : configuration p (c i ) : power usage g (c i ) : number of GPUs a i : adaptation action d (a i ) : length of adaptation P : peak power budget d (a n ) Finding a sequence of GPU consolidation and node placement actions to reach from configuration c 0 to c n The final config. should be the most power- efficient for the current GPU demand Any intermediate config. must not violate the power budget Each intermediate config. must meet the current GPU demand In case of multiple final configs., find the one that can be reached in the shortest time 13 Dijkstra’s Algorithm (single source shortest paths)

Optimization Algorithm Current node configuration is set as source vertex in the Graph. Apply Dijkstra’s algorithm to find single source shortest paths to remaining nodes such that – Each intermediate vertex does not violate power constraint – Each intermediate vertex satisfies the GPU demand Optimization problem is reduced to a search problem from a list of target nodes. Search criteria is defined by the optimization models.

Migration Manager

Evaluation Each compute node has Dual Intel Xeon Quad Core CPUs, 16 GB of memory, and two NVIDIA Tesla C1060 GPUs CUDA 4.2 toolkit Switched CW-24VD/VY 3-Phase CDU (Cabinet Power Distribution Unit). MPICH2 MPI for Ethernet connected nodes Application kernel benchmarks: Matrix-multiplication, N-Body, Matrix-Transpose, Smith-Waterman

Impact of GPU Consolidation 17 43% & 18% improvement

Impact of GPU Consolidation (with various workload mixes)

Power-phase Topology Aware Consolidation Workload – 3 compute nodes per phase – Each node has 2 GPUs – 20 n-body instances in phase 1 GPUs – 40 n-body instances in phase 2 GPUs – 80 n-body instances in phase 3 GPUs 14% improvement

Peak Power Control (Power budget 2000 Watts)

Peak Power Control (Power budget 2600 Watts)

Peak Power Control Higher power budget allows distribution of the workloads across different power-phases to reach more power-efficient configurations much earlier.

Overhead Analysis Migration time very small compared to Computation time. Time required to turn on compute nodes has no effect on performance.

Conclusion We investigate and enable dynamic scheduling of GPU resources for online power management in virtualized GPU environments. pVOCL supports dynamic placement and consolidation of GPU workloads in a power aware manner. It controls the peak power consumption and improves the energy efficiency of the underlying server system.

Thank You ! Webpage: