12/7/2015 1 Performance Optimizations for running NIM on GPUs Jacques Middlecoff NOAA/OAR/ESRL/GSD/AB Mark Govett, Tom Henderson.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

MEMORY popo.
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Shredder GPU-Accelerated Incremental Storage and Computation
Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.
I/O Organization popo.
Processes Management.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Dan Iannuzzi Kevin Pine CS 680. Outline The Problem Recap of CS676 project Goal of this GPU Research Approach Parallelization attempts Results Difficulties.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Chapter 7 Interupts DMA Channels Context Switching.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
1 Today I/O Systems Storage. 2 I/O Devices Many different kinds of I/O devices Software that controls them: device drivers.
Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.
RESOURCE MANAGEMENT System Resources. What resources are managed in a computer system?
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
The hybird approach to programming clusters of multi-core architetures.
CS 179: GPU Programming Lecture 20: Cross-system communication.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Animation. Pre-calculated Animation Do more now, less later.
Chapter 3 Memory Management: Virtual Memory
Processor Structure & Operations of an Accumulator Machine
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Cis303a_chapt06_exam.ppt CIS303A: System Architecture Exam - Chapter 6 Name: __________________ Date: _______________ 1. What connects the CPU with other.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Extracted directly from:
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
The Central Processing Unit
GPU Architecture and Programming
CSIT 301 (Blum)1 Cache Based in part on Chapter 9 in Computer Architecture (Nicholas Carter)
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
CS338Parallel and Distributed Databases11-1 Parallel and Distributed Databases Lecture Topics Multi-CPU and distributed systems Monolithic system Client–server.
The fetch-execute cycle. 2 VCN – ICT Department 2013 A2 Computing RegisterMeaningPurpose PCProgram Counter keeps track of where to find the next instruction.
Computer Systems - Processor. Objectives To investigate and understand the structure and role of the processor.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
LabVIEW.com.tw LabVIEW Community Speeding Up Your VIs 參考 NI 官方教材: LabVIEW Intermediate II for 7.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Logical & Physical Address Nihal Güngör. Logical Address In simplest terms, an address generated by the CPU is known as a logical address. Logical addresses.
Development of a GPU based PIC
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.
Application of Emerging Computational Architectures (GPU, MIC) to Atmospheric Modeling Tom Henderson NOAA Global Systems Division
Best Practices for Multi-threading Eric Young Developer Technology.
NFV Compute Acceleration APIs and Evaluation
Chapter 10: Computer systems (1)
Memory COMPUTER ARCHITECTURE
Understanding Operating Systems Seventh Edition
Computer Memory.
CPU & its Components CPU stands for central Processing Unit
O.S Lecture 13 Virtual Memory.
Presentation transcript:

12/7/ Performance Optimizations for running NIM on GPUs Jacques Middlecoff NOAA/OAR/ESRL/GSD/AB Mark Govett, Tom Henderson Jim Rosinski

12/7/ Goal for NIM

12/7/ Optimizations to be discussed NIM: The halo to be communicated between processors is packed and unpacked on the GPU No copy of entire variable to and from the CPU About the same speed as the CPU Halo computation Overlapping communication with computation Mapped, pinned memory NVIDIA GPUDirect technology

12/7/ Halo Computation Redundant computation to avoid communication Calculate values in the halo instead of MPI send Trades computation time for communication time GPUs create more opportunity for halo comp NIM has halo comp for everything not requiring extra communication NIM next step is to look at halo comp’s that require new but less often communication

12/7/ Overlapping Communication with Computation Works best with a co-processor to handle comm Overlap communication with other calculations between when a variable is set and used. Not enough computation time on the GPU Calculate perimeter first then do communication while calculating the interior Loop level: Not enough computation on the GPU Subroutine level: Not enough computation time Entire dynamics: Not feasible for NIM

12/7/ Overlapping Communication with Computation: Entire Dynamics 14 exchanges per time step 3 iteration Runge Kutta loop Exchanges in the RK loop Results in a 7 deep halo Perimeter Interior Way too much communication More halo comp? Move exchanges out of RK loop? Considerable code restructuring required.

12/7/ Mapped, Pinned Memory: Theory Mapped, pinned memory is CPU memory Mapped so GPU can access it across PCIe bus Page-locked so the OS can’t swap it out Limited amount Integrated GPUs: Always a performance gain Discrete GPUs (what we have) Advantageous only in certain cases The data is not cached on the GPU Global loads and stores must be coalesced Zero-copy: Both GPU and CPU can access data

12/7/ Mapped, Pinned Memory: Practice Using mapped, pinned memory for fast copy SendBuf is mapped and pinned Regular GPU array (d_buff) is packed on GPU d_buff is copied to SendBuf Twice as fast as copying d_buff to a CPU array Pack the halo on GPU SendBuf = VAR Zero-copy 2.7X slower Why? Unpack halo on GPU VAR = RecvBuf Zero-copy unpack same speed but no copy

12/7/ Mapped, Pinned Memory: Results NIM horizontal, 96 vertical 10 processors Lowest value selected to avoid skew

12/7/ Mapped, Pinned Memory: Results

12/7/ NVIDIA GPUDirect Technology Eliminates the CPU in interprocessor communication Based on an interface between the GPU and InfiniBand Both devices share pinned memory buffers Data written by GPU can be sent immediately by InfiniBand Overlapping communication with computation? No longer a co-processor to do the comm? We have this technology but have yet to install it

12/7/ Questions?