© NVIDIA Corporation 2011 The ‘Super’ Computing Company From Super Phones to Super Computers CUDA 4.0.

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

Introduction to the CUDA Platform
GPU Programming using BU Shared Computing Cluster
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Optimization on Kepler Zehuan Wang
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
Computing with Accelerators: Overview ITS Research Computing Mark Reed.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
GPU Computing with Hartford Condor Week 2012 Bob Nordlund.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Contemporary Languages in Parallel Computing Raymond Hummel.
1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
An Introduction to the Thrust Parallel Algorithms Library.
A Top Level Overview of Parallelism from Microsoft's Point of View in 15 minutes IDC HPC User’s Forum April 2010 David Rich Director Strategic Business.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
CS6235 L16: Libraries, OpenCL and OpenAcc. L16: Libraries, OpenACC, OpenCL CS6235 Administrative Remaining Lectures -Monday, April 15: CUDA 5 Features.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Extracted directly from:
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
Low-Latency Accelerated Computing on GPUs
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 3, 2011outline.1 ITCS 6010/8010 Topics in Computer Science: GPU Programming for High Performance.
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
GPU Computing April GPU Outpacing CPU in Raw Processing GPU NVIDIA GTX cores 1.04 TFLOPS CPU GPU CUDA Architecture Introduced DP HW Introduced.
Initial experience on openCL pragamming and develop GPU solver for OpenFoam Presented by: Qingfeng Xia School of MACE University of Manchester Date:
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CS179: GPU Programming Lecture 16: Final Project Discussion.
GPU Architecture and Programming
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond
CUDA-based Volume Rendering in IGT Nobuhiko Hata Benjamin Grauer.
GPU-Accelerated Computing and Case-Based Reasoning Yanzhi Ren, Jiadi Yu, Yingying Chen Department of Electrical and Computer Engineering, Stevens Institute.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.
Track finding and fitting with GPUs Oleksiy Rybalchenko and Mohammad Al-Turany GSI-IT 4/29/11 PANDA FEE/DAQ/Trigger Workshop Grünberg.
Benefits. CAAR Project Phases Each of the CAAR projects will consist of a: 1.three-year Application Readiness phase ( ) in which the code refactoring.
Martin Kruliš by Martin Kruliš (v1.0)1.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
CUDA Interoperability with Graphical Environments
D. Gratadour : Introducing YoGA, Yorick with GPU acceleration
Real-Time Ray Tracing Stefan Popov.
Peng Wang, Ph.D. HPC Developer Technology, NVIDIA
NVIDIA Fermi Architecture
GPU Introduction: Uses, Architecture, and Programming Model
Presentation transcript:

© NVIDIA Corporation 2011 The ‘Super’ Computing Company From Super Phones to Super Computers CUDA 4.0

© NVIDIA Corporation 2011 CUDA Toolkit 4.0 Release Candidate Available to Registered Developers on March 4 th Press Embargo : February 28 th – 6am PST (San Francisco)

© NVIDIA Corporation 2011 Rapid Application Porting Unified Virtual Addressing Faster Multi-GPU Programming GPUDirect 2.0 CUDA 4.0 Application Porting Made Simpler Easier Parallel Programming in C++ Thrust

© NVIDIA Corporation 2011 CUDA 4.0 for Broader Developer Adoption

© NVIDIA Corporation 2011 NVIDIA GPUDirect™ : Towards Eliminating the CPU Bottleneck Direct access to GPU memory for 3 rd party devices Eliminates unnecessary sys mem copies & CPU overhead Supported by Mellanox and Qlogic Up to 30% improvement in communication performance Direct access to GPU memory for 3 rd party devices Eliminates unnecessary sys mem copies & CPU overhead Supported by Mellanox and Qlogic Up to 30% improvement in communication performance Version 1.0 for applications that communicate over a network Version 1.0 for applications that communicate over a network Peer-to-Peer memory access, transfers & synchronization MPI implementations natively support GPU data transfers Less code, higher programmer productivity Peer-to-Peer memory access, transfers & synchronization MPI implementations natively support GPU data transfers Less code, higher programmer productivity Version 2.0 for applications that communicate within a node Version 2.0 for applications that communicate within a node

© NVIDIA Corporation 2011 Before GPUDirect v2.0 Required Copy into Main Memory GPU 1 Memory GPU 2 Memory PCI-e CPU Chip set SystemMemory

© NVIDIA Corporation 2011 GPUDirect v2.0: Peer-to-Peer Communication Direct Transfers b/w GPUs GPU 1 Memory GPU 2 Memory PCI-e CPU Chip set SystemMemory

© NVIDIA Corporation 2011 Unified Virtual Addressing Easier to Program with Single Address Space No UVA: Multiple Memory Spaces UVA : Single Address Space System Memory CPU GPU 0 Memory GPU 1 Memory System Memory CPU GPU 0 Memory GPU 1 Memory PCI-e 0x0000 0xFFFF 0x0000 0xFFFF 0x0000 0xFFFF 0x0000 0xFFFF

© NVIDIA Corporation 2011 C++ Templatized Algorithms & Data Structures (Thrust) Powerful open source C++ parallel algorithms & data structures Similar to C++ Standard Template Library (STL) Automatically chooses the fastest code path at compile time Divides work between GPUs and multi-core CPUs Parallel 5x to 100x faster than STL and TBB Data Structures thrust::device_vector thrust::host_vector thrust::device_ptr Etc. Algorithms thrust::sort thrust::reduce thrust::exclusive_scan Etc.

© NVIDIA Corporation 2011 Source: C C++ Parallel Programming Sweet Spot

© NVIDIA Corporation 2011 CUDA 4.0: Highlights Share GPUs across multiple threads Single thread access to all GPUs No-copy pinning of system memory New CUDA C/C++ features Thrust templated primitives library NPP image/video processing library Layered Textures Share GPUs across multiple threads Single thread access to all GPUs No-copy pinning of system memory New CUDA C/C++ features Thrust templated primitives library NPP image/video processing library Layered Textures Easier Parallel Application Porting Easier Parallel Application Porting Auto Performance Analysis C++ Debugging GPU Binary Disassembler cuda-gdb for MacOS Auto Performance Analysis C++ Debugging GPU Binary Disassembler cuda-gdb for MacOS New & Improved Developer Tools Unified Virtual Addressing NVIDIA GPUDirect™ v2.0 Peer-to-Peer Access Peer-to-Peer Transfers GPU-accelerated MPI Unified Virtual Addressing NVIDIA GPUDirect™ v2.0 Peer-to-Peer Access Peer-to-Peer Transfers GPU-accelerated MPI Faster Multi-GPU Programming Faster Multi-GPU Programming

© NVIDIA Corporation 2011 GPU Technology Conference 2011 Oct | San Jose, CA 3 rd annual GPU Technology Conference New for 2011: Co-located with Los Alamos HPC Symposium 300+ Research Scientists from National Labs 2010 highlights 280 hours of sessions 100+ Research posters 42 countries represented

© NVIDIA Corporation 2011 BACKGROUND SLIDES CUDA 4.0

© NVIDIA Corporation 2011 NVIDIA CUDA Summary New in CUDA 4.0 Libraries Thrust C++ Library Templated Performance Primitives NVIDIA Library Support Complete math.h Complete BLAS Library (1, 2 and 3) Sparse Matrix Math Library RNG Library FFT Library (1D, 2D and 3D) Image Processing Library (NPP) Video Processing Library (NPP) 3 rd Party Math Libraries CULA Tools MAGMA IMSL VSIPL VSIPLTools Parallel Nsight Pro NVIDIA Tools Support Parallel Nsight 1.0 IDE cuda-gdb Debugger with multi-GPU CUDA/OpenCL Visual Profiler CUDA Memory Checker CUDA C SDK CUDA Disassembler CUDA Partner Tools Allinea DDT RogueWave /Totalview Vampir Tau CAPS HMPPPlatform GPUDirect 2.0 Fast Path to Data Hardware Support ECC Memory Double Precision Native 64-bit Architecture Concurrent Kernel Execution Dual Copy Engines Multi-GPU support 6GB per GPU supported Operating System Support MS Windows 32/64 Linux 32/64 support Mac OSX support Cluster Management GPUDirect Tesla Compute Cluster (TCC) Graphics Interoperability Programming Model Unified Virtual Addressing C++ new/delete C++ Virtual Functions C support NVIDIA C Compiler CUDA C Parallel Extensions Function Pointers Recursion Atomics malloc/free C++ support Classes/Objects Class Inheritance Polymorphism Operator Overloading Class Templates Function Templates Virtual Base Classes Namespaces Fortran, OpenCL

© NVIDIA Corporation 2011 cuda-gdb Now Available for MacOS

© NVIDIA Corporation 2011 Automated Performance Analysis in Visual Profiler Summary analysis & hints Session Device Context Kernel New UI for kernel analysis Identify limiting factor Analyze instruction throughput Analyze memory throughput Analyze kernel occupancy

© NVIDIA Corporation 2011 NVIDIA Parallel Nsight ™ Professional features now available free of charge! Key Features Professional Profiler Standard Microsoft Visual Studio 2010 support Single System Debugging Tesla Compute Cluster CUDA Toolkit 3.2

© NVIDIA Corporation 2011 CUDA 3 rd Party Ecosystem Tools Parallel Debuggers Visual Studio IDE with Parallel Nsight Pro Allinea DDT Debugger TotalView Debugger Performance Tools ParaTools VampirTrace TauCUDA Performance Tools PAPI HPC Toolkit Compute Platform Providers Cloud Compute Amazon EC2 Peer 1 OEM’sDellHPIBM Cluster Tools Cluster Management Platform LSF Cluster Manager Platform LSF Cluster Manager Platform Symphony Platform Symphony Bright Cluster manager Bright Cluster manager Job Scheduling Altair PBS Altair PBS Cluster Resources TORQUE Cluster Resources TORQUE MPI Libraries MPIOpenMPI Qlogic OFED Compilers PGI CUDA Fortran PGI Accelerators PGI CUDA x86 CAPS HMPP TidePowerd GPU.net pyCUDA

© NVIDIA Corporation 2011

NVIDIA CUDA Developer Resources ENGINES & LIBRARIES Math Libraries CUFFT, CUBLAS, CUSPARSE, CURAND 3 rd Party Libraries CULA LAPACK, VSIPL, NPP Image Libraries Performance primitives for imaging App Acceleration Engines Ray Tracing: Optix, iRay Video Libraries NVCUVID / NVCUVENC DEVELOPMENT TOOLS CUDA Toolkit Complete GPU computing development kit cuda-gdb GPU hardware debugging Visual Profiler GPU hardware profiler for CUDA C and OpenCL Parallel Nsight Integrated development environment for Visual Studio SDKs AND CODE SAMPLES GPU Computing SDK CUDA C/C++, DirectCompute, OpenCL code samples and documentation Books CUDA by Example, GPU Gems Optimization Guides Best Practices for GPU computing and graphics development

© NVIDIA Corporation 2011 Proven Research Vision John Hopkins University Nanyan University Technical University-Czech CSIRO SINTEF HP Labs ICHEC Barcelona SuperComputer Center Clemson University Fraunhofer SCAI Karlsruhe Institute Of Technology World Class Research Leadership and Teaching University of Cambridge Harvard University University of Utah University of Tennessee University of Maryland University of Illinois at Urbana-Champaign Tsinghua University Tokyo Institute of Technology Chinese Academy of Sciences National Taiwan University Georgia Institute of Technology GPGPU Education 350+ Universities Academic Partnerships / Fellowships GPU Computing Research & Education Mass. Gen. Hospital/NE Univ North Carolina State University Swinburne University of Tech. Techische Univ. Munich UCLA University of New Mexico University Of Warsaw-ICM VSB-Tech University of Ostrava And more coming shortly.

© NVIDIA Corporation 2011 CUDA ApplicationsMomentum Increasing

© NVIDIA Corporation 2011 Today’s CUDA CAE Solutions Structural Mechanics Electromagnetics ANSYS Mechanical AFEA Abaqus/Standard (beta) AcuSolve Moldflow Culises (OpenFOAM) Particleworks Nexxim EMPro CST MS XFdtd SEMCAD X Fluid Dynamics