Graphics Processors and the Exascale: Parallel Mappings, Scalability and Application Lifespan Rob Farber, Senior Scientist, PNNL.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

GPU Programming using BU Shared Computing Cluster

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Contemporary Languages in Parallel Computing Raymond Hummel.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An Introduction to the Thrust Parallel Algorithms Library.

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Computer System Architectures Computer System Software

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Computer Graphics Graphics Hardware

UIUC CSL Global Technology Forum © NVIDIA Corporation 2007 Computing in Crisis: Challenges and Opportunities David B. Kirk.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

GPU Architecture and Programming

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts Bert de Jong High Performance Software.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Copyright © Curt Hill SIMD Single Instruction Multiple Data.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Parallel Programming Models

Enabling machine learning in embedded systems

The Yin and Yang of Processing Data Warehousing Queries on GPUs

Presentation transcript:

Graphics Processors and the Exascale: Parallel Mappings, Scalability and Application Lifespan Rob Farber, Senior Scientist, PNNL

Questions 1 and 2 1. Looking forward in the 2-5 year timeframe will we continue to need new languages, compiler directives, or language extensions to use accelerators? Absolutely as will be discussed in the next few slides 1. Will compiler technology advance sufficiently to seamlessly use accelerators as when the 8087 was added to the 8086 in the early days of the x86 architecture or when instruction sets were extended to include SSE or AltiVec and compilers eventually generated code for them? Oh I wish! However, there is hope for data-parallel problems 2. What is your vision of what a unified Heterogeneous HPC ecosystem should encompass? What Languages, Libraries, frameworks? Should debuggers and profiling tools be integrated across heterogeneous architectures? Humans are the weak link A scalable globally unified file-system is essential Yes to a unified set of debugger and profiling tools I’d like to say any language but many semantics and assumptions will not scale!

A perfect storm of opportunities and technology (Summary of Farber, Scientific Computing, “Realizing the Benefits of Affordable Teraflop-capable Hardware”) Multi-threaded software is a must-have because manufacturers were forced to move to multi-core CPUs The failure of Dennard’s scaling laws meant processor manufacturers had to add cores to increase performance and entice customers This is a new model for a huge body of legacy code! Multi-core is disruptive to single-threaded and poorly scaling legacy apps GPGPUs, the Cray XMT, Blue Waters have changed the numbers. Commodity systems are catching up. Massive threading is the future. Research efforts will not benefit from new hardware unless they invest in scalable, multi-threaded software Lack of investment risks stagnation and losing to the competition Competition is fierce, the new technology is readily available and it is inexpensive! Which software and models? Look to those successes that are: Widely adopted and have withstood the test of time Briefly examine CUDA, OpenCL, data-parallel extensions

GPGPUs: an existing capability Market forces evolved GPUs into massively parallel GPGPUs (General Purpose Graphics Processing Units). NVIDIA quotes a 100+ million installed base of CUDA-enabled GPUs GPUs put supercomputing in the hands of the masses. December 1996, ASCI Red the first teraflop supercomputer Today: kids buy GPUs with flop rates comparable to systems available to scientists with supercomputer access in the mid to late 1990s. Remember that Finnish kid who wrote some software to understand operating systems? Inexpensive commodity hardware enables: New thinking A large educated base of developers GPUPeak 32-bit GF/s Peak 64-bit GF/s Cost $ GeForce GTX < $500 AMD Radeon HD < $380

Meeting the need. CUDA was adopted quickly! February 2007: The initial CUDA SDK was made public. Now: CUDA-based GPU Computing is part of the curriculum at more than 200 universities. MIT, Harvard, Cambridge, Oxford, the Indian Institutes of Technology, National Taiwan University, and the Chinese Academy of Sciences. Application speed tells the story. fastest 100 apps in the NVIDIA Showcase Sept. 8, 2010 Fastest: 2600x Median: 253x Slowest: 98x URL: click on Sort by Speed Uphttp://

GPGPUs are not a one-trick pony Used on a wide-range of computational, data driven, and real-time applications Exhibit knife-edge performance Balance ratios can help map problems Can really be worth the effort  10x can make computational workflows more interactive (even poorly performing GPU apps are useful).  100x is disruptive and has the potential to fundamentally affect scientific research by removing time-to-discovery barriers.  1000x and greater achieved through the use of optimized transcendental functions and/or multiple GPUs.

Three rules for fast GPU codes 1.Get the data on the GPU (and keep it there!) PCIe x16 v2.0 bus: 8 GiB/s in a single direction 20-series GPUs: GiB/s 2.Give the GPU enough work to do Assume 10  s latency and 1 TF device Can waste (10 -6 * ) = 1M operations 3.Reuse and locate data to avoid global memory bandwidth bottlenecks flop hardware delivers flop when global memory limited Can cause a 100x slowdown! Tough for people. Tools need heuristics that can work on incomplete data and adjust for bad decisions. It’s even worse in a distributed and non-failsafe environment.

Results presented at SC09 (courtesy TACC) Application lifespan SIMD: a key from the past Farber: general SIMD mapping from the 1980s Acknowledgements: Work performed at or funded by the Santa Fe Institute, the theoretical division at Los Alamos National Laboratory and various NSF, DOE and other funding sources including the Texas Advance Computer Center. This mapping for Neural Networks … “Most efficient implementation to date” (Singer 1990), (Thearling 1995) The Connection Machine 60,000 cores: 363 TF/s measured 62,796 cores: 386 TF/s (projected)

The Parallel Mapping energy = objFunc(p 1, p 2, … p n ) Examples 0, N-1 Examples N, 2N-1 Examples 2N, 3N-1 Examples 3N, 4N-1 Step 2 Calculate partials Step 3 Sum partials to get energy Step1 Broadcast parameters Optimization Method (Powell, Conjugate Gradient, Other) p 1,p 2, … p n GPU 1 p 1,p 2, … p n GPU 2 p 1,p 2, … p n GPU 3 p 1,p 2, … p n GPU 4

Results = The Connection Machine * C NVIDIA (where C NVIDIA >> 1) Nonlinear PCA Average 100 iterations (sec) 8x core* C2050 ** speedup40x vs. 1 core295x (measured) Linear PCA Average 100 iterations (sec) 8x core* C2050 ** speedup8x vs. 1 core57x (measured) * 2x Intel (quadcore) 2.53 GHz, openmp, SSE enabled via g++ ** includes all data transfer overhead (“Effective Flops”) What is C NVIDIA for modern x86_64 machines?

Scalability across GPU/CPU cluster nodes (big hybrid supercomputers are coming) Oak Ridge National Laboratory looks to NVIDIA “Fermi” architecture for new supercomputer NERSC experimental GPU cluster: Dirac EMSL experimental GPU cluster: Barracuda

Looking into my crystal ball I predict long life for GPGPU applications Why? SIMD/SPMD/MIMD mappings translate well to new architectures CUDA/OpenCL provide an excellent way to create these codes Will these applications always be written in these languages? Data-parallel extensions are hot!

Data-parallel extensions URL: Example from website int main(void) { // generate random data on the host thrust::host_vector h_vec(100); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer to device and compute sum thrust::device_vector d_vec = h_vec; int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus ()); return 0; }

OpenCL has potential (but is still very new) X86: The dominant architecture More cores with greater memory bandwidth and lower power Power 7 Blue waters with over 1 million concurrent threads of execution in a petabyte of shared memory Innovative design feature to avoid SMP scaling bottlenecks Hybrid architectures CPU/GPU clusters Problems dominated by irregular access in large data. Cray XMT: specialized for large graph problems

Question 3 Will we need a whole new computational execution model for Exascale systems Eg. something like LSU’s ParallelX ? It certainly sounds wonderful! A new model of parallel computation Semantics for state objects, functions, parallel flow control, and distributed interactions Unbounded policies for implementation technology, structure, and mechanism Intrinsic system-wide latency hiding Near fine-grain global parallelism Global unified parallel programming Humans are the weak link A scalable globally unified file-system is essential Yes to a unified set of debugger and profiling tools Many language semantics and assumptions will not scale!