SOS14 Panel Discussion Software programming environments and tools for GPUs Where are they today? What do we need for tomorrow? Savannah, GA 09 March 2010.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Priority Research Direction Key challenges General Evaluation of current algorithms Evaluation of use of algorithms in Applications Application of “standard”

Performance Metrics Inspired by P. Kent at BES Workshop Run larger: Length, Spatial extent, #Atoms, Weak scaling Run longer: Time steps, Optimizations,

Presented by Suzy Tichenor Director, Industrial Partnerships Program Computing and Computational Sciences Directorate Oak Ridge National Laboratory DOE.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Life and Health Sciences Summary Report. “Bench to Bedside” coverage Participants with very broad spectrum of expertise bridging all scales –From molecule.

Productive Performance Tools for Heterogeneous Parallel Computing Allen D. Malony Department of Computer and Information Science University of Oregon Shigeo.

Workshop Charge J.C. Wells & M. Sato Park Vista Hotel Gatlinburg, Tennessee September 5-6, 2014.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

OpenFOAM on a GPU-based Heterogeneous Cluster

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Reference: Message Passing Fundamentals.

 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.

1 Aug 7, 2004 GPU Req GPU Requirements for Large Scale Scientific Applications “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Contemporary Languages in Parallel Computing Raymond Hummel.

Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 18 Slide 1 Software Reuse.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.

Are their more appropriate domain-specific performance metrics for science and engineering HPC applications available then the canonical “percent of peak”

Domain Applications: Broader Community Perspective Mixture of apprehension and excitement about programming for emerging architectures.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Future role of DMR in Cyber Infrastructure D. Ceperley NCSA, University of Illinois Urbana-Champaign N.B. All views expressed are my own.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

GPU Architecture and Programming

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Co-Design 2013 Summary Exascale needs new architectures due to slowing of Dennard scaling (since 2004), multi/many core limits New programming models,

Scalable Linear Algebra Capability Area Michael A. Heroux Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation,

Experts in numerical algorithms and HPC services Compiler Requirements and Directions Rob Meyer September 10, 2009.

Brent Gorda LBNL – SOS7 3/5/03 1 Planned Machines: BluePlanet SOS7 March 5, 2003 Brent Gorda Future Technologies Group Lawrence Berkeley.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

An Execution Model for Heterogeneous Multicore Architectures Gregory Diamos, Andrew Kerr, and Sudhakar Yalamanchili Computer Architecture and Systems Laboratory.

Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November CSRI.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Programmability Hiroshi Nakashima Thomas Sterling.

Contemporary Languages in Parallel Computing Raymond Hummel.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Martin Kruliš by Martin Kruliš (v1.0)1.

EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

Applications Scaling Panel Jonathan Carter (LBNL) Mike Heroux (SNL) Phil Jones (LANL) Kalyan Kumaran (ANL) Piyush Mehrotra (NASA Ames) John Michalakes.

My Coordinates Office EM G.27 contact time:

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.

Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Productive Performance Tools for Heterogeneous Parallel Computing

Toward a Unified HPC and Big Data Runtime

Presentation transcript:

SOS14 Panel Discussion Software programming environments and tools for GPUs Where are they today? What do we need for tomorrow? Savannah, GA 09 March 2010 John A. Turner, ORNL Allen Malony, Univ. of Oregon Simone Melchionna, EPFL Mike Heroux, SNL

2 U. S. Department Of Energy 2 DOE INCITE applications span a broad range of science challenges Projects Accelerator physics 1111 Astrophysics 3455 Chemistry 1124 Climate change 3345 Combustion 1122 Computer science 1111 Fluid Dynamics 11 Fusion 4535 Geosciences 111 High energy physics 11 Life sciences 2224 Materials science 2334 Nuclear physics 2212 Industry 2333 Total Projects: CPU Hours: 36,156,00075,495,000145,387,000469,683,000

3 Application Challenges Node-level Concurrency – finding sufficient concurrency to amortize or hide data movement costs – thread creation and management Data movement – managing memory hierarchy – minimizing data movement between CPU and GPU Programming models – expressing the concurrency and data movement mentioned above – refactoring applications undergoing active development – maintaining code portability and performance Compilers and tools – availability, quality, usability – ability to use particular language features (e.g. C++ templates)

4 How are you programming GPUs today? – In preparing apps for hybrid / accelerated / many-core platforms, what is your preferred path for balancing performance, portability, and “future-proofing”? – What is the typical effort required to refactor applications for hybrid / accelerated / many-core platforms? What role does performance modeling play, if any? – What role can / should it play? Software programming environments and tools for GPUs (Where are they today? What do we need for tomorrow?)

5 What programming environment tools are key to preparing applications for coming platforms? – What tools are most in need of improvement & investment? Do the application modifications for existing and near-term platforms (Roadrunner, GPU-accelerated systems) better prepare them for other future platforms envisioned in the Exascale roadmap?

6 What is difficult or missing from your current programming models? What important architectural features must be captured in new programming models? What are the possible future programming models 10 and 20 years from now? Software programming environments and tools for GPUs (Where are they today? What do we need for tomorrow?)

7 In previous major shifts such as vector to distributed-memory parallel, some applications did not survive or required such major re-factoring that for all intents and purposes they were new codes. – What fraction of applications would you predict will fail to make this transition? – Will the survival rate be significantly different between “science” apps and “engineering” apps? – Software has become more complex since the last major transition, but also tends to make greater use of abstraction. In the end does this help or hinder applications in moving to these architectures? – Is there an aspect to survival that could be viewed as a positive rather than a negative? That is, does the “end of the road” for an application really present an opportunity to re-think the physics / math / algorithms / implementations?

8 Allen Malony slides

9 Simone Melchionna EPFL High Performance Simulations of Complex Biological Flows on CPUs and GPUs High Performance Simulations of Complex Biological Flows on CPUs and GPUs EPFL, CHHarvard, USCNR, Italy

10 Multi-scale Hemodynamics < < Mesh generation Level-set to 3D reconstruction Domain partitioning Vascular Remodeling

11 MUPHY: MULti-PHYsics simulator Graph-partitioning

12 Multiscale computing: GPU challenges Heterogeneous Multiscale Simulation: does not necessarily fit Heterogeneous Computing and data-parallel accelerators. Method modifications have long latency for code modifications. Great constraints on programmers: 1) Avoid conditionals 2) Avoid non- coalesced Global Memory access 3) Lack of inter-block comms 4) Data layout critical. Rethinking consolidated simulation methods: a great practice. Hybrid programming model: efficient, accessible , vulnerable  For Multiscale sims several components do not map on threads/coalesced mem access. E.g. bisection search, ordered traversing of matrix elements. Development is not an incremental porting: CPU-GPU bandwidth critical step.

13 Wish list Overcome the heroic stage: more accessible programming model and lift constraints on programmers. More room for science! Less strain on global code rewriting (scripting & meta-programming). Allow for incremental porting. Overcome programming models vulnerabilities (long modification latency) and increase level of abstraction More libraries and coarse/fine grained algorithmics. Profiling & debugging tools. More testbeds, more HW emulators, more mature ties with community. Avoid the xN temptation. Establish models and best practice for established simulation methods. Put forward a GPU landscape on the simulation methods space. Establish models on data locality (method rethinking!). Moving object simulation methods: establish new roadmaps.

14 Mike Heroux: Programming Environments for GPUs In preparing apps for hybrid / accelerated / many-core platforms, what is your preferred path for balancing performance, portability, and “future-proofing”? We have developed a node API using C++ template meta-programming. It is similar to CUDA (actually Thrust) with the OpenCL memory model (Trilinos NodeAPI). What is the typical effort required to refactor applications for hybrid / accelerated / many-core platforms? We started with a clean slate (Tpetra) instead of refactoring Epetra. What programming environment tools are key to preparing applications for coming platforms. What tools are most in need of improvement & investment? We need a portable programming model, higher level than OpenCL. CUDA level is sufficient. Something like CUDA SDK is essential. What role can performance modeling play, if any? Some, but working example suite is even better, or miniapps. Do the application modifications for existing and near-term platforms (Roadrunner, GPU- accelerated systems) better prepare them for other future platforms envisioned in the Exascale roadmap? Yes, if not in reusable software, at least in strategy and design.

15 Mike Heroux: Programming Environments for GPUs In previous major shifts such as vector to distributed-memory parallel, some applications did not survive or required such major re-factoring that for all intents and purposes they were new codes. – What fraction of applications would you predict will fail to make this transition? Many will make it, at least as “petascale kernels”. – Will the survival rate be significantly different between “science” apps and “engineering” apps? Science “forward problems” appear to be more scalable than engineering forward problems. But engineering apps are more ready for advanced modeling and simulation, UQ, etc. – Software has become more complex since the last major transition, but also tends to make greater use of abstraction. In the end does this help or hinder applications in moving to these architectures? Proper abstraction can be a huge advantage. Caveat: Abstraction can solve any problem except too much abstraction. – Is there an aspect to survival that should be viewed as a positive rather than a negative? That is, does the “end of the road” for an application really present an opportunity to re-think the physics / math / algorithms at a deeper level? Yes, but parallelism cannot overcome inferior algorithm complexity.

16 Trilinos Node API Example Kernels: axpy() and dot() template void Node::parallel_for(int beg, int end, WDP workdata ); template WDP::ReductionType Node::parallel_reduce(int beg, int end, WDP workdata ); template struct AxpyOp { const T * x; T * y; T alpha, beta; void execute(int i) { y[i] = alpha*x[i] + beta*y[i]; } }; template struct DotOp { typedef T ReductionType; const T * x, * y; T identity() { return (T)0; } T generate(int i) { return x[i]*y[i]; } T reduce(T x, T y) { return x + y; } }; AxpyOp op; op.x =...; op.alpha =...; op.y =...; op.beta =...; node.parallel_for > (0, length, op); DotOp op; op.x =...; op.y =...; float dot; dot = node.parallel_reduce > (0, length, op); Work with Chris Baker, ORNL

17 Some Additional Thoughts New Programming Models/Environments: – Must compete/cooperate with MPI. – MPI advantage: Most code is serial. New PM/E: – Truly useful if it solve the ubiquitous markup problem. – Conjecture: Data-intensive apps will drive new PM/Es. Our community will adopt as opportunities arise. Source of new PM/Es: Industry. – Best compilers are commercial. – Runtime system is critical.