 What hardware accelerators are you using/evaluating?  Cells in a Roadrunner configuration ◦ 8-way SPE threads w/ local memory, DMA & vector unit programming.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

1 Computational models of the physical world Cortical bone Trabecular bone.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Computing with Accelerators: Overview ITS Research Computing Mark Reed.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Types of Parallel Computers

Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.

What is next for accelerators? Turf war or collaboration? Stefan Möhl, Co-Founder, Chief Strategy Officer, Mitrionics.

Some Thoughts on Technology and Strategies for Petaflops.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Contemporary Languages in Parallel Computing Raymond Hummel.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Computer System Architectures Computer System Software

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

GPU Architecture and Programming

Alternative ProcessorsHPC User Forum Panel1 HPC User Forum Alternative Processor Panel Results 2008.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

1 Evolving ATLAS Computing Model and Requirements Michael Ernst, BNL With slides from Borut Kersevan and Karsten Koeneke U.S. ATLAS Distributed Facilities.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Parallel Portability and Heterogeneous programming Stefan Möhl, Co-founder, CSO, Mitrionics.

Programming on IBM Cell Triblade Jagan Jayaraj,Pei-Hung Lin, Mike Knox and Paul Woodward University of Minnesota April 1, 2009.

Understanding Parallel Computers Parallel Processing EE 613.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.

Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.

Conclusions on CS3014 David Gregg Department of Computer Science

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Architecture & Organization 1

What is Parallel and Distributed computing?

Architecture & Organization 1

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Performance What hardware accelerators are you using/evaluating?

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

 What hardware accelerators are you using/evaluating?  Cells in a Roadrunner configuration ◦ 8-way SPE threads w/ local memory, DMA & vector unit programming issues but tremendous flexibility ◦ Fast (25.6 GB/s) & large memory (4GB or larger) ◦ Augmented C language; also C++ & now Fortran; GNU & XL variants; OpenMP is new; OpenCL is being prototyped ◦ Opterons can run bulk of code not needing acceleration; Cell-only clusters possible

 What hardware accelerators are you using/evaluating? Several years ago… ◦ GPUs (pre CUDA & Tesla)  Brook & Scout (LANL data-parallel language)  No 32bit at the time; limited memory; everything is a data-parallel problem  No ECC memory ; insufficient parity/ECC protection of data paths and logic  Others at LANL still working in this area including Tesla & CUDA) ◦ Clearspeed (several years ago)  Earliest Clearspeeds before the Advance families  Augmented C language; 96 SIMD PEs  Everything is done as long SIMD data parallel and in synch  Low power ◦ FPGAs (HDL, several years ago)  Programming is hard -- very hard  Logic space limited the number of 64bit ops  Fast SRAM but small; external DRAM modest size but no faster than CPUs  One algorithm at a time, so significant impact to use for multi-physics  Low power

 Describe the applications that you are porting to accelerators? ◦ MD (materials), laser-plasma PIC, IMC X-ray (particle) transport, GROMACS, n-body universe & galaxies, DNS turbulence & supernovea, HIV genealogy, nanowire long-time-scale MD ◦ Ocean circulation, wildfires, discrete social simulations, clouds & rain, influenza spread, plasma turbulence, plasma sheaths, fluid instabilities My personal observations: ◦ Particle methods are generally easiest ◦ Codes with good characteristics:  A few computationally intense “algorithms”  pre-existing or obvious “fine-grain” parallel work units  C language versus Fortran or highly OO C++

 Describe the kinds of speed-ups are you seeing (provide the basis for the comparison)? ◦ 5x to 10X over single-Opteron-core for code with high memory BW intensive and 5%-10% peak ◦ 10x to 25x on particle methods, searches, etc.  How does it compare to scaling out (i.e., just using more X86 processors)? What are the bottlenecks to further performance improvements? ◦ Scale out via more sockets is better – BUT!  Scaling efficiencies are a problem already for several LANL applications running at 4,000 to 10,000 cores; scale out of LANL- sized machines means $$$ for HW, space, & power  Scaling out by multi-core is not a clear winner ◦ Memory BW and cache architectures often limit performance which Cells mostly get around ◦ Memory BW per core is decreasing at “inverse Moore’s law” rate!

 Describe the programming effort required to make use of the accelerator. ◦ ½ to 1 man-year to “convert” a code, mostly dealing with data structures and threaded parallelism designs. ◦ Lack of debugging & similar tools are like the earliest days of parallel computing (LANL was leader then as well – remember early PVM Ethernet workstation “carpet” clusters in the mid-80’s before MPPs) ◦ We like to see 1-2 programming experts (PhD-level or equiv) assigned to forefront-science code projects which have 1 to 4+ physics experts (PhD-level)  Amortization ◦ Ready for the future – codes and skilled programmers. We expect our dual-level (MPI+threads) & SIMD-vectorization techniques used for Roadrunner to pay off on future multi-core and many-core chips as well. ◦ It’s not just about running codes this year. Others will have to work through new forms of parallelism soon. ◦ We can do science now that isn’t possible with most other machines

 Compare accelerator cost to scaling out cost ◦ Commodity-processor-only machines would have cost 2X what Roadrunner did in (~$80M more) ◦ Used 2X or more power (~$1M per MW) ◦ Significantly larger nodes counts cause scaling & reliability issues ◦ Accelerators or heterogeneous chips should be Greener  Ease of use issues ◦ Newer Cell programming techniques (ALF, OpenMP) could make this easier. ◦ A Cell cluster would be easier, but the PPE is really, really slow for non- SPU accelerated code segments. ◦ Not for the faint of heart, but Top20 machines never are

 What is the future direction of hardware based accelerators? ◦ Domain specific libraries can make them far more useful in those specific areas ◦ Some may appear on Intel QPI or AMD HT. ◦ Specialized cores will show up within commodity microprocessors – ignore them or use them ◦ GPU-based systems will have to adopt ECC & partity protection ◦ Convey appears to have the most viable FPGA approach (FPGA as compiler managed co-processor)  Software futures? ◦ OpenCL looks promising but doesn’t address programming the specialized accelerator devices themselves ◦ The uber-auto-wizard-compiler will never come ◦ Heterogeneous compilers may come. ◦ Debuggers & tools may come  What are your thoughts on what the vendors need to do to ensure wider acceptance of accelerators? ◦ Create next generation versions and sell as mainstream products

 Compile & run on PowerPC PPE  Identify & isolate algorithm & data to run parallel on 8 “remote” SPEs  Compile scalar version of algorithm on SPE ◦ Add SPE thread process control ◦ Add DMAs  Use “blocking” DMAs at this stage just for functionality  Worry about data alignments ◦ First on a single SPE, then on 8 SPEs  Optimize SPE code ◦ SIMD, branches  merges ◦ Add asynch double/triple buffering of DMAs  For Roadrunner, connect to rest of code on Opteron via DaCS and “message relay”

 Roadrunner is more than a petascale supercomputer for today’s use ◦ provides a balanced platform to explore new algorithm design, programming models, and to refresh developer skills  LANL has been an early adopter of transformational technology*: ◦ 1970s: HPC is scalar LANL adopts vector (Cray 1 w/ no OS) ◦ 1980s: HPC is vector LANL adopts data parallel (big CM-2) ◦ 2000s: HPC is multi-core clusters LANL adopts hybrid (Roadrunner) Slide 9 *Credit to Scott Pakin, CCS-1, for this list idea

OpteronCell PPCCell SPE (x8 parallel) Host data pushed/pulled to Cell Cell spawns parallel threads on SPEs Parallel threads completed Node may need to push/pull more data to/from Cell & to/from cluster or could be available for concurrent work during this time Host launches Cell code Cell code completed (1) (2) (3) (6) (5b)(5a) MPI Updated data pushed/pulled to Host Non-accelerated code Each SPE computes within its local memory buffers Each SPE DMA multi-buffers data back to Cell memory Each SPE DMA multi-buffers Cell data into local memory (4) until done Simultaneously Node (Opteron) Serial PPC Processor Node Memory Cell Memory Parallel SPE Processors Local Memories (1)(2)(6) (3) (4) 8-way parallel MPI (5B) PCIe link (5a) How much can be automated in compilers or languages? DaCS DMA DaCS