L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed.

Slides:

Advertisements

Similar presentations

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Advertisements

Distributed Systems CS

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

1 Computational models of the physical world Cortical bone Trabecular bone.

Thoughts on Shared Caches Jeff Odom University of Maryland.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Chapter 17 Parallel Processing.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Computer System Architectures Computer System Software

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Network Aware Resource Allocation in Distributed Clouds.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 A Generalized Framework for Auto-tuning Stencil Computations Shoaib Kamil 1,3, Cy.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Lattice Boltzmann Hybrid Auto-Tuning on High-End Computational Platforms Samuel Williams,

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Memory-Efficient Optimization of Gyrokinetic Particle-to-Grid Interpolation for Multicore.

Full and Para Virtualization

PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing TX, US, Experimental.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Architecture & Organization 1

PA an Coordinated Memory Caching for Parallel Jobs

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Architecture & Organization 1

Chapter 4 Multiprocessors

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Chapter 01: Introduction

Presentation transcript:

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed Auto-Tuning Samuel Williams, Leonid Oliker Jonathan Carter, John Shalf 1 Lawrence Berkeley National Laboratory

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Introduction  Management of data locality and data movement is essential in attaining high performance.  However, the subtleties of multicore architectures, deep memory hierarchies, networks, and the randomness of batch job scheduling makes attaining high-performance on large-scale distributed memory applications is increasingly challenging.  In this talk, we examine techniques to mitigate these challenges and attain high performance on a plasma physics simulation. 2

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 3 LBMHD

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 4 LBMHD  Lattice Boltzmann Magnetohydrodynamics (CFD+Maxwell’s Equations)  Plasma turbulence simulation via Lattice Boltzmann Method (~structured grid) for simulating astrophysical phenomena and fusion devices  Three macroscopic quantities:  Density  Momentum (vector)  Magnetic Field (vector)  LBM requires two velocity distributions:  momentum distribution (27 scalar components)  magnetic distribution (15 Cartesian vector components)  Overall, that represents a storage requirement of 158 doubles per point. momentum distribution Y Z +X magnetic distribution Y +Z +X macroscopic variables +Y +Z +X

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 5 LBMHD  Code Structure:  we base our code on the 2005 Gordon Bell nominated implementation.  time evolution through a series of collision( ) and stream( ) functions  collision( ) evolves the state of the simulation  stream( ) exchanges ghost zones among processes.  When parallelized, stream( ) should constitute ~10% of the runtime.  Notes on collision( ) :  For each lattice update, we must read 73 doubles, perform about 1300 flops, and update 79 doubles (1216 bytes of data movement)  Just over 1.0 flop per byte (ideal architecture)  Suggests LBMHD is memory-bound on most architectures.  Structure-of-arrays layout (component’s are separated) ensures that cache capacity requirements are independent of problem size  However, TLB capacity requirement increases to >150 entries (similarly, HW prefetchers must deal with 150 streams)

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY LBMHD Sequential Challenges  stencil-like code is sensitive to bandwidth and data movement. (maximize bandwidth, minimize superfluous data movement)  To attain good bandwidth on processors with HW prefetching, we must express a high-degree of sequential locality. i.e. access all addresses f[i]…f[i+k] for a moderately large k.  Code demands maintaining a cache working set of >80 doubles per lattice update (simpler stencils require a working set of 3 doubles)  Working set scales linearly with sequential locality.  For every element of sequential access, we must maintain a working set of >640 bytes !!! (thus, to express one page of sequential locality, we must maintain a working set of over 2.5MB per thread)  Clearly, we must trade sequential locality for temporal locality 6

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY LBMHD Parallel Challenges  Each process maintains a 1 element (79 doubles) deep ghost zone around its subdomain.  On each time step, a subset of these values (24 doubles) must be communicated for each face of the subdomain.  For small problems per process (e.g. 24x32x48), each process must communicate at least 6% of its data.  Given network bandwidth per core is typically far less than DRAM bandwidth per core, this poor surface:volume ratio cam impede performance.  Moreover, network bandwidth can vary depending on the locations of the communicating pairs.  We must explore approaches to parallelization and decomposition that mitigate these effects 7

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Supercomputing Platforms 8

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Supercomputing Platforms  We examine performance on three machines: the Cray XT4 (Franklin), the Cray XE6 (Hopper), and the IBM BGP (Intrepid) 9 IntrepidHopperFranklin 1chips per node41 4cores per chip64 Custom 3D Torus Interconnect Gemini 3D Torus SeaStar2 3D Torus 3.4Gflop/s per core DRAM GB/s per core WNode power per core 1 19W30W 32KBPrivate cache per core64+512KB 2MBLLC cache per core1MB512KB 1 Based on top500

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Supercomputing Platforms  We examine performance on three machines: the Cray XT4 (Franklin), the Cray XE6 (Hopper), and the IBM BGP (Intrepid) 10 IntrepidHopperFranklin 1chips per node41 4cores per chip64 Custom 3D Torus Interconnect Gemini 3D Torus SeaStar2 3D Torus 3.4Gflop/s per core DRAM GB/s per core WNode power per core 1 19W30W 32KBPrivate cache per core64+512KB 2MBLLC cache per core1MB512KB 1 Based on top500

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Supercomputing Platforms  We examine performance on three machines: the Cray XT4 (Franklin), the Cray XE6 (Hopper), and the IBM BGP (Intrepid) 11 IntrepidHopperFranklin 1chips per node41 4cores per chip64 Custom 3D Torus Interconnect Gemini 3D Torus SeaStar2 3D Torus 3.4Gflop/s per core DRAM GB/s per core WNode power per core 1 19W30W 32KBPrivate cache per core64+512KB 2MBLLC cache per core1MB512KB 1 Based on top500

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Supercomputing Platforms  We examine performance on three machines: the Cray XT4 (Franklin), the Cray XE6 (Hopper), and the IBM BGP (Intrepid) 12 IntrepidHopperFranklin 1chips per node41 4cores per chip64 Custom 3D Torus Interconnect Gemini 3D Torus SeaStar2 3D Torus 3.4Gflop/s per core DRAM GB/s per core WNode power per core 1 19W30W 32KBPrivate cache per core64+512KB 2MBLLC cache per core1MB512KB 1 Based on top500

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Supercomputing Platforms  We examine performance on three machines: the Cray XT4 (Franklin), the Cray XE6 (Hopper), and the IBM BGP (Intrepid) 13 IntrepidHopperFranklin 1chips per node41 4cores per chip64 Custom 3D Torus Interconnect Gemini 3D Torus SeaStar2 3D Torus 3.4Gflop/s per core DRAM GB/s per core WNode power per core 1 19W30W 32KBPrivate cache per core64+512KB 2MBLLC cache per core1MB512KB 1 Based on top500

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Supercomputing Platforms  We examine performance on three machines: the Cray XT4 (Franklin), the Cray XE6 (Hopper), and the IBM BGP (Intrepid) 14 IntrepidHopperFranklin 1chips per node41 4cores per chip64 Custom 3D Torus Interconnect Gemini 3D Torus SeaStar2 3D Torus 3.4Gflop/s per core DRAM GB/s per core WNode power per core 1 19W30W 32KBPrivate cache per core64+512KB 2MBLLC cache per core1MB512KB 1 Based on top500

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Supercomputing Platforms  We examine performance on three machines: the Cray XT4 (Franklin), the Cray XE6 (Hopper), and the IBM BGP (Intrepid) 15 IntrepidHopperFranklin 1chips per node41 4cores per chip64 Custom 3D Torus Interconnect Gemini 3D Torus SeaStar2 3D Torus 3.4Gflop/s per core DRAM GB/s per core WNode power per core 1 19W30W 32KBPrivate cache per core64+512KB 2MBLLC cache per core1MB512KB 1 Based on top500

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Supercomputing Platforms  We examine performance on three machines: the Cray XT4 (Franklin), the Cray XE6 (Hopper), and the IBM BGP (Intrepid) 16 IntrepidHopperFranklin 1chips per node41 4cores per chip64 Custom 3D Torus Interconnect Gemini 3D Torus SeaStar2 3D Torus 3.4Gflop/s per core DRAM GB/s per core WNode power per core 1 19W30W 32KBPrivate cache per core64+512KB 2MBLLC cache per core1MB512KB 1 Based on top500

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Experimental Methodology 17

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Experimental Methodology  Memory Usage: We will explore three different (per node) problem sizes: 96 3, 144 3, and These correspond to using (at least) 1GB, 4GB, and 16GB of DRAM.  Programming Models: We explore the flat MPI, MPI+OpenMP, and MPI+Pthreads programming models. OpenMP/Pthread comparisons are used to quantify the performance impact from increased productivity.  Scale: We auto-tune at relatively small scale (64 nodes = 1536 cores on Hopper), then evaluate at scale (2K nodes = cores on Hopper)  Note, in every configuration, we always use every core on a node. That is, as we increase the number of threads per process, we proportionally reduce the number of processes per node. 18

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Performance Optimization 19

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Sequential Optimizations  Virtual Vectors: Treat the cache hierarchy as a virtual vector register file with a parameterized vector lenght. This parameter allows one to trade sequential locality for temporal locality.  Unrolling/Reordering/SIMDization: restructure the operations on virtual vectors to facilitate code generation. Use SIMD intrinsics to map operations on virtual vectors directly to SIMD instructions. Large parameter space for balancing RF locality and L1 BW  SW Prefetching: Memory access pattern is complicated by short stanzas. Use SW prefetching intrinsics to hide memory latency. Tuning prefetch distance trades temporal/sequential locality  BGP-specific: restructure code to cope with L1 quirks (half bandwidth, write-through) and insert XL/C-specific pragmas (aligned and disjoint) 20

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Optimizations for Parallelism 21  Affinity: we use aprun options coupled with first touch initialization to optimize for NUMA on Hopper. Similar technology was applied for Intel clusters.  MPI/Thread Decomposition: We explore all possible decompositions of the cubical per node grid among processes and threads. Threading is only in the z- dimension.  Optimization of stream( ): we explored sending individual velocities and entire faces. We also thread buffer packing. MPI decomposition also affects time spent accessing various faces.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Hierarchical Auto-tuning 22

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Hierarchical Auto-tuning  Prior work (IPDPS’08) examined auto-tuning LBMHD on a single node (ignoring time spent in MPI)  To avoid a combinatorial explosion, we auto-tune the parallel (distributed) application in a two- stage fashion using 64 nodes.  Stage 1: explore sequential optimizations on the 4GB problem.  Stage 2: explore MPI and thread decomposition for each problem size.  Once the optimal parameters are determined, we can run problems at scale. 23

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Hierarchical Auto-tuning  Best VL was 256 elements on the Opterons (fitting in the L2 was more important than page stanzas)  Best VL was 128 elements on the BGP (clearly fitting in the L1 did not provide sufficient sequential locality)  The Cray machines appear to be much more sensitive to MPI and thread decomposition (threading provided a 30% performance boost over flat MPI)  Although OpenMP is NUMA- aware, it did not scale perfectly across multiple sockets (pthreads did). 24

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Results at Scale 25

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Performance at Scale  Using the best configurations from our hierarchical auto-tuning, we evaluate performance per core using 2K nodes on each platform for each problem size.  Note, due to limited memory per node, not all machines can run all problem configurations.  Assuming we are bandwidth- limited, all machines should attain approximately 2GFlop/s. (Roofline limit)  Clearly, reference flat MPI (which obviates NUMA) performance is far below our expectations. 26

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Performance at Scale  Auto-tuning the sequential implementation in a portable C fashion shows only moderate boosts to performance.  Clearly, we are well below the performance bound. 27

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Performance at Scale  However, auto-tuning the sequential implementation with ISA-specific optimizations shows a huge boost on all architectures.  Some of this benefit comes from SIMDization/prefetching, but on Franklin, a large benefit comes from cache bypass instructions. 28

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Performance at Scale  Threading showed no benefit on Intrepid.  We observe that performance on Franklin is approaching the 2GFlop/s performance bound.  In fact, when examining collision( ), we observe that Franklin attains 94% of its STREAM bandwidth.  Overall, we see a performance boost of x over reference.  On Hopper, the threading boosts performance by 11-26% depending on problem size.  Why does the benefit of threading vary so greatly ? 29

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Memory and Communication  For the largest problem per node, we examine how optimization reduces time spent in collision( ) (computation) and stream( ) (communication).  Clearly, sequential auto-tuning can dramatically reduce runtime, but the time spent in communication remains constant.  Similarly, threading reduces the time spent in communication as it replaces explicit messaging with coherent shared memory accesses. 30

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Memory and Communication  However, when we consider the smallest (1GB) problem, we see the time spent in stream( ) is significant. Note, this is an artifact of Hopper’s greatly increase node performance.  After sequential optimization on Hopper, more than 55% of the time is spent in communication.  This does not bode well for future GPUs or accelerators on this class of computation as they only address computation performance (yellow bars) and lack enough memory to amortize communication (red bars) 31

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Energy Efficiency  Increasingly, energy is the great equalizer among HPC systems.  Thus, rather than comparing performance per node or performance per core, we consider energy efficiency (performance per Watt).  Surprisingly, despite the differences in age and process technology, all three systems deliver similar energy efficiencies for this application.  The energy efficiency of the Opteron-based system only improved by 33% in 2 years. 32

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Conclusions 33

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Conclusions  Used LBMHD as a testbed to evaluate, analyze, and optimize performance on BGP, XT4, and XE6.  We employed a two-stage approach to auto-tuning and delivered a x speedup on a variety of supercomputers/problem sizes.  We observe a similar improvement in energy efficiency observing that after optimization, all machines delivered similar energy efficiencies.  We observe that threading (either OpenMP or pthreads) improves performance over flat MPI with pthreads making better use of NUMA architectures.  Unfortunately, performance on small problems is hampered by finite network bandwidth. Accelerator’s sacrifice of capacity to maximize bandwidth exacerbates the problem. 34

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Future Work  We used an application-specific solution to maximize performance.  We will continue to investigate OpenMP issues on NUMA archs.  Just as advances in in-core performance will outstrip DRAM bandwidth, DRAM bandwidth will outstrip network bandwidth. This will put a great emphasis on communication hiding techniques and on-sided PGAS-like communication.  In the future, we can use Active Harmony to generalize the search process and DSL’s/source-to-source tools to facilitate code generation.  Although we showed significant performance gains, in the future, we must ensure these translate to superior ( O(N) ) algorithms. 35

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 36 Acknowledgements  All authors from Lawrence Berkeley National Laboratory were supported by the DOE Office of Advanced Scientific Computing Research under contract number DE-AC02-05CH  This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH  George Vahala and his research group provided the original (FORTRAN) version of the LBMHD code.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Questions? 37