BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams.

Slides:



Advertisements
Similar presentations
Chapter 7 Multicores, Multiprocessors, and Clusters.
Advertisements

Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Structure of Computer Systems
Performance Metrics Inspired by P. Kent at BES Workshop Run larger: Length, Spatial extent, #Atoms, Weak scaling Run longer: Time steps, Optimizations,
1 3D Lattice Boltzmann Magneto-hydrodynamics (LBMHD3D) Sam Williams 1,2, Jonathan Carter 2, Leonid Oliker 2, John Shalf 2, Katherine Yelick 1,2 1 University.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Auto-tuning Stencil Codes for.
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Autotuning Sparse Matrix.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 A Generalized Framework for Auto-tuning Stencil Computations Shoaib Kamil 1,3, Cy.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Lattice Boltzmann Hybrid Auto-Tuning on High-End Computational Platforms Samuel Williams,
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 The Roofline Model: A pedagogical.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 A Vision for Integrating.
Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
1 Lecture 5a: CPU architecture 101 boris.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.
Parallel Computers Today
Samuel Williams1,2, David Patterson1,
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
The Coming Multicore Revolution: What Does it Mean for Programming?
Auto-tuning Memory Intensive Kernels for Multicore
EE 4xx: Computer Architecture and Performance Programming
Kaushik Datta1,2, Mark Murphy2,
Presentation transcript:

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams 1,2, Jonathan Carter 2, Leonid Oliker 1,2, John Shalf 2, Katherine Yelick 1,2 1 University of California, Berkeley 2 Lawrence Berkeley National Laboratory

BIPS Motivation  Multicore is the de facto solution for improving peak performance for the next decade  How do we ensure this applies to sustained performance as well ?  Processor architectures are extremely diverse and compilers can rarely fully exploit them  Require a HW/SW solution that guarantees performance without completely sacrificing productivity

BIPS Overview  Examined the Lattice-Boltzmann Magneto-hydrodynamic (LBMHD) application  Present and analyze two threaded & auto-tuned implementations  Benchmarked performance across 5 diverse multicore microarchitectures  Intel Xeon (Clovertown)  AMD Opteron (rev.F)  Sun Niagara2 (Huron)  IBM QS20 Cell Blade (PPEs)  IBM QS20 Cell Blade (SPEs)  We show  Auto-tuning can significantly improve application performance  Cell consistently delivers good performance and efficiency  Niagara2 delivers good performance and productivity

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Multicore SMPs used

BIPS Multicore SMP Systems

BIPS Multicore SMP Systems (memory hierarchy) Conventional Cache-based Memory Hierarchy Conventional Cache-based Memory Hierarchy

BIPS Multicore SMP Systems (memory hierarchy) Conventional Cache-based Memory Hierarchy Conventional Cache-based Memory Hierarchy Disjoint Local Store Memory Hierarchy Disjoint Local Store Memory Hierarchy

BIPS Multicore SMP Systems (memory hierarchy) Cache + Pthreads implementationsb Cache + Pthreads implementationsb Local Store + libspe implementations Local Store + libspe implementations

BIPS Multicore SMP Systems (peak flops) 75 Gflop/s17 Gflop/s PPEs: 13 Gflop/s SPEs: 29 Gflop/s 11 Gflop/s

BIPS Multicore SMP Systems (peak DRAM bandwidth) 21 GB/s(read) 10 GB/s(write) 21 GB/s 51 GB/s 42 GB/s(read) 21 GB/s(write)

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Auto-tuning

BIPS Auto-tuning  Hand optimizing each architecture/dataset combination is not feasible  Our auto-tuning approach finds a good performance solution by a combination of heuristics and exhaustive search  Perl script generates many possible kernels  (Generate SIMD optimized kernels)  Auto-tuning benchmark examines kernels and reports back with the best one for the current architecture/dataset/compiler/…  Performance depends on the optimizations generated  Heuristics are often desirable when the search space isn’t tractable  Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Introduction to LBMHD

BIPS Introduction to Lattice Methods  Structured grid code, with a series of time steps  Popular in CFD (allows for complex boundary conditions)  Overlay a higher dimensional phase space  Simplified kinetic model that maintains the macroscopic quantities  Distribution functions (e.g velocities per point in space) are used to reconstruct macroscopic quantities  Significant Memory capacity requirements Z +Y +X

BIPS LBMHD (general characteristics)  Plasma turbulence simulation  Couples CFD with Maxwell’s equations  Two distributions:  momentum distribution (27 scalar velocities)  magnetic distribution (15 vector velocities)  Three macroscopic quantities:  Density  Momentum (vector)  Magnetic Field (vector) momentum distribution Y Z +X magnetic distribution Y +Z +X macroscopic variables +Y +Z +X

BIPS LBMHD (flops and bytes)  Must read 73 doubles, and update 79 doubles per point in space (minimum 1200 bytes)  Requires about 1300 floating point operations per point in space  Flop:Byte ratio  0.71 (write allocate architectures)  1.07 (ideal)  Rule of thumb for LBMHD:  Architectures with more flops than bandwidth are likely memory bound (e.g. Clovertown)

BIPS LBMHD (implementation details)  Data Structure choices:  Array of Structures: no spatial locality, strided access  Structure of Arrays: huge number of memory streams per thread, but guarantees spatial locality, unit-stride, and vectorizes well  Parallelization  Fortran version used MPI to communicate between tasks. = bad match for multicore  The version in this work uses pthreads for multicore, and MPI for inter-node  MPI is not used when auto-tuning  Two problem sizes:  64 3 (~330MB)  (~2.5GB)

BIPS Stencil for Lattice Methods  Very different the canonical heat equation stencil  There are multiple read and write arrays  There is no reuse read_lattice[ ][ ] write_lattice[ ][ ]

BIPS Side Note on Performance Graphs  Threads are mapped first to cores, then sockets. i.e. multithreading, then multicore, then multisocket  Niagara2 always used 8 threads/core.  Show two problem sizes  We’ll step through performance as optimizations/features are enabled within the auto-tuner  More colors implies more optimizations were necessary  This allows us to compare architecture performance while keeping programmer effort(productivity) constant

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Performance and Analysis of Pthreads Implementation

BIPS Pthread Implementation  Not naïve  fully unrolled loops  NUMA-aware  1D parallelization  Always used 8 threads per core on Niagara2  1P Niagara2 is faster than 2P x86 machines IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown)AMD Opteron (rev.F)

BIPS  Not naïve  fully unrolled loops  NUMA-aware  1D parallelization  Always used 8 threads per core on Niagara2  1P Niagara2 is faster than 2P x86 machines Pthread Implementation IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown)AMD Opteron (rev.F) 4.8% of peak flops 17% of bandwidth 4.8% of peak flops 17% of bandwidth 14% of peak flops 17% of bandwidth 14% of peak flops 17% of bandwidth 54% of peak flops 14% of bandwidth 54% of peak flops 14% of bandwidth 1% of peak flops 0.3% of bandwidth 1% of peak flops 0.3% of bandwidth

BIPS  Not naïve  fully unrolled loops  NUMA-aware  1D parallelization  Always used 8 threads per core on Niagara2  1P Niagara2 is faster than 2P x86 machines Initial Pthread Implementation IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown)AMD Opteron (rev.F) Performance degradation despite improved surface to volume ratio

BIPS Cache effects  Want to maintain a working set of velocities in the L1 cache  150 arrays each trying to keep at least 1 cache line.  Impossible with Niagara’s 1KB/thread L1 working set = capacity misses  On other architectures, the combination of:  Low associativity L1 caches (2 way on opteron)  Large numbers or arrays  Near power of 2 problem sizes can result in large numbers of conflict misses  Solution: apply a lattice (offset) aware padding heuristic to the velocity arrays to avoid/minimize conflict misses

BIPS Auto-tuned Performance (+Stencil-aware Padding)  This lattice method is essentially 79 simultaneous 72-point stencils  Can cause conflict misses even with highly associative L1 caches (not to mention opteron’s 2 way)  Solution: pad each component so that when accessed with the corresponding stencil(spatial) offset, the components are uniformly distributed in the cache IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown)AMD Opteron (rev.F) +Padding Naïve+NUMA

BIPS Blocking for the TLB  Touching 150 different arrays will thrash TLBs with less than 128 entries.  Try to maximize TLB page locality  Solution: borrow a technique from compilers for vector machines:  Fuse spatial loops  Strip mine into vectors of size vector length (VL)  Interchange spatial and velocity loops  Can be generalized by varying:  The number of velocities simultaneously accessed  The number of macroscopics / velocities simultaneously updated  Has the side benefit expressing more ILP and DLP (SIMDization) and cleaner loop structure at the cost of increased L1 cache traffic

BIPS Multicore SMP Systems (TLB organization) 16 entries 4KB pages 32 entries 4KB pages PPEs: 1024 entries SPEs: 256 entries 4KB pages 128 entries 4MB pages

BIPS Cache / TLB Tug-of-War  For cache locality we want small VL  For TLB page locality we want large VL  Each architecture has a different balance between these two forces  Solution: auto-tune to find the optimal VL L1 miss penalty TLB miss penalty L1 / 1200 Page size / 8 VL

BIPS Auto-tuned Performance (+Vectorization)  Each update requires touching ~150 components, each likely to be on a different page  TLB misses can significantly impact performance  Solution: vectorization  Fuse spatial loops, strip mine into vectors of size VL, and interchange with phase dimensional loops  Auto-tune: search for the optimal vector length  Significant benefit on some architectures  Becomes irrelevant when bandwidth dominates performance IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown)AMD Opteron (rev.F) +Vectorization +Padding Naïve+NUMA

BIPS Auto-tuned Performance (+Explicit Unrolling/Reordering)  Give the compilers a helping hand for the complex loops  Code Generator: Perl script to generate all power of 2 possibilities  Auto-tune: search for the best unrolling and expression of data level parallelism  Is essential when using SIMD intrinsics IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown)AMD Opteron (rev.F) +Unrolling +Vectorization +Padding Naïve+NUMA

BIPS Auto-tuned Performance (+Software prefetching)  Expanded the code generator to insert software prefetches in case the compiler doesn’t.  Auto-tune:  no prefetch  prefetch 1 line ahead  prefetch 1 vector ahead.  Relatively little benefit for relatively little work IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown)AMD Opteron (rev.F) +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

BIPS Auto-tuned Performance (+Software prefetching)  Expanded the code generator to insert software prefetches in case the compiler doesn’t.  Auto-tune:  no prefetch  prefetch 1 line ahead  prefetch 1 vector ahead.  Relatively little benefit for relatively little work IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown)AMD Opteron (rev.F) +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA 6% of peak flops 22% of bandwidth 6% of peak flops 22% of bandwidth 32% of peak flops 40% of bandwidth 32% of peak flops 40% of bandwidth 59% of peak flops 15% of bandwidth 59% of peak flops 15% of bandwidth 10% of peak flops 3.7% of bandwidth 10% of peak flops 3.7% of bandwidth

BIPS Auto-tuned Performance (+SIMDization, including non-temporal stores)  Compilers(gcc & icc) failed at exploiting SIMD.  Expanded the code generator to use SIMD intrinsics.  Explicit unrolling/reordering was extremely valuable here.  Exploited movntpd to minimize memory traffic (only hope if memory bound)  Significant benefit for significant work +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown)AMD Opteron (rev.F)

BIPS Auto-tuned Performance (+SIMDization, including non-temporal stores)  Compilers(gcc & icc) failed at exploiting SIMD.  Expanded the code generator to use SIMD intrinsics.  Explicit unrolling/reordering was extremely valuable here.  Exploited movntpd to minimize memory traffic (only hope if memory bound)  Significant benefit for significant work +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown)AMD Opteron (rev.F) 7.5% of peak flops 18% of bandwidth 7.5% of peak flops 18% of bandwidth 42% of peak flops 35% of bandwidth 42% of peak flops 35% of bandwidth 59% of peak flops 15% of bandwidth 59% of peak flops 15% of bandwidth 10% of peak flops 3.7% of bandwidth 10% of peak flops 3.7% of bandwidth

BIPS Auto-tuned Performance (+SIMDization, including non-temporal stores)  Compilers(gcc & icc) failed at exploiting SIMD.  Expanded the code generator to use SIMD intrinsics.  Explicit unrolling/reordering was extremely valuable here.  Exploited movntpd to minimize memory traffic (only hope if memory bound)  Significant benefit for significant work +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA IBM Cell Blade (PPEs) Sun Niagara2 (Huron) Intel Xeon (Clovertown)AMD Opteron (rev.F) 1.5x 10x 4.3x 1.6x

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Performance and Analysis of Cell Implementation

BIPS Cell Implementation  Double precision implementation  DP will severely hamper performance  Vectorized, double buffered, but not auto-tuned  No NUMA optimizations  No Unrolling  VL is fixed  Straight to SIMD intrinsics  Prefetching replaced by DMA list commands  Only collision() was implemented.

BIPS Auto-tuned Performance (Local Store Implementation)  First attempt at cell implementation.  VL, unrolling, reordering fixed  No NUMA  Exploits DMA and double buffering to load vectors  Straight to SIMD intrinsics.  Despite the relative performance, Cell’s DP implementation severely impairs performance Intel Xeon (Clovertown)AMD Opteron (rev.F) Sun Niagara2 (Huron) IBM Cell Blade (SPEs) * * collision() only +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

BIPS  First attempt at cell implementation.  VL, unrolling, reordering fixed  No NUMA  Exploits DMA and double buffering to load vectors  Straight to SIMD intrinsics.  Despite the relative performance, Cell’s DP implementation severely impairs performance Auto-tuned Performance (Local Store Implementation) Intel Xeon (Clovertown)AMD Opteron (rev.F) Sun Niagara2 (Huron) IBM Cell Blade (SPEs) * * collision() only +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA 7.5% of peak flops 18% of bandwidth 7.5% of peak flops 18% of bandwidth 42% of peak flops 35% of bandwidth 42% of peak flops 35% of bandwidth 59% of peak flops 15% of bandwidth 59% of peak flops 15% of bandwidth 57% of peak flops 33% of bandwidth 57% of peak flops 33% of bandwidth

BIPS  First attempt at cell implementation.  VL, unrolling, reordering fixed  Exploits DMA and double buffering to load vectors  Straight to SIMD intrinsics.  Despite the relative performance, Cell’s DP implementation severely impairs performance Speedup from Heterogeneity Intel Xeon (Clovertown)AMD Opteron (rev.F) Sun Niagara2 (Huron) IBM Cell Blade (SPEs) * * collision() only +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA 1.5x 13x over auto-tuned PPE 13x over auto-tuned PPE 4.3x 1.6x

BIPS  First attempt at cell implementation.  VL, unrolling, reordering fixed  Exploits DMA and double buffering to load vectors  Straight to SIMD intrinsics.  Despite the relative performance, Cell’s DP implementation severely impairs performance Speedup over naive Intel Xeon (Clovertown)AMD Opteron (rev.F) Sun Niagara2 (Huron) IBM Cell Blade (SPEs) * * collision() only +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA 1.5x 130x 4.3x 1.6x

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Summary

BIPS Aggregate Performance (Fully optimized)  Cell SPEs deliver the best full system performance  Although, Niagara2 delivers near comparable per socket performance  Dual core Opteron delivers far better performance (bandwidth) than Clovertown  Clovertown has far too little effective FSB bandwidth

BIPS Parallel Efficiency (average performance per thread, Fully optimized)  Aggregate Mflop/s / #cores  Niagara2 & Cell showed very good multicore scaling  Clovertown showed very poor multicore scaling (FSB limited)

BIPS Power Efficiency (Fully Optimized)  Used a digital power meter to measure sustained power  Calculate power efficiency as: sustained performance / sustained power  All cache-based machines delivered similar power efficiency  FBDIMMs (~12W each) sustained power  8 DIMMs on Clovertown (total of ~330W)  16 DIMMs on N2 machine (total of ~450W)

BIPS Productivity  Niagara2 required significantly less work to deliver good performance (just vectorization for large problems).  Clovertown, Opteron, and Cell all required SIMD (hampers productivity) for best performance.  Cache based machines required search for some optimizations, while Cell relied solely on heuristics (less time to tune)

BIPS Summary  Niagara2 delivered both very good performance and productivity  Cell delivered very good performance and efficiency (processor and power)  On the memory bound Clovertown parallelism wins out over optimization and auto-tuning  Our multicore auto-tuned LBMHD implementation significantly outperformed the already optimized serial implementation  Sustainable memory bandwidth is essential even on kernels with moderate computational intensity (flop:byte ratio)  Architectural transparency is invaluable in optimizing code

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Multi-core arms race

BIPS New Multicores 2.2GHz Opteron (rev.F) 1.40GHz Niagara2

BIPS New Multicores  Barcelona is a quad core Opteron  Victoria Falls is a dual socket (128 threads) Niagara2  Both have the same total bandwidth +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA +Smaller pages 2.2GHz Opteron (rev.F)2.3GHz Opteron (Barcelona) 1.40GHz Niagara2 1.16GHz Victoria Falls

BIPS Speedup from multicore/socket  Barcelona is a quad core Opteron  Victoria Falls is a dual socket (128 threads) Niagara2  Both have the same total bandwidth +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA +Smaller pages 2.2GHz Opteron (rev.F)2.3GHz Opteron (Barcelona) 1.40GHz Niagara2 1.16GHz Victoria Falls 1.6x (1.9x frequency normalized) 1.6x (1.9x frequency normalized) 1.9x (1.8x frequency normalized) 1.9x (1.8x frequency normalized)

BIPS Speedup from Auto-tuning  Barcelona is a quad core Opteron  Victoria Falls is a dual socket (128 threads) Niagara2  Both have the same total bandwidth +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA +Smaller pages 2.2GHz Opteron (rev.F)2.3GHz Opteron (Barcelona) 1.40GHz Niagara2 1.16GHz Victoria Falls 1.5x 16x 3.9x 4.3x

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Questions?

BIPS Acknowledgements  UC Berkeley  RADLab Cluster (Opterons)  PSI cluster(Clovertowns)  Sun Microsystems  Niagara2 donations  Forschungszentrum Jülich  Cell blade cluster access  George Vahala, et. al  Original version of LBMHD  ASCR Office in the DOE office of Science  contract DE-AC02-05CH11231