L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1

Slides:

Advertisements

Similar presentations

Square Root Chapter 9.

Advertisements

.. Resistance is futile…. Dec 6, 2004

Factor each trinomial:

Solving Systems of Equations

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Multiplication X 1 1 x 1 = 1 2 x 1 = 2 3 x 1 = 3 4 x 1 = 4 5 x 1 = 5 6 x 1 = 6 7 x 1 = 7 8 x 1 = 8 9 x 1 = 9 10 x 1 = x 1 = x 1 = 12 X 2 1.

Division ÷ 1 1 ÷ 1 = 1 2 ÷ 1 = 2 3 ÷ 1 = 3 4 ÷ 1 = 4 5 ÷ 1 = 5 6 ÷ 1 = 6 7 ÷ 1 = 7 8 ÷ 1 = 8 9 ÷ 1 = 9 10 ÷ 1 = ÷ 1 = ÷ 1 = 12 ÷ 2 2 ÷ 2 =

Objectives: Generate and describe sequences. Vocabulary:

1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.

Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.

EQUATIONS OF CIRCLES.

Parallel Processors.

Arithmetic and Geometric Means

Multiplication Facts Review. 6 x 4 = 24 5 x 5 = 25.

Year 6 mental test 5 second questions

Welcome to Who Wants to be a Millionaire

Welcome to Who Wants to be a Millionaire

TMS320C6xx Architecture C6xx

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

HISTORY OF MICROPROCESSORS Gursharan Singh Tatla 1.

Hardware Evolution in the Datacenter Rick Indyke AMD Business Development Mgr.

Bus Architecture.

Copyright © 2007 Heathkit Company, Inc. All Rights Reserved PC Fundamentals Presentation 35 – Buses.

Introduced 1982 Used mostly in embedded applications - controllers, point-of- sale systems, terminals, and the like Used in several MS-DOS non-PC- Compatible.

Mr. Gursharan Singh Tatla

Mobile Mapping Peterson. If the point of interest (POI) falls perfectly in the middle, the device needs four tiles in one dimension.

C SINGH, JUNE 7-8, 2010IWW 2010, ISATANBUL, TURKEY Advanced Computers Architecture, UNIT 2 Advanced Computers Architecture UNIT 2 CACHE MEOMORY Lecture7.

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Parallel Processing with PlayStation3 Lawrence Kalisz.

Advance Nano Device Lab. Fundamentals of Modern VLSI Devices 2 nd Edition Yuan Taur and Tak H.Ning 0 Ch9. Memory Devices.

Look at This PowerPoint for help on you times tables

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

ILUMINA 502 PRODUCT PRESENTATION

AMD Microprocessor Technologies Ben Sander AMD Principal Member of Technical Staff 06/21/

IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

Memory Memory technologies

The University of Adelaide, School of Computer Science

Perspectives on the “Memory Wall” John D. McCalpin, Ph.D IBM Global Microprocessor Development Austin, TX.

Fairness via Source Throttling: A configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi * Chang Joo Lee * Onur.

Sequence Quickies 1  ORB Education. Visit for the other resources in this pack.

EN0129 PC and Network Technology - 1 Sajjad Shami Adrian Robson Gerhard Fehringer School of Computing, Engineering & Information Sciences Northumbria University.

Database for Data-Analysis Developer: Ying Chen (JLab) Developer: Ying Chen (JLab) Computing 3(or N)-pt functions Computing 3(or N)-pt functions Many correlation.

Figure 10–1 A 64-cell memory array organized in three different ways.

SE-292 High Performance Computing

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

PSSA Preparation.

& dding ubtracting ractions.

Transforming the Equation of a Circle

Altix ccNUMA Architecture Distributed Memory - Shared address space.

Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don.

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

Implementing Large Scale FFTs on Heterogeneous Multicore Systems Yan Li 1, Jeff Diamond 2, Haibo Lin 1, Yudong Yang 3, Zhenxing Han 3 June 4 th,

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 The Roofline Model: A pedagogical.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Memory-Efficient Optimization of Gyrokinetic Particle-to-Grid Interpolation for Multicore.

Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.

● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.

Parallel Computers Today

Samuel Williams1,2, David Patterson1,

CS 140 Lecture Notes: Technology and Operating Systems

CS 140 Lecture Notes: Technology and Operating Systems

Kaushik Datta1,2, Mark Murphy2,

Presentation transcript:

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Outline Challenges / Goals Fundamentals Roofline Performance Model Example: Heat Equation Example: SpMV Alternate Roofline Formulations Summary 2

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Challenges / Goals 3

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 4 Four Architectures 667MHz DDR2 DIMMs GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) Sun Victoria FallsAMD Barcelona NVIDIA G80IBM Cell Blade

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Challenges / Goals We have extremely varied architectures. Moreover, the characteristics of numerical methods can vary dramatically. The result is that performance and the benefit of optimization can vary significantly from one architecture x kernel combination to the next. We wish to understand whether or not weve attained good performance (high fraction of a theoretical peak) We wish to identify performance bottlenecks and enumerate potential remediation strategies. 5

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Fundamentals 6

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Littles Law Littles Law: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency Bandwidth conventional memory bandwidth #floating-point units Latency memory latency functional unit latency Concurrency: bytes expressed to the memory subsystem concurrent (parallel) memory operations For example, consider a CPU with 2 FPUs each with a 4-cycle latency. Littles law states that we must express 8-way ILP to fully utilize the machine. 7

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Littles Law Examples 8 consider a CPU with 2 FPUs each with a 4-cycle latency. Littles law states that we must express 8-way ILP to fully utilize the machine. Solution: unroll/jam the code to express 8 independent FP operations. Note, simply unrolling dependent operations (e.g. reduction) does not increase ILP. It simply amortizes loop overhead. Applied to FPUsApplied to Memory consider a CPU with 20GB/s of bandwidth and 100ns memory latency. Littles law states that we must express 2KB of concurrency (independent memory operations) to the memory subsystem to attain peak performance On todays superscalar processors, hardware stream prefetchers speculatively load consecutive elements. Solution: express the memory access pattern in a streaming fashion in order to engage the prefetchers.

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Three Classes of Locality Temporal Locality reusing data (either registers or cache lines) multiple times amortizes the impact of limited bandwidth. transform loops or algorithms to maximize reuse. Spatial Locality data is transferred from cache to registers in words. However, data is transferred to the cache in Byte lines using every word in a line maximizes spatial locality. transform data structures into structure of arrays (SoA) layout Sequential Locality Many memory address patterns access cache lines sequentially. CPUs hardware stream prefetchers exploit this observation to hide speculatively load data to memory latency. Transform loops to generate (a few) long, unit-stride accesses. 9

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 10 Arithmetic Intensity True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality) Others have constant intensity Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses. A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY NUMA 11 Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY NUMA 12 Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Roofline Model 13

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Overlap of Communication Consider a simple example in which a FP kernel maintains a working set in DRAM. We assume we can perfectly overlap computation with communication or v.v. either through prefetching/DMA and/or pipelining (decoupling of communication and computation) Thus, time, is the maximum of the time required to transfer the data and the time required to perform the floating point operations. 14 Bytes / STREAM Bandwidth Flops / Flop/s time

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model Basic Concept 15 Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis. where optimization i can be SIMDize, or unroll, or SW prefetch, … Given a kernels arithmetic intensity (based on DRAM traffic after being filtered by the cache), programmers can inspect the figure, and bound performance. Moreover, provides insights as to which optimizations will potentially be beneficial. Attainable Performance ij = min FLOP/s with Optimizations 1-i AI * Bandwidth with Optimizations 1-j

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Example 16 Consider the Opteron 2356: Dual Socket (NUMA) limited HW stream prefetchers quad-core (8 total) 2.3GHz 2-way SIMD (DP) separate FPMUL and FPADD datapaths 4-cycle FP latency Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like ?

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model Basic Concept 17 Plot on log-log scale Given AI, we can easily bound performance But architectures are much more complicated We will bound performance as we eliminate specific forms of in-core parallelism actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) /81/ /41/4 1/21/ peak DP Stream Bandwidth

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model computational ceilings 18 Opterons have dedicated multipliers and adders. If the code is dominated by adds, then attainable performance is half of peak. We call these Ceilings They act like constraints on performance actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) /81/ /41/4 1/21/ peak DP Stream Bandwidth mul / add imbalance

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model computational ceilings 19 Opterons have 128-bit datapaths. If instructions arent SIMDized, attainable performance will be halved actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) /81/ /41/4 1/21/ peak DP Stream Bandwidth mul / add imbalance w/out SIMD

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model computational ceilings 20 On Opterons, floating-point instructions have a 4 cycle latency. If we dont express 4-way ILP, performance will drop by as much as 4x actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) /81/ /41/4 1/21/ w/out SIMD w/out ILP peak DP Stream Bandwidth mul / add imbalance

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model communication ceilings 21 We can perform a similar exercise taking away parallelism from the memory subsystem actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) /81/ /41/4 1/21/ peak DP Stream Bandwidth

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model communication ceilings 22 Explicit software prefetch instructions are required to achieve peak bandwidth actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) /81/ /41/4 1/21/ peak DP Stream Bandwidth w/out SW prefetch

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model communication ceilings 23 Opterons are NUMA As such memory traffic must be correctly balanced among the two sockets to achieve good Stream bandwidth. We could continue this by examining strided or random memory access patterns actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) /81/ /41/4 1/21/ peak DP Stream Bandwidth w/out SW prefetch w/out NUMA

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model computation + communication ceilings 24 We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth. actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) /81/ /41/4 1/21/ w/out SIMD peak DP mul / add imbalance w/out ILP Stream Bandwidth w/out SW prefetch w/out NUMA

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model locality walls 25 Remember, memory traffic includes more than just compulsory misses. As such, actual arithmetic intensity may be substantially lower. Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s /81/ /41/4 1/21/ w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic FLOPs Compulsory Misses AI =

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Cache Behavior Knowledge of the underlying cache operation can be critical. For example, caches are organized into lines. Lines are organized into sets & ways (associativity) Thus, we must mimic the effect of Mark Hills 3Cs of caches Impacts of conflict, compulsory, and capacity misses are both architecture- and application-dependent. Ultimately they reduce the actual flop:byte ratio. Moreover, many caches are write allocate. a write allocate cache read in an entire cache line upon a write miss. If the application ultimately overwrites that line, the read was superfluous (further reduces flop:byte ratio) Because programs access data in words, but hardware transfers it in 64 or 128B cache lines, spatial locality is key Array-of-structure data layouts can lead to dramatically lower flop:byte ratios. e.g. if a program only operates on the red field of a pixel, bandwidth is wasted. 26

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model locality walls 27 Remember, memory traffic includes more than just compulsory misses. As such, actual arithmetic intensity may be substantially lower. Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s /81/ /41/4 1/21/ w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic+write allocation traffic FLOPs Allocations + Compulsory Misses AI =

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model locality walls 28 Remember, memory traffic includes more than just compulsory misses. As such, actual arithmetic intensity may be substantially lower. Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s /81/ /41/4 1/21/ w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic+write allocation traffic+capacity miss traffic FLOPs Capacity + Allocations + Compulsory AI =

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model locality walls 29 Remember, memory traffic includes more than just compulsory misses. As such, actual arithmetic intensity may be substantially lower. Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s /81/ /41/4 1/21/ w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic+write allocation traffic+capacity miss traffic+conflict miss traffic FLOPs Conflict + Capacity + Allocations + Compulsory AI =

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model locality walls 30 Optimizations remove these walls and ceilings which act to constrain performance. actual FLOP:Byte ratio attainable GFLOP/s /81/ /41/4 1/21/ w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic+write allocation traffic+capacity miss traffic+conflict miss traffic

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Instruction Issue Bandwidth On a superscalar processor, there is likely ample instruction issue bandwidth. This allows loads, integer, and FP instructions to be issued simultaneously. As such, we assumed that expression of parallelism was the underlying challenge for in-core. However, on some architectures, finite instruction-issue bandwidth can become a major impediment to performance. 31

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Roofline Model Instruction Mix 32 As the instruction mix shifts away from floating-point, finite issue bandwidth begins to affect limits on in- core performance. On Niagara2, with dual issues units but only 1 FPU, FP instructions must constitute 50% of the mix to attain peak performance. A similar approach should be used on GPUs where proper use of CUDA solves the parallelism challenges. actual FLOP:Byte ratio attainable GFLOP/s /81/ /41/4 1/21/ UltraSparc T2+ T5140 (Niagara2) Stream Bandwidth 12% FP 25% FP w/out SW prefetch w/out NUMA peak DP, 50% FP

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 33 Optimization Categorization Maximizing (attained) In-core Performance Minimizing (total) Memory Traffic Maximizing (attained) Memory Bandwidth

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 34 Optimization Categorization Minimizing Memory Traffic Maximizing Memory Bandwidth Maximizing In-core Performance Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 35 Optimization Categorization Minimizing Memory Traffic Maximizing Memory Bandwidth Maximizing In-core Performance Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? ? ? ? ? ? ? ? unroll & jam explicit SIMD reorder eliminate branches

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY ? ? ? ? ? ? ? ? unroll & jam explicit SIMD reorder eliminate branches 36 Optimization Categorization Maximizing In-core Performance Minimizing Memory Traffic Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance Maximizing Memory Bandwidth Exploit NUMA Hide memory latency Satisfy Littles Law ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking ? ? ? ? ? ? ? ? unroll & jam explicit SIMD reorder eliminate branches 37 Optimization Categorization Maximizing In-core Performance Maximizing Memory Bandwidth Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance Exploit NUMA Hide memory latency Satisfy Littles Law Minimizing Memory Traffic Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? ? ? ? ? ? ? ? cache blocking array padding compress data streaming stores

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 38 Optimization Categorization Maximizing In-core Performance Minimizing Memory Traffic Maximizing Memory Bandwidth Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance Exploit NUMA Hide memory latency Satisfy Littles Law ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? ? ? ? ? ? ? ? cache blocking array padding compress data streaming stores ? ? ? ? ? ? ? ? unroll & jam explicit SIMD reorder eliminate branches

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Examples 39

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 40 Multicore SMPs Used 667MHz DDR2 DIMMs GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) GB/s(write)21.33 GB/s(read) GB/s Core FSB Core GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Heat Equation 41

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 7-point Stencil 42 PDE grid +Y +Z +X stencil for heat equation PDE y+1 y-1 x-1 z-1 z+1 x+1 x,y,z Simplest derivation of the Laplacian operator results in a constant coefficient 7-point stencil for all x,y,z: u(x,y,z,t+1) = alpha*u(x,y,z,t) + beta*( u(x,y,z-1,t) + u(x,y-1,z,t) + u(x-1,y,z,t) + u(x+1,y,z,t) + u(x,y+1,z,t) + u(x,y,z+1,t) ) Clearly each stencil performs: 8 floating-point operations 8 memory references all but 2 should be filtered by an ideal cache 6 memory streams all but 2 should be filtered (less than # HW prefetchers) Ideally, AI=0.5. However, write-allocate bounds it to 0.33.

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 43 Roofline model for Stencil (out-of-the-box code) Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than multiplies (imbalance) Ideal flop:byte ratio 1 / 3 High locality requirements Capacity and conflict misses will severely impair flop:byte ratio No naïve Cell implementation

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 44 Roofline model for Stencil (out-of-the-box code) Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than multiplies (imbalance) Ideal flop:byte ratio 1 / 3 High locality requirements Capacity and conflict misses will severely impair flop:byte ratio No naïve Cell implementation

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 45 Roofline model for Stencil (NUMA, cache blocking, unrolling, prefetch, …) Cache blocking helps ensure flop:byte ratio is as close as possible to 1 / 3 Clovertown has huge caches but is pinned to lower BW ceiling Cache management is essential when capacity/thread is low No naïve Cell implementation

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 46 Roofline model for Stencil (SIMDization + cache bypass) Make SIMDization explicit Technically, this swaps ILP and SIMD ceilings Use cache bypass instruction: movntpd Increases flop:byte ratio to ~0.5 on x86/Cell

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP SpMV 47

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 48 Sparse Matrix Vector Multiplication Whats a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure Whats SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP (bad for pipelined or superscalar), Difficult to exploit DLP (bad for SIMD) (a) algebra conceptualization (c) CSR reference code for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0; } Axy (b) CSR data structure A.val[ ] A.rowStart[ ]... A.col[ ]...

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 49 Roofline model for SpMV Double precision roofline models In-core optimizations 1..i DRAM optimizations 1..j FMA is inherent in SpMV (place at bottom) / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter GFlops i,j (AI) = min InCoreGFlops i StreamBW j * AI

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 50 Roofline model for SpMV (overlay arithmetic intensity) Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory flop:byte < / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 51 Roofline model for SpMV (out-of-the-box parallel) Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory flop:byte < For simplicity: dense matrix in sparse format / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 52 Roofline model for SpMV (NUMA & SW prefetch) compulsory flop:byte ~ utilize all memory channels / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 53 Roofline model for SpMV (matrix compression) Inherent FMA Register blocking improves ILP, DLP, flop:byte ratio, and FP% of instructions / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance dataset dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls)

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Various Kernels We have examined and heavily optimized a number of kernels and applications for both CPUs and GPUs. We observe that for most, performance is highly correlated with DRAM bandwidth – particularly on the GPU. Note, GTC has a strong scatter/gather component that skews STREAM- based rooflines / /81/8 1/41/4 1/21/2 1 / 32 single-precision peak double-precision peak STREAM bandwidth Xeon X5550 (Nehalem) DP add-only / /81/8 1/41/4 1/21/2 1 / 32 single-precision peak double-precision peak Device bandwidth NVIDIA C2050 (Fermi) DP add-only

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Alternate Rooflines 55

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY No overlap of communication and computation Previously, we assumed perfect overlap of communication or computation. What happens if there is a dependency (either inherent or by a lack of optimization) that serializes communication and computation ? 56 Bytes / STREAM Bandwidth Flops / Flop/s time Bytes / STREAM Bandwidth Flops / Flop/s time Time is the sum of communication time and computation time. The result is that flop/s grows asymptotically.

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY No overlap of communication and computation Consider a generic machine If we can perfectly decouple and overlap communication with computation, the roofline is sharp/angular. However, without overlap, the roofline is smoothed, and attainable performance is degraded by up to a factor of 2x. 57

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Alternate Bandwidths Thus far, we assumed a synergy between streaming applications and bandwidth (proxied by the STREAM benchmark) STREAM is NOT a good proxy for short stanza/random cacheline access patterns as memory latency (instead of just bandwidth) is being exposed. Thus one might conceive of alternate memory benchmarks to provide a bandwidth upper bound (ceiling) Similarly, if data is primarily local in the LLC cache, one should construct rooflines based on LLC bandwidth and flop:LLC byte ratios. For GPUs/accelerators, PCIe bandwidth can be an impediment. Thus one can construct a roofline model based on PCIe bandwidth and the flop:PCIe byte ratio. 58

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Alternate Computations Arising from HPC kernels, its no surprise roofline use DP Flop/s. Of course, it could use SP flop/s, integer ops, bit operations, pairwise comparisons (sorting), graphics operations, etc… 59

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Time-based roofline In some cases, it is easier to visualize performance in terms of seconds (i.e. time-to-solution). We can invert the roofline (seconds per flop) and simply multiply by the number of requisite flops Additionally, we could change the horizontal axis from locality to some more appealing metric. 60

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Empirical Roofline Thus far, all in-core estimates have been based on first principles analysis of the underlying computer architecture (frequency, SIMD- width, latency, etc…) Conceivably, one could design a series of compiled benchmarks that would extract the relevant roofline parameters. Similarly, one could use performance counters to extract application characteristics so one could accurately determine application coordinates. 61

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Questions? Acknowledgments Research supported by DOE Office of Science under contract number DE-AC02-05CH

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP BACKUP SLIDES 63