The Price of Cache-obliviousness

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.

Data Locality CS 524 – High-Performance Computing.

1 Cache-Efficient Matrix Transposition Written by : Siddhartha Chatterjee and Sandeep Sen Presented By: Iddit Shalem.

Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.

9/10/2007CS194 Lecture1 Memory Hierarchies and Optimizations: Case Study in Matrix Multiplication Kathy Yelick

Data Locality CS 524 – High-Performance Computing.

An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

The Price of Cache-obliviousness Keshav Pingali, University of Texas, Austin Kamen Yotov, Goldman Sachs Tom Roeder, Cornell University John Gunnels, IBM.

Cache-oblivious Programming. Story so far We have studied cache optimizations for array programs –Main transformations: loop interchange, loop tiling.

An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.

Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.

Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,

Optimizing the Performance of Sparse Matrix-Vector Multiplication

University of California, Berkeley

TI Information – Selective Disclosure

Employing compression solutions under openacc

Ioannis E. Venetis Department of Computer Engineering and Informatics

A Comparison of Cache-conscious and Cache-oblivious Programs

Associativity in Caches Lecture 25

Optimizing Cache Performance in Matrix Multiplication

Empirical Search and Library Generators

The Hardware/Software Interface CSE351 Winter 2013

Section 7: Memory and Caches

High-Performance Matrix Multiplication

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Algorithm Analysis CSE 2011 Winter September 2018.

CS 105 Tour of the Black Holes of Computing

Lecture 5: GPU Compute Architecture

Optimizing Cache Performance in Matrix Multiplication

Vector Processing => Multimedia

BLAS: behind the scenes

Automatic Performance Tuning

for more information ... Performance Tuning

Bojian Zheng CSCD70 Spring 2018

CSCI1600: Embedded and Real Time Software

Memory Hierarchies.

Lecture 5: GPU Compute Architecture for the last time

Lecture 14: Reducing Cache Misses

Automatic Measurement of Instruction Cache Capacity in X-Ray

Joining Interval Data in Relational Databases

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

CS 380C: Advanced Compiler Techniques

Optimizing MMM & ATLAS Library Generator

Adapted from slides by Sally McKee Cornell University

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

A Comparison of Cache-conscious and Cache-oblivious Codes

Siddhartha Chatterjee

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Low Depth Cache-Oblivious Algorithms

What does it take to produce near-peak Matrix-Matrix Multiply

Memory System Performance Chapter 3

Cache Models and Program Transformations

Cache-oblivious Programming

Maximizing Speedup through Self-Tuning of Processor Allocation

Cache Performance Improvements

Optimizing single thread performance

CSCI1600: Embedded and Real Time Software

An analytical model for ATLAS

10/18: Lecture Topics Using spatial locality

Presentation transcript:

The Price of Cache-obliviousness Keshav Pingali, University of Texas, Austin Kamen Yotov, Goldman Sachs Tom Roeder, Cornell University John Gunnels, IBM T.J. Watson Research Center Fred Gustavson, IBM T.J.Watson Research Center

Memory Hierarchy Management Cache-conscious (CC) approach: Blocked iterative algorithms and arrays (usually) Code and data structures have parameters that depend on careful blocking for memory hierarchy Used in dense linear algebra libraries: BLAS, LAPACK Lots of algorithmic data reuse: O(N3) operations on O(N2) data Cache-oblivious (CO) approach: Recursive algorithms and data structures (usually) Not aware of memory hierarchy: approximate blocking I/O optimal: Hong and Kung, Frigo and Leiserson Used in FFT implementations: FFTW Little algorithmic data reuse: O(N(logN)) computations on O(N) data

Questions Does CO approach perform as well as CC approach? Intuitively, a “self-adaptive” program that is oblivious of some hardware feature should be at a disadvantage. Little experimental data in the literature CO community believes their approach outperforms CC approach But most studies from CO community compare performance with unblocked (unoptimized) CC codes If not, what can be done to improve the performance of CO programs?

One study Piyush Kumar (LNCS 2625, 2004) Studied recursive and iterative MMM on Itanium-2 Recursive performs better But look at MFlops: 30 MFlops Intel MKL: 6GFlops

Organization of talk CO and CC approaches to blocking control structures data structures Non-standard view of blocking (or why CO may work well) reduce bandwidth required from memory Experimental results UltraSPARC IIIi Itanium Xeon Power 5 Lessons and ongoing work

Cache-Oblivious Algorithms C00 = A00*B00 + A01*B10 C01 = A01*B11 + A00*B01 C11 = A11*B01 + A10*B01 C10 = A10*B00 + A11*B10 Divide all dimensions (AD) 8-way recursive tree down to 1x1 blocks Bilardi, et. al. Gray-code order promotes reuse We use AD in rest of talk C0 = A0*B C1 = A1*B C11 = A11*B01 + A10*B01 C10 = A10*B00 + A11*B10 Divide largest dimension (LD) Two-way recursive tree down to 1x1 blocks Frigo, Leiserson, et. al.

CO: recursive micro-kernel Internal nodes of recursion tree are recursive overhead; roughly 100 cycles on Itanium-2 360 cycles on UltraSPARC IIIi Large overhead: for LD, roughly one internal node per leaf node Solution: Micro-kernel: code obtained by complete unrolling of recursive tree for some fixed size problem (RUxRUxRU) Cut off recursion when sub-problem size becomes equal to micro-kernel size, and invoke micro-kernel Overhead of internal node is amortized over micro-kernel, rather than a single multiply-add Choice of RU: empirical recursive micro-kernel

Data Structures Match data structure layout to access patterns Improve Row-major Row-Block-Row Morton-Z Match data structure layout to access patterns Improve Spatial locality Streaming Morton-Z is more complicated to implement Payoff is small or even negative in our experience Rest of talk: use RBR format with block size matched to microkernel

Cache-conscious algorithms The workset should fit in registers!!! Cache blocking Register blocking

CC algorithms: discussion Iterative codes Nested loops Implementation of blocking Cache blocking Mini-kernel: in ATLAS, multiply NBxNB blocks Choose NB so NB2 + NB + 1 <= CL1 Register blocking Micro-kernel: in ATLAS, multiply MUx1 block of A with 1xNU block of B into MUxNU block of C Choose MU,NU so that MU + NU +MU*NU <= NR

Organization of talk CO and CC approaches to blocking control structures data structures Non-standard view of blocking reduce bandwidth required from memory Experimental results UltraSPARC IIIi Itanium Xeon Power 5 Lessons and ongoing work

Blocking Microscopic view Macroscopic view Blocking reduces expected latency of memory access Macroscopic view Memory hierarchy can be ignored if memory has enough bandwidth to feed processor data can be pre-fetched to hide memory latency Blocking reduces bandwidth needed from memory Useful to consider macroscopic view in more detail

Blocking for MMM Assume processor can perform 1 FMA every cycle CPU Cache Memory Assume processor can perform 1 FMA every cycle Ideal execution time for NxN MMM = N3 cycles Square blocks: NB x NB Upper bound for NB: working set for block computation must fit in cache size of working set depends on schedule: at most 3NB2 Upper bound on NB: 3NB2 ≤ Cache Capacity Lower bound for NB: data movement in block computation = 4 NB2 total data movement < (N / NB)3 * 4 NB2 = 4 N3 / NB doubles required bandwidth from memory = (4 N3 / NB) / (N3 ) = 4 / NB doubles/cycle Lower bound on NB: 4/NB < Bandwidth between cache and memory Multi-level memory hierarchy: same idea sqrt(capacity(L)/3) > NBL > 4/Bandwidth(L,L+1) (levels L,L+1)

Example: MMM on Itanium 2 FPU Registers L2 L3 Memory L1 4* ≥2 2* 4 ≥6 ≈0.5 * Bandwidth in doubles per cycle; Limit 4 accesses per cycle between registers and L2 Between L3 and Memory Constraints 8 / NBL3 ≤ 0.5 3 * NBL32 ≤ 524288 (4MB) Therefore Memory has enough bandwidth for 16 ≤ NBL3 ≤ 418 NBL3 = 16 required 8 / NBL3 = 0.5 doubles per cycle from Memory NBL3 = 418 required 8 / NBR ≈ 0.02 doubles per cycle from Memory NBL3 > 418 possible with better scheduling 2 ≤ NBR ≤ 6 1.33 ≤ B(R,L2) ≤ 4 2 ≤ NBL2 ≤ 6 1.33 ≤ B(L2,L3) ≤ 4 16 ≤ NBL3 ≤ 418 0.02 ≤ B(L3,Memory) ≤ 0.5 2 FMAs/cycle

Lessons Reducing bandwidth requirements Reducing latency Block size does not have to be exact Enough for block size to lie within an interval that depends on hardware parameters If upper bound on NB is more than twice lower bound, divide and conquer will automatically generate a block size in this range  approximate blocking CO-style is OK Reducing latency Accurate block sizes are better If block size is chosen approximately, may need to compensate with prefetching

Organization of talk CO and CC approaches to blocking control structures data structures Non-standard view of blocking reduce bandwidth required from memory Experimental results UltraSPARC IIIi Itanium Xeon Power 5 Lessons and ongoing work

UltraSPARC IIIi Peak performance: 2 GFlops (1 GHZ, 2 FPUs) Memory hierarchy: Registers: 32 L1 data cache: 64KB, 4-way L2 data cache: 1MB, 4-way Compilers C: SUN C 5.5

Outer Control Structure Inner Control Structure Naïve algorithms Outer Control Structure Iterative Recursive Inner Control Structure Statement Recursive: down to 1 x 1 x 1 360 cycles overhead for each MA = 6 MFlops Iterative: triply nested loop little overhead Both give roughly the same performance Vendor BLAS and ATLAS: 1750 MFlops

Miss ratios Misses/FMA for iterative code is roughly 2 Misses/FMA for recursive code is 0.002 Practical manifestation of theoretical I/O optimality results for recursive code However, two competing factors affect performance: cache misses overhead 6 MFlops is a long way from 1750 MFlops!

Recursive micro-kernel Outer Control Structure Iterative Recursive Inner Control Structure Statement Micro-Kernel None / Compiler Scalarized Belady BRILA Coloring Recursion down to RU(=8) Unfold completely below RU to get a basic block Micro-Kernel Scheduling and register allocation using heuristics for large basic blocks in BRILA compiler

Lessons Bottom-line on UltraSPARC: Similar results on other machines: Peak: 2 GFlops ATLAS: 1.75 GFlops Best CO strategy: 700 MFlops Similar results on other machines: Best CO performance on Itanium: roughly 2/3 of peak Conclusion: Recursive micro-kernels are not a good idea

Iterative micro-kernel The workset should fit in registers!!! Cache blocking Register blocking

Recursion + Iterative micro-kernel Outer Control Structure Iterative Recursive Inner Control Structure Statement Micro-Kernel None / Compiler Scalarized Belady BRILA Coloring Recursion down to MU x NU x KU (4x4x120) Micro-Kernel Completely unroll MU x NU nested loop Construct a preliminary schedule Perform Graph Coloring register allocation Schedule using BRILA compiler

Loop + iterative micro-kernel Wrapping a loop around highly optimized iterative micro-kernel does not give good performance This version does not block for any cache level, so micro-kernel is starved for data. Recursive outer structure version is able to block approximately for L1 cache and higher, so micro-kernel is not starved.

Lessons Two hardware constraints on size of micro-kernels: I-cache limits amount of unrolling Number of registers Iterative micro-kernel: three degrees of freedom (MU,NU,KU) Choose MU and NU to optimize register usage Choose KU unrolling to fit into I-cache Recursive micro-kernel: one degree of freedom (RU) But even if you choose rectangular tiles, all three degrees of freedom are tied to both hardware constraints Recursive control structure + iterative micro-kernel: Performs reasonably because recursion takes care of caches and microkernel optimizes for registers/pipeline Iterative control structure + iterative micro-kernel: Performs poorly because micro-kernel is starved for data from caches What happens if you tile explicitly for caches?

Recursion + mini-kernel Outer Control Structure Iterative Recursive Inner Control Structure Statement Mini-Kernel Micro-Kernel None / Compiler Scalarized Belady BRILA Coloring Recursion down to NB Mini-Kernel NB x NB x NB triply nested loop (NB=120) Tiling for L1 cache Body of mini-kernel is iterative micro-kernel

Recursion + mini-kernel + pre-fetching Outer Control Structure Iterative Recursive Inner Control Structure Statement Mini-Kernel Micro-Kernel ATLAS CGw/S ATLAS Unleashed None / Compiler Scalarized Belady BRILA Coloring Using mini-kernel from ATLAS Unleashed gives big performance boost over BRILA mini-kernel. Reason: pre-fetching

Outer Control Structure Inner Control Structure Vendor BLAS Outer Control Structure Iterative Recursive Inner Control Structure Statement Mini-Kernel Micro-Kernel ATLAS CGw/S ATLAS Unleashed None / Compiler Scalarized Belady BRILA Coloring Not much difference from previous case. Vendor BLAS is at same level.

Lessons Vendor BLAS gets highest performance Pre-fetching boosts performance by roughly 40% Iterative code: pre-fetching is well-understood Recursive code: not well-understood

Summary Iterative approach has been proven to work well in practice Vendor BLAS, ATLAS, etc. But requires a lot of work to produce code and tune parameters Implementing a high-performance CO code is not easy Careful attention to micro-kernel and mini-kernel is needed Using fully recursive approach with highly optimized recursive micro-kernel, we never got more than 2/3 of peak. Issues with CO approach Recursive Micro-Kernels yield less performance than iterative ones using same scheduling techniques Pre-fetching is needed to compete with best code: not well-understood in the context of CO codes

Ongoing Work Explain performance of all results shown Complete ongoing Matrix Transpose study Proteus system and BRILA compiler I/O optimality: Interesting theoretical results for simple model of computation What additional aspects of hardware/program need to be modeled for it to be useful in practice?

Miss ratios