The Price of Cache-obliviousness Keshav Pingali, University of Texas, Austin Kamen Yotov, Goldman Sachs Tom Roeder, Cornell University John Gunnels, IBM.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign.

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.

Compiler Challenges for High Performance Architectures

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.

Data Locality CS 524 – High-Performance Computing.

1 Cache-Efficient Matrix Transposition Written by : Siddhartha Chatterjee and Sandeep Sen Presented By: Iddit Shalem.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.

Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.

9/10/2007CS194 Lecture1 Memory Hierarchies and Optimizations: Case Study in Matrix Multiplication Kathy Yelick

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

Data Locality CS 524 – High-Performance Computing.

An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

Bounds on the Geometric Mean of Arc Lengths for Bounded- Degree Planar Graphs M. K. Hasan Sung-eui Yoon Kyung-Yong Chwa KAIST, Republic of Korea.

CS 380C: Advanced Topics in Compilers. Administration Instructor: Keshav Pingali –Professor (CS, ICES) –ACES 4.126A TA: Muhammed.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Cache-oblivious Programming. Story so far We have studied cache optimizations for array programs –Main transformations: loop interchange, loop tiling.

An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.

Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.

Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU

University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee

Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,

Optimizing the Performance of Sparse Matrix-Vector Multiplication

TI Information – Selective Disclosure

Ioannis E. Venetis Department of Computer Engineering and Informatics

A Comparison of Cache-conscious and Cache-oblivious Programs

Empirical Search and Library Generators

Section 7: Memory and Caches

High-Performance Matrix Multiplication

Automatic Performance Tuning

Memory Hierarchies.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Optimizing MMM & ATLAS Library Generator

A Comparison of Cache-conscious and Cache-oblivious Codes

What does it take to produce near-peak Matrix-Matrix Multiply

Memory System Performance Chapter 3

Cache Models and Program Transformations

Cache-oblivious Programming

The Price of Cache-obliviousness

An analytical model for ATLAS

Presentation transcript:

The Price of Cache-obliviousness Keshav Pingali, University of Texas, Austin Kamen Yotov, Goldman Sachs Tom Roeder, Cornell University John Gunnels, IBM T.J. Watson Research Center Fred Gustavson, IBM T.J.Watson Research Center

Memory Hierarchy Management Cache-conscious (CC) approach: –Blocked iterative algorithms and arrays (usually) Code and data structures have parameters that depend on careful blocking for memory hierarchy –Used in dense linear algebra libraries: BLAS, LAPACK Lots of algorithmic data reuse: O(N 3 ) operations on O(N 2 ) data Cache-oblivious (CO) approach: –Recursive algorithms and data structures (usually) Not aware of memory hierarchy: approximate blocking I/O optimal: Hong and Kung, Frigo and Leiserson –Used in FFT implementations: FFTW Little algorithmic data reuse: O(N(logN)) computations on O(N) data

Questions Does CO approach perform as well as CC approach? –Intuitively, a “self-adaptive” program that is oblivious of some hardware feature should be at a disadvantage. –Little experimental data in the literature CO community believes their approach outperforms CC approach But most studies from CO community compare performance with unblocked (unoptimized) CC codes If not, what can be done to improve the performance of CO programs?

One study Studied recursive and iterative MMM on Itanium-2 Recursive performs better But look at MFlops: 30 MFlops Intel MKL: 6GFlops Piyush Kumar (LNCS 2625, 2004)

Organization of talk CO and CC approaches to blocking –control structures –data structures Non-standard view of blocking (or why CO may work well) –reduce bandwidth required from memory Experimental results –UltraSPARC IIIi –Itanium –Xeon –Power 5 Lessons and ongoing work

Cache-Oblivious Algorithms C 00 = A 00 *B 00 + A 01 *B 10 C 01 = A 01 *B 11 + A 00 *B 01 C 11 = A 11 *B 01 + A 10 *B 01 C 10 = A 10 *B 00 + A 11 *B 10 Divide all dimensions (AD) 8-way recursive tree down to 1x1 blocks Bilardi, et. al. –Gray-code order promotes reuse We use AD in rest of talk A 00 A 01 A 11 A 10 C 00 C 01 C 11 C 10 B 00 B 01 B 11 B 10 A0A0 A1A1 C0C0 C1C1 B C 0 = A 0 *B C 1 = A 1 *B C 11 = A 11 *B 01 + A 10 *B 01 C 10 = A 10 *B 00 + A 11 *B 10 Divide largest dimension (LD) Two-way recursive tree down to 1x1 blocks Frigo, Leiserson, et. al.

CO: recursive micro-kernel Internal nodes of recursion tree are recursive overhead; roughly –100 cycles on Itanium-2 –360 cycles on UltraSPARC IIIi Large overhead: for LD, roughly one internal node per leaf node Solution: –Micro-kernel: code obtained by complete unrolling of recursive tree for some fixed size problem (RUxRUxRU) –Cut off recursion when sub-problem size becomes equal to micro-kernel size, and invoke micro-kernel –Overhead of internal node is amortized over micro-kernel, rather than a single multiply-add –Choice of RU: empirical recursive micro-kernel

Data Structures Match data structure layout to access patterns Improve –Spatial locality –Streaming Morton-Z is more complicated to implement –Payoff is small or even negative in our experience Rest of talk: use RBR format with block size matched to microkernel Row-majorRow-Block-RowMorton-Z

Cache-conscious algorithms Cache blocking Register blocking

CC algorithms: discussion Iterative codes –Nested loops Implementation of blocking –Cache blocking Mini-kernel: in ATLAS, multiply NBxNB blocks Choose NB so NB 2 + NB + 1 <= C L1 –Register blocking Micro-kernel: in ATLAS, multiply MUx1 block of A with 1xNU block of B into MUxNU block of C Choose MU,NU so that MU + NU +MU*NU <= NR

Organization of talk CO and CC approaches to blocking –control structures –data structures Non-standard view of blocking –reduce bandwidth required from memory Experimental results –UltraSPARC IIIi –Itanium –Xeon –Power 5 Lessons and ongoing work

Blocking Microscopic view –Blocking reduces expected latency of memory access Macroscopic view –Memory hierarchy can be ignored if memory has enough bandwidth to feed processor data can be pre-fetched to hide memory latency –Blocking reduces bandwidth needed from memory Useful to consider macroscopic view in more detail

Blocking for MMM Assume processor can perform 1 FMA every cycle Ideal execution time for NxN MMM = N 3 cycles Square blocks: NB x NB Upper bound for NB: –working set for block computation must fit in cache –size of working set depends on schedule: at most 3NB 2 –Upper bound on NB: 3NB 2 ≤ Cache Capacity Lower bound for NB: –data movement in block computation = 4 NB 2 –total data movement < (N / NB) 3 * 4 NB 2 = 4 N 3 / NB doubles –required bandwidth from memory = (4 N 3 / NB) / (N 3 ) = 4 / NB doubles/cycle –Lower bound on NB: 4/NB < Bandwidth between cache and memory Multi-level memory hierarchy: same idea –sqrt(capacity(L)/3) > NB L > 4/Bandwidth(L,L+1) (levels L,L+1) CPUCacheMemory

Example: MMM on Itanium 2 * Bandwidth in doubles per cycle; Limit 4 accesses per cycle between registers and L2 Between L3 and Memory –Constraints 8 / NB L3 ≤ * NB L3 2 ≤ (4MB) –Therefore Memory has enough bandwidth for 16 ≤ NB L3 ≤ 418 NB L3 = 16 required 8 / NB L3 = 0.5 doubles per cycle from Memory NB L3 = 418 required 8 / NB R ≈ 0.02 doubles per cycle from Memory NB L3 > 418 possible with better scheduling FPURegistersL2L3MemoryL1 4* ≥2 2* 4 4 ≥6 ≈0.5 2 ≤ NB R ≤ ≤ B(R,L 2 ) ≤ 4 2 ≤ NB L2 ≤ ≤ B(L2,L3) ≤ 4 16 ≤ NB L3 ≤ ≤ B(L3,Memory) ≤ FMAs/cycle

Lessons Reducing bandwidth requirements –Block size does not have to be exact –Enough for block size to lie within an interval that depends on hardware parameters –If upper bound on NB is more than twice lower bound, divide and conquer will automatically generate a block size in this range  approximate blocking CO-style is OK Reducing latency –Accurate block sizes are better –If block size is chosen approximately, may need to compensate with prefetching

Organization of talk CO and CC approaches to blocking –control structures –data structures Non-standard view of blocking –reduce bandwidth required from memory Experimental results –UltraSPARC IIIi –Itanium –Xeon –Power 5 Lessons and ongoing work

UltraSPARC IIIi Peak performance: 2 GFlops (1 GHZ, 2 FPUs) Memory hierarchy: –Registers: 32 –L1 data cache: 64KB, 4-way –L2 data cache: 1MB, 4-way Compilers –C: SUN C 5.5

Naïve algorithms Outer Control Structure IterativeRecursive Inner Control Structure Statement Recursive: –down to 1 x 1 x 1 –360 cycles overhead for each MA = 6 MFlops Iterative: –triply nested loop –little overhead Both give roughly the same performance Vendor BLAS and ATLAS: –1750 MFlops

Miss ratios Misses/FMA for iterative code is roughly 2 Misses/FMA for recursive code is Practical manifestation of theoretical I/O optimality results for recursive code However, two competing factors affect performance: cache misses overhead 6 MFlops is a long way from 1750 MFlops!

Recursive micro-kernel Recursion down to RU(=8) –Unfold completely below RU to get a basic block Micro-Kernel –Scheduling and register allocation using heuristics for large basic blocks in BRILA compiler Outer Control Structure IterativeRecursive Inner Control Structure Statement Recursive Micro-Kernel None / Compiler Scalarized / Compiler Belady / BRILA Coloring / BRILA

Lessons Bottom-line on UltraSPARC: –Peak: 2 GFlops –ATLAS: 1.75 GFlops –Best CO strategy: 700 MFlops Similar results on other machines: –Best CO performance on Itanium: roughly 2/3 of peak Conclusion: –Recursive micro-kernels are not a good idea

Iterative micro-kernel Cache blocking Register blocking

Recursion + Iterative micro-kernel Outer Control Structure IterativeRecursive Inner Control Structure Statement RecursiveIterative Micro-Kernel None / Compiler Scalarized / Compiler Belady / BRILA Coloring / BRILA Recursion down to MU x NU x KU (4x4x120) Micro-Kernel –Completely unroll MU x NU nested loop –Construct a preliminary schedule –Perform Graph Coloring register allocation –Schedule using BRILA compiler

Loop + iterative micro-kernel Wrapping a loop around highly optimized iterative micro-kernel does not give good performance This version does not block for any cache level, so micro-kernel is starved for data. Recursive outer structure version is able to block approximately for L1 cache and higher, so micro-kernel is not starved.

Lessons Two hardware constraints on size of micro-kernels: –I-cache limits amount of unrolling –Number of registers Iterative micro-kernel: three degrees of freedom (MU,NU,KU) –Choose MU and NU to optimize register usage –Choose KU unrolling to fit into I-cache Recursive micro-kernel: one degree of freedom (RU) –But even if you choose rectangular tiles, all three degrees of freedom are tied to both hardware constraints Recursive control structure + iterative micro-kernel: –Performs reasonably because recursion takes care of caches and microkernel optimizes for registers/pipeline Iterative control structure + iterative micro-kernel: –Performs poorly because micro-kernel is starved for data from caches –What happens if you tile explicitly for caches?

Recursion + mini-kernel Outer Control Structure IterativeRecursive Inner Control Structure Statement RecursiveIterative Mini-Kernel Micro-Kernel None / Compiler Scalarized / Compiler Belady / BRILA Coloring / BRILA Recursion down to NB Mini-Kernel –NB x NB x NB triply nested loop (NB=120) –Tiling for L1 cache –Body of mini-kernel is iterative micro-kernel

Recursion + mini-kernel + pre-fetching Outer Control Structure IterativeRecursive Inner Control Structure Statement RecursiveIterative Mini-Kernel Micro-Kernel ATLAS CGw/S ATLAS Unleashed None / Compiler Scalarized / Compiler Belady / BRILA Coloring / BRILA Using mini-kernel from ATLAS Unleashed gives big performance boost over BRILA mini-kernel. Reason: pre-fetching

Vendor BLAS Outer Control Structure IterativeRecursive Inner Control Structure Statement RecursiveIterative Mini-Kernel Micro-Kernel ATLAS CGw/S ATLAS Unleashed None / Compiler Scalarized / Compiler Belady / BRILA Coloring / BRILA Not much difference from previous case. Vendor BLAS is at same level.

Lessons Vendor BLAS gets highest performance Pre-fetching boosts performance by roughly 40% Iterative code: pre-fetching is well-understood Recursive code: not well-understood

Summary Iterative approach has been proven to work well in practice –Vendor BLAS, ATLAS, etc. –But requires a lot of work to produce code and tune parameters Implementing a high-performance CO code is not easy –Careful attention to micro-kernel and mini-kernel is needed Using fully recursive approach with highly optimized recursive micro-kernel, we never got more than 2/3 of peak. Issues with CO approach –Recursive Micro-Kernels yield less performance than iterative ones using same scheduling techniques –Pre-fetching is needed to compete with best code: not well-understood in the context of CO codes

Ongoing Work Explain performance of all results shown Complete ongoing Matrix Transpose study Proteus system and BRILA compiler I/O optimality: –Interesting theoretical results for simple model of computation –What additional aspects of hardware/program need to be modeled for it to be useful in practice?

Miss ratios