A Comparison of Cache-conscious and Cache-oblivious Codes

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Compiler Challenges for High Performance Architectures

1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.

Data Locality CS 524 – High-Performance Computing.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.

9/10/2007CS194 Lecture1 Memory Hierarchies and Optimizations: Case Study in Matrix Multiplication Kathy Yelick

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Generic Code Optimization Jack Dongarra, Shirley Moore, Keith Seymour, and Haihang You LACSI Symposium Automatic Tuning of Whole Applications Workshop.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

The Price of Cache-obliviousness Keshav Pingali, University of Texas, Austin Kamen Yotov, Goldman Sachs Tom Roeder, Cornell University John Gunnels, IBM.

Cache-oblivious Programming. Story so far We have studied cache optimizations for array programs –Main transformations: loop interchange, loop tiling.

An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.

Sunpyo Hong, Hyesoon Kim

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.

Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,

Optimizing the Performance of Sparse Matrix-Vector Multiplication

University of California, Berkeley

TI Information – Selective Disclosure

Chapter 8: Main Memory.

Ioannis E. Venetis Department of Computer Engineering and Informatics

A Comparison of Cache-conscious and Cache-oblivious Programs

Cache Memories CSE 238/2038/2138: Systems Programming

Optimizing Cache Performance in Matrix Multiplication

Empirical Search and Library Generators

The Hardware/Software Interface CSE351 Winter 2013

Section 7: Memory and Caches

Introduction to Algorithms

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Algorithm Analysis CSE 2011 Winter September 2018.

CS 105 Tour of the Black Holes of Computing

Lecture 5: GPU Compute Architecture

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Automatic Performance Tuning

Algorithm Analysis (not included in any exams!)

Memory Hierarchies.

Lecture 5: GPU Compute Architecture for the last time

Automatic Measurement of Instruction Cache Capacity in X-Ray

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

CS 380C: Advanced Compiler Techniques

Register Pressure Guided Unroll-and-Jam

Optimizing MMM & ATLAS Library Generator

All-Pairs Shortest Paths

CS 201 Fundamental Structures of Computer Science

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

What does it take to produce near-peak Matrix-Matrix Multiply

Memory System Performance Chapter 3

Cache Models and Program Transformations

Cache-oblivious Programming

Cache Performance Improvements

CSE 373: Data Structures and Algorithms

Lecture 11: Machine-Dependent Optimization

6- General Purpose GPU Programming

Introduction to Optimization

Optimizing single thread performance

The Price of Cache-obliviousness

An analytical model for ATLAS

Presentation transcript:

A Comparison of Cache-conscious and Cache-oblivious Codes Tom Roeder, Cornell University Kamen Yotov, IBM T. J. Watson (presenter) Keshav Pingali, Cornell University Joint work with Fred Gustavson and John Gunnels (IBM T. J. Watson) 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Motivation Current compilers do not generate good code even for BLAS Conjecture: code generation requires parameters like tile sizes and loop unroll factors whose optimal values are difficult to determine using analytical models Previous work on ATLAS Actually, compiler models are quite adequate to produce optimized iterative cache-conscious MMM (Yotov et al. 2005) So what is going on inside compilers? Hard to know because compilers like XLF are not in public domain Goal Build a domain-specific compiler (BRILA) for dense linear algebra programs Input High-level block-recursive algorithms Key data structure is matrix, not array Output Code optimized for memory hierarchy Question: What should output look like? 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Two Possible Answers Cache-oblivious (CO) approach Execute recursive algorithm directly Not aware of memory hierarchy: approximate blocking I/O optimal Used in FFT implementations, e.g. FFTW Little data reuse Cache-conscious (CC) approach: Execute carefully blocked iterative algorithm Code (and data structures) have parameters that depend on memory hierarchy Used in dense linear algebra domain, e.g. BLAS, LAPACK Lots of data reuse Questions How does performance of CO approach compare with that of CC approach for algorithms with a lot of reuse such as dense LA? More generally, under what assumptions about hardware and algorithms does CO approach perform well? 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Organization of Talk Non-standard view of blocking Reduce bandwidth required from memory CO and CC approaches to blocking Control structures Data structures Experimental results UltraSPARC IIIi Itanium Xeon Power 5 Lessons and ongoing work 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Blocking Microscopic view Blocking reduces latency of a memory access Macroscopic view Memory hierarchy can be ignored if memory has enough bandwidth to feed processor data can be pre-fetched to hide memory latency Blocking reduces bandwidth needed from memory Useful to consider macroscopic view in more detail 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Example: MMM on Itanium 2 Processor features 2 FMAs per cycle 126 effective FP registers Basic MMM for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) C[i, j] += A[i, k] * B[k, j]; Execution requirements N3 multiply-adds Ideal execution time = N3 / 2 cycles 3 N3 loads + N3 stores = 4 N3 memory operations Bandwidth requirements 4 N3 / (N3 / 2) = 8 doubles / cycle Memory cannot sustain this bandwidth but register file can 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Reduce Bandwidth by Blocking Square blocks: NB x NB x NB working set must fit in cache size depends on schedule, maximum is 3 NB2 Data touched (doubles) Block: 4 NB2 Total: (N / NB)3 * 4 NB2 = 4 N3 / NB Ideal execution time (cycles) N3 / 2 Required bandwidth from memory (doubles per cycle) (4 N3 / NB) / (N3 / 2) = 8 / NB General picture for multi-level memory hierarchy Bandwidth required from level L+1 = 8 / NBL Constraints Lower bound: 8 / NBL ≤ Bandwidth between L and L+1 Upper bound: Working set of block computation ≤ Capacity(L) CPU Cache Memory 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Example: MMM on Itanium 2 * Bandwidth in doubles per cycle; Limit 4 accesses per cycle between registers and L2 Between Register File and L2 Constraints 8 / NBR ≤ 4 3 * NBR2 ≤ 126 Therefore Bandwidth(R,L2) is enough for 2 ≤ NBR ≤ 6 NBR = 2 required 8 / NBR = 4 doubles per cycle from L2 NBR = 6 required 8 / NBR = 1.33 doubles per cycle from L2 NBR > 6 possible with better scheduling FPU Registers L2 L3 Memory L1 4* ≥2 2* 4 ≥6 ≈0.5 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Example: MMM on Itanium 2 * Bandwidth in doubles per cycle; Limit 4 accesses per cycle between registers and L2 Between L2 and L3 Sufficient bandwidth without blocking at L2 d FPU Registers L2 L3 Memory L1 4* ≥2 2* 4 ≥6 ≈0.5 2 ≤ NBR ≤ 6 1.33 ≤ B(R,L2) ≤ 4 2 ≤ NBL2 ≤ 6 1.33 ≤ B(R,L2) ≤ 4 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Example: MMM on Itanium 2 * Bandwidth in doubles per cycle; Limit 4 accesses per cycle between registers and L2 Between L3 and Memory Constraints 8 / NBL3 ≤ 0.5 3 * NBL32 ≤ 524288 (4MB) Therefore Bandwidth(L3,Memory) is enough for 16 ≤ NBL3 ≤ 418 NBL3 = 16 required 8 / NBL3 = 0.5 doubles per cycle from Memory NBL3 = 418 required 8 / NBR ≈ 0.02 doubles per cycle from Memory NBL3 > 418 possible with better scheduling FPU Registers L2 L3 Memory L1 4* ≥2 2* 4 ≥6 ≈0.5 2 ≤ NBR ≤ 6 1.33 ≤ B(R,L2) ≤ 4 2 ≤ NBL2 ≤ 6 1.33 ≤ B(L2,L3) ≤ 4 16 ≤ NBL3 ≤ 418 0.02 ≤ B(L3,Memory) ≤ 0.5 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Lessons Blocking Useful to reduce bandwidth requirements Block size Does not have to be exact Enough to lie within an interval Interval depends on hardware parameters Approximate blocking may be OK Latency Use pre-fetching to reduce expected latency 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Organization of Talk Non-standard view of blocking Reduce bandwidth required from memory CO and CC approaches to blocking Control structures Data structures Experimental results UltraSPARC IIIi Itanium Xeon Power 5 Lessons and ongoing work 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Implementation of Blocking Control structure What are the block computations? In what order are they performed? How is this order generated? Data structure Non-standard storage orders to match control structure 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Cache-Oblivious Algorithms C00 = A00*B00 + A01*B10 C01 = A01*B11 + A00*B01 C11 = A11*B01 + A10*B01 C10 = A10*B00 + A11*B10 Divide all dimensions (AD) 8-way recursive tree down to 1x1 blocks Gray-code order promotes reuse Bilardi, et al. C0 = A0*B C1 = A1*B C11 = A11*B01 + A10*B01 C10 = A10*B00 + A11*B10 Divide largest dimension (LD) Two-way recursive tree down to 1x1 blocks Frigo, Leiserson, et al. 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Cache-Oblivious: Discussion Block sizes Generated dynamically at each level in the recursive call tree Our experience Performance is similar Use AD for the rest of the talk 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Cache-Conscious Algorithms Usually Iterative Nested loops Implementation of blocking Cache blocking achieved by Loop Tiling Register blocking also requires Loop Unrolling 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Structure of CC Code Cache Blocking Register Blocking 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Data Structures Row-major Row-Block-Row Morton-Z Fit control structure better Improve Spatial locality Streaming 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Data Structures: Discussion Morton-Z Matches recursive control structure better than RBR Suggests better performance for CO More complicated to implement In our experience payoff is small or even negative Use RBR for the rest of the talk 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Organization of Talk Non-standard view of blocking Reduce bandwidth required from memory CO and CC approaches to blocking Control structures Data structures Experimental results UltraSPARC IIIi Itanium Xeon Power 5 Lessons and ongoing work 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) UltraSPARC IIIi Peak performance 2 GFlops Memory hierarchy Registers: 32 L1 data cache: 64KB, 4-way L2 data cache: 1MB, 4-way Compilers FORTRAN: SUN F95 7.1 C: SUN C 5.5 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Outer Control Structure Inner Control Structure Control Structures Outer Control Structure Iterative Recursive Inner Control Structure Statement Iterative: triply nested loop Recursive: down to 1 x 1 x 1 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Outer Control Structure Inner Control Structure Control Structures Outer Control Structure Iterative Recursive Inner Control Structure Statement Micro-Kernel None / Compiler Recursion down to NB Unfold completely below NB to get a basic block Micro-Kernel: The basic block compiled with native compiler Best performance for NB =12 Compiler unable to use registers Unfolding reduces control overhead limited by I-cache 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Outer Control Structure Inner Control Structure Control Structures Outer Control Structure Iterative Recursive Inner Control Structure Statement Micro-Kernel None / Compiler Scalarized Recursion down to NB Unfold completely below NB to get a basic block Micro-Kernel Scalarize all array references in the basic block Compile with native compiler 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Outer Control Structure Inner Control Structure Control Structures Outer Control Structure Iterative Recursive Inner Control Structure Statement Micro-Kernel None / Compiler Scalarized Belady BRILA Recursion down to NB Unfold completely below NB to get a basic block Micro-Kernel Perform Belady’s register allocation on the basic block Schedule using BRILA compiler 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Outer Control Structure Inner Control Structure Control Structures Outer Control Structure Iterative Recursive Inner Control Structure Statement Micro-Kernel None / Compiler Scalarized Belady BRILA Coloring Recursion down to NB Unfold completely below NB to get a basic block Micro-Kernel Construct a preliminary schedule Perform Graph Coloring register allocation Schedule using BRILA compiler 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Outer Control Structure Inner Control Structure Control Structures Outer Control Structure Iterative Recursive Inner Control Structure Statement Micro-Kernel None / Compiler Scalarized Belady BRILA Coloring Recursion down to MU x NU x KU Micro-Kernel Completely unroll MU x NU x KU triply nested loop Construct a preliminary schedule Perform Graph Coloring register allocation Schedule using BRILA compiler 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Outer Control Structure Inner Control Structure Control Structures Outer Control Structure Iterative Recursive Inner Control Structure Statement Mini-Kernel Micro-Kernel None / Compiler Scalarized Belady BRILA Coloring Recursion down to NB Mini-Kernel NB x NB x NB triply nested loop Tiling for L1 cache Body is Micro-Kernel 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Outer Control Structure Inner Control Structure Control Structures Outer Control Structure Iterative Recursive Inner Control Structure Statement Mini-Kernel Micro-Kernel ATLAS CGw/S ATLAS Unleashed None / Compiler Scalarized Belady BRILA Coloring 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Outer Control Structure Inner Control Structure Control Structures Outer Control Structure Iterative Recursive Inner Control Structure Statement Mini-Kernel Micro-Kernel ATLAS CGw/S ATLAS Unleashed None / Compiler Scalarized Belady BRILA Coloring 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

UltraSPARC IIIi Complete 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Some Observations Iterative has been proven to work well in practice Vendor BLAS, ATLAS, etc. But requires a lot of work to produce code and tune parameters Implementing a high-performance CO code is not easy Careful attention to micro-kernel and mini-kernel is needed Recursive suffers overhead on several fronts Recursive Micro-Kernels yield less performance than iterative ones using same scheduling techniques Recursive Micro-Kernels have large code size, which sometimes impacts instruction cache performance Obtaining high-performance from recursive outer structure requires large kernels at the leaves to reduce recursive overhead Using fully recursive approach with highly optimized micro-kernel, we never got more than 2/3 of peak. We are just starting analyze the numbers Automating code generation: Pre-fetching in iterative codes can be automated Not obvious how to do it for CO codes 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Ongoing Work Explain performance of all results shown Complete ongoing Matrix Transpose study I/O optimality Interesting theoretical results for simple model of computation What additional aspects of hardware/program need to be modeled for it to be useful in practice? Compiler-generated iterative multi-level blocking for dense linear algebra programs BRILA Compiler 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Itanium 2 (In)Complete 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Xeon (In)Complete 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)

Automatic Tuning of Whole Applications (LACSI'05) Power 5 (In)Complete In the works… 1/16/2019 Automatic Tuning of Whole Applications (LACSI'05)