Memory Hierarchies.

Slides:

Advertisements

Similar presentations

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.

Data Locality CS 524 – High-Performance Computing.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Data Locality CS 524 – High-Performance Computing.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

CS 380C: Advanced Topics in Compilers. Administration Instructor: Keshav Pingali –Professor (CS, ICES) –ACES 4.126A TA: Muhammed.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

CS 612: Software Design for High-performance Architectures.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.

CS 380C: Advanced Topics in Compilers. Administration Instructor: Keshav Pingali –Professor (CS, ICES) –ACES 4.126A TA: Zubair.

A few words on locality and arrays

Lecture 38: Compiling for Modern Architectures 03 May 02

Cache Memories.

Reducing Hit Time Small and simple caches Way prediction Trace caches

Cache Memories CSE 238/2038/2138: Systems Programming

The Hardware/Software Interface CSE351 Winter 2013

Section 7: Memory and Caches

The University of Adelaide, School of Computer Science

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

5.2 Eleven Advanced Optimizations of Cache Performance

Architecture Background

Cache Memory Presentation I

Lecture 6 Memory Hierarchy

CS 105 Tour of the Black Holes of Computing

CPU Central Processing Unit

Lecture: SMT, Cache Hierarchies

Instruction Scheduling for Instruction-Level Parallelism

Cache Memories Topics Cache memory organization Direct mapped caches

“The course that gives CMU its Zip!”

CPU Central Processing Unit

ECE Dept., University of Toronto

STUDY AND IMPLEMENTATION

CS 380C: Advanced Compiler Techniques

Optimizing MMM & ATLAS Library Generator

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Adapted from slides by Sally McKee Cornell University

Lecture 20: OOO, Memory Hierarchy

A Comparison of Cache-conscious and Cache-oblivious Codes

Siddhartha Chatterjee

CS-447– Computer Architecture Lecture 20 Cache Memories

Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

What does it take to produce near-peak Matrix-Matrix Multiply

Memory System Performance Chapter 3

Cache Performance October 3, 2007

Cache Models and Program Transformations

Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05

Cache-oblivious Programming

Cache Memories.

Cache Performance Improvements

Optimizing single thread performance

Writing Cache Friendly Code

Presentation transcript:

Memory Hierarchies

Memory Hierarchy of SGI Octane 128MB size L2 cache 1MB L1 cache 32KB (I) 32KB (D) Regs 64 access time (cycles) 2 10 70 R10 K processor: 4-way superscalar, 2 fpo/cycle, 195MHz Peak performance: 390 Mflops Experience: sustained performance is less than 10% of peak Processor often stalls waiting for memory system to load data

Memory-wall solutions Latency avoidance: multi-level memory hierarchies (caches) Latency tolerance: Pre-fetching multi-threading Techniques are not mutually exclusive: Most microprocessors have caches and pre-fetching Modest multi-threading is coming into vogue Our focus: memory hierarchies

Software problem Caches are useful only if programs have locality of reference temporal locality: program references to given memory address are clustered together in time spatial locality: program references clustered in address space are clustered in time Problem: Programs obtained by expressing most algorithms in the straight-forward way do not have much locality of reference Worrying about locality when coding algorithms complicates the software process enormously.

Example: matrix multiplication DO I = 1, N //assume arrays stored in row-major order DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) Great algorithmic data reuse: each array element is touched O(N) times! All six loop permutations are computationally equivalent (even modulo round-off error). However, execution times of the six versions can be very different if machine has a cache.

MMM Experiments Simulated L1 Cache Miss Ratio for Intel Pentium III MMM with N = 1…1300 16KB 32B/Block 4-way 8-byte elements

Quantifying performance differences DO I = 1, N //assume arrays stored in row-major order DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) Octane L2 cache hit: 10 cycles, cache miss 70 cycles Time to execute IKJ version: 2N3 + 70*0.13*4N3 + 10*0.87*4N3 = 73.2 N3 Time to execute JKI version: 2N3 + 70*0.5*4N3 + 10*0.5*4N3 = 162 N3 Speed-up = 2.2 Key transformation: loop permutation

Even better….. Break MMM into a bunch of smaller MMMs so that large cache model is true for each small MMM  large cache model is valid for entire computation  miss ratio will be 0.75/bt for entire computation where t is

Loop tiling Jt B DO It = 1,N, t DO Jt = 1,N,t DO Kt = 1,N,t DO I = It,It+t-1 DO J = Jt,Jt+t-1 DO K = Kt,Kt+t-1 C(I,J) = C(I,J)+A(I,K)*B(K,J) J A It t t I t t K C Kt Break big MMM into sequence of smaller MMMs where each smaller MMM multiplies sub-matrices of size txt. Parameter t (tile size) must be chosen carefully as large as possible working set of small matrix multiplication must fit in cache

Speed-up from tiling Miss ratio for block computation = miss ratio for large cache model = 0.75/bt = 0.001 (b = 4, t = 200) for Octane Time to execute tiled version = 2N3 + 70*0.001*4N3 + 10*0.999*4N3 = 42.3N3 Speed-up over JKI version = 4

Observations Locality optimized code is more complex than high-level algorithm. Locality optimization changed the order in which operations were done, not the number of operations “Fine-grain” view of data structures (arrays) is critical Loop orders and tile size must be chosen carefully cache size is key parameter associativity matters Actual code is even more complex: must optimize for processor resources registers: register tiling pipeline: loop unrolling Optimized MMM code can be ~1000 lines of C code Wouldn’t it be nice to have all this be done automatically by a compiler? Actually, it is done automatically nowadays…

Itanium MMM (–O3) 92% of Peak Performance