Locality / Tiling María Jesús Garzarán University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
Computer Systems – the impact of caches
SE-292 High Performance Computing
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
11/8/2005Comp 120 Fall November 9 classes to go! Read Section 7.5 especially important!
Recitation 7 Caching By yzhuang. Announcements Pick up your exam from ECE course hub ◦ Average is 43/60 ◦ Final Grade computation? See syllabus
Memory Hierarchy. Smaller and faster, (per byte) storage devices Larger, slower, and cheaper (per byte) storage devices.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Systems I Locality and Caching
ECE Dept., University of Toronto
CacheLab 10/10/2011 By Gennady Pekhimenko. Outline Memory organization Caching – Different types of locality – Cache organization Cachelab – Warnings.
Cache Locality for Non-numerical Codes María Jesús Garzarán University of Illinois at Urbana-Champaign.
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.
Cache-Conscious Structure Definition By Trishul M. Chilimbi, Bob Davidson, and James R. Larus Presented by Shelley Chen March 10, 2003.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
CMSC 611: Advanced Computer Architecture
CSE 351 Section 9 3/1/12.
The Goal: illusion of large, fast, cheap memory
Cache Memories CSE 238/2038/2138: Systems Programming
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The Hardware/Software Interface CSE351 Winter 2013
Section 7: Memory and Caches
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Memory & Cache
Local secondary storage (local disks)
Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2013
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
CMSC 611: Advanced Computer Architecture
Lecture 20: OOO, Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Introduction
Lecture 22: Cache Hierarchies, Memory
Lecture 21: Memory Hierarchy
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Cache - Optimization.
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Presentation transcript:

Locality / Tiling María Jesús Garzarán University of Illinois at Urbana-Champaign

2 Roadmap Locality (Tiling) for Matrix Multiplication –Find optimal tile size assuming data are copied to consecutive locations Kamen Yotov et al. A Comparison of Empirical and Model-driven Optimization. In PLDI, Locality for Non-Numerical Codes –Structure Splitting –Field Reordering Cache-conscious Structure Definition, by Trishul M. Chilimbi, Bob Davidson, and James Larus, PLDI –Cache-conscious Structure Layout Cache-conscious Structure Layout, by Trishul M. Chilimbi, Mark D. Hill and James Larus, PLDI 1999.

3 Memory Hierarchy Most programs have a high degree of locality in their accesses –Spatial locality: accessing things nearby previous accesses –Temporal locality: accessing an item that was previously accessed Memory Hierarchy tries to exploit locality on-chip cache registers datapath control processor Second level cache (SRAM) Main memory (DRAM) Secondary storage (Disk) Tertiary storage (Disk/Tape) Time (Cycles): 4 23 Pentium 4 (Prescott) 3 17 AMD Athlon 64 Size (Bytes): 8-32 K M 1GB-8GB GB

Matrix Multiplication 4 for (i = 0; i < SIZE; i++) for (j = 0; j < SIZE; j++) for (k = 0; k < SIZE; k++) C[i][j] += A[i][k] * B[k][j]; B k j A i k C i j

Matrix Multiplication: Loop Invariant 5 for (i = 0; i < SIZE; i++) for (j = 0; j < SIZE; j++) for (k = 0; k < SIZE; k++) C[i][j] += A[i][k] * B[k][j]; for (i = 0; i < SIZE; i++) for (j = 0; j < SIZE; j++){ D =C[i][j]; for (k = 0; k < SIZE; k++) D += A[i][k] * B[k][j]; C[i][j]=D; }

Matrix Multiplication: Cache Tiling 6 for (i0 = 0; i0 < SIZE; i0 += block) for (j0 = 0; j0 < SIZE; j0 += block) for (k0 = 0; k0 < SIZE; k0 += block) for (i = i0; i < min(i0 + block, SIZE); i++) for (j = j0; j < min(j0 + block, SIZE); j++) for (k = k0; k < min(k0 + block, SIZE); k++) C[i][j] += A[i][k] * B[k][j]; B k0 j0 A i0 k0 C i0 j0

Modeling for Tile Size (NB) Models of increasing complexity –3*NB 2 C Whole work-set fits in L1 –NB 2 + NB + 1 C Fully Associative Optimal Replacement Line Size: 1 word – or Line Size > 1 word – or LRU Replacement

Largest NB for no capacity/conflict misses Tiles are copied into contiguous memory Condition for cold misses only: –3*NB 2 <= L1Size A k B j k i NB

Matrix Multiplication: Cache Tiling 9 for (i0 = 0; i0 < SIZE; i0 += block) for (j0 = 0; j0 < SIZE; j0 += block) for (k0 = 0; k0 < SIZE; k0 += block) for (i = i0; i < min(i0 + block, SIZE); i++) for (j = j0; j < min(j0 + block, SIZE); j++) for (k = k0; k < min(k0 + block, SIZE); k++) C[i][j] += A[i][k] * B[k][j]; B k0 j0 A i0 k0 C i0 j0

Largest NB for no capacity misses MMM: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] Cache model: –Fully associative –Line size 1 Word –Optimal Replacement Bottom line: NB 2 +NB+1<= L1Size –One full matrix –One row / column –One element

Extending the Model Line Size > 1 –Spatial locality –Array layout in memory matters Bottom line: depending on loop order –either –or

Extending the Model (cont.) LRU (not optimal replacement) MMM sample: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] Bottom line: IJK, IKJ JIK, JKI KIJ KJI

Matrix Multiplication: Cache and Register Tiling for (j=0; j<=SIZE; j +=block) for (i=0; i<=SIZE; i +=block) for (k=0; k<=SIZE; k +=block) // miniMMM code for (jj=j; jj<j+block; jj+=MU) for (ii=i; ii<i+block; ii +=NU) for (kk=k; kk<k+block; kk++) // microMMM code C[ii][jj]+= A[ii][kk] * B[kk][jj] C[ii+1][jj]+= A[ii+1][kk] * B[kk][jj] C[ii+2][jj]+= A[ii+2][kk] * B[kk][jj] C[ii][jj+1]+= A[ii][kk] * B[kk][jj+1] C[ii+1][jj+1]+= A[ii+1][kk] * B[kk][jj+1] C[ii+2][jj+1]+= A[ii+2][kk] * B[kk][jj+1] MU = 2 and NU = 3

14 Locality for Non-Numerical Codes Cache-conscious Structure Definition, by Trishul M. Chilimbi, Bob Davidson, and James Larus, PLDI –Structure Splitting –Field Reordering Cache-conscious Structure Layout, by Trishul M. Chilimbi, Mark D. Hill and James Larus, PLDI 1999.

15 Cache Conscious Structure Definition group them based on temporal affinity

16 cold fields are labelled with public Program Transformation. Example reference to the new cold class new cold class instance assigned to the cold class reference field acces to cold fields require an extra indirection

17 Cache Conscious Layout Locality can be improved by: 1.changing programs data access pattern Applied to scientific programs that manipulate dense matrices: -uniform, random accesses of elements -static analysis of data dependences 2.changing data organization and layout They have locational transparency: elements in a structure can be placed at different memory (and cache) locations without chaging a programs semantics. Two placement techniques: -coloring -clustsering

18

19 Clustering Packs data structure elements likely to be accessed contemporaneously into a cache block. Improves spatial and temporal locality and provides implicit prefetch. One way to cluster a tree is to pack subrees into a cache block.

20 Clustering Why is this clustering for binary tree good? –Assuming random tree search, the probability of accesing either child of a node is 1/2. –With K nodes of a subtree clustered in a cache block, the expected number of accesses to the block is the height of the subtree, log 2 (k+1), which is greater than 2 when K >3. With a depht-first clustering, the expeted number of accesses to the block is smaller. –Of course this is only true for a random acces pattern.

21 Coloring Coloring maps contemporaneously-accessed elements to non-conflicting regions of the cache. 2-way cache p C-p ppp Frequently access data structure elements Remaining data structure elements

22