Cache Lab Implementation and Blocking

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

Simulations of Memory Hierarchy LAB 2: CACHE LAB.
Today Memory hierarchy, caches, locality Cache organization
Recitation 7 Caching By yzhuang. Announcements Pick up your exam from ECE course hub ◦ Average is 43/60 ◦ Final Grade computation? See syllabus
Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
CS 104 Introduction to Computer Science and Graphics Problems
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Systems I Locality and Caching
ECE Dept., University of Toronto
Cache Lab Implementation and Blocking
Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
CacheLab 10/10/2011 By Gennady Pekhimenko. Outline Memory organization Caching – Different types of locality – Cache organization Cachelab – Warnings.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
University of Washington Memory and Caches I The Hardware/Software Interface CSE351 Winter 2013.
CacheLab Recitation 7 10/8/2012. Outline Memory organization Caching – Different types of locality – Cache organization Cachelab – Tips (warnings, getopt,
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Lecture 13: Caching EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier.
Lecture 20: Locality and Caching CS 2011 Fall 2014, Dr. Rozier.
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Carnegie Mellon Introduction to Computer Systems /18-243, spring th Lecture, Feb. 19 th Instructors: Gregory Kesden and Markus Püschel.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
University of Washington Today Midterm topics end here. HW 2 is due Wednesday: great midterm review. Lab 3 is posted. Due next Wednesday (July 31)  Time.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 26 Memory Hierarchy Design (Concept of Caching and Principle of Locality)
Cache Memories.
CSE 351 Section 9 3/1/12.
Optimization III: Cache Memories
Cache Memories CSE 238/2038/2138: Systems Programming
How will execution time grow with SIZE?
The Hardware/Software Interface CSE351 Winter 2013
Section 7: Memory and Caches
Today How’s Lab 3 going? HW 3 will be out today
CS 105 Tour of the Black Holes of Computing
Caches II CSE 351 Spring 2017 Instructor: Ruth Anderson
CS 105 Tour of the Black Holes of Computing
Local secondary storage (local disks)
The Memory Hierarchy : Memory Hierarchy - Cache
Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron
ReCap Random-Access Memory (RAM) Nonvolatile Memory
Caches and Blocking 8 October 2018.
CS 105 Tour of the Black Holes of Computing
Lecture 08: Memory Hierarchy Cache Performance
Caches II CSE 351 Winter 2018 Instructor: Mark Wyse
Feb 11 Announcements Memory hierarchies! How’s Lab 3 going?
Contents Memory types & memory hierarchy Virtual memory (VM)
CS-447– Computer Architecture Lecture 20 Cache Memories
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
Cache Memories.
Cache Memory and Performance
10/18: Lecture Topics Using spatial locality
Caches III CSE 351 Spring 2019 Instructor: Ruth Anderson

Presentation transcript:

Cache Lab Implementation and Blocking Monil Shah June 19th, 2015

Welcome to the World of Pointers !

Outline Schedule Memory organization Caching Cache lab Different types of locality Cache organization Cache lab Part (a) Building Cache Simulator Part (b) Efficient Matrix Transpose Blocking

Class Schedule Cache Lab Exam Soon ! Due Wednesday. Start now ( if you haven’t already!) Exam Soon ! Start doing practice problems. Mon June 29th – Wed 1st July 10 days to go!

(tapes, distributed file systems, Web servers) Memory Hierarchy CPU registers hold words retrieved from L1 cache Smaller, faster, costlier per byte L0: Registers L1 cache holds cache lines retrieved from L2 cache L1: L1 cache (SRAM) L2: L2 cache (SRAM) L2 cache holds cache lines retrieved from main memory Main memory (DRAM) Main memory holds disk blocks retrieved from local disks L3: Larger, slower, cheaper per byte Local disks hold files retrieved from disks on remote network servers L4: Local secondary storage (local disks) Remote secondary storage (tapes, distributed file systems, Web servers) L5:

Memory Hierarchy Registers SRAM DRAM Local Secondary storage Remote Secondary storage We will discuss this interaction

SRAM vs DRAM tradeoff SRAM (cache) DRAM (main memory) Faster (L1 cache: 1 CPU cycle) Smaller (Kilobytes (L1) or Megabytes (L2)) More expensive and “energy-hungry” DRAM (main memory) Relatively slower (hundreds of CPU cycles) Larger (Gigabytes) Cheaper

Locality Temporal locality Spatial locality Recently referenced items are likely to be referenced again in the near future After accessing address X in memory, save the bytes in cache for future access Spatial locality Items with nearby addresses tend to be referenced close together in time After accessing address X, save the block of memory around X in cache for future access

Memory Address 64-bit on shark machines Block offset: b bits Set index: s bits Tag Bits: Address Size – b – s

Cache A cache is a set of 2^s cache sets A cache set is a set of E cache lines E is called associativity If E=1, it is called “direct-mapped” Each cache line stores a block Each block has B = 2^b bytes Total Capacity = S*B*E

Visual Cache Terminology E lines per set Address of word: t bits s bits b bits S = 2s sets tag set index block offset data begins at this offset v tag 1 2 B-1 valid bit B = 2b bytes per cache block (the data)

General Cache Concepts Smaller, faster, more expensive memory caches a subset of the blocks Cache 4 8 9 14 10 3 Data is copied in block-sized transfer units 10 4 Larger, slower, cheaper memory viewed as partitioned into “blocks” Memory Hit: it’s in the cache Miss: the tags don’t match 1 2 3 4 4 5 6 7 8 9 10 10 11 12 13 14 15

General Cache Concepts: Miss Request: 12 Data in block b is needed Block b is not in cache: Miss! Cache 8 9 12 14 3 Block b is fetched from memory 12 Request: 12 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (victim) Memory 1 2 3 4 5 6 7 8 9 10 11 12 12 13 14 15

General Caching Concepts: Types of Cache Misses Cold (compulsory) miss The first access to a block has to be a miss Conflict miss Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block E.g., Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time Capacity miss Occurs when the set of active cache blocks (working set) is larger than the cache

Cache Lab Part (a) Building a cache simulator Part (b) Optimizing matrix transpose

Part (a) : Cache simulator A cache simulator is NOT a cache! Memory contents NOT stored Block offsets are NOT used – the b bits in your address don’t matter. Simply count hits, misses, and evictions Your cache simulator needs to work for different s, b, E, given at run time. Use LRU – Least Recently Used replacement policy Evict the least recently used block from the cache to make room for the next block. Queues ? Time Stamps ?

Part (a) : Hints A cache is just 2D array of cache lines: struct cache_line cache[S][E]; S = 2^s, is the number of sets E is associativity Each cache_line has: Valid bit Tag LRU counter ( only if you are not using a queue )

Part (a) : getopt getopt() automates parsing elements on the unix command line If function declaration is missing Typically called in a loop to retrieve arguments Its return value is stored in a local variable When getopt() returns -1, there are no more options To use getopt, your program must include the header file unistd.h If not running on the shark machines then you will need #include <getopt.h>. Better Advice: Run on Shark Machines !

Part (a) : getopt A switch statement is used on the local variable holding the return value from getopt() Each command line input case can be taken care of separately “optarg” is an important variable – it will point to the value of the option argument Think about how to handle invalid inputs For more information, look at man 3 getopt http://www.gnu.org/software/libc/manual/html_node/Getopt.html

Part (a) : getopt Example int main(int argc, char** argv){ int opt,x,y; /* looping over arguments */ while(-1 != (opt = getopt(argc, argv, “x:y:"))){ /* determine which argument it’s processing */ switch(opt) { case 'x': x = atoi(optarg); break; case ‘y': y = atoi(optarg); default: printf(“wrong argument\n"); } Suppose the program executable was called “foo”. Then we would call “./foo -x 1 –y 3“ to pass the value 1 to variable x and 3 to y.

Part (a) : fscanf The fscanf() function is just like scanf() except it can specify a stream to read from (scanf always reads from stdin) parameters: A stream pointer format string with information on how to parse the file the rest are pointers to variables to store the parsed data You typically want to use this function in a loop. It returns -1 when it hits EOF or if the data doesn’t match the format string For more information, man fscanf http://crasseux.com/books/ctutorial/fscanf.html fscanf will be useful in reading lines from the trace files. L 10,1 M 20,1

Part (a) : fscanf example FILE * pFile; //pointer to FILE object pFile = fopen ("tracefile.txt",“r"); //open file for reading char identifier; unsigned address; int size; // Reading lines like " M 20,1" or "L 19,3" while(fscanf(pFile,“ %c %x,%d”, &identifier, &address, &size)>0) { // Do stuff } fclose(pFile); //remember to close file when done

Part (a) : Malloc/free Use malloc to allocate memory on the heap Always free what you malloc, otherwise may get memory leak some_pointer_you_malloced = malloc(sizeof(int)); Free(some_pointer_you_malloced); Don’t free memory you didn’t allocate

Part (b) Efficient Matrix Transpose Matrix Transpose (A -> B) Matrix A Matrix B How do we optimize this operation using the cache? 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Part (b) : Efficient Matrix Transpose Suppose Block size is 8 bytes ? Access A[0][0] cache miss Should we handle 3 & 4 Access B[0][0] cache miss next or 5 & 6 ? Access A[0][1] cache hit Access B[1][0] cache miss

Part (b) : Blocking Blocking: divide matrix into sub-matrices.  Size of sub-matrix depends on cache block size, cache size, input matrix size.  Try different sub-matrix sizes.

Example: Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n + j] += a[i*n + k] * b[k*n + j]; } j c a b = * i

Cache Miss Analysis = * = * Assume: First iteration: Matrix elements are doubles Cache block = 8 doubles Cache size C << n (much smaller than n) First iteration: n/8 + n = 9n/8 misses Afterwards in cache: (schematic) n = * = * 8 wide

Cache Miss Analysis = * Assume: Second iteration: Total misses: Matrix elements are doubles Cache block = 8 doubles Cache size C << n (much smaller than n) Second iteration: Again: n/8 + n = 9n/8 misses Total misses: 9n/8 * n2 = (9/8) * n3 n = * 8 wide

Blocked Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=B) for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+B; i++) for (j1 = j; j1 < j+B; j++) for (k1 = k; k1 < k+B; k++) c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1]; } j1 c a b c = + * i1 Block size B x B

Cache Miss Analysis = * = * Assume: First (block) iteration: Cache block = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B2 < C First (block) iteration: B2/8 misses for each block 2n/B * B2/8 = nB/4 (omitting matrix c) Afterwards in cache (schematic) n/B blocks = * Block size B x B = *

Cache Miss Analysis = * Assume: Second (block) iteration: Cache block = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B2 < C Second (block) iteration: Same as first iteration 2n/B * B2/8 = nB/4 Total misses: nB/4 * (n/B)2 = n3/(4B) n/B blocks = * Block size B x B

Part(b) : Blocking Summary No blocking: (9/8) * n3 Blocking: 1/(4B) * n3 Suggest largest possible block size B, but limit 3B2 < C! Reason for dramatic difference: Matrix multiplication has inherent temporal locality: Input data: 3n2, computation 2n3 Every array elements used O(n) times! But program has to be written properly For a detailed discussion of blocking: http://csapp.cs.cmu.edu/public/waside.html

Part (b) : Specs Cache: Test Matrices: You get 1 kilobytes of cache Directly mapped (E=1) Block size is 32 bytes (b=5) There are 32 sets (s=5) Test Matrices: 32 by 32 64 by 64 61 by 67

Part (b) Things you’ll need to know: Warnings are errors Header files Eviction policies in the cache

Warnings are Errors Strict compilation flags Reasons: Avoid potential errors that are hard to debug Learn good habits from the beginning Add “-Werror” to your compilation flags

Missing Header Files Remember to include files that we will be using functions from If function declaration is missing Find corresponding header files Use: man <function-name> Live example man 3 getopt

Eviction policies of Cache The first row of Matrix A evicts the first row of Matrix B Caches are memory aligned. Matrix A and B are stored in memory at addresses such that both the first elements align to the same place in cache! Diagonal elements evict each other. Matrices are stored in memory in a row major order. If the entire matrix can’t fit in the cache, then after the cache is full with all the elements it can load. The next elements will evict the existing elements of the cache. Example:- 4x4 Matrix of integers and a 32 byte cache. The third row will evict the first row!

Style Read the style guideline Start forming good habits now! But I already read it! Good, read it again. Start forming good habits now!

Questions?