University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.

Slides:

Advertisements

Similar presentations

CS 105 Tour of the Black Holes of Computing

Advertisements

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.

Recitation 7 Caching By yzhuang. Announcements Pick up your exam from ECE course hub ◦ Average is 43/60 ◦ Final Grade computation? See syllabus

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

The Memory Hierarchy CPSC 321 Andreas Klappenecker.

Cache Memories September 30, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance.

Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.

Caching I Andreas Klappenecker CPSC321 Computer Architecture.

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

CPSC 312 Cache Memories Slides Source: Bryant Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on.

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213.

DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.

Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.

Systems I Locality and Caching

Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)

ECE Dept., University of Toronto

1 Overview 2 Cache entry structure 3 mapping function 4 Cache hierarchy in a modern processor 5 Advantages and Disadvantages of Larger Caches 6 Implementation.

Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

CMPE 421 Parallel Computer Architecture

Lecture 19: Virtual Memory

CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

University of Amsterdam Computer Systems – virtual memory Arnoud Visser 1 Computer Systems Virtual Memory.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1 Memory Hierarchy ( Ⅲ ). 2 Outline The memory hierarchy Cache memories Suggested Reading: 6.3, 6.4.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Cache Memories February 28, 2002 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Reading:

Systems I Cache Organization

Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.

1 Cache Memory. 2 Outline General concepts 3 ways to organize cache memory Issues with writes Write cache friendly codes Cache mountain Suggested Reading:

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance cache.ppt CS 105 Tour.

Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.

Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 26 Memory Hierarchy Design (Concept of Caching and Principle of Locality)

CMSC 611: Advanced Computer Architecture

Cache Memories.

COSC3330 Computer Architecture

CSE 351 Section 9 3/1/12.

Cache Memories CSE 238/2038/2138: Systems Programming

Section 9: Virtual Memory (VM)

The Hardware/Software Interface CSE351 Winter 2013

Cache Memory Presentation I

CS 105 Tour of the Black Holes of Computing

The Memory Hierarchy : Memory Hierarchy - Cache

Memory hierarchy.

ReCap Random-Access Memory (RAM) Nonvolatile Memory

Caches 2 Hakim Weatherspoon CS 3410, Spring 2013 Computer Science

Lecture 21: Memory Hierarchy

Cache Memories September 30, 2008

Lecture 22: Cache Hierarchies, Memory

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

10/16: Lecture Topics Memory problem Memory Solution: Caches Locality

CMSC 611: Advanced Computer Architecture

CS-447– Computer Architecture Lecture 20 Cache Memories

Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05

Cache Memory and Performance

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 2 How do you know? (ow133)cpuid This system has a Genuine Intel(R) Pentium(R) 4 processor Processor Family: F, Extended Family: 0, Model: 2, Stepping: 7 Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz GHz (bus-speed 400/533 MHz) Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entries Data TLB: 4K or 4M pages, fully associative, 64 entries 1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line size No 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cache Trace cache: 12K-uops, 8-way set associative 2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 3 CPUID-instruction asm("push %ebx; movl $2,%eax; CPUID; movl %eax,cache_eax; movl %ebx,cache_ebx; movl %edx,cache_edx; movl %ecx,cache_ecx; pop %ebx");

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 4 A limited number of answers

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 5 Direct-Mapped Cache Simplest kind of cache Characterized by exactly one line per set. E=1 lines per set valid tag set 0: set 1: set S-1: cache block

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 6 Fully Associative Cache Difficult and expensive to build one that is both large and fast Valid Tag Set 0: E = C/B lines in the one and only set ValidTag Cache block

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 7 E – way Set Associative Caches Characterized by more than one line per set validtag set 0: E=2 lines per set set 1: set S-1: cache block validtagcache block validtagcache block validtagcache block validtagcache block validtagcache block

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 8 Accessing Set Associative Caches Set selection –Use the set index bits to determine the set of interest. m-1 t bitss bits b bits tagset indexblock offset selected set s = log 2 (S) valid tag set 0: valid tag set 1: valid tag set S-1: cache block S = C / (B x E)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 9 Accessing Set Associative Caches Line matching and word selection –must compare the tag in each valid line in the selected set w3w3 w0w0 w1w1 w2w t bitss bits 100i0110 0m-1 b bits tagset indexblock offset selected set (i): =1? (1) The valid bit must be set. = ? (2) The tag bits in one of the cache lines must match the tag bits in the address (3) If (1) and (2), then cache hit, and block offset selects starting byte b = log 2 (B)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 10 Observations Contiguously region of addresses form a block Multiple address blocks share the same set-index Blocks that map to the same set can be uniquely identified by the tag

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 11 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. Data is copied between levels in block-sized transfer units Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 Level k: Level k+1: Same set-index

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 12 Request 14 Request 12 General Caching Concepts Program needs object d, which is stored in some block b. Cache hit –Program finds b in the cache at level k. E.g., block 14. Cache miss –b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12. –If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”? Level k: Level k+1: * Request 12 4* Conflict miss

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 13 Intel Processors Cache SRAM L1L2 InstructionData Pentium II19974-way, 32B,4-way,32B Celeron A1998,, 4-way,32B Pentium III Coppermine 2000,, 4-way,32B Pentium 4 Willamette way4-way,64B8-way,64B Pentium 4 Northwood 2002,, 8-way,64B

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 14 Intel Processors Cache SRAM ftp://download.intel.com/design/Pentium4/manuals/ pdf (section 2.4)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 15 Impact Associativity more lines decrease the vulnerability on thrashing. It requires more tag-bits and control logic, which increases the hit-time (1-2 cycles for L1) Block size larger blocks exploit spatial locality (not temporal locality) to increase the hit-rate. Larger blocks increase the miss-penalty ( cycles for L1).

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 16 Optimum Block size

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 17 Absolute Miss Rate

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 18 Relative Miss Rate

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 19 Matrix Multiplication Example Major Cache Effects to Consider –Total cache size Exploit temporal locality and keep the working set small (e.g., by using blocking) –Block size Exploit spatial locality Description: –Multiply N x N matrices –O(N3) total operations –Accesses N reads per source element N values summed per destination Assumption –No so large that cache not big enough to hold multiple rows /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } Variable sum held in register

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 20 Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ABC (i,*) (*,j) (i,j) Inner loop: Column- wise Row-wise Fixed Misses per Inner Loop Iteration: ABC = 1.25

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 21 Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } ABC (i,*) (*,j) (i,j) Inner loop: Row-wiseColumn- wise Fixed Misses per Inner Loop Iteration: ABC = 1.25

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 22 Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } ABC (i,*) (i,k)(k,*) Inner loop: Row-wise Fixed Misses per Inner Loop Iteration: ABC = 0.5

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 23 Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } ABC (i,*) (i,k)(k,*) Inner loop: Row-wise Fixed Misses per Inner Loop Iteration: ABC = 0.5

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 24 Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } ABC (*,j) (k,j) Inner loop: (*,k) Column - wise Column- wise Fixed Misses per Inner Loop Iteration: ABC = 2.0

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 25 Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } ABC (*,j) (k,j) Inner loop: (*,k) FixedColumn- wise Column- wise Misses per Inner Loop Iteration: ABC = 2.0

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 26 Summary of Matrix Multiplication for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ijk (& jik): 2 loads, 0 stores misses/iter = 1.25 for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } kij (& ikj): 2 loads, 1 store misses/iter = 0.5 jki (& kji): 2 loads, 1 store misses/iter = 2.0

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 27 Matrix Multiply Performance Miss rates are helpful but not perfect predictors. Code scheduling matters, too.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 28 Improving Temporal Locality by Blocking Example: Blocked matrix multiplication –do not mean “cache block”. –Instead, it mean a sub-block within the matrix. –Example: N = 8; sub-block size = 4 C 11 = A 11 B 11 + A 12 B 21 C 12 = A 11 B 12 + A 12 B 22 C 21 = A 21 B 11 + A 22 B 21 C 22 = A 21 B 12 + A 22 B 22 A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 X = C 11 C 12 C 21 C 22 Key idea: Sub-blocks (i.e., A xy ) can be treated just like scalars.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 29 Blocked Matrix Multiply (bijk) for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; }

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 30 Blocked Matrix Multiply Analysis Innermost loop pair multiplies –a 1 X bsize sliver of A by –a bsize X bsize block of B and accumulates into 1 X bsize sliver of C –Loop over i steps through n row slivers of A & C, using same B ABC block reused n times in succession row sliver accessed bsize times Update successive elements of sliver ii kk jj

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 31 Matrix Multiply Performance Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions –relatively insensitive to array size.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 32 Concluding Observations Programmer can optimize for cache performance –How data structures are organized –How data are accessed Nested loop structure Blocking is a general technique All systems favor “cache friendly code” –Getting absolute optimum performance is very platform specific Cache sizes, line sizes, associativities, etc. –Can get most of the advantage with generic code Keep working set reasonably small (temporal locality) Use small strides (spatial locality)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 33 Assignment Homework problem 6.28: What percentage of writes will miss in the cache for the blank screen function?