Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.

Slides:



Advertisements
Similar presentations
CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Advertisements

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.
Example How are these parameters decided?. Row-Order storage main() { int i, j, a[3][4]={1,2,3,4,5,6,7,8,9,10,11,12}; for (i=0; i
Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.
Cache Memories September 30, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance.
The Memory Hierarchy CS 740 Sept. 28, 2001 Topics The memory hierarchy Cache design.
The Memory Hierarchy CS 740 Sept. 17, 2007 Topics The memory hierarchy Cache design.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
Cache Memories February 24, 2004 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance class13.ppt.
CS 3214 Computer Systems Godmar Back Lecture 11. Announcements Stay tuned for Exercise 5 Project 2 due Sep 30 Auto-fail rule 2: –Need at least Firecracker.
CPSC 312 Cache Memories Slides Source: Bryant Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on.
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
Caches Oct. 22, 1998 Topics Memory Hierarchy
Fast matrix multiplication; Cache usage
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213.
Systems I Locality and Caching
Memory and cache CPU Memory I/O.
ECE Dept., University of Toronto
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
Lecture 13: Caching EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier.
Lecture 20: Locality and Caching CS 2011 Fall 2014, Dr. Rozier.
Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”
Code and Caches 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
Memory Hierarchy II. – 2 – Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.
Cache Memories February 28, 2002 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Reading:
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
1 Cache Memory. 2 Outline General concepts 3 ways to organize cache memory Issues with writes Write cache friendly codes Cache mountain Suggested Reading:
1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance cache.ppt CS 105 Tour.
Memory Hierarchy Computer Organization and Assembly Languages Yung-Yu Chuang 2007/01/08 with slides by CMU
Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.
Cache Memories February 26, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:
Cache Memories CENG331 - Computer Organization Instructor: Murat Manguoglu(Section 1) Adapted from: and
Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.
Carnegie Mellon 1 Cache Memories Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron.
Cache Memories.
CSE 351 Section 9 3/1/12.
Cache Memories CSE 238/2038/2138: Systems Programming
The Hardware/Software Interface CSE351 Winter 2013
CS 105 Tour of the Black Holes of Computing
The Memory Hierarchy : Memory Hierarchy - Cache
Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron
Cache Memories September 30, 2008
Cache Memories Topics Cache memory organization Direct mapped caches
“The course that gives CMU its Zip!”
Memory Hierarchy II.
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
Cache Performance October 3, 2007
Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05
Cache - Optimization.
Cache Memories.
Cache Memory and Performance
Writing Cache Friendly Code

Presentation transcript:

Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain range class20.ppt

CS 213 F’98– 2 – class20.ppt Basic Cache Organization tsb Cache (C = S x E x B bytes) S = 2 s sets E blocks/set Cache block (cache line) Address space (N = 2 n bytes) Valid bitdata 1 bitB = 2 b bytes (block size) tag t bits Address (n = t + s + b bits)

CS 213 F’98– 3 – class20.ppt Multi-Level Caches size: speed: $/Mbyte: block size: 200 B 5 ns 4 B 8 KB 5 ns 16 B 128 MB DRAM 70 ns $1.50/MB 4 KB 10 GB 10 ms $0.06/MB larger, slower, cheaper Memory disk TLB L1 Icache L1 Dcache regs L2 Dcache L2 Icache Processor 1M SRAM 6 ns $200/MB 32 B larger block size, higher associativity, more likely to write back Can have separate Icache and Dcache or unified Icache/Dcache

CS 213 F’98– 4 – class20.ppt Cache Performance Metrics Miss Rate fraction of memory references not found in cache (misses/references) Typical numbers: 5-10% for L1 1-2% for L2 Hit Time time to deliver a block in the cache to the processor (includes time to determine whether the block is in the cache) Typical numbers 1 clock cycle for L1 3-8 clock cycles for L2 Miss Penalty additional time required because of a miss –Typically cycles for main memory

CS 213 F’98– 5 – class20.ppt Impact of Cache and Block Size Cache Size Effect on miss rate –Larger is better Effect on hit time –Smaller is faster Block Size Effect on miss rate –Big blocks help exploit spatial locality –For given cache size, can hold fewer big blocks than little ones, though Effect on miss penalty –Longer transfer time

CS 213 F’98– 6 – class20.ppt Impact of Associativity Direct-mapped, set associative, or fully associative? Total Cache Size (tags+data) Higher associativity requires more tag bits, LRU state machine bits Additional read/write logic, multiplexors Miss rate Higher associativity decreases miss rate Hit time Higher associativity increases hit time –Direct mapped allows test and data transfer at the same time for read hits. Miss Penalty Higher associativity requires additional delays to select victim

CS 213 F’98– 7 – class20.ppt Impact of Write Strategy Write-through or write-back? Advantages of Write Through Read misses are cheaper. Why? Simpler to implement. Requires a write buffer to pipeline writes Advantages of Write Back Reduced traffic to memory –Especially if bus used to connect multiple processors or I/O devices Individual writes performed at the processor rate

CS 213 F’98– 8 – class20.ppt Qualitative Cache Performance Model Compulsory Misses First access to line not in cache Also called “Cold start” misses Capacity Misses Active portion of memory exceeds cache size Conflict Misses Active portion of address space fits in cache, but too many lines map to same cache entry Direct mapped and set associative placement only

CS 213 F’98– 9 – class20.ppt Miss Rate Analysis Assume Block size = 32B (big enough for 4 32-bit words) n is very large –Approximate 1/n as 0.0 Cache not even big enough to hold multiple rows Analysis Method Look at access pattern by inner loop C A k i B k j i j

CS 213 F’98– 10 – class20.ppt Interactions Between Program & Cache Major Cache Effects to Consider Total cache size –Try to keep heavily used data in highest level cache Block size (sometimes referred to “line size”) –Exploit spatial locality Example Application Multiply n X n matrices O(n 3 ) total operations Accesses –n reads per source element –n values summed per destination »But may be able to hold in register /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } Variable sum held in register

CS 213 F’98– 11 – class20.ppt Matmult Performance (Sparc20) As matrices grow in size, exceed cache capacity Different loop orderings give different performance –Cache effects –Whether or not can accumulate in register

CS 213 F’98– 12 – class20.ppt Layout of Arrays in Memory C Arrays Allocated in Row-Major Order Each row in contiguous memory locations Stepping Through Columns in One Row for (i = 0; i < n; i++) sum += a[0][i]; Accesses successive elements For block size > 8, get spatial locality –Cold Start Miss Rate = 8/B Stepping Through Rows in One Column for (i = 0; i < n; i++) sum += a[i][0]; Accesses distant elements No spatial locality –Cold Start Miss rate = 1 a[0][0] a[0][1] a[0][2] a[0][3] 0x x x x80018 a[1][0] a[1][1] a[1][2] a[1][3] 0x x x x80818 Memory Layout a[0][255] 0x807F8 a[1][255] 0x80FF8 a[255][255] 0xFFFF8

CS 213 F’98– 13 – class20.ppt Matrix multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ABC (i,*) (*,j) (i,j) Inner loop: Column- wise Row-wise Fixed Approx. Miss Rates abc

CS 213 F’98– 14 – class20.ppt Matrix multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } ABC (i,*) (*,j) (i,j) Inner loop: Row-wiseColumn- wise Fixed Approx. Miss Rates abc

CS 213 F’98– 15 – class20.ppt Matrix multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } ABC (i,*) (i,k)(k,*) Inner loop: Row-wise Fixed Approx. Miss Rates abc

CS 213 F’98– 16 – class20.ppt Matrix multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } ABC (i,*) (i,k)(k,*) Inner loop: Row-wise Fixed Approx. Miss Rates abc

CS 213 F’98– 17 – class20.ppt Matrix multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } ABC (*,j) (k,j) Inner loop: (*,k) Column - wise Column- wise Fixed Approx. Miss Rates abc

CS 213 F’98– 18 – class20.ppt Matrix multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } ABC (*,j) (k,j) Inner loop: (*,k) FixedColumn- wise Column- wise Approx. Miss Rates abc

CS 213 F’98– 19 – class20.ppt Summary of Matrix Multiplication for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ijk (L=2, S=0, MR=1.25) for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } jik (L=2, S=0, MR=1.25)kij (L=2, S=1, MR=0.5) for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r*b[k][j]; } ikj (L=2, S=1, MR=0.5) for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } jki (L=2, S=1, MR=2.0) for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } kji (L=2, S=1, MR=2.0)

CS 213 F’98– 20 – class20.ppt Matmult performance (DEC5000) (L=2, S=1, MR=0.5) (L=2, S=0, MR=1.25) (L=2, S=1, MR=2.0)

CS 213 F’98– 21 – class20.ppt Matmult Performance (Sparc20) (L=2, S=1, MR=0.5) (L=2, S=0, MR=1.25) (L=2, S=1, MR=2.0) Multiple columns of B fit in cache?

CS 213 F’98– 22 – class20.ppt Matmult Performance (Alpha 21164) (L=2, S=1, MR=0.5) (L=2, S=0, MR=1.25) (L=2, S=1, MR=2.0) Too big for L1 CacheToo big for L2 Cache

CS 213 F’98– 23 – class20.ppt Block Matrix Multiplication C 11 = A 11 B 11 + A 12 B 21 C 12 = A 11 B 12 + A 12 B 22 C 21 = A 21 B 11 + A 22 B 21 C 22 = A 21 B 12 + A 22 B 22 A11 A12 A21 A22 Example n=8, B = 4: B11 B12 B21 B22 X = C11 C12 C21 C22 Key idea: Sub-blocks (i.e., A ij ) can be treated just like scalars.

CS 213 F’98– 24 – class20.ppt Blocked Matrix Multiply (bijk) for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; }

CS 213 F’98– 25 – class20.ppt Blocked Matrix Multiply Analysis ABC block reused n times in succession row sliver accessed bsize times Update successive elements of sliver ii kk jj for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } Innermost loop pair multiplies 1 X bsize sliver of A times bsize X bsize block of B and accumulates into 1 X bsize sliver of C Loop over i steps through n row slivers of A & C, using same B Innermost Loop Pair

CS 213 F’98– 26 – class20.ppt Blocked matmult perf (DEC5000)

CS 213 F’98– 27 – class20.ppt Blocked matmult perf (Sparc20)

CS 213 F’98– 28 – class20.ppt Blocked matmult perf (Alpha 21164)

CS 213 F’98– 29 – class20.ppt Matrix transpose for (i=0; i < N; i++) for (j=0; j < M; j++) dst[j][i] = src[i][j] Row-wise transpose: for (j=0; j < M; j++) for (i=0; i < N; i++) dst[j][i] = src[i][j] Column-wise transpose: N rows M cols T M rows N cols

CS 213 F’98– 30 – class20.ppt 11 MB/s

CS 213 F’98– 31 – class20.ppt 14 MB/s

CS 213 F’98– 32 – class20.ppt 45 MB/s

CS 213 F’98– 33 – class20.ppt

CS 213 F’98– 34 – class20.ppt The Memory Mountain Range DEC Alpha 8400 (21164) 300 MHz 8 KB (L1) 96 KB (L2) 4 M (L3)

CS 213 F’98– 35 – class20.ppt Effects Seen in Mountain Range Cache Capacity See sudden drops as increase working set size Cache Block Effects Performance degrades as increase stride –Less spatial locality Levels off –When reach single access per block