Faculty of Computer Science © 2006 CMPUT 229 Cache Performance Analysis Hitting for performance.

Slides:

Advertisements

Similar presentations

Cache Memory Exercises. Questions I Given: –memory is little-endian and byte addressable; memory size; –number of cache blocks, size of cache block –An.

Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

11/8/2005Comp 120 Fall November 9 classes to go! Read Section 7.5 especially important!

Arrays and Matrices CSE, POSTECH. 2 2 Introduction Data is often available in tabular form Tabular data is often represented in arrays Matrix is an example.

Example How are these parameters decided?. Row-Order storage main() { int i, j, a[3][4]={1,2,3,4,5,6,7,8,9,10,11,12}; for (i=0; i

© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.

Simulations of Memory Hierarchy LAB 2: CACHE LAB.

Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.

The Lord of the Cache Project 3. Caches Three common cache designs: Direct-Mapped store in exactly one cache line Fully Associative store in any cache.

How caches take advantage of Temporal locality

1 Matrix Multiplication on SOPC Project instructor: Ina Rivkin Students: Shai Amara Shuki Gulzari Project duration: one semester.

Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.

The Structure of the CPU

Matrix Multiplication (i,j,k) for I = 1 to n do for j = 1 to n do for k = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor.

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

Fast matrix multiplication; Cache usage

Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213.

Copyright 1998 UC, Irvine1 Miss Stride Buffer Department of Information and Computer Science University of California, Irvine.

Faculty of Computer Science © 2006 CMPUT 229 Input and Output Devices Pooling and Interrupts.

External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.

1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Code and Caches 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission.

1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.

1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.

Cache Memories February 28, 2002 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Reading:

1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.

Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.

Numerical Algorithms.Matrix Multiplication.Gaussian Elimination.Jacobi Iteration.Gauss-Seidel Relaxation.

Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.

Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:

Problem byte direct-mapped cache 16-byte blocks sizeof(int) = 4 struct algae_position{ int x; int y; }; Struct algae_position grid[16][16]; grid.

CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.

Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.

A few words on locality and arrays

Cache Memories.

CS2100 Computer Organization

Cache Memories CSE 238/2038/2138: Systems Programming

The Hardware/Software Interface CSE351 Winter 2013

Section 7: Memory and Caches

CS 105 Tour of the Black Holes of Computing

The Memory Hierarchy : Memory Hierarchy - Cache

Exam Review of Cache Memories Dec 11, 2001

Memory Hierarchies.

Cache Memories Topics Cache memory organization Direct mapped caches

“The course that gives CMU its Zip!”

Memory Hierarchy II.

Parallel Matrix Operations

November 14 6 classes to go! Read

Lecture 22: Cache Hierarchies, Memory

Help! How does cache work?

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

CS100: Discrete structures

Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:

Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Cache Memories.

Cache Memory and Performance

Arrays and Matrices Prof. Abdul Hameed.

ENERGY 211 / CME 211 Lecture 11 October 15, 2008.

Applied Discrete Mathematics Week 4: Functions

Writing Cache Friendly Code

Presentation transcript:

Faculty of Computer Science © 2006 CMPUT 229 Cache Performance Analysis Hitting for performance

© 2006 Department of Computing Science CMPUT 229 Standard Matrix Multiplication for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ c[i,j] = 0.0; for(k = 0; k<n ; k++){ c[i,j] = c[i,j] + a[i,k] * b[k,j]; } Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $ , Address(b[0,0]) = $ Address(c[0,0]) = $ What is the data cache hit ratio for this program? for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ sum = 0.0; for(k = 0; k<n ; k++){ temp1  load(a[i,k]); temp2  load(b[k,j]); sum  sum + temp1*temp2; } store(c[i,j])  sum; }

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B

© 2006 Department of Computing Science CMPUT 229 Cache Access Analysis Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $ , Address(b[0,0]) = $ Address(c[0,0]) = $ What is the data cache hit ratio for this program? 32K-byte cache 128-byte cache line = 256 lines/cache

© 2006 Department of Computing Science CMPUT 229 Cache Access Analysis Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $ , Address(b[0,0]) = $ Address(c[0,0]) = $ What is the data cache hit ratio for this program? 128-byte cache lines 8-byte element = 16 elements/line 32K-byte cache 128-byte cache line = 256 lines/cache

© 2006 Department of Computing Science CMPUT 229 Cache Data Access Pattern If we ignore conflict misses, then: Every 16th access of A is a miss; Every access to B is a miss; How many hits and misses will occur to compute one element of C? 256 lines/cache 16 elements/line In A there will be 1024/16 = 64 misses and = 960 hits. In B there will be 1024 misses. Thus, what is the hit ratio? # hits # of accesses Hit ratio = = 960 hits 2048 accesses = 0.47 = 47%

© 2006 Department of Computing Science CMPUT 229 Address anatomy The data cache has 32 Kbytes and 128-byte cache lines; 128 = = lines/cache 16 elements/line TagIndexOffset 7 bits 8 bits 17 bits

© 2006 Department of Computing Science CMPUT 229 Conflict Misses 256 lines/cache 16 elements/line Cache Access Address Index Outcome A[0,0] $ miss B[0,0] $ miss A[0,1] $ miss B[1,0] $ miss A[0,2] $ hit B[2,0] $ miss A[0,3] $ C 0 hit B[3,0] $ miss A[0,4] $ hit B[4,0] $ miss A[0,5] $ hit B[5,0] $ miss A[0,6] $ hit B[6,0] $ miss A[0,7] $ C 0 hit B[7,0] $ miss A[0,8] $ hit B[8,0] $ miss A[0,9] $ miss B[9,0] $ miss 0  32  64  96  128  160  192  244  In General: A 1024-element row of A Occupies element cache lines. There will be 2 conflict misses in two of these rows. A total of 4 conflict misses per row. Thus the accesses of A will result in 68 misses and 986 hits for each 1024 accesses. The conflict misses are not significant and can be ignored.

© 2006 Department of Computing Science CMPUT 229 Matrix Multiplication with Transpose for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ for(k = 0; k<n ; k++){ c[i,j] = c[i,j] + a[i,k] * b1[j,k]; } Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $ , Address(b[0,0]) = $ Address(c[0,0]) = $ What is the data cache hit ratio for this program? for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ b1[i,j] = b[j,i]; } Where in memory should we place matrix b1 to reduce conflict misses?

© 2006 Department of Computing Science CMPUT 229 Where to place matrix b1? TagIndexOffset Intuitively the index of b1[0][0] should be away from the index of a[0][0]. The index of a[0][0] is 0. Thus we could aim to place b1 at an address whose index is 128.

© 2006 Department of Computing Science CMPUT 229 Cache Access Pattern for the Transpose If we ignore conflict misses, then: Every 16th access of b1 is a miss; Every access to b is a miss; The transpose’s inner loop yields: 2048 accesses 960 hits. And the inner loop is repeated 1024 times: 1024  2048 accesses 1024  960 hits Thus, the hit ratio is: # hits # of accesses Hit ratio = = 960 hits 2048 accesses = 0.47 = 47% for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ b1[i,j] = b[j,i]; }

© 2006 Department of Computing Science CMPUT 229 Cache Access Pattern for the Multiplication If we ignore conflict misses, then: Every 16th access of a is a miss; Every 16th access to b1 is a miss; Thus the inner loop yields 2048 accesses and 1920 hits. for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ sum = 0.0; for(k = 0; k<n ; k++){ temp1  load(a[i,k]); temp2  load(b1[j,k]); sum  sum + temp1*temp2; } store(c[i,j])  sum; } The inner loop is executed n 2 times. The total number of accesses (ignoring accesses to c) in the multiplication is: 1024  1024  2048 accesses 1024  1024  1920 hits

© 2006 Department of Computing Science CMPUT 229 Hit Ratio for Multiplication with Transpose 1024   1024  1920 hits 2048   1024  2048 accesses Hit ratio = The total number of accesses (ignoring accesses to c) in the multiplication is: 1024  1024  2048 accesses 1024  1024  1920 hits The transpose yields: 1024  2048 accesses 1024  960 hits  1920 hits 1025  2048 accesses Hit ratio = = = 93.7%

© 2006 Department of Computing Science CMPUT 229 Blocked Matrix Multiplication* for (i0 = 0; i0<n ; i0 = i0 + b){ for(j0 = 0; j0<n ; j0 = j0 + b){ for(k0 = 0; k0<n ; k0 = k0 + b){ for(i = i0; i< min(i0+b-1,n) ; i++){ for(j = j0; j< min(j0+b-1,n) ; j++){ for(k = k0; j< min(k0+b-1,n) ; j++){ c[i,j] = c[i,j] + a[i,k] * b[k,j]; } Code adapted from Assumes that all elements of matrix c were initialized to zero beforehand

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 2 0

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 3 1

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 2

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 4

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 6

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 8

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 10

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 12

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 14 Multiplying the first row of the block of A by the block of B required 18 accesses that resulted in 4 misses. How many of the 18 accesses required to multiply the second row of the block of A by the block of B will be misses?

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4|1 14|1

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4|1 14|17

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4| 1 | 1 = 6 14|17|17 = 48

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit What is the hit ratio for the next block multiplication? 4| 1 | 1 = 6 14|17|17 = 48 3 hits and 48 references In general, there are b misses and 2  b 3 accesses 2  b 3 - b 2b32b3 Hit ratio =

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit What is the hit ratio for the next block multiplication? 4| 1 | 1 = 6 14|17|17 = 48 3 hits and 48 references In general, there are b misses and 2  b 3 accesses 2  b b22b2 Hit ratio =

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; What should be the value of b? Do the memory locations of A and B matter? miss hit

© 2006 Department of Computing Science CMPUT 229 Cache Usage for Blocked Matrix Multiplication Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; for (i0 = 0; i0<n ; i0 = i0 + b){ for(j0 = 0; j0<n ; j0 = j0 + b){ for(k0 = 0; k0<n ; k0 = k0 + b){ for(i = i0; i< min(i0+b-1,n) ; i++){ for(j = j0; j< min(j0+b-1,n) ; j++){ for(k = k0; j< min(k0+b-1,n) ; j++){ c[i,j] = c[i,j] + a[i,k] * b[k,j]; } } } } } Ignore conflict misses. Estimate the hit ratio for the block computation if b=16. 2  b b22b2 Hit ratio = 2  (16)  (16) 2 Hit ratio = Hit ratio = 99.8%