Fast matrix multiplication; Cache usage

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

Example How are these parameters decided?. Row-Order storage main() { int i, j, a[3][4]={1,2,3,4,5,6,7,8,9,10,11,12}; for (i=0; i
CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.
Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.
Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
Cache Memories February 24, 2004 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance class13.ppt.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
CS 3214 Computer Systems Godmar Back Lecture 11. Announcements Stay tuned for Exercise 5 Project 2 due Sep 30 Auto-fail rule 2: –Need at least Firecracker.
CPSC 312 Cache Memories Slides Source: Bryant Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Systems I Locality and Caching
Memory and cache CPU Memory I/O.
ECE Dept., University of Toronto
CMPE 421 Parallel Computer Architecture
External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Chapter Twelve Memory Organization
CacheLab Recitation 7 10/8/2012. Outline Memory organization Caching – Different types of locality – Cache organization Cachelab – Tips (warnings, getopt,
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
Lecture 13: Caching EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier.
Lecture 20: Locality and Caching CS 2011 Fall 2014, Dr. Rozier.
Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Code and Caches 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission.
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.
Cache Memories February 28, 2002 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Reading:
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance cache.ppt CS 105 Tour.
Memory Hierarchy Computer Organization and Assembly Languages Yung-Yu Chuang 2007/01/08 with slides by CMU
Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:
Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.
Cache Memories.
CSE 351 Section 9 3/1/12.
Cache Memories CSE 238/2038/2138: Systems Programming
External Sorting Sort n records/elements that reside on a disk.
The Hardware/Software Interface CSE351 Winter 2013
CS 105 Tour of the Black Holes of Computing
The Memory Hierarchy : Memory Hierarchy - Cache
Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron
Cache Memories Topics Cache memory organization Direct mapped caches
“The course that gives CMU its Zip!”
Memory Hierarchy II.
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache Memories Lecture, Oct. 30, 2018
Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05
Cache Memories.
Cache Memory and Performance
Writing Cache Friendly Code

Presentation transcript:

Fast matrix multiplication; Cache usage Assignment #2

Multiplication of 2D Matrices Simple algorithm Time and Performance Measurement Simple Code Improvements

Matrix Multiplication B = *

Matrix Multiplication B C cij = * ai* b*j

The simplest algorithm Assumption: the matrices are stored as 2-D NxN arrays for (i=0;i<N;i++) for (j=0;j<N;j++) for (k=0;k<N;k++) c[i][j] += a[i][k] * b[k][j]; Advantage: code simplicity Disadvantage: performance (?)

First Improvement c[i][j]: Requires address (pointer) computations for (i=0;i<N;i++) for (j=0;j<N;j++) for (k=0;k<N;k++) c[i][j] += a[i][k] * b[k][j]; c[i][j]: Requires address (pointer) computations is constant in the k-loop

First Performance Improvement for (i=0;i<N;i++){ for (j=0;j<N;j++) { int sum = 0; for (k=0;k<N;k++) { sum += a[i][k] * b[k][j]; } c[i][j] = sum;

Performance Analysis clock_t clock() – Returns the processor time used by the program since the beginning of execution, or -1 if unavailable. clock()/CLOCKS_PER_SEC – the time in seconds (approx.) #include <time.h> clock_t t1,t2; t1 = clock(); mult_ijk(a,b,c,n); t2 = clock(); printf(“Running time = %f seconds\n", (double)(t2 - t1)/CLOCKS_PER_SEC);

The running time The simplest algorithm: After the first optimization: 15% reduction

Profiling Profiling allows you to learn where your program spent its time. This information can show you which pieces of your program are slower and might be candidates for rewriting to make your program execute faster. It can also tell you which functions are being called more often than you expected.

gprof gprof is a profiling tool ("display call graph profile data") Using gprof: compile and link with –pg flag (gcc –pg) Run your program. After program completion a file name gmon.out is created in the current directory. gmon.out includes the data collected for profiling Run gprof (e.g. “gprof test”, where test is the executable filename) For more details – man gprof.

The total time spent by the function WITHOUT its descendents gprof outputs (gprof test gmon.out | less) The total time spent by the function WITHOUT its descendents

Cache Memory

Cache memory (CPU cache) A temporary storage area where frequently accessed data can be stored for rapid access. Once the data is stored in the cache, future use can be made by accessing the cached copy rather than re-fetching the original data, so that the average access time is lower.

Memory Hierarchy CPU Registers 500 Bytes 0.25 ns CPU word transfer (1-8 bytes)  access time  size of transfer unit  cost per bit  capacity  frequency of access Cache 16K-1M Bytes 1 ns cache block transfer (8-128 bytes) Main Memory 64M-2G Bytes 100ns main memory Disk 100 G Bytes 5 ms disks Most computers possess four kinds of memory: registers in the CPU, CPU caches (generally some kind of static RAM) both inside and adjacent to the CPU, main memory (generally dynamic RAM) which the CPU can read and write to directly and reasonably quickly; and disk storage, which is much slower, but also much larger. CPU register use is generally handled by the compiler and this isn't a huge burden as data doesn't generally stay in them very long. The decision of when to use cache and when to use main memory is generally dealt with by hardware so generally both are regarded together by the programmer as simply physical memory. The memory cache is closer to the processor than the main memory. It is smaller, faster and more expensive than the main memory. Transfer between caches and main memory is performed in units called cache blocks/lines.

Types of Cache Misses 1. Compulsory misses: caused by first access to blocks that have never been in cache (also known as cold- start misses) 2. Capacity misses: when cache cannot contain all the blocks needed during execution of a program. Occur because of blocks being replaced and later retrieved when accessed. 3. Conflict misses: when multiple blocks compete for the same set.

Main Cache Principles Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. LRU principle: Keep last recently used data Spatial Locality (Locality in Space): If an item is referenced, close items (by address) tend to be referenced soon. Move blocks of contiguous words to the cache Here is a simple overview of how cache works. Cache works on the principle of locality. It takes advantage of the temporal locality by keeping the more recently addressed words in the cache. In order to take advantage of spatial locality, on a cache miss, we will move an entire block of data that contains the missing word from the lower level into the cache. This way, we are not ONLY storing the most recently touched data in the cache but also the data that are adjacent to them. +1 = 15 min. (X:55)

Improving Spatial Locality: Loop Reordering for Matrices Allocated by Row X N M Allocation by rows

Writing Cache Friendly Code Assumptions: 1) block-size=  words (word = int) 2) N is divided by  3) The cache cannot hold a complete row / column int sumarrayrows(int a[N][N]) { int i, j, sum = 0; for (i = 0; i < N; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } int sumarraycols(int a[N][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < N; i++) sum += a[i][j]; return sum; } no spatial locality! Miss rate = 1/  Accesses successive elements: a[0][0],...,a[0][N-1],a[1][0],.. Miss rate = 1 Accesses distant elements: a[0][0],...,a[N-1][0],a[0][1],...

Matrix Multiplication (ijk) every element of A and B is accessed N times Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { int sum = 0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } Inner loop: (*,j) (i,j) (i,*) A B C Column- wise Fixed Row-wise Misses per Inner Loop Iteration (i=j=0): A B C 1/ 1.0 1.0 *what happens when i=0 and j=1?

Matrix Multiplication (jik) every element of A and B is accessed N times Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { int sum = 0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } Inner loop: (*,j) (i,j) (i,*) A B C Row-wise Column- wise Fixed Misses per Inner Loop Iteration (i=j=0): A B C 1/  1.0 1.0 *what happens when j=0 and i=1?

Matrix Multiplication (kij) every element of B and C is accessed N times Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { int x = a[i][k]; for (j=0; j<n; j++) c[i][j] += x * b[k][j]; } Inner loop: A B C (i,*) (i,k) (k,*) Row-wise Fixed Partial calculation of the elements of C Misses per Inner Loop Iteration (i=j=0): A B C 1.0 1/ 1/ *what happens when k=0 and i=1?

Matrix Multiplication (jki) every element of A and C is accessed N times Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { int x = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * x; } Inner loop: (*,k) (*,j) (k,j) A B C Column - wise Fixed Column- wise Misses per Inner Loop Iteration (i=j=0): A B C 1.0 1.0 1.0 *what happens when j=0 and k=1?

Summary: misses ratios for the first iteration of the inner loop ijk (& jik): (+1)/(2) >1/2 kij: 1/ jki: 1 for (i=0; i<n; i++) { for (j=0; j<n; j++) { int sum = 0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } for (k=0; k<n; k++) { for (i=0; i<n; i++) { int x = a[i][k]; for (j=0; j<n; j++) c[i][j] += x * b[k][j]; } for (j=0; j<n; j++) { for (k=0; k<n; k++) { int x = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * x; }

The inner indices point on the matrix scanned n times Cache Misses Analysis Assumptions about the cache:  = Block size (in words) The cache cannot hold an entire matrix The replacement policy is LRU (Least Recently Used). Observation: for each loop ordering one of the matrices is scanned n times. Lower bound = 2n2/ + n3/ =  (n3/) For further details see the tutorial on the website. The inner indices point on the matrix scanned n times one read of 2 matrices (compulsory misses) n sequential reads of one matrix (compulsory + capacity misses)

Improving Temporal Locality: Blocked Matrix Multiplication

“Blocked” Matrix Multiplication j r elements cache block i i C A j B Key idea: reuse the other elements in each cache block as much as possible

“Blocked” Matrix Multiplication r elements j cache blocks cache block i i r elements r elements C[i][j] C[i][j+r-1] C A j B The blocks loaded for the computation of C[i][j] are appropriate for the computation of C[i,j+1]...C[i,j+ r-1] compute the first r terms of C[i][j],...,C[i][j+r -1] compute the next r terms of C[i][j],...,C[i][j+r -1] .....

“Blocked” Matrix Multiplication r elements j i i C A j B Next improvement: Reuse the loaded blocks of B for the computation of next (r -1) subrows.

“Blocked” Matrix Multiplication j cache block j B1 i i C1 A1 A2 A3 A4 B2 B3 B4 C A B Order of the operations: Compute the first r terms of C1 )=A1*B1) Compute the next r terms of C1 (=A2*B2) . . . Compute the last r terms of C1 (=A4*B4)

“Blocked” Matrix Multiplication N = 4 *r C22 = A21B12 + A22B22 + A23B32 + A24B42 = k A2k*Bk2 Main Point: each multiplication operates on small “block” matrices, whose size may be chosen so that they fit in the cache.

r x r matrix multiplication Blocked Algorithm The blocked version of the i-j-k algorithm is written simply as for (i=0;i<N/r;i++) for (j=0;j<N/r;j++) for (k=0;k<N/r;k++) C[i][j] += A[i][k]*B[k][j] r = block (sub-matrix) size (Assume r divides N) X[i][j] = a sub-matrix of X, defined by block row i and block column j r x r matrix addition r x r matrix multiplication

Maximum Block Size The blocking optimization works only if the blocks fit in cache. That is, 3 blocks of size r x r must fit in memory (for A, B, and C) M = size of cache (in elements/words) We must have: 3r2  M, or r  √(M/3) Lower bound = (r2/) (2(n/r)3 + (n/r)2) = (1/)(2n3/r +n2) =  (n3/(r)) = (n3/(√M)) Therefore, the ratio of cache misses ijk-blocked vs. ijk-unblocked: 1:√M

Home Exercise

Home exercise Implement the described algorithms for matrix multiplication and measure the performance. Store the matrices as arrays, organized by columns!!! M a[i][j] = A[i+j*N] i j*N X N X i a[i][j] A[i+j*N]

Question 2.1: mlpl Implement all the 6 options of loop ordering (ijk, ikj, jik, jki, kij, kji). Run them for matrices of different sizes. Measure the performance with clock() and gprof. Plot the running times of all the options (ijk, jki, etc.) as the function of matrix size. Select the most efficient loop ordering.

Question 2.2: block_mlpl Implement the blocking algorithm. Use the most efficient loop ordering from 1.1. Run it for matrices of different sizes. Measure the performance with clock() Plot the running times in CPU ticks as the function of matrix size.

User Interface Input Output Case 1: 0 or negative Case 2: A positive integer number followed by values of two matrices (separated by spaces) Output Case 1: Running times Case 2: A matrix, which is the multiplication of the input matrices (C=A*B).

Files and locations All of your files should be located under your home directory /soft-proj09/assign2/ Strictly follow the provided prototypes and the file framework (explained in the assignment)

The Makefile Commands: mlpl: allocate_free.c matrix_manipulate.c multiply.c mlpl.c gcc -Wall -pg -g -ansi -pedantic-errors allocate_free.c matrix_manipulate.c multiply.c mlpl.c -o mlpl block_mlpl: allocate_free.c matrix_manipulate.c multiply.c block_mlpl.c gcc -Wall -pg -g -ansi -pedantic-errors allocate_free.c matrix_manipulate.c multiply.c block_mlpl.c -o block_mlpl Commands: make mlpl – will create the executable mlpl for 2.1 make block_mlpl - will create the executable block_mlpl for 2.2

Plotting the graphs Save the output to *.csv file. Open it in Excel. Use the Excel’s “Chart Wizard” to plot the data as the XY Scatter. X-axis – the matrix sizes. Y-axis – the running time in CPU ticks.

Final Notes Arrays and Pointers: The expressions below are equivalent: int *a int a[]

Good Luck in the Exercise!!!