Matrix Multiplication (i,j,k) for I = 1 to n do for j = 1 to n do for k = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor.

Slides:



Advertisements
Similar presentations
CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Advertisements

Example How are these parameters decided?. Row-Order storage main() { int i, j, a[3][4]={1,2,3,4,5,6,7,8,9,10,11,12}; for (i=0; i
Faculty of Computer Science © 2006 CMPUT 229 Cache Performance Analysis Hitting for performance.
Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
Cache Memories February 24, 2004 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance class13.ppt.
CPSC 312 Cache Memories Slides Source: Bryant Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on.
1 Friday, September 22, 2006 If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. -Grace Murray Hopper ( )
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Fast matrix multiplication; Cache usage
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213.
Memory and cache CPU Memory I/O.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Chapter One Introduction to Pipelined Processors.
External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.
1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
Lecture 13: Caching EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier.
Lecture 20: Locality and Caching CS 2011 Fall 2014, Dr. Rozier.
Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
Code and Caches 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission.
SNU IDB Lab. Ch4. Performance Measurement © copyright 2006 SNU IDB Lab.
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.
Cache Memories February 28, 2002 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Reading:
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance cache.ppt CS 105 Tour.
Memory Hierarchy Computer Organization and Assembly Languages Yung-Yu Chuang 2007/01/08 with slides by CMU
Memory-Aware Compilation Philip Sweany 10/20/2011.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.
Joe Hummel, PhD U. Of Illinois, Chicago Loyola University Chicago
Cache Memories February 26, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.
DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:
Cache Memories CENG331 - Computer Organization Instructor: Murat Manguoglu(Section 1) Adapted from: and
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.
Cache Memories.
Cache Memories CSE 238/2038/2138: Systems Programming
Architecture Background
CS 105 Tour of the Black Holes of Computing
The Memory Hierarchy : Memory Hierarchy - Cache
Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron
Memory Hierarchies.
Cache Memories Topics Cache memory organization Direct mapped caches
“The course that gives CMU its Zip!”
Memory Hierarchy II.
Parallel Matrix Operations
November 14 6 classes to go! Read
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
Matrix Addition
Cache Performance October 3, 2007
Cache Memories Lecture, Oct. 30, 2018
Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Cache Memories.
Cache Memory and Performance
Writing Cache Friendly Code

Presentation transcript:

Matrix Multiplication (i,j,k) for I = 1 to n do for j = 1 to n do for k = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor

(i,j,k) Memory Map = x i j j k i k

Scalar Architecture Registers Cache memory Functional units Functional units Main memory Memory bus

Cache lines: matrix stored by rows Stride 1 dimension

Matrix Multiplication (i,k,j) Improve Spatial Locality for i = 1 to n do for k = 1 to n do for j = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor

(i,k,j) Memory Map = x i j j k i k

Matrix Multiplication (i,k,j) Improve Temporal Locality =x C11 C12 C13 C21 C22 C23 C31 C32 C33 A11 A12 A13 A21 A22 A23 A31 A32 A33 B11 B12 B13 B21 B22 B23 B31 B32 B33 C11 = A11 x B11 + A12 x B21 + A13 x B31

Submatrix Multiplication (i,k,j) for it = 1 to n by s do for kt = 1 to n by s do for jt = 1 to n by s do for i = it to min(it+s-1,n) do for k = kt to min(kt+s-1,n) do for j = jt to min(jt+s-1,n) do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor

(i,k,j) Memory Map = x it jt kt it kt s

Multiprocessor Architecture Memory bus CPU Cache memory Main memory CPU Cache memory

Parallel (i,k,j): Inner loop for i = 1 to n do for k = 1 to n do parfor j = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endparfor endfor

Parallel (i,k,j): Inner loop memory mapping = x i k i k

Parallel (i,k,j): Outer loop parfor i = 1 to n do for k = 1 to n do for j = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor endparfor

Parallel (i,k,j): Outer loop memory mapping = x

Parallel (i,k,j): Submatrix parfor it = 1 to n by s do for kt = 1 to n by s do for jt = 1 to n by s do for i = it to min(it+s-1,n) do for k = kt to min(kt+s-1,n) do for j = jt to min(jt+s-1,n) do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor endparfor