15-213 Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

Slides:



Advertisements
Similar presentations
Program Optimization (Chapter 5)
Advertisements

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
1 Code Optimization(II). 2 Outline Understanding Modern Processor –Super-scalar –Out-of –order execution Suggested reading –5.14,5.7.
Winter 2015 COMP 2130 Intro Computer Systems Computing Science Thompson Rivers University Program Optimization.
Example How are these parameters decided?. Row-Order storage main() { int i, j, a[3][4]={1,2,3,4,5,6,7,8,9,10,11,12}; for (i=0; i
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.
Data Locality CS 524 – High-Performance Computing.
Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
Cache Memories February 24, 2004 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance class13.ppt.
CS 3214 Computer Systems Godmar Back Lecture 11. Announcements Stay tuned for Exercise 5 Project 2 due Sep 30 Auto-fail rule 2: –Need at least Firecracker.
CPSC 312 Cache Memories Slides Source: Bryant Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on.
Fast matrix multiplication; Cache usage
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213.
Architecture Basics ECE 454 Computer Systems Programming
Cache Performance Metrics
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
Code and Caches 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission.
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
Memory Hierarchy II. – 2 – Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in.
Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.
1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.
Cache Memories February 28, 2002 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Reading:
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance cache.ppt CS 105 Tour.
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
Memory Hierarchy Computer Organization and Assembly Languages Yung-Yu Chuang 2007/01/08 with slides by CMU
Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.
Cache Memories February 26, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:
Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.
Cache Memories.
Cache Memories CSE 238/2038/2138: Systems Programming
Machine-Dependent Optimization
CS 105 Tour of the Black Holes of Computing
Compilers for Embedded Systems
Code Optimization II: Machine Dependent Optimizations Oct. 1, 2002
Code Optimization II: Machine Dependent Optimizations
Machine-Dependent Optimization
Cache Memories Topics Cache memory organization Direct mapped caches
“The course that gives CMU its Zip!”
Memory Hierarchy II.
Code Optimization(II)
COMP 2130 Intro Computer Systems Thompson Rivers University
Optimizing program performance
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
Cache Performance October 3, 2007
Cache Memories Lecture, Oct. 30, 2018
Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05
Cache Memories.
Machine-Independent Optimization
Lecture 11: Machine-Dependent Optimization
Cache Memory and Performance
Code Optimization II October 2, 2001
Writing Cache Friendly Code

Presentation transcript:

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo Office Hours: Thursday 6:00 – 7:00 Wean 8402 Reminder L4 due: Thursday 10/24, 11:59pm Submission is *online*

Machine Independent Optimization Code Motion –move repeating code out of loop Reduction in Strength –replace costly operations with simpler ones –keep data in registers rather than memory –avoid procedure call in loop, use pointers Share Common sub-expressions Procedure calls are expensive! Bounds checking is expensive! Memory references are expensive!

Machine Dependent Optimization Take advantage of system infrastructure Loop unrolling Blocking

Execution Functional Units Instruction Control Integer/ Branch FP Add FP Mult/Div LoadStore Instruction Cache Data Cache Fetch Control Instruction Decode Address Instrs. Operations Prediction OK? Data Addr. General Integer Operation Results Retirement Unit Register File Register Updates Loop Unrolling – Why We Can Do So? Superscalar: perform multiple operations on every clock cycle Out-of-order: executed instructions need not correspond to the assembly

Multiple Instructions Can Execute in Parallel –1 load –1 store –2 integer (one may be branch) –1 FP Addition –1 FP Multiplication or Division Some Instructions Take > 1 Cycle, but Can be Pipelined –InstructionLatencyCycles/Issue –Load / Store31 –Integer Multiply41 –Integer Divide3636 –Double/Single FP Multiply52 –Double/Single FP Add31 –Double/Single FP Divide3838 CPU Capability of Pentium III

Loop Unrolling Combine multiple iterations into single loop body –Perform more data operations in each iteration Amortize loop overhead across multiple iterations –Computing loop index –Testing loop condition –Make sure the loop condition NOT to run over array bounds Finish extras at end void combine5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; int i; /* Combine 3 elements at a time */ for (i = 0; i < limit; i+=3) { sum += data[i] + data[i+2] + data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { sum += data[i]; } *dest = sum; }

Practice Time Work on practice problem 5.12 and 5.13

Solution 5.12 void inner5(vec_ptr u, vec_ptr v, data_t *dest) { int i; int length = vec_length(u); int limit = length-3; data_t *udata = get_vec_start(u); data_t *vdata = get_vec_start(v); data_t sum = (data_t) 0; /* Do four elements at a time */ for (i = 0; i < limit; i+=4) { sum += udata[i] * vdata[i] + udata[i+1] * vdata[i+1] + udata[i+2] * vdata[i+2] + udata[i+3] * vdata[i+3]; } /* Finish off any remaining elements */ for (; i < length; i++) sum += udata[i] * vdata[i]; *dest = sum; }

Solution 5.12 A. We must perform two loads per element to read values for udata and vdata. There is only one unit to perform these loads, and it requires one cycle. B. The performance for floating point is still limited by the 3 cycle latency of the floating-point adder.

Solution 5.13 void inner6(vec_ptr u, vec_ptr v, data_t *dest) { int i; int length = vec_length(u); int limit = length-3; data_t *udata = get_vec_start(u); data_t *vdata = get_vec_start(v); data_t sum0 = (data_t) 0; data_t sum1 = (data_t) 0; /* Do four elements at a time */ for (i = 0; i < limit; i+=4) { sum0 += udata[i] * vdata[i]; sum1 += udata[i+1] * vdata[i+1]; sum0 += udata[i+2] * vdata[i+2]; sum1 += udata[i+3] * vdata[i+3]; } /* Finish off any remaining elements */ for (; i < length; i++) sum0 = sum0 + udata[i] * vdata[i]; *dest = sum0 + sum1; }

Solution 5.13 For each element, we must perform two loads with a unit that can only load one value per clock cycle. We must also perform one floating-point multiplication with a unit that can only perform one multiplication every two clock cycles. Both of these factors limit the CPE to 2.

Summary of Matrix Multiplication for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ijk (& jik): 2 loads, 0 stores misses/iter = 1.25 for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } kij (& ikj): 2 loads, 1 store misses/iter = 0.5 jki (& kji): 2 loads, 1 store misses/iter = 2.0 AB C (i,*) (*,j) (i,j) ABC (i,*) (i,k)(k,*) AB C (*,j) (k,j) (*,k)

Improve Temporal Locality by Blocking Example: Blocked matrix multiplication –“block” (in this context) does not mean “cache block” –instead, it means a sub-block within the matrix. –e.g: n = 8; sub-block size = 4 x 4 C 11 = A 11 B 11 + A 12 B 21 C 12 = A 11 B 12 + A 12 B 22 C 21 = A 21 B 11 + A 22 B 21 C 22 = A 21 B 12 + A 22 B 22 A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 X = C 11 C 12 C 21 C 22 Key idea: Sub-blocks (i.e., A xy ) can be treated just like scalars.

Blocked Matrix Multiply (bijk) int en = n/bsize; /*assume n is an integral multiple of bsize */ for(i=0; i<n; i++) for(j=0; j<n; j++) c[i][j] = 0.0; for(kk=0; kk<en; kk+=bsize) { for(jj=0; jj<en; jj+=bsize) { for (i=0; i<n; i++) { for(j=jj; j<jj+bsize; j++) { sum = c[i][j]; for(k=kk; k<kk+bsize; k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; }

Blocked Matrix Multiply Analysis Innermost loop pair multiplies a 1 x bsize sliver of A by a bsize x bsize block of B and accumulates into 1 x bsize sliver of C Loop over i steps through n row slivers of A & C, using same B ABC block reused n times in succession row sliver accessed bsize times Update successive elements of sliver ii kk jj for (i=0; i<n; i++) { for(j=jj; j<jj+bsize; j++) { sum = c[i][j]; for(k=kk; k<kk+bsize; k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } Innermost Loop Pair

Pentium Blocked Mat-Mul Performance Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik) –relatively insensitive to array size.

Summary You can improve the performance of your program! –Keep working set reasonably small (temporal locality) –Use small strides (spatial locality) Mind writing “cache friendly code” –Absolute optimum performance is very platform specific: cache sizes, line sizes, etc.