1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Carnegie Mellon Today Program optimization  Optimization blocker: Memory aliasing  Out of order processing: Instruction level parallelism  Understanding.
Instructor: Erol Sahin Program Optimization CENG331: Introduction to Computer Systems 11 th Lecture Acknowledgement: Most of the slides are adapted from.
Program Optimization (Chapter 5)
1 Code Optimization(II). 2 Outline Understanding Modern Processor –Super-scalar –Out-of –order execution Suggested reading –5.14,5.7.
Winter 2015 COMP 2130 Intro Computer Systems Computing Science Thompson Rivers University Program Optimization.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Carnegie Mellon 1 Program Optimization : Introduction to Computer Systems 25 th Lecture, Nov. 23, 2010 Instructors: Randy Bryant and Dave O’Hallaron.
Code Optimization I September 24, 2007 Topics Machine-Independent Optimizations Basic optimizations Optimization blockers class08.ppt F’07.
Code Optimization: Machine Independent Optimizations Feb 12, 2004 Topics Machine-Independent Optimizations Machine-Dependent Opts Understanding Processor.
Code Optimization I: Machine Independent Optimizations Sept. 26, 2002 Topics Machine-Independent Optimizations Code motion Reduction in strength Common.
Code Optimization II: Machine Dependent Optimizations Topics Machine-Dependent Optimizations Pointer code Unrolling Enabling instruction level parallelism.
C++ / G4MICE Course Session 3 Introduction to Classes Pointers and References Makefiles Standard Template Library.
Architecture Basics ECE 454 Computer Systems Programming
Chapter 1 Algorithm Analysis
Code Optimization 1. Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.1 ~
Cache Performance Metrics
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order.
1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order.
Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”
Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
Code Optimization and Performance Chapters 5 and 9 perf01.ppt CS 105 “Tour of the Black Holes of Computing”
Addressing Modes Chapter 6 S. Dandamudi To be used with S. Dandamudi, “Introduction to Assembly Language Programming,” Second Edition, Springer,
Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.
Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.
Copyright 2014 – Noah Mendelsohn Code Tuning Noah Mendelsohn Tufts University Web:
Code Optimization and Performance CS 105 “Tour of the Black Holes of Computing”
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
2005MEE Software Engineering Lecture 7 –Stacks, Queues.
Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”
Introduction to Computer Systems Class “The Class That Gives CMU Its Zip!” Randal E. Bryant August 30, 2005.
Code Optimization and Performance I Chapter 5 perf01.ppt CS 105 “Tour of the Black Holes of Computing”
1 Array. 2 Outline Aggregate scalar data into larger data Pointers vs. Array Array subscript and pointer arithmetic Memory layout for array –Static vs.
Code Optimization and Performance II
Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Code Optimization.
Code Optimization II September 27, 2006
Today’s Instructor: Phil Gibbons
Machine-Dependent Optimization
Code Optimization II Dec. 2, 2008
Code Optimization I: Machine Independent Optimizations
Instructors: Dave O’Hallaron, Greg Ganger, and Greg Kesden
Program Optimization CENG331 - Computer Organization
Instructors: Pelin Angin and Erol Sahin
Program Optimization CSCE 312
Code Optimization II: Machine Dependent Optimizations Oct. 1, 2002
Code Optimization II: Machine Dependent Optimizations
Machine-Dependent Optimization
Code Optimization /18-213/14-513/15-513: Introduction to Computer Systems 10th Lecture, September 27, 2018.
Introduction to Computer Systems
Code Optimization and Performance
Code Optimization I: Machine Independent Optimizations Feb 11, 2003
Code Optimization(II)
Code Optimization April 6, 2000
Code Optimization I Nov. 25, 2008
Machine-Level Programming 5 Structured Data
Machine-Level Programming 5 Structured Data
COMP 2130 Intro Computer Systems Thompson Rivers University
Optimizing program performance
Program Optimization CSE 238/2038/2138: Systems Programming
Structured Data I: Homogenous Data Feb. 10, 2000
Machine-Independent Optimization
Lecture 11: Machine-Dependent Optimization
Code Optimization and Performance
Code Optimization II October 2, 2001
Presentation transcript:

1 Code Optimization

2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6

3 Motivation Constant factors matter too! –easily see 10:1 performance range depending on how code is written –must optimize at multiple levels algorithm, data representations, procedures, and loops

4 Motivation Must understand system to optimize performance –how programs are compiled and executed –how to measure program performance and identify bottlenecks –how to improve performance without destroying code modularity and generality

5 5.3 Program Example

6 Vector ADT P384 typedef struct { int len ; data_t *data ; } vec_rec, *vec_ptr ; typedef int data_t ; length data      012length–1 Figure 5.3 P385

7 Procedures P386 vec_ptr new_vec(int len) –Create vector of specified length data_t *get_vec_start(vec_ptr v) –Return pointer to start of vector data P392 P386

8 Procedures int get_vec_element( vec_ptr v, int index, int *dest ) –Retrieve vector element, store at *dest –Return 0 if out of bounds, 1 if successful Similar to array implementations in Pascal, Java –E.g., always do bounds checking

9 Vector ADT vec_ptr new_vec(int len) { /* allocate header structure */ vec_ptr result = (vec_ptr) malloc(sizeof(vec_rec)) ; if ( !result ) return NULL ;

10 Vector ADT /* allocate array */ if ( len > 0 ) { data_t *data = (data_t *)calloc(len, sizeof(data_t)) ; if ( !data ) { free( (void *)result ) ; return NULL ; /* couldn’t allocte stroage */ } result->data = data } else result->data = NULL return result ; }

11 Vector ADT /* * Retrieve vector element and store at dest. * Return 0 (out of bounds) or 1 (successful) */ int get_vec_element(vec_ptr v, int index, data_t *dest) { if ( index = v->len) return 0 ; *dest = v->data[index] ; return 1; }

12 Vector ADT /* Return length of vector */ int vec_length(vec_ptr) { return v->len ; } /* Return pointer to start of vector data */ data_t *get_vec_start(vec_ptr v) { return v->data ; } P392

13 Optimization Example P385 #ifdef ADD #define IDENT 0 #define OPER + #else #define IDENT 1 #define OPER * #endif

14 Optimization Example P387 void combine1(vec_ptr v, data_t *dest) { int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; }

15 Optimization Example Procedure –Compute sum (product) of all elements of vector –Store result at destination location

Expressing Program Performance

17 Time Scales P382 Absolute Time –Typically use nanoseconds 10 –9 seconds –Time scale of computer instructions

18 Time Scales Clock Cycles –Most computers controlled by high frequency clock signal –Typical Range 100 MHz –10 8 cycles per second –Clock period = 10ns 2 GHz –2 X 10 9 cycles per second –Clock period = 0.5ns

19 CPE P383 1 void vsum1(int n) 2 { 3 int i; 4 5 for (i = 0; i < n; i++) 6 c[i] = a[i] + b[i]; 7 } 8

20 CPE P383 9 /* Sum vector of n elements (n must be even) */ 10 void vsum2(int n) 11 { 12 int i; for (i = 0; i < n; i+=2) { 15 /* Compute two elements per iteration */ 16 c[i] = a[i] + b[i]; 17 c[i+1] = a[i+1] + b[i+1]; 18 } 19 }

21 Cycles Per Element Convenient way to express performance of program that operators on vectors or lists Length = n T = CPE*n + Overhead

22 Cycles Per Element Figure 5.2 P383 vsum1 Slope = 4.0 vsum2 Slope = 3.5

Program Example 5.4 Eliminating Loop Inefficiencies 5.5 Reducing Procedure Calls 5.6 Eliminating Unneeded Memory References

24 Time Scales P387 void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } 5.3 Program Example

25 Time Scales P385 Procedure –Compute sum of all elements of integer vector –Store result at destination location –Vector data structure and operations defined via abstract data type Pentium II/III Performance: CPE –42.06 (Compiled -g)31.25 (Compiled -O2)

26 Understanding Loop void combine1-goto(vec_ptr v, int *dest) { int i = 0; int val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done: } 1 iteration

27 Inefficiency Procedure vec_length called every iteration Even though result always the same 5.4 Eliminating Loop Inefficiencies

28 Code Motion P388 void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; }

29 Code Motion P388 Optimization –Move call to vec_length out of inner loop Value does not change from one iteration to next Code motion –CPE: (Compiled -O2) vec_length requires only constant time, but significant overhead

30 Reduction in Strength P392 void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } 5.5 Reducing Procedure Calls

31 Reduction in Strength Optimization –Avoid procedure call to retrieve each vector element Get pointer to start of array before loop Within loop just do pointer reference Not as clean in terms of data abstraction –CPE: 6.00 (Compiled -O2) Procedure calls are expensive! Bounds checking is expensive

32 Eliminate Unneeded Memory References P394 void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } 5.6 Eliminating Unneeded Memory References

33 Eliminate Unneeded Memory References Optimization –Don’t need to store in destination until end –Local variable sum held in register –Avoids 1 memory read, 1 memory write per cycle –CPE: 2.00 (Compiled -O2) Memory references are expensive!

34 Detecting Unneeded Memory References.L18: movl (%ecx,%edx,4),%eax addl %eax,(%edi) incl %edx cmpl %esi,%edx jl.L18.L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl.L24 Combine3Combine4

35 Detecting Unneeded Memory References Performance –Combine3 5 instructions in 6 clock cycles addl must read and write memory –Combine4 4 instructions in 2 clock cyles