COMP 2130 Intro Computer Systems Thompson Rivers University Code Optimization Winter 2013 COMP 2130 Intro Computer Systems Computing Science Thompson Rivers University
Your vision? Seek with all your heart? Course Objectives The better knowledge of computer systems, the better programing. Computer System C Programming Language Computer architecture CPU (Central Processing Unit) IA32 assembly language Introduction to C language Compiling, linking, loading, executing Physical main memory MMU (Memory Management Unit) Virtual memory space Memory hierarchy Cache Dynamic memory management Better coding – locality Reliable and efficient programming for power programmers (to avoid strange errors, to optimize codes, to avoid security holes, …) Not the replacement of IPv6; Inter-transition mechanism; But if IPv6 would fail TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Course Contents Introduction to computer systems: B&O 1 Introduction to C programming: K&R 1 – 4 Data representations: B&O 2.1 – 2.4 C: advanced topics: K&R 5.1 – 5.10, 6 – 7 Introduction to IA32 (Intel Architecture 32): B&O 3.1 – 3.8, 3.13 Compiling, linking, loading, and executing: B&O 7 (except 7.12) Dynamic memory management – Heap: B&O 9.1–2, 9.3–4, 9.9.1–2, 9.9.4–5, 9.11 Code optimization: B&O 5.1 – 5.6, 5.13 Memory hierarchy, locality, caching: B&O 5.12, 6.1 – 6.3, 6.4.1 – 6.4.2, 6.5, 6.6.2 – 6.6.3, 6.7 Virtual memory (if time permits): B&O 9.4 – 9.5 Not the replacement of IPv6; Inter-transition mechanism; But if IPv6 would fail TRU-COMP2130 Code Optimization
Unit Learning Objectives Your vision? Seek with all your heart? Unit Learning Objectives List the two optimization blockers. Give examples of the two optimization blockers. Use of optimization techniques TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Unit Contents TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Introduction The primary objective in writing a program To make it work correctly under all possible conditions. Making a program run fast is also an important consideration. [Q] How to write an efficient program? Appropriate algorithms and data structures Source code that the compiler can effectively optimize to turn into efficient executable code For the second part, it is important to understand the capabilities and limitations of optimizing compilers. However programmers must make a trade-off between how easy a program is to implement and maintain, and how fast it runs. TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Modern compilers employ sophisticated forms of analysis and optimization. Even the best compilers, however, can be thwarted by optimization blockers – aspects of the program’s behavior that depend strongly on the execution environment. Optimization blockers make even programmers get confused and produce logical errors. Programmers must assist the compiler by writing code that can be optimized readily. TRU-COMP2130 Code Optimization
5.1 Limitations of Optimizing Compilers Your vision? Seek with all your heart? 5.1 Limitations of Optimizing Compilers Higher optimization levels of gcc can improve program performance. But they may expand program size and they make program more difficult to debug using standard debugging tools. TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Compilers must be careful to apply only safe optimizations to a program. Example: Memory Aliasing void twiddle1(int *xp, int *yp) { *xp += *yp; } [Q] Can twiddle1 be replaced by twiddle2? void twiddle2(int *xp, int *yp) { *xp += 2 * *yp; [Q] What if *xp == *yp? In twiddle1, *xp becomes triple, but In twiddle2, *xp becomes twice. [Q] Is it a good programming style to pass pointers and manipulate them? How to improve twiddle1()? TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Example x = 1000; y = 3000; *q = y; *p = x; t1 = *q; [Q] What value will t1 have? 1000 or 3000 -> It is not easy even for us to understand the above code. -> Definitely not a good programming style. Compilers cannot replace the code with t1 = y;. Optimization blockers Memory aliasing around pointers … TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Example: Side Effect int f(); int func1() { return f() + f() + f() + f(); } int func2() { return 4 * f(); } [Q] Can you see any problem? [Q] What if int count = 0; int f() { return counter++; } ? [Q] What will func1() and func2() return? [Q] Good programming style? How to improve? Optimization blockers Memory aliasing around pointers Functions with a side effect … TRU-COMP2130 Code Optimization
5.2 Expressing Program Performance Your vision? Seek with all your heart? 5.2 Expressing Program Performance Cycles Per Element (CPE) How many instructions (cycles) (, not the number of C lines,) are being executed rather than how fast the clock runs. TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Example: loop unrolling void psum1(float a[], float p[], long int n) { long int i; p[0] = a[0]; for (i = 1; i < n; i++) p[i] = p[i-1] + a[i]; } void psum2(float a[], float p[], long int n) { long int i; float mid_val; for (i = 1; i < n-1; i += 2) { mid_val = p[i-1] + a[i]; p[i] = mid_val; p[i+1] = mid_val + a[i+1]; if (i < n) p[i] = p[i-1] + a[i]; [Q] Which one do you think run faster? [Q] Can you simply count the # of operations that access main memory? 3 (n-1) 5 (n-1)/2 TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Loop unrolling Possibly reduce the number of memory accesses. Possibly run multiple statements in parallel over multi-core CPUs. In the previous example ??? TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? 5.3 Program Example typedef struct { // vector abstract data type long int len; data_t *data; // vector values } vec_rec, *vec_ptr; #define IDENT 0 #define OP + void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i=0; i < vec_length(v); i++) { // it is good to hide len. data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } [Q] Can you write vec_length() and get_vec_element()? [Q] Compilers can optimize the above code well. Can you optimize? TRU-COMP2130 Code Optimization
5.4 Eliminating Loop Inefficiencies Your vision? Seek with all your heart? 5.4 Eliminating Loop Inefficiencies Code motion Identifying a computation that is performed multiple times (e.g., within a loop), such that the result of the computation will not change. Example: void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } Does vec_length() have a side effect? Or is the length of the vector changed in the loop? No. Then? TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? From the previous example: void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } Example: Any problem? How can you improve? void lower1(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= ‘A’ && s[i] <= ‘Z’) s[i] -= ‘A’ – ‘a’; Can we remove &val? TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? From the previous example: void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } Example: Any problem? How can you improve? void lower1(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= ‘A’ && s[i] <= ‘Z’) s[i] -= ‘A’ – ‘a’; Can we remove &val? TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Example: void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } How to improve? TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Example: void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } How to improve? TRU-COMP2130 Code Optimization
5.5 Reducing Procedure Calls Your vision? Seek with all your heart? 5.5 Reducing Procedure Calls From the previous example: Any problem? typedef struct { // vector abstract data type long int len; data_t *data; // vector values } vec_rec, *vec_ptr; int get_vec_element(vec_ptr v, long int index, data_t *dest) { if (index < 0 || index >= v->len) return 0; *dest = v->data[index]; return 1; } void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. TRU-COMP2130 Code Optimization
From the previous example: Your vision? Seek with all your heart? From the previous example: typedef struct { // vector abstract data type long int len; data_t *data; // vector values } vec_rec, *vec_ptr; int get_vec_element(vec_ptr v, long int index, data_t *dest) { if (index < 0 || index >= v-> len) return 0; *dest = v->data[index]; return 1; } void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; data_t *data = get_vec_start(v); // v->data for (i = 0; i < length; i++) *dest = *dest OP data[i]; // it makes sum. Can you write get_vec_start()? TRU-COMP2130 Code Optimization
5.6 Eliminating Unneeded Memory References Your vision? Seek with all your heart? 5.6 Eliminating Unneeded Memory References From the previous example: Any problem? void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); // v->data *dest = IDENT; for (i = 0; i < length; i++) *dest = *dest OP data[i]; // it makes sum. } // the statement in for loop // data_t = int; OP = *; i in %edx, data in %ecx, dest in %ebx movl (%ebx), %eax imull (%ecx, %edx, 4), %eax movl %eax, (%ebx) TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? From the previous example: Any problem? void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); // v->data *dest = IDENT; for (i = 0; i < length; i++) *dest = *dest OP data[i]; // it makes sum. } // the statement in for loop // data_t = int; OP = *; i in %edx, data in %ecx, dest in %ebx movl (%ebx), %eax imull (%ecx, %edx, 4), %eax movl %eax, (%ebx) TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? From the previous example: void combine4(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); // v->data data_t acc = IDENT; // can be implemented in a register for (i = 0; i < length; i++) acc = acc OP data[i]; // it makes sum. *dest = acc; } // the statement in for loop // data_t = int; OP = *; i in %edx, data in %ecx, acc in %eax imull (%ecx, %edx, 4), %eax TRU-COMP2130 Code Optimization
5.13 Performance Improvement Techniques Your vision? Seek with all your heart? 5.13 Performance Improvement Techniques High-level design Appropriate algorithms and data structures Basic coding principles Elimination of loop inefficiency Elimination of excessive function calls Elimination of unnecessary memory references – Introduce temporary variables to hold intermediate results. Elimination of pointers if possible … Low-level optimizations Unroll loops to reduce overhead and to enable further optimizations. Find ways to increase instruction-level parallelism. TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Unroll loops to reduce overhead and to enable further optimizations. Find ways to increase instruction-level parallelism. for (i = 0; i < length; i++) acc = acc OP data[i]; // it makes sum. *dest = acc; //------------------------- limit = length – 1; for (i = 0; i < limit; i += 2) { // combine two elements acc0 = acc0 OP data[i]; // two statements at a time acc1 = acc1 OP data[i+1]; } for (; i < length; i++) // finish any remaining elements acc1 = acc1 OP data[i]; *dest = acc0 OP acc1; TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Example: Convert the following code to use 4-way loop unrolling: for (i = 0; i < length; i++) sum = sum + udata[i] * vdata[i]; *dest = sum; TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? Example: Improve the following code by using a word of data type unsigned long to pack four copies of c: void *basic_memset(void *s, int c, int n) { int cnt = 0; unsigned char *schar = s; while (cnt < n) { *schar = (unsigned char) c; schar++; cnt++; } TRU-COMP2130 Code Optimization
Your vision? Seek with all your heart? void *memset(void *s, int c, int n) { int cnt = 0; int length = n / 4; unsigned ic; unsigned char *schar = s; unsigned int *si = s; c = c & 0xff; ic = c << 24 + c << 16 + c << 8 + c; while (cnt < length) { *si = ic; si++; cnt++; } cnt = length * 4; schar += length * 4; while (cnt < n) { *schar = (unsigned char) c; schar++; TRU-COMP2130 Code Optimization
Reduction in Strength Replace costly operation with simpler one Carnegie Mellon Reduction in Strength Replace costly operation with simpler one Shift, add instead of multiply or divide 16*x --> x << 4 Utility machine dependent Depends on cost of multiply or divide instruction On Intel Nehalem, integer multiply requires 3 CPU cycles Recognize sequence of products int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];
Share Common Subexpressions Carnegie Mellon Share Common Subexpressions Reuse portions of expressions Compilers often not very sophisticated in exploiting arithmetic properties /* Sum neighbors of i,j */ up = val[(i-1)*n + j ]; down = val[(i+1)*n + j ]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right; long inj = i*n + j; up = val[inj - n]; down = val[inj + n]; left = val[inj - 1]; right = val[inj + 1]; sum = up + down + left + right; 3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n leaq 1(%rsi), %rax # i+1 leaq -1(%rsi), %r8 # i-1 imulq %rcx, %rsi # i*n imulq %rcx, %rax # (i+1)*n imulq %rcx, %r8 # (i-1)*n addq %rdx, %rsi # i*n+j addq %rdx, %rax # (i+1)*n+j addq %rdx, %r8 # (i-1)*n+j imulq %rcx, %rsi # i*n addq %rdx, %rsi # i*n+j movq %rsi, %rax # i*n+j subq %rcx, %rax # i*n+j-n leaq (%rsi,%rcx), %rcx # i*n+j+n