COMP 2130 Intro Computer Systems Thompson Rivers University

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Carnegie Mellon Today Program optimization  Optimization blocker: Memory aliasing  Out of order processing: Instruction level parallelism  Understanding.
Instructor: Erol Sahin Program Optimization CENG331: Introduction to Computer Systems 11 th Lecture Acknowledgement: Most of the slides are adapted from.
Program Optimization (Chapter 5)
University of Washington Procedures and Stacks II The Hardware/Software Interface CSE351 Winter 2013.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Machine/Assembler Language Putting It All Together Noah Mendelsohn Tufts University Web:
1 Code Optimization(II). 2 Outline Understanding Modern Processor –Super-scalar –Out-of –order execution Suggested reading –5.14,5.7.
Winter 2015 COMP 2130 Intro Computer Systems Computing Science Thompson Rivers University Program Optimization.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Carnegie Mellon 1 Program Optimization : Introduction to Computer Systems 25 th Lecture, Nov. 23, 2010 Instructors: Randy Bryant and Dave O’Hallaron.
Code Optimization I September 24, 2007 Topics Machine-Independent Optimizations Basic optimizations Optimization blockers class08.ppt F’07.
1 Program Optimization Professor Jennifer Rexford
Code Optimization I: Machine Independent Optimizations Sept. 26, 2002 Topics Machine-Independent Optimizations Code motion Reduction in strength Common.
Code Optimization 1. Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.1 ~
Intro to Computer Systems Summer 2014 COMP 2130 Introduction to Computer Systems Computing Science Thompson Rivers University.
Assembly Questions תרגול 12.
Introduction and Overview Summer 2014 COMP 2130 Introduction to Computer Systems Computing Science Thompson Rivers University.
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
Carnegie Mellon 1 Odds and Ends Intro to x86-64 Memory Layout.
Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.
Optimization of C Code The C for Speed
Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.
Code Optimization and Performance CS 105 “Tour of the Black Holes of Computing”
Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
CS 3214 Computer Systems Godmar Back Lecture 8. Announcements Stay tuned for Project 2 & Exercise 4 Project 1 due Sep 16 Auto-fail rule 1: –Need at least.
Code Optimization and Performance I Chapter 5 perf01.ppt CS 105 “Tour of the Black Holes of Computing”
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Machine-Level Programming I: Basics
Code Optimization.
Introduction To Computer Systems
Machine-Level Programming IV: Data
Machine-Level Programming IV: Data
Today’s Instructor: Phil Gibbons
Optimization Code Optimization ©SoftMoore Consulting.
CS 3114 Many of the following slides are taken with permission from
Code Optimization I: Machine Independent Optimizations
Instructors: Dave O’Hallaron, Greg Ganger, and Greg Kesden
Program Optimization CENG331 - Computer Organization
Instructors: Pelin Angin and Erol Sahin
Program Optimization CSCE 312
Machine-Dependent Optimization
Code Optimization /18-213/14-513/15-513: Introduction to Computer Systems 10th Lecture, September 27, 2018.
Machine-Level Programming IV: Data
Code Optimization and Performance
Code Optimization I: Machine Independent Optimizations Feb 11, 2003
Roadmap C: Java: Assembly language: OS: Machine code: Computer system:
Machine-Level Programming IV: Data
Code Optimization(II)
Instructors: Majd Sakr and Khaled Harras
Code Optimization April 6, 2000
Code Optimization I Nov. 25, 2008
Machine-Level Programming 5 Structured Data
Machine-Level Programming 5 Structured Data
Optimizing program performance
Machine-Level Programming II: Basics Comp 21000: Introduction to Computer Organization & Systems Spring 2016 Instructor: John Barr * Modified slides.
Cache Memories Lecture, Oct. 30, 2018
Program Optimization CSE 238/2038/2138: Systems Programming
CS 3114 Many of the following slides are taken with permission from
Instructor: Fatma CORUT ERGİN
Machine-Independent Optimization
Machine-Level Programming VIII: Data Comp 21000: Introduction to Computer Systems & Assembly Lang Spring 2017 Systems book chapter 3* * Modified slides.
Lecture 11: Machine-Dependent Optimization
Code Optimization and Performance
CMPE 152: Compiler Design April 30 Class Meeting
Presentation transcript:

COMP 2130 Intro Computer Systems Thompson Rivers University Code Optimization Winter 2013 COMP 2130 Intro Computer Systems Computing Science Thompson Rivers University

Your vision? Seek with all your heart? Course Objectives The better knowledge of computer systems, the better programing. Computer System C Programming Language Computer architecture CPU (Central Processing Unit) IA32 assembly language Introduction to C language Compiling, linking, loading, executing Physical main memory MMU (Memory Management Unit) Virtual memory space Memory hierarchy Cache Dynamic memory management Better coding – locality Reliable and efficient programming for power programmers (to avoid strange errors, to optimize codes, to avoid security holes, …) Not the replacement of IPv6; Inter-transition mechanism; But if IPv6 would fail TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Course Contents Introduction to computer systems: B&O 1 Introduction to C programming: K&R 1 – 4 Data representations: B&O 2.1 – 2.4 C: advanced topics: K&R 5.1 – 5.10, 6 – 7 Introduction to IA32 (Intel Architecture 32): B&O 3.1 – 3.8, 3.13 Compiling, linking, loading, and executing: B&O 7 (except 7.12) Dynamic memory management – Heap: B&O 9.1–2, 9.3–4, 9.9.1–2, 9.9.4–5, 9.11 Code optimization: B&O 5.1 – 5.6, 5.13 Memory hierarchy, locality, caching: B&O 5.12, 6.1 – 6.3, 6.4.1 – 6.4.2, 6.5, 6.6.2 – 6.6.3, 6.7 Virtual memory (if time permits): B&O 9.4 – 9.5 Not the replacement of IPv6; Inter-transition mechanism; But if IPv6 would fail TRU-COMP2130 Code Optimization

Unit Learning Objectives Your vision? Seek with all your heart? Unit Learning Objectives List the two optimization blockers. Give examples of the two optimization blockers. Use of optimization techniques TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Unit Contents TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Introduction The primary objective in writing a program To make it work correctly under all possible conditions. Making a program run fast is also an important consideration. [Q] How to write an efficient program? Appropriate algorithms and data structures Source code that the compiler can effectively optimize to turn into efficient executable code For the second part, it is important to understand the capabilities and limitations of optimizing compilers. However programmers must make a trade-off between how easy a program is to implement and maintain, and how fast it runs. TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Modern compilers employ sophisticated forms of analysis and optimization. Even the best compilers, however, can be thwarted by optimization blockers – aspects of the program’s behavior that depend strongly on the execution environment. Optimization blockers make even programmers get confused and produce logical errors. Programmers must assist the compiler by writing code that can be optimized readily. TRU-COMP2130 Code Optimization

5.1 Limitations of Optimizing Compilers Your vision? Seek with all your heart? 5.1 Limitations of Optimizing Compilers Higher optimization levels of gcc can improve program performance. But they may expand program size and they make program more difficult to debug using standard debugging tools. TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Compilers must be careful to apply only safe optimizations to a program. Example: Memory Aliasing void twiddle1(int *xp, int *yp) { *xp += *yp; } [Q] Can twiddle1 be replaced by twiddle2? void twiddle2(int *xp, int *yp) { *xp += 2 * *yp; [Q] What if *xp == *yp? In twiddle1, *xp becomes triple, but In twiddle2, *xp becomes twice. [Q] Is it a good programming style to pass pointers and manipulate them? How to improve twiddle1()? TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Example x = 1000; y = 3000; *q = y; *p = x; t1 = *q; [Q] What value will t1 have? 1000 or 3000 -> It is not easy even for us to understand the above code. -> Definitely not a good programming style. Compilers cannot replace the code with t1 = y;. Optimization blockers Memory aliasing around pointers … TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Example: Side Effect int f(); int func1() { return f() + f() + f() + f(); } int func2() { return 4 * f(); } [Q] Can you see any problem? [Q] What if int count = 0; int f() { return counter++; } ? [Q] What will func1() and func2() return? [Q] Good programming style? How to improve? Optimization blockers Memory aliasing around pointers Functions with a side effect … TRU-COMP2130 Code Optimization

5.2 Expressing Program Performance Your vision? Seek with all your heart? 5.2 Expressing Program Performance Cycles Per Element (CPE) How many instructions (cycles) (, not the number of C lines,) are being executed rather than how fast the clock runs. TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Example: loop unrolling void psum1(float a[], float p[], long int n) { long int i; p[0] = a[0]; for (i = 1; i < n; i++) p[i] = p[i-1] + a[i]; } void psum2(float a[], float p[], long int n) { long int i; float mid_val; for (i = 1; i < n-1; i += 2) { mid_val = p[i-1] + a[i]; p[i] = mid_val; p[i+1] = mid_val + a[i+1]; if (i < n) p[i] = p[i-1] + a[i]; [Q] Which one do you think run faster? [Q] Can you simply count the # of operations that access main memory? 3  (n-1) 5  (n-1)/2 TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Loop unrolling Possibly reduce the number of memory accesses. Possibly run multiple statements in parallel over multi-core CPUs. In the previous example ??? TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? 5.3 Program Example typedef struct { // vector abstract data type long int len; data_t *data; // vector values } vec_rec, *vec_ptr; #define IDENT 0 #define OP + void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i=0; i < vec_length(v); i++) { // it is good to hide len. data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } [Q] Can you write vec_length() and get_vec_element()? [Q] Compilers can optimize the above code well. Can you optimize? TRU-COMP2130 Code Optimization

5.4 Eliminating Loop Inefficiencies Your vision? Seek with all your heart? 5.4 Eliminating Loop Inefficiencies Code motion Identifying a computation that is performed multiple times (e.g., within a loop), such that the result of the computation will not change. Example: void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } Does vec_length() have a side effect? Or is the length of the vector changed in the loop? No. Then? TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? From the previous example: void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } Example: Any problem? How can you improve? void lower1(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= ‘A’ && s[i] <= ‘Z’) s[i] -= ‘A’ – ‘a’; Can we remove &val? TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? From the previous example: void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. } Example: Any problem? How can you improve? void lower1(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= ‘A’ && s[i] <= ‘Z’) s[i] -= ‘A’ – ‘a’; Can we remove &val? TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Example: void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } How to improve? TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Example: void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } How to improve? TRU-COMP2130 Code Optimization

5.5 Reducing Procedure Calls Your vision? Seek with all your heart? 5.5 Reducing Procedure Calls From the previous example: Any problem? typedef struct { // vector abstract data type long int len; data_t *data; // vector values } vec_rec, *vec_ptr; int get_vec_element(vec_ptr v, long int index, data_t *dest) { if (index < 0 || index >= v->len) return 0; *dest = v->data[index]; return 1; } void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); // it changes val. *dest = *dest OP val; // it makes sum. TRU-COMP2130 Code Optimization

From the previous example: Your vision? Seek with all your heart? From the previous example: typedef struct { // vector abstract data type long int len; data_t *data; // vector values } vec_rec, *vec_ptr; int get_vec_element(vec_ptr v, long int index, data_t *dest) { if (index < 0 || index >= v-> len) return 0; *dest = v->data[index]; return 1; } void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; data_t *data = get_vec_start(v); // v->data for (i = 0; i < length; i++) *dest = *dest OP data[i]; // it makes sum. Can you write get_vec_start()? TRU-COMP2130 Code Optimization

5.6 Eliminating Unneeded Memory References Your vision? Seek with all your heart? 5.6 Eliminating Unneeded Memory References From the previous example: Any problem? void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); // v->data *dest = IDENT; for (i = 0; i < length; i++) *dest = *dest OP data[i]; // it makes sum. } // the statement in for loop // data_t = int; OP = *; i in %edx, data in %ecx, dest in %ebx movl (%ebx), %eax imull (%ecx, %edx, 4), %eax movl %eax, (%ebx) TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? From the previous example: Any problem? void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); // v->data *dest = IDENT; for (i = 0; i < length; i++) *dest = *dest OP data[i]; // it makes sum. } // the statement in for loop // data_t = int; OP = *; i in %edx, data in %ecx, dest in %ebx movl (%ebx), %eax imull (%ecx, %edx, 4), %eax movl %eax, (%ebx) TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? From the previous example: void combine4(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); // v->data data_t acc = IDENT; // can be implemented in a register for (i = 0; i < length; i++) acc = acc OP data[i]; // it makes sum. *dest = acc; } // the statement in for loop // data_t = int; OP = *; i in %edx, data in %ecx, acc in %eax imull (%ecx, %edx, 4), %eax TRU-COMP2130 Code Optimization

5.13 Performance Improvement Techniques Your vision? Seek with all your heart? 5.13 Performance Improvement Techniques High-level design Appropriate algorithms and data structures Basic coding principles Elimination of loop inefficiency Elimination of excessive function calls Elimination of unnecessary memory references – Introduce temporary variables to hold intermediate results. Elimination of pointers if possible … Low-level optimizations Unroll loops to reduce overhead and to enable further optimizations. Find ways to increase instruction-level parallelism. TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Unroll loops to reduce overhead and to enable further optimizations. Find ways to increase instruction-level parallelism. for (i = 0; i < length; i++) acc = acc OP data[i]; // it makes sum. *dest = acc; //------------------------- limit = length – 1; for (i = 0; i < limit; i += 2) { // combine two elements acc0 = acc0 OP data[i]; // two statements at a time acc1 = acc1 OP data[i+1]; } for (; i < length; i++) // finish any remaining elements acc1 = acc1 OP data[i]; *dest = acc0 OP acc1; TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Example: Convert the following code to use 4-way loop unrolling: for (i = 0; i < length; i++) sum = sum + udata[i] * vdata[i]; *dest = sum; TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? Example: Improve the following code by using a word of data type unsigned long to pack four copies of c: void *basic_memset(void *s, int c, int n) { int cnt = 0; unsigned char *schar = s; while (cnt < n) { *schar = (unsigned char) c; schar++; cnt++; } TRU-COMP2130 Code Optimization

Your vision? Seek with all your heart? void *memset(void *s, int c, int n) { int cnt = 0; int length = n / 4; unsigned ic; unsigned char *schar = s; unsigned int *si = s; c = c & 0xff; ic = c << 24 + c << 16 + c << 8 + c; while (cnt < length) { *si = ic; si++; cnt++; } cnt = length * 4; schar += length * 4; while (cnt < n) { *schar = (unsigned char) c; schar++; TRU-COMP2130 Code Optimization

Reduction in Strength Replace costly operation with simpler one Carnegie Mellon Reduction in Strength Replace costly operation with simpler one Shift, add instead of multiply or divide 16*x --> x << 4 Utility machine dependent Depends on cost of multiply or divide instruction On Intel Nehalem, integer multiply requires 3 CPU cycles Recognize sequence of products int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

Share Common Subexpressions Carnegie Mellon Share Common Subexpressions Reuse portions of expressions Compilers often not very sophisticated in exploiting arithmetic properties /* Sum neighbors of i,j */ up = val[(i-1)*n + j ]; down = val[(i+1)*n + j ]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right; long inj = i*n + j; up = val[inj - n]; down = val[inj + n]; left = val[inj - 1]; right = val[inj + 1]; sum = up + down + left + right; 3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n leaq 1(%rsi), %rax # i+1 leaq -1(%rsi), %r8 # i-1 imulq %rcx, %rsi # i*n imulq %rcx, %rax # (i+1)*n imulq %rcx, %r8 # (i-1)*n addq %rdx, %rsi # i*n+j addq %rdx, %rax # (i+1)*n+j addq %rdx, %r8 # (i-1)*n+j imulq %rcx, %rsi # i*n addq %rdx, %rsi # i*n+j movq %rsi, %rax # i*n+j subq %rcx, %rax # i*n+j-n leaq (%rsi,%rcx), %rcx # i*n+j+n