1 Program Optimization Professor Jennifer Rexford

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Part IV: Memory Management
Instruction Set Design
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
INSTRUCTION SET ARCHITECTURES
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Names and Bindings.
Tuan Tran. What is CISC? CISC stands for Complex Instruction Set Computer. CISC are chips that are easy to program and which make efficient use of memory.
Programming Types of Testing.
Garbage Collection CSCI 2720 Spring Static vs. Dynamic Allocation Early versions of Fortran –All memory was static C –Mix of static and dynamic.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
Chapter 11 Instruction Sets
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Introduction to Analysis of Algorithms
CS 536 Spring Run-time organization Lecture 19.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Code Optimization I September 24, 2007 Topics Machine-Independent Optimizations Basic optimizations Optimization blockers class08.ppt F’07.
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Cisc Complex Instruction Set Computing By Christopher Wong 1.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Chapter 3 Memory Management: Virtual Memory
System Calls 1.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
1 A Simple but Realistic Assembly Language for a Course in Computer Organization Eric Larson Moon Ok Kim Seattle University October 25, 2008.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 19 & 20 Instruction Formats PDP-8,PDP-10,PDP-11 & VAX Course Instructor: Engr. Aisha Danish.
Introduction to ECE 454 Computer Systems Programming Topics: Lecture topics and assignments Profiling rudiments Lab schedule and rationale Cristiana Amza.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
Basic Memory Management 1. Readings r Silbershatz et al: chapters
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Instruction Sets: Addressing modes and Formats Group #4  Eloy Reyes  Rafael Arevalo  Julio Hernandez  Humood Aljassar Computer Design EEL 4709c Prof:
Optimization of C Code The C for Speed
Performance Tuning John Black CS 425 UNR, Fall 2000.
Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.
Copyright 2014 – Noah Mendelsohn Code Tuning Noah Mendelsohn Tufts University Web:
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Code Optimization.
Introduction To Computer Systems
Optimization Code Optimization ©SoftMoore Consulting.
Code Optimization I: Machine Independent Optimizations
CSCI1600: Embedded and Real Time Software
Introduction to Computer Systems
Code Optimization I Nov. 25, 2008
Introduction to Computer Systems
Performance Improvement Revisited
Lecture 4: Instruction Set Design/Pipelining
Machine-Independent Optimization
CSCI1600: Embedded and Real Time Software
Presentation transcript:

1 Program Optimization Professor Jennifer Rexford

2 Goals of Today’s Class Improving program performance  When and what to optimize  Better algorithms & data structures vs. tuning the code Exploiting an understanding of underlying system  Compiler capabilities  Hardware architecture  Program execution Why?  To be effective, and efficient, at making programs faster –Avoid optimizing the fast parts of the code –Help the compiler do its job better  To review material from the second half of the course

3 Improving Program Performance Most programs are already “fast enough”  No need to optimize performance at all  Save your time, and keep the program simple/readable Most parts of a program are already “fast enough”  Usually only a small part makes the program run slowly  Optimize only this portion of the program, as needed Steps to improve execution (time) efficiency  Do timing studies (e.g., gprof)  Identify hot spots  Optimize that part of the program  Repeat as needed

4 Ways to Optimize Performance Better data structures and algorithms  Improves the “asymptotic complexity” –Better scaling of computation/storage as input grows –E.g., going from O(n 2 ) sorting algorithm to O(n log n)  Clearly important if large inputs are expected  Requires understanding data structures and algorithms Better source code the compiler can optimize  Improves the “constant factors” –Faster computation during each iteration of a loop –E.g., going from 1000n to 10n running time  Clearly important if a portion of code is running slowly  Requires understanding hardware, compiler, execution

5 Helping the Compiler Do Its Job

6 Optimizing Compilers Provide efficient mapping of program to machine  Register allocation  Code selection and ordering  Eliminating minor inefficiencies Don’t (usually) improve asymptotic efficiency  Up to the programmer to select best overall algorithm Have difficulty overcoming “optimization blockers”  Potential function side-effects  Potential memory aliasing

7 Limitations of Optimizing Compilers Fundamental constraint  Compiler must not change program behavior  Ever, even under rare pathological inputs Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles  Data ranges more limited than variable types suggest  Array elements remain unchanged by function calls Most analysis is performed only within functions  Whole-program analysis is too expensive in most cases Most analysis is based only on static information  Compiler has difficulty anticipating run-time inputs

8 Avoiding Repeated Computation A good compiler recognizes simple optimizations  Avoiding redundant computations in simple loops  Still, programmer may still want to make it explicit Example  Repetition of computation: n * i for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; for (i = 0; i < n; i++) { int ni = n * i; for (j = 0; j < n; j++) a[ni + j] = b[j]; }

9 Worrying About Side Effects Compiler cannot always avoid repeated computation  May not know if the code has a “side effect”  … that makes the transformation change the code’s behavior Is this transformation okay? Not necessarily, if int func1(int x) { return f(x) + f(x) + f(x) + f(x); } int func1(int x) { return 4 * f(x); } int counter = 0; int f(int x) { return counter++; } And this function may be defined in another file known only at link time!

10 Another Example on Side Effects Is this optimization okay? Short answer: it depends  Compiler often cannot tell  Most compilers do not try to identify side effects Programmer knows best  And can decide whether the optimization is safe for (i = 0; i < strlen(s); i++) { /* Do something with s[i] */ } length = strlen(s); for (i = 0; i < length; i++) { /* Do something with s[i] */ }

11 Memory Aliasing Is this optimization okay? Not necessarily, what if xp and yp are equal?  First version: result is 4 times *xp  Second version: result is 3 times *xp void twiddle(int *xp, int *yp) { *xp += *yp; } void twiddle(int *xp, int *yp) { *xp += 2 * *yp; }

12 Memory Aliasing Memory aliasing  Single data location accessed through multiple names  E.g., two pointers that point to the same memory location Modifying the data using one name  Implicitly modifies the values seen through other names Blocks optimization by the compiler  The compiler cannot tell when aliasing may occur  … and so must forgo optimizing the code Programmer often does know  And can optimize the code accordingly xp, yp

13 Another Aliasing Example Is this optimization okay? Not necessarily  If y and x point to the same location in memory…  … the correct output is “x = 10\n” int *x, *y; … *x = 5; *y = 10; printf(“x=%d\n”, *x); printf(“x=5\n”);

14 Summary: Helping the Compiler Compiler can perform many optimizations  Register allocation  Code selection and ordering  Eliminating minor inefficiencies But often the compiler needs your help  Knowing if code is free of side effects  Knowing if memory aliasing will not happen Modifying the code can lead to better performance  Profile the code to identify the “hot spots”  Look at the assembly language the compiler produces  Rewrite the code to get the compiler to do the right thing

15 Exploiting the Hardware

16 Underlying Hardware Implements a collection of instructions  Instruction set varies from one architecture to another  Some instructions may be faster than others Registers and caches are faster than main memory  Number of registers and sizes of caches vary  Exploiting both spatial and temporal locality Exploits opportunities for parallelism  Pipelining: decoding one instruction while running another –Benefits from code that runs in a sequence  Superscalar: perform multiple operations per clock cycle –Benefits from operations that can run independently  Speculative execution: performing instructions before knowing they will be reached (e.g., without knowing outcome of a branch)

17 Addition Faster Than Multiplication Adding instead of multiplying  Addition is faster than multiplication Recognize sequences of products  Replace multiplication with repeated addition for (i = 0; i < n; i++) { int ni = n * i; for (j = 0; j < n; j++) a[ni + j] = b[j]; } int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; }

18 Bit Operations Faster Than Arithmetic Shift operations to multiple/divide by powers of 2  “x >> 3” is faster than “x/8”  “x << 3” is faster than “x * 8” Bit masking is faster than mod operation  “x & 15” is faster than “x % 16” << &

19 Caching: Matrix Multiplication Caches  Slower than registers, but faster than main memory  Both instruction caches and data caches Locality  Temporal locality: recently-referenced items are likely to be referenced in near future  Spatial locality: Items with nearby addresses tend to be referenced close together in time Matrix multiplication  Multiply n-by-n matrices A and B, and store in matrix C  Performance heavily depends on effective use of caches

20 Matrix Multiply: Cache Effects for (i=0; i<n; i++) { for (j=0; j<n; j++) { for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; } Reasonable cache effects  Good spatial locality for A  Poor spatial locality for B  Good temporal locality for C AB C (i,*) (*,j) (i,j)

21 Matrix Multiply: Cache Effects Rather poor cache effects  Bad spatial locality for A  Good temporal locality for B  Bad spatial locality for C for (j=0; j<n; j++) { for (k=0; k<n; k++) { for (i=0; i<n; i++) c[i][j] += a[i][k] * b[k][j]; } AB C (*,j) (k,j) (*,k)

22 Matrix Multiply: Cache Effects Good poor cache effects  Good temporal locality for A  Good spatial locality for B  Good spatial locality for C for (k=0; k<n; k++) { for (i=0; i<n; i++) { for (j=0; j<n; j++) c[i][j] += a[i][k] * b[k][j]; } ABC (i,*) (i,k)(k,*)

23 Parallelism: Loop Unrolling What limits the performance? Limited apparent parallelism  One main operation per iteration (plus book-keeping)  Not enough work to keep multiple functional units busy  Disruption of instruction pipeline from frequent branches Solution: unroll the loop  Perform multiple operations on each iteration for (i = 0; i < length; i++) sum += data[i];

24 Parallelism: After Loop Unrolling Original code After loop unrolling (by three) for (i = 0; i < length; i++) sum += data[i]; /* Combine three elements at a time */ limit = length – 2; for (i = 0; i < limit; i+=3) sum += data[i] + data[i+1] + data[i+2]; /* Finish any remaining elements */ for ( ; i < length; i++) sum += data[i];

25 Program Execution

26 Avoiding Function Calls Function calls are expensive  Caller saves registers and pushes arguments on stack  Callee saves registers and pushes local variables on stack  Call and return disrupt the sequence flow of the code Function inlining: void g(void) { /* Some code */ } void f(void) { … g(); … } void f(void) { … /* Some code */ … } Some compilers support “inline” keyword directive.

27 Writing Your Own Malloc and Free Dynamic memory management  Malloc to allocate blocks of memory  Free to free blocks of memory Existing malloc and free implementations  Designed to handle a wide range of request sizes  Good most of the time, but rarely the best for all workloads Designing your own dynamic memory management  Forego using traditional malloc/free, and write your own  E.g., if you know all blocks will be the same size  E.g., if you know blocks will usually be freed in the order allocated  E.g.,

28 Conclusion Work smarter, not harder  No need to optimize a program that is “fast enough”  Optimize only when, and where, necessary Speeding up a program  Better data structures and algorithms: better asymptotic behavior  Optimized code: smaller constants Techniques for speeding up a program  Coax the compiler  Exploit capabilities of the hardware  Capitalize on knowledge of program execution

29 Course Wrap Up

30 The Rest of the Semester Deans Date: Tuesday May 12  Final assignment due at 9pm  Cannot be accepted after 11:59pm Final Exam: Friday May 15  1:30-4:20pm in Friend Center 101  Exams from previous semesters are online at –  Covers entire course, with emphasis on second half of the term  Open book, open notes, open slides, etc. (just no computers!) –No need to print/bring the IA-32 manuals Office hours during reading/exam period  Daily, times TBA on course mailing list Review sessions  May 13-14, time TBA on course mailing list

31 Goals of COS 217 Understand boundary between code and computer  Machine architecture  Operating systems  Compilers Learn C and the Unix development tools  C is widely used for programming low-level systems  Unix has a rich development environment  Unix is open and well-specified, good for study & research Improve your programming skills  More experience in programming  Challenging and interesting programming assignments  Emphasis on modularity and debugging

32 Relationship to Other Courses Machine architecture  Logic design (306) and computer architecture (471)  COS 217: assembly language and basic architecture Operating systems  Operating systems (318)  COS 217: virtual memory, system calls, and signals Compilers  Compiling techniques (320)  COS 217: compilation process, symbol tables, assembly and machine language Software systems  Numerous courses, independent work, etc.  COS 217: programming skills, UNIX tools, and ADTs

33 Lessons About Computer Science Modularity  Well-defined interfaces between components  Allows changing the implementation of one component without changing another  The key to managing complexity in large systems Resource sharing  Time sharing of the CPU by multiple processes  Sharing of the physical memory by multiple processes Indirection  Representing address space with virtual memory  Manipulating data via pointers (or addresses)

34 Lessons Continued Hierarchy  Memory: registers, cache, main memory, disk, tape, …  Balancing the trade-off between fast/small and slow/big Bits can mean anything  Code, addresses, characters, pixels, money, grades, …  Arithmetic can be done through logic operations  The meaning of the bits depends entirely on how they are accessed, used, and manipulated

35 Have a Great Summer!!!