Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Program Optimization (Chapter 5)
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Kernighan/Ritchie: Kelley/Pohl:
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Code Optimization I September 24, 2007 Topics Machine-Independent Optimizations Basic optimizations Optimization blockers class08.ppt F’07.
1 Program Optimization Professor Jennifer Rexford
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
Register Allocation (via graph coloring)
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
Recap from last time We saw various different issues related to program analysis and program transformations You were not expected to know all of these.
1 Lecture 7: Computer Arithmetic Today’s topics:  Chapter 2 wrap-up  Numerical representations  Addition and subtraction Reminder: Assignment 3 will.
Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Chapter 1 Algorithm Analysis
1 Chapter 5: Names, Bindings and Scopes Lionel Williams Jr. and Victoria Yan CSci 210, Advanced Software Paradigms September 26, 2010.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Comp 245 Data Structures Linked Lists. An Array Based List Usually is statically allocated; may not use memory efficiently Direct access to data; faster.
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
Introduction to ECE 454 Computer Systems Programming Topics: Lecture topics and assignments Profiling rudiments Lab schedule and rationale Cristiana Amza.
Computer Organization II Topics: Theme Five great realities of computer systems How this fits within CS curriculum CS 2733.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.
Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.
Copyright 2014 – Noah Mendelsohn Code Tuning Noah Mendelsohn Tufts University Web:
Code Optimization and Performance CS 105 “Tour of the Black Holes of Computing”
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
1 High-Performance Grid Computing and Research Networking Presented by Juan Carlos Martinez Instructor: S. Masoud Sadjadi
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Memory-Aware Compilation Philip Sweany 10/20/2011.
CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.
Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Code Optimization Overview and Examples
Code Optimization.
Introduction To Computer Systems
Optimization Code Optimization ©SoftMoore Consulting.
Code Optimization I: Machine Independent Optimizations
Introduction to Computer Systems
Code Optimization and Performance
Optimizing Transformations Hal Perkins Autumn 2011
Code Optimization I Nov. 25, 2008
Introduction to Computer Systems
Cache Memories Lecture, Oct. 30, 2018
Program Optimization CSE 238/2038/2138: Systems Programming
Lecture 5: Pipeline Wrap-up, Static ILP
Machine-Independent Optimization
Writing Cache Friendly Code
Presentation transcript:

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code

CS 740 F’00– 2 – Performance Matters Constant factors count! easily see 10:1 performance range depending on how code is written must optimize at multiple levels: –algorithm, data representations, procedures, and loops Must understand system to optimize performance how programs are compiled and executed how to measure program performance and identify bottlenecks how to improve performance without destroying code modularity and generality

CS 740 F’00– 3 – Optimizing Compilers Provide efficient mapping of program to machine register allocation code selection and ordering eliminating minor inefficiencies Don’t (usually) improve asymptotic efficiency up to programmer to select best overall algorithm big-O savings are (often) more important than constant factors –but constant factors also matter Have difficulty overcoming “optimization blockers” potential memory aliasing potential procedure side-effects

CS 740 F’00– 4 – Limitations of Optimizing Compilers Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles e.g., data ranges may be more limited than variable types suggest –e.g., using an “ int ” in C for what could be an enumerated type Most analysis is performed only within procedures whole-program analysis is too expensive in most cases Most analysis is based only on static information compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative cannot perform optimization if it changes program behavior under any realizable circumstance –even if circumstances seem quite bizarre and unlikely

CS 740 F’00– 5 – What do compilers try to do? Reduce the number of instructions Dynamic Static Take advantage of parallelism Eliminate useless work Optimize memory access patterns Use special hardware when available

CS 740 F’00– 6 – Matrix Multiply – Simple Version Heavy use of memory operations, addition and multiplication Contains redundant operations for(i = 0; i < SIZE; i++) { for(j = 0; j < SIZE; j++) { for(k = 0; k < SIZE; k++) { c[i][j]+=a[i][k]*b[k][j]; }

CS 740 F’00– 7 – Matrix Multiply – Hand Optimized Turned array accesses into pointer dereferences Assign to each element of c just once for(i = 0; i < SIZE; i++) { int *orig_pa = &a[i][0]; for(j = 0; j < SIZE; j++) { int *pa = orig_pa; int *pb = &a[0][j]; int sum = 0; for(k = 0; k < SIZE; k++) { sum += *pa * *pb; pa++; pb += SIZE; } c[i][j] = sum; } for(i = 0; i < SIZE; i++) { for(j = 0; j < SIZE; j++) { for(k = 0; k < SIZE; k++) { c[i][j]+=a[i][k]*b[k][j]; }

CS 740 F’00– 8 – Results Is the “optimized” code optimal? R10000SimpleOptimized cc –O034.7s27.4s cc –O35.3s8.0s egcc –O910.1s8.3s 21164SimpleOptimized cc –O040.5s12.2s cc –O516.7s18.6s egcc –O027.2s19.5s egcc –O912.3s14.7s Pentium IISimpleOptimized egcc –O928.4s25.3s RS/6000SimpleOptimized xlC –O363.9s65.3s

CS 740 F’00– 9 – Why is Simple Better? Easier for humans and the compiler to understand The more the compiler knows the more it can do Pointers are hard to analyze, arrays are easier You never know how fast code will run until you time it The transformations we did by hand good optimizers will do for us And they will often do a better job than we can do Pointers may cause aliases and data dependences where the array code had none

CS 740 F’00– 10 – Optimization blocker: pointers Aliasing: if a compiler can’t tell what a pointer points at, it must be conservative and assume it can point at almost anything Eg: Could optimize to a much better loop if only we knew that our strings do not alias each other void strcpy(char *dst, char *src) { while(*(src++) != ‘\0’) *(dst++) = *src; *dst = ‘\0’; }

CS 740 F’00– 11 – SGI’s Superior Compiler Loop unrolling Central loop is unrolled 2X Code scheduling Loads are moved up in the schedule to hide their latency Loop interchange Inner two loops are interchanged giving us ikj rather than ijk –Better cache performance – gives us a huge benefit Software pipelining Do loads for next iteration while doing multiply for current iteration Strength reduction Add 4 to current array location to get next one rather than multiplying by index Loop invariant code motion Values which are constants are not re-computed for each loop iteration

CS 740 F’00– 12 – Loop Interchange Does any loop iteration read a value produced by any other iteration? What do the memory access patterns look like in the inner loop? ijk: constant += sequential * striding ikj: sequential += constant * sequential jik: constant += sequential * sequential jki: striding += striding * constant kij: sequential += constant * sequential kji: striding += striding * constant for(i = 0; i < SIZE; i++) for(j = 0; j < SIZE; j++) for(k = 0; k < SIZE; k++) c[i][j]+=a[i][k]*b[k][j];

CS 740 F’00– 13 – Software Pipelining for(j = 0; j < SIZE; j++) c_r[j] += a_r_c * b_r[j]; load b_r[j]a_r_c load c_r[j]* + Dataflow graph: Now must optimize inner loop Want to do as much work as possible in each iteration Keep all of the functional units busy in the processor store c_r[j]

CS 740 F’00– 14 – Fill Steady State Drain Not pipelined: for(j = 0; j < SIZE; j++) c_r[j] += a_r_c * b_r[j]; Pipelined: Software Pipelining cont. load b_r[j]a_r_c load c_r[j]* + store c_r[j] load b_r[j]a_r_c load c_r[j]* + store c_r[j] load b_r[j]a_r_c load c_r[j]* + store c_r[j] load b_r[j]a_r_c load c_r[j]* + store c_r[j] load b_r[j]a_r_c load c_r[j]* + store c_r[j] load b_r[j]a_r_c load c_r[j]* + store c_r[j] load b_r[j]a_r_c load c_r[j]* + store c_r[j] load b_r[j]a_r_c load c_r[j]* + store c_r[j] load b_r[j]a_r_c load c_r[j]* + store c_r[j] load b_r[j]a_r_c load c_r[j]* + store c_r[j]

CS 740 F’00– 15 – Code Motion Examples Sum Integers from 1 to n! Bad Better Best sum = 0; for (i = 0; i <= fact(n); i++) sum += i; sum = 0; fn = fact(n); for (i = 0; i <= fn; i++) sum += i; sum = 0; for (i = fact(n); i > 0; i--) sum += i; fn = fact(n); sum = fn * (fn + 1) / 2;

CS 740 F’00– 16 – Optimization Blocker: Procedure Calls Why couldn’t the compiler move fact(n) out of the inner loop? Procedure May Have Side Effects i.e, alters global state each time called Function May Not Return Same Value for Given Arguments Depends on other parts of global state Why doesn’t compiler look at code for fact(n)? Linker may overload with different version –Unless declared static Interprocedural optimization is not used extensively due to cost Inlining can achieve the same effect for small procedures Warning: Compiler treats procedure call as a black box Weakens optimizations in and around them

CS 740 F’00– 17 – Role of Programmer How should I write my programs, given that I have a good, optimizing compiler? Don’t: Smash Code into Oblivion Hard to read, maintain & ensure correctness Do: Select best algorithm Write code that’s readable & maintainable –Procedures, recursion, without built-in constant limits –Even though these factors can slow down code Eliminate optimization blockers –Allows compiler to do its job Account for cache behavior Focus on Inner Loops Use a profiler to find important ones!