Optimizing program performance

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Topics Left Superscalar machines IA64 / EPIC architecture
Computer Organization and Architecture
Computer architecture
Carnegie Mellon Today Program optimization  Optimization blocker: Memory aliasing  Out of order processing: Instruction level parallelism  Understanding.
Instructor: Erol Sahin Program Optimization CENG331: Introduction to Computer Systems 11 th Lecture Acknowledgement: Most of the slides are adapted from.
Program Optimization (Chapter 5)
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Code Optimization(II). 2 Outline Understanding Modern Processor –Super-scalar –Out-of –order execution Suggested reading –5.14,5.7.
1 Seoul National University Wrap-Up. 2 Overview Seoul National University Wrap-Up of PIPE Design  Exception conditions  Performance analysis Modern.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Code Optimization II: Machine Dependent Optimizations Topics Machine-Dependent Optimizations Pointer code Unrolling Enabling instruction level parallelism.
Architecture Basics ECE 454 Computer Systems Programming
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.
Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.
Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”
1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.
Code Optimization and Performance II
Real-World Pipelines Idea Divide process into independent stages
IBM System 360. Common architecture for a set of machines
CS 352H: Computer Systems Architecture
Assembly language.
Code Optimization II September 27, 2006
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Machine-Dependent Optimization
Code Optimization II Dec. 2, 2008
Code Optimization I: Machine Independent Optimizations
Instructors: Pelin Angin and Erol Sahin
Code Optimization II: Machine Dependent Optimizations Oct. 1, 2002
Code Optimization II: Machine Dependent Optimizations
Machine-Dependent Optimization
Superscalar Processors & VLIW Processors
Seoul National University
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Code Optimization(II)
Code Optimization April 6, 2000
Control unit extension for data hazards
Pipelined Implementation : Part I
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Superscalar and VLIW Architectures
Control unit extension for data hazards
Lecture 4: Instruction Set Design/Pipelining
Lecture 5: Pipeline Wrap-up, Static ILP
Machine-Independent Optimization
Lecture 11: Machine-Dependent Optimization
Code Optimization and Performance
Code Optimization II October 2, 2001
Presentation transcript:

Optimizing program performance Computer Systems Optimizing program performance Arnoud Visser Computer Systems – optimizing program performance

Performance can make the difference Use Pointers instead of array indices Use doubles instead of floats Optimize inner loops Recommendations Patrick van der Smagt in 1991 for neural net implementations Arnoud Visser Computer Systems – optimizing program performance

Machine-independent versus Machine-dependent optimizations Optimizations you should do regardless of processor / compiler Code Motion (out of the loop) Reducing procedure calls Unneeded Memory usage Share Common sub-expressions Machine-Dependent Optimizations Pointer code Unrolling Enabling instruction level parallelism Arnoud Visser Computer Systems – optimizing program performance

Computer Systems – optimizing program performance Machine dependent One has to known today’s architectures Superscalar (Pentium) (often two instructions/cycle) Dynamic execution (P6) (three instructions out-of-order/cycle) Explicit parallelism (Itanium) (six execution units) Arnoud Visser Computer Systems – optimizing program performance

Computer Systems – optimizing program performance Pentium III Design Instruction Control Address Fetch Control Instruction Cache Retirement Unit Instrs. Register File Instruction Decode Operations Register Updates Prediction OK? Execution Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. Data Data Data Cache Arnoud Visser Computer Systems – optimizing program performance

Functional Units of Pentium III Multiple Instructions Can Execute in Parallel 1 load 1 store 2 integer (one may be branch) 1 FP Addition 1 FP Multiplication or Division Arnoud Visser Computer Systems – optimizing program performance

Performance of Pentium III operations Many instructions can be Pipelined to 1 cycle Instruction Latency Cycles/Issue Load / Store 3 1 Integer Multiply 4 1 Integer Divide 36 36 Double/Single FP Multiply 5 2 Double/Single FP Add 3 1 Double/Single FP Divide 38 38 Arnoud Visser Computer Systems – optimizing program performance

Computer Systems – optimizing program performance Instruction Control Instruction Control Instruction Cache Fetch Control Decode Address Instrs. Operations Retirement Unit Register File Grabs Instructions From Memory Based on current PC + predicted targets for predicted branches Hardware dynamically guesses (possibly) branch target Translates Instructions Into Operations Primitive steps required to perform instruction Typical instruction requires 1–3 operations Converts Register References Into Tags Abstract identifier linking destination of one operation with sources of later operations Arnoud Visser Computer Systems – optimizing program performance

Computer Systems – optimizing program performance Translation Example Version of Combine4 Integer data, multiply operation Translation of First Iteration .L24: # Loop: imull (%eax,%edx,4),%ecx # t *= data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop .L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 Split into two operations load reads from memory to generate temporary result t.1 Multiply operation just operates on registers Operands Registers %eax does not change in loop. Values will be retrieved from register file during decoding Register %ecx changes on every iteration. Uniquely identify different versions as %ecx.0, %ecx.1, %ecx.2, … Register renaming Values passed directly from producer to consumers load (%eax,%edx.0,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 Arnoud Visser Computer Systems – optimizing program performance

Visualizing Operations load (%eax,%edx,4)  t.1 imull t.1, %ecx.0  %ecx.1 incl %edx.0  %edx.1 cmpl %esi, %edx.1  cc.1 jl-taken cc.1 cc.1 t.1 load %ecx.1 incl cmpl jl %edx.0 %edx.1 %ecx.0 imull Time Operations Vertical position denotes time at which executed Cannot begin operation until operands available Height denotes latency Operands Arcs shown only for operands that are passed within execution unit Arnoud Visser Computer Systems – optimizing program performance

4 Iterations of Combining Sum 4 integer ops Resource Analysis Performance Unlimited resources should give CPE of 1.0 Would require executing 4 integer operations in parallel Arnoud Visser Computer Systems – optimizing program performance

Pentium Resource Constraints Only two integer functional units Set priority based on program order Performance Sustain CPE of 2.0 Arnoud Visser Computer Systems – optimizing program performance

Computer Systems – optimizing program performance Loop Unrolling Measured CPE=1.33 void combine5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; int i; /* Combine 3 elements at a time */ for (i = 0; i < limit; i+=3) { sum += data[i] + data[i+2] + data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { sum += data[i]; *dest = sum; Optimization Combine multiple iterations into single loop body Amortizes loop overhead across multiple iterations Finish extras at end Arnoud Visser Computer Systems – optimizing program performance

Resource distribution with Loop Unrolling Predicted Performance Can complete iteration in 3 cycles Should give CPE of 1.0 Arnoud Visser Computer Systems – optimizing program performance

Computer Systems – optimizing program performance Effect of Unrolling Unrolling Degree 1 2 3 4 8 16 Integer Sum 2.00 1.50 1.33 1.25 1.06 Product 4.00 FP 3.00 5.00 Only helps integer sum for our examples Other cases constrained by functional unit latencies Effect is nonlinear with degree of unrolling Many subtle effects determine exact scheduling of operations Arnoud Visser Computer Systems – optimizing program performance

Unrolling is for long vectors Unrolling Degree 1 2 3 4 8 16 IntSum ∞ 2.00 1.50 1.33 1.25 1.06 1024 2.06 1.56 1.40 1.31 1.12 31 4.02 3.57 3.39 3.84 3.91 3.66 New source of overhead The need to finish the remaining elements when the vector length is not divisible by the degree of unrolling Arnoud Visser Computer Systems – optimizing program performance

3 Iterations of Combining Product Unlimited Resource Analysis Assume operation can start as soon as operands available Performance Limiting factor becomes latency Gives CPE of 4.0 Arnoud Visser Computer Systems – optimizing program performance

Computer Systems – optimizing program performance Iteration splitting void combine6(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x0 = 1; int x1 = 1; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 *= data[i]; x1 *= data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { *dest = x0 * x1; Optimization Make operands available by accumulating in two different products (x0,x1) Combine at end Performance CPE = 2.0 Arnoud Visser Computer Systems – optimizing program performance

Resource distribution with Iteration Splitting Vraag: wat klopt hier niet: derde jl geeft conflict met eerste imull Predicted Performance Make use of both execution units Gives CPE of 2.0 Arnoud Visser Computer Systems – optimizing program performance

Results for Pentium III Biggest gain doing basic optimizations But, last little bit helps Arnoud Visser Computer Systems – optimizing program performance

Computer Systems – optimizing program performance Results for Pentium 4 Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0) Clock runs at 2.0 GHz Not an improvement over 1.0 GHz P3 for integer * Avoids FP multiplication anomaly Arnoud Visser Computer Systems – optimizing program performance

Computer Systems – optimizing program performance Machine-Dependent Opt. Summary Loop Unrolling Some compilers do this automatically Generally not as clever as what can achieve by hand Exposing Instruction-Level Parallelism Very machine dependent Warning: Benefits depend heavily on particular machine Do only for performance-critical parts of code Best if performed by compiler But GCC on IA32/Linux is not very good Pointer Code Look carefully at generated code to see whether helpful Arnoud Visser Computer Systems – optimizing program performance

Computer Systems – optimizing program performance Conclusion How should I write my programs, given that I have a good, optimizing compiler? Don’t: Smash Code into Oblivion Hard to read, maintain, & assure correctness Do: Select best algorithm & data representation Write code that’s readable & maintainable Procedures, recursion, without built-in constant limits Even though these factors can slow down code Focus on Inner Loops Detailed optimization means detailed measurement Arnoud Visser Computer Systems – optimizing program performance

Computer Systems – optimizing program performance Assignment Practice Problems Practice Problem 5.5: 'What program variable has been spilled for combine6()?‘ Practice Problem 5.6: 'Indicate the CPE of different associated products.' Optimization Lab Arnoud Visser Computer Systems – optimizing program performance

Performance of Pentium Core 2 operations Two stores and one load in a single cycle. The performance of IDIV depends on the quotient Instruction Latency Cycles/Issue Load / Store 1 0.33 Integer Multiply 3 1 Integer Divide 17-41 12-36 Double/Single FP Multiply 5 2 Double/Single FP Add 3 1 Double/Single FP Divide 32 32 http://download.intel.com/design/processor/manuals/248966.pdf Table C13a en C14a (Pentium Core 2 = cpuid 06_0FH) The latency and throughput of IDIV in Intel Core microarchitecture varies with the number of significant digits of the quotient of the division. Latency and throughput of IDIV may increases with the number of significant digit of the quotient. The latency and throughput of IDIV with 64-bit operand are significantly slower than those with 32-bit operand Arnoud Visser Computer Systems – optimizing program performance