Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.

Slides:

Advertisements

Similar presentations

Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.

Advertisements

Carnegie Mellon Today Program optimization  Optimization blocker: Memory aliasing  Out of order processing: Instruction level parallelism  Understanding.

Instructor: Erol Sahin Program Optimization CENG331: Introduction to Computer Systems 11 th Lecture Acknowledgement: Most of the slides are adapted from.

Program Optimization (Chapter 5)

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

1 Code Optimization(II). 2 Outline Understanding Modern Processor –Super-scalar –Out-of –order execution Suggested reading –5.14,5.7.

Winter 2015 COMP 2130 Intro Computer Systems Computing Science Thompson Rivers University Program Optimization.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Code Optimization II Feb. 19, 2008 Topics Machine Dependent Optimizations Understanding Processor Operations Branches and Branch Prediction class11.ppt.

Code Optimization: Machine Independent Optimizations Feb 12, 2004 Topics Machine-Independent Optimizations Machine-Dependent Opts Understanding Processor.

Code Optimization II September 26, 2007 Topics Machine Dependent Optimizations Understanding Processor Operations Branches and Branch Prediction class09.ppt.

Code Optimization II: Machine Dependent Optimizations Topics Machine-Dependent Optimizations Pointer code Unrolling Enabling instruction level parallelism.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Architecture Basics ECE 454 Computer Systems Programming

1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order.

1 Code Optimization. 2 Outline Optimizing Blockers –Memory alias –Side effect in function call Understanding Modern Processor –Super-scalar –Out-of –order.

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

Code Optimization and Performance Chapters 5 and 9 perf01.ppt CS 105 “Tour of the Black Holes of Computing”

Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.

Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.

Code Optimization and Performance CS 105 “Tour of the Black Holes of Computing”

Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”

1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.

Code Optimization and Performance I Chapter 5 perf01.ppt CS 105 “Tour of the Black Holes of Computing”

Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.

Code Optimization and Performance II

Real-World Pipelines Idea Divide process into independent stages

Code Optimization II September 27, 2006

Instruction Level Parallelism

William Stallings Computer Organization and Architecture 8th Edition

Machine-Dependent Optimization

Code Optimization II Dec. 2, 2008

Code Optimization I: Machine Independent Optimizations

Instructors: Dave O’Hallaron, Greg Ganger, and Greg Kesden

Program Optimization CENG331 - Computer Organization

Instructors: Pelin Angin and Erol Sahin

Code Optimization II: Machine Dependent Optimizations Oct. 1, 2002

Code Optimization II: Machine Dependent Optimizations

Machine-Dependent Optimization

Code Optimization(II)

Code Optimization April 6, 2000

COMP 2130 Intro Computer Systems Thompson Rivers University

Optimizing program performance

Lecture 5: Pipeline Wrap-up, Static ILP

Machine-Independent Optimization

Lecture 11: Machine-Dependent Optimization

Code Optimization and Performance

Code Optimization II October 2, 2001

Presentation transcript:

Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism

– 2 – Previous Best Combining Code Task Compute sum of all elements in vector Vector represented by C-style abstract data type Achieved CPE of 2.00 Cycles per element void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; }

– 3 – General Forms of Combining void abstract_combine4(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); data_t *data = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP data[i]; *dest = t; }

– 4 – Pointer Code Optimization Use pointers rather than array references CPE: 3.00 (Compiled -O2) Oops! We’re not making progress here! Warning: Some compilers do better job optimizing array code void combine4p(vec_ptr v, int *dest) { int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum; }

– 5 – Pointer vs. Array Code Inner Loops Array Code Pointer Code Performance Array Code: 4 instructions in 2 clock cycles Pointer Code: Almost same 4 instructions in 3 clock cycles.L24:# Loop: addl (%eax,%edx,4),%ecx# sum += data[i] incl %edx# i++ cmpl %esi,%edx# i:length jl.L24# if < goto Loop.L30:# Loop: addl (%eax),%ecx# sum += *data addl $4,%eax# data ++ cmpl %edx,%eax# data:dend jb.L30# if < goto Loop

– 6 – Modern CPU Design Execution Functional Units Instruction Control Integer/ Branch FP Add FP Mult/Div LoadStore Instruction Cache Data Cache Fetch Control Instruction Decode Address Instrs. Operations Prediction OK? Data Addr. General Integer Operation Results Retirement Unit Register File Register Updates

– 7 – CPU Capabilities of Pentium III Multiple Instructions Can Execute in Parallel 1 load 1 store 2 integer (one may be branch) 1 FP Addition 1 FP Multiplication or Division Some Instructions Take > 1 Cycle, but Can be Pipelined InstructionLatencyCycles/Issue Load / Store31 Integer Multiply41 Double/Single FP Multiply52 Double/Single FP Add31

– 8 – CPU Capabilities of Pentium III Multiple Instructions Can Execute in Parallel 1 load 1 store 2 integer (one may be branch) 1 FP Addition 1 FP Multiplication or Division Some Instructions Take > 1 Cycle, but Can be Pipelined InstructionLatencyCycles/Issue Load / Store31 Integer Multiply41 Double/Single FP Multiply52 Double/Single FP Add31 Integer Divide3636 Double/Single FP Divide3838

– 9 – Visualizing Operations Operations Vertical position denotes time at which executed Cannot begin operation until operands available Height denotes latency cc.1 t.1 load %ecx.1 incl cmpl jl %edx.0 %edx.1 %ecx.0 imull Time

– 10 – Visualizing Operations (cont.) Operations Same as before, except that add has latency of 1 Time cc.1 t.1 %ecx. i +1 incl cmpl jl load %edx.0 %edx.1 %ecx.0 addl %ecx.1 load

– 11 – 3 Iterations of Combining Product Unlimited Resource Analysis Assume operation can start as soon as operands available Operations for multiple iterations overlap in timePerformance Limiting factor becomes latency of integer multiplier Gives CPE of 4.0

– 12 – 4 Iterations of Combining Sum Unlimited Resource Analysis Performance Can begin a new iteration on each clock cycle Should give CPE of 1.0 Would require executing 4 integer operations in parallel 4 integer ops

– 13 – Combining Sum: Resource Constraints Only have two integer functional units Some operations delayed even though operands availablePerformance Sustain CPE of 2.0

– 14 – Loop Unrolling Optimization Combine multiple iterations into single loop body Amortizes loop overhead across multiple iterations Finish extras at end Measured CPE = 1.33 void combine5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; int i; }

– 15 – Visualizing Unrolled Loop Loads can pipeline, since don’t have dependencies Only one set of loop control operations Time %edx.0 %edx.1 %ecx.0c cc.1 t.1a %ecx. i +1 addl cmpl jl addl %ecx.1c addl t.1b t.1c %ecx.1a %ecx.1b load

– 16 – Executing with Loop Unrolling Predicted Performance Can complete iteration in 3 cycles Should give CPE of 1.0 Measured Performance CPE of 1.33 One iteration every 4 cycles

– 17 – Effect of Unrolling Only helps integer sum for our examples Other cases constrained by functional unit latencies Effect is nonlinear with degree of unrolling Many subtle effects determine exact scheduling of operations Unrolling Degree IntegerSum IntegerProduct4.00 FPSum3.00 FPProduct5.00

– 18 – Serial Computation Computation ((((((((((((1 * x 0 ) * x 1 ) * x 2 ) * x 3 ) * x 4 ) * x 5 ) * x 6 ) * x 7 ) * x 8 ) * x 9 ) * x 10 ) * x 11 )Performance N elements, D cycles/operation N*D cycles * *1 x0x0x0x0 x1x1x1x1 * x2x2x2x2 * x3x3x3x3 * x4x4x4x4 * x5x5x5x5 * x6x6x6x6 * x7x7x7x7 * x8x8x8x8 * x9x9x9x9 * x 10 * x 11

– 19 – Parallel Loop Unrolling Code Version Integer productOptimization Accumulate in two different products Can be performed simultaneously Combine at endPerformance CPE = 2.0 2X performance void combine6(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x0 = 1; int x1 = 1; int i; } *dest = x0 * x1; }

– 20 – Dual Product Computation Computation ((((((1 * x 0 ) * x 2 ) * x 4 ) * x 6 ) * x 8 ) * x 10 ) * ((((((1 * x 1 ) * x 3 ) * x 5 ) * x 7 ) * x 9 ) * x 11 )Performance N elements, D cycles/operation (N/2+1)*D cycles ~2X performance improvement

– 21 – Requirements for Parallel Computation Mathematical Combining operation must be associative & commutative OK for integer multiplication Not strictly true for floating point »OK for most applicationsHardware Pipelined functional units Ability to dynamically extract parallelism from code

– 22 – Visualizing Parallel Loop Two multiplies within loop no longer have data dependency Allows them to pipeline Time %edx.1 %ecx.0 %ebx.0 cc.1 t.1a imull %ecx.1 addl cmpl jl %edx.0 imull %ebx.1 t.1b load

– 23 – Executing with Parallel Loop Predicted Performance Can keep 4-cycle multiplier busy performing two simultaneous multiplications Gives CPE of 2.0

– 24 –

– 25 – Parallel Unrolling: Method #2 Code Version Integer productOptimization Multiply pairs of elements together And then update productPerformance CPE = 2.5 void combine6aa(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x = 1; int i; }

– 26 – Method #2 Computation Computation ((((((1 * (x 0 * x 1 )) * (x 2 * x 3 )) * (x 4 * x 5 )) * (x 6 * x 7 )) * (x 8 * x 9 )) * (x 10 * x 11 ))Performance N elements, D cycles/operation Should be (N/2+1)*D cycles CPE = 2.0 Measured CPE worse Unrolling CPE (measured) CPE (theoretical)

– 27 – Understanding Parallelism CPE = 4.00 All multiplies performed in sequence /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x * (data[i] * data[i+1]); } /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x * data[i]) * data[i+1]; } CPE = 2.50 Multiplies overlap

– 28 – Limitations of Parallel Execution Need Lots of Registers To hold sums/products Only 6 usable integer registers Also needed for pointers, loop conditions 8 FP registers When not enough registers, must spill temporaries onto stack Wipes out any performance gains

– 29 – Machine-Dependent Opt. Summary Pointer Code Look carefully at generated code to see whether helpful Loop Unrolling Some compilers do this automatically Generally not as clever as what can achieve by hand Exposing Instruction-Level Parallelism Very machine dependent (GCC on IA32/Linux not very good)Warning: Benefits depend heavily on particular machine Do only for performance-critical parts of code

– 30 –

– 31 –

– 32 –

– 33 –