Lecture 11: Machine-Dependent Optimization

Slides:

Advertisements

Similar presentations

Lecture 4: CPU Performance

Advertisements

Computer Organization and Architecture

Computer architecture

Carnegie Mellon Today Program optimization  Optimization blocker: Memory aliasing  Out of order processing: Instruction level parallelism  Understanding.

Instructor: Erol Sahin Program Optimization CENG331: Introduction to Computer Systems 11 th Lecture Acknowledgement: Most of the slides are adapted from.

Program Optimization (Chapter 5)

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

1 Code Optimization(II). 2 Outline Understanding Modern Processor –Super-scalar –Out-of –order execution Suggested reading –5.14,5.7.

Winter 2015 COMP 2130 Intro Computer Systems Computing Science Thompson Rivers University Program Optimization.

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Chapter 12 CPU Structure and Function. Example Register Organizations.

Pipelined Processor II CPSC 321 Andreas Klappenecker.

Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.

Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Pipelining Example Laundry Example: Three Stages

Code Optimization and Performance CS 105 “Tour of the Black Holes of Computing”

Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”

1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.

PipeliningPipelining Computer Architecture (Fall 2006)

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Code Optimization and Performance II

Code Optimization II September 27, 2006

Today’s Instructor: Phil Gibbons

Code Optimization II Dec. 2, 2008

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Instructors: Dave O’Hallaron, Greg Ganger, and Greg Kesden

Program Optimization CENG331 - Computer Organization

Instructors: Pelin Angin and Erol Sahin

Program Optimization CSCE 312

Pipeline Implementation (4.6)

CS203 – Advanced Computer Architecture

Code Optimization II: Machine Dependent Optimizations

Pipelining: Advanced ILP

Instruction Level Parallelism and Superscalar Processors

Machine-Dependent Optimization

Code Optimization /18-213/14-513/15-513: Introduction to Computer Systems 10th Lecture, September 27, 2018.

Superscalar Processors & VLIW Processors

Code Optimization and Performance

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Code Optimization(II)

Instruction Level Parallelism (ILP)

COMP 2130 Intro Computer Systems Thompson Rivers University

Optimizing program performance

Program Optimization CSE 238/2038/2138: Systems Programming

Lecture 5: Pipeline Wrap-up, Static ILP

Machine-Independent Optimization

Code Optimization II October 2, 2001

Presentation transcript:

Lecture 11: Machine-Dependent Optimization CS 105 February 27, 2019

From Monday… Method Integer Double FP Operation Add Mult Combine1 –O0 void combine1(vec_ptr v, data_t *dest) { long i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; } void combine2(vec_ptr v, data_t *dest) { long i; data_t t = IDENT; long length = vec_length(v); data_t *d = get_vec_start(v); for (i = 0; i < length; i++){ t = t OP d[i]; } *dest = t; Method Integer Double FP Operation Add Mult Combine1 –O0 22.68 20.02 19.98 20.18 Combine1 –O1 10.12 10.17 11.14 Combine2 1.27 3.01 5.01

Assembly/Machine Code View Memory 0x7FFF Code Data Stack Heap CPU Registers Data PC Condition Codes Addresses Instructions 0x0000 Programmer-Visible State PC: Program counter 16 Registers Condition codes Memory Byte addressable array Code and user data Stack to support procedures

The Datapath for a CPU

Pipelined Processors Instruction Fetch (IF) Instruction Decode and register file read (ID) Execute or address calculation(EX) Memory Access (MEM) Write back (WB)

Pipelined Processors

Pipelined Processors

Pipelined Processors

Superscalar Processors Definition: A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically. Benefit: without programming effort, superscalar processor can take advantage of the instruction level parallelism that most programs have Most modern CPUs are superscalar. Intel: since Pentium (1993)

What might go wrong? instructions might need same hardware (Structural Hazard) duplicate functional units

Modern CPU Design Instruction Control Execution Fetch Control Address Cache Retirement Unit Register File Instruction Decode Instructions Operations Register Updates Prediction OK? Execution Functional Units Branch Arith Arith Arith Load Store Operation Results Addr. Addr. Data Data Data Cache

Example: Haswell CPU (introduced 2013) 4-way superscalar, 14 pipeline stages 8 Total Functional Units Multiple instructions can execute in parallel 2 load, with address computation 1 store, with address computation 4 integer 2 FP multiply 1 FP add 1 FP divide Some instructions take > 1 cycle, but can be pipelined Instruction Latency Cycles/Issue Load / Store 4 1 Integer Multiply 3 1 Integer/Long Divide 3-30 3-30 Single/Double FP Multiply 5 1 Single/Double FP Add 3 1 Single/Double FP Divide 3-15 3-15

Pipelined Functional Units Stage 1 Stage 2 Stage 3 long mult_eg(long a, long b, long c) { long p3 = p1 * p2; } Time 1 2 3 4 5 6 7 Stage 1 a*b a*c p1*p2 Stage 2 Stage 3 Divide computation into stages Pass partial computations from stage to stage Stage i can start on new computation once values passed to i+1 Complete 3 multiplications in 7 cycles, not 9 cycles

What might go wrong? instructions might need same hardware (Structural Hazard) duplicate functional units dynamic scheduling instructions might not be independent (Data Hazard) forwarding stalling might not know next instruction (Control Hazard) branch prediction eliminate unnecessary jumps (e.g., loop unrolling) Haswell can pick from 96-instruction window

combine, revisited Move vec_length out of loop void combine2(vec_ptr v, data_t *dest) { long i; long length = vec_length(v); data_t *d = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP d[i]; *dest = t; } Move vec_length out of loop Avoid bounds check on each cycle Accumulate in temporary

Measuring Performance Latency: How long to get one result, from beginning to end Appropriate measure when each instruction must wait for another to complete Upper bound on time Throughput: How long between completions of instructions Usually (1/number of units) * (cycle time) Appropriate measure when there are no dependencies between instructions Lower bound on time

x86-64 Compilation of Combine4 Inner Loop (Case: Integer Multiply) .L519: # Loop: imull (%rax,%rdx,4), %ecx # t = t * d[i] addq $1, %rdx # i++ cmpq %rdx, %rbp # Compare length:i jg .L519 # If >, goto Loop Method Integer Double FP Operation Add Mult Combine1 –O0 22.68 20.02 19.98 20.18 Combine1 –O1 10.12 10.17 11.14 Combine2 1.27 3.01 5.01 Latency Bound 1.00 3.00 5.00

Loop Unrolling (2x1) Perform 2x more useful work per iteration void unroll2a_combine(vec_ptr v, data_t *dest) { long length = vec_length(v); long limit = length-1; data_t *d = get_vec_start(v); data_t x = IDENT; long i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x OP d[i]) OP d[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x = x OP d[i]; *dest = x; Perform 2x more useful work per iteration

Effect of Loop Unrolling Method Integer Double FP Operation Add Mult Combine1 –O0 22.68 20.02 19.98 20.18 Combine1 –O1 10.12 10.17 11.14 Combine2 1.27 3.01 5.01 Unroll 2 1.01 Latency Bound 1.00 3.00 5.00 Helps integer add Achieves latency bound Is this the best we can do? x = (x OP d[i]) OP d[i+1];

Combine = Serial Computation (OP = *) 1 d0 Computation (length=8) ((((((((1 * d[0]) * d[1]) * d[2]) * d[3]) * d[4]) * d[5]) * d[6]) * d[7]) Sequential dependence Performance: determined by latency of OP * d1 * d2 * d3 * d4 * d5 * d6 * d7 * x = (x OP d[i]) OP d[i+1];

Loop Unrolling with Reassociation (2x1a) void unroll2aa_combine(vec_ptr v, data_t *dest) { long length = vec_length(v); long limit = length-1; data_t *d = get_vec_start(v); data_t x = IDENT; long i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x OP (d[i] OP d[i+1]); } /* Finish any remaining elements */ for (; i < length; i++) { x = x OP d[i]; *dest = x; Compare to before x = (x OP d[i]) OP d[i+1];

Reassociated Computation What changed: Ops in the next iteration can be started early (no dependency) Overall Performance N elements, D cycles latency/op (N/2+1)*D cycles: CPE = D/2 x = x OP (d[i] OP d[i+1]); d0 d1 * d2 d3 1 * d4 d5 * * d6 d7 * * * *

Effect of Reassociation Method Integer Double FP Operation Add Mult Combine1 –O0 22.68 20.02 19.98 20.18 Combine1 –O1 10.12 10.17 11.14 Combine2 1.27 3.01 5.01 Unroll 2 1.01 Unroll 2a 1.51 2.51 Latency Bound 1.00 3.00 5.00 Throughput Bound 0.50 Nearly 2x speedup for Int *, FP +, FP * Reason: Breaks sequential dependency 2 func. units for FP * 2 func. units for load 4 func. units for int + 2 func. units for load x = x OP (d[i] OP d[i+1]);

Loop Unrolling with Separate Accumulators (2x2) void unroll2a_combine(vec_ptr v, data_t *dest) { long length = vec_length(v); long limit = length-1; data_t *d = get_vec_start(v); data_t x0 = IDENT; data_t x1 = IDENT; long i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 = x0 OP d[i]; x1 = x1 OP d[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { *dest = x0 OP x1;

Separate Accumulators x0 = x0 OP d[i]; x1 = x1 OP d[i+1]; What changed: Two independent “streams” of operations Overall Performance N elements, D cycles latency/op Should be (N/2+1)*D cycles: CPE = D/2 CPE matches prediction! 1 d0 1 d1 * d2 * d3 * d4 * d5 * d6 * d7 * * What Now? *

Effect of Separate Accumulators Method Integer Double FP Operation Add Mult Combine1 –O0 22.68 20.02 19.98 20.18 Combine1 –O1 10.12 10.17 11.14 Combine2 1.27 3.01 5.01 Unroll 2 1.01 Unroll 2a 1.51 2.51 Unroll 2x2 0.81 Latency Bound 1.00 3.00 5.00 Throughput Bound 0.50 x0 = x0 OP d[i]; x1 = x1 OP d[i+1]; Int + makes use of two load units 2x speedup (over unroll2) for Int *, FP +, FP *

Unrolling & Accumulating Idea Can unroll to any degree L Can accumulate K results in parallel L must be multiple of K Limitations Diminishing returns Cannot go beyond throughput limitations of execution units Large overhead for short lengths Finish off iterations sequentially

Unrolling & Accumulating: Double * Case Intel Haswell Double FP Multiplication Latency bound: 5.00. Throughput bound: 0.50 FP * Unrolling Factor L K 1 2 3 4 6 8 10 12 5.01 2.51 1.67 1.25 1.26 0.84 0.88 0.63 0.51 0.52 Accumulators

Unrolling & Accumulating: Int + Case Intel Haswell Integer addition Latency bound: 1.00. Throughput bound: 1.00 FP * Unrolling Factor L K 1 2 3 4 6 8 10 12 1.27 1.01 0.81 0.69 0.54 0.74 1.24 0.56 Accumulators

Achievable Performance Method Integer Double FP Operation Add Mult Combine1 –O0 22.68 20.02 19.98 20.18 Combine1 –O1 10.12 10.17 11.14 Combine2 1.27 3.01 5.01 Unroll 2 1.01 Unroll 2a 1.51 2.51 Unroll 2x2 0.81 Optimal Unrolling 0.54 0.52 Latency Bound 1.00 3.00 5.00 Throughput Bound 0.50 Method Integer Double FP Operation Add Mult Best 0.54 1.01 0.52 Latency Bound 1.00 3.00 5.00 Throughput Bound 0.50 Limited only by throughput of functional units Up to 42X improvement over original, unoptimized code Combines machine-independent and machine-dependent factors