Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Slides:



Advertisements
Similar presentations
CSC 4181 Compiler Construction Code Generation & Optimization.
Advertisements

Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Instruction Set Design
CSCI 4717/5717 Computer Architecture
Lecture 6 Programming the TMS320C6x Family of DSPs.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Hardware-Software Interface Machine Program Performance = t cyc x CPI x code size X Available resources statically fixed Designed to support wide variety.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Compiler Optimization Overview
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Code Optimization Overview and Examples
Code Optimization.
Optimization Code Optimization ©SoftMoore Consulting.
CSCE430/830 Computer Architecture
CSL718 : VLIW - Software Driven ILP
Vector Processing => Multimedia
Morgan Kaufmann Publishers The Processor
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
Advanced Computer Architecture
Code Optimization Overview and Examples Control Flow Graph
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Midterm 2 review Chapter
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 4: Instruction Set Design/Pipelining
EE108b Review Session February 2nd, 2007 Daxia Ge.
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Computer Architecture Lecture 7 Compiler Considerations and Optimizations

Structure of Recent Compilers Front End Transform Language to Common Intermediate Form Note: Only few companies make front for C. Source code for C++ Front end is about 30 times bigger than C. Most Front down convert C++ to C before compilation. High Level Optimization High Level Loop Optimization Example: Procedure In-lining (Lang Dep., Machine Ind.) Global Optimization Global and Local Optimization and register allocation (Small Lang Dep., Small Machine dep.) Code Generation Detailed Instruction Selection and machine dependent optimization (No Lang Dep., Highly Machine Dep.)

Compiler Prime Target Program Correctness Speed Compilation Time? Phases of compilers help write bug-free code

Optimizations High-level Local (Basic Block) Global (across branches) Register Allocation, Live Range Analysis Processor Dependent

Optimization Names Procedure Integration Common Sub-expression Elimination/Dead Code Elimination A = b+ c ;dead code eliminated, no subsequent use of b+c A = x+ y Similarly if a procedure does not return a value and uses only local variables will be eliminated. (Test this in VC++) Constant Propagation: A variable used as constant. (Constants aren’t, Variable Won’t. Osborn’s Law) Global Sub-expression Elimination Copy Propagation (a = b, a will be replaced by b) Code Motion (A code that does not change with index in a loop will be moved out of the loop.) Induction Variable Elimination (A = A + 5 in a loop that runs n times will be replaced with A = A + 5 * n and moved out of loop, if A is not used,) Strength Reduction (Multiply replaced with shift and add if possible, A*25 + b*25 will be replaced with (A+B) * 25 ) Pipeline Scheduling Branch Optimization

Problems with Pointers A = 5; p = x+y; *p = 9 (only programmer knows &A = p) Compiler cannot assign a register

Architecture Help Provide Orthogonality The Operations, The Data Types, The Addressing Modes, The Register Functions should be orthogonal Simplify Trade-offs between alternatives (With caches and pipelining, trade-offs have become very complex) For Example: Most difficult one in register-memory architecture: How many times a variable is referenced before it is assigned a register. Provide Instructions to Bind Variables with Constants Most SIMD kernels are hand-coded as no compiler support

Hand-Coded VS Compiler Generated On TMS320C6203 (VLIW CPU) (reported May 2000) EEMBC Telecom Kernels Ratio of Execution Time (Compiler/Hand Written) Ratio of Code Size (Compiler/Hand Written) Convolution Encoder Fixed Point Complex FFT Viterbi GSM Decoder Fixed Point Bit Allocation Auto-correlation1.80.7

Basic Compiler Techniques Basic Pipelining Static Loop Unrolling Example: Instruction Producing Result Instruction Using Result Latency in CC FP ALU 3 Store2 LoadFP ALU1 FP LoadFP Store0

Example (Contd…) Loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) DADDUI R1,R1, #-8 BNEQ R1,R2, Loop

Example (Without Scheduling) Loop: L.D F0, 0(R1) stall ;LUD ADD.D F4,F0,F2 stall S.D F4, 0(R1) DADDUI R1,R1, #-8 stall BNEQ R1,R2, Loop stall ;Successor flushed Total 10cc

Example (With Scheduling) Loop: L.D F0, 0(R1) DADDUI R1,R1, #-8 ADD.D F4,F0,F2 stall BNEQ R1,R2, Loop S.D F4, 8(R1) ;delay slot Total 6cc (3 for data, 3 overhead)

Example (Static Loop Unrolling 4 times ) Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1,R1, #-32 S.D F12, 16(R1) BNEQ R1,R2, Loop S.D F16, 8(R1) ; Delay slot Total 3.5cc per element Compiler Considerations: 1.Use of delay slot 2.Loop level independence 3.Register Assignment 4.Proper Loop Adjustment

Example (Static Dual Issue, 1 Int and 1 FP/CC) Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) ADD.D F4,F0,F2 L.D F14, -32(R1) ADD.D F8,F6,F2 L.D F18, -36(R1) ADD.D F12,F10,F2 S.D F4, 0(R1) ADD.D F16,F14,F2 S.D F8, -8(R1) ADD.D F20,F18,F2 S.D F12, -16(R1) DADDUI R1,R1, #-40 S.D F16, 16(R1) BNEQ R1,R2, Loop S.D F20, 8(R1) ; Delay slot Total 2.4cc per element LUD

VLIW Compiler formats issue packets Compiler ensures that dependencies are not present 64 to 200-bit long instructions

Example (VLIW, 1 Int, 2 FP, 2 LD/ST /CC 5-slots) Mem 1 SlotMem 2 SlotFP 1 SlotFP 2 SlotInt/ Branch L.D F0, 0(R1)L.D F6, -8(R1) L.D F10, -16(R1)L.D F14, -24(R1) L.D F18, -36(R1)L.D F22, -40(R1) ADD.D F4,F0,F2ADD.D F8,F6,F2 L.D F26, -48(R1)ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2ADD.D F24,F12,F2 S.D F4, 0(R1)S.D F8, -8(R1) ADD.D F28,F26,F2 S.D F12, -16(R1) S.D F16, -24(R1) DADDUI R1,R1, #-56 S.D F20, 24(R1) S.D F24, 16(R1) S.D F28, 8(R1)BNEQ R1,R2, Loop 1.29cc per element, 23 slots used out of potential 45

Loop Level Parallelism Loop Carried Dependence: Data calculated in one loop iteration is required in the next loop. A Parallel Loop For (I = 1000; I > 0; I = i-1) x[i] = x[i] + s

Example For (i = 1; i <= 100; i = i+1) { A[i+1] = A[i] + + C[i]; B[i+1] = B[i] + + A[i+1]; } Dependences?

Example 2 Make the following loop parallel. For (i = 1; i <= 100; i = i+1) { A[i] = A[i] + + B[i]; B[i+1] = C[i] + + D[i]; }

The GCD Test Loop stores in a  j + b and later fetches from c  k + d. Sufficient test is that if loop carried dependence exits then GCD(c,a) must integer divide (d-b) (no remainder). For (i = 1; i <= 100; i = i+1) x(2*i+3] = x[2*i] *5 This test ignores loop bounds.

Example 2 Use renaming to find ILP For (i = 1; i <= 100; i = i+1) { Y[i] = X[i] /c1 X[i] = X[i] +c2 Z[i] = Y[i] + c3 Y[i] = c4 - Y[i] /c }

Other techniques Addi R1, R2, # 4 To Addi R1, R2, # 8;copy Propagation And Add R1, R2, R3 Add R2, R1, R5 Addi R7, R2, R8 ;(tree height reduction) Sum = sum + x[i] Sum = (sum + x[1]) + ( x[2] + x[3]) + (x[4]+x[5]) ;recurrence optimization