1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.

Slides:



Advertisements
Similar presentations
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Multiscalar processors
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Generic Software Pipelining at the Assembly Level Markus Pister
Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.
Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis
1 Understanding the Energy-Delay Tradeoff of ILP-based Compilation Techniques on a VLIW Architecture G. Pokam, F. Bodin CPC 2004 Chiemsee, Germany, July.
Instruction-Level Parallelism and Its Dynamic Exploitation
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CS203 – Advanced Computer Architecture
Henk Corporaal TUEindhoven 2009
CSL718 : VLIW - Software Driven ILP
Lecture 6: Advanced Pipelines
CS 201 Compiler Construction
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Static Code Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Static Scheduling Techniques
Lecture 5: Pipeline Wrap-up, Static ILP
EECS 583 – Class 3 Region Formation, Predicated Execution
Instruction Scheduling Hal Perkins Autumn 2011
Presentation transcript:

1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling

2 Profiling Many crucial ILP optimizations require good profile information ILP optimizations try to maximize performance/price by increasing the IPC Compiler techniques are needed to expose and enhance ILP Two types of profiles: point profiles and path profiles

3 Compiling with Profiling

4 Point Profiles “Point profiles” collect statistics about points in call graphs and control flow graphs gprof produces call graph profiles, statistics on how many times a function was called, who called it, and (sometimes) how much time was spent in that function Control flow graph profiles give statistics on nodes (node profiles) and edges (edge profiles)

5 Path Profiles “Path profiles” measure the execution frequency of a sequence of basic blocks on a path a CFG A “hot path” is a path that is (very) frequently executed Types include forward paths (no backedges), bounded- length paths (start/stop points), and whole-program paths (interprocedural) The choice is a tradeoff between accuracy and efficiency to collect the profile B1 B2 B3 B5 B4 B6 B7 Path 1 {B1, B2, B3, B5, B7} count = 7 Path 2 {B1, B2, B3, B6, B7} count = 9 Path 3 {B1, B2, B4, B6, B7} count = 123

6 Profile Collection Data collected through code instrumentation is very detailed, but instrumentation overhead affects execution Hardware counters have very low overhead but information is not exhaustive Interrupt-based sampling examines machine state in intervals Collecting path profiles requires enumerating the set of paths encountered during runtime Instrumentation inserts instructions to record edge profiling events

7 Profile Bookkeeping Problem: compiler optimization modifies (instrumented) code in ways that change the use and applicability of profile information for later compilation stages Apply profiling right before profile data is needed Axiom of profile uniformity: “When one copies a chunk of a program, one should equally divide the profile frequency of the original chunk among the copies.” Use this axiom for point profiles as a simple heuristic Path profiles correlate branches and thus path-based compiler optimizations preserve these profiles

8 Instruction Scheduling Instruction scheduling is the most fundamental ILP-oriented compilation phase Responsible for identifying and grouping operations that can be executed in parallel Two approaches:  Cyclic schedulers operate on loops to exploit ILP in (tight) loop nests usually without control flow  Acyclic schedulers consider loop-free regions Region shape AcyclicCyclic Basic block Super- block TraceDAG

9 Acyclic Scheduling of Basic Block Region Shapes Region is restricted to single basic block Local scheduling of instructions in a single basic block is simple ILP is exposed by bundling operations into VLIW instructions (instruction formation or instruction compaction) add $r13 = $r3, $r0 shl $r13 = $r13, 3 ld.w $r14 = 0[$r4] sub $r16 = $r6, 3 shr $r15 = $r15, 9 add $r13 = $r3, $r0 sub $r16 = $r6, 3 ;; ## end of 1st instr. shl $r13 = $r13, 3 shr $r15 = $r15, 9 ld.w $r14 = 0[$r4] ;; ## end of 2nd instr. bundle

10 Intermezzo: VLIW Encoding A VLIW schedule can be encoded compactly using horizontal and vertical nops Start bits, stop bits, or instruction templates are used to compress the VLIW instructions into variable-width instruction bundles add $r13 = $r3, $r0 sub $r16 = $r6, 3 ;; ## end of 1st instr. shl $r13 = $r13, 3 shr $r15 = $r15, 9 ld.w $r14 = 0[$r4] ;; ## end of 2nd instr.

11 Intermezzo: VLIW Execution Model Subtleties Horizontal issues within an instruction:  A read sees the original value of a register  A read sees the value written by the write  Read and write to same register is illegal  Also exception issues Vertical issues across pipelined instructions:  EQ model  LEQ model mov $r1 = 2 ;; mov $r0 = $r1 mov $r1 = 3 ;; ld.w $r0 = 0[$r1] ;; add $r0 = $r1, $r2 ;; sub $r3 = $r0, $r4 … # load completed: add $r3 = $r3, $r0 EQ model allows $r0 to be reused between issue of 1st instruction and its completion when latency expires

12 Acyclic Region Scheduling for Loops To fulfill the need to enlarge the region size of a loop body to expose more ILP, apply:  Loop fusion  Loop peeling  Loop unrolling DO I = 1, N A(I) = C*A(I) ENDDO DO I = 1, N D(I) = A(I)*B(I) ENDDO DO I = 1, N A(I) = C*A(I) D(I) = A(I)*B(I) ENDDO DO I = 1, N, 2 A(I) = C*A(I) D(I) = A(I)*B(I) A(I+1) = C*A(I+1) D(I+1) = A(I+1)*B(I+1) ENDDO (Assuming 2 divides N)

13 Region Scheduling Across Basic Blocks Region scheduling schedules operations across basic blocks, usually on hot paths Fulfill the need to increase the region size by merging operations from block to expose more ILP But problem with conditional flow: how to move operations from one block to another for instruction scheduling? B3 B6 B4 Move operation from here to there But operation is now missing on this path

14 Region Scheduling Across Basic Blocks Problem: how to move operations from one block to another for instruction scheduling? Affected branches need to be compensated B3 B6 B4 Move operation from here to there But operation is now inserted on this path

15 Trace Scheduling Earliest region scheduling approach has restrictions A trace consists of a the operations from a list of basic blocks B 0, B 1, …, B n 1. Each B i is a predecessor (falls through or branches to) the next B i+1 on the list 2. For any i and k there is no path B i  B k  B i except for i=0, i.e. the code is cycle free except that the entire region can be part of a loop B1 B2B5 B3 B6 B B1 B2B5 B3 B6 B

16 Superblocks Superblocks are single-entry multiple-exit traces Superblock formation uses tail duplication to to eliminate side entrances 1. Each B i is a predecessor of the next B i+1 on the list (fall through) 2. For any i and k there is no path B i  B k  B i except for i=0 3. There are no branches into a block in the region (no side entrances), except to B 0 B1 B2B5 B3 B6 B B1 B2 B5 B3 B6 B B3’ B4’

17 Hyperblocks Hyperblocks are single-entry multiple-exit traces with internal control flow effectuated via instruction predication If-conversion folds flow into single block using instruction predication B1 B2B5 B3 B6 B B1 B2,B5 B3 B6 B B4’

18 Intermezzo: Predication If-conversion translates control dependences into data dependences by instruction predication to conditionally execute them Predication requires hardware support Full predication adds a boolean operand to (all or selected) instructions Partial predication executes all instructions, but selects the final result based on a condition cmpgt $b1 = $r5, 0 ;; br $b1, L1 ;; mpy $r3 = $r1, $r2 ;; L1: stw 0[$r10] = $r3 ;; cmpgt $p1 = $r5, 0 ;; ($p1) mpy $r3 = $r1, $r2 ;; stw 0[$r10] = $r3 ;; mpy $r4 = $r1, $r2 ;; cmpgt $b1 = $r5, 0 ;; slct $r3 = $b1, $r4, $r3 ;; stw 0[$r10] = $r3 ;; Original After full predication After partial prediction

19 Treegions Treegions are regions containing a trees of blocks such that no block in a treegion has side entrances Any path through a treegion is a superblock Treegion 1 Treegion 3 Treegion 2

20 Region Formation The scheduler constructs schedules for a single region at a time Need to select which region to optimize (within limits of regions shape), i.e. group traces of frequently executed blocks into regions May need to enlarge regions to expose enough ILP for scheduler Region enlargement Schedule construction Region selection

21 Region Selection by Trace Growing Trace growing uses the mutual most likely heuristic: Suppose A is last block in trace Add block B to trace if B is most likely successor of A and A is B’s most likely predecessor Also works to grow backward Requires edge profiling, but result can be poor because edge profiling does not correlate branch probabilities A B

22 Region Selection by Path Profiling Treat trace as a path and consider its execution frequency by path profiling Correlations are preserved in the region formation process B1 B2B5 B3 B6 B4 B1 B2 B5 B3 B6 B4 B3’ B4’ path 1: {B1, B2, B3, B4}count = 44 path 2: {B1, B2, B3, B6, B4}count = 0 path 3: {B1, B5, B3, B4}count = 16 path 4: {B1, B5, B3, B6, B4}count = 12

23 Superblock Enlargement by Target Expansion Target expansion is useful when the branch at the end of a superblock has a high probability but the superblock cannot be enlarged due to a side entrance Duplicate sequence of target blocks to a create larger superblock B1 B2 80 B3 B B1 B2 80 B3’ B4’ B3 B4 20

24 Superblock Enlargement by Loop Peeling Peel a number of iterations of a small loop body to create a larger superblock that branches into the loop Useful when profiled loop iterations is bounded to a small constant (two iterations in the example) B1 B2 10 B1 B2 10 B1” B2” 0 0 B1’ B2’

25 Superblock Enlargement by Loop Unrolling Loops with a superblock body and a backedge with high probability are called superblock loops When a superblock loop is small we can unroll the loop B1 B B1 B B1’ B2’ B1” B2” 3.3

26 Exposing ILP After Loop Unrolling Loop unrolling exposes limited amount of ILP Cross-iteration dependences on the loop counter’s updates prevent parallel execution of the copies of the loop body Cannot generally move instructions across split points Note: can use speculative execution to hoist instructions above split points B1 B2 10 Split point B1’

27 Exposing ILP with Renaming and Copy Propagation

28 Schedule Construction The schedule constructor (scheduler) uses compaction techniques to produce a schedule for a region after region formation The goal is to minimize an objective cost function while maintaining program correctness and obeying resource limitations:  Increase speed by reducing completion time  Reduce code size  Increase energy efficiency Region enlargement Schedule construction Region selection

29 Schedule Construction and Explicitly Parallel Architectures A scheduler for an explicitly parallel architecture such as VLIW and EPIC uses the exposed ILP to statically schedule instructions in parallel Instruction compaction must obey data dependences (RAW, WAR, and WAW) and control dependences to ensure correctness add $r13 = $r3, $r0 shl $r13 = $r13, 3 ld.w $r14 = 0[$r4] sub $r16 = $r6, 3 shr $r15 = $r15, 9 add $r13 = $r3, $r0 sub $r16 = $r6, 3 ;; ## end of 1st instr. shl $r13 = $r13, 3 shr $r15 = $r15, 9 ld.w $r14 = 0[$r4] ;; ## end of 2nd instr. bundle

30 Schedule Construction and Instruction Latencies Instruction latencies must be taken into account by the scheduler, but they’re not always fixed or the same for all ops A scheduler can assume average or worst-case instruction latencies Hide instruction latencies by ensuring that there is sufficient height between instruction issue and when result is needed to avoid pipeline stalls Also recall the difference between the EQ versus the LEQ model mul $r3 = $r3, $r1 add $r13 = $r2, $r3 ld.w $r14 = 0[$r5] add $r13 = $r13, $r14 ld.w $r15 = 0[$r6] Takes 2 cycles to complete Takes >3 cycles (4 cycles ave.) RAW hazards Takes 1 cycle to complete

31 Linear Scheduling Techniques Instruction compaction using linear-time scans over region:  As-soon-as-possible (ASAP) scheduling places ops in the earliest possible cycle using top-down scan  As-late-as-possible (ALAP) scheduling places ops in the latest possible cycle using bottom-up scan  Critical-path (CP) scheduling uses ASAP followed by ALAP Resource hazard detection is local mul $r3 = $r3, $r1 add $r13 = $r2, $r3 ld.w $r14 = 0[$r5] add $r13 = $r13, $r14 ld.w $r15 = 0[$r6] cycle mul $r3 = $r3, $r1 ld.w $r14 = 0[$r5] ;; ld.w $r15 = 0[$r6] ;; add $r13 = $r2, $r3 ;; add $r13 = $r13, $r14 ;; At most one load per inst.

32 List Scheduling List scheduling schedules operations from the global region based on a data dependence graph (DDG) or program dependence graph (PDG) which both have O(n 2 ) complexity Repeatedly selects an operation from a data-ready queue (DRQ), where an operation is ready when all if its DDG predecessors have been scheduled for each root r in the PDG sorted by priority do enqueue(r) while DRQ is non-empty do h = dequeue() schedule(h) for each DAG successor s of h do if all predecessors of s have been scheduled then enqueue(s)

33 Data Dependence Graph The data dependence graph (DDG)  Nodes are operations  Edges are RAW, WAR, and WAW dependences

34 Control Flow Dependence

35 Compensation Code Compensation code is needed when operations are scheduled across basic blocks in a region Compensation code corrects scheduling changes by duplicating code on entries and exits from a scheduled region A B X C Y Scheduler interchanges A with B entry exit Entry and/or exit must be compensated

36 No Compensation No compensation code is needed when block B does not have an entry and exit B A X C Y A B X C Y

37 Join Compensation Join compensation is applied when block B has an entry Duplicate block B B A X C Y A B X C Y B’ Z Z

38 Split Compensation Split compensation is applied when block B has an exit Duplicate block A B A X C Y A B X C Y A’ W W

39 Join-Split Compensation Join-split compensation is applied when block B has an entry and an exit Duplicate block A and B B A X C Y A B X C Y A’ W W B’ Z Z W

40 Resource Management with Reservation Tables A resource reservation table records which resources are busy per cycle Reservation tables allow easy scheduling of operations by matching the operation’s required resources to empty slots Construction of reservation table at a join point in the CFG is constructed by merging busy slots from both branches Cycle Integer ALU FP ALU MEMBranch 0busy 1 2 3

41 Software Pipelining DO i = 0, 6 A B C D E F G H ENDDO Assuming that the initiation interval (II) is 3 cycles prologue epilogue kernel

42 Software Pipelining Example > 3 cycles > 2 cycles >1 cycle

43 Modulo Scheduling DDG MRT

44 Constructing Kernel-Only Code by Predicate Register Rotation BRT branches to the top and rotates the predicate registers: p1 = p0, p2 = p1, p3 = p2, p0 = p3

45 Modulo Variable Expansion (1)

46 Modulo Variable Expansion (2)