CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral

CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680

CMPUT 680 - Compiler Design and Optimization2 Suggested Reading Intel IA-64 Architecture Software Developer’s Manual, Chapters 8, 9

CMPUT 680 - Compiler Design and Optimization3 Instruction Group An instruction group is a set of instructions that have no read after write (RAW) or write after write (WAW) register dependencies. Consecutive instruction groups are separated by stops (represented by a double semi-column in the assembly code). ld8r1=[r5]// First group subr6=r8, r9// First group addr3=r1,r4 ;;// First group st8[r6]=r12// Second group

CMPUT 680 - Compiler Design and Optimization4 Instruction Bundles Instructions are organized in bundles of three instructions, with the following format: instruction slot 2instruction slot 1instruction slot 0template 12787864645540 41 5

CMPUT 680 - Compiler Design and Optimization5 Bundles In assembly, each 128-bit bundle is enclosed in curly braces and contains a template specification {.mii ld4r28=[r8] // Load a 4-byte value addr9=2,r1 // 2+r1 and put in r9 addr30=1,r1 // 1+r1 and put in r30 } An instruction group can extend over an arbitrary number of bundles.

CMPUT 680 - Compiler Design and Optimization6 Templates There are restrictions on the type of instructions that can be bundled together. The IA-64 has five slot types (M, I, F, B, and L), six instruction types (M, I, A, F, B, L), and twelve basic template types (MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, and MFB). The underscore in the bundle accronym indicates a stop. Every basic bundle type has two versions: one with a stop at the end of the bundle and one without.

CMPUT 680 - Compiler Design and Optimization7 Control Dependency Preventing Code Motion addr7=r6,1// cycle 0 addr13=r25, r27 cmp.eq p1, p2=r12, r23 (p1) br. condsome_label ;; ld4r2=[r3] ;;// cycle 1 sub r4=r2, r11// cycle 3 ld br block A block B In the code below, the ld4 is control dependent on the branch, and thus cannot be safely moved up in conventional processor architectures.

CMPUT 680 - Compiler Design and Optimization8 Control Speculation (p1) br.cond.dptk L1// cycle 0 ld8 r3=[r5] ;;// cycle 1 shr r7=r3,r87// cycle 3 In the following code, suppose a load latency of two cycles However, if we execute the load before we know that we actually have to do it (control speculation), we get: ld8.s r3=[r5]// earlier cycle // other, unrelated instructions (p1) br.cond.dptk L1;;// cycle 0 chk.s r3, recovery// cycle 1 shr r7=r3,r87// cycle 1

CMPUT 680 - Compiler Design and Optimization9 Control Speculation ld8.s r3=[r5]// earlier cycle // other, unrelated instructions (p1) br.cond.dptk L1;;// cycle 0 chk.s r3, recovery// cycle 1 shr r7=r3,r87// cycle 1 The ld8.s instruction is a speculative load, and the chk.s instruction is a check instruction that verifies if the value loaded is still good.

CMPUT 680 - Compiler Design and Optimization10 Ambiguous Memory Dependencies An ambiguous memory dependency is a dependence between a load and a store, or between two stores, where it cannot be determined if the instructions involved access overlapping memory locations. Two or more memory references are independent if it is known that they access non-overlapping memory locations.

CMPUT 680 - Compiler Design and Optimization11 Data Speculation An advanced load allows a load to be moved above a store even if it is not known wether the load and the store may reference overlapping memory locations. st8[r55]=r45// cycle 0 ld8r3=[r5] ;;// cycle 0 shrr7=r3,r87// cycle 2 ld8.ar3=[r5] ;;// Advanced Load // other, unrelated instructions st8[r55]=r45// cycle 0 ld8.cr3=[r5] ;;// cycle 0 - check shrr7=r3,r87// cycle 0

CMPUT 680 - Compiler Design and Optimization12 Moving Up Loads + Uses: Recovery Code st8[r4] = r12 // cycle 0: ambiguous store ld8r6 = [r8] ;; // cycle 0: load to advance addr5 = r6,r7 // cycle 2 st8[r18] = r5 // cycle 3 Original Code ld8.ar6 = [r8] ;; // cycle -3 // other, unrelated instructions addr5 = r6,r7 // cycle -1; add that uses r6 // other, unrelated instructions st8[r4]=r12 // cycle 0 chk.ar6, recover // cycle 0: check back: // Return point from jump to recover st8[r18] = r5 // cycle 0 recover: ld8r6 = [r8] ;; // Reload r6 from [r8] addr5 = r6,r7 // Re-execute the add brback // Jump back to main code Speculative Code

CMPUT 680 - Compiler Design and Optimization13 ld.c, chk.a and the ALAT The execution of an advanced load, ld.a, creates an entry in a hardware structure, the Advanced Load Address Table (ALAT). This table is indexed by the register number. Each entry records the load address, the load type, and the size of the load. When a check is executed, the entry for the register is checked to verify that a valid enter with the type specified is there.

CMPUT 680 - Compiler Design and Optimization14 ld.c, chk.a and the ALAT Entries are removed from the ALAT when: (1) A store overlaps with the memory locations specified in the ALAT entry; (2) Another advanced load to the same register is executed; (3) There is a context switch caused by the operating system (or hardware); (4) Capacity limitation of the ALAT implementation requires reuse of the entry.

CMPUT 680 - Compiler Design and Optimization15 Not a Thing (NaT) The IA-64 has 128 general purpose registers, each with 64+1 bits, and 128 floating point registers, each with 82 bits. The extra bit in the GPRs is the NaT bit that is used to indicate that the content of the register is not valid. NaT=1 indicates that an instruction that generated an exception wrote to the register. It is a way to defer exceptions caused by speculative loads. Any operation that uses NaT as an operand results in NaT.

CMPUT 680 - Compiler Design and Optimization16 If-conversion If-conversion uses predicates to transform a conditional code into a single control stream code. if(r4) { add r1= r2, r3 ld8 r6=[r5] } cmp.nep1, p0=r4, 0 ;; Set predicate reg (p1) addr1=r2, r3 (p1) ld8r6=[r5] if(r1) r2 = r3 + r3 else r7 = r6 - r5 cmp.ne p1, p2 = r1, 0 ;; Set predicate reg (p1) addr2 = r3, r4 (p2) subr7 = r6,r5

Code Generation for Software Pipelining z 0  &Z(1) x 0  &X(1) q 0  0.0 DO k=1,NLat. (a)u k  load z k-1 (6) (b)v k  load x k-1 (6) (c)w k  u k * v k (2) (d)q k  q k-1 + w k (2) (e)z k  z k-1 + 4(1) (f)x k  x k-1 + 4 (1) END DO

Code Generation for Software Pipelining z 0  &Z(1) x 0  &X(1) q 0  0.0 (a1)u 1  load z 0 (b1)v 1  load x 0 (e1)z 1  z 0 + 4 (f1)x 1  x 0 + 4 (a2)u 2  load z 1 (b2)v 2  load x 1 (e2)z 2  z 1 + 4 (f2)x 2  x 1 + 4 (a3)u 3  load z 2 (b3)v 3  load x 2 (e3)z 3  z 2 + 4 (f3)x 3  x 2 + 4 (a4)u 4  load z 3 (b4)v 4  load x 3 (c1)w 1  u 0 * v 0 (e4)z 4  z 3 + 4 (f4)x 4  x 3 + 4

Code Generation for Software Pipelining DO k=1,N-4 (a k+4 )u k+4  load z k+3 (b k+4 )v k+4  load x k+3 (c k+1 )w k+1  u k * v k (d)q k  q k-1 + w k (e k+4 )z k+4  z k+3 + 4 (f k+4 )x k+4  x k+3 + 4 END DO (c 98 )w 98  u 97 * v 97 (d 97 )q 97  q 96 + w 97 (c 99 )w 99  u 98 * v 98 (d 98 )q 98  q 97 + w 98 (c 100 )w 100  u 99 * v 99 (d 99 )q 99  q 98 + w 99 (d 100 )q 100  q 99 + w 100

Code Generation for Software Pipelining z 0  &Z(1) x 0  &X(1) q 0  0.0 DO k=1,4 (a)u k  load z k-1 (b)v k  load x k-1 (e)z k  z k-1 + 4 (f)x k  x k-1 + 4 END DO (c)w 1  u 1 * v 1 DO k=5,N-4 (a)u k+4  load z k+3 (b)v k+4  load x k+3 (c)w k+1  u k+1 * v k+1 (d)q k  q k-1 + w k (e)z k+4  z k+3 + 4 (f)x k+4  x k+3 + 4 END DO prolog counter loop counter

Code Generation for Software Pipelining DO k=N-3,N (c)w k+1  u k+1 * v k+1 (d)q k  q k-1 + w k END DO (d)q 100  q 99 + w 100 epilog counter

Code Generation for Software Pipelining(try 3) R 0  &Z(1) R 1  &X(1) F 0  0.0 R 2  1 loop: F 1  load [R 0 ] F 2  load [R 1 ] F 3  mult F 1, F 2 F 0  add F 0, F 3 R 0  add R 0, 4 R 1  add R 1, 4 R 2  add R 2, 1 brne R 2, N loop But, we still have not solved the register allocation problem. The code on the right needs a large number of registers. What can we do about it? Without software pipelining the following code could be generated.

CMPUT 680 - Compiler Design and Optimization23 Optimization of Loops L1: ld4r4 = [r5], 4 ;; // Cycle 0 load postinc 4 addr7 = r4, r9 ;; // Cycle 2 st4 [r6] = r7, 4 // Cycle 3 store postinc 4 br.cloop L1 ;; // Cycle 3 Instructions Description: ld4r4 = [r5], 4 ;;r4  MEM[r5] r5  r5 + 4 st4[r6] = r7, 4MEM[r6]  r7 r6  r6 + 4 br.cloopL1if LC  0 then LC  LC -1 goto L1

CMPUT 680 - Compiler Design and Optimization24 Optimization of Loops (a) L1: ld4 r4 = [r5], 4 ;; (b) add r7 = r4, r9 ;; (c) st4 [r6] = r7, 4 (d) br.cloop L1 ;; 1234 0a 1 2b 3c/d 4a 5 6 b 7 8a 9 10b Cycles Iterations 11c/d 12a 13 14b If LC=1000, how long does it take for this loop to execute? It takes 4000 cycles.

CMPUT 680 - Compiler Design and Optimization25 Optimization of Loops: Loop Unrolling (a) L1: ld4 r4 = [r5], 4 ;; (b) ld4 r14 = [r5], 4 ;; (c) add r7 = r4, r9 ;; (d) add r17 = r14, r9 (e) st4 [r6] = r7,4 ;; (f) st4 [r6] = r17,4 (g) br.cloop L1 ;; Cycles Iterations 1234 0a 1b 2c 3d/e 4f/g 5a 6b 7c 8d/e 9f/g 10a 11b 12c 13d/e 14f/g For simplicity we assumed that N is a multiple of 2. Because the loads (a) and (b) both update r5 they have to be serialized

CMPUT 680 - Compiler Design and Optimization26 Optimization of Loops: Loop Unrolling (a) L1: ld4 r4 = [r5], 4 ;; (b) ld4 r14 = [r5], 4 ;; (c) add r7 = r4, r9 ;; (d) add r17 = r14, r9 (e) st4 [r6] = r7,4 ;; (f) st4 [r6] = r17,4 (g) br.cloop L1 ;; Cycles Iterations 1234 0a 1b 2c 3d/e 4f/g 5a 6b 7c 8d/e 9f/g 10a 11b 12c 13d/e 14f/g If LC=1000 for the original loop, how long does it take for this loop to execute? It takes 2500 cycles. Thus the loop is 4000/2500 = 1.6 times faster

CMPUT 680 - Compiler Design and Optimization27 Optimization of Loops: Expanding the Induction Variable add r15 = 4, r5 add r16 = 4, r6 ;; (a) L1: ld4 r4 = [r5], 8 (b) ld4 r14 = [r15], 8 ;; (c) add r7 = r4, r9 (d) add r17 = r14, r9 (e) st4 [r6] = r7,8 ;; (f) st4 [r16] = r17,8 (g) br.cloop L1 ;; Cycles Iterations 1234 0a/b 1 2c/d 3e/f/g 4a/b 5 6c/d 7e/f/g 8a/b 9 10c/d 11e/f/g 12a/b 13 14c/d We use twice as many functional units as the original code. But no instruction is issued in cycle 1, and functional units are still under-utilized.

CMPUT 680 - Compiler Design and Optimization28 Optimization of Loops: Expanding the Induction Variable add r15 = 4, r5 add r16 = 4, r6 ;; (a) L1: ld4 r4 = [r5], 8 (b) ld4 r14 = [r15], 8 ;; (c) add r7 = r4, r9 (d) add r17 = r14, r9 (e) st4 [r6] = r7,8 (f) st4 [r6] = r17,8 (g) br.cloop L1 ;; Cycles Iterations 1234 0a/b 1 2c/d 3e/f/g 4a/b 5 6c/d 7e/f/g 8a/b 9 10c/d 11e/f/g 12a/b 13 14c/d If LC=1000 for the original loop, how long does it take for this loop to execute? It takes 2000 cycles. Thus the loop is 4000/2000 = 2.0 times faster

CMPUT 680 - Compiler Design and Optimization29 Optimization of Loops: Further Loop Unrolling add r15 = 4, r5 add r25 = 8, r5 add r35 = 12, r5 add r16 = 4, r6 add r26 = 8, r6 add r36 = 12, r6 ;; add r16 = 4, r6 ;; (a) L1: ld4 r4 = [r5], 16 (b) ld4 r14 = [r15], 16 ;; (c) ld4 r24 = [r25], 16 (d) ld4 r34 = [r35], 16 ;; (e) add r7 = r4, r9 (f) add r17 = r14, r9;; (g) st4 [r6] = r7,16 (h) st4 [r16] = r17,16 (i) add r27 = r24, r9 (j) add r37 = r34, r9 ;; (k) st4 [r26] = r27, 16 (l) st4 [r36] = r37, 16 (m) br.cloop L1 ;; Iterations Cycles 1234 0a/b 1c/d 2e/f 3g/h/i/j 4k/l/m 5a/b 6c/d 7e/f 8g/h/i/j 9k/l/m 10a/b 11c/d 12e/f 13g/h/i/j 14k/l/m

CMPUT 680 - Compiler Design and Optimization30 Optimization of Loops: Further Loop Unrolling Iterations Cycles 1234 0a/b 1c/d 2e/f 3g/h/i/j 4k/l/m 5a/b 6c/d 7e/f 8g/h/i/j 9k/l/m 10a/b 11c/d 12e/f 13g/h/i/j 14k/l/m If LC=1000 for the original loop, how long does it take for this loop (unrolled 4 times) to execute? It takes 250*5=1250 cycles. Thus the loop is 4000/1250 = 3.2 times faster

CMPUT 680 - Compiler Design and Optimization31 Loop Optimization: Loop Unrolling In the previous example we obtained a good utilization of the functional units through loop unrolling. But at the cost of code expansion and higher register pressure. Software Pipelining offers an alternative by overlapping the execution of operations from multiple iterations of the loop.

CMPUT 680 - Compiler Design and Optimization32 Loop Optimization: Software Pipelining (S1) ld4 r4 = [r5], 4 (S2) - - - (S3) add r7 = r4, r9 (S4) st4 [r6] = r7, 4 Cycles * This is not real code Iterations 1 0S1 1 2S3 3S4 4 5 6 7 8 9 234 S1 S3S1 S4S3 S4S3 S4 567 S1 S3S1 S4S3 S4S3 S4 prologue kernel epilogue

CMPUT 680 - Compiler Design and Optimization33 Loop Optimization: Software Pipelining Code ld4r4 = [r5], 4 ;; // load x[1] ld4r4 = [r5], 4 ;; // load x[2] addr7 = r4, r9 // y[1] = x[1]+ k ld4r4 = [r5], 4 ;; // load x[3] L1: ld4r4 = [r5], 4 // load x[i+3] add r7 = r4, r9 // y[i+1] = x[i+1] + k st4[r6] = r7, 4 // store y[i] br.cloop L1 ;; st4[r6] = r7, 4 // store y[n-2] addr7 = r4, r9 ;; // y[n-1] = x[n-1] + k st4 [r6] = r7, 4 // store y[n-1] add r7 = r4,r9 ;; // y[n] = x[n] + k st4[r6] = r7, 4 // store y[n] prologue kernel epilogue

CMPUT 680 - Compiler Design and Optimization34 Support for Software Pipelining in the IA-64 After a loop is converted into a software pipeline, it looks quite different from the original loop, Intel adopts the following terminology: source loop and source iteration: refer to the original source code kernel loop and kernel iteration: refer to the code that implements the software pipeline.

CMPUT 680 - Compiler Design and Optimization35 Loop Support in the IA-64: Register Rotation The IA-64 has a rotating register base (rrb) register that is decremented by special software pipelined loop branches. When the rrb is decremented the valued stored in register X appear to move to register X+1, and the value of the highest numbered rotating register appears to move to the lowest numbered rotating register.

CMPUT 680 - Compiler Design and Optimization36 Loop Support in the IA-64: Register Rotation zWhat registers can rotate? yThe predicate registers p16-p63; yThe floating-point registers f32-f127; yA programable portion of the general registers: xThe function alloc can allocate 0, 8, 16, 24, …, 96 general registers as rotating registers xThe lowest numbered rotating register is r32. yThere are three rrb: rrb.gr, rrb.fr rrb.pr

CMPUT 680 - Compiler Design and Optimization37 How Register Rotation Helps Software Pipeline The concept of a software pipelining branch: L1: ld4 r35 = [r4], 4// post-increment by 4 st4[r5] = r37, 4// post-increment by 4 swp_branch L1 ;; The pseudo-instruction swp_branch in the example rotates the general registers. Therefore the value stored into r35 is read in r37 two kernel iterations (and two rotations) later. The register rotation eliminated a dependence between the load and the store instructions, and allowed the loop to execute in one cycle.

CMPUT 680 - Compiler Design and Optimization38 How Register Rotation Helps Software Pipeline The concept of a software pipelining branch: L1: ld4 r35 = [r4], 4// post-increment by 4 st4[r5] = r37, 4// post-increment by 4 swp_branch L1 ;; 7 R32 R33 R35 R34 R36 R37 R38 R39 0 RRB Physical Logical R35 R37 8 7 R32 R33 R35 R34 R36 R37 R38 R39 RRB Physical Logical R35 R37 9 8 7 R32 R33 R35 R34 R36 R37 R38 R39 -2 RRB Physical Logical R35 R37

CMPUT 680 - Compiler Design and Optimization39 The stage predicate (S1): (p16) ld4 r4 = [r5], 4 (S2): (p17) - - - (S3): (p18) addr7 = r4, r9 (S4): (p19) st4 [r6] = r7, 4 When assembling a software pipeline the programmer can assign a stage predicate to each stage of the pipeline to control the execution of the instructions in that stage. p16 is architecturally defined as the predicate for the first stage, p17 for the second, and so on. The software pipeline branch rotates the predicate registers and injects a 1 in p16. Thus enabling one stage of the pipeline at a time for the execution of the prolog.

CMPUT 680 - Compiler Design and Optimization40 The stage predicate (S1): (p16) ld4 r4 = [r5], 4 (S2): (p17) - - - (S3): (p18) addr7 = r4, r9 (S4): (p19) st4 [r6] = r7, 4 When the kernel counter reaches zero, the software pipeline branch starts to decrement the epilog counter and injects 0 in p16 at every rotation to execute the epilogue of the software pipelined loop.

CMPUT 680 - Compiler Design and Optimization41 Anatomy of a Software Pipelining Branch LC? PR[16]=1 RRB-- branch PR[16]=0RRB-- PR[16]=0 RRB-- fall-thru EC? == 0 (epilog) EC-- >1 EC-- =1 EC =0 LC--  0 (prolog/kernel)

CMPUT 680 - Compiler Design and Optimization42 Software Pipelining Example in the IA-64 mov pr.rot= 0// Clear all rotating predicate registers cmp.eq p16,p0 = r0,r0// Set p16=1 mov ar.lc= 4// Set loop counter to n-1 mov ar.ec= 3// Set epilog counter to 3 … loop: (p16)ldl r32 = [r12], 1// Stage 1: load x (p17)add r34 = 1, r33// Stage 2: y=x+1 (p18)stl [r13] = r35,1// Stage 3: store y br.ctop loop // Branch back

CMPUT 680 - Compiler Design and Optimization43 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x1 3233 34 35 36 3738 General Registers (Physical) 001 1617 18 Predicate Registers 4 LC 3 EC x4 x5 x1 x2 x3 Memory 39 3233 34 35 36 373839 General Registers (Logical) 0 RRB

CMPUT 680 - Compiler Design and Optimization44 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 001 1617 18 Predicate Registers 4 LC 3 EC x4 x5 x1 x2 x3 Memory x1 3233 34 35 36 3738 General Registers (Physical) 39 3233 34 35 36 373839 General Registers (Logical) 0 RRB

CMPUT 680 - Compiler Design and Optimization45 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 001 1617 18 Predicate Registers 4 LC 3 EC x4 x5 x1 x2 x3 Memory x1 3233 34 35 36 3738 General Registers (Physical) 39 3233 34 35 36 373839 General Registers (Logical) 0 RRB

CMPUT 680 - Compiler Design and Optimization46 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 001 1617 18 Predicate Registers 4 LC 3 EC 1 x4 x5 x1 x2 x3 Memory x1 3334 35 36 37 3839 General Registers (Physical) 32 33 34 35 36 373839 General Registers (Logical) RRB

CMPUT 680 - Compiler Design and Optimization47 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 101 1617 18 Predicate Registers 3 LC 3 EC x4 x5 x1 x2 x3 Memory x1 3334 35 36 37 3839 General Registers (Physical) 32 33 34 35 36 373839 General Registers (Logical) RRB

CMPUT 680 - Compiler Design and Optimization48 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 101 1617 18 Predicate Registers 3 LC 3 EC x4 x5 x1 x2 x3 Memory x1 3334 35 36 37 3839 General Registers (Physical) 32 33 34 35 36 373839 General Registers (Logical) x2 RRB

CMPUT 680 - Compiler Design and Optimization49 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 101 1617 18 Predicate Registers 3 LC 3 EC x4 x5 x1 x2 x3 Memory x1 3334 35 36 37 3839 General Registers (Physical) 32 33 34 35 36 373839 General Registers (Logical) x2 y1 RRB

CMPUT 680 - Compiler Design and Optimization52 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 111 1617 18 Predicate Registers 2 LC 3 EC 1 x4 x5 x1 x2 x3 Memory x1 3435 36 37 38 3932 General Registers (Physical) 33 3233 34 35 36 373839 General Registers (Logical) x2 y1 -2 RRB

CMPUT 680 - Compiler Design and Optimization53 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 111 1617 18 Predicate Registers 2 LC 3 EC x4 x5 x1 x2 x3 Memory x1 3435 36 37 38 3932 General Registers (Physical) 33 3233 34 35 36 373839 General Registers (Logical) x2y1x3 -2 RRB

CMPUT 680 - Compiler Design and Optimization54 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop y2 111 1617 18 Predicate Registers 2 LC 3 EC x4 x5 x1 x2 x3 Memory 3435 36 37 38 3932 General Registers (Physical) 33 3233 34 35 36 373839 General Registers (Logical) x2y1x3 -2 RRB

CMPUT 680 - Compiler Design and Optimization55 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 111 1617 18 Predicate Registers 2 LC 3 EC x4 x5 x1 x2 x3 y1 Memory y2 3435 36 37 38 3932 General Registers (Physical) 33 3233 34 35 36 373839 General Registers (Logical) x2y1x3 -2 RRB

CMPUT 680 - Compiler Design and Optimization56 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 111 1617 18 Predicate Registers 2 LC 3 EC x4 x5 x1 x2 x3 y1 Memory y2 3435 36 37 38 3932 General Registers (Physical) 33 3233 34 35 36 373839 General Registers (Logical) x2y1x3 -2 RRB

CMPUT 680 - Compiler Design and Optimization57 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 111 1617 18 Predicate Registers 1 LC 3 EC 1 x4 x5 x1 x2 x3 y1 Memory -3 RRB y2 3536 37 38 39 3233 General Registers (Physical) 34 3233 34 35 36 373839 General Registers (Logical) x2y1x3

CMPUT 680 - Compiler Design and Optimization58 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 111 1617 18 Predicate Registers 1 LC 3 EC x4 x5 x1 x2 x3 y1 Memory -3 RRB y2 x4 3536 37 38 39 3233 General Registers (Physical) 34 3233 34 35 36 373839 General Registers (Logical) x2y1x3

CMPUT 680 - Compiler Design and Optimization59 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 111 1617 18 Predicate Registers 1 LC 3 EC x4 x5 x1 x2 x3 y1 Memory y2 x4 3536 37 38 39 3233 General Registers (Physical) 34 3233 34 35 36 373839 General Registers (Logical) y3y1x3 -3 RRB

CMPUT 680 - Compiler Design and Optimization60 Software Pipelining Example in the IA-64 loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 111 1617 18 Predicate Registers 1 LC 3 EC x4 x5 x1 x2 x3 y1 y2 Memory y2 x4 3536 37 38 39 3233 General Registers (Physical) 34 3233 34 35 36 373839 General Registers (Logical) y3y1x3 -3 RRB

CMPUT 680 - Compiler Design and Optimization61 Software Pipelining Example in the IA-64 111 1617 18 Predicate Registers 1 LC 3 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 Memory y2 x4 3536 37 38 39 3233 General Registers (Physical) 34 3233 34 35 36 373839 General Registers (Logical) y3y1x3 -3 RRB

CMPUT 680 - Compiler Design and Optimization62 Software Pipelining Example in the IA-64 111 1617 18 Predicate Registers 0 LC 3 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 1 x4 x5 x1 x2 x3 y1 y2 Memory -4 RRB y2 x4 3637 38 39 32 3334 General Registers (Physical) 35 3233 34 35 36 373839 General Registers (Logical) y3y1x3

CMPUT 680 - Compiler Design and Optimization63 Software Pipelining Example in the IA-64 111 1617 18 Predicate Registers 0 LC 3 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 Memory y2 x5x4 3637 38 39 32 3334 General Registers (Physical) 35 3233 34 35 36 373839 General Registers (Logical) y3y1x3 -4 RRB

CMPUT 680 - Compiler Design and Optimization64 Software Pipelining Example in the IA-64 111 1617 18 Predicate Registers 0 LC 3 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 Memory y2 x5x4 3637 38 39 32 3334 General Registers (Physical) 35 3233 34 35 36 373839 General Registers (Logical) y3y1y4 -4 RRB

CMPUT 680 - Compiler Design and Optimization65 Software Pipelining Example in the IA-64 111 1617 18 Predicate Registers 0 LC 3 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 y3 Memory -4 RRB y2 x5x4 3637 38 39 32 3334 General Registers (Physical) 35 3233 34 35 36 373839 General Registers (Logical) y3y1y4

CMPUT 680 - Compiler Design and Optimization66 Software Pipelining Example in the IA-64 111 1617 18 Predicate Registers 0 LC 3 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 y3 Memory y2 x5x4 3637 38 39 32 3334 General Registers (Physical) 35 3233 34 35 36 373839 General Registers (Logical) y3y1y4 -4 RRB

CMPUT 680 - Compiler Design and Optimization67 Software Pipelining Example in the IA-64 110 1617 18 Predicate Registers 0 LC 2 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 0 x4 x5 x1 x2 x3 y1 y2 y3 Memory y2 x5x4 3738 39 32 33 3435 General Registers (Physical) 36 3233 34 35 36 373839 General Registers (Logical) y3y1y4 -5 RRB

CMPUT 680 - Compiler Design and Optimization68 Software Pipelining Example in the IA-64 110 1617 18 Predicate Registers 0 LC 2 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 0 x4 x5 x1 x2 x3 y1 y2 y3 Memory y2 x5x4 3738 39 32 33 3435 General Registers (Physical) 36 3233 34 35 36 373839 General Registers (Logical) y3y1y4 -5 RRB

CMPUT 680 - Compiler Design and Optimization69 Software Pipelining Example in the IA-64 110 1617 18 Predicate Registers 0 LC 2 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 y3 Memory y2 x5x4 3738 39 32 33 3435 General Registers (Physical) 36 3233 34 35 36 373839 General Registers (Logical) y3y1y4 -5 RRB

CMPUT 680 - Compiler Design and Optimization70 Software Pipelining Example in the IA-64 110 1617 18 Predicate Registers 0 LC 2 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y1 y2 y3 Memory y2 x5y5 3738 39 32 33 3435 General Registers (Physical) 36 3233 34 35 36 373839 General Registers (Logical) y3y1y4 -5 RRB

CMPUT 680 - Compiler Design and Optimization71 Software Pipelining Example in the IA-64 110 1617 18 Predicate Registers 0 LC 2 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y4 y1 y2 y3 Memory y2 x5y5 3738 39 32 33 3435 General Registers (Physical) 36 3233 34 35 36 373839 General Registers (Logical) y3y1y4 -5 RRB

CMPUT 680 - Compiler Design and Optimization73 Software Pipelining Example in the IA-64 010 1617 18 Predicate Registers 0 LC 1 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 0 x4 x5 x1 x2 x3 y4 y1 y2 y3 Memory y2 x5y5 3637 38 39 32 3334 General Registers (Physical) 35 3233 34 35 36 373839 General Registers (Logical) y3y1y4 -6 RRB

CMPUT 680 - Compiler Design and Optimization76 Software Pipelining Example in the IA-64 010 1617 18 Predicate Registers 0 LC 1 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop x4 x5 x1 x2 x3 y4 y5 y1 y2 y3 Memory y2 x5y5 3637 38 39 32 3334 General Registers (Physical) 35 3233 34 35 36 373839 General Registers (Logical) y3y1y4 -6 RRB

CMPUT 680 - Compiler Design and Optimization79 Software Pipelining Example in the IA-64 000 1617 18 Predicate Registers 0 LC 0 EC loop: (p16)ldl r32 = [r12], 1 (p17)add r34 = 1, r33 (p18)stl [r13] = r35,1 br.ctop loop 0 x4 x5 x1 x2 x3 y4 y5 y1 y2 y3 Memory y2 x5y5 3738 39 32 33 3435 General Registers (Physical) 36 3233 34 35 36 373839 General Registers (Logical) y3y1y4 -7 RRB

CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral

Similar presentations

Presentation on theme: "CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral

Similar presentations

Presentation on theme: "CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral"— Presentation transcript:

Similar presentations

About project

Feedback