HARP Control Divergence & Assignment 4 Blaise Tine Georgia Institute of Technology
Questions? Agenda Harp Control Divergence Assignment 4 Predication Split-Join Assignment 4 Codebase Clone Barriers Samples Walkthrough Questions?
Two techniques supported by ISA: Predication Control Divergence Two techniques supported by ISA: Predication Control branch divergence at instruction granularity Split-Join Control branch divergence at block granularity
Harp Predication Full Predication Implementation All instructions can be predicated Implementation Separate predicate register file All predicated instructions execute Fetch => Decode => Execute Conditional Commit stage Only instructions with predicate value ‘true’
Harp Predication (2) Compiler Support Example If-conversion: Converts control dependencies into data dependencies Example Set predicate if (r1) { ++r2; } else { --r2; } rtop @p0, %r1 @p0 ? addi %r2, %r2, #1 ntop @p0, @p0 @p0 ? Subi %r2, %r2, #1 Inverse predicate
Predicate Value Test Instructions Harp Predication (3) Predicate Value Test Instructions rtop @dst %src isneg @dst %src iszero @dst %src Predicate Manipulation Instructions ntop @dst @src0 andp @dst @src0 @src1 orp @dst @src0 @src1 xorp @dst @src0 @src1
Harp Predication (4) Advantages Limitations No branching overhead Simple microarchitecture Limitations If-conversion is not always possible e.g. loops, indirect branches Inefficient with unanimous branches Both paths are always executed
Hardware stack management Compiler support Harp Split-Join ISA Support @p split: partition a warp using predicate mask, each subset taking different target join: merge partitioned subset into single execution block Implementation Hardware stack management Compiler support
Harp Split-Join (2) Example Set predicate NPC mask rtop @p0, %r1 @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) Example push PC and mask onto HW stack NPC mask @2 1001 @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) Example Execute threads with ‘true’ predicate NPC mask @2 1001 @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) Example Execute threads with ‘true’ predicate NPC mask @2 1001 @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) Example Pop HW stack and jmp to @2 NPC mask @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) Example Execute threads with ‘false’ predicate NPC mask @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) Example Execute threads with ‘false’ predicate NPC mask @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) Example Execute threads with ‘false’ predicate NPC mask @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) Example Pop HW stack and jmp to @7 NPC mask rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }
Harp Split-Join (2) Advantages Challenges Efficient with unanimous branches Only a single path is executed The active mask turns off inactive threads Challenges Complex microarchitecture HW stack manager Split-jmp-Join overhead
Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Mini Harp Minimal ISA Word encoding Integers only A single predicate register No Split-Join No warps creation No interrupts No virtual addressing Instructions Set Nop, Add, Sub, And, Or, Xor, Not, Shr, Shl, Ld, St, Jmp, Jal, Bar Configuration Register size, warp size, number of warps Chapter 1 — Computer Abstractions and Technology
Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Code base Shared header Common.h // common includes and definitions Utility Library utils.cpp/h // utility functions Core classes mem.cpp/h // memory lrucache.cpp/h // cache Instr.cpp/h // instruction decode. cpp/h // decoder regfile.h // register file warp.cpp/h // warp unit core.cpp/h // processor core Chapter 1 — Computer Abstractions and Technology
Assignment 4: Core Initialization Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Core Initialization Program RAM Core Construction Console output Load/Store Unit ICache & DCache IDecoder Warps Chapter 1 — Computer Abstractions and Technology
Assignment 4: Memory Layout Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Memory Layout console RAM Chapter 1 — Computer Abstractions and Technology
Assignment 4: Warp Initialization Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Warp Initialization Warp Construction GP Registers Pred Registers Boot enable Chapter 1 — Computer Abstractions and Technology
Assignment 4: Warp Execute Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Warp Execute Step Function Pipeline stages Fetch Decode Chapter 1 — Computer Abstractions and Technology
Assignment 4: Warp Execute (2) Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Warp Execute (2) Execution Instructions Predication Jump instruction Set predicate Add your code! Chapter 1 — Computer Abstractions and Technology
Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Clone Instruction Format clone %src0 Operation Copy current lane registers into %src0 lane. Register %src0 holds the destination lane index. e.g. ldi %r0, #2 clone %r0 # copy current registers into 3rd lane. Chapter 1 — Computer Abstractions and Technology
Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Barrier Instruction Format bar %src0, %src1 Operation Synchronize %src1 number of warps with barrier identifier %src0. Register %src0 holds the barrier id (supported max value is 3). Register %src1 holds the number of warps to wait on. e.g. ldi %r0, #1 ldi %r1, # 2 bar %r0, %r1 # insert a size-2 named barrier with id=1 Chapter 1 — Computer Abstractions and Technology
Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Testing Emulator command line ./miniharp.out –r #regs –t #threads –w #warps –o #output Sample programs $ ./miniharp.out hello.bin -t 4 -w 1 -r 8 -o output.log $ ./miniharp.out sum.bin -t 4 -w 1 -r 8 -o output.log $ ./miniharp.out barrier.bin -t 4 -w 4 -r 8 -o output.log Output format “<Program Output>” “Instruction Count: <?>” Chapter 1 — Computer Abstractions and Technology
Morgan Kaufmann Publishers April 3, 2019 Assignment 4: runtime.s Print Hex Print String Print NewLine Chapter 1 — Computer Abstractions and Technology
Morgan Kaufmann Publishers April 3, 2019 Assignment 4: hello.s Load string Call prints Exit String data Chapter 1 — Computer Abstractions and Technology
Morgan Kaufmann Publishers April 3, 2019 Assignment 4: sum.s Clone Registers Parallel Call Print result0 Array data Output address Chapter 1 — Computer Abstractions and Technology
Morgan Kaufmann Publishers April 3, 2019 Assignment 4: barrier.s Start new Warp Barrier Single warp Print results Chapter 1 — Computer Abstractions and Technology
Questions? Questions?