Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Heterogeneous Architecture Research Prototype (HARP)

Similar presentations


Presentation on theme: "The Heterogeneous Architecture Research Prototype (HARP)"— Presentation transcript:

1 The Heterogeneous Architecture Research Prototype (HARP)
Chad Kersey, Hyesoon Kim, S.Yalamanchili Georgia Institute of Technology

2 Agenda Motivations Design Objectives The HARP Infrastructure
The HARP ISA The HARP Compiler Harmonica Microarchitecture Assignment 4: Mini-Harp  Questions?

3 Motivations Application performance and efficiency is largely constrained by the memory system DRAM latency bottleneck DRAM bandwidth constrained by pins saturation The influx of memory bound apps Low compute to memory access ratio Poor spatial and temporal locality Irregular control flow Processing in Memory (PIM) challenges Limited area in logic layer Low power requirements HMC Vault - Graph Analytics

4 Design Objectives Area and power constraints
Compact and efficient microarchitecture Reduce Instruction Set Effective memory bandwidth Latency hiding via parallelism Bandwidth saturation via concurrency Parametrization Design space exploration (area vs power) Scale the design to domain specific applications Hardware prototyping (e.g. FPGA)

5 Morgan Kaufmann Publishers
February 25, 2019 The HARP ISA Simple, RISC like 64 opcodes (register and immediate operand) General purpose and predicate registers e.g. addi %r0, %r0, #3 Full Predication addi %r0, %r0, #3 SIMT Oriented Control divergence Warp control instructions Barrier synchronization Customizable ArchID: 4w8/8/16/16 Chapter 1 — Computer Abstractions and Technology

6 Morgan Kaufmann Publishers
February 25, 2019 The HARP ISA (2) Instruction Encoding Word/Byte encoding Little endianness Privilege Instructions Interrupts (ei, di, reti, halt) Kernel (skep, jmpru) TLB (tlbadd, tlbrm, tlbflush) Memory Loads/Stores ld/st Predicate Manipulations andp, orp, xorp, notp Chapter 1 — Computer Abstractions and Technology

7 Morgan Kaufmann Publishers
February 25, 2019 The HARP ISA (3) Value Tests rtop (!= 0), isneg (< 0), iszero (== 0) Arithmetic Instructions Immediate Integer Register Integer Register Fixed/Floating-Point Control Flow jmpi, jmpr, jali, jalr SIMD Control clone Jalis, jalrs, jmprt split/joint Jmpi => Jump immediate (PC-relative) Jmpr => jump indirect Jali => Jump and link immediate (subroutine calls) Jali => Jump and link indirect (subroutine calls) clone => clone register state into specified lane Jalis/jalrs => jump and link spawning N active lanes Jmprt => jump indirect, terminate execution of spawn lanes Split => control flow divergence Join => control flow convergence Chapter 1 — Computer Abstractions and Technology

8 Morgan Kaufmann Publishers
February 25, 2019 The HARP ISA (4) Warp Control wspan bar User/kernel trap Interrupts Lane 0 is specialized for interrupts handling ABI Stack pointer and link register are highest numbered registers Frame pointer optionally follows Calle manages the stack and frame pointers. Function arguments use temp registers or stack wspan => create a new warp Bar => barrier Trap => User generated interrupt The link register hold the call return address The frame pointer keep the return stack frame address for variable size stack frames Chapter 1 — Computer Abstractions and Technology

9 The HARP Infrastructure
Morgan Kaufmann Publishers February 25, 2019 The HARP Infrastructure Harp Compiler Clang Extension LLVM Backend Harp Runtime Harp Tool Assembler Disassembler Linker Emulator Harmonica FPGA / Simulation Chapter 1 — Computer Abstractions and Technology

10 Morgan Kaufmann Publishers
February 25, 2019 The HARP Compiler Clang Front-End C language extension for kernel paramters e.g. __attribute(harp_kernel) int foo( __attribute(warp_id) int wid, __attribute(lane_id) int lid, int val) {…} LLVM Back-End Two targets: harp32 and harp64 LLVM IR to HARP assembly Chapter 1 — Computer Abstractions and Technology

11 Morgan Kaufmann Publishers
February 25, 2019 The HARP Compiler (2) Instruction Selection DAG Creation / Lowering / Scheduling Register Allocation Restriction, Spilling, Frame index elimination Peephole Codegen Pass Pseudo instructions to HARP instructions Control Divergence Pass Predication, split-joint insertion Frame Lowering Asm Printer - Register allocation restriction => reserved registers are removed from allocation - Frame index elimination => translate virtual stack references to machine stack references - Frame lowering => prolog/epilog insertion Chapter 1 — Computer Abstractions and Technology

12 Morgan Kaufmann Publishers
February 25, 2019 The HARP Compiler (3) Predication Converts control dependencies into data dependencies Done during If-conversion pass after register allocation e.g. if (%r1) %r2++ else %r2--; Split-Joint Hardware stack based control divergence Divergent loops not supported Done after If-conversion Decision framework Static – operands def-use chains Dynamic profiled guided Split-join is best for unanimous branches Split-Join use IPDOM (Immediate Post-dominator) algorithm Static analysis checks if the operands in conditional branch depend on a thread variant value Profiled guided use instrumentation Divergent Loops => different lanes in the warp can execute different number of iterations Solution: Workaround using lane_or() : logical_or of loop predicate for all lanes in the warp.  Chapter 1 — Computer Abstractions and Technology

13 Harmonica Microarchitecture
Morgan Kaufmann Publishers February 25, 2019 Harmonica Microarchitecture CHDL Implementation C++ hardware modelling via template generators Verilog codegen for FPGA Simplified Pipeline No thread blocks Single Issue Warp scheduler SRAM based register file with single read/write ports No bank parallelism Warps formation (nested parallelism) Chapter 1 — Computer Abstractions and Technology

14 Harmonica Microarchitecture (2)
Morgan Kaufmann Publishers February 25, 2019 Harmonica Microarchitecture (2) Pipeline stages Schedule, Fetch, PredRegs, GPRegs, Exec, WriteBack - 32 warps with 32 lanes each at 650 Mhz - 7 GB/s memory bandwidth (~2/3 of available 10 GB/s HMC) - 1 KB ICache - 32 KB DCache - 4 simultaneous barriers - 4 entries control flow stack Chapter 1 — Computer Abstractions and Technology

15 Harmonica Microarchitecture (3)
Morgan Kaufmann Publishers February 25, 2019 Harmonica Microarchitecture (3) Scheduler: FIFO circular queue - Issue a single warp at the time - Warp state: user mode enabled, interrupt enabled - Activity mask => which thread is active Chapter 1 — Computer Abstractions and Technology

16 Harmonica Microarchitecture (4)
Morgan Kaufmann Publishers February 25, 2019 Harmonica Microarchitecture (4) Instruction Fetch  - The memory sub-system matches reqs/rsps using tags - WarpId is used to tag memory requests for fetch and loads Chapter 1 — Computer Abstractions and Technology

17 Harmonica Microarchitecture (5)
Morgan Kaufmann Publishers February 25, 2019 Harmonica Microarchitecture (5) Predicate Register File Access Chapter 1 — Computer Abstractions and Technology

18 Harmonica Microarchitecture (6)
Morgan Kaufmann Publishers February 25, 2019 Harmonica Microarchitecture (6) General Purpose Register File Access Chapter 1 — Computer Abstractions and Technology

19 Harmonica Microarchitecture (7)
Morgan Kaufmann Publishers February 25, 2019 Harmonica Microarchitecture (7) Functional Unit Output AM => active mask Chapter 1 — Computer Abstractions and Technology

20 Harmonica Microarchitecture (8)
Morgan Kaufmann Publishers February 25, 2019 Harmonica Microarchitecture (8) Register File Writeback Clone_info => info for coping current register file into another lane Chapter 1 — Computer Abstractions and Technology

21 Harmonica Microarchitecture (9)
Morgan Kaufmann Publishers February 25, 2019 Harmonica Microarchitecture (9) Next Instruction Fetch Chapter 1 — Computer Abstractions and Technology

22 Harmonica Microarchitecture (10)
Morgan Kaufmann Publishers February 25, 2019 Harmonica Microarchitecture (10) Pipeline Stalls Warps wait for instruction fetch requests Chapter 1 — Computer Abstractions and Technology

23 Assignment 4: Mini Harp Emulator
Morgan Kaufmann Publishers February 25, 2019 Assignment 4: Mini Harp Emulator Minimal ISA Word encoding Integers only A single predicate register No Split-Join No interrupts No virtual addressing Instructions Set Nop, Add, Sub, And, Or, Xor, Not, Shr, Shl, Ld, St, Clone, Bar Configuration Register size, warp size, number of warps Chapter 1 — Computer Abstractions and Technology

24 Assignment 4: Mini Harp Emulator (2)
Morgan Kaufmann Publishers February 25, 2019 Assignment 4: Mini Harp Emulator (2) Emulator baseline System Core Memory Subsystem Instruction Decode Instruction Fetch You Implement Register File Execute Stage Print Stats Provided Code base, sample programs Chapter 1 — Computer Abstractions and Technology

25 Questions? Questions?


Download ppt "The Heterogeneous Architecture Research Prototype (HARP)"

Similar presentations


Ads by Google