Peer-to-peer Hardware-Software Interfaces for Reconfigurable Fabrics Mihai Budiu Mahim Mishra Ashwin Bharambe Seth Copen Goldstein Carnegie Mellon University
Peer-to-peer hw/sw interfaces Reconfigurable Hardware CacheLogic Resources Galore
Peer-to-peer hw/sw interfaces Fixed Why RH: Computational Bandwidth CPU “Unbounded” RH
Peer-to-peer hw/sw interfaces Partition Application C ProgramHDL CADCompiler OS support communication Using RH Today
Peer-to-peer hw/sw interfaces Computer System Tomorrow high-ILP computation low-ILP computation + OS + VM CPURH Memory Tight coupling
Peer-to-peer hw/sw interfaces This Work HLL Program Partitioning We suggest a high-level mechanism (not a policy). CPURH Memory ccCAD
Peer-to-peer hw/sw interfaces Outline Motivation Interfacing RH & CPU Opportunities Conclusions
Peer-to-peer hw/sw interfaces Premises RH is large –can implement large program fragments RH can access memory –does not require CPU support to access data –coherent memory view with CPU RH seen through clean abstraction –interface portability
Peer-to-peer hw/sw interfaces Unit of Partitioning: Procedure library leaves recursive hot spot high ILP Program call-graph:
Peer-to-peer hw/sw interfaces Production-Quality Software int foo(….) { highly parallel computation; …. if (!r) { fprintf(stderr, “Unexpected input”); return E_BADIN; } …. }
Peer-to-peer hw/sw interfaces Peering a( ) { b( ); } b( ) { c( ); } c( ) { d( ) } d( ) { } Program CPURH a b c d
Peer-to-peer hw/sw interfaces marshalling, control transfer Stubs software procedure call hardware dependent RH “RPC” CPU a b c d b’ c’ d’
Peer-to-peer hw/sw interfaces RH a( ) { r = b’(b_args); } b’(b_args) { } CPU b Stubs a( ) { r = b(b_args); } b(b_args) { } Program send_rh(b_args); invoke_rh(b); r = receive_rh( ); return r;
Peer-to-peer hw/sw interfaces Required Stubs 1 stub to call each RH procedure 1 stub for each procedure called by RH CPURH
Peer-to-peer hw/sw interfaces policy Compiling Procedures for RH Synthesis Procedures for CPU Program Partitioning Stubs Configuration Linker Executable automatic HLL to HDL
Peer-to-peer hw/sw interfaces Outline Motivation Interfacing RH & CPU Opportunities Conclusions
Peer-to-peer hw/sw interfaces Evaluation How much can be mapped to RH? SpecInt95 & Mediabench Partition strictly on procedure boundaries Limit RH to 10 6 bit-operations
Peer-to-peer hw/sw interfaces Coverage a( ) { b( ); } b( ) { c( ); } c( ) {} On RH Method1Method2 N N YY Y N 40%75% Total 100% 40% 35% 25% Running Time
Peer-to-peer hw/sw interfaces Coverage a( ) { b( ); } b( ) { c( ); } c( ) {} Running Time 40% 35% 25% On RH Method1Method2 N N YY N Y 25%65% Total 100%
Peer-to-peer hw/sw interfaces Policies leaves on RH RH X CPU arbitrary
Peer-to-peer hw/sw interfaces RH Stack Models Locals in registers f() { int local; g(&local); } Locals statically allocated f(x) { return x+1; } f(x) { f(x+1); } Dynamic stack
Peer-to-peer hw/sw interfaces Potential RH Coverage: SpecINT95 % Running time leaves CPU->RH CPU->RH->CPU dynamic stack static stack frames no stack
Peer-to-peer hw/sw interfaces Potential RH Coverage: Mediabench dynamic stack static stack frames no stack leaves CPU->RH CPU->RH->CPU
Peer-to-peer hw/sw interfaces Conclusions Stubs make RH/CPU interface transparent Stubs are automatically generated RH and CPU as peers RH/CPU interface: (remote) procedure call RPC used for control transfer (not data) Peering gives partitioner freedom
Peer-to-peer hw/sw interfaces The End
Peer-to-peer hw/sw interfaces
Independent of b Dispatcher Stubs a( ) { r = b(b_args); } b(b_args) { if (x) c( ); return r; } c( ) { } Program b’(b_args) { send_rh(b_args); invoke_rh(b); while (1) { com = get_rh_command( ); if (! com) break; (*com)( ); } r = receive_rh( ); return r; } c’s stub
Peer-to-peer hw/sw interfaces C’s Stub a( ) { r = b(b_args); } b(b_args) { if (x) c( ); return r; } c( ) { } Program c’( ) { receive_rh(c_args); r = c(c_args); send_rh(r); invoke_rh(return_to_rh); } back
Peer-to-peer hw/sw interfaces Attempt 1 Manual partitioning Interface: ad hoc Ex: OneChip, NAPA, PAM Advantage: huge speed-ups Problem: very hard work RH Program
Peer-to-peer hw/sw interfaces Attempt 2 Select small computations Interface: RH = functional unit Ex: PRISC, Chimaera Advantage: easy to automate Problem: low speed-up + >> Program + >> *
Peer-to-peer hw/sw interfaces Attempt 3 while (b) { b[ j+5]; } Select loop body Deeply pipelined implementation No memory access Interface: I/O or Functional Unit or Coprocessor Ex: PipeRench Advantage: very high speed-up Problems: cannot be automated loop-carried dependences few opportunities Program
Peer-to-peer hw/sw interfaces Attempt 4 Select whole loop Pipelined implementation Autonomous memory access Interface: coprocessor Ex: GARP Advantage: many opportunities Problems: complicated algorithm requires exceptional loop exits while (b) { if (error) printf(“err”); a[x] = y; } Program